I've been running xpra 0.2.0 on Ubuntu (11.10 for the server, 11.04 for the client) for the last two days. My client disconnects overnight and on both days, when I attempt to connect the next morning, the client does not attach.
doole@andammo:~$ xpra attach ssh:reorx:1 Connection failed: [Errno 2] No such file or directory connection lost: empty marker in read queue Connection lost
I took a look at .xpra/reorx-1.log, but I don't see anything obvious. I have attached both reorx-1.log and reorx-1.log.old.
On the server, I can see the xpra process running:
doole@reorx:~$ ps -ef | grep xpra doole 915 1 52 Apr26 ? 11:05:18 /usr/bin/python /opt/xpra/bin/xpra start :1
but I am unable to stop it:
doole@reorx:~$ xpra stop :1 Connection failed: [Errno 2] No such file or directory doole@reorx:~$ xpra stop Usage: xpra start DISPLAY xpra attach [DISPLAY] xpra detach [DISPLAY] xpra screenshot filename [DISPLAY] xpra version [DISPLAY] xpra stop [DISPLAY] xpra list xpra upgrade DISPLAY xpra: error: cannot find a live server to connect to
So I manually kill the xpra session (kill 915 - process is gone) and restart xpra. When I reconnect from the client, all my windows are still present (so killing xpra didn't wipe out the window manager for some reason) although the windows have all moved to +0+0 on the screen.
In case it matters, my line to start xpra is:
/usr/bin/xpra start :1
I was previously using 0.0.7.34 and did not see this behaviour.
I suspect that you'll need more diagnostics - just let me know what to gather.
Connection failed: [Errno 2] No such file or directory
Is a little odd, it means the socket is gone - why is not entirely clear. Can you check if the xpra process still has it open?
After that, xpra stop will fail, and connect also..
Maybe you could try to strace the process next time this happens to see what it is doing, or just run the server in debug mode (with '
-d all') so we can get some diagnostics.
Hah, I think I got it, from the log:
wimpiggy.selection.AlreadyOwned removing socket /home/doole/.xpra/reorx-1
When you tried to start the server again, it failed because one was already running, during the cleanup code it deleted what it thought was its socket but was in fact the socket of the active server.
Until I can fix this bug, just make sure you don't start the server again if one is already running, or use the "
--use-display" flag, which will force the old server to exit and start the new one.
Hmmm, looking at the code, that cannot be the case:
is called before we register the
cleanup_socket handler, so this would have failed if the socket still existed... So something must be deleting that socket...
Both logs say that there is already a window manager running for the
:1 session, can you try with a clean/brand new display number, clean logs and "
And maybe even run a script every few minutes to check the state of the socket in
while true; do date; ls -la ~/.xpra xpra list; sleep 60; done
And pipe that to a logfile to check in the morning?
Maybe a cron job is running at a specific time that breaks something? Or maybe the DHCP lease is renewed (shouldn't matter - but maybe the hostname or domainname changes?), or the network drops for a few seconds, or...
I needed to do a reboot of my server machine, so I was able to run the script on a completely fresh machine. As luck would have it, it failed fairly quickly.
The other thing I realized is that I had a cron task to restart xpra if it failed. I stopped that job.
I have attached two files:
Also, although "xpra list" is saying there are no sessions, xpra is still running at the server:
doole@reorx:~$ ps -ef | grep xpra doole 2035 1 31 13:59 ? 00:08:53 /usr/bin/python /opt/xpra/bin/xpra start :1 doole 2505 2504 0 14:01 ? 00:00:00 /usr/bin/python /opt/xpra/bin/xpra _proxy :1
The client session that I had established before the failure is still usable (I can even spawn new windows from within the session.)
Shoot, I forgot "-d all" when I started xpra on that last run. Trying again...
Just added reorx-1.log.20120430-2. This is the server log with "-d all". (It looks like whatever the problem is happens fairly quickly.)
At the end of the log there's will be a bunch of events as I touched the window managed by xpra. This occurred after the server lost track of the session.
I just tried another run. This time I didn't establish any client connection and the server still failed. (I have the server log with "-d all" if you think it will be useful.)
In all the runs I did today, the failure seems to happen about 20 minutes after starting the server. (That may have been the case as well last week, but since the failure doesn't kill an established connection, I wouldn't have noticed.)
OK, that was easy with the
reorx-1.log.20120430 logfile (the earlier
reorx-1.log files were no good because they were overwritten when starting/trying-to-start a new server):
(...) New connection received too many connections (20), ignoring new one
Something keeps connecting (every minute from what you are saying - this will be the script above), but fails to disconnect (or we just miss the event?), eventually causing the server to refuse new connections, the "
xpra list" client then comes in, fails to connect and decides that the socket is dead and deletes it!
What I need to do:
Fixed in r778: there was a race in the network protocol threading code, making the server not unregister connections... and leading to a DoS.
This may warrant a
To cause this bug with
0.2.0, simply run:
while true; do xpra list; sleep 0.1;done
oops, you will also need r779
I have confirmed that r779 fixes the problem. Thanks.
One thing is still puzzling me though: When I first saw the problem,what was causing all the connection attempts that would mess up the server? The only external connection attempt should have been my client establishing a session, and that wouldn't happen more than a few times in a 24 hour period.
So where did the rest of the connection attempts come from?
If you are running winswitch, it will run the equivalent of "
xpra list" regularly (in particular whenever there is umtp traffic, login/logout, etc)
I'm not using winswitch - just xpra by itself.
Looking at the log, I don't see a lot of connection attempts. So while you've fixed the problem, I'm just curious how I hit it. (Oh well, not a big deal.)
this ticket has been moved to: https://github.com/Xpra-org/xpra/issues/118