#118 closed defect (fixed)
xpra server dies overnight
Reported by: | Doug Doole | Owned by: | Antoine Martin |
---|---|---|---|
Priority: | critical | Milestone: | 0.3 |
Component: | server | Version: | 0.2.0 |
Keywords: | Cc: |
Description (last modified by )
I've been running xpra 0.2.0 on Ubuntu (11.10 for the server, 11.04 for the client) for the last two days. My client disconnects overnight and on both days, when I attempt to connect the next morning, the client does not attach.
doole@andammo:~$ xpra attach ssh:reorx:1 Connection failed: [Errno 2] No such file or directory connection lost: empty marker in read queue Connection lost
I took a look at .xpra/reorx-1.log, but I don't see anything obvious. I have attached both reorx-1.log and reorx-1.log.old.
On the server, I can see the xpra process running:
doole@reorx:~$ ps -ef | grep xpra doole 915 1 52 Apr26 ? 11:05:18 /usr/bin/python /opt/xpra/bin/xpra start :1
but I am unable to stop it:
doole@reorx:~$ xpra stop :1 Connection failed: [Errno 2] No such file or directory doole@reorx:~$ xpra stop Usage: xpra start DISPLAY xpra attach [DISPLAY] xpra detach [DISPLAY] xpra screenshot filename [DISPLAY] xpra version [DISPLAY] xpra stop [DISPLAY] xpra list xpra upgrade DISPLAY xpra: error: cannot find a live server to connect to
So I manually kill the xpra session (kill 915 - process is gone) and restart xpra. When I reconnect from the client, all my windows are still present (so killing xpra didn't wipe out the window manager for some reason) although the windows have all moved to +0+0 on the screen.
In case it matters, my line to start xpra is:
/usr/bin/xpra start :1
I was previously using 0.0.7.34 and did not see this behaviour.
I suspect that you'll need more diagnostics - just let me know what to gather.
Attachments (5)
Change History (20)
Changed 10 years ago by
Attachment: | reorx-1.log added |
---|
Changed 10 years ago by
Attachment: | reorx-1.log.old added |
---|
comment:1 Changed 10 years ago by
Description: | modified (diff) |
---|---|
Status: | new → accepted |
comment:2 Changed 10 years ago by
Hah, I think I got it, from the log:
wimpiggy.selection.AlreadyOwned removing socket /home/doole/.xpra/reorx-1
When you tried to start the server again, it failed because one was already running, during the cleanup code it deleted what it thought was its socket but was in fact the socket of the active server.
Until I can fix this bug, just make sure you don't start the server again if one is already running, or use the "--use-display
" flag, which will force the old server to exit and start the new one.
comment:3 Changed 10 years ago by
Hmmm, looking at the code, that cannot be the case:
create_unix_domain_socket
is called before we register the cleanup_socket
handler, so this would have failed if the socket still existed... So something must be deleting that socket...
Both logs say that there is already a window manager running for the :1
session, can you try with a clean/brand new display number, clean logs and "-d all
"?
comment:4 Changed 10 years ago by
And maybe even run a script every few minutes to check the state of the socket in .xpra
.
while true; do date; ls -la ~/.xpra xpra list; sleep 60; done
And pipe that to a logfile to check in the morning?
Maybe a cron job is running at a specific time that breaks something? Or maybe the DHCP lease is renewed (shouldn't matter - but maybe the hostname or domainname changes?), or the network drops for a few seconds, or...
comment:5 Changed 10 years ago by
I needed to do a reboot of my server machine, so I was able to run the script on a completely fresh machine. As luck would have it, it failed fairly quickly.
The other thing I realized is that I had a cron task to restart xpra if it failed. I stopped that job.
I have attached two files:
- xpra.track is the output of your script. (I added in a "host reorx" call as well, just in case the IP address of the machine changed. It didn't.)
- reorx-1.log.20120430 is the log from start to failure. The interesting observation here is that a whole bunch of "New connection received" messages came in, but I wasn't creating any new connections. (I just have the one active connection from the client.)
Also, although "xpra list" is saying there are no sessions, xpra is still running at the server:
doole@reorx:~$ ps -ef | grep xpra doole 2035 1 31 13:59 ? 00:08:53 /usr/bin/python /opt/xpra/bin/xpra start :1 doole 2505 2504 0 14:01 ? 00:00:00 /usr/bin/python /opt/xpra/bin/xpra _proxy :1
The client session that I had established before the failure is still usable (I can even spawn new windows from within the session.)
Changed 10 years ago by
Attachment: | xpra.track added |
---|
Changed 10 years ago by
Attachment: | reorx-1.log.20120430 added |
---|
comment:6 Changed 10 years ago by
Shoot, I forgot "-d all" when I started xpra on that last run. Trying again...
Changed 10 years ago by
Attachment: | reorx-1.log.21020430-2 added |
---|
comment:7 Changed 10 years ago by
Just added reorx-1.log.20120430-2. This is the server log with "-d all". (It looks like whatever the problem is happens fairly quickly.)
At the end of the log there's will be a bunch of events as I touched the window managed by xpra. This occurred after the server lost track of the session.
comment:8 Changed 10 years ago by
I just tried another run. This time I didn't establish any client connection and the server still failed. (I have the server log with "-d all" if you think it will be useful.)
In all the runs I did today, the failure seems to happen about 20 minutes after starting the server. (That may have been the case as well last week, but since the failure doesn't kill an established connection, I wouldn't have noticed.)
comment:9 Changed 10 years ago by
Priority: | major → critical |
---|
OK, that was easy with the reorx-1.log.20120430
logfile (the earlier reorx-1.log
files were no good because they were overwritten when starting/trying-to-start a new server):
(...) New connection received too many connections (20), ignoring new one
Something keeps connecting (every minute from what you are saying - this will be the script above), but fails to disconnect (or we just miss the event?), eventually causing the server to refuse new connections, the "xpra list
" client then comes in, fails to connect and decides that the socket is dead and deletes it!
What I need to do:
- ensure connections either become the live one or are dropped after a few seconds (that should already be the case - bug)
- try to distinguish a rejected connection from a dead socket (and only mark it as dead if it is the latter)
comment:10 Changed 10 years ago by
Resolution: | → fixed |
---|---|
Status: | accepted → closed |
Fixed in r778: there was a race in the network protocol threading code, making the server not unregister connections... and leading to a DoS.
This may warrant a 0.2.1
release..
comment:12 Changed 10 years ago by
I have confirmed that r779 fixes the problem. Thanks.
One thing is still puzzling me though: When I first saw the problem,what was causing all the connection attempts that would mess up the server? The only external connection attempt should have been my client establishing a session, and that wouldn't happen more than a few times in a 24 hour period.
So where did the rest of the connection attempts come from?
comment:13 Changed 10 years ago by
If you are running winswitch, it will run the equivalent of "xpra list
" regularly (in particular whenever there is umtp traffic, login/logout, etc)
comment:14 Changed 10 years ago by
I'm not using winswitch - just xpra by itself.
Looking at the log, I don't see a lot of connection attempts. So while you've fixed the problem, I'm just curious how I hit it. (Oh well, not a big deal.)
comment:15 Changed 16 months ago by
this ticket has been moved to: https://github.com/Xpra-org/xpra/issues/118
The
Is a little odd, it means the socket is gone - why is not entirely clear.
Can you check if the xpra process still has it open?
After that, xpra stop will fail, and connect also..
Maybe you could try to strace the process next time this happens to see what it is doing, or just run the server in debug mode (with '
-d all
') so we can get some diagnostics.