xpra icon
Bug tracker and wiki

Opened 8 years ago

Closed 8 years ago

Last modified 8 years ago

#118 closed defect (fixed)

xpra server dies overnight

Reported by: Doug Doole Owned by: Antoine Martin
Priority: critical Milestone: 0.3
Component: server Version: 0.2.0
Keywords: Cc:

Description (last modified by Antoine Martin)

I've been running xpra 0.2.0 on Ubuntu (11.10 for the server, 11.04 for the client) for the last two days. My client disconnects overnight and on both days, when I attempt to connect the next morning, the client does not attach.

doole@andammo:~$ xpra attach ssh:reorx:1
Connection failed: [Errno 2] No such file or directory
connection lost: empty marker in read queue
Connection lost

I took a look at .xpra/reorx-1.log, but I don't see anything obvious. I have attached both reorx-1.log and reorx-1.log.old.

On the server, I can see the xpra process running:

doole@reorx:~$ ps -ef | grep xpra
doole      915     1 52 Apr26 ?        11:05:18 /usr/bin/python /opt/xpra/bin/xpra start :1

but I am unable to stop it:

doole@reorx:~$ xpra stop :1
Connection failed: [Errno 2] No such file or directory
doole@reorx:~$ xpra stop
Usage: 
        xpra start DISPLAY
        xpra attach [DISPLAY]
        xpra detach [DISPLAY]
        xpra screenshot filename [DISPLAY]
        xpra version [DISPLAY]
        xpra stop [DISPLAY]
        xpra list
        xpra upgrade DISPLAY

xpra: error: cannot find a live server to connect to

So I manually kill the xpra session (kill 915 - process is gone) and restart xpra. When I reconnect from the client, all my windows are still present (so killing xpra didn't wipe out the window manager for some reason) although the windows have all moved to +0+0 on the screen.

In case it matters, my line to start xpra is:

/usr/bin/xpra start :1

I was previously using 0.0.7.34 and did not see this behaviour.

I suspect that you'll need more diagnostics - just let me know what to gather.

Attachments (5)

reorx-1.log (1.0 KB) - added by Doug Doole 8 years ago.
reorx-1.log.old (1.0 KB) - added by Doug Doole 8 years ago.
xpra.track (10.1 KB) - added by Doug Doole 8 years ago.
reorx-1.log.20120430 (2.4 KB) - added by Doug Doole 8 years ago.
reorx-1.log.21020430-2 (129.3 KB) - added by Doug Doole 8 years ago.

Download all attachments as: .zip

Change History (19)

Changed 8 years ago by Doug Doole

Attachment: reorx-1.log added

Changed 8 years ago by Doug Doole

Attachment: reorx-1.log.old added

comment:1 Changed 8 years ago by Antoine Martin

Description: modified (diff)
Status: newaccepted

The

Connection failed: [Errno 2] No such file or directory

Is a little odd, it means the socket is gone - why is not entirely clear.
Can you check if the xpra process still has it open?

After that, xpra stop will fail, and connect also..

Maybe you could try to strace the process next time this happens to see what it is doing, or just run the server in debug mode (with '-d all') so we can get some diagnostics.

comment:2 Changed 8 years ago by Antoine Martin

Hah, I think I got it, from the log:

wimpiggy.selection.AlreadyOwned
removing socket /home/doole/.xpra/reorx-1

When you tried to start the server again, it failed because one was already running, during the cleanup code it deleted what it thought was its socket but was in fact the socket of the active server.

Until I can fix this bug, just make sure you don't start the server again if one is already running, or use the "--use-display" flag, which will force the old server to exit and start the new one.

comment:3 Changed 8 years ago by Antoine Martin

Hmmm, looking at the code, that cannot be the case:

create_unix_domain_socket

is called before we register the cleanup_socket handler, so this would have failed if the socket still existed... So something must be deleting that socket...

Both logs say that there is already a window manager running for the :1 session, can you try with a clean/brand new display number, clean logs and "-d all"?

comment:4 Changed 8 years ago by Antoine Martin

And maybe even run a script every few minutes to check the state of the socket in .xpra.

while true; do
  date;
  ls -la ~/.xpra
  xpra list;
  sleep 60;
done

And pipe that to a logfile to check in the morning?

Maybe a cron job is running at a specific time that breaks something? Or maybe the DHCP lease is renewed (shouldn't matter - but maybe the hostname or domainname changes?), or the network drops for a few seconds, or...

comment:5 Changed 8 years ago by Doug Doole

I needed to do a reboot of my server machine, so I was able to run the script on a completely fresh machine. As luck would have it, it failed fairly quickly.

The other thing I realized is that I had a cron task to restart xpra if it failed. I stopped that job.

I have attached two files:

  • xpra.track is the output of your script. (I added in a "host reorx" call as well, just in case the IP address of the machine changed. It didn't.)
  • reorx-1.log.20120430 is the log from start to failure. The interesting observation here is that a whole bunch of "New connection received" messages came in, but I wasn't creating any new connections. (I just have the one active connection from the client.)

Also, although "xpra list" is saying there are no sessions, xpra is still running at the server:

doole@reorx:~$ ps -ef | grep xpra
doole     2035     1 31 13:59 ?        00:08:53 /usr/bin/python /opt/xpra/bin/xpra start :1
doole     2505  2504  0 14:01 ?        00:00:00 /usr/bin/python /opt/xpra/bin/xpra _proxy :1

The client session that I had established before the failure is still usable (I can even spawn new windows from within the session.)

Last edited 8 years ago by Antoine Martin (previous) (diff)

Changed 8 years ago by Doug Doole

Attachment: xpra.track added

Changed 8 years ago by Doug Doole

Attachment: reorx-1.log.20120430 added

comment:6 Changed 8 years ago by Doug Doole

Shoot, I forgot "-d all" when I started xpra on that last run. Trying again...

Changed 8 years ago by Doug Doole

Attachment: reorx-1.log.21020430-2 added

comment:7 Changed 8 years ago by Doug Doole

Just added reorx-1.log.20120430-2. This is the server log with "-d all". (It looks like whatever the problem is happens fairly quickly.)

At the end of the log there's will be a bunch of events as I touched the window managed by xpra. This occurred after the server lost track of the session.

comment:8 Changed 8 years ago by Doug Doole

I just tried another run. This time I didn't establish any client connection and the server still failed. (I have the server log with "-d all" if you think it will be useful.)

In all the runs I did today, the failure seems to happen about 20 minutes after starting the server. (That may have been the case as well last week, but since the failure doesn't kill an established connection, I wouldn't have noticed.)

comment:9 Changed 8 years ago by Antoine Martin

Priority: majorcritical

OK, that was easy with the reorx-1.log.20120430 logfile (the earlier reorx-1.log files were no good because they were overwritten when starting/trying-to-start a new server):

(...)
New connection received
too many connections (20), ignoring new one

Something keeps connecting (every minute from what you are saying - this will be the script above), but fails to disconnect (or we just miss the event?), eventually causing the server to refuse new connections, the "xpra list" client then comes in, fails to connect and decides that the socket is dead and deletes it!

What I need to do:

  • ensure connections either become the live one or are dropped after a few seconds (that should already be the case - bug)
  • try to distinguish a rejected connection from a dead socket (and only mark it as dead if it is the latter)
Last edited 8 years ago by Antoine Martin (previous) (diff)

comment:10 Changed 8 years ago by Antoine Martin

Resolution: fixed
Status: acceptedclosed

Fixed in r778: there was a race in the network protocol threading code, making the server not unregister connections... and leading to a DoS.

This may warrant a 0.2.1 release..

To cause this bug with 0.2.0, simply run:

while true; do xpra list; sleep 0.1;done
Last edited 8 years ago by Antoine Martin (previous) (diff)

comment:11 Changed 8 years ago by Antoine Martin

oops, you will also need r779

comment:12 Changed 8 years ago by Doug Doole

I have confirmed that r779 fixes the problem. Thanks.

One thing is still puzzling me though: When I first saw the problem,what was causing all the connection attempts that would mess up the server? The only external connection attempt should have been my client establishing a session, and that wouldn't happen more than a few times in a 24 hour period.

So where did the rest of the connection attempts come from?

comment:13 Changed 8 years ago by Antoine Martin

If you are running winswitch, it will run the equivalent of "xpra list" regularly (in particular whenever there is umtp traffic, login/logout, etc)

comment:14 Changed 8 years ago by Doug Doole

I'm not using winswitch - just xpra by itself.

Looking at the log, I don't see a lot of connection attempts. So while you've fixed the problem, I'm just curious how I hit it. (Oh well, not a big deal.)

Note: See TracTickets for help on using tickets.