Xpra: Ticket #1447: Server does not accept initial connection on remote start.

Starting xpra with xpra start ssh:....

Server on remote machine is started, but this is being spammed continiously into the log:

2017-02-21 21:37:05,892 New unix-domain connection received on /var/run/xpra/hostname-100
2017-02-21 21:37:05,991 New unix-domain connection received on /var/run/user/1000/xpra/hostname-100
2017-02-21 21:37:05,992 New unix-domain connection received on /home/username/.xpra/hostname-100
2017-02-21 21:37:05,993 New unix-domain connection received on /var/run/xpra/hostname-100
2017-02-21 21:37:06,092 New unix-domain connection received on /home/username/.xpra/hostname-100
2017-02-21 21:37:06,093 New unix-domain connection received on /var/run/user/1000/xpra/hostname-100
2017-02-21 21:37:06,094 New unix-domain connection received on /var/run/xpra/hostname-100
2017-02-21 21:37:06,193 New unix-domain connection received on /var/run/user/1000/xpra/hostname-100
2017-02-21 21:37:06,194 New unix-domain connection received on /var/run/xpra/hostname-100
2017-02-21 21:37:06,194 New unix-domain connection received on /home/username/.xpra/hostname-100
2017-02-21 21:37:06,294 New unix-domain connection received on /var/run/user/1000/xpra/hostname-100
2017-02-21 21:37:06,295 New unix-domain connection received on /home/username/.xpra/hostname-100
2017-02-21 21:37:06,296 New unix-domain connection received on /var/run/xpra/hostname-100
2017-02-21 21:37:06,396 New unix-domain connection received on /var/run/user/1000/xpra/hostname-100
2017-02-21 21:37:06,396 New unix-domain connection received on /home/username/.xpra/hostname-100
2017-02-21 21:37:06,397 New unix-domain connection received on /var/run/xpra/hostname-100
2017-02-21 21:37:06,496 New unix-domain connection received on /home/username/.xpra/hostname-100
2017-02-21 21:37:06,497 New unix-domain connection received on /var/run/xpra/hostname-100
2017-02-21 21:37:06,498 New unix-domain connection received on /var/run/user/1000/xpra/hostname-100
2017-02-21 21:37:06,597 New unix-domain connection received on /home/username/.xpra/hostname-100
2017-02-21 21:37:06,598 New unix-domain connection received on /var/run/user/1000/xpra/hostname-100
2017-02-21 21:37:06,599 New unix-domain connection received on /var/run/xpra/hostname-100
2017-02-21 21:37:06,699 New unix-domain connection received on /var/run/user/1000/xpra/hostname-100
2017-02-21 21:37:06,699 New unix-domain connection received on /home/username/.xpra/hostname-100
2017-02-21 21:37:06,700 New unix-domain connection received on /var/run/xpra/hostname-100
2017-02-21 21:37:06,800 New unix-domain connection received on /var/run/user/1000/xpra/hostname-100
2017-02-21 21:37:06,800 New unix-domain connection received on /home/username/.xpra/hostname-100
2017-02-21 21:37:06,802 New unix-domain connection received on /var/run/xpra/hostname-100

Client soon exits, log spamming continues until I connect with xpra attach. Then server's log is truncated and replaced with this message:

(EE)
Fatal server error:
(EE) Server is already active for display 100
	If this server is no longer running, remove /tmp/.X100-lock
	and start again.
(EE)
(EE)
Please consult the The X.Org Foundation support
	 at http://wiki.x.org
 for help.
(EE)
2017-02-21 21:37:21,079
2017-02-21 21:37:21,079 Xvfb command has terminated! xpra cannot continue
2017-02-21 21:37:21,079  if the display is already running, try a different one,
2017-02-21 21:37:21,079  or use the --use-display flag
2017-02-21 21:37:21,079
2017-02-21 21:37:21,080 killing xvfb with pid 14651
2017-02-21 21:37:21,080 failed to kill xvfb process with pid 14651:
2017-02-21 21:37:21,080  [Errno 3] No such process


Wed, 22 Feb 2017 04:06:17 GMT - Antoine Martin: owner changed; milestone set

I did find a bug, fixed in r15133, but this would have caused the client side to also emit a large number of warning messages to stdout - did you not get those?

Please close if that fixes things.


Tue, 28 Feb 2017 09:39:21 GMT - psycho_zs: status changed; resolution set

r15180 works well on remote start, no log spam on both ends.


Tue, 28 Feb 2017 18:22:14 GMT - psycho_zs: attachment set


Tue, 28 Feb 2017 18:22:23 GMT - psycho_zs: attachment set


Tue, 28 Feb 2017 18:23:28 GMT - psycho_zs: status changed; resolution deleted

If server has leftovers from previous run, client can not initiate remote start and same log spam happens on server. Attached server log and client output.


Wed, 01 Mar 2017 10:22:59 GMT - Antoine Martin:

I'm not sure what you mean by "has leftovers from previous run", or how I would reproduce this.

Assuming that you hit the case where a server was already running on the display that you tried to start remotely, one could argue that the code is working as intended by not matching this server, but r15190 will now allow us to match it and connect to it. (could be backported I guess - not sure that's necessarily right)


Wed, 01 Mar 2017 15:54:52 GMT - psycho_zs:

Sorry, I'm speaking about sockets:

/var/run/user/1000/xpra/server-100 is not responding, waiting for it to timeout before clearing it.....
2017-02-28 21:09:05,310 created unix domain socket: /var/run/user/1000/xpra/server-100
/home/user/.xpra/server-100 is not responding, waiting for it to timeout before clearing it.....
2017-02-28 21:09:09,315 created unix domain socket: /home/user/.xpra/server-100
/var/run/xpra/server-100 is not responding, waiting for it to timeout before clearing it.....
2017-02-28 21:09:13,321 created unix domain socket: /var/run/xpra/server-100

Maybe client doesn't wait for this process to finish?


Wed, 01 Mar 2017 15:59:25 GMT - Antoine Martin: owner, status, milestone changed

Those sockets are usually left behind when a server crashes. Those messages were not in the log samples your had provided.

Fixing this will be harder, we must wait in case the server is still alive, and the client also needs to give up after waiting too long... What we should do in this case is wait for all the sockets in parallel to save time. This will be a bigger change, too late for this version.


Wed, 01 Mar 2017 16:06:39 GMT - psycho_zs:

I copied that snippet directly from the attached server.log )

Viewing server and client log live it seemed that client gives up just at the same time server finises dealing with sockets. So a dirty fix would be to give the client a couple more seconds of timeout.

I will try catching the crashes.


Wed, 01 Mar 2017 16:07:48 GMT - Antoine Martin:

I copied that snippet directly from the attached server.log )

Sorry, yes - I missed it!


Sun, 19 Mar 2017 05:21:19 GMT - Antoine Martin:

r15278 raised the socket timeout to 20s (r15300 for v1.0.x branch). This will help until we can do parallel socket probing.


Fri, 24 Mar 2017 13:19:27 GMT - Antoine Martin: owner changed

It's not really done in parallel as that would be too hard to implement, but r15381 should make things more reliable. I've struggled to test this properly because if I "kill -9" we correctly detect that the socket is dead, and if I just hang the server then the socket still looks alive..

@psycho_zs: does that work for you? how can I test hung sockets?


Fri, 24 Mar 2017 14:38:43 GMT - psycho_zs:

I was left with old sockets some time ago when xpra server crashed sometimes. Now it does not, so I do not know how to test this situation now. I will test remote start in the next couple of days.


Sat, 25 Mar 2017 19:11:21 GMT - psycho_zs:

In 2.0: killing all xpra-related processes with kill -9 leaves socket $HOME/.xpra/$SERVERNAME-$PORT. Trying to remote-start after that produces described behavior despite increased timeouts. It also seems that "New unix-domain connection received on..." message spam is continued to removed log with old file descriptor (seen if tailing log throughout the whole connection attempt).

I have troubles building current trunk, it does not seem to produce some files listed in xpra.install, like etc/xpra/xorg.conf, usr/lib/tmpfiles.d/xpra.conf. Would you add a fresh beta package?


Sun, 26 Mar 2017 11:01:39 GMT - Antoine Martin:

Older branches will probably not be patched - as I can't think of a way to make things better without changing too much code.


Sun, 26 Mar 2017 15:56:37 GMT - psycho_zs:

r15425 seems to be working even with leftover sockets. Although it failed once on the first test, right after upgrade. (I killed xpra processes before upgrading).


Mon, 27 Mar 2017 05:48:50 GMT - Antoine Martin: status changed; resolution set

Closing, feel free to re-open if you can give steps to reproduce problems with trunk.


Tue, 28 Mar 2017 15:12:38 GMT - psycho_zs: status changed; resolution deleted

I reproduced the situation in which client again does not wait long enough to connect. Xpra server did not survive dbus restart after recent system upgrade, leaving behind xpra _proxy_start :100 process and and X. So I killed everything with killall -u username and did remote start via xpra start ssh:.... Server waited for 3 sockets, client disconnected just before server finished waiting on them:

Warning: some of the sockets are in an unknown state:
 /var/run/user/1000/xpra/servername-100
 /home/username/.xpra/servername-100
 /var/run/xpra/servername-100

Full server log in attachment...


Tue, 28 Mar 2017 15:16:16 GMT - psycho_zs: attachment set


Wed, 29 Mar 2017 05:30:33 GMT - Antoine Martin:

Sorry but I cannot reproduce this. Please try to provide specific steps.


Mon, 03 Apr 2017 07:13:01 GMT - psycho_zs:

With new initialization mechanism are there any potential race conditions left?

I was able to reproduce it the second time: On server with running xpra restart dbus with related daemons

systemctl restart dbus ; systemctl daemon-reexec ; systemctl restart systemd-logind systemd-journald NetworkManager

kill all user processes:

killall -u $USER

Try to do xpra start ssh:...

Client gives up and disconnects somewhere around the time server prints "created unix domain socket" messages. Attach works a moment later.


Mon, 03 Apr 2017 15:52:15 GMT - Antoine Martin:

Timeouts are inherently racy. It should be enough time for the server to start, even when waiting and then clearing the sockets, but there's no way we can be sure of that. And if the server is quite slow, it won't be enough. (add to that the ssh login delay) Raising the client timeout indefinitely is not a good solution either. It is already at 15 seconds, which is quite a long time. Are you sure your 2.1 server takes that long to start? (are you even testing with 2.1?)

No need to run "xpra shadow ssh:..." which just complicates things. Just kill it and watch it start with "xpra shadow :0 --no-daemon" - see how long that takes. The output should include timestamps, please post that information. (or the server log file if you wish) Without hard data, it is difficult to know where to go.

Example on my relatively beefy system:

2017-04-03 22:53:23,078 created unix domain socket: /run/user/1000/xpra/desktop-10
2017-04-03 22:53:23,078 created unix domain socket: /home/antoine/.xpra/desktop-10
2017-04-03 22:53:23,079 created unix domain socket: /var/run/xpra/desktop-10
(..)
2017-04-03 22:53:24,038 xpra is ready.

So that's 1 second from the socket creation until fully ready. (and the socket timeout + clearing should be around 8 seconds maximum)


Mon, 03 Apr 2017 17:16:57 GMT - psycho_zs:

See server_3sockets.log​, it is the server log from the similar situation as described in my previous comment. 2.1-20170326r15425-1


Mon, 03 Apr 2017 17:28:14 GMT - Antoine Martin:

See server_3sockets.log​, it is the server log from the similar situation as described in my previous comment.

That log is against a 2.1-r15340 server, therefore missing any of the fixes from comment:10 onwards. We want the timing information of the current code.


Mon, 03 Apr 2017 17:33:32 GMT - psycho_zs:

I've just rechecked, the package version is 2.1-20170326r15425-1, but the log reports 2.1-r15340. OK, I'll try a fresher one.


Mon, 03 Apr 2017 17:47:24 GMT - psycho_zs:

r15494 package matches. Attaching log... Remote start after dbus restart and killed user processes worked. (But there were no substantial system upgrade preceding it).


Mon, 03 Apr 2017 17:47:42 GMT - psycho_zs: attachment set


Tue, 04 Apr 2017 06:24:31 GMT - Antoine Martin:

OK, so in this log we can see:

2017-04-03 20:44:01,438 Warning: some of the sockets are in an unknown state:
19	2017-04-03 20:44:01,438  /var/run/user/1000/xpra/servername-100
20	2017-04-03 20:44:01,438  /home/username/.xpra/servername-100
21	2017-04-03 20:44:01,439  /var/run/xpra/servername-100
22	2017-04-03 20:44:01,439  please wait as we allow the socket probing to timeout
23	2017-04-03 20:44:09,464 created unix domain socket: /var/run/user/1000/xpra/servername-100
24	2017-04-03 20:44:09,465 created unix domain socket: /home/username/.xpra/servername-100
25	2017-04-03 20:44:09,465 created unix domain socket: /var/run/xpra/servername-100
(..)
41	2017-04-03 20:44:12,034 xpra is ready.

So that's:

For a total of 11 seconds, which is well under the client timeout of 15 seconds.. r15508 reduces from 8 to just 6 seconds, which should still be enough to distinguish between a server that is starting up (3 seconds in your case) and one that is just dead.


Thu, 11 May 2017 04:59:40 GMT - Antoine Martin:

@psycho_zs: can I close this ticket or can you reproduce with r15508 or later?


Thu, 11 May 2017 08:11:55 GMT - psycho_zs:

Thanks. I'll try to check in the next couple of days. Busy month.


Sun, 14 May 2017 07:23:32 GMT - psycho_zs:

Xpra from latest svn can not start server

xpra start :100 -d all
using systemd-run to wrap 'start' server command
'systemd-run' '--description' 'xpra-start' '--scope' '--user' '/usr/bin/xpra' 'start' ':100' '-d' 'all' '--systemd-run=no'
Failed to create bus connection: No such file or directory

Latest deb build 2.1-20170409r15542-1 seem to be working, at least it connected successfully with leftover sockets on server a couple of times. First timestamp visible in client output: 10:19:10 First timestamp visible in server log: 10:19:13 "Xpra is ready" message on server: 10:19:21 Handshake complete: 10:19:22


Sun, 14 May 2017 08:17:35 GMT - Antoine Martin: status changed; resolution set

The timeout issue is fixed so I am closing this bug.

If you are having systemd startup problems, please file a separate ticket. The exact systemd-run command we use is printed out, if that fails then it is quite unlikely to be an xpra bug.


Tue, 20 Jun 2017 16:18:40 GMT - Antoine Martin:

Saw something similar in ticket:1537#comment:3.


Sat, 23 Jan 2021 05:24:33 GMT - migration script:

this ticket has been moved to: https://github.com/Xpra-org/xpra/issues/1447