xpra icon
Bug tracker and wiki

Opened 7 years ago

Closed 7 years ago

Last modified 7 years ago

#257 closed defect (fixed)

can't reattach after losing connection

Reported by: Smo Owned by: Antoine Martin
Priority: blocker Milestone: 0.9
Component: core Version: 0.8.x
Keywords: Cc:

Description

When using windows client 0.8.2 after losing connection to the network I can't connect to the running xpra session again

2013-02-12 00:47:50,665 New connection received: SocketConnection(('1.2.3.100', 1200) - ('1.2.3.4', 61569))
2013-02-12 00:47:51,076 Handshake complete; enabling connection
2013-02-12 00:47:51,078 Python/Gtk2 win32 client version 0.8.2 connected from 'somecomputer'
2013-02-12 00:47:51,078 using x264 as primary encoding
2013-02-12 00:47:51,078 max client resolution is 1366x768 (from [[1366, 768]]), current server resolution is 1440x900
2013-02-12 00:47:51,079 best resolution matching 1366x768 is unchanged: 1440x900
2013-02-12 00:47:51,080 setting key repeat rate from client: 500ms delay / 33ms interval
2013-02-12 00:47:51,082 keyboard mapping already configured (skipped)
2013-02-12 00:47:51,236 using sound codec mp3
2013-02-12 00:47:51,315 starting sound using pulseaudio device null.monitor
2013-02-12 00:47:51,595 codec: MPEG-1 Layer 3 (MP3)
2013-02-12 00:49:03,835 delayed_region_timeout: sending now - something is wrong!
2013-02-12 00:49:18,901 delayed_region_timeout: sending now - something is wrong!
2013-02-12 00:49:33,960 delayed_region_timeout: sending now - something is wrong!
2013-02-12 00:49:49,005 delayed_region_timeout: sending now - something is wrong!
2013-02-12 00:49:49,072 get_backlog found some damage acks that have been pending for too long, expiring them: [600, 601, 602, 603]
2013-02-12 00:50:04,058 delayed_region_timeout: sending now - something is wrong!
2013-02-12 00:50:04,123 get_backlog found some damage acks that have been pending for too long, expiring them: [604]
2013-02-12 00:50:11,085 New connection received: SocketConnection(('1.2.3.100', 1200) - ('1.2.3.4', 61598))
2013-02-12 00:50:11,458 Handshake complete; enabling connection
2013-02-12 00:50:11,459 Disconnecting existing client Protocol(SocketConnection(('1.2.3.100', 1200) - ('1.2.3.4', 61569))), reason is: new valid connection received, this session does not allow sharing
2013-02-12 00:50:31,370 unknown or invalid packet type: ping
2013-02-12 00:50:31,371 connection closed after 1 packets received (3382 bytes) and 0 packets sent (0 bytes)

Change History (17)

comment:1 Changed 7 years ago by Antoine Martin

Owner: set to Antoine Martin
Status: newaccepted

Need more details please, as per wiki/ReportingBugs: full command lines, etc
Which connection is the "good" one, etc.

comment:2 Changed 7 years ago by Smo

Server OS is Fedora 18 x86_64
Client OS is Windows 7 Ultimate 64 bit

Instructions to reproduce

  1. Start xpra on server with command dbus-launch xpra --bind-tcp=0.0.0.0:1200 --no-daemon --start-child=chromium-browser start :11
  1. Connect with xpra client with command C:\Program Files (x86)\Xpra>xpra_cmd attach tcp:1.2.3.100:1200
  1. Sever your connection on purpose and wait for xpra window to time out
  2. Reestablish your connection and connect back to xpra
Last edited 7 years ago by Smo (previous) (diff)

comment:3 Changed 7 years ago by Antoine Martin

I've been unable to reproduce on a Fedora 18 host, here's how I tested:

  • start xpra and bind to all interfaces:
    xpra start --bind-tcp=0.0.0.0:10000 :10
    
  • attach via local ethernet address, ie:
    xpra attach tcp:192.168.1.1:10000
    
  • block incoming and outgoing packets from/to that port on that specific interface:
    iptables -I INPUT -p tcp -d 192.168.1.1 --dport 10000 -j DROP
    iptables -I OUTPUT -p tcp -s 192.168.1.1 --sport 10000 -j DROP
    
  • connect via local loopback address (unfiltered by above):
    xpra attach tcp:127.0.0.1:10000
    
  • delete the iptables filters:
    iptables -D INPUT 1
    iptables -D OUTPUT 1
    

What I see is that the first client eventually times out (after about 60s, and it says so) and that the second client connects immediately without problems.

Can you confirm that you get the same results?

Now, the main difference between those two tests is the client (Fedora vs Windows), but since the connection is meant to have been severed (by wifi drop out in your test case or iptables rule in my example), I don't really see why this should make any difference at all.... Ideas welcome.

Are you using the UI on MS Windows? Can you try it from the command line instead?
Can you test with a plain xterm instead of Chrome?

Last edited 7 years ago by Antoine Martin (previous) (diff)

comment:4 Changed 7 years ago by Antoine Martin

Summary: 0.8.3 Can't reattach after losing connectioncan't reattach after losing connection
Version: trunk0.8.x

Also, please try a v0.7.8 client/server so we can figure out if this is a recent regression.

(fixed ticket metadata)

comment:5 Changed 7 years ago by Antoine Martin

Looking at the log messages again and something looks suspicious:

unknown or invalid packet type: ping

This happens after the old dead connection has been dropped, so it must be dropping the new connection at this point?
But why would it? r2740 will log the source of the invalid packet so we will know for sure (so please update to r2740 or later)
Also, please try both with normal logging and with "-d all" so we can get more details from the log messages.

Last edited 7 years ago by Antoine Martin (previous) (diff)

comment:6 Changed 7 years ago by alas

Last edited 7 years ago by alas (previous) (diff)

comment:7 Changed 7 years ago by Antoine Martin

comment:6 which has been deleted shows that you are running the server with --no-deamon and via putty, this is the source of the problem - just do not do that!

Xpra is a server process, running it with --no-deamon leaves it attached to stdin/stdout and those fds end up being network connections that you then kill - the process is then unable to close properly and the kernel will leave the sockets in dead state for a while and refuse to allow a new process to use them. The exact timeout after you kill -9 the process can be tuned via sysctl IIRC - but you should just never do that.

Alternatives: use screen, use nohup, don't use --no-deamon, etc..

Please confirm so I can close this ticket as invalid. (always posting the exact command lines and setup environment does save a lot of time)

comment:8 Changed 7 years ago by Antoine Martin

Status: acceptednew

comment:9 Changed 7 years ago by Antoine Martin

Owner: changed from Antoine Martin to Smo

Please see comment:5, it would be good to figure out where this message came from.

Apart from that, I believe this bug can be closed as invalid.

comment:10 Changed 7 years ago by Smo

As requested I tried some older clients I haven't tried older server but here are my findings.

Server Fedora 18 xpra-0.8.5-1.fc18.i386
Client Windows 7 ultimate 32 bit

Client command

Xpra_cmd.exe attach tcp:1.2.3.71:10000

Server command

xpra --no-daemon --bind-tcp=0.0.0.0:10000 --start-child=xterm start :12

Windows client 0.7.8
No problem
0.7.8 disconnects cleanly when it loses its connection

Windows client 0.8.7
2 tests

fails to disconnect cleanly xterm window remains open and non responsive after killing connection on the server end
Need to kill task in the task manager to continue. After reconnecting here is the client log

xpra client version 0.8.7
2013-02-28 10:36:43,048 server ping timeout - waited 60 seconds without a respon
se
2013-02-28 10:36:43,053 connection closed after 0 packets received (0 bytes) and
 5 packets sent (3493 bytes)
2013-02-28 10:36:43,058 Connection lost

And here is the log on the server end

New connection received: SocketConnection(('1.2.3.71', 10000) - ('1.2.3.57', 49187))
Handshake complete; enabling connection
Disconnecting existing client Protocol(SocketConnection(('1.2.3.71', 10000) - ('1.2.3.57', 49186))), reason is: new valid connection received, this session does not allow sharing
unknown or invalid packet type: ping
connection closed after 1 packets received (3373 bytes) and 0 packets sent (0 bytes)

As you can see that mysterious message is back but only with the newer version of the client.

Last edited 7 years ago by Antoine Martin (previous) (diff)

comment:11 Changed 7 years ago by Antoine Martin

Owner: changed from Smo to Antoine Martin
Status: newassigned

How do you disconnect the existing client, by making a new connection to the server?
I am confused because this ticket is originally about disconnection because of network dropouts, but the log you supply indicates a disconnection due to a new client connection - those are very different things. The former will rely on a timeout, which may take some time, the latter should be instantaneous.

Do you get the same behaviour if you Control-C the client?

Last edited 7 years ago by Antoine Martin (previous) (diff)

comment:12 Changed 7 years ago by Antoine Martin

Does this patch fix the issue?

--- src/xpra/server_source.py	(revision 2850)
+++ src/xpra/server_source.py	(working copy)
@@ -398,8 +398,10 @@
         self.window_sources = {}
         self.window_metdata_cache = {}
         self.close_mmap()
-        self.protocol = None
         self.stop_sending_sound()
+        if self.protocol:
+            self.protocol.close()
+            self.protocol = None
 
     def user_event(self):
         self.last_user_event = time.time()

This should make sure we disconnect from the server end, though I would have thought that the client would realize the connection is dead within 1 minute or so as it doesn't get the ping reply.

Also, I don't see why it would help as the server code already does this:

protocol.flush_then_close(["disconnect", reason])

But maybe the write flushing spins endlessly - in which case we may need yet more code to deal with that.. (make it timeout eventually)

Last edited 7 years ago by Antoine Martin (previous) (diff)

comment:13 Changed 7 years ago by Smo

The server log is from the new connection from the client log right above it.

I was removing my network cable to test this and then plugging it back in after the time-out to reconnect.

We only stumbled upon it since there is one guy that routinely just shuts his laptop lid and never disconnects his client causing the same issue.

Afterwards the server won't accept any new connections at all.

Oddly enough I built a new server from source today and using the windows client 0.8.4 and 0.8.7 that you built I'm unable to reproduce the problem.

As soon as I connect one of those clients to an older server 0.8.5 32bit that you have built as well I can reproduce it right away.

I guess I will do some further testing and let you know what I find out.

comment:14 Changed 7 years ago by Antoine Martin

Priority: majorblocker

It's a shame it's not very easy to reproduce, because otherwise we could bisect this..

(raising priority as this can cause serious problems for people)

comment:15 Changed 7 years ago by Antoine Martin

Resolution: fixed
Status: assignedclosed

Sorry it has taken so long to fix this, I was finding it hard to reproduce reliably, but in fact this was trivial using a VirtualBox VM and using "save machine state".

Fixed in r2878, will backport to v0.8.x and release the bug fix.

comment:16 Changed 7 years ago by Antoine Martin

Follow up:

  • r2879 is the backport to v0.8.x branch, was included in v0.8.8
  • r2883 prevents us from sleeping in the UI thread as we wait for the lock/queue
  • r2884 prevents a malicious client from deadlocking the UI thread
  • this bug was present in all recent versions, including trunk where it was first fixed (I think comment:13 is wrong on that)

comment:17 Changed 7 years ago by Antoine Martin

as of r2894, the server will kill client connections when it does not get a response to its pings (default: ping every 10 seconds, kill after 1 minute)

Note: See TracTickets for help on using tickets.