xpra icon
Bug tracker and wiki

Opened 5 months ago

Closed 5 months ago

#1871 closed defect (fixed)

Cannot start xpra server with 8 nvidia GPUs

Reported by: yelantf Owned by: yelantf
Priority: major Milestone: 2.4
Component: server Version: 2.3.x
Keywords: Cc:

Description (last modified by Antoine Martin)

I'm using the newest version 2.3.1-r19533 on Ubuntu 16.04 (Xenial),
but I cannot start my session on the server. I tried on two servers and both always get the same results. When use the command

xpra start :2233

, it will always say

server failure: disconnected before the session could be established
server requested disconnect: server error (failed to start a new session)

.And in the log file, I see

Fatal server error:
(EE) Server is already active for display 2233
	If this server is no longer running, remove /tmp/.X2233-lock
	and start again.
(EE) 

But I have cleared all the processes that start with the letter 'X'...
I also use journalctl after set

DEBUG=auth,proxy,util,x11

in /etc/defaults/xpra. Everything seems fine there except these lines,

get_server_state: connect(/run/user/1031/xpra/mvig-2233)=[Errno 111] Connection refused
socket_details: '/run/user/1031/xpra/mvig-2233' state does not match (UNKNOWN vs LIVE)
get_server_state: connect(/home/yelantf/.xpra/mvig-2233)=[Errno 111] Connection refused
socket_details: '/home/yelantf/.xpra/mvig-2233' state does not match (UNKNOWN vs LIVE)
identify_new_socket new_sockets=()
start_server_subprocess failed
Traceback (most recent call last):
File "/usr/lib/python2.7/dist-packages/xpra/server/proxy/proxy_server.py", line 263, in proxy_session
proc, socket_path, display = self.start_new_session(username, uid, gid, sns, displays)
File "/usr/lib/python2.7/dist-packages/xpra/server/proxy/proxy_server.py", line 419, in start_new_session
proc, socket_path, display = start_server_subprocess(sys.argv[0], args, mode, opts, username, uid, gid, env, cwd)
File "/usr/lib/python2.7/dist-packages/xpra/scripts/main.py", line 1795, in start_server_subprocess
socket_path, display = identify_new_socket(proc, dotxpra, existing_sockets, matching_display, new_server_uuid, display_name, uid)
File "/usr/lib/python2.7/dist-packages/xpra/scripts/main.py", line 1890, in identify_new_socket
raise InitException("failed to identify the new server display!")
InitException: failed to identify the new server display!
\Error: failed to start server subprocess:
\failed to identify the new server display!

. I don't know what's happening, and have no idea to start a new session in xpra now.

Attachments (3)

2235.log (763 bytes) - added by yelantf 5 months ago.
logfile
2235.log.old (2.1 KB) - added by yelantf 5 months ago.
another log file
_2235.log.old (6.0 KB) - added by yelantf 5 months ago.
.log.old changes after a while

Download all attachments as: .zip

Change History (11)

comment:1 Changed 5 months ago by yelantf

Component: androidserver
Keywords: new session added

comment:2 Changed 5 months ago by Antoine Martin

Description: modified (diff)
Keywords: new session removed
Milestone: 2.4
Owner: changed from Antoine Martin to yelantf

Everything seems fine there except these lines,
socket_details: '/home/yelantf/.xpra/mvig-2233' state does not match (UNKNOWN vs LIVE)
identify_new_socket new_sockets=()
start_server_subprocess failed

That's odd, I don't have such problems in my Xenial VM.
It looks like the server takes way too long to startup. The socket will be in UNKNOWN state until the server finishes starting up then it will be LIVE.

Can you post the /run/user/1031/xpra/:2233.log? (and the matching .bak file if there is one)
The timeout should be long enough for most systems.
I assume that you've tried other display numbers and that this makes no difference?
You can find the vfb left behind, if any, with:

ps -ef | grep "Xvfb"

(on other platforms grep for Xorg)

comment:3 Changed 5 months ago by Antoine Martin

Summary: Cannot start xpra sever...Cannot start xpra server...

Changed 5 months ago by yelantf

Attachment: 2235.log added

logfile

Changed 5 months ago by yelantf

Attachment: 2235.log.old added

another log file

comment:4 in reply to:  2 Changed 5 months ago by yelantf

Replying to Antoine Martin:

Can you post the /run/user/1031/xpra/:2233.log? (and the matching .bak file if there is one)
I assume that you've tried other display numbers and that this makes no difference?

Tried another display number :2235, things are still the same. By '.bak file' do you mean .log.old? There is two logs . I attached both them here.

Last edited 5 months ago by yelantf (previous) (diff)

Changed 5 months ago by yelantf

Attachment: _2235.log.old added

.log.old changes after a while

comment:5 Changed 5 months ago by yelantf

By reading the .log.old, I think the problem is caused by the GPUs on my server. As you can see in this file, there are 8 gpus on the server. Maybe this causes too much time when I try to start a new session. Then I try to start a session and monitor .log.old file. In the .log.old file, it takes a lot of time to scan the gpu devices. After quite a long time, the .log.old file shows that "xpra is ready". Then I try to connect it and it works. But if I try to list the session before the "xpra is ready" is printed out in .log.old file, it will clean the state-unknown session...

Last edited 5 months ago by yelantf (previous) (diff)

comment:6 Changed 5 months ago by Antoine Martin

Summary: Cannot start xpra server...Cannot start xpra server with 8 nvidia GPUs

2018-06-10 15:06:55,377 Error importing swscale colorspace conversion (csc_swscale)
2018-06-10 15:06:55,378 libswscale.so.5: cannot open shared object file: No such file or directory
That's not normal. The swscale library is a dependency of the package, it should be installed.

By reading the .log.old, I think the problem is caused by the GPUs on my server. As you can see in this file, there are 8 gpus on the server. Maybe this causes too much time when I try to start a new session.

As per wiki/ReportingBugs: anything that would make the setup unusual - 8 GPUs definitely qualifies there.
That's definitely the problem, it takes over a minute to initialize all the GPUs. By the time this is finished, the proxy has assumed that the session failed and returns the error.

And since the NVENC runtime API version doesn't match the one used for building, nvenc fails anyway:

NVENCException: getting API function list - returned 15: This indicates that an invalid struct version was used by the client.

So all this initialization cost was for nothing.

So your 2 options are - both should work:

  • try r19597 which increases the timeout and gives the user visual feedback that something is happening - beta builds for Xenial with this change can be found here: https://xpra.org/beta/ - you could also apply this change to the 2.3.x branch, or just use the existing env var to increase the timeout (needed for both for the proxy server and the start command)
  • just turn off nvenc: --video-encoders=vpx,x264

To fix nvenc, use the latest drivers from nvidia - but this will not fix the timeout.

BTW,

2018-06-10 15:18:57,826 setting keyboard layout to 'cn'

Does that work OK?

Last edited 5 months ago by Antoine Martin (previous) (diff)

comment:7 in reply to:  6 Changed 5 months ago by yelantf

Replying to Antoine Martin:

2018-06-10 15:06:55,377 Error importing swscale colorspace conversion (csc_swscale)
2018-06-10 15:06:55,378 libswscale.so.5: cannot open shared object file: No such file or directory
That's not normal. The swscale library is a dependency of the package, it should be installed.

Well, so how to install swscale library? I upgrade xpra by installing the .deb files manully and see no warnings or errors. But I only upgrade xpra with latest .deb file (without upgrading other dependency packages). In the path /usr/lib/xpra, I find three, libswscale.so, libswscale.so.4 and libswscale.so.4.6.100. There is no libswscale.so.5.

So your 2 options are - both should work:

  • try r19597 which increases the timeout and gives the user visual feedback that something is happening - beta builds for Xenial with this change can be found here: https://xpra.org/beta/ - you could also apply this change to the 2.3.x branch, or just use the existing env var to increase the timeout (needed for both for the proxy server and the start command)
  • just turn off nvenc: --video-encoders=vpx,x264

OK, I'd simply choose the second choice.

BTW,

2018-06-10 15:18:57,826 setting keyboard layout to 'cn'

Does that work OK?

I found no difference between 'cn' and 'us'. But both keyboard layouts are not able to transfer Chinese characters which produced by Chinese input methods. Teamviewer can do this, so I am wondering if it is possible to support that in xpra.

Thank you for your generous help anyway.

comment:8 Changed 5 months ago by Antoine Martin

Resolution: fixed
Status: newclosed

Well, so how to install swscale library? I upgrade xpra by installing the .deb files manually and see no warnings or errors. But I only upgrade xpra with latest .deb file (without upgrading other dependency packages).

Well, again this qualifies as a non-standard installation method and should be reported.
Don't do that and let apt-get update things as needed, there is an updated ffmpeg package in the repository with the newer swscale library version xpra is linked against.

I am wondering if it is possible to support that in xpra.

I'm sure it's doable, no idea how though.

I am closing this ticket as fixed, feel free to re-open if you still have problems.

Note: See TracTickets for help on using tickets.