After playing around with wiki/Profiling, I've identified some areas where tuning could lead to some significant improvements:
Protocol.encode
calling Compressed.__len___
fixed in r7021
error.XErrorToName
called when the result is generally not used (not very expensive, but used very often!) fixed in r7022
region.contains
improved for python>=2.5 in r7025 (+fix in r7027)
Not done yet:
WindowVideoSource.get_best_encoding()
costs 10ms! (including 5ms for get_encoding_options()
!)
get_current_speed()
and get_current_quality()
cost about 1ms each, but we call them a lot
WindowMode.get_dimensions()
: 2ms
region.add_rectangle
: 2.5ms!
send_cursor
: we should compress before sending, not in the event loop!
set_client_properties
: 30ms! (huge!)
example profiling output taken at r7031
The graph attached was generated using:
./tests/scripts/pycallgraph -d 10 \ -e libs,std,misc,one_offs,net,logging,x11,connection,keyboard,clipboard,sound,mouse \ -- start :10 ...
Progress:
get_current_speed()
and get_current_quality()
removed in r7030 (+fix in r7042)
get_best_encoding
, we should pre-calculate the options, including encoding specific things like the max_webp
value - not sure this can be done for this release. Maybe we can at least catch the special case of send_full_window_update
as seen in the graph above?
send_cursor
is now doing the compression work later and per-client, which also allows us to use lz4 or lzo if supported
Some improvements on get_encoding_options
(try to pre-calculate some) in r7178, often much faster for the non-video case - the video case remains harder..
choose the encoding using a scoring mechanism
Improvements in r7203 and r7204.
The patch above is much nicer than the current code, but unfortunately it is not faster (yet? why not!?) and it is missing some features.
New important bottleneck identified (and expected): the rectangle merging/splitting code. It works fine with a few rectangles of limited size but really slows us down with more demanding tests. Needs to be cythonized.
fast encoding selection is no longer the bottleneck
Encoding selection is now much improved in r7215, to the point that this is no longer the bottleneck (though get_best_encoding_default
method could still be improved as it is now a bit ugly and inefficient), as can be seen here (synthetic benchmark - which is hiding the real problem with rectangles..):
I liked the scoring method from the patch above as it allows for smoother encoding transitions than the arbitrary limits set in the current code, maybe we can merge it without slowing things down too much? Or maybe only use it beyond a certain size, to make small regions a fast path (as there is likely to be many more of those and choosing the right encoding is less important and easier: rgb often does a fine job for those)
rectangles can become really slow
The new target is the rectangle code, which becomes dreadfully slow as we get more rectangles in the list:
Two things we should do:
profiling of rectangle code only
Whereas the "rectangle slow" screenshot was cropped from a larger capture, this one is captured using a dedicated profiling command:
./tests/scripts/pycallgraph -d 10 \ -I "*rectangle*,*.damage,*.do_send_delayed_regions,*.make_data_packet_cb,*._refresh_region,*.add_refresh_region" \ -- start :10 --no-notifications --no-daemon --start-child=xterm --no-mdns --no-pulseaudio
And the result is much better:
Is it a heisenbug where the profiling code slows down the rectangle loops so much that it becomes a problem? (the first capture has hundreds of expressions, the second one only the 6 shown above)
I think it is good enough now, please run the profiling as per comment:1 and comment:6 with the scenarios you are interested in, and post the resulting file pycallgraph-xpra.png
.
And look for any unusually high time per call (usually highlighted by a different colour).
It would be worth testing the video region code too with just: -I *subregion*
, assuming #410 isn't in too bad a shape.
I've updated my pycallgraph to v1.0.1 like it is listed in your graphs.
Following the commands in comment:1 and comment:6 I've ran a few tests this way and I've only once got pycallgraph-xpra.png
in my local directory.
I don't see any errors but I also don't get the resulting graph. What am I doing wrong ?
Please post the exact commands that you used for both the server and the client, the command output, etc.
Should be closed for 0.14, raising to critical.
Should be closed against milestone 0.14
I'm running the server with
./tests/scripts/pycallgraph -d 10 \ -I "*rectangle*,*.damage,*.do_send_delayed_regions,*.make_data_packet_cb,*._refresh_region,*.add_refresh_region" \ -- start :10 --no-notifications --no-daemon --start-child=xterm --no-mdns --no-pulseaudio --bind-tcp=0.0.0.0:1400
Then i'm connecting with a windows client
C:\Program Files (x86)\Xpra>Xpra_cmd.exe attach tcp:1.2.3.4:1400
I do a few things in the xterm then disconnect afterwards I stop the server with ctrl+c and there is often no .png in the current directory so i'm not sure what i'm doing wrong
Don't control-c the server, try "xpra stop" or when profiling the client, just exit from the tray.
Tried again using xpra stop instead. I got pycallgraph-xpra.png
this time but it was exactly the same as the previous one attached to this ticket.
I used the same commands at comment:12
Please re-assign tickets to me if I am supposed to act on them.
Did you wait for at least 10 seconds, which is the initial delay set? You should then see this message in the server output:
starting trace
If that still does not work for you, please post full versions, log output, etc..
Maybe we do need to optimize intersect_rect
: see ticket:670#comment:6.
converts the region code to use a cython class, doubling the speed
The reason why the patch above only doubles the speed is because most of the expensive operations still work on python lists of objects (untyped). That's not an easy problem to solve. One way would be to use numpy arrays for the list of rectangles, but we would lose the nice encapsulation.
r9146 + r9147 re-implements the rectangle code using cython, similar to the patch above but with more code unrolling.
Also adds a test to measure the impact of this change and the results are good (elapsed time - lower is better):
Test | 0.14 | 0.15 | Improvement |
---|---|---|---|
R1 create | 0.26 | 0.07 | 3.7x |
R1 add | 0.93 | 0.19 | 4.8x |
R1 remove | 0.44 | 0.14 | 3.1x |
R1 contains | 0.08 | 0.08 | 1x |
R2 create | 0.10 | 0.03 | 3.3x |
R2 add | 0.48 | 0.10 | 4.8x |
R2 remove | 0.17 | 0.05 | 3.4x |
R2 contains | 0.08 | 0.08 | 1x |
merge all | 2.09 | 0.02 | 104x |
@afarr: I am re-assigning to you because you guys need to be able to run the profiler. I have just ran it again, and now that the code is in cython, we don't get as much output - but we still get something..
Small bug fixed in r9282, makes merge_all a tiny bit slower, but we use it more and it is still plenty fast. (backport to v0.15 in r9283)
Well... I tried to install on fedora 21, seemed to get everything installed, but still apparently lacking rencode versioning.
[tlaloc@jimador src]$ xpra --version 2015-06-02 17:34:20,115 rencode at '/usr/lib64/python2.7/site-packages/rencode/__init__.pyc' lacks versioning information xpra v0.16.0
I suppose I'm missing some other detail (probably fairly obvious), because when I try to run with the same command as Smo from comment:12 I get the following traceback:
[tlaloc@jimador src]$ ./tests/scripts/pycallgraph -d 10 \ > -I "*rectangle*,*.damage,*.do_send_delayed_regions,*.make_data_packet_cb,*._refresh_region,*.add_refresh_region" \ > -- start :10 --no-notifications --no-daemon --start-child=xterm --no-mdns --no-pulseaudio --bind-tcp=0.0.0.0:1400 Traceback (most recent call last): File "./tests/scripts/pycallgraph", line 7, in <module> from pycallgraph import PyCallGraph, Config ImportError: No module named pycallgraph
Checking from the src directory, I do indeed see what appears to by a pycallgraph in tests/scripts ... so - does this look like an issue related to not having rencode versioning?
[tlaloc@jimador src]$ sudo yum install python-rencode [sudo] password for tlaloc: Loaded plugins: langpacks https://winswitch.org/beta/Fedora/21/x86_64/repodata/repomd.xml: [Errno 14] HTTPS Error 404 - Not Found Trying other mirror. Package python-rencode-1.0.3-1.fc21.x86_64 already installed and latest version Nothing to do
Trying to install python3-rencode (which I see in the dists directory for Fedora 21) I get apparent success:
Installed: python3-rencode.x86_64 0:1.0.3-1.fc21
... maybe not the right version?
Ok, Smo held my hand through the install of the pycallgraph (didn't realize it needed a separate install).
Was able to generate a graph of a session with xterm start child launching a firefox window... opening a couple of tabs and watching video - with a 0.15.0 client connected to a 0.16.0 server. Not that I'm sure exactly what the chart indicates.
pycallgraph experiment
Well... I tried to install on fedora 21, seemed to get everything installed, but still apparently lacking rencode versioning.
python-rencode
is in the fedora repo, if you have this repo installed, simply doing a yum update
will get you the up to date rencode version, which will get rid of this warning.
I think I finally found a rencode library that at least causes no complaints.
Ran another test, got the same results as above, so presumably the rencode wasn't at issue.
Looking through the wiki/profiling page clarifies a number of the parameters I was copying from your/Smo's example... but I'm not entirely clear on what bits list I would use to tweak the profiling.
As an experiment I tried a small tweak:
./tests/scripts/pycallgraph -d 10 -I "*rectangle*,*encoding*,*.damage,*.do_send_delayed_regions,*.make_data_packet_cb,*._refresh_region,*.add_refresh_region" -- \ start :10 --no-notifications --no-daemon --start-child=xterm --no-mdns --no-pulseaudio --bind-tcp=0.0.0.0:1201 --start-child=xterm
Which generated a considerably less interesting graph (just the _main_
grey box, really...) and also managed to produce some tracebacks/access violations client-side (prior to expected ctrl-c issues with win32 client): moved to #929.
So... is there another wiki page that outlines the structure and which "pycall"s should be expected along certain ... err... "usepaths"?
subregion pycallgraph test
Looking back through this ticket I noticed the -I *subregion*
suggested setting.
Running with the following command line:
./tests/scripts/pycallgraph -d 10 -I "*subregion*" -- \ start :10 --no-notifications --no-daemon --start-child=xterm --no-mdns --no-pulseaudio --bind-tcp=0.0.0.0:1201 --start-child=xterm
0.15.4 r10055 win32 client against 0.15.4 r10082 fedora 21 server (smo's build) I was able to generate the following graph, after using a start-child xterm to start firefox and watch some video.
So is this sufficient "mastery" (to use the term very very loosely) to close this ticket and feel (vaguely) comfortable that I will be able to run this graphing should it be called for in the future? Or is there something glaringly obvious that I'm still missing?
(Hmm... and should I run the test again to produce non-reduced to fit the ticket results?)
Looking back through this ticket I noticed the -I *subregion* suggested setting.
What is included will depend on what we're looking for. subregion is a good one to look at.
I was able to generate the following graph
This will do for now.
this ticket has been moved to: https://github.com/Xpra-org/xpra/issues/620