Xpra: Ticket #620: performance optimizations from profiling

After playing around with wiki/Profiling, I've identified some areas where tuning could lead to some significant improvements:

Protocol.encode calling Compressed.__len___ fixed in r7021
error.XErrorToName called when the result is generally not used (not very expensive, but used very often!) fixed in r7022
region.contains improved for python>=2.5 in r7025 (+fix in r7027)

Not done yet:

WindowVideoSource.get_best_encoding() costs 10ms! (including 5ms for get_encoding_options()!)
get_current_speed() and get_current_quality() cost about 1ms each, but we call them a lot
WindowMode.get_dimensions(): 2ms
region.add_rectangle: 2.5ms!
send_cursor: we should compress before sending, not in the event loop!
set_client_properties: 30ms! (huge!)

Wed, 30 Jul 2014 10:30:48 GMT - Antoine Martin: attachment set

attachment set to damage-r7031.png

example profiling output taken at r7031

Wed, 30 Jul 2014 12:38:37 GMT - Antoine Martin: owner, status, description changed

owner changed from Antoine Martin to Antoine Martin
status changed from new to assigned
description modified (diff)

The graph attached was generated using:

./tests/scripts/pycallgraph -d 10 \
   -e libs,std,misc,one_offs,net,logging,x11,connection,keyboard,clipboard,sound,mouse \
   --
   start :10 ...

Progress:

get_current_speed() and get_current_quality() removed in r7030 (+fix in r7042)
for get_best_encoding, we should pre-calculate the options, including encoding specific things like the max_webp value - not sure this can be done for this release. Maybe we can at least catch the special case of send_full_window_update as seen in the graph above?
send_cursor is now doing the compression work later and per-client, which also allows us to use lz4 or lzo if supported

Thu, 07 Aug 2014 14:38:23 GMT - Antoine Martin:

Some improvements on get_encoding_options (try to pre-calculate some) in r7178, often much faster for the non-video case - the video case remains harder..

Fri, 08 Aug 2014 16:07:02 GMT - Antoine Martin: attachment set

attachment set to encoding-scoring-v2.patch

choose the encoding using a scoring mechanism

Fri, 08 Aug 2014 16:11:29 GMT - Antoine Martin:

Improvements in r7203 and r7204.

The patch above is much nicer than the current code, but unfortunately it is not faster (yet? why not!?) and it is missing some features.

New important bottleneck identified (and expected): the rectangle merging/splitting code. It works fine with a few rectangles of limited size but really slows us down with more demanding tests. Needs to be cythonized.

Sat, 09 Aug 2014 08:53:25 GMT - Antoine Martin: attachment set

attachment set to damage-r7215.png

fast encoding selection is no longer the bottleneck

Sat, 09 Aug 2014 08:58:22 GMT - Antoine Martin:

Encoding selection is now much improved in r7215, to the point that this is no longer the bottleneck (though get_best_encoding_default method could still be improved as it is now a bit ugly and inefficient), as can be seen here (synthetic benchmark - which is hiding the real problem with rectangles..):

I liked the scoring method from the patch above as it allows for smoother encoding transitions than the arbitrary limits set in the current code, maybe we can merge it without slowing things down too much? Or maybe only use it beyond a certain size, to make small regions a fast path (as there is likely to be many more of those and choosing the right encoding is less important and easier: rgb often does a fine job for those)

Sat, 09 Aug 2014 09:11:32 GMT - Antoine Martin: attachment set

attachment set to rectangle-slow.png

rectangles can become really slow

Sat, 09 Aug 2014 09:13:44 GMT - Antoine Martin:

The new target is the rectangle code, which becomes dreadfully slow as we get more rectangles in the list:

Two things we should do:

avoid creating so many small rectangles beyond a certain point
make rectangle calculations faster

Sat, 09 Aug 2014 14:19:44 GMT - Antoine Martin: attachment set

attachment set to rectangle-only.png

profiling of rectangle code only

Sat, 09 Aug 2014 14:23:25 GMT - Antoine Martin:

Whereas the "rectangle slow" screenshot was cropped from a larger capture, this one is captured using a dedicated profiling command:

./tests/scripts/pycallgraph -d 10 \
    -I "*rectangle*,*.damage,*.do_send_delayed_regions,*.make_data_packet_cb,*._refresh_region,*.add_refresh_region" \
    --
    start :10 --no-notifications --no-daemon --start-child=xterm --no-mdns --no-pulseaudio

And the result is much better:

Is it a heisenbug where the profiling code slows down the rectangle loops so much that it becomes a problem? (the first capture has hundreds of expressions, the second one only the 6 shown above)

Mon, 11 Aug 2014 03:24:32 GMT - Antoine Martin: owner, status changed

owner changed from Antoine Martin to alas
status changed from assigned to new

I think it is good enough now, please run the profiling as per comment:1 and comment:6 with the scenarios you are interested in, and post the resulting file pycallgraph-xpra.png. And look for any unusually high time per call (usually highlighted by a different colour).

It would be worth testing the video region code too with just: -I *subregion*, assuming #410 isn't in too bad a shape.

Mon, 11 Aug 2014 19:06:12 GMT - Smo:

I've updated my pycallgraph to v1.0.1 like it is listed in your graphs.

Following the commands in comment:1 and comment:6 I've ran a few tests this way and I've only once got pycallgraph-xpra.png in my local directory.

I don't see any errors but I also don't get the resulting graph. What am I doing wrong ?

Mon, 11 Aug 2014 19:07:58 GMT - Smo: attachment set

attachment set to graph-fail.png

Tue, 12 Aug 2014 02:57:12 GMT - Antoine Martin: owner changed

owner changed from alas to Smo

Please post the exact commands that you used for both the server and the client, the command output, etc.

Sun, 17 Aug 2014 14:21:13 GMT - Antoine Martin: priority changed

priority changed from major to critical

Should be closed for 0.14, raising to critical.

Tue, 19 Aug 2014 03:34:40 GMT - Antoine Martin: milestone changed

milestone changed from 0.15 to 0.14

Should be closed against milestone 0.14

Tue, 19 Aug 2014 17:14:19 GMT - Smo:

I'm running the server with

./tests/scripts/pycallgraph -d 10 \
-I "*rectangle*,*.damage,*.do_send_delayed_regions,*.make_data_packet_cb,*._refresh_region,*.add_refresh_region" \
-- start :10 --no-notifications --no-daemon --start-child=xterm --no-mdns --no-pulseaudio --bind-tcp=0.0.0.0:1400

Then i'm connecting with a windows client

C:\Program Files (x86)\Xpra>Xpra_cmd.exe attach tcp:1.2.3.4:1400

I do a few things in the xterm then disconnect afterwards I stop the server with ctrl+c and there is often no .png in the current directory so i'm not sure what i'm doing wrong

Tue, 19 Aug 2014 17:15:05 GMT - Smo: owner changed

owner changed from Smo to Antoine Martin

Wed, 20 Aug 2014 02:12:55 GMT - Antoine Martin: owner changed

owner changed from Antoine Martin to Smo

Don't control-c the server, try "xpra stop" or when profiling the client, just exit from the tray.

Wed, 20 Aug 2014 06:26:09 GMT - Smo:

Tried again using xpra stop instead. I got pycallgraph-xpra.png this time but it was exactly the same as the previous one attached to this ticket.

I used the same commands at comment:12

Sun, 24 Aug 2014 06:15:17 GMT - Antoine Martin:

Please re-assign tickets to me if I am supposed to act on them.

Did you wait for at least 10 seconds, which is the initial delay set? You should then see this message in the server output:

starting trace

If that still does not work for you, please post full versions, log output, etc..

Mon, 08 Sep 2014 03:38:50 GMT - Antoine Martin: owner, status changed

owner changed from Smo to Antoine Martin
status changed from new to assigned

Maybe we do need to optimize intersect_rect: see ticket:670#comment:6.

Mon, 08 Sep 2014 05:58:36 GMT - Antoine Martin: attachment set

attachment set to region-cython.patch

converts the region code to use a cython class, doubling the speed

Mon, 08 Sep 2014 06:10:14 GMT - Antoine Martin:

The reason why the patch above only doubles the speed is because most of the expensive operations still work on python lists of objects (untyped). That's not an easy problem to solve. One way would be to use numpy arrays for the list of rectangles, but we would lose the nice encapsulation.

Fri, 24 Apr 2015 10:49:22 GMT - Antoine Martin: owner, status changed

owner changed from Antoine Martin to alas
status changed from assigned to new

r9146 + r9147 re-implements the rectangle code using cython, similar to the patch above but with more code unrolling.

Also adds a test to measure the impact of this change and the results are good (elapsed time - lower is better):

Test	0.14	0.15	Improvement
R1 create	0.26	0.07	3.7x
R1 add	0.93	0.19	4.8x
R1 remove	0.44	0.14	3.1x
R1 contains	0.08	0.08	1x
R2 create	0.10	0.03	3.3x
R2 add	0.48	0.10	4.8x
R2 remove	0.17	0.05	3.4x
R2 contains	0.08	0.08	1x
merge all	2.09	0.02	104x

@afarr: I am re-assigning to you because you guys need to be able to run the profiler. I have just ran it again, and now that the code is in cython, we don't get as much output - but we still get something..

Fri, 08 May 2015 08:38:02 GMT - Antoine Martin:

Small bug fixed in r9282, makes merge_all a tiny bit slower, but we use it more and it is still plenty fast. (backport to v0.15 in r9283)

Wed, 03 Jun 2015 00:45:05 GMT - alas:

Well... I tried to install on fedora 21, seemed to get everything installed, but still apparently lacking rencode versioning.

[tlaloc@jimador src]$ xpra --version
2015-06-02 17:34:20,115 rencode at '/usr/lib64/python2.7/site-packages/rencode/__init__.pyc' lacks versioning information
xpra v0.16.0

I suppose I'm missing some other detail (probably fairly obvious), because when I try to run with the same command as Smo from comment:12 I get the following traceback:

[tlaloc@jimador src]$ ./tests/scripts/pycallgraph -d 10 \
> -I "*rectangle*,*.damage,*.do_send_delayed_regions,*.make_data_packet_cb,*._refresh_region,*.add_refresh_region" \
> -- start :10 --no-notifications --no-daemon --start-child=xterm --no-mdns --no-pulseaudio --bind-tcp=0.0.0.0:1400
Traceback (most recent call last):
  File "./tests/scripts/pycallgraph", line 7, in <module>
    from pycallgraph import PyCallGraph, Config
ImportError: No module named pycallgraph

Checking from the src directory, I do indeed see what appears to by a pycallgraph in tests/scripts ... so - does this look like an issue related to not having rencode versioning?

Trying to install python-rencode I get this:

[tlaloc@jimador src]$ sudo yum install python-rencode
[sudo] password for tlaloc:
Loaded plugins: langpacks
https://winswitch.org/beta/Fedora/21/x86_64/repodata/repomd.xml: [Errno 14] HTTPS Error 404 - Not Found
Trying other mirror.
Package python-rencode-1.0.3-1.fc21.x86_64 already installed and latest version
Nothing to do

Trying to install python3-rencode (which I see in the dists directory for Fedora 21) I get apparent success:

Installed:
  python3-rencode.x86_64 0:1.0.3-1.fc21

... maybe not the right version?

Wed, 03 Jun 2015 01:22:32 GMT - alas:

Ok, Smo held my hand through the install of the pycallgraph (didn't realize it needed a separate install).

Was able to generate a graph of a session with xterm start child launching a firefox window... opening a couple of tabs and watching video - with a 0.15.0 client connected to a 0.16.0 server. Not that I'm sure exactly what the chart indicates.

Wed, 03 Jun 2015 01:23:19 GMT - alas: attachment set

attachment set to pycallgraph-xpra-2.png

pycallgraph experiment

Wed, 03 Jun 2015 01:24:02 GMT - alas:

Wed, 03 Jun 2015 01:51:45 GMT - Antoine Martin:

Well... I tried to install on fedora 21, seemed to get everything installed, but still apparently lacking rencode versioning.

python-rencode is in the fedora repo, if you have this repo installed, simply doing a yum update will get you the up to date rencode version, which will get rid of this warning.

Wed, 29 Jul 2015 00:55:15 GMT - alas:

I think I finally found a rencode library that at least causes no complaints.

Ran another test, got the same results as above, so presumably the rencode wasn't at issue.

Looking through the wiki/profiling page clarifies a number of the parameters I was copying from your/Smo's example... but I'm not entirely clear on what bits list I would use to tweak the profiling.

As an experiment I tried a small tweak:

 ./tests/scripts/pycallgraph -d 10 -I "*rectangle*,*encoding*,*.damage,*.do_send_delayed_regions,*.make_data_packet_cb,*._refresh_region,*.add_refresh_region" -- \
    start :10 --no-notifications --no-daemon --start-child=xterm --no-mdns --no-pulseaudio --bind-tcp=0.0.0.0:1201 --start-child=xterm

Which generated a considerably less interesting graph (just the _main_ grey box, really...) and also managed to produce some tracebacks/access violations client-side (prior to expected ctrl-c issues with win32 client): moved to #929.

So... is there another wiki page that outlines the structure and which "pycall"s should be expected along certain ... err... "usepaths"?

Mon, 03 Aug 2015 23:49:43 GMT - alas: attachment set

attachment set to pycallgraph-xpra--i-subregion-15-4.png

subregion pycallgraph test

Mon, 03 Aug 2015 23:54:14 GMT - alas:

Looking back through this ticket I noticed the -I *subregion* suggested setting.

Running with the following command line:

./tests/scripts/pycallgraph -d 10 -I "*subregion*" -- \
    start :10 --no-notifications --no-daemon --start-child=xterm --no-mdns --no-pulseaudio --bind-tcp=0.0.0.0:1201 --start-child=xterm

0.15.4 r10055 win32 client against 0.15.4 r10082 fedora 21 server (smo's build) I was able to generate the following graph, after using a start-child xterm to start firefox and watch some video.

So is this sufficient "mastery" (to use the term very very loosely) to close this ticket and feel (vaguely) comfortable that I will be able to run this graphing should it be called for in the future? Or is there something glaringly obvious that I'm still missing?

(Hmm... and should I run the test again to produce non-reduced to fit the ticket results?)

Tue, 04 Aug 2015 03:50:03 GMT - Antoine Martin: status changed; resolution set

status changed from new to closed
resolution set to fixed

Looking back through this ticket I noticed the -I *subregion* suggested setting.

What is included will depend on what we're looking for. subregion is a good one to look at.

I was able to generate the following graph

This will do for now.

Sat, 23 Jan 2021 05:01:06 GMT - migration script:

this ticket has been moved to: https://github.com/Xpra-org/xpra/issues/620