Xpra: Ticket #620: performance optimizations from profiling

After playing around with wiki/Profiling, I've identified some areas where tuning could lead to some significant improvements:

Not done yet:



Wed, 30 Jul 2014 10:30:48 GMT - Antoine Martin: attachment set

example profiling output taken at r7031


Wed, 30 Jul 2014 12:38:37 GMT - Antoine Martin: owner, status, description changed

The graph attached was generated using:

./tests/scripts/pycallgraph -d 10 \
   -e libs,std,misc,one_offs,net,logging,x11,connection,keyboard,clipboard,sound,mouse \
   --
   start :10 ...

Progress:


Thu, 07 Aug 2014 14:38:23 GMT - Antoine Martin:

Some improvements on get_encoding_options (try to pre-calculate some) in r7178, often much faster for the non-video case - the video case remains harder..


Fri, 08 Aug 2014 16:07:02 GMT - Antoine Martin: attachment set

choose the encoding using a scoring mechanism


Fri, 08 Aug 2014 16:11:29 GMT - Antoine Martin:

Improvements in r7203 and r7204.

The patch above is much nicer than the current code, but unfortunately it is not faster (yet? why not!?) and it is missing some features.

New important bottleneck identified (and expected): the rectangle merging/splitting code. It works fine with a few rectangles of limited size but really slows us down with more demanding tests. Needs to be cythonized.


Sat, 09 Aug 2014 08:53:25 GMT - Antoine Martin: attachment set

fast encoding selection is no longer the bottleneck


Sat, 09 Aug 2014 08:58:22 GMT - Antoine Martin:

Encoding selection is now much improved in r7215, to the point that this is no longer the bottleneck (though get_best_encoding_default method could still be improved as it is now a bit ugly and inefficient), as can be seen here (synthetic benchmark - which is hiding the real problem with rectangles..):

fast encoding selection is no longer the bottleneck

I liked the scoring method from the patch above as it allows for smoother encoding transitions than the arbitrary limits set in the current code, maybe we can merge it without slowing things down too much? Or maybe only use it beyond a certain size, to make small regions a fast path (as there is likely to be many more of those and choosing the right encoding is less important and easier: rgb often does a fine job for those)


Sat, 09 Aug 2014 09:11:32 GMT - Antoine Martin: attachment set

rectangles can become really slow


Sat, 09 Aug 2014 09:13:44 GMT - Antoine Martin:

The new target is the rectangle code, which becomes dreadfully slow as we get more rectangles in the list:

rectangles can become really slow

Two things we should do:


Sat, 09 Aug 2014 14:19:44 GMT - Antoine Martin: attachment set

profiling of rectangle code only


Sat, 09 Aug 2014 14:23:25 GMT - Antoine Martin:

Whereas the "rectangle slow" screenshot was cropped from a larger capture, this one is captured using a dedicated profiling command:

./tests/scripts/pycallgraph -d 10 \
    -I "*rectangle*,*.damage,*.do_send_delayed_regions,*.make_data_packet_cb,*._refresh_region,*.add_refresh_region" \
    --
    start :10 --no-notifications --no-daemon --start-child=xterm --no-mdns --no-pulseaudio

And the result is much better:

profiling of rectangle code only

Is it a heisenbug where the profiling code slows down the rectangle loops so much that it becomes a problem? (the first capture has hundreds of expressions, the second one only the 6 shown above)


Mon, 11 Aug 2014 03:24:32 GMT - Antoine Martin: owner, status changed

I think it is good enough now, please run the profiling as per comment:1 and comment:6 with the scenarios you are interested in, and post the resulting file pycallgraph-xpra.png. And look for any unusually high time per call (usually highlighted by a different colour).

It would be worth testing the video region code too with just: -I *subregion*, assuming #410 isn't in too bad a shape.


Mon, 11 Aug 2014 19:06:12 GMT - Smo:

I've updated my pycallgraph to v1.0.1 like it is listed in your graphs.

Following the commands in comment:1 and comment:6 I've ran a few tests this way and I've only once got pycallgraph-xpra.png in my local directory.

I don't see any errors but I also don't get the resulting graph. What am I doing wrong ?


Mon, 11 Aug 2014 19:07:58 GMT - Smo: attachment set


Tue, 12 Aug 2014 02:57:12 GMT - Antoine Martin: owner changed

Please post the exact commands that you used for both the server and the client, the command output, etc.


Sun, 17 Aug 2014 14:21:13 GMT - Antoine Martin: priority changed

Should be closed for 0.14, raising to critical.


Tue, 19 Aug 2014 03:34:40 GMT - Antoine Martin: milestone changed

Should be closed against milestone 0.14


Tue, 19 Aug 2014 17:14:19 GMT - Smo:

I'm running the server with

./tests/scripts/pycallgraph -d 10 \
-I "*rectangle*,*.damage,*.do_send_delayed_regions,*.make_data_packet_cb,*._refresh_region,*.add_refresh_region" \
-- start :10 --no-notifications --no-daemon --start-child=xterm --no-mdns --no-pulseaudio --bind-tcp=0.0.0.0:1400

Then i'm connecting with a windows client

C:\Program Files (x86)\Xpra>Xpra_cmd.exe attach tcp:1.2.3.4:1400

I do a few things in the xterm then disconnect afterwards I stop the server with ctrl+c and there is often no .png in the current directory so i'm not sure what i'm doing wrong


Tue, 19 Aug 2014 17:15:05 GMT - Smo: owner changed


Wed, 20 Aug 2014 02:12:55 GMT - Antoine Martin: owner changed

Don't control-c the server, try "xpra stop" or when profiling the client, just exit from the tray.


Wed, 20 Aug 2014 06:26:09 GMT - Smo:

Tried again using xpra stop instead. I got pycallgraph-xpra.png this time but it was exactly the same as the previous one attached to this ticket.

I used the same commands at comment:12


Sun, 24 Aug 2014 06:15:17 GMT - Antoine Martin:

Please re-assign tickets to me if I am supposed to act on them.

Did you wait for at least 10 seconds, which is the initial delay set? You should then see this message in the server output:

starting trace

If that still does not work for you, please post full versions, log output, etc..


Mon, 08 Sep 2014 03:38:50 GMT - Antoine Martin: owner, status changed

Maybe we do need to optimize intersect_rect: see ticket:670#comment:6.


Mon, 08 Sep 2014 05:58:36 GMT - Antoine Martin: attachment set

converts the region code to use a cython class, doubling the speed


Mon, 08 Sep 2014 06:10:14 GMT - Antoine Martin:

The reason why the patch above only doubles the speed is because most of the expensive operations still work on python lists of objects (untyped). That's not an easy problem to solve. One way would be to use numpy arrays for the list of rectangles, but we would lose the nice encapsulation.


Fri, 24 Apr 2015 10:49:22 GMT - Antoine Martin: owner, status changed

r9146 + r9147 re-implements the rectangle code using cython, similar to the patch above but with more code unrolling.

Also adds a test to measure the impact of this change and the results are good (elapsed time - lower is better):

Test0.140.15Improvement
R1 create0.260.07 3.7x
R1 add0.930.19 4.8x
R1 remove0.440.14 3.1x
R1 contains0.080.08 1x
R2 create0.100.03 3.3x
R2 add0.480.10 4.8x
R2 remove0.170.05 3.4x
R2 contains0.080.08 1x
merge all2.090.02 104x

@afarr: I am re-assigning to you because you guys need to be able to run the profiler. I have just ran it again, and now that the code is in cython, we don't get as much output - but we still get something..


Fri, 08 May 2015 08:38:02 GMT - Antoine Martin:

Small bug fixed in r9282, makes merge_all a tiny bit slower, but we use it more and it is still plenty fast. (backport to v0.15 in r9283)


Wed, 03 Jun 2015 00:45:05 GMT - alas:

Well... I tried to install on fedora 21, seemed to get everything installed, but still apparently lacking rencode versioning.

[tlaloc@jimador src]$ xpra --version
2015-06-02 17:34:20,115 rencode at '/usr/lib64/python2.7/site-packages/rencode/__init__.pyc' lacks versioning information
xpra v0.16.0

I suppose I'm missing some other detail (probably fairly obvious), because when I try to run with the same command as Smo from comment:12 I get the following traceback:

[tlaloc@jimador src]$ ./tests/scripts/pycallgraph -d 10 \
> -I "*rectangle*,*.damage,*.do_send_delayed_regions,*.make_data_packet_cb,*._refresh_region,*.add_refresh_region" \
> -- start :10 --no-notifications --no-daemon --start-child=xterm --no-mdns --no-pulseaudio --bind-tcp=0.0.0.0:1400
Traceback (most recent call last):
  File "./tests/scripts/pycallgraph", line 7, in <module>
    from pycallgraph import PyCallGraph, Config
ImportError: No module named pycallgraph

Checking from the src directory, I do indeed see what appears to by a pycallgraph in tests/scripts ... so - does this look like an issue related to not having rencode versioning?

Trying to install python3-rencode (which I see in the dists directory for Fedora 21) I get apparent success:

Installed:
  python3-rencode.x86_64 0:1.0.3-1.fc21

... maybe not the right version?


Wed, 03 Jun 2015 01:22:32 GMT - alas:

Ok, Smo held my hand through the install of the pycallgraph (didn't realize it needed a separate install).

Was able to generate a graph of a session with xterm start child launching a firefox window... opening a couple of tabs and watching video - with a 0.15.0 client connected to a 0.16.0 server. Not that I'm sure exactly what the chart indicates.


Wed, 03 Jun 2015 01:23:19 GMT - alas: attachment set

pycallgraph experiment


Wed, 03 Jun 2015 01:24:02 GMT - alas:

pycallgraph experiment


Wed, 03 Jun 2015 01:51:45 GMT - Antoine Martin:

Well... I tried to install on fedora 21, seemed to get everything installed, but still apparently lacking rencode versioning.


python-rencode is in the fedora repo, if you have this repo installed, simply doing a yum update will get you the up to date rencode version, which will get rid of this warning.


Wed, 29 Jul 2015 00:55:15 GMT - alas:

I think I finally found a rencode library that at least causes no complaints.

Ran another test, got the same results as above, so presumably the rencode wasn't at issue.

Looking through the wiki/profiling page clarifies a number of the parameters I was copying from your/Smo's example... but I'm not entirely clear on what bits list I would use to tweak the profiling.

As an experiment I tried a small tweak:

 ./tests/scripts/pycallgraph -d 10 -I "*rectangle*,*encoding*,*.damage,*.do_send_delayed_regions,*.make_data_packet_cb,*._refresh_region,*.add_refresh_region" -- \
    start :10 --no-notifications --no-daemon --start-child=xterm --no-mdns --no-pulseaudio --bind-tcp=0.0.0.0:1201 --start-child=xterm

Which generated a considerably less interesting graph (just the _main_ grey box, really...) and also managed to produce some tracebacks/access violations client-side (prior to expected ctrl-c issues with win32 client): moved to #929.

So... is there another wiki page that outlines the structure and which "pycall"s should be expected along certain ... err... "usepaths"?


Mon, 03 Aug 2015 23:49:43 GMT - alas: attachment set

subregion pycallgraph test


Mon, 03 Aug 2015 23:54:14 GMT - alas:

Looking back through this ticket I noticed the -I *subregion* suggested setting.

Running with the following command line:

./tests/scripts/pycallgraph -d 10 -I "*subregion*" -- \
    start :10 --no-notifications --no-daemon --start-child=xterm --no-mdns --no-pulseaudio --bind-tcp=0.0.0.0:1201 --start-child=xterm

0.15.4 r10055 win32 client against 0.15.4 r10082 fedora 21 server (smo's build) I was able to generate the following graph, after using a start-child xterm to start firefox and watch some video.


subregion pycallgraph test


So is this sufficient "mastery" (to use the term very very loosely) to close this ticket and feel (vaguely) comfortable that I will be able to run this graphing should it be called for in the future? Or is there something glaringly obvious that I'm still missing?

(Hmm... and should I run the test again to produce non-reduced to fit the ticket results?)


Tue, 04 Aug 2015 03:50:03 GMT - Antoine Martin: status changed; resolution set

Looking back through this ticket I noticed the -I *subregion* suggested setting.


What is included will depend on what we're looking for. subregion is a good one to look at.


I was able to generate the following graph


This will do for now.


Sat, 23 Jan 2021 05:01:06 GMT - migration script:

this ticket has been moved to: https://github.com/Xpra-org/xpra/issues/620