xpra icon
Bug tracker and wiki

Opened 3 years ago

Closed 21 months ago

#620 closed defect (fixed)

performance optimizations from profiling

Reported by: Antoine Martin Owned by: alas
Priority: critical Milestone: 0.14
Component: android Version: 0.12.x
Keywords: Cc:

Description (last modified by Antoine Martin)

After playing around with wiki/Profiling, I've identified some areas where tuning could lead to some significant improvements:

  • Protocol.encode calling Compressed.__len___ fixed in r7021
  • error.XErrorToName called when the result is generally not used (not very expensive, but used very often!) fixed in r7022
  • region.contains improved for python>=2.5 in r7025 (+fix in r7027)

Not done yet:

  • WindowVideoSource.get_best_encoding() costs 10ms! (including 5ms for get_encoding_options()!)
  • get_current_speed() and get_current_quality() cost about 1ms each, but we call them a lot
  • WindowMode.get_dimensions(): 2ms
  • region.add_rectangle: 2.5ms!
  • send_cursor: we should compress before sending, not in the event loop!
  • set_client_properties: 30ms! (huge!)

Attachments (9)

damage-r7031.png (589.3 KB) - added by Antoine Martin 3 years ago.
example profiling output taken at r7031
encoding-scoring-v2.patch (17.4 KB) - added by Antoine Martin 3 years ago.
choose the encoding using a scoring mechanism
damage-r7215.png (231.8 KB) - added by Antoine Martin 3 years ago.
fast encoding selection is no longer the bottleneck
rectangle-slow.png (42.6 KB) - added by Antoine Martin 3 years ago.
rectangles can become really slow
rectangle-only.png (81.9 KB) - added by Antoine Martin 3 years ago.
profiling of rectangle code only
graph-fail.png (4.6 KB) - added by Smo 3 years ago.
region-cython.patch (14.3 KB) - added by Antoine Martin 3 years ago.
converts the region code to use a cython class, doubling the speed
pycallgraph-xpra-2.png (61.8 KB) - added by alas 2 years ago.
pycallgraph experiment
pycallgraph-xpra--i-subregion-15-4.png (34.4 KB) - added by alas 21 months ago.
subregion pycallgraph test

Download all attachments as: .zip

Change History (36)

Changed 3 years ago by Antoine Martin

Attachment: damage-r7031.png added

example profiling output taken at r7031

comment:1 Changed 3 years ago by Antoine Martin

Description: modified (diff)
Owner: changed from Antoine Martin to Antoine Martin
Status: newassigned

The graph attached was generated using:

./tests/scripts/pycallgraph -d 10 \
   -e libs,std,misc,one_offs,net,logging,x11,connection,keyboard,clipboard,sound,mouse \
   --
   start :10 ...

Progress:

  • get_current_speed() and get_current_quality() removed in r7030 (+fix in r7042)
  • for get_best_encoding, we should pre-calculate the options, including encoding specific things like the max_webp value - not sure this can be done for this release. Maybe we can at least catch the special case of send_full_window_update as seen in the graph above?
  • send_cursor is now doing the compression work later and per-client, which also allows us to use lz4 or lzo if supported
Last edited 3 years ago by Antoine Martin (previous) (diff)

comment:2 Changed 3 years ago by Antoine Martin

Some improvements on get_encoding_options (try to pre-calculate some) in r7178, often much faster for the non-video case - the video case remains harder..

Changed 3 years ago by Antoine Martin

Attachment: encoding-scoring-v2.patch added

choose the encoding using a scoring mechanism

comment:3 Changed 3 years ago by Antoine Martin

Improvements in r7203 and r7204.

The patch above is much nicer than the current code, but unfortunately it is not faster (yet? why not!?) and it is missing some features.

New important bottleneck identified (and expected): the rectangle merging/splitting code. It works fine with a few rectangles of limited size but really slows us down with more demanding tests. Needs to be cythonized.

Changed 3 years ago by Antoine Martin

Attachment: damage-r7215.png added

fast encoding selection is no longer the bottleneck

comment:4 Changed 3 years ago by Antoine Martin

Encoding selection is now much improved in r7215, to the point that this is no longer the bottleneck (though get_best_encoding_default method could still be improved as it is now a bit ugly and inefficient), as can be seen here (synthetic benchmark - which is hiding the real problem with rectangles..):

fast encoding selection is no longer the bottleneck

I liked the scoring method from the patch above as it allows for smoother encoding transitions than the arbitrary limits set in the current code, maybe we can merge it without slowing things down too much? Or maybe only use it beyond a certain size, to make small regions a fast path (as there is likely to be many more of those and choosing the right encoding is less important and easier: rgb often does a fine job for those)

Changed 3 years ago by Antoine Martin

Attachment: rectangle-slow.png added

rectangles can become really slow

comment:5 Changed 3 years ago by Antoine Martin

The new target is the rectangle code, which becomes dreadfully slow as we get more rectangles in the list:

rectangles can become really slow

Two things we should do:

  • avoid creating so many small rectangles beyond a certain point
  • make rectangle calculations faster

Changed 3 years ago by Antoine Martin

Attachment: rectangle-only.png added

profiling of rectangle code only

comment:6 Changed 3 years ago by Antoine Martin

Whereas the "rectangle slow" screenshot was cropped from a larger capture, this one is captured using a dedicated profiling command:

./tests/scripts/pycallgraph -d 10 \
    -I "*rectangle*,*.damage,*.do_send_delayed_regions,*.make_data_packet_cb,*._refresh_region,*.add_refresh_region" \
    --
    start :10 --no-notifications --no-daemon --start-child=xterm --no-mdns --no-pulseaudio

And the result is much better:

profiling of rectangle code only

Is it a heisenbug where the profiling code slows down the rectangle loops so much that it becomes a problem? (the first capture has hundreds of expressions, the second one only the 6 shown above)

comment:7 Changed 3 years ago by Antoine Martin

Owner: changed from Antoine Martin to alas
Status: assignednew

I think it is good enough now, please run the profiling as per comment:1 and comment:6 with the scenarios you are interested in, and post the resulting file pycallgraph-xpra.png.
And look for any unusually high time per call (usually highlighted by a different colour).

It would be worth testing the video region code too with just: -I *subregion*, assuming #410 isn't in too bad a shape.

comment:8 Changed 3 years ago by Smo

I've updated my pycallgraph to v1.0.1 like it is listed in your graphs.

Following the commands in comment:1 and comment:6 I've ran a few tests this way and I've only once got pycallgraph-xpra.png in my local directory.

I don't see any errors but I also don't get the resulting graph. What am I doing wrong ?

Changed 3 years ago by Smo

Attachment: graph-fail.png added

comment:9 Changed 3 years ago by Antoine Martin

Owner: changed from alas to Smo

Please post the exact commands that you used for both the server and the client, the command output, etc.

comment:10 Changed 3 years ago by Antoine Martin

Priority: majorcritical

Should be closed for 0.14, raising to critical.

comment:11 Changed 3 years ago by Antoine Martin

Milestone: 0.150.14

Should be closed against milestone 0.14

comment:12 Changed 3 years ago by Smo

I'm running the server with

./tests/scripts/pycallgraph -d 10 \
-I "*rectangle*,*.damage,*.do_send_delayed_regions,*.make_data_packet_cb,*._refresh_region,*.add_refresh_region" \
-- start :10 --no-notifications --no-daemon --start-child=xterm --no-mdns --no-pulseaudio --bind-tcp=0.0.0.0:1400

Then i'm connecting with a windows client

C:\Program Files (x86)\Xpra>Xpra_cmd.exe attach tcp:1.2.3.4:1400

I do a few things in the xterm then disconnect afterwards I stop the server with ctrl+c and there is often no .png in the current directory so i'm not sure what i'm doing wrong

comment:13 Changed 3 years ago by Smo

Owner: changed from Smo to Antoine Martin

comment:14 Changed 3 years ago by Antoine Martin

Owner: changed from Antoine Martin to Smo

Don't control-c the server, try "xpra stop" or when profiling the client, just exit from the tray.

comment:15 Changed 3 years ago by Smo

Tried again using xpra stop instead. I got pycallgraph-xpra.png this time but it was exactly the same as the previous one attached to this ticket.

I used the same commands at comment:12

comment:16 Changed 3 years ago by Antoine Martin

Please re-assign tickets to me if I am supposed to act on them.

Did you wait for at least 10 seconds, which is the initial delay set?
You should then see this message in the server output:

starting trace

If that still does not work for you, please post full versions, log output, etc..

comment:17 Changed 3 years ago by Antoine Martin

Owner: changed from Smo to Antoine Martin
Status: newassigned

Maybe we do need to optimize intersect_rect: see ticket:670#comment:6.

Changed 3 years ago by Antoine Martin

Attachment: region-cython.patch added

converts the region code to use a cython class, doubling the speed

comment:18 Changed 3 years ago by Antoine Martin

The reason why the patch above only doubles the speed is because most of the expensive operations still work on python lists of objects (untyped).
That's not an easy problem to solve. One way would be to use numpy arrays for the list of rectangles, but we would lose the nice encapsulation.

comment:19 Changed 2 years ago by Antoine Martin

Owner: changed from Antoine Martin to alas
Status: assignednew

r9146 + r9147 re-implements the rectangle code using cython, similar to the patch above but with more code unrolling.

Also adds a test to measure the impact of this change and the results are good (elapsed time - lower is better):

Test0.140.15Improvement
R1 create0.260.07 3.7x
R1 add0.930.19 4.8x
R1 remove0.440.14 3.1x
R1 contains0.080.08 1x
R2 create0.100.03 3.3x
R2 add0.480.10 4.8x
R2 remove0.170.05 3.4x
R2 contains0.080.08 1x
merge all2.090.02 104x

@afarr: I am re-assigning to you because you guys need to be able to run the profiler.
I have just ran it again, and now that the code is in cython, we don't get as much output - but we still get something..

Last edited 2 years ago by Antoine Martin (previous) (diff)

comment:20 Changed 2 years ago by Antoine Martin

Small bug fixed in r9282, makes merge_all a tiny bit slower, but we use it more and it is still plenty fast. (backport to v0.15 in r9283)

Last edited 2 years ago by Antoine Martin (previous) (diff)

comment:21 Changed 2 years ago by alas

Well... I tried to install on fedora 21, seemed to get everything installed, but still apparently lacking rencode versioning.

[tlaloc@jimador src]$ xpra --version
2015-06-02 17:34:20,115 rencode at '/usr/lib64/python2.7/site-packages/rencode/__init__.pyc' lacks versioning information
xpra v0.16.0

I suppose I'm missing some other detail (probably fairly obvious), because when I try to run with the same command as Smo from comment:12 I get the following traceback:

[tlaloc@jimador src]$ ./tests/scripts/pycallgraph -d 10 \
> -I "*rectangle*,*.damage,*.do_send_delayed_regions,*.make_data_packet_cb,*._refresh_region,*.add_refresh_region" \
> -- start :10 --no-notifications --no-daemon --start-child=xterm --no-mdns --no-pulseaudio --bind-tcp=0.0.0.0:1400
Traceback (most recent call last):
  File "./tests/scripts/pycallgraph", line 7, in <module>
    from pycallgraph import PyCallGraph, Config
ImportError: No module named pycallgraph

Checking from the src directory, I do indeed see what appears to by a pycallgraph in tests/scripts ... so - does this look like an issue related to not having rencode versioning?

  • Trying to install python-rencode I get this:
    [tlaloc@jimador src]$ sudo yum install python-rencode
    [sudo] password for tlaloc:
    Loaded plugins: langpacks
    https://winswitch.org/beta/Fedora/21/x86_64/repodata/repomd.xml: [Errno 14] HTTPS Error 404 - Not Found
    Trying other mirror.
    Package python-rencode-1.0.3-1.fc21.x86_64 already installed and latest version
    Nothing to do
    

Trying to install python3-rencode (which I see in the dists directory for Fedora 21) I get apparent success:

Installed:
  python3-rencode.x86_64 0:1.0.3-1.fc21

... maybe not the right version?

comment:22 Changed 2 years ago by alas

Ok, Smo held my hand through the install of the pycallgraph (didn't realize it needed a separate install).

Was able to generate a graph of a session with xterm start child launching a firefox window... opening a couple of tabs and watching video - with a 0.15.0 client connected to a 0.16.0 server. Not that I'm sure exactly what the chart indicates.

Changed 2 years ago by alas

Attachment: pycallgraph-xpra-2.png added

pycallgraph experiment

comment:23 Changed 2 years ago by alas

pycallgraph experiment

comment:24 Changed 2 years ago by Antoine Martin

Well... I tried to install on fedora 21, seemed to get everything installed, but still apparently lacking rencode versioning.


python-rencode is in the fedora repo, if you have this repo installed, simply doing a yum update will get you the up to date rencode version, which will get rid of this warning.

comment:25 Changed 21 months ago by alas

I think I finally found a rencode library that at least causes no complaints.

Ran another test, got the same results as above, so presumably the rencode wasn't at issue.

Looking through the wiki/profiling page clarifies a number of the parameters I was copying from your/Smo's example... but I'm not entirely clear on what bits list I would use to tweak the profiling.

As an experiment I tried a small tweak:

 ./tests/scripts/pycallgraph -d 10 -I "*rectangle*,*encoding*,*.damage,*.do_send_delayed_regions,*.make_data_packet_cb,*._refresh_region,*.add_refresh_region" -- \
    start :10 --no-notifications --no-daemon --start-child=xterm --no-mdns --no-pulseaudio --bind-tcp=0.0.0.0:1201 --start-child=xterm

Which generated a considerably less interesting graph (just the _main_ grey box, really...) and also managed to produce some tracebacks/access violations client-side (prior to expected ctrl-c issues with win32 client): moved to #929.

So... is there another wiki page that outlines the structure and which "pycall"s should be expected along certain ... err... "usepaths"?

Last edited 21 months ago by Antoine Martin (previous) (diff)

Changed 21 months ago by alas

subregion pycallgraph test

comment:26 Changed 21 months ago by alas

Looking back through this ticket I noticed the -I *subregion* suggested setting.

Running with the following command line:

./tests/scripts/pycallgraph -d 10 -I "*subregion*" -- \
    start :10 --no-notifications --no-daemon --start-child=xterm --no-mdns --no-pulseaudio --bind-tcp=0.0.0.0:1201 --start-child=xterm

0.15.4 r10055 win32 client against 0.15.4 r10082 fedora 21 server (smo's build) I was able to generate the following graph, after using a start-child xterm to start firefox and watch some video.


subregion pycallgraph test


So is this sufficient "mastery" (to use the term very very loosely) to close this ticket and feel (vaguely) comfortable that I will be able to run this graphing should it be called for in the future? Or is there something glaringly obvious that I'm still missing?

(Hmm... and should I run the test again to produce non-reduced to fit the ticket results?)

comment:27 Changed 21 months ago by Antoine Martin

Resolution: fixed
Status: newclosed

Looking back through this ticket I noticed the -I *subregion* suggested setting.


What is included will depend on what we're looking for. subregion is a good one to look at.


I was able to generate the following graph


This will do for now.

Note: See TracTickets for help on using tickets.