xpra icon
Bug tracker and wiki

Opened 5 years ago

Closed 2 years ago

#797 closed task (fixed)

Profile xpra and capture tests for regression

Reported by: Nick Centanni Owned by: Smo
Priority: major Milestone: 1.0
Component: tests Version: trunk
Keywords: Cc:

Description (last modified by Antoine Martin)

We can take a look at other things to educate ourselves, any result (positive or negative) should tell us more about the trade-offs we have decided to make and whether we should re-evaluate them (my bet: there is bound to be some that need another look).

Here are a few worth checking:

  • 0.14 vs 0.15: the obvious one first, we should be doing better - but are we?
  • x264 tuning: speed and quality impact on bandwidth, framerate, etc
  • x264: impact of enabling threading ("XPRA_X264_THREADS=1") - same for vpx and webp - related to the CPU tests below
  • re-run of the detailed x264 vs vp8 comparison
  • rgb lz4 vs png: are the bandwidth savings worth the CPU pain of using png?
  • nvenc: kepler vs maxwell 1 vs maxwell2 cards (the new GTX 960 is out too..)
  • the new multi-delta feature: what is the optimal number of delta buckets (currently set to 5)
  • CPU tests: measure the impact of clock speed on performance (by artificially capping it), same for number of cores
  • various environment variables that control aspects of the network layer: XPRA_READ_BUFFER_SIZE, XPRA_PACKET_JOIN_SIZE, XPRA_INLINE_SIZE, XPRA_YIELD
  • proxy tuning: PROXY_QUEUE_SIZE
  • impact of the merge regions code: "XPRA_MERGE_REGIONS"
  • impact of downscaling, by forcing it with XPRA_SCALING, maybe also forcing CSC with XPRA_FORCE_CSC_MODE

etc...

The important thing about running those tests is to run them again and again, after either narrowing down on the outliers or discarding them when we understand why they stick out.

Attachments (25)

comparing_v14_v15.tar.gz (256.7 KB) - added by Nick Centanni 5 years ago.
CSV files from a 9-repetition run of test_measure_perf comparing v14 and v15.
compare_15Feb_15Jun.tar.gz (122.6 KB) - added by Nick Centanni 4 years ago.
Compares benchmark results of v15 done Feb-2015 with those done June-2015.
compare15.tar.gz (258.6 KB) - added by Nick Centanni 4 years ago.
CSV and log files from the 15Feb_15Jun tests.
h264_glx.tar.gz (150.1 KB) - added by Nick Centanni 4 years ago.
Collection of several charts visualizing 16 repetitions of the benchmark tests.
0.16.3-perflogs.tar.bz2 (125.2 KB) - added by Smo 3 years ago.
0.16.3 performance logs from fedora 24
0.17.5-perflogs.tar.bz2 (158.4 KB) - added by Smo 3 years ago.
0.17.5 performance logs from fedora 24
xpra-1.0-testrun-fail.txt (17.7 KB) - added by Smo 3 years ago.
1.0-perflogs.tar.bz2 (233.0 KB) - added by Smo 3 years ago.
1.0-20161028r14313-1-perflogs.tar.bz2 (238.5 KB) - added by Smo 3 years ago.
1.0-20161029r14329-1-perflogs.tar.bz2 (224.3 KB) - added by Smo 3 years ago.
0.17.6-2.r14324-perflogs.tar.bz2 (214.8 KB) - added by Smo 3 years ago.
1.0-0.20161102r14384-perflogs.tar.bz2 (99.5 KB) - added by Smo 3 years ago.
1.0-0.20161102r14384-run2-perflogs.tar.bz2 (108.8 KB) - added by Smo 3 years ago.
1.0-20161104r14397-perflogs.tar.bz2 (107.7 KB) - added by Smo 3 years ago.
1.0-20161114r14416-perflogs.tar.bz2 (107.7 KB) - added by Smo 3 years ago.
1.0-20161115r14430-perflogs.tar.bz2 (107.9 KB) - added by Smo 3 years ago.
1.0-20161115r14430-run2-perflogs.tar.bz2 (168.8 KB) - added by Smo 3 years ago.
1.0-20161118r14449-perflogs.tar.bz2 (167.6 KB) - added by Smo 3 years ago.
1.0.2-1.r14944-perflogs.tar.bz2 (410.8 KB) - added by Smo 3 years ago.
2.0-0.20170302r15219-perflogs.tar.bz2 (319.3 KB) - added by Smo 3 years ago.
2.0-0.20170307r15230-perflogs.tar.bz (415.7 KB) - added by Smo 3 years ago.
1.0.3r15051_perflogs.tar.bz2 (420.0 KB) - added by Smo 3 years ago.
103v20-chart.tar.bz2 (100.1 KB) - added by Smo 3 years ago.
1.0.4r15266_perflogs.tar.bz2 (405.3 KB) - added by Smo 3 years ago.
2.0r15251_perflogs.tar.bz2 (421.5 KB) - added by Smo 3 years ago.

Change History (74)

comment:1 Changed 5 years ago by Nick Centanni

Owner: changed from Antoine Martin to Nick Centanni

comment:2 Changed 5 years ago by Nick Centanni

Status: newassigned

comment:3 Changed 5 years ago by Antoine Martin

Description: modified (diff)
Milestone: 0.14

(just re-formatting with wiki tags)

comment:4 Changed 5 years ago by Nick Centanni

I've attached the results of a 9-repetition test of v14 and v15. These data files can be extracted anywhere.

I've added a new utility in tests/xpra, called test_measure_perf_charts.py along with the necessary css and js files necessary to draw the charts in the resulting HTML file. These utility files were added in r8639.

Change tests/xpra/test_measure_perf_charts.py so it points to the directory where you placed the extracted data files.

The first version of this chart generator happens to already be configured for this data set (it has the correct prefix, params, and reps already configured), but normally you would need to change those to reflect the data set you're generating charts for.

Changed 5 years ago by Nick Centanni

Attachment: comparing_v14_v15.tar.gz added

CSV files from a 9-repetition run of test_measure_perf comparing v14 and v15.

comment:5 Changed 5 years ago by Antoine Martin

This was done 3 months ago now, it would be worth running again, to see if anything has changed.

Changed 4 years ago by Nick Centanni

Attachment: compare_15Feb_15Jun.tar.gz added

Compares benchmark results of v15 done Feb-2015 with those done June-2015.

comment:6 Changed 4 years ago by Nick Centanni

Attached results of the new benchmark run on v15 Jun-2015, compared to the benchmark results taken for v15 Feb-2015.

comment:7 Changed 4 years ago by Antoine Martin

Status: assignednew

Thanks nick! (FYI: please run with r9638 from now on: we want vp8 and vp9 tested, and "x264" is not a valid encoding name any more)

Things that stand out (one good news first, then mostly bad news):

  • vpx is now very competitive with h264 (#832), we use more threads - use more cpu, encode faster at a higher quality setting, and send more pixels
  • disappointing pixels/s figures: we seem to have lost a few percent there - and this is one of the most important metrics..
  • x264 seems to have regressed with glxgears / glxspheres (batch delay through the roof)
  • batch delay is way too high in the latest v0.15.x (is the revision shown anywhere? please show it somewhere) - it looks like less than 4 fps! Worst case (max batch delay) is just 1 fps!! Quite strange seeing that the number of pixels and regions sent is more or less unchanged.
  • mmap has also regressed
  • system cpu usage is higher with mmap and vpx, not sure why that is (I guess the X11 server works harder as we request more frames with vpx, but mmap should be down not up?)
  • odd zero quality with gpxspheres and rgb24 / png - quality should always be at 100% for those encodings - is the data correct? missing? was there a crash?
  • odd min-quality values with mmap: the quality is always 100% with mmap / rgb24!
  • please show the client rss

Executive summary: we have serious quality control issues. We should have tackled those regressions during the 0.14 or 0.15 development cycles, not half way through the 0.16 release cycle.
This data should be collected daily or weekly so we can identify regressions. And now this will need to be done anyway to identify the source of the regressions, more than doubling up the amount of work required to fix them.

Last edited 4 years ago by Antoine Martin (previous) (diff)

comment:8 Changed 4 years ago by Nick Centanni

I've attached here all the CSV and log output from both the 15 (Feb) and 15_2 (June) tests. I checked the logs for crashes before preparing the data. There was a crash in 15_2 sequence 8, so I reran that sequence.

There were login timeouts in all of the Feb x11perf tests. But that test was not included in the charts.

The revision of the June v15 that was used for this test was: r9612.

Last edited 4 years ago by Antoine Martin (previous) (diff)

Changed 4 years ago by Nick Centanni

Attachment: compare15.tar.gz added

CSV and log files from the 15Feb_15Jun tests.

comment:9 Changed 4 years ago by Antoine Martin

Did you ever figure out where the variation came from?
Once the data is reliable, it would be worth running again.

comment:10 Changed 4 years ago by Nick Centanni

In addition to the first two comparisons r8585 vs r9612 (all tests) and r8585 vs r9612 (h264 glx tests only), I ran each of these 4 cases an additional 3 times to compare the deviation between tests. This can be seen in the attached archive.

Untar the archive and view index.html, which contains explanations of each chart in the set.

Changed 4 years ago by Nick Centanni

Attachment: h264_glx.tar.gz added

Collection of several charts visualizing 16 repetitions of the benchmark tests.

comment:11 Changed 4 years ago by Antoine Martin

@nick: can you give us a quick "executive summary" of what you did and why, and your initial conclusions?

I've had a look and noticed:

  • vpx will be skewing the results - because it is faster now, please see comment:7 we want vp8 and vp9 separately going forward
  • damage latency and batch delay: although the average damage latency is improved (which is great), the max latency and max batch delays are worse, and so is the avg batch delay, in particular:
    • with vpx and glxgears - 100ms to 200ms should be our target for the "max latency" (things are allowed to get a bit high at times, as long as the average stays bearable)
    • with moebiusgears, polytopes and xterm tests: looks like a clear regression: the average batch delay is way too high, which explains why we can keep the damage latency low (not processing enough data, so we can do it faster)

Am I reading this right?

comment:12 Changed 4 years ago by Nick Centanni

@antoine: since we were getting contradictory results between the comparison done of v15 and v16 in June -- when all tests were run -- and the results we got between the same revisions in July -- when only h264 glx tests were run -- I wanted to do some sanity tests just to verify that the test results are reliable.

Specifically:

  • I wanted to run the same tests, the same way, for the same revisions, multiple times to try to reproduce the results – and to see if the contradictory results fell within the margin of error, or could be determined to be some kind of fluke.
  • I wanted to determine if there really was a significant difference in the results when all tests were run in the suite, as opposed to running just the h264 / glx tests.
  • I wanted to produce a combined chart (which can be found at the end of index.html) that summarized all these tests.
  • I also did an audit of the chart generator results (since I've made changes to it) to rule out any problem with the charting tool itself. (There were no problems, so the charts are reporting the correct values.)

About the graphs:
The graphs to look at are the last two (on index.html). They combine all repetitions of the tests. I couldn't fit legends on the graphs (let alone both apps), so here's a description:

  • The first 8 bars in each chart represent r8585, and the second 8 represent r9612.
  • Within each revision, the first 4 represent the results from tests where all apps, for all encodings were done. And the second 4 represent the results from tests where only h264 / glxtests were run.
  • Within each set of 4, the red bar represents the results from the earlier contradictory tests, and the remaining 3 are new repetitions of those tests.
  • You can hover over bars to see which of the above results each bar represents.

Observations:

  • The results were reproducible and consistent with the exception of a few metrics where the red bar (older tests) varied quite a bit from the newer tests, particular from the first v9612 tests: Application Bytes in/s, Application Bytes out/s, Application packets in/s, Application packets out/s, Avg Damage Latency, Encoding Pixels/s, Max Damage Latency, and Min Damage Latency. These instances of wider variations of the results were similar for both glxgears and glxspheres. So were those flukes? Was there a problem on the test machine that day? I'm not sure, but since the later runs of those same tests produced significantly different results for those metrics, I'm rather suspicious of those readings.
  • One interesting point was that Server cpu pct was 5 times higher when r9612 was first tested, compared with all the other r9612 tests. Could this have indicated that some process was running on the box during that test that wasn't running during the remainder of the tests? This might explain the significant variance of those particular results. I always start a test run from a freshly booted machine, so I don't know what could have been going on on that machine, that day. The Client user cpu pct didn't show the same dramatic variation. Maybe I'm just misunderstanding what's being measured there.
  • In almost all cases, running all the tests in one run, vs running just the h264 glxtests was not a factor. This can be seen on the charts by looking at how consistent the first two sets of 4 bars are with each other, and how consistent the second two sets of bars are. The metrics which did seem to vary significantly between these two testing methods were: Avg Damage Latency (in r9612), Encoding Pixels (in r9612), Min Damage Latency (in both revisions, and rather dramatically), and Decoding Pixels (r9612, glxspheres only).
  • For those charts where there were no dramatic anomalies in the result consistency, each chart can be used to compare r8585 and r9612 by comparing the average values of the first 8 bars with the average values of the second 8 bars. Given these results, I can speak to a couple of your points about the results.

v15 vs v16:

  • You mentioned that you thought Average Damage Latency had improved. But this is one of those charts where the results differed so dramatically between running all tests vs running glxtests only, that I'm not convinced those results are useful.
  • I agree that the Avg Batch Delay is pretty bad with moebiusgears, polytopes ans xterm. To double-check, I took a look at the same metric in the test repetitions that I ran recently and saw the same dramatic regression.
  • To answer your first point about vpx / vp8 / vp9, since I was trying to run tests the same way as before, to get comparable results for the various test types, I used the same configs as the older tests, and these configs predated your suggestion to run vp8 and vp9 separately. Anyway, this current effort was just to investigate why the h264 results were contradictory. We can do additional tests to investigate the performance of vp8 and vp9 going forward.

comment:13 Changed 4 years ago by Antoine Martin

Stumbled upon this package which could be useful for visualizing datasets: Animator5D.

comment:14 Changed 4 years ago by Antoine Martin

Milestone: 0.140.18
Owner: changed from Nick Centanni to Smo

I believe smo is going to be running those tests again from now on.

comment:15 Changed 3 years ago by Antoine Martin

Milestone: 0.181.0

Milestone renamed

comment:16 Changed 3 years ago by Smo

Owner: changed from Smo to Antoine Martin

I am indeed running the tests again on fedora 24.

I've run into an issue running the tests on xpra 1.0 from winswitch-beta

Traceback (most recent call last):
  File "/home/xpra/tests/xpra/test_measure_perf.py", line 81, in <module>
    XPRA_VERSION_NO = [int(x) for x in XPRA_VERSION.split(".")]
ValueError: invalid literal for int() with base 10: '0-r14149'

I will attach my data from the tests to this ticket along with full logs.

comment:17 Changed 3 years ago by Antoine Martin

Owner: changed from Antoine Martin to Smo

Untested: I believe r14150 fixes this.

Changed 3 years ago by Smo

Attachment: 0.16.3-perflogs.tar.bz2 added

0.16.3 performance logs from fedora 24

Changed 3 years ago by Smo

Attachment: 0.17.5-perflogs.tar.bz2 added

0.17.5 performance logs from fedora 24

comment:18 Changed 3 years ago by Smo

Fix works running the tests on 1.0 beta from winswitch-beta repo now will post the results when they are done.

comment:19 Changed 3 years ago by Smo

I see this traceback running the tests with the 1.0 beta client/server

raceback (most recent call last):
  File "/usr/lib64/python2.7/site-packages/xpra/client/gtk_base/gtk_tray_menu_base.py", line 404, in bell_toggled
DPI set to 96 x 96
    self.client.send_bell_enabled()
  File "/usr/lib64/python2.7/site-packages/xpra/client/ui_client_base.py", line 2626, in send_bell_enabled
    assert self.client_supports_bell, "cannot toggle bell: the feature is disabled by the client"
AssertionError: cannot toggle bell: the feature is disabled by the client

Changed 3 years ago by Smo

Attachment: xpra-1.0-testrun-fail.txt added

comment:20 Changed 3 years ago by Smo

I have attached a log of the issue i'm currently having running the tests for 1.0 beta.

It seems to me that the server isn't shutting down properly after the first test run.

Here is a list of the running processes at the end of that log.

[xpra@xpratb01 ~]$ ps x -www
  PID TTY      STAT   TIME COMMAND
 1250 ?        Ss     0:00 /usr/lib/systemd/systemd --user
 1252 ?        S      0:00 (sd-pam)
 1257 ?        S      0:00 sshd: xpra@pts/1
 1258 pts/1    Ss     0:00 -bash
 1279 pts/1    S+     0:00 /bin/bash /home/xpra/bin/runtest
 1286 pts/1    S+     0:00 tee /home/xpra/data/new-1.0-1.log
 1307 pts/1    Sl+    1:48 /usr/bin/python2 /usr/bin/xpra --bind-tcp=0.0.0.0:10000 --password-file=./test-password.txt --auth=file --tcp-auth=file start :10 --daemon=no --pulseaudio=no --notifications=no
 1308 pts/1    Sl+    0:00 /usr/bin/pkttyagent --notify-fd 5 --fallback
 1315 ?        Ssl    0:45 /usr/libexec/Xorg -noreset -nolisten tcp +extension GLX +extension RANDR +extension RENDER -auth /home/xpra/.Xauthority -logfile /run/user/1001/xpra/Xorg.:10.log -configdir /home/xpra/.xpra/xorg.conf.d -config /etc/xpra/xorg.conf :10
 1326 ?        Ssl    0:00 /usr/bin/dbus-daemon --syslog-only --fork --print-pid 5 --print-address 7 --session
 1361 ?        Sl     0:00 /usr/bin/pulseaudio --start --log-target=syslog
 1700 ?        S      0:00 sshd: xpra@pts/2
 1701 pts/2    Ss     0:00 -bash
 1735 pts/2    R+     0:00 ps x -www

comment:21 Changed 3 years ago by Antoine Martin

I'm not sure how to reproduce the bug in comment:19, r14235 fixes the symptoms only..
As for the server not shutting down properly, maybe we want to run with "systemd-run" turned off for the tests? Does that help?

comment:22 Changed 3 years ago by Smo

Changing

systemd-run = no in /etc/xpra/conf.d/60_server.conf allowed me to continue on with the tests on fedora 24 with 1.0 beta

I will post the data here shortly then I will take a look at logs.

Changed 3 years ago by Smo

Attachment: 1.0-perflogs.tar.bz2 added

comment:23 Changed 3 years ago by Smo

lots of errors from vp8 in these logs

Error during encoding:
Traceback (most recent call last):
  File "/usr/lib64/python2.7/site-packages/xpra/server/source.py", line 2185, in encode_loop
    fn_and_args[1](*fn_and_args[2:])
  File "/usr/lib64/python2.7/site-packages/xpra/server/window/window_source.py", line 1448, in make_data_packet_cb
    packet = self.make_data_packet(damage_time, process_damage_time, image, coding, sequence, options, flush)
  File "/usr/lib64/python2.7/site-packages/xpra/server/window/window_source.py", line 1794, in make_data_packet
    ret = encoder(coding, image, options)
  File "/usr/lib64/python2.7/site-packages/xpra/server/window/window_video_source.py", line 1467, in video_encode
    return self.do_video_encode(encoding, image, options)
  File "/usr/lib64/python2.7/site-packages/xpra/server/window/window_video_source.py", line 1568, in do_video_encode
    ret = ve.compress_image(csc_image, quality, speed, options)
  File "xpra/codecs/vpx/encoder.pyx", line 584, in xpra.codecs.vpx.encoder.Encoder.compress_image (xpra/codecs/vpx/encoder.c:8327)
AssertionError
Last edited 3 years ago by Smo (previous) (diff)

comment:24 Changed 3 years ago by Smo

saw this error as well when a the server was shutting down

got signal SIGTERM, exiting
client has requested disconnection: exit on signal SIGTERM
Disconnecting client 127.0.0.1:54674:
 client request
Warning: no matching argb function: cannot convert BGRX to one of: []
Warning: cannot convert BGRX to one of: []
error processing encode queue at index 0
item=(504, 504, 1477598147.752195, 1477598147.792868, XShmImageWrapper(BGRX: 1, 1, 504, 504), 'rgb24', 6951, {}, 0)
Traceback (most recent call last):
  File "/usr/lib64/python2.7/site-packages/xpra/server/window/window_video_source.py", line 823, in encode_from_queue
    self.make_data_packet_cb(*item)
  File "/usr/lib64/python2.7/site-packages/xpra/server/window/window_source.py", line 1448, in make_data_packet_cb
    packet = self.make_data_packet(damage_time, process_damage_time, image, coding, sequence, options, flush)
  File "/usr/lib64/python2.7/site-packages/xpra/server/window/window_source.py", line 1794, in make_data_packet
    ret = encoder(coding, image, options)
  File "/usr/lib64/python2.7/site-packages/xpra/server/window/window_source.py", line 1884, in rgb_encode
    self.rgb_zlib, self.rgb_lz4, self.rgb_lzo)
  File "/usr/lib64/python2.7/site-packages/xpra/server/picture_encode.py", line 87, in rgb_encode
    raise Exception("cannot find compatible rgb format to use for %s! (supported: %s)" % (pixel_format, rgb_formats))
Exception: cannot find compatible rgb format to use for BGRX! (supported: [])
xpra client 1 disconnected.
Version 0, edited 3 years ago by Smo (next)

comment:25 Changed 3 years ago by Antoine Martin

The vp8 one is real, saw it before so I added a better error message in r14239.

The other one looks harmless, is fixed in r14305, but inspecting the code I found more issues: see ticket:835#comment:34

Last edited 3 years ago by Antoine Martin (previous) (diff)

Changed 3 years ago by Smo

Changed 3 years ago by Smo

comment:26 Changed 3 years ago by Smo

In the latest 1.0 tests I see these tracebacks

Traceback (most recent call last):
  File "/usr/lib64/python2.7/site-packages/xpra/server/source.py", line 2185, in encode_loop
    fn_and_args[1](*fn_and_args[2:])
  File "/usr/lib64/python2.7/site-packages/xpra/server/window/window_source.py", line 1440, in make_data_packet_cb
    packet = self.make_data_packet(damage_time, process_damage_time, image, coding, sequence, options, flush)
  File "/usr/lib64/python2.7/site-packages/xpra/server/window/window_source.py", line 1786, in make_data_packet
    ret = encoder(coding, image, options)
  File "/usr/lib64/python2.7/site-packages/xpra/server/window/window_video_source.py", line 1512, in video_encode
    return self.do_video_encode(encoding, image, options)
  File "/usr/lib64/python2.7/site-packages/xpra/server/window/window_video_source.py", line 1614, in do_video_encode
    ret = ve.compress_image(csc_image, quality, speed, options)
  File "xpra/codecs/vpx/encoder.pyx", line 584, in xpra.codecs.vpx.encoder.Encoder.compress_image (xpra/codecs/vpx/encoder.c:9075)
AssertionError: invalid image width 1240, expected 826
Last edited 3 years ago by Smo (previous) (diff)

comment:27 Changed 3 years ago by Antoine Martin

There's a very good chance that the "invalid image width" is fixed by r14365.

Changed 3 years ago by Smo

Changed 3 years ago by Smo

comment:28 Changed 3 years ago by Smo

In 1.0-r14384 logs attached the vpx issue seems to be gone.

I did notice some errors like these

/usr/lib64/python2.7/site-packages/xpra/server/window/window_video_source.py:803: Warning: Source ID 452 was not found when attempting to remove it
  self.source_remove(eqt)
/usr/lib64/python2.7/site-packages/xpra/server/window/window_video_source.py:803: Warning: Source ID 484 was not found when attempting to remove it
  self.source_remove(eqt)

Also mplayer test seems to be failing I will have to fix this or disable it.

comment:29 Changed 3 years ago by Smo

Looking over 0.17.6-2 logs I notice there is sometimes this issue when shutting down.

2016-11-01 18:04:16,489 New tcp connection received from 127.0.0.1:54160
2016-11-01 18:04:26,114 Authentication required by password file authenticator module
2016-11-01 18:04:26,114  sending challenge for 'xpra' using hmac digest
2016-11-01 18:04:28,337 connection timedout: Protocol(tcp socket: 127.0.0.1:10000 <- 127.0.0.1:54160)

got signal SIGTERM, exiting
Traceback (most recent call last):
  File "/home/xpra/tests/xpra/test_measure_perf.py", line 510, in with_server
    data.update(measure_client(server_pid, name, client_cmd, get_stats_cb))
  File "/home/xpra/tests/xpra/test_measure_perf.py", line 361, in measure_client
    initial_stats = get_stats_cb()
  File "/home/xpra/tests/xpra/test_measure_perf.py", line 588, in xpra_get_stats
    out = getoutput(info_cmd)
  File "/home/xpra/tests/xpra/test_measure_perf.py", line 56, in getoutput
    raise Exception("command '%s' returned error code %s, out=%s, err=%s" % (cmd, code, out, err))
Exception: command '['/usr/bin/xpra', 'info', 'tcp:127.0.0.1:10000', '--password-file=./test-password.txt', '--auth=file', '--tcp-auth=file']' returned error code 7, out=server failure: disconnected before the session could be established
server requested disconnect: login timeout
, err=None
2016-11-01 18:04:29,497 Connection lost
2016-11-01 18:04:29,502 xpra client disconnected.
2016-11-01 18:04:29,504 processed info request from None in 8ms

2016-11-01 18:04:30,472 got signal SIGTERM, exiting
2016-11-01 18:04:30,475 closing tcp socket 0.0.0.0:10000
2016-11-01 18:04:30,475 removing socket /home/xpra/.xpra/xpratb01-10
2016-11-01 18:04:30,477 killing xvfb with pid 21711
(II) Server terminated successfully (0). Closing log file.

Probably cause by it closing while the client is trying to request info.

comment:30 Changed 3 years ago by Smo

This test fails as well because it has changed locations

Traceback (most recent call last):
  File "/home/xpra/tests/xpra/test_measure_perf.py", line 479, in with_server
    assert code is None, "test command %s failed to start: exit code is %s" % (cmd, code)
AssertionError: test command ['/usr/bin/vglrun', '--', '/usr/bin/xterm', '-geometry', '160x60', '-e', 'PYTHONPATH=`pwd` ./tests/xpra/simulate_console_user.py'] failed to start: exit code is 0

I'll make the necessary changes to my config but this should be reflected in the default one.

comment:31 Changed 3 years ago by Smo

Example of why mplayer is failing for the audio test

Traceback (most recent call last):
  File "/home/xpra/tests/xpra/test_measure_perf.py", line 479, in with_server
    assert code is None, "test command %s failed to start: exit code is %s" % (cmd, code)
AssertionError: test command /usr/bin/vglrun -- while true; do /usr/bin/mplayer ./test.mp3; done failed to start: exit code is 1

When we run this command by hand we get a weird error

[xpra@xpratb01 ~]$ /usr/bin/vglrun -- while true; do /usr/bin/mplayer ./test.mp3; done
-bash: syntax error near unexpected token `do'

Suggestion we just use mplayer -loop 0 ./test.mp3 instead of the while loop and this should be good i'll make changes in my test config but defaults should be updated

Changed 3 years ago by Smo

Changed 3 years ago by Smo

Changed 3 years ago by Smo

Changed 3 years ago by Smo

Changed 3 years ago by Smo

comment:32 Changed 3 years ago by Smo

I sometimes see this traceback when looking at the logs usually when running glxgears.

Traceback (most recent call last):
  File "/usr/lib64/python2.7/site-packages/xpra/server/server_core.py", line 1156, in in_thread
    info = self.get_info(proto, *args)
  File "/usr/lib64/python2.7/site-packages/xpra/server/server_base.py", line 1959, in get_info
    dgi = self.do_get_info(proto, sources, wids)
  File "/usr/lib64/python2.7/site-packages/xpra/x11/server.py", line 274, in do_get_info
    info = X11ServerBase.do_get_info(self, proto, server_sources, window_ids)
  File "/usr/lib64/python2.7/site-packages/xpra/x11/x11_server_base.py", line 249, in do_get_info
    info = GTKServerBase.do_get_info(self, proto, server_sources, window_ids)
  File "/usr/lib64/python2.7/site-packages/xpra/server/gtk_server_base.py", line 117, in do_get_info
    info = ServerBase.do_get_info(self, proto, *args)
  File "/usr/lib64/python2.7/site-packages/xpra/server/server_base.py", line 2095, in do_get_info
    info.update(ss.get_window_info(window_ids))
  File "/usr/lib64/python2.7/site-packages/xpra/server/source.py", line 1586, in get_window_info
    winfo[wid] = ws.get_info()
  File "/usr/lib64/python2.7/site-packages/xpra/server/window/window_video_source.py", line 169, in get_info
    info = WindowSource.get_info(self)
  File "/usr/lib64/python2.7/site-packages/xpra/server/window/window_source.py", line 398, in get_info
    lde = [x for x in self.statistics.last_damage_events if x[0]>=cutoff]
RuntimeError: deque mutated during iteration

comment:33 Changed 3 years ago by Antoine Martin

The stacktrace from comment:32 is fixed in r14615.
Have you made the other changes yet?

comment:34 Changed 3 years ago by Smo

Had to make some modifications to the test script some of it was just for debugging but it doesn't seem to hurt anything. Now I have the problem with the 2nd stop command not being called by the full path and not using Shell=true means the commands need their full path. This way of killing the Xorg process doesn't work in newer versions anyways. I think stopping the xpra server properly is a better way and should hopefully shut down cleanly.

Also needed the password file for newer versions. I will attach my logs from 1.0.2 before I review them and also see if I can start the tests from 2.0.0 from the beta repo

Index: xpra/test_measure_perf.py
===================================================================
--- xpra/test_measure_perf.py	(revision 14995)
+++ xpra/test_measure_perf.py	(working copy)
@@ -81,7 +81,7 @@
 XPRA_VERSION = XPRA_VERSION.split("-")[0]
 XPRA_VERSION_NO = [int(x) for x in XPRA_VERSION.split(".")]
 XPRA_SERVER_STOP_COMMANDS = [
-                             [XPRA_BIN, "stop", ":%s" % config.DISPLAY_NO],
+                             [XPRA_BIN, "--password-file=./test-password.txt",  "stop", ":%s" % config.DISPLAY_NO],
                              "ps -ef | grep -i [X]org-for-Xpra-:%s | awk '{print $2}' | xargs kill" % config.DISPLAY_NO
                              ]
 XPRA_INFO_COMMAND = [XPRA_BIN, "info", "tcp:%s:%s" % (config.IP, config.PORT)]
@@ -524,8 +524,11 @@
                     for s in stop_server_commands:
                         print("stopping server with: %s" % (s))
                         try:
-                            stop_process = subprocess.Popen(s, stdin=None, stdout=subprocess.PIPE, stderr=subprocess.PIPE, shell=True)
+                            stop_process = subprocess.Popen(s, stdin=None, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
                             stop_process.wait()
+                            output, error = stop_process.communicate()
+                            if output:
+                                print("output: %s" % output)
                         except Exception as e:
                             print("error: %s" % e)
                     try_to_kill(server_process, 5)

Changed 3 years ago by Smo

comment:35 Changed 3 years ago by Smo

Here is an example of one of the stop commands not working now.

stopping server with: ['/usr/bin/xpra', '--password-file=./test-password.txt', 'stop', ':10']
output: server requested disconnect: server shutdown
xpra at :10 has exited.

stopping server with: ps -ef | grep -i [X]org-for-Xpra-:10 | awk '{print $2}' | xargs kill
error: [Errno 2] No such file or directory

Since there is no shell we'd have to call all these commands by their full path.

comment:36 Changed 3 years ago by Antoine Martin

Unix sockets shouldn't require authentication, r15021 tries to make sure of that. If that doesn't help, we could just rip out all the password stuff - we have good unit test coverage for that already.

The reason why we try to kill Xorg is that if one was left behind, this could cause later tests to fail.
Why do we have to turn shell=True off? We can turn the "xpra stop" command into a plain string if that helps - I'd rather keep the kill command in there.

comment:37 Changed 3 years ago by Smo

That fix works for the password issue. We can leave shell=True if we turn the command into a string that works.

Changed 3 years ago by Smo

comment:38 Changed 3 years ago by Smo

I've attached a set of logs and data from 2.0-0.20170302r15219-1 on fedora 24

There were several crashes and one of the machines froze completely so there is no csv for that one.

I will check over the logs for errors. This is the first time I've been able to run through some tests on 2.0.

comment:39 Changed 3 years ago by Smo

Traceback (most recent call last):
  File "/home/xpra/tests/xpra/test_measure_perf.py", line 475, in with_server
    assert code is None, "test command %s failed to start: exit code is %s" % (cmd, code)
AssertionError: test command ['/usr/bin/vglrun', '--', '/usr/bin/xterm', '-geometry', '160x60', '-e', "'PYTHONPATH=`pwd` ./tests/xpra/test_apps/simulate_console_user.py'"] failed to start: exit code is 0

This test seems to be failing to start. I'll have to try the test manually

comment:40 Changed 3 years ago by Smo

I've seen at least one instance of this in the logs not sure what it is might be worth noting

[xcb] Unknown sequence number while processing queue
[xcb] Most likely this is a multi-threaded client and XInitThreads has not been called
[xcb] Aborting, sorry about that.
python2: xcb_io.c:274: poll_for_event: Assertion `!xcb_xlib_threads_sequence_lost' failed.

comment:41 Changed 3 years ago by Antoine Martin

For the error in comment:40, see #1456

Last edited 3 years ago by Antoine Martin (previous) (diff)

comment:42 Changed 3 years ago by Smo

Great this is starting to look better. It looks like there is an issue during one of the VLC tests and x264.

2017-03-08 13:29:21,430 item=(0, 791, 1489008561.391492, 1489008561.424384, XImageWrapper(BGRX: 0, 790, 1280, 1), 'rgb24', 4524, {}, 1)
Traceback (most recent call last):
  File "/usr/lib64/python2.7/site-packages/xpra/server/window/window_video_source.py", line 877, in encode_from_queue
    self.make_data_packet_cb(*item)
  File "/usr/lib64/python2.7/site-packages/xpra/server/window/window_source.py", line 1488, in make_data_packet_cb
    packet = self.make_data_packet(damage_time, process_damage_time, image, coding, sequence, options, flush)
  File "/usr/lib64/python2.7/site-packages/xpra/server/window/window_source.py", line 1822, in make_data_packet
    image.set_pixels(xored)
  File "xpra/x11/bindings/ximage.pyx", line 413, in xpra.x11.bindings.ximage.XImageWrapper.set_pixels (xpra/x11/bindings/ximage.c:5045)
AssertionError

I've checked the logs from and it appears in more than one of the runs.

Changed 3 years ago by Smo

comment:43 Changed 3 years ago by Antoine Martin

The bug in comment:42 should be fixed in r15234.

It was happening when we use delta encoding (#756) with a non-scroll region as part of the scroll encoding. (#1232)

Changed 3 years ago by Smo

Changed 3 years ago by Smo

Attachment: 103v20-chart.tar.bz2 added

comment:44 Changed 3 years ago by Smo

Attached some data from tests done with 1.0.3 from the stable repo.

Also used test_measure_perf_charts to generate the chart that is attached as well 103v20

Had to change directories and filenames from using '-' to '_' or the javascript fails.

comment:45 Changed 3 years ago by Smo

There are a few values that doesn't seem to be working in those charts and i'm also missing the network info since I didn't enable the use of ipchains in the performance tests.

comment:46 Changed 3 years ago by Antoine Martin

Noteworthy graphs:

  • the packet counts are missing (probably missing the firewall hooks on this test system)
  • speed and quality aren't recorded properly (maybe the info format has changed?)
  • max batch delay is through the roof (2 to 3 seconds!), same for max damage latency
  • 2.0 is marginally worse with mmap, consistently - not sure why
  • 2.0 encodes vp8 and vp9 faster in 2.0, not sure why
  • both encoding and decoding "pixels/s" are faster in 2.0 - that's an important metric, assuming the quality isn't visibly worse than before - but "pixels/s sent" and "regions/s" aren't much better than before which is odd
  • batch delay / damage latency is mostly unchanged
  • memscroller seems to be causing huge latency problems with jpeg / png
  • jpeg isn't showing as much of an improvement as I expected (#1423)
  • the client uses 2 more threads! why? (from 6 to 8, more with multi-threaded video codecs)
  • server reports only one thread (this is wrong..)
  • the client uses more memory - not sure why?

We should update the test code to also run in "auto" mode - which is now the default.

Changed 3 years ago by Smo

comment:47 Changed 3 years ago by Smo

I will add another set of logs from 2.0 here soon I'm just waiting on one machine to complete. The machine locked up on the nexuiz test last time I think.

I'll also compare those 2 sets of results. They should have the network info as I enabled the firewall rules to track them.

Changed 3 years ago by Smo

Attachment: 2.0r15251_perflogs.tar.bz2 added

comment:48 Changed 2 years ago by Antoine Martin

No data - can I close this and assume that the code is 100% perfect? ;)

comment:49 Changed 2 years ago by Antoine Martin

Resolution: fixed
Status: newclosed

Hopefully maxmylyn can follow up in a different ticket.

Note: See TracTickets for help on using tickets.