#797 closed task (fixed)
Profile xpra and capture tests for regression
Reported by: | Nick Centanni | Owned by: | Smo |
---|---|---|---|
Priority: | major | Milestone: | 1.0 |
Component: | tests | Version: | trunk |
Keywords: | Cc: |
Description (last modified by )
We can take a look at other things to educate ourselves, any result (positive or negative) should tell us more about the trade-offs we have decided to make and whether we should re-evaluate them (my bet: there is bound to be some that need another look).
Here are a few worth checking:
- 0.14 vs 0.15: the obvious one first, we should be doing better - but are we?
- x264 tuning: speed and quality impact on bandwidth, framerate, etc
- x264: impact of enabling threading ("
XPRA_X264_THREADS=1
") - same for vpx and webp - related to the CPU tests below - re-run of the detailed x264 vs vp8 comparison
- rgb lz4 vs png: are the bandwidth savings worth the CPU pain of using png?
- nvenc: kepler vs maxwell 1 vs maxwell2 cards (the new GTX 960 is out too..)
- the new multi-delta feature: what is the optimal number of delta buckets (currently set to 5)
- CPU tests: measure the impact of clock speed on performance (by artificially capping it), same for number of cores
- various environment variables that control aspects of the network layer:
XPRA_READ_BUFFER_SIZE
,XPRA_PACKET_JOIN_SIZE
,XPRA_INLINE_SIZE
,XPRA_YIELD
- proxy tuning:
PROXY_QUEUE_SIZE
- impact of the merge regions code: "
XPRA_MERGE_REGIONS
" - impact of downscaling, by forcing it with
XPRA_SCALING
, maybe also forcing CSC withXPRA_FORCE_CSC_MODE
etc...
The important thing about running those tests is to run them again and again, after either narrowing down on the outliers or discarding them when we understand why they stick out.
Attachments (25)
Change History (75)
comment:1 Changed 6 years ago by
Owner: | changed from Antoine Martin to Nick Centanni |
---|
comment:2 Changed 6 years ago by
Status: | new → assigned |
---|
comment:3 Changed 6 years ago by
Description: | modified (diff) |
---|---|
Milestone: | → 0.14 |
comment:4 Changed 6 years ago by
I've attached the results of a 9-repetition test of v14 and v15. These data files can be extracted anywhere.
I've added a new utility in tests/xpra
, called test_measure_perf_charts.py
along with the necessary css and js files necessary to draw the charts in the resulting HTML file. These utility files were added in r8639.
Change tests/xpra/test_measure_perf_charts.py
so it points to the directory where you placed the extracted data files.
The first version of this chart generator happens to already be configured for this data set (it has the correct prefix, params, and reps already configured), but normally you would need to change those to reflect the data set you're generating charts for.
Changed 6 years ago by
Attachment: | comparing_v14_v15.tar.gz added |
---|
CSV files from a 9-repetition run of test_measure_perf comparing v14 and v15.
comment:5 Changed 6 years ago by
This was done 3 months ago now, it would be worth running again, to see if anything has changed.
Changed 6 years ago by
Attachment: | compare_15Feb_15Jun.tar.gz added |
---|
Compares benchmark results of v15 done Feb-2015 with those done June-2015.
comment:6 Changed 6 years ago by
Attached results of the new benchmark run on v15 Jun-2015, compared to the benchmark results taken for v15 Feb-2015.
comment:7 Changed 6 years ago by
Status: | assigned → new |
---|
Thanks nick! (FYI: please run with r9638 from now on: we want vp8 and vp9 tested, and "x264" is not a valid encoding name any more)
Things that stand out (one good news first, then mostly bad news):
- vpx is now very competitive with h264 (#832), we use more threads - use more cpu, encode faster at a higher quality setting, and send more pixels
- disappointing pixels/s figures: we seem to have lost a few percent there - and this is one of the most important metrics..
- x264 seems to have regressed with glxgears / glxspheres (batch delay through the roof)
- batch delay is way too high in the latest v0.15.x (is the revision shown anywhere? please show it somewhere) - it looks like less than 4 fps! Worst case (max batch delay) is just 1 fps!! Quite strange seeing that the number of pixels and regions sent is more or less unchanged.
- mmap has also regressed
- system cpu usage is higher with mmap and vpx, not sure why that is (I guess the X11 server works harder as we request more frames with vpx, but mmap should be down not up?)
- odd zero quality with gpxspheres and rgb24 / png - quality should always be at 100% for those encodings - is the data correct? missing? was there a crash?
- odd min-quality values with mmap: the quality is always 100% with mmap / rgb24!
- please show the client rss
Executive summary: we have serious quality control issues. We should have tackled those regressions during the 0.14 or 0.15 development cycles, not half way through the 0.16 release cycle.
This data should be collected daily or weekly so we can identify regressions. And now this will need to be done anyway to identify the source of the regressions, more than doubling up the amount of work required to fix them.
comment:8 Changed 6 years ago by
I've attached here all the CSV and log output from both the 15 (Feb) and 15_2 (June) tests. I checked the logs for crashes before preparing the data. There was a crash in 15_2 sequence 8, so I reran that sequence.
There were login timeouts in all of the Feb x11perf tests. But that test was not included in the charts.
The revision of the June v15 that was used for this test was: r9612.
Changed 6 years ago by
Attachment: | compare15.tar.gz added |
---|
CSV and log files from the 15Feb_15Jun tests.
comment:9 Changed 6 years ago by
Did you ever figure out where the variation came from?
Once the data is reliable, it would be worth running again.
comment:10 Changed 6 years ago by
In addition to the first two comparisons r8585 vs r9612 (all tests) and r8585 vs r9612 (h264 glx tests only), I ran each of these 4 cases an additional 3 times to compare the deviation between tests. This can be seen in the attached archive.
Untar the archive and view index.html, which contains explanations of each chart in the set.
Changed 6 years ago by
Attachment: | h264_glx.tar.gz added |
---|
Collection of several charts visualizing 16 repetitions of the benchmark tests.
comment:11 Changed 6 years ago by
@nick: can you give us a quick "executive summary" of what you did and why, and your initial conclusions?
I've had a look and noticed:
- vpx will be skewing the results - because it is faster now, please see comment:7 we want vp8 and vp9 separately going forward
- damage latency and batch delay: although the average damage latency is improved (which is great), the max latency and max batch delays are worse, and so is the avg batch delay, in particular:
- with vpx and glxgears - 100ms to 200ms should be our target for the "max latency" (things are allowed to get a bit high at times, as long as the average stays bearable)
- with moebiusgears, polytopes and xterm tests: looks like a clear regression: the average batch delay is way too high, which explains why we can keep the damage latency low (not processing enough data, so we can do it faster)
Am I reading this right?
comment:12 Changed 6 years ago by
@antoine: since we were getting contradictory results between the comparison done of v15 and v16 in June -- when all tests were run -- and the results we got between the same revisions in July -- when only h264 glx tests were run -- I wanted to do some sanity tests just to verify that the test results are reliable.
Specifically:
- I wanted to run the same tests, the same way, for the same revisions, multiple times to try to reproduce the results – and to see if the contradictory results fell within the margin of error, or could be determined to be some kind of fluke.
- I wanted to determine if there really was a significant difference in the results when all tests were run in the suite, as opposed to running just the h264 / glx tests.
- I wanted to produce a combined chart (which can be found at the end of index.html) that summarized all these tests.
- I also did an audit of the chart generator results (since I've made changes to it) to rule out any problem with the charting tool itself. (There were no problems, so the charts are reporting the correct values.)
About the graphs:
The graphs to look at are the last two (on index.html). They combine all repetitions of the tests. I couldn't fit legends on the graphs (let alone both apps), so here's a description:
- The first 8 bars in each chart represent r8585, and the second 8 represent r9612.
- Within each revision, the first 4 represent the results from tests where all apps, for all encodings were done. And the second 4 represent the results from tests where only h264 / glxtests were run.
- Within each set of 4, the red bar represents the results from the earlier contradictory tests, and the remaining 3 are new repetitions of those tests.
- You can hover over bars to see which of the above results each bar represents.
Observations:
- The results were reproducible and consistent with the exception of a few metrics where the red bar (older tests) varied quite a bit from the newer tests, particular from the first v9612 tests: Application Bytes in/s, Application Bytes out/s, Application packets in/s, Application packets out/s, Avg Damage Latency, Encoding Pixels/s, Max Damage Latency, and Min Damage Latency. These instances of wider variations of the results were similar for both glxgears and glxspheres. So were those flukes? Was there a problem on the test machine that day? I'm not sure, but since the later runs of those same tests produced significantly different results for those metrics, I'm rather suspicious of those readings.
- One interesting point was that Server cpu pct was 5 times higher when r9612 was first tested, compared with all the other r9612 tests. Could this have indicated that some process was running on the box during that test that wasn't running during the remainder of the tests? This might explain the significant variance of those particular results. I always start a test run from a freshly booted machine, so I don't know what could have been going on on that machine, that day. The Client user cpu pct didn't show the same dramatic variation. Maybe I'm just misunderstanding what's being measured there.
- In almost all cases, running all the tests in one run, vs running just the h264 glxtests was not a factor. This can be seen on the charts by looking at how consistent the first two sets of 4 bars are with each other, and how consistent the second two sets of bars are. The metrics which did seem to vary significantly between these two testing methods were: Avg Damage Latency (in r9612), Encoding Pixels (in r9612), Min Damage Latency (in both revisions, and rather dramatically), and Decoding Pixels (r9612, glxspheres only).
- For those charts where there were no dramatic anomalies in the result consistency, each chart can be used to compare r8585 and r9612 by comparing the average values of the first 8 bars with the average values of the second 8 bars. Given these results, I can speak to a couple of your points about the results.
v15 vs v16:
- You mentioned that you thought Average Damage Latency had improved. But this is one of those charts where the results differed so dramatically between running all tests vs running glxtests only, that I'm not convinced those results are useful.
- I agree that the Avg Batch Delay is pretty bad with moebiusgears, polytopes ans xterm. To double-check, I took a look at the same metric in the test repetitions that I ran recently and saw the same dramatic regression.
- To answer your first point about vpx / vp8 / vp9, since I was trying to run tests the same way as before, to get comparable results for the various test types, I used the same configs as the older tests, and these configs predated your suggestion to run vp8 and vp9 separately. Anyway, this current effort was just to investigate why the h264 results were contradictory. We can do additional tests to investigate the performance of vp8 and vp9 going forward.
comment:13 Changed 5 years ago by
Stumbled upon this package which could be useful for visualizing datasets: Animator5D.
comment:14 Changed 5 years ago by
Milestone: | 0.14 → 0.18 |
---|---|
Owner: | changed from Nick Centanni to Smo |
I believe smo is going to be running those tests again from now on.
comment:16 Changed 4 years ago by
Owner: | changed from Smo to Antoine Martin |
---|
I am indeed running the tests again on fedora 24.
I've run into an issue running the tests on xpra 1.0 from winswitch-beta
Traceback (most recent call last): File "/home/xpra/tests/xpra/test_measure_perf.py", line 81, in <module> XPRA_VERSION_NO = [int(x) for x in XPRA_VERSION.split(".")] ValueError: invalid literal for int() with base 10: '0-r14149'
I will attach my data from the tests to this ticket along with full logs.
comment:17 Changed 4 years ago by
Owner: | changed from Antoine Martin to Smo |
---|
Untested: I believe r14150 fixes this.
Changed 4 years ago by
Attachment: | 0.16.3-perflogs.tar.bz2 added |
---|
0.16.3 performance logs from fedora 24
Changed 4 years ago by
Attachment: | 0.17.5-perflogs.tar.bz2 added |
---|
0.17.5 performance logs from fedora 24
comment:18 Changed 4 years ago by
Fix works running the tests on 1.0 beta from winswitch-beta repo now will post the results when they are done.
comment:19 Changed 4 years ago by
I see this traceback running the tests with the 1.0 beta client/server
raceback (most recent call last): File "/usr/lib64/python2.7/site-packages/xpra/client/gtk_base/gtk_tray_menu_base.py", line 404, in bell_toggled DPI set to 96 x 96 self.client.send_bell_enabled() File "/usr/lib64/python2.7/site-packages/xpra/client/ui_client_base.py", line 2626, in send_bell_enabled assert self.client_supports_bell, "cannot toggle bell: the feature is disabled by the client" AssertionError: cannot toggle bell: the feature is disabled by the client
Changed 4 years ago by
Attachment: | xpra-1.0-testrun-fail.txt added |
---|
comment:20 Changed 4 years ago by
I have attached a log of the issue i'm currently having running the tests for 1.0 beta.
It seems to me that the server isn't shutting down properly after the first test run.
Here is a list of the running processes at the end of that log.
[xpra@xpratb01 ~]$ ps x -www PID TTY STAT TIME COMMAND 1250 ? Ss 0:00 /usr/lib/systemd/systemd --user 1252 ? S 0:00 (sd-pam) 1257 ? S 0:00 sshd: xpra@pts/1 1258 pts/1 Ss 0:00 -bash 1279 pts/1 S+ 0:00 /bin/bash /home/xpra/bin/runtest 1286 pts/1 S+ 0:00 tee /home/xpra/data/new-1.0-1.log 1307 pts/1 Sl+ 1:48 /usr/bin/python2 /usr/bin/xpra --bind-tcp=0.0.0.0:10000 --password-file=./test-password.txt --auth=file --tcp-auth=file start :10 --daemon=no --pulseaudio=no --notifications=no 1308 pts/1 Sl+ 0:00 /usr/bin/pkttyagent --notify-fd 5 --fallback 1315 ? Ssl 0:45 /usr/libexec/Xorg -noreset -nolisten tcp +extension GLX +extension RANDR +extension RENDER -auth /home/xpra/.Xauthority -logfile /run/user/1001/xpra/Xorg.:10.log -configdir /home/xpra/.xpra/xorg.conf.d -config /etc/xpra/xorg.conf :10 1326 ? Ssl 0:00 /usr/bin/dbus-daemon --syslog-only --fork --print-pid 5 --print-address 7 --session 1361 ? Sl 0:00 /usr/bin/pulseaudio --start --log-target=syslog 1700 ? S 0:00 sshd: xpra@pts/2 1701 pts/2 Ss 0:00 -bash 1735 pts/2 R+ 0:00 ps x -www
comment:21 Changed 4 years ago by
I'm not sure how to reproduce the bug in comment:19, r14235 fixes the symptoms only..
As for the server not shutting down properly, maybe we want to run with "systemd-run" turned off for the tests? Does that help?
comment:22 Changed 4 years ago by
Changing
systemd-run = no
in /etc/xpra/conf.d/60_server.conf
allowed me to continue on with the tests on fedora 24 with 1.0 beta
I will post the data here shortly then I will take a look at logs.
Changed 4 years ago by
Attachment: | 1.0-perflogs.tar.bz2 added |
---|
comment:23 Changed 4 years ago by
lots of errors from vp8 in these logs
Error during encoding: Traceback (most recent call last): File "/usr/lib64/python2.7/site-packages/xpra/server/source.py", line 2185, in encode_loop fn_and_args[1](*fn_and_args[2:]) File "/usr/lib64/python2.7/site-packages/xpra/server/window/window_source.py", line 1448, in make_data_packet_cb packet = self.make_data_packet(damage_time, process_damage_time, image, coding, sequence, options, flush) File "/usr/lib64/python2.7/site-packages/xpra/server/window/window_source.py", line 1794, in make_data_packet ret = encoder(coding, image, options) File "/usr/lib64/python2.7/site-packages/xpra/server/window/window_video_source.py", line 1467, in video_encode return self.do_video_encode(encoding, image, options) File "/usr/lib64/python2.7/site-packages/xpra/server/window/window_video_source.py", line 1568, in do_video_encode ret = ve.compress_image(csc_image, quality, speed, options) File "xpra/codecs/vpx/encoder.pyx", line 584, in xpra.codecs.vpx.encoder.Encoder.compress_image (xpra/codecs/vpx/encoder.c:8327) AssertionError
comment:24 Changed 4 years ago by
saw this error as well when the server was shutting down
got signal SIGTERM, exiting client has requested disconnection: exit on signal SIGTERM Disconnecting client 127.0.0.1:54674: client request Warning: no matching argb function: cannot convert BGRX to one of: [] Warning: cannot convert BGRX to one of: [] error processing encode queue at index 0 item=(504, 504, 1477598147.752195, 1477598147.792868, XShmImageWrapper(BGRX: 1, 1, 504, 504), 'rgb24', 6951, {}, 0) Traceback (most recent call last): File "/usr/lib64/python2.7/site-packages/xpra/server/window/window_video_source.py", line 823, in encode_from_queue self.make_data_packet_cb(*item) File "/usr/lib64/python2.7/site-packages/xpra/server/window/window_source.py", line 1448, in make_data_packet_cb packet = self.make_data_packet(damage_time, process_damage_time, image, coding, sequence, options, flush) File "/usr/lib64/python2.7/site-packages/xpra/server/window/window_source.py", line 1794, in make_data_packet ret = encoder(coding, image, options) File "/usr/lib64/python2.7/site-packages/xpra/server/window/window_source.py", line 1884, in rgb_encode self.rgb_zlib, self.rgb_lz4, self.rgb_lzo) File "/usr/lib64/python2.7/site-packages/xpra/server/picture_encode.py", line 87, in rgb_encode raise Exception("cannot find compatible rgb format to use for %s! (supported: %s)" % (pixel_format, rgb_formats)) Exception: cannot find compatible rgb format to use for BGRX! (supported: []) xpra client 1 disconnected.
comment:25 Changed 4 years ago by
The vp8 one is real, saw it before so I added a better error message in r14239.
The other one looks harmless, is fixed in r14305, but inspecting the code I found more issues: see ticket:835#comment:34
Changed 4 years ago by
Attachment: | 1.0-20161028r14313-1-perflogs.tar.bz2 added |
---|
Changed 4 years ago by
Attachment: | 1.0-20161029r14329-1-perflogs.tar.bz2 added |
---|
comment:26 Changed 4 years ago by
In the latest 1.0 tests I see these tracebacks
Traceback (most recent call last): File "/usr/lib64/python2.7/site-packages/xpra/server/source.py", line 2185, in encode_loop fn_and_args[1](*fn_and_args[2:]) File "/usr/lib64/python2.7/site-packages/xpra/server/window/window_source.py", line 1440, in make_data_packet_cb packet = self.make_data_packet(damage_time, process_damage_time, image, coding, sequence, options, flush) File "/usr/lib64/python2.7/site-packages/xpra/server/window/window_source.py", line 1786, in make_data_packet ret = encoder(coding, image, options) File "/usr/lib64/python2.7/site-packages/xpra/server/window/window_video_source.py", line 1512, in video_encode return self.do_video_encode(encoding, image, options) File "/usr/lib64/python2.7/site-packages/xpra/server/window/window_video_source.py", line 1614, in do_video_encode ret = ve.compress_image(csc_image, quality, speed, options) File "xpra/codecs/vpx/encoder.pyx", line 584, in xpra.codecs.vpx.encoder.Encoder.compress_image (xpra/codecs/vpx/encoder.c:9075) AssertionError: invalid image width 1240, expected 826
comment:27 Changed 4 years ago by
There's a very good chance that the "invalid image width" is fixed by r14365.
Changed 4 years ago by
Attachment: | 0.17.6-2.r14324-perflogs.tar.bz2 added |
---|
Changed 4 years ago by
Attachment: | 1.0-0.20161102r14384-perflogs.tar.bz2 added |
---|
comment:28 Changed 4 years ago by
In 1.0-r14384 logs attached the vpx issue seems to be gone.
I did notice some errors like these
/usr/lib64/python2.7/site-packages/xpra/server/window/window_video_source.py:803: Warning: Source ID 452 was not found when attempting to remove it self.source_remove(eqt) /usr/lib64/python2.7/site-packages/xpra/server/window/window_video_source.py:803: Warning: Source ID 484 was not found when attempting to remove it self.source_remove(eqt)
Also mplayer test seems to be failing I will have to fix this or disable it.
comment:29 Changed 4 years ago by
Looking over 0.17.6-2 logs I notice there is sometimes this issue when shutting down.
2016-11-01 18:04:16,489 New tcp connection received from 127.0.0.1:54160 2016-11-01 18:04:26,114 Authentication required by password file authenticator module 2016-11-01 18:04:26,114 sending challenge for 'xpra' using hmac digest 2016-11-01 18:04:28,337 connection timedout: Protocol(tcp socket: 127.0.0.1:10000 <- 127.0.0.1:54160) got signal SIGTERM, exiting Traceback (most recent call last): File "/home/xpra/tests/xpra/test_measure_perf.py", line 510, in with_server data.update(measure_client(server_pid, name, client_cmd, get_stats_cb)) File "/home/xpra/tests/xpra/test_measure_perf.py", line 361, in measure_client initial_stats = get_stats_cb() File "/home/xpra/tests/xpra/test_measure_perf.py", line 588, in xpra_get_stats out = getoutput(info_cmd) File "/home/xpra/tests/xpra/test_measure_perf.py", line 56, in getoutput raise Exception("command '%s' returned error code %s, out=%s, err=%s" % (cmd, code, out, err)) Exception: command '['/usr/bin/xpra', 'info', 'tcp:127.0.0.1:10000', '--password-file=./test-password.txt', '--auth=file', '--tcp-auth=file']' returned error code 7, out=server failure: disconnected before the session could be established server requested disconnect: login timeout , err=None 2016-11-01 18:04:29,497 Connection lost 2016-11-01 18:04:29,502 xpra client disconnected. 2016-11-01 18:04:29,504 processed info request from None in 8ms 2016-11-01 18:04:30,472 got signal SIGTERM, exiting 2016-11-01 18:04:30,475 closing tcp socket 0.0.0.0:10000 2016-11-01 18:04:30,475 removing socket /home/xpra/.xpra/xpratb01-10 2016-11-01 18:04:30,477 killing xvfb with pid 21711 (II) Server terminated successfully (0). Closing log file.
Probably cause by it closing while the client is trying to request info.
comment:30 Changed 4 years ago by
This test fails as well because it has changed locations
Traceback (most recent call last): File "/home/xpra/tests/xpra/test_measure_perf.py", line 479, in with_server assert code is None, "test command %s failed to start: exit code is %s" % (cmd, code) AssertionError: test command ['/usr/bin/vglrun', '--', '/usr/bin/xterm', '-geometry', '160x60', '-e', 'PYTHONPATH=`pwd` ./tests/xpra/simulate_console_user.py'] failed to start: exit code is 0
I'll make the necessary changes to my config but this should be reflected in the default one.
comment:31 Changed 4 years ago by
Example of why mplayer is failing for the audio test
Traceback (most recent call last): File "/home/xpra/tests/xpra/test_measure_perf.py", line 479, in with_server assert code is None, "test command %s failed to start: exit code is %s" % (cmd, code) AssertionError: test command /usr/bin/vglrun -- while true; do /usr/bin/mplayer ./test.mp3; done failed to start: exit code is 1
When we run this command by hand we get a weird error
[xpra@xpratb01 ~]$ /usr/bin/vglrun -- while true; do /usr/bin/mplayer ./test.mp3; done -bash: syntax error near unexpected token `do'
Suggestion we just use mplayer -loop 0 ./test.mp3
instead of the while loop and this should be good i'll make changes in my test config but defaults should be updated
Changed 4 years ago by
Attachment: | 1.0-0.20161102r14384-run2-perflogs.tar.bz2 added |
---|
Changed 4 years ago by
Attachment: | 1.0-20161104r14397-perflogs.tar.bz2 added |
---|
Changed 4 years ago by
Attachment: | 1.0-20161114r14416-perflogs.tar.bz2 added |
---|
Changed 4 years ago by
Attachment: | 1.0-20161115r14430-perflogs.tar.bz2 added |
---|
Changed 4 years ago by
Attachment: | 1.0-20161115r14430-run2-perflogs.tar.bz2 added |
---|
Changed 4 years ago by
Attachment: | 1.0-20161118r14449-perflogs.tar.bz2 added |
---|
comment:32 Changed 4 years ago by
I sometimes see this traceback when looking at the logs usually when running glxgears.
Traceback (most recent call last): File "/usr/lib64/python2.7/site-packages/xpra/server/server_core.py", line 1156, in in_thread info = self.get_info(proto, *args) File "/usr/lib64/python2.7/site-packages/xpra/server/server_base.py", line 1959, in get_info dgi = self.do_get_info(proto, sources, wids) File "/usr/lib64/python2.7/site-packages/xpra/x11/server.py", line 274, in do_get_info info = X11ServerBase.do_get_info(self, proto, server_sources, window_ids) File "/usr/lib64/python2.7/site-packages/xpra/x11/x11_server_base.py", line 249, in do_get_info info = GTKServerBase.do_get_info(self, proto, server_sources, window_ids) File "/usr/lib64/python2.7/site-packages/xpra/server/gtk_server_base.py", line 117, in do_get_info info = ServerBase.do_get_info(self, proto, *args) File "/usr/lib64/python2.7/site-packages/xpra/server/server_base.py", line 2095, in do_get_info info.update(ss.get_window_info(window_ids)) File "/usr/lib64/python2.7/site-packages/xpra/server/source.py", line 1586, in get_window_info winfo[wid] = ws.get_info() File "/usr/lib64/python2.7/site-packages/xpra/server/window/window_video_source.py", line 169, in get_info info = WindowSource.get_info(self) File "/usr/lib64/python2.7/site-packages/xpra/server/window/window_source.py", line 398, in get_info lde = [x for x in self.statistics.last_damage_events if x[0]>=cutoff] RuntimeError: deque mutated during iteration
comment:33 Changed 4 years ago by
The stacktrace from comment:32 is fixed in r14615.
Have you made the other changes yet?
comment:34 Changed 4 years ago by
Had to make some modifications to the test script some of it was just for debugging but it doesn't seem to hurt anything. Now I have the problem with the 2nd stop command not being called by the full path and not using Shell=true means the commands need their full path. This way of killing the Xorg process doesn't work in newer versions anyways. I think stopping the xpra server properly is a better way and should hopefully shut down cleanly.
Also needed the password file for newer versions. I will attach my logs from 1.0.2 before I review them and also see if I can start the tests from 2.0.0 from the beta repo
Index: xpra/test_measure_perf.py =================================================================== --- xpra/test_measure_perf.py (revision 14995) +++ xpra/test_measure_perf.py (working copy) @@ -81,7 +81,7 @@ XPRA_VERSION = XPRA_VERSION.split("-")[0] XPRA_VERSION_NO = [int(x) for x in XPRA_VERSION.split(".")] XPRA_SERVER_STOP_COMMANDS = [ - [XPRA_BIN, "stop", ":%s" % config.DISPLAY_NO], + [XPRA_BIN, "--password-file=./test-password.txt", "stop", ":%s" % config.DISPLAY_NO], "ps -ef | grep -i [X]org-for-Xpra-:%s | awk '{print $2}' | xargs kill" % config.DISPLAY_NO ] XPRA_INFO_COMMAND = [XPRA_BIN, "info", "tcp:%s:%s" % (config.IP, config.PORT)] @@ -524,8 +524,11 @@ for s in stop_server_commands: print("stopping server with: %s" % (s)) try: - stop_process = subprocess.Popen(s, stdin=None, stdout=subprocess.PIPE, stderr=subprocess.PIPE, shell=True) + stop_process = subprocess.Popen(s, stdin=None, stdout=subprocess.PIPE, stderr=subprocess.PIPE) stop_process.wait() + output, error = stop_process.communicate() + if output: + print("output: %s" % output) except Exception as e: print("error: %s" % e) try_to_kill(server_process, 5)
Changed 4 years ago by
Attachment: | 1.0.2-1.r14944-perflogs.tar.bz2 added |
---|
comment:35 Changed 4 years ago by
Here is an example of one of the stop commands not working now.
stopping server with: ['/usr/bin/xpra', '--password-file=./test-password.txt', 'stop', ':10'] output: server requested disconnect: server shutdown xpra at :10 has exited. stopping server with: ps -ef | grep -i [X]org-for-Xpra-:10 | awk '{print $2}' | xargs kill error: [Errno 2] No such file or directory
Since there is no shell we'd have to call all these commands by their full path.
comment:36 Changed 4 years ago by
Unix sockets shouldn't require authentication, r15021 tries to make sure of that. If that doesn't help, we could just rip out all the password stuff - we have good unit test coverage for that already.
The reason why we try to kill Xorg is that if one was left behind, this could cause later tests to fail.
Why do we have to turn shell=True off? We can turn the "xpra stop" command into a plain string if that helps - I'd rather keep the kill command in there.
comment:37 Changed 4 years ago by
That fix works for the password issue. We can leave shell=True if we turn the command into a string that works.
Changed 4 years ago by
Attachment: | 2.0-0.20170302r15219-perflogs.tar.bz2 added |
---|
comment:38 Changed 4 years ago by
I've attached a set of logs and data from 2.0-0.20170302r15219-1 on fedora 24
There were several crashes and one of the machines froze completely so there is no csv for that one.
I will check over the logs for errors. This is the first time I've been able to run through some tests on 2.0.
comment:39 Changed 4 years ago by
Traceback (most recent call last): File "/home/xpra/tests/xpra/test_measure_perf.py", line 475, in with_server assert code is None, "test command %s failed to start: exit code is %s" % (cmd, code) AssertionError: test command ['/usr/bin/vglrun', '--', '/usr/bin/xterm', '-geometry', '160x60', '-e', "'PYTHONPATH=`pwd` ./tests/xpra/test_apps/simulate_console_user.py'"] failed to start: exit code is 0
This test seems to be failing to start. I'll have to try the test manually
comment:40 Changed 4 years ago by
I've seen at least one instance of this in the logs not sure what it is might be worth noting
[xcb] Unknown sequence number while processing queue [xcb] Most likely this is a multi-threaded client and XInitThreads has not been called [xcb] Aborting, sorry about that. python2: xcb_io.c:274: poll_for_event: Assertion `!xcb_xlib_threads_sequence_lost' failed.
comment:42 Changed 4 years ago by
Great this is starting to look better. It looks like there is an issue during one of the VLC tests and x264.
2017-03-08 13:29:21,430 item=(0, 791, 1489008561.391492, 1489008561.424384, XImageWrapper(BGRX: 0, 790, 1280, 1), 'rgb24', 4524, {}, 1) Traceback (most recent call last): File "/usr/lib64/python2.7/site-packages/xpra/server/window/window_video_source.py", line 877, in encode_from_queue self.make_data_packet_cb(*item) File "/usr/lib64/python2.7/site-packages/xpra/server/window/window_source.py", line 1488, in make_data_packet_cb packet = self.make_data_packet(damage_time, process_damage_time, image, coding, sequence, options, flush) File "/usr/lib64/python2.7/site-packages/xpra/server/window/window_source.py", line 1822, in make_data_packet image.set_pixels(xored) File "xpra/x11/bindings/ximage.pyx", line 413, in xpra.x11.bindings.ximage.XImageWrapper.set_pixels (xpra/x11/bindings/ximage.c:5045) AssertionError
I've checked the logs from and it appears in more than one of the runs.
Changed 4 years ago by
Attachment: | 2.0-0.20170307r15230-perflogs.tar.bz added |
---|
comment:43 Changed 4 years ago by
The bug in comment:42 should be fixed in r15234.
It was happening when we use delta encoding (#756) with a non-scroll region as part of the scroll encoding. (#1232)
Changed 4 years ago by
Attachment: | 1.0.3r15051_perflogs.tar.bz2 added |
---|
Changed 4 years ago by
Attachment: | 103v20-chart.tar.bz2 added |
---|
comment:44 Changed 4 years ago by
Attached some data from tests done with 1.0.3 from the stable repo.
Also used test_measure_perf_charts to generate the chart that is attached as well 103v20
Had to change directories and filenames from using '-' to '_' or the javascript fails.
comment:45 Changed 4 years ago by
There are a few values that doesn't seem to be working in those charts and i'm also missing the network info since I didn't enable the use of ipchains in the performance tests.
comment:46 Changed 4 years ago by
Noteworthy graphs:
- the packet counts are missing (probably missing the firewall hooks on this test system)
- speed and quality aren't recorded properly (maybe the info format has changed?)
- max batch delay is through the roof (2 to 3 seconds!), same for max damage latency
- 2.0 is marginally worse with mmap, consistently - not sure why
- 2.0 encodes vp8 and vp9 faster in 2.0, not sure why
- both encoding and decoding "pixels/s" are faster in 2.0 - that's an important metric, assuming the quality isn't visibly worse than before - but "pixels/s sent" and "regions/s" aren't much better than before which is odd
- batch delay / damage latency is mostly unchanged
- memscroller seems to be causing huge latency problems with jpeg / png
- jpeg isn't showing as much of an improvement as I expected (#1423)
- the client uses 2 more threads! why? (from 6 to 8, more with multi-threaded video codecs)
- server reports only one thread (this is wrong..)
- the client uses more memory - not sure why?
We should update the test code to also run in "auto" mode - which is now the default.
Changed 4 years ago by
Attachment: | 1.0.4r15266_perflogs.tar.bz2 added |
---|
comment:47 Changed 4 years ago by
I will add another set of logs from 2.0 here soon I'm just waiting on one machine to complete. The machine locked up on the nexuiz test last time I think.
I'll also compare those 2 sets of results. They should have the network info as I enabled the firewall rules to track them.
Changed 4 years ago by
Attachment: | 2.0r15251_perflogs.tar.bz2 added |
---|
comment:48 Changed 4 years ago by
No data - can I close this and assume that the code is 100% perfect? ;)
comment:49 Changed 4 years ago by
Resolution: | → fixed |
---|---|
Status: | new → closed |
Hopefully maxmylyn can follow up in a different ticket.
comment:50 Changed 3 months ago by
this ticket has been moved to: https://github.com/Xpra-org/xpra/issues/797
(just re-formatting with wiki tags)