xpra icon
Bug tracker and wiki

Version 12 (modified by Antoine Martin, 7 years ago) (diff)

more info on "auto" tuning

Window Refresh

The key feature of xpra is the forwarding of individual window contents to the client.
The speed at which the server compresses and sends screen updates to the client depends on a number of factors, the most important one is obviously the encoding used, but as can be seen below the client rendering also has an influence.
There are only a few things the system can tune at runtime:

  • the "batch delay": this is the time that the server will wait for screen updates to accumulate before grabbing the next frame
  • the "video quality" (for the x264 video encoder only)
  • the "video speed" (for the x264 video encoder only)


Code Pointers

Since this is an area that receives constant tuning and improvements, the most reliable source of current information is the code itself:

  • server_source.py: this class is instantiated for each client connection. In particular:
    • GlobalPerformanceStatistics: collects various statistics about this client's connection speed, the various work queues, etc. The get_factors method returns a guesstimate of how the batch delay should be adjusted for the given parameters ("target latency" and "pixel count")
    • calculate_delay_thread which runs at most 4 times a second to tune the batch delay, both the global default one (default_batch_config which is a DamageBatchConfig) and the one specific to each window (see below for details)
  • window_source.py: this class is instantiated for each window and for each client, it deals with sending the window's pixels to the client. In particular:
    • DamageBatchConfig: the structure which encapsulates all the configuration and historical values related to a batch delay.
    • WindowPerformanceStatistics: per-window statistics: performance of the encoding/decoding, amount of frames and pixels in flight, etc. Again, the get_factors method rreturns a guesstimate of how the batch delay should ne agjusted for the given parameters ("pixel_count" and current "batch delay")
  • batch_delay_calculator.py: this class is where we used the factors obtained by get_factors above to tune the various attributes.


Note also that many of these classes have a add_stats method which is used by xpra info to provide detailed statistics from the command line and is a good first step for debugging.


Background on damage request processing

Step by step (and oversimplified - so check the code for details) of the critical damage code path:

  • an X11 client application running against the xpra display updates something on one of its windows, or creates a new window, resizes it, etc. It notifies the X11 server that something needs to be redrawn, changed or updated.
  • xpra, working as a compositing window manager, receives a notification that something needs updating/changing.
  • this Damage event is forwarded through various classes (CompositeHelper, WindowModel, .. and eventually ends up in the server's damage method - often as a Damage event directly, or in other cases we synthesize one to force the client to (re-)draw the window.
  • then for each client connected, we call damage on its ServerSource (see above)
  • then this damage event is forwarded to the WindowSource for the given window (creating one if needed)
  • then we either add this event to the current list of delayed regions (which will expire after "batch delay"), we create a new delayed region, or if things are not congested at all we just process it immediately
  • process_damage_region (called when the timer expires - or directly) will grab the window's pixels and queue a request to call "make a data packet" from it. This is done so that we minimize the amount of work done in the critical UI thread, another thread will pick items off this queue.

Note: delayed regions may get delayed more than originally intended if the client has an excessive packet or pixel backlog.
The damage thread will pick items from this queue, decide what compression to use, compress the pixels and make a data packet then queue this packet for sending. Note that the queue is in ServerSource and it is shared between all the windows..

Things not covered here:

  • auto-refresh
  • choosing between a full frame or many smaller regions
  • various tunables
  • callbacks recording performance statistics
  • callbacks between ServerSource and WindowSource
  • calculating the backlog
  • error handling
  • cancelling requests
  • client encoding options
  • mmap


The Factors

Again, this is a simplified explanation, and the actual values used will make use of heuristics, weighted averages, etc. Recent vs average values, trends, etc. Please refer to the source for details.

From ServerSource:

  • client latency: the client latency measured during the processing of pixel data (when we get the echo back). We know the lowest latency observed, and we try to keep the latency as low as that
  • client ping latency: as above, but measured from ping packets (client)
  • server ping latency: latency measured when the client pings the server, the client then sends this information as part of ping echo response packets
  • damage data queue: the number of damage frames waiting to be compressed, we want to keep this low - especially as this data is uncompressed at this point, so it can be quite large
  • damage packet queue size: the number of packets waiting to be sent by the network layer - we want to keep this low, but sometimes many small packets can make it look worse than it is..
  • damage packet queue pixels: the number of pixels in the packet queue, the target value depends on the current size of the window
  • mmap area % full: only used with mmap


From WindowSource:

  • damage processing latency: how long frames take from the moment we receive the damage event until the packet is queued for sending. This value increases with the number of pixels that are encoded. We want to keep this low.
  • damage processing ratios: as above, but based on the trend: this measure goes up when we queue more damage requests than we can encode.
  • damage send latency: how long frames take from the moment the damage event until the packet containing it has made it out of the network layer (it may still be in the operating system's buffers though). Again, we want to keep this low. This value increases when the network becomes the bottleneck.
  • damage network delay: the difference between damage processing latency} and damage send latency, this should equal to the network latency - but because the values are running averages, it is not very reliable.
  • network send speed: we keep track of the socket's performance (in bytes per second)
  • client decode speed: how quickly the client is decoding frames
  • no damage events for X ms: when nothing happens for a while, this means that the window is not busy and therefore we ought to be able to lower the delay


Each factor provides:

  • a textual description which can be used for debugging
  • a change factor (float number) ranging from 0.0 (lower the delay) to a small positive number (up to 10 - generally, but that is not enforced).. A value of 1.0 means no change.
  • a weight value: this is to measure how confident this particular factor is about the change it requests.

ie:

  • with a weight of 0.0, the factor is irrelevant as it won't be counted
  • with a high enough weight, a factor can overrule others


Debugging

"xpra info" provides a good overview of the current values used for batch delay and encoding speed/quality, but very little in terms of the factors used to come up with those values (only some latency values are exposed). A good start is:

xpra info | grep batch_delay
xpra info | grep latency.avg


To dump the change in "batch delay", set:

XPRA_DELAY_DEBUG=1

when starting the server. Then every 30 seconds, or every 1000 messages (whichever comes first), the delay factors will be dumped to the log (this is done to prevent the logging itself from affecting the system and the calculations - and it still does, but much more sparsely), the log messages look like this:

update_batch_delay: wid=5, last updated 249.50 ms ago, decay=1.00s, \
    change factor=9.8%, delay min=5, avg=5, max=6, cur=6.7, \
    w. average=6.0, tot wgt=227.2, hist_w=113.6, new delay=7.4

For more details, to get the actual factors used, set:

XPRA_DELAY_DEBUG=2

Then the output should be much more verbose:

Factors (change - weight - description):
  -14      38  damage processing latency: avg=0.013, recent=0.013, target=0.014, aim=0.800, aimed avg factor=0.729, div=1.000, s=<built-in function sqrt>
 +128      15  damage processing ratios 12 - 13 / 5
  -25      50  damage send latency: avg=0.014, recent=0.015, target=0.030, aim=0.800, aimed avg factor=0.557, div=1.000, s=<built-in function sqrt>
  -65       8  damage network delay: avg delay=0.001 recent delay=0.002
  -65      23  client decode speed: avg=32.3, recent=32.3 (MPixels/s)
   +0       0  no damage events for 1.2 ms (highest latency is 100.0)
  -49       8  client latency: avg=0.002, recent=0.002, target=0.006, aim=0.800, aimed avg factor=0.260, div=1.000, s=<built-in function sqrt>
  -40       4  client ping latency: avg=0.003, recent=0.003, target=0.007, aim=0.950, aimed avg factor=0.353, div=1.000, s=<built-in function sqrt>
  -54       4  server ping latency: avg=0.003, recent=0.002, target=0.006, aim=0.950, aimed avg factor=0.211, div=1.000, s=<built-in function sqrt>
 -100       0  damage packet queue size: avg=0.000, recent=0.000, target=1.000, aim=0.250, aimed avg factor=0.000, div=1.000, s=<built-in function sqrt>
 -100       0  damage packet queue pixels: avg=0.000, recent=0.000, target=1.000, aim=0.250, aimed avg factor=0.000, div=90000.000, s=<built-in function sqrt>
  -99      20  damage data queue: avg=0.346, recent=0.033, target=1.000, aim=0.250, aimed avg factor=0.019, div=1.000, s=<built-in function logp>
 -100       0  damage packet queue window pixels: avg=0.000, recent=0.000, target=1.000, aim=0.250, aimed avg factor=0.000, div=90000.000, s=<built-in function sqrt>

Note: the change and weight are shown as percentages in the output (rather than floating point numbers in the implementation).


To dump the changes to video encoding speed and quality, set:

XPRA_VIDEO_DEBUG=1

when starting the server, you will then get messages like these when using fixed quality/speed settings:

video encoder using fixed speed: 10
video encoder using fixed quality: 10

Or these when actually using the tuning code:

video encoder quality factors: wid=5, packets_bl=1.00, batch_q=0.31, \
    latency_q=94.44, target=30, new_quality=25
video encoder speed factors: wid=4, low_limit=157684, min_damage_latency=0.03, \
    target_damage_latency=0.20, batch.delay=17.69, dam_lat=0.00, dec_lat=0.05, \
    target=4.00, new_speed=5.00

Please refer to the code for accurate information. (update_video_encoder)

The auto-tuning code:

  • honours the minimum speed and quality settings set via --min-quality and -min-speed
  • tunes the speed so that encoding of one frame takes roughly as long as the batch delay (the time spent accumulating updates for the next frame), so that there is only one frame in the pipeline on average. The aim is to always keep the encoder busy, but without any backlog. (note: this does not necessarily lead to the best framerate, since a busy cpu may cause the batch delay to increase... and in turn lower the speed)
  • client speed: if the client is struggling to decode frames, then we use a higher speed (which means less effort in decompressing the stream) - the target client decoding speed is 8 MPixels/s
  • quality: we try to lower the quality if we find that the client has a backlog of frames to draw/acknowledge (usually a sign of network congestion), or if the measure client latency is higher than normal (also a sign of congestion)