At the moment, we have a read_buffer which is a string and we append to it each time we get more data. We read data from the network 8KB at a time, which means that for an 8MB picture (uncompressed
RGBA at 1080p), we end up copying that string buffer 1000 times... Quick maths tell me we generate
(1000*1001)/2*8K = ~4GB of memory copy for an 8MB picture!
Now, with just lz4 compression, the average frame drops to just a few percent of the original size, ie for 5%: 400KB is 50 packets, which means:
50*51/2*8K= ~10MB (which is still 25 times more than we should!)
h264, the compression is much more efficient, so the average packet size drops to 200KB, still high enough that memory copy is probably costing us.
PoC: populates the payload buffer using slicing so we only allocate the memory once
better patch using bytearray instead of ctypes.create_string_buffer - but still too slow..
The v2 patch above looks good, but there is one problem left: we don't want to spend any time allocating the buffer, a simple
malloc would do, unfortunately this is how long it takes to allocate 1GB:
" "*(1024*1024*1024)~ 3ms!
So the string wins, but it is immutable...
All I want, is a
bytearray backed by
malloc... why is it so hard?
updated version using a list of strings as temporary buffer
Contrary to what I expected, the performance improvement is marginal at best for the large packet case and the extra
if statements in the new code actually make the more common case (handling smaller packets) a little slower! It does reduce client CPU load though, which may still make this worth having.
See Concatenation Test Code for why that is. Quote:
With byte-code strings, concatenating with += is as fast as a .join. Since our strings are byte strings from the network layer, using join doesn't really help.
What seems to make more of a difference is the size of the receive buffer, bumping the size to 16k makes a noticeable improvement for the high bandwidth case. Interestingly, the lz4 compression saves a huge amount of bandwidth, without really costing much in terms of number of frames / pixels sent.
As can be seen on those newly added performance charts: rgb-nocompress, disabling compression of RGB pixels can increase the bandwidth consumption 100-fold, peaking above 100MB/s. The largest improvement comes from bumping the size of the network receive buffer to 64KB, done in r5276.
Finally, the protocol speed charts comparing old code with small variations of the new code, with and without 64KB receive buffers, shows that there isn't much to gain from the new code.
Not applying and closing.
raw CSV data used to generate the graphs
this ticket has been moved to: https://github.com/Xpra-org/xpra/issues/502