xpra icon
Bug tracker and wiki

Opened 4 weeks ago

Closed 4 weeks ago

Last modified 4 weeks ago

#1749 closed defect (fixed)

crash in xor

Reported by: Mike Owned by: Mike
Priority: major Milestone: 2.3
Component: encodings Version: 2.2.x
Keywords: crash xor 64bit Cc:

Description

I have two 64bit Linux installations of Xpra. Both crash in xor after r17513 which changes xoring from 8bit to 64bit.

That means I took 2.2.x (r18057) and it crashes some seconds after attaching a client.

Thread 3 "xpra" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7f14cb52a700 (LWP 19210)]
0x00007f14ddecb625 in __pyx_pw_4xpra_6codecs_3xor_5cyxor_1xor_str () from /usr/lib64/python2.7/site-packages/xpra/codecs/xor/cyxor.so

When I reverse patch r17560,r17518 and r17513 it isn't crashing.

When I change the new xor routine from 64bit to 32bit it's also not crashing. (see 32bit.patch)

Is it possible that not all buffers are aligned on 8Byte boundaries?

By reading the new xor implementation I also find:

for 0 <= i < steps:
    obuf[i] = abuf[i] ^ bbuf[i]

as an error. If the buffer is smaller than 8byte then 'steps' will be 0, but the 64bit xor is accessing 8bytes of the buffer (which has for instance only 2 bytes).

A debug build also isn't crashing!

python -c 'import struct;print( 8 * struct.calcsize("P"))'

answers '64' (=> python is 64 bit).

Attachments (3)

32bit.patch (1.9 KB) - added by Mike 4 weeks ago.
machine1 (7.4 KB) - added by Mike 4 weeks ago.
machine2 (3.6 KB) - added by Mike 4 weeks ago.

Download all attachments as: .zip

Change History (15)

Changed 4 weeks ago by Mike

Attachment: 32bit.patch added

comment:1 Changed 4 weeks ago by Antoine Martin

Owner: changed from Antoine Martin to Mike

Is it possible that not all buffers are aligned on 8Byte boundaries?

It is possible, though unlikely. In any case, all modern CPUs can handle unaligned accesses.

If the buffer is smaller than 8byte then 'steps' will be 0

Are you sure that this loops runs at all when steps=0?
I thought that this Cython for loop syntax was the equivalent to:

for i in range(steps):

Which definitely does not run if steps=0.

Does r18096 fix things for you?

Last edited 4 weeks ago by Antoine Martin (previous) (diff)

comment:2 Changed 4 weeks ago by Mike

No r18096 doesn't change it. I tested that already before opening the ticket.

Only reversing to the old implementation, switching to 32bit xor or using a debug build avoids the crash.

comment:3 Changed 4 weeks ago by Antoine Martin

Only reversing to the old implementation, switching to 32bit xor or using a debug build avoids the crash.

What sort of CPU are you using?

cat /proc/cpuinfo

Changed 4 weeks ago by Mike

Attachment: machine1 added

Changed 4 weeks ago by Mike

Attachment: machine2 added

comment:4 Changed 4 weeks ago by Mike

Added machine1, machine2 with output of

cat /proc/cpuinfo

of both machines.

comment:5 Changed 4 weeks ago by Antoine Martin

The code definitely looks correct and I don't have time to figure out why this crashes on some CPUs. You may want to run it in gdb to see what the real problem is.
In the meantime, does r18097 fix things?

comment:6 Changed 4 weeks ago by Mike

r18097 isn't compiling (misses uint64_t used four out buffer).

comment:7 Changed 4 weeks ago by Antoine Martin

Oops, sorry. Compile tested this time: r18098.

comment:8 Changed 4 weeks ago by Mike

Now it compiles and runs without crash.

comment:9 Changed 4 weeks ago by Antoine Martin

Milestone: 2.3
Resolution: fixed
Status: newclosed

Backported to v2.2.x in r18099.

I'd love to know why those CPUs struggle with a simple 64-bit XOR but I don't have the time to look into it. Feel free to post the GDB backtrace and more details though.

comment:10 Changed 4 weeks ago by Mike

I had a look at assembly level.
The compiler (gcc) generates code which uses the SSE instruction pxor for 128 bit xor.
The SSE instruction needs data aligned on 16 byte boundaries. There is no "unaligned" version of this instruction. Getting the data from mem and storing is done via 16 byte transfers, but the "unaligned" version of the load/store instruction is used (therefore there no crash happens).

When casting to uint64_t* the compiler seems to assume at least an 8 byte alignmend, which would make sense. Therefore in this case there is no pre-check on the alignment and the pxor instruction gives an exception when the data is unaligned.

Now the good news: When casting to uint32_t* the compiler also uses 128 bit xor, but now assumes a 4 byte alignment of the buffers. So it generates code which deals with data not aligned on 16 byte boundaries so that the 128 bit pxor still gets data on a 16 byte boundary.
At the end there will be no difference in performance between the uint32_t* and the uint64_t* cast and the current solution is absolutely ok.

I guess that the buffers allocated by python/cython will still be 4 byte aligned. Bad: this is still a guessing.
Best would be to add assertions to check the 4 byte alignmend.

This is the 128 Bit main xor loop:

   ┌───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
   │0x7f813205f61a <__pyx_pw_4xpra_6codecs_3xor_5cyxor_1xor_str+362>        movdqu 0x0(%r13,%rsi,1),%xmm0                                                          │
   │0x7f813205f621 <__pyx_pw_4xpra_6codecs_3xor_5cyxor_1xor_str+369>        add    $0x1,%r10d                                                                      │
  >│0x7f813205f625 <__pyx_pw_4xpra_6codecs_3xor_5cyxor_1xor_str+373>        pxor   (%r14,%rsi,1),%xmm0                                                             │
   │0x7f813205f62b <__pyx_pw_4xpra_6codecs_3xor_5cyxor_1xor_str+379>        movups %xmm0,(%rcx,%rsi,1)                                                             │
   │0x7f813205f62f <__pyx_pw_4xpra_6codecs_3xor_5cyxor_1xor_str+383>        add    $0x10,%rsi                                                                      │
   │0x7f813205f633 <__pyx_pw_4xpra_6codecs_3xor_5cyxor_1xor_str+387>        cmp    %r12d,%r10d                                                                     │
   │0x7f813205f636 <__pyx_pw_4xpra_6codecs_3xor_5cyxor_1xor_str+390>        jb     0x7f813205f61a <__pyx_pw_4xpra_6codecs_3xor_5cyxor_1xor_str+362>                │
   └───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

At the end, not the CPU has problems to perform a simple 64 bit xor, but the cast to uint64_t* is simply invalid.

comment:11 Changed 4 weeks ago by Antoine Martin

Excellent investigation. I'm pretty sure I've seen somewhere that newer CPU models can deal with 64-bit unaligned access, this would explain why I have not seen any problems at all on any of my test systems.

I guess that the buffers allocated by python/cython will still be 4 byte aligned. Bad: this is still a guessing.

The buffers that we allocate explicitly are all memaligned on Linux (to 64 by default), aligned to 16-bit on macos and not aligned at all on win32... (for details see: browser/xpra/trunk/src/xpra/buffers/memalign.c)
In any case, the input to the xor function isn't always directly under our control and this is meant as an opaque call anyway. FWIW: the python string buffers always seem to be 4-byte aligned, but we accept any kind of buffer as input, including memoryviews and those can be sliced to any value...

So r18124 adds a slow byte-at-a-time path for unaligned addresses - the improved unit test exercises it.

@Mike: I think this covers everything?

comment:12 Changed 4 weeks ago by Mike

I've tested it and had no crash anymore.

Many thanks for your quick fix!

Note: See TracTickets for help on using tickets.