xpra icon
Bug tracker and wiki

This bug tracker and wiki are being discontinued
please use https://github.com/Xpra-org/xpra instead.

Opened 18 months ago

Closed 18 months ago

Last modified 12 months ago

#2851 closed defect (worksforme)

segfault in motion.pyx:match_distance()

Reported by: Ian Collier Owned by: Ian Collier
Priority: critical Milestone: 4.1
Component: server Version: 4.0.x
Keywords: Cc:

Description (last modified by Antoine Martin)

I use xpra for several hours a day, and the server crashes maybe every 1-2 weeks. So this one is tricky to reproduce, and I think if I turn debugging on the log will fill up way before the crash happens. But still, when it does happen it's annoying. There also isn't anything particular I can do to trigger the crash, although my window manager (fvwm1) does weird stuff sometimes, and the crashes have tended to happen during WM operations (although one happened when I dismissed xscreensaver by pressing a key just as it was starting to engage).

The xpra version is 4.0.2 using the distro package from Fedora 32 (both client and server). The previous 3.x version of xpra seemed stable.

The crash is happening in the match_distance function of motion.pyx, here:

        for i1 in range(rstart, rend):
            i2 = i1+distance
            v = a1[i1]       <----CRASH

By analysing a core file I've managed to ascertain that 'distance' is zero, line_state is a list of 970 zero-bytes, and 'self' has the following value:

{ob_base = {ob_refcnt = 4,                                                         
    ob_type = 0x7f0f86a329e0 <__pyx_type_4xpra_6server_6window_6motion_ScrollData>}, __pyx_vtab =
    0x7f0f86a33040 <__pyx_vtable_4xpra_6server_6window_6motion_ScrollData>,                
  __weakref__ = 0x0, distances = 0x0, a1 = 0x0, a2 = 0x0, matched = 0 '\000',              
  x = 211, y = 44, width = 1542, height = 970}

So the crash is happening because self.a1 is null.

This is called from get_scroll_values() here:

            for i in reversed(sorted(scroll_hits.keys())):
                v = scroll_hits[i]
                for scroll in v:
                    #find matching lines:
                    line_defs = self.match_distance(line_state, scroll, MIN_LINE_COUNT)

It looks like 'i' is pointing to the 9th element of the sorted list when this happens, although this is difficult to say because of the various layers of translation and optimisation. The puzzle is that this function starts with

        if self.a1==NULL or self.a2==NULL:
            return None

and so self.a1 and self.a2 should be guaranteed non-null, right? So something must be nullifying them after the function has started. I have no clue as to what, because I don't actually understand the code. I am imagining that distance shouldn't really be zero here, but don't know whether that's anything to do with it. I guess the hack fix would be to return early from match_distance if a1 or a2 are null, but this wouldn't get to the root cause.

Attachments (1)

scrolldata-check-thread.patch (3.0 KB) - added by Antoine Martin 18 months ago.
debug patch: checks the current thread in all methods calls

Download all attachments as: .zip

Change History (5)

comment:1 Changed 18 months ago by Antoine Martin

Description: modified (diff)
Priority: majorcritical
Status: newassigned

Thanks for the detailed analysis.
It sounds like this may be a threading issue, perhaps we end up calling free on the ScrollData object whilst the encoding thread is still using it. (as it can be called from quite a few places)

I guess the hack fix would be to return early from match_distance if a1 or a2 are null, but this wouldn't get to the root cause

Yes. Another easy "fix" would be to add locking, but that's just lazy and might slow things down - I will try to locate the threading issue first. (starting with some thread checks first)

You may be able to avoid the crash for now by turning off scroll encoding. In v4, this can be done using XPRA_SCROLL_ENCODING=0 xpra ... (works for both client and server), in v4.1 you have to use --encodings= (see #2810).

This could also just be a memory corruption from somewhere else manifesting itself there.

I have no clue as to what, because I don't actually understand the code

Mile high view: ScrollData holds the xxhash checksums of every line of two screen updates of the same size:

  • update adds a new screen update and computes the new checksums
  • calculate finds the scrolling distances that can be used to generate the newest screen update from the previous one (including duplicates)
  • get_scroll_values finds the best matches (without duplicates) and uses match_distance to do this

There are a few docstrings in the source code, though they're probably not detailed enough.

comment:2 Changed 18 months ago by Antoine Martin

Owner: changed from Antoine Martin to Ian Collier
Status: assignednew

I'm pretty sure it's a threading issue: ScrollData should only be accessed from the encode thread, but free_scroll_data() was occasionally getting called from other threads via cancel_damage() and full_quality_refresh().

r27031 should fix this.

@imc: does this change prevent the segfault?

Below, I have also attached a patch which adds extra checks to all the method calls.

Last edited 18 months ago by Antoine Martin (previous) (diff)

Changed 18 months ago by Antoine Martin

debug patch: checks the current thread in all methods calls

comment:3 Changed 18 months ago by Antoine Martin

Resolution: worksforme
Status: newclosed

comment:4 Changed 12 months ago by migration script

this ticket has been moved to: https://github.com/Xpra-org/xpra/issues/2851

Note: See TracTickets for help on using tickets.