Gemini Lasswell writes: > I set up a single-threaded situation where I could redefine a function > while exec_byte_code was running it, and got a segfault. I've gained > some insights from debugging this version of the bug which I will put > into a separate email. Here's a gdb transcript going through the single-threaded version of this bug. In this transcript I use a file 'repro.el' which I've attached to the end of this message, and is the same as the one in my last message. Start gdb with a breakpoint at Fredraw_display: $ gdb --args ./emacs -Q ... (gdb) b Fredraw_display (gdb) r In Emacs, find the file repro.el and load it with byte-compile-file, then go back to *scratch* and run my-loop: C-x C-f repro.el RET C-u M-x byte-compile-file RET repro.el RET C-x b RET M-x my-loop RET This gets me to the gdb prompt, at a point in execution where the next function called will be my-loop-1, so I set a breakpoint in funcall_lambda, where I can see the bytecode object for my-loop-1 (I edited out the bytestring): Thread 1 "emacs" hit Breakpoint 3, Fredraw_display () at dispnew.c:3027 3027 { (gdb) br funcall_lambda Breakpoint 4 at 0x5cdb00: file eval.c, line 3016. (gdb) c Continuing. Thread 1 "emacs" hit Breakpoint 4, funcall_lambda (fun=XIL(0x31c0235), nargs=nargs@entry=0, arg_vector=arg_vector@entry=0x7fffffff01c0) at eval.c:3016 3016 { (gdb) clear Deleted breakpoint 4 (gdb) p fun $1 = XIL(0x1630fc5) (gdb) pr #[0 "..." [my-var 0 "Now in recursive edit " recursive-edit format "Leaving recursive edit: %s " (a b c d e) message "foo: %s" last 1 "bar: %s" 2 "baz: %s" "bop: %s" mod 3] 6] Then I skip ahead into exec-byte-code: (gdb) br exec_byte_code Breakpoint 5 at 0x611bb0: file bytecode.c, line 342. (gdb) c Continuing. Thread 1 "emacs" hit Breakpoint 5, exec_byte_code (bytestr=XIL(0x3571d24), vector=XIL(0x31c0195), maxdepth=make_number(4), args_template=args_template@entry=XIL(0), nargs=nargs@entry=0, args=args@entry=0x0) at bytecode.c:342 342 { Here's what's in the register $rbp, and the constants vector: (gdb) clear Deleted breakpoint 5 (gdb) p $rbp $2 = (void *) 0xb0201 (gdb) pr # (gdb) p vector $3 = XIL(0x1630f35) (gdb) pr [my-var 0 "Now in recursive edit " recursive-edit format "Leaving recursive edit: %s " (a b c d e) message "foo: %s" last 1 "bar: %s" 2 "baz: %s" "bop: %s" mod 3] Skip ahead, to get to where exec_byte_code has a value for vectorp: (gdb) n 12 366 USE_SAFE_ALLOCA; (gdb) p vectorp $4 = (Lisp_Object *) 0x1630f38 (gdb) p *vectorp $5 = XIL(0x2327d80) (gdb) pr my-var (gdb) break mark_vectorlike if ptr->contents == $4 Breakpoint 6 at 0x5ad400: file alloc.c, line 6036. (gdb) c Continuing. The idea is to break when garbage collection finds the constants vector. (I first tried setting a conditional breakpoint in mark_object, which made garbage collection either hang or take more time than I had patience for.) In Emacs type C-x b RET. This causes a gc and a breakpoint hit: Thread 1 "emacs" hit Breakpoint 6, mark_vectorlike (ptr=0x31c0190) at alloc.c:6036 6036 eassert (!VECTOR_MARKED_P (ptr)); (gdb) bt 20 #0 mark_vectorlike (ptr=0x1630f30 ) at alloc.c:6036 #1 0x00000000005aca9c in mark_object (arg=...) at alloc.c:6430 #2 0x00000000005ad45e in mark_vectorlike ( ptr=0x1611fd0 ) at alloc.c:6046 #3 0x00000000005aca9c in mark_object (arg=...) at alloc.c:6430 #4 0x00000000005acdf4 in mark_object (arg=...) at alloc.c:6477 #5 0x00000000005acae4 in mark_object (arg=...) at alloc.c:6434 #6 0x00000000005ad45e in mark_vectorlike ( ptr=0x15a8e00 ) at alloc.c:6046 #7 0x00000000005ad45e in mark_vectorlike ( ptr=0x15a9c30 ) at alloc.c:6046 #8 0x00000000005aca9c in mark_object (arg=...) at alloc.c:6430 #9 0x00000000005ad45e in mark_vectorlike ( ptr=0x15a7c30 ) at alloc.c:6046 #10 0x00000000005aca9c in mark_object (arg=...) at alloc.c:6430 #11 0x00000000005ad45e in mark_vectorlike ( ptr=0x15a6e80 ) at alloc.c:6046 #12 0x00000000005aca9c in mark_object (arg=...) at alloc.c:6430 #13 0x00000000005acdf4 in mark_object (arg=...) at alloc.c:6477 #14 0x00000000005acaa5 in mark_object (arg=...) at alloc.c:6431 #15 0x00000000005ad45e in mark_vectorlike ( ptr=0x15fbed0 ) at alloc.c:6046 #16 0x00000000005aca9c in mark_object (arg=...) at alloc.c:6430 #17 0x00000000005ad45e in mark_vectorlike ( ptr=0x15fbf50 ) at alloc.c:6046 #18 0x00000000005aca9c in mark_object (arg=...) at alloc.c:6430 #19 0x00000000005ad45e in mark_vectorlike ( ptr=0x15fcc80 ) at alloc.c:6046 (More stack frames follow...) Lisp Backtrace: "Automatic GC" (0x0) "eldoc-pre-command-refresh-echo-area" (0xfffefbb0) "recursive-edit" (0xfffeffd8) "my-loop-1" (0xffff0250) "my-loop" (0xffff0650) "funcall-interactively" (0xffff0648) "call-interactively" (0xffff07d0) "command-execute" (0xffff0ab8) "execute-extended-command" (0xffff0ea0) "funcall-interactively" (0xffff0e98) "call-interactively" (0xffff11d0) "command-execute" (0xffff1488) There are 279 frames in the backtrace, and mark_stack and mark_memory aren't there. So I'm guessing the constants vector is getting found via the function definition of 'my-loop-1'. Keep going: (gdb) c Continuing. Now in Emacs do this: M-x eval-buffer RET C-x b RET M-x my-gc RET Execution does not stop at the breakpoint. In Emacs type C-M-c. Result: Thread 1 "emacs" received signal SIGSEGV, Segmentation fault. 0x00000000005bca1b in styled_format (nargs=2, args=0x7ffffffeffd8, message=) at editfns.c:3129 3129 unsigned char format_char = *format++; What's happened to the constants vector and its contents? (gdb) p $3 $6 = XIL(0x1630f35) (gdb) pr # (gdb) p *$4 $7 = XIL(0x2327d80) (gdb) pr my-var (gdb) p *($4+5) $8 = XIL(0x359a6f4) (gdb) pr # (gdb) p *($4+4) $9 = XIL(0x6390) (gdb) pr format Looks like the constants vector was freed, and its contents haven't been overwritten (yet) but the format string has been freed leading to the crash in styled_format. While I was developing this method of reproducing this bug, I went through this exercise without lexical-binding set in repro.el. In that version, the register $rbp when exec_byte_code is called contains the bytecode Lisp_Object (instead of the non-Lisp-object value it contains in the transcript above), and the first thing exec_byte_code does is save it on the stack (presumably because the System V AMD64 ABI calling convention says that called functions which use $rbp should save and restore it). Here's the beginning of the disassembly of exec_byte_code from "objdump -S bytecode.o": 0000000000000020 : executing BYTESTR. */ Lisp_Object exec_byte_code (Lisp_Object bytestr, Lisp_Object vector, Lisp_Object maxdepth, Lisp_Object args_template, ptrdiff_t nargs, Lisp_Object *args) { 20: 55 push %rbp 21: 48 89 e5 mov %rsp,%rbp 24: 41 57 push %r15 26: 41 56 push %r14 28: 41 55 push %r13 2a: 41 54 push %r12 2c: 49 89 ce mov %rcx,%r14 2f: 53 push %rbx So in the non-lexical-binding case the bytecode Lisp_Object is written to the stack by the first instruction in exec_byte_code, and then during the execution of 'my-gc' the breakpoint in mark_vectorlike stops at a point with a much shorter backtrace which includes mark_stack and mark_memory, and mark_memory's pp is pointing to the location on the stack where $rbp was written. The bytecode object and constants vector are consequently not freed, and no segfault happens. I don't follow everything going on in the disassembly of funcall_lambda, but I did figure out (by comparison with a debug session in the multithreaded situation) that the different values in $rbp when funcall_lambda calls exec_byte_code depend on the different code paths following the test of whether the first element of the bytecode object vector (the "args template" as funcall_lambda's comment calls it) is an integer, which in turn depends on whether my-loop-1 was compiled with lexical-binding on. Here is 'repro.el':