unofficial mirror of bug-gnu-emacs@gnu.org 
 help / color / mirror / code / Atom feed
* bug#41357: 28.0.50; GC may miss to mark calle safe register content
@ 2020-05-17 12:42 Andrea Corallo
  2020-05-17 15:36 ` Eli Zaretskii
  2020-05-25  2:09 ` Tom Tromey
  0 siblings, 2 replies; 29+ messages in thread
From: Andrea Corallo @ 2020-05-17 12:42 UTC (permalink / raw)
  To: 41357; +Cc: Paul Eggert

[-- Attachment #1: Type: text/plain, Size: 2561 bytes --]

Hi all,

debugging the native compiler I've been chasing a bug in a configuration
where the .eln are compiled at speed 2 (-O2) and emacs-core is compiled
at -O0.

What is going on is that in a .eln in a function A a Lisp_Object is
hold in a register (r14).  Function A is calling other functions into
emacs-core till Garbage Collection is triggered.

Being emacs-core compiled with -O0 GCC is not selecting any callee safe
register and therefore these gets never pushed.  The value stays in r14
till we enter into 'flush_stack_call_func' where we have to push all
registers and identify the end of the stack for mark.

We correctly push callee safe register with __builtin_unwind_init () and
we identify the top (end) of the stack on my machine using
__builtin_frame_address (0).

Here I think raise the issue, __builtin_frame_address on GCC 7 and 10
for X86_64 is returning the base pointer and not the stack pointer [1].
As a consequence this is not including the callee safe registers that we
have just pushed.

In my case r14 gets pushed at address 0x7ffc47b95fa0 but in mark_stack
we are scanning the interval 0x7ffc47b95fb0 (end) 0x7ffc47b9a150
(bottom).  This because __builtin_frame_address returned ebp
(0x7ffc47b95fb0 in this case).

The consequence is that the object originally referenced by r14 is never
marked and this leads to have it freed and to a crash.

I think we would be interested into obtaining the stack pointer and not
the base pointer, unfortunately what __builtin_frame_address does is
appears not really portable:

https://gcc.gnu.org/onlinedocs/gcc/Return-Address.html

This bug is easy to observe in the native compiler with configurations
like this (speed2 for eln -O0 for core) but I believe can affect stock
Emacs too if any caller of flush_stack_call_func has a callee safe
register holding a reference to a live object not present into the
stack.  This can get trickier especially with LTO enabled.

For now I'm testing the simple attached patch that seams to do the job
for me.  It pushes the registers in 'flush_stack_call_func' and then
call 'flush_stack_call_func1' where now ebp must include the address
where those register got pushed.

I hope I'm not catastrophically wrong in this analysis, in case
I apologize for the noise.

Thanks

  Andrea

[1] Reduced example. GCC7 -O0

void *
foo (void)
{
  __builtin_unwind_init ();
  return __builtin_frame_address (0);
}

foo:
	push	rbp
	mov	rbp, rsp
	push	r15
	push	r14
	push	r13
	push	r12
	push	rbx
	mov	rax, rbp
	pop	rbx
	pop	r12
	pop	r13
	pop	r14
	pop	r15
	pop	rbp
	ret

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 0001-Fix-Garbage-Collector-for-missing-calle-safe-registe.patch --]
[-- Type: text/x-diff, Size: 1875 bytes --]

From f8a5d8a4a95182e91037d444d6d1f51f7dc7e467 Mon Sep 17 00:00:00 2001
From: Andrea Corallo <akrl@sdf.org>
Date: Sun, 17 May 2020 13:23:59 +0100
Subject: [PATCH] * Fix Garbage Collector for missing calle safe registers
 content

	* src/alloc.c (SET_STACK_TOP_ADDRESS): Do not call
	__builtin_unwind_init.
	(flush_stack_call_func1): Rename from 'flush_stack_call_func'.
	(flush_stack_call_func): New function to spill all registers
	before calling 'flush_stack_call_func1'.  This to make sure the
	top of the stack identified includes those registers.
---
 src/alloc.c | 11 ++++++++---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/src/alloc.c b/src/alloc.c
index f2b80fac88..1d2ee05481 100644
--- a/src/alloc.c
+++ b/src/alloc.c
@@ -4952,12 +4952,10 @@ test_setjmp (void)
 #ifdef HAVE___BUILTIN_UNWIND_INIT
 # define SET_STACK_TOP_ADDRESS(p)	\
    stacktop_sentry sentry;		\
-   __builtin_unwind_init ();		\
    *(p) = NEAR_STACK_TOP (&sentry)
 #else
 # define SET_STACK_TOP_ADDRESS(p)		\
    stacktop_sentry sentry;			\
-   __builtin_unwind_init ();			\
    test_setjmp ();				\
    sys_setjmp (sentry.j);			\
    *(p) = NEAR_STACK_TOP (&sentry + (stack_bottom < &sentry.c))
@@ -5033,7 +5031,7 @@ mark_stack (char const *bottom, char const *end)
    from FUNC.  */
 
 NO_INLINE void
-flush_stack_call_func (void (*func) (void *arg), void *arg)
+flush_stack_call_func1 (void (*func) (void *arg), void *arg)
 {
   void *end;
   struct thread_state *self = current_thread;
@@ -5043,6 +5041,13 @@ flush_stack_call_func (void (*func) (void *arg), void *arg)
   eassert (current_thread == self);
 }
 
+void
+flush_stack_call_func (void (*func) (void *arg), void *arg)
+{
+  __builtin_unwind_init ();
+  flush_stack_call_func1 (func, arg);
+}
+
 /* Determine whether it is safe to access memory at address P.  */
 static int
 valid_pointer_p (void *p)
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* bug#41357: 28.0.50; GC may miss to mark calle safe register content
  2020-05-17 12:42 bug#41357: 28.0.50; GC may miss to mark calle safe register content Andrea Corallo
@ 2020-05-17 15:36 ` Eli Zaretskii
  2020-05-17 16:40   ` Andrea Corallo
  2020-05-25  2:09 ` Tom Tromey
  1 sibling, 1 reply; 29+ messages in thread
From: Eli Zaretskii @ 2020-05-17 15:36 UTC (permalink / raw)
  To: Andrea Corallo; +Cc: 41357, eggert

> From: Andrea Corallo <akrl@sdf.org>
> Cc: Paul Eggert <eggert@cs.ucla.edu>, Eli Zaretskii <eliz@gnu.org>
> Date: Sun, 17 May 2020 12:42:48 +0000
> 
> What is going on is that in a .eln in a function A a Lisp_Object is
> hold in a register (r14).  Function A is calling other functions into
> emacs-core till Garbage Collection is triggered.
> 
> Being emacs-core compiled with -O0 GCC is not selecting any callee safe
> register and therefore these gets never pushed.

Isn't this something for the infrastructure of calling
natively-compiled Lisp to solve?  The Emacs C code isn't prepared for
calling optimized C code when it calls Lisp, and I don't think it's
right for us to assume that, because it will make Emacs slower.  If
the natively-compiled Lisp needs some setup to be compatible with GC,
I think the calling framework should set that up.





^ permalink raw reply	[flat|nested] 29+ messages in thread

* bug#41357: 28.0.50; GC may miss to mark calle safe register content
  2020-05-17 15:36 ` Eli Zaretskii
@ 2020-05-17 16:40   ` Andrea Corallo
  2020-05-17 16:46     ` Paul Eggert
  2020-05-17 17:04     ` Eli Zaretskii
  0 siblings, 2 replies; 29+ messages in thread
From: Andrea Corallo @ 2020-05-17 16:40 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 41357, eggert

[-- Attachment #1: Type: text/plain, Size: 2058 bytes --]

Eli Zaretskii <eliz@gnu.org> writes:

>> From: Andrea Corallo <akrl@sdf.org>
>> Cc: Paul Eggert <eggert@cs.ucla.edu>, Eli Zaretskii <eliz@gnu.org>
>> Date: Sun, 17 May 2020 12:42:48 +0000
>>
>> What is going on is that in a .eln in a function A a Lisp_Object is
>> hold in a register (r14).  Function A is calling other functions into
>> emacs-core till Garbage Collection is triggered.
>>
>> Being emacs-core compiled with -O0 GCC is not selecting any callee safe
>> register and therefore these gets never pushed.
>
> Isn't this something for the infrastructure of calling
> natively-compiled Lisp to solve?  The Emacs C code isn't prepared for
> calling optimized C code when it calls Lisp, and I don't think it's
> right for us to assume that, because it will make Emacs slower.  If
> the natively-compiled Lisp needs some setup to be compatible with GC,
> I think the calling framework should set that up.

Hi Eli,

I think this is a real bug that we have in the codebase (emacs-27
included).

Usually it works because having many big functions with high register
pressure that gets activated before reaching 'flush_stack_call_func'
statistically callee saved regs are very likely to be pushed.

But nothing prevents a caller of 'flush_stack_call_func' to store a lisp
object into a callee saved regs and trigger the bug.  This obviously
depends on the compiler, flags used etc, things we have no control over.

This could be also an explanation of instability for LTO configurations.
Given that callers of 'flush_stack_call_func' are more likely to be
inlined the exposed surface becomes considerably higher.

This bug should be also more likely to be observable if C files are
compiled with a mix of -O0 and -O2.

I think we should honor calling convention and make sure we garbage
collect also the content of callee saved registers, BTW I guess that's
the reason why we call '__builtin_unwind_init' isn't?

If we are concerned about performance the attached the attached patch
should be zero performance overhead.

Regards

  Andrea

--
akrl@sdf.org

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 0001-Fix-Garbage-Collector-for-missing-calle-safe-registe.patch --]
[-- Type: text/x-diff, Size: 2203 bytes --]

From ef923e029d5192b67ff7a0aede75a19c2c5327d4 Mon Sep 17 00:00:00 2001
From: Andrea Corallo <akrl@sdf.org>
Date: Sun, 17 May 2020 13:23:59 +0100
Subject: [PATCH] * Fix Garbage Collector for missing calle safe registers
 content

	* src/alloc.c (SET_STACK_TOP_ADDRESS): Do not call
	__builtin_unwind_init.
	(flush_stack_call_func1): Rename from 'flush_stack_call_func'.
	(flush_stack_call_func): New function to spill all registers
	before calling 'flush_stack_call_func1'.  This to make sure the
	top of the stack identified includes those registers.
---
 src/alloc.c | 4 +---
 src/lisp.h  | 9 ++++++++-
 2 files changed, 9 insertions(+), 4 deletions(-)

diff --git a/src/alloc.c b/src/alloc.c
index f2b80fac88..d6ba4d9790 100644
--- a/src/alloc.c
+++ b/src/alloc.c
@@ -4952,12 +4952,10 @@ test_setjmp (void)
 #ifdef HAVE___BUILTIN_UNWIND_INIT
 # define SET_STACK_TOP_ADDRESS(p)	\
    stacktop_sentry sentry;		\
-   __builtin_unwind_init ();		\
    *(p) = NEAR_STACK_TOP (&sentry)
 #else
 # define SET_STACK_TOP_ADDRESS(p)		\
    stacktop_sentry sentry;			\
-   __builtin_unwind_init ();			\
    test_setjmp ();				\
    sys_setjmp (sentry.j);			\
    *(p) = NEAR_STACK_TOP (&sentry + (stack_bottom < &sentry.c))
@@ -5033,7 +5031,7 @@ mark_stack (char const *bottom, char const *end)
    from FUNC.  */
 
 NO_INLINE void
-flush_stack_call_func (void (*func) (void *arg), void *arg)
+flush_stack_call_func1 (void (*func) (void *arg), void *arg)
 {
   void *end;
   struct thread_state *self = current_thread;
diff --git a/src/lisp.h b/src/lisp.h
index 66a86ddadf..c7ce3918fc 100644
--- a/src/lisp.h
+++ b/src/lisp.h
@@ -3819,7 +3819,14 @@ verify (FLT_RADIX == 2 || FLT_RADIX == 16);
 extern void alloc_unexec_post (void);
 extern void mark_maybe_objects (Lisp_Object const *, ptrdiff_t);
 extern void mark_stack (char const *, char const *);
-extern void flush_stack_call_func (void (*func) (void *arg), void *arg);
+
+INLINE void
+flush_stack_call_func (void (*func) (void *arg), void *arg)
+{
+  __builtin_unwind_init ();
+  flush_stack_call_func1 (func, arg);
+}
+
 extern void garbage_collect (void);
 extern void maybe_garbage_collect (void);
 extern const char *pending_malloc_warning;
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* bug#41357: 28.0.50; GC may miss to mark calle safe register content
  2020-05-17 16:40   ` Andrea Corallo
@ 2020-05-17 16:46     ` Paul Eggert
  2020-05-17 17:00       ` Pip Cet
                         ` (2 more replies)
  2020-05-17 17:04     ` Eli Zaretskii
  1 sibling, 3 replies; 29+ messages in thread
From: Paul Eggert @ 2020-05-17 16:46 UTC (permalink / raw)
  To: Andrea Corallo, Eli Zaretskii; +Cc: 41357

On 5/17/20 9:40 AM, Andrea Corallo wrote:
> I think this is a real bug that we have in the codebase (emacs-27
> included).

Thanks for all the detective work! Your analysis is correct and your patch looks
good. I've always been suspicious of that code, and it looks like you've
confirmed my suspicions.

The only question in my mind is whether to install the patch into the emacs-27
branch or the master branch. Given Eli's problems with stability in emacs-27
(see Bug#41321), I'm inclined to think the former, as the bug could explain the
problems Eli is observing.





^ permalink raw reply	[flat|nested] 29+ messages in thread

* bug#41357: 28.0.50; GC may miss to mark calle safe register content
  2020-05-17 16:46     ` Paul Eggert
@ 2020-05-17 17:00       ` Pip Cet
  2020-05-17 19:01         ` Paul Eggert
  2020-05-17 17:04       ` Eli Zaretskii
  2020-05-17 21:47       ` Andrea Corallo
  2 siblings, 1 reply; 29+ messages in thread
From: Pip Cet @ 2020-05-17 17:00 UTC (permalink / raw)
  To: Paul Eggert; +Cc: 41357, Andrea Corallo

On Sun, May 17, 2020 at 4:47 PM Paul Eggert <eggert@cs.ucla.edu> wrote:
> On 5/17/20 9:40 AM, Andrea Corallo wrote:
> > I think this is a real bug that we have in the codebase (emacs-27
> > included).
> Thanks for all the detective work! Your analysis is correct and your patch looks
> good.

That's my impression as well.

> The only question in my mind is whether to install the patch into the emacs-27
> branch or the master branch. Given Eli's problems with stability in emacs-27
> (see Bug#41321), I'm inclined to think the former, as the bug could explain the
> problems Eli is observing.

I don't think that platform even has callee-saved registers? But I
think the fix should go on the emacs-27 branch. It's a bad bug and
sheer luck that Fgarbage_collect on my platform (using this specific
compiler, etc.) pushes all callee-saved registers. We shouldn't rely
on such lucks on all platforms.





^ permalink raw reply	[flat|nested] 29+ messages in thread

* bug#41357: 28.0.50; GC may miss to mark calle safe register content
  2020-05-17 16:40   ` Andrea Corallo
  2020-05-17 16:46     ` Paul Eggert
@ 2020-05-17 17:04     ` Eli Zaretskii
  2020-05-17 17:08       ` Paul Eggert
  2020-05-17 17:13       ` Andrea Corallo
  1 sibling, 2 replies; 29+ messages in thread
From: Eli Zaretskii @ 2020-05-17 17:04 UTC (permalink / raw)
  To: Andrea Corallo; +Cc: 41357, eggert

> From: Andrea Corallo <akrl@sdf.org>
> Cc: bug-gnu-emacs@gnu.org, eggert@cs.ucla.edu
> Date: Sun, 17 May 2020 16:40:09 +0000
> 
> I think this is a real bug that we have in the codebase (emacs-27
> included).

Maybe it's so, but your explanation makes sense only in the context of
calling a machine-language function.  When we call Lisp or bytecode,
the machine-level operation is very different, and I cannot easily
correlate your description of using registers with what happens when
we call Lisp or bytecode.  Sorry for my misunderstanding.





^ permalink raw reply	[flat|nested] 29+ messages in thread

* bug#41357: 28.0.50; GC may miss to mark calle safe register content
  2020-05-17 16:46     ` Paul Eggert
  2020-05-17 17:00       ` Pip Cet
@ 2020-05-17 17:04       ` Eli Zaretskii
  2020-05-17 17:21         ` Andrea Corallo
  2020-05-17 21:47       ` Andrea Corallo
  2 siblings, 1 reply; 29+ messages in thread
From: Eli Zaretskii @ 2020-05-17 17:04 UTC (permalink / raw)
  To: Paul Eggert; +Cc: 41357, akrl

> Cc: bug-gnu-emacs@gnu.org
> From: Paul Eggert <eggert@cs.ucla.edu>
> Date: Sun, 17 May 2020 09:46:23 -0700
> 
> The only question in my mind is whether to install the patch into the emacs-27
> branch or the master branch.

Definitely the master!





^ permalink raw reply	[flat|nested] 29+ messages in thread

* bug#41357: 28.0.50; GC may miss to mark calle safe register content
  2020-05-17 17:04     ` Eli Zaretskii
@ 2020-05-17 17:08       ` Paul Eggert
  2020-05-17 17:24         ` Eli Zaretskii
  2020-05-17 17:13       ` Andrea Corallo
  1 sibling, 1 reply; 29+ messages in thread
From: Paul Eggert @ 2020-05-17 17:08 UTC (permalink / raw)
  To: Eli Zaretskii, Andrea Corallo; +Cc: 41357

On 5/17/20 10:04 AM, Eli Zaretskii wrote:
> I cannot easily
> correlate your description of using registers with what happens when
> we call Lisp or bytecode.

His description is generic: it applies regardless of whether the garbage
collector is called from C code (in his branch, generated from Lisp code) or
from the interpreter (either in his branch or in the emacs-27 branch) as it is
executing Lisp code or bytecode.

It's a low-level problem in which the garbage collector is not seeing some
objects that it should see, because at the machine level the object addresses
are in registers that the garbage collector hasn't saved and thus won't see when
it scans memory.

A serious and insidious bug in our existing system, in other words.





^ permalink raw reply	[flat|nested] 29+ messages in thread

* bug#41357: 28.0.50; GC may miss to mark calle safe register content
  2020-05-17 17:04     ` Eli Zaretskii
  2020-05-17 17:08       ` Paul Eggert
@ 2020-05-17 17:13       ` Andrea Corallo
  2020-05-17 17:22         ` Eli Zaretskii
  1 sibling, 1 reply; 29+ messages in thread
From: Andrea Corallo @ 2020-05-17 17:13 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 41357, eggert

Eli Zaretskii <eliz@gnu.org> writes:

>> From: Andrea Corallo <akrl@sdf.org>
>> Cc: bug-gnu-emacs@gnu.org, eggert@cs.ucla.edu
>> Date: Sun, 17 May 2020 16:40:09 +0000
>> 
>> I think this is a real bug that we have in the codebase (emacs-27
>> included).
>
> Maybe it's so, but your explanation makes sense only in the context of
> calling a machine-language function.  When we call Lisp or bytecode,
> the machine-level operation is very different, and I cannot easily
> correlate your description of using registers with what happens when
> we call Lisp or bytecode.  Sorry for my misunderstanding.

That is correct, but I don't think we need bytecode to come into play
here to have the problem.

If a C function caller of 'flush_stack_call_func' allocates a
Lisp_Object in a temp variable and the compiler decide to keep this in a
callee saved reg while 'flush_stack_call_func' is called this will be
garbage collected unexpectedly.

Am I wrong?

  Andrea

-- 
akrl@sdf.org





^ permalink raw reply	[flat|nested] 29+ messages in thread

* bug#41357: 28.0.50; GC may miss to mark calle safe register content
  2020-05-17 17:04       ` Eli Zaretskii
@ 2020-05-17 17:21         ` Andrea Corallo
  2020-05-17 17:28           ` Eli Zaretskii
  0 siblings, 1 reply; 29+ messages in thread
From: Andrea Corallo @ 2020-05-17 17:21 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 41357, eggert

Eli Zaretskii <eliz@gnu.org> writes:

>> Cc: bug-gnu-emacs@gnu.org
>> From: Paul Eggert <eggert@cs.ucla.edu>
>> Date: Sun, 17 May 2020 09:46:23 -0700
>> 
>> The only question in my mind is whether to install the patch into the emacs-27
>> branch or the master branch.
>
> Definitely the master!

Is not my responsability so I'll not insist.

But I just wanted to point out that I think this is clearly a bug in our
code, and it can trigger depending on decision of the tool-chain we have
no control over.

I think it should be trated as we would do if we discover and verify an
"access out of bounds" or "read after free".

My opinion :)

Regards

  Andrea

-- 
akrl@sdf.org





^ permalink raw reply	[flat|nested] 29+ messages in thread

* bug#41357: 28.0.50; GC may miss to mark calle safe register content
  2020-05-17 17:13       ` Andrea Corallo
@ 2020-05-17 17:22         ` Eli Zaretskii
  2020-05-17 17:45           ` Andrea Corallo
  0 siblings, 1 reply; 29+ messages in thread
From: Eli Zaretskii @ 2020-05-17 17:22 UTC (permalink / raw)
  To: Andrea Corallo; +Cc: 41357, eggert

> From: Andrea Corallo <akrl@sdf.org>
> Cc: bug-gnu-emacs@gnu.org, eggert@cs.ucla.edu
> Date: Sun, 17 May 2020 17:13:26 +0000
> 
> If a C function caller of 'flush_stack_call_func' allocates a
> Lisp_Object in a temp variable and the compiler decide to keep this in a
> callee saved reg while 'flush_stack_call_func' is called this will be
> garbage collected unexpectedly.

Can you show me an example of this (as skeleton C code)?

Thanks.





^ permalink raw reply	[flat|nested] 29+ messages in thread

* bug#41357: 28.0.50; GC may miss to mark calle safe register content
  2020-05-17 17:08       ` Paul Eggert
@ 2020-05-17 17:24         ` Eli Zaretskii
  2020-05-17 19:05           ` Paul Eggert
  0 siblings, 1 reply; 29+ messages in thread
From: Eli Zaretskii @ 2020-05-17 17:24 UTC (permalink / raw)
  To: Paul Eggert; +Cc: 41357, akrl

> Cc: bug-gnu-emacs@gnu.org
> From: Paul Eggert <eggert@cs.ucla.edu>
> Date: Sun, 17 May 2020 10:08:00 -0700
> 
> It's a low-level problem in which the garbage collector is not seeing some
> objects that it should see, because at the machine level the object addresses
> are in registers that the garbage collector hasn't saved and thus won't see when
> it scans memory.

Since we write in C and in Lisp, not in assembly, I struggle to see
how a Lisp object could appear in a register without leaving any trace
on the stack.  I'm probably missing something.





^ permalink raw reply	[flat|nested] 29+ messages in thread

* bug#41357: 28.0.50; GC may miss to mark calle safe register content
  2020-05-17 17:21         ` Andrea Corallo
@ 2020-05-17 17:28           ` Eli Zaretskii
  2020-05-17 21:27             ` Paul Eggert
  0 siblings, 1 reply; 29+ messages in thread
From: Eli Zaretskii @ 2020-05-17 17:28 UTC (permalink / raw)
  To: Andrea Corallo; +Cc: 41357, eggert

> From: Andrea Corallo <akrl@sdf.org>
> Cc: Paul Eggert <eggert@cs.ucla.edu>, bug-gnu-emacs@gnu.org
> Date: Sun, 17 May 2020 17:21:54 +0000
> 
> But I just wanted to point out that I think this is clearly a bug in our
> code, and it can trigger depending on decision of the tool-chain we have
> no control over.

I'm not saying it is not a bug.  I'm saying that we've lived with this
bug for a very long time, so if it is real, it is definitely very-very
rare.

If we put every bugfix we could think of into the release branch, we
will never release Emacs 27.  Never.  Because there's always one more
bug.

> I think it should be trated as we would do if we discover and verify an
> "access out of bounds" or "read after free".

Depends on the circumstances.





^ permalink raw reply	[flat|nested] 29+ messages in thread

* bug#41357: 28.0.50; GC may miss to mark calle safe register content
  2020-05-17 17:22         ` Eli Zaretskii
@ 2020-05-17 17:45           ` Andrea Corallo
  2020-05-17 17:57             ` Eli Zaretskii
  0 siblings, 1 reply; 29+ messages in thread
From: Andrea Corallo @ 2020-05-17 17:45 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 41357, eggert

Eli Zaretskii <eliz@gnu.org> writes:

>> From: Andrea Corallo <akrl@sdf.org>
>> Cc: bug-gnu-emacs@gnu.org, eggert@cs.ucla.edu
>> Date: Sun, 17 May 2020 17:13:26 +0000
>> 
>> If a C function caller of 'flush_stack_call_func' allocates a
>> Lisp_Object in a temp variable and the compiler decide to keep this in a
>> callee saved reg while 'flush_stack_call_func' is called this will be
>> garbage collected unexpectedly.
>
> Can you show me an example of this (as skeleton C code)?
>
> Thanks.

Sure, something like

=====

Lisp_Object
foo (void)
{
  /* 'res' goes in a callee saved reg  */
  Lisp_Object res = build_string ("bar");
  [...]
  /* LTO inline the following as "flush_stack_call_func (mark_threads_callback, NULL);" */
  mark_threads ();
  [...]
  gc_sweep ();

  /* The string pointed by 'res' was garbage collected.  */
  return res;
}

=====

I'm not sure this is the only possible scenarion tho.

  Andrea

-- 
akrl@sdf.org





^ permalink raw reply	[flat|nested] 29+ messages in thread

* bug#41357: 28.0.50; GC may miss to mark calle safe register content
  2020-05-17 17:45           ` Andrea Corallo
@ 2020-05-17 17:57             ` Eli Zaretskii
  2020-05-17 18:16               ` Andrea Corallo
  0 siblings, 1 reply; 29+ messages in thread
From: Eli Zaretskii @ 2020-05-17 17:57 UTC (permalink / raw)
  To: Andrea Corallo; +Cc: 41357, eggert

> From: Andrea Corallo <akrl@sdf.org>
> Cc: bug-gnu-emacs@gnu.org, eggert@cs.ucla.edu
> Date: Sun, 17 May 2020 17:45:28 +0000
> 
> Lisp_Object
> foo (void)
> {
>   /* 'res' goes in a callee saved reg  */
>   Lisp_Object res = build_string ("bar");
>   [...]
>   /* LTO inline the following as "flush_stack_call_func (mark_threads_callback, NULL);" */
>   mark_threads ();
>   [...]
>   gc_sweep ();
> 
>   /* The string pointed by 'res' was garbage collected.  */
>   return res;
> }

But mark_threads etc. (GC in general) isn't called from functions like
your 'foo.  It is more like this:

Lisp_Object
foo (void)
{
  /* 'res' goes in a callee saved reg  */
  Lisp_Object res = build_string ("bar");
  [...]
  call_something ();
  [...]

}

call_something (void)
{
  [...]
  garbage_collect ();
  [...]
}

Which is quite different, AFAIU, wrt stack usage.

Or maybe I don't understand how "callee saved registers" work.





^ permalink raw reply	[flat|nested] 29+ messages in thread

* bug#41357: 28.0.50; GC may miss to mark calle safe register content
  2020-05-17 17:57             ` Eli Zaretskii
@ 2020-05-17 18:16               ` Andrea Corallo
  0 siblings, 0 replies; 29+ messages in thread
From: Andrea Corallo @ 2020-05-17 18:16 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 41357, eggert

Eli Zaretskii <eliz@gnu.org> writes:

>> From: Andrea Corallo <akrl@sdf.org>
>> Cc: bug-gnu-emacs@gnu.org, eggert@cs.ucla.edu
>> Date: Sun, 17 May 2020 17:45:28 +0000
>> 
>> Lisp_Object
>> foo (void)
>> {
>>   /* 'res' goes in a callee saved reg  */
>>   Lisp_Object res = build_string ("bar");
>>   [...]
>>   /* LTO inline the following as "flush_stack_call_func (mark_threads_callback, NULL);" */
>>   mark_threads ();
>>   [...]
>>   gc_sweep ();
>> 
>>   /* The string pointed by 'res' was garbage collected.  */
>>   return res;
>> }
>
> But mark_threads etc. (GC in general) isn't called from functions like
> your 'foo.  It is more like this:
>
> Lisp_Object
> foo (void)
> {
>   /* 'res' goes in a callee saved reg  */
>   Lisp_Object res = build_string ("bar");
>   [...]
>   call_something ();
>   [...]
>
> }
>
> call_something (void)
> {
>   [...]
>   garbage_collect ();
>   [...]
> }

Yes, my example was minimal your is certanly more realistic.

But also this can be critical.  We have to hope that in 'call_something'
or 'garbage_collect' there is sufficient register pressure to have the
register that is holding 'res' to be pushed.


  Andrea

-- 
akrl@sdf.org





^ permalink raw reply	[flat|nested] 29+ messages in thread

* bug#41357: 28.0.50; GC may miss to mark calle safe register content
  2020-05-17 17:00       ` Pip Cet
@ 2020-05-17 19:01         ` Paul Eggert
  2020-05-17 19:19           ` Eli Zaretskii
  0 siblings, 1 reply; 29+ messages in thread
From: Paul Eggert @ 2020-05-17 19:01 UTC (permalink / raw)
  To: Pip Cet; +Cc: 41357, Andrea Corallo

On 5/17/20 10:00 AM, Pip Cet wrote:
> I don't think that platform even has callee-saved registers?

Eli's platform is 32-bit Microsoft Windows, and W32 has four callee-save
registers (ebx, esi, edi, ebp) not counting esp and eip which are of course
callee-save by definition. So the problem could at least in theory be occurring
on his platform, depending on the compiler and its options.





^ permalink raw reply	[flat|nested] 29+ messages in thread

* bug#41357: 28.0.50; GC may miss to mark calle safe register content
  2020-05-17 17:24         ` Eli Zaretskii
@ 2020-05-17 19:05           ` Paul Eggert
  2020-05-17 19:26             ` Eli Zaretskii
  0 siblings, 1 reply; 29+ messages in thread
From: Paul Eggert @ 2020-05-17 19:05 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 41357, akrl

On 5/17/20 10:24 AM, Eli Zaretskii wrote:
> I struggle to see
> how a Lisp object could appear in a register without leaving any trace
> on the stack

Quite easily. It happens all the time. If I do something like this:

    Lisp_Object a = Fcons (b, c);
    f (x, y);
    return a;

The compiler might put 'a' into a callee-save register R, which means that while
f is running there's no trace of 'a' on the stack (unless f's code itself
decides to use R for whatever reason, but let's suppose it doesn't). This
situation can persist even if f calls g which calls h which calls the garbage
collector, and the garbage collector will then think the cons is garbage even
though it's not.

The proposed fix is harmless except it may execute a handful more instructions
per GC. So the cost of applying the fix is tiny, whereas the potential
reliability benefit is large.





^ permalink raw reply	[flat|nested] 29+ messages in thread

* bug#41357: 28.0.50; GC may miss to mark calle safe register content
  2020-05-17 19:01         ` Paul Eggert
@ 2020-05-17 19:19           ` Eli Zaretskii
  2020-05-17 20:23             ` Andrea Corallo
  2020-05-17 21:03             ` Paul Eggert
  0 siblings, 2 replies; 29+ messages in thread
From: Eli Zaretskii @ 2020-05-17 19:19 UTC (permalink / raw)
  To: Paul Eggert; +Cc: 41357, pipcet, akrl

> Cc: Andrea Corallo <akrl@sdf.org>, Eli Zaretskii <eliz@gnu.org>,
>  41357@debbugs.gnu.org
> From: Paul Eggert <eggert@cs.ucla.edu>
> Date: Sun, 17 May 2020 12:01:40 -0700
> 
> On 5/17/20 10:00 AM, Pip Cet wrote:
> > I don't think that platform even has callee-saved registers?
> 
> Eli's platform is 32-bit Microsoft Windows, and W32 has four callee-save
> registers (ebx, esi, edi, ebp) not counting esp and eip which are of course
> callee-save by definition. So the problem could at least in theory be occurring
> on his platform, depending on the compiler and its options.

I've seen the same problem on 64-bit Windows as well, in Emacs
compiled with a different (newer) version of GCC.  I don't think this
has anything to do with how many registers are there.  I also never
before saw these problems, so this is most definitely due to some
recent changes, I just cannot yet figure out which ones.





^ permalink raw reply	[flat|nested] 29+ messages in thread

* bug#41357: 28.0.50; GC may miss to mark calle safe register content
  2020-05-17 19:05           ` Paul Eggert
@ 2020-05-17 19:26             ` Eli Zaretskii
  2020-05-17 19:46               ` Andrea Corallo
  2020-05-17 21:21               ` Paul Eggert
  0 siblings, 2 replies; 29+ messages in thread
From: Eli Zaretskii @ 2020-05-17 19:26 UTC (permalink / raw)
  To: Paul Eggert; +Cc: 41357, akrl

> Cc: akrl@sdf.org, bug-gnu-emacs@gnu.org
> From: Paul Eggert <eggert@cs.ucla.edu>
> Date: Sun, 17 May 2020 12:05:25 -0700
> 
> On 5/17/20 10:24 AM, Eli Zaretskii wrote:
> > I struggle to see
> > how a Lisp object could appear in a register without leaving any trace
> > on the stack
> 
> Quite easily. It happens all the time. If I do something like this:
> 
>     Lisp_Object a = Fcons (b, c);
>     f (x, y);
>     return a;

And where's GC in this picture?  If it's called directly from 'f', can
you show me such code in Emacs?  Then we could disassembly it and see
what we've got.

Usually the code that calls GC is much deeper, and thus the chance of
that temporary to stay in a register is very small, to say the least.

> The proposed fix is harmless

Yeah, right.  Sorry, I don't buy this.  Too many gray hair from such
assumptions.





^ permalink raw reply	[flat|nested] 29+ messages in thread

* bug#41357: 28.0.50; GC may miss to mark calle safe register content
  2020-05-17 19:26             ` Eli Zaretskii
@ 2020-05-17 19:46               ` Andrea Corallo
  2020-05-17 21:21               ` Paul Eggert
  1 sibling, 0 replies; 29+ messages in thread
From: Andrea Corallo @ 2020-05-17 19:46 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 41357, eggert

Eli Zaretskii <eliz@gnu.org> writes:

>> Cc: akrl@sdf.org, bug-gnu-emacs@gnu.org
>> From: Paul Eggert <eggert@cs.ucla.edu>
>> Date: Sun, 17 May 2020 12:05:25 -0700
>> 
>> On 5/17/20 10:24 AM, Eli Zaretskii wrote:
>> > I struggle to see
>> > how a Lisp object could appear in a register without leaving any trace
>> > on the stack
>> 
>> Quite easily. It happens all the time. If I do something like this:
>> 
>>     Lisp_Object a = Fcons (b, c);
>>     f (x, y);
>>     return a;
>
> And where's GC in this picture?

GC can be triggered by f or any of his callee it does not matter.

> If it's called directly from 'f', can
> you show me such code in Emacs?  Then we could disassembly it and see
> what we've got.

I'm not sure what we can prove disassembling, that would be just the
result of a specific .c + toolchain + invocation.  I think we want to
have code that is sufficiently portable and safe because correct.

> Usually the code that calls GC is much deeper, and thus the chance of
> that temporary to stay in a register is very small, to say the least.

Probably yes, but I don't think we want to have code that works accidentally.

  Andrea

-- 
akrl@sdf.org





^ permalink raw reply	[flat|nested] 29+ messages in thread

* bug#41357: 28.0.50; GC may miss to mark calle safe register content
  2020-05-17 19:19           ` Eli Zaretskii
@ 2020-05-17 20:23             ` Andrea Corallo
  2020-05-17 21:03             ` Paul Eggert
  1 sibling, 0 replies; 29+ messages in thread
From: Andrea Corallo @ 2020-05-17 20:23 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 41357, pipcet, Paul Eggert

Eli Zaretskii <eliz@gnu.org> writes:

>> Eli's platform is 32-bit Microsoft Windows, and W32 has four callee-save
>> registers (ebx, esi, edi, ebp) not counting esp and eip which are of course
>> callee-save by definition. So the problem could at least in theory be occurring
>> on his platform, depending on the compiler and its options.
>
> I've seen the same problem on 64-bit Windows as well, in Emacs
> compiled with a different (newer) version of GCC.  I don't think this
> has anything to do with how many registers are there.  I also never
> before saw these problems, so this is most definitely due to some
> recent changes, I just cannot yet figure out which ones.

It's hard to say but I suspect the main gate that saved us till today is
'garbage_collect' that being quite big is likely to have calle-save regs
spilled.

You can disassemble your 'garbage_collect' and see if all of this regs
are spilled (in my case they are at -O2).  This could give an indication
on the correlation of the two bugs (but nothing more).

  Andrea

-- 
akrl@sdf.org





^ permalink raw reply	[flat|nested] 29+ messages in thread

* bug#41357: 28.0.50; GC may miss to mark calle safe register content
  2020-05-17 19:19           ` Eli Zaretskii
  2020-05-17 20:23             ` Andrea Corallo
@ 2020-05-17 21:03             ` Paul Eggert
  1 sibling, 0 replies; 29+ messages in thread
From: Paul Eggert @ 2020-05-17 21:03 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 41357, pipcet, akrl

>>> I don't think that platform even has callee-saved registers?

>> Eli's platform is 32-bit Microsoft Windows, and W32 has four callee-save
>> registers (ebx, esi, edi, ebp) not counting esp and eip which are of course
>> callee-save by definition. So the problem could at least in theory be occurring
>> on his platform, depending on the compiler and its options.

> I've seen the same problem on 64-bit Windows as well, in Emacs
> compiled with a different (newer) version of GCC.  I don't think this
> has anything to do with how many registers are there. 

You're right that the number of registers doesn't matter, in the sense that the
problem can occur if any registers are callee-save. I was responding to Pip
Cet's comment, where he said he thought your platform (W32 in the original bug
report) had zero callee-saved registers. That would have meant the problem
couldn't occur on your platform. However, because your platform does have
callee-save registers the problem can occur there.

64-bit Windows also has callee-save registers (rbx, rbp, rdi, rsi, r12, r13,
r14, r15) so it can also have the problem. Most platforms do have callee-save
these days.





^ permalink raw reply	[flat|nested] 29+ messages in thread

* bug#41357: 28.0.50; GC may miss to mark calle safe register content
  2020-05-17 19:26             ` Eli Zaretskii
  2020-05-17 19:46               ` Andrea Corallo
@ 2020-05-17 21:21               ` Paul Eggert
  1 sibling, 0 replies; 29+ messages in thread
From: Paul Eggert @ 2020-05-17 21:21 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 41357, akrl

On 5/17/20 12:26 PM, Eli Zaretskii wrote:
> And where's GC in this picture?  If it's called directly from 'f', can
> you show me such code in Emacs?  Then we could disassembly it and see
> what we've got.
> 
> Usually the code that calls GC is much deeper, and thus the chance of
> that temporary to stay in a register is very small, to say the least.

The probability is not that small, unfortunately. Compilers often have a habit
of running through the same set of callee-save registers in the same order.
Let's say you're on the x86 and your compiler consumes the four callee-save
registers in the order ebx, esi, edi, ebp. Then if we call f which calls g which
calls h which calls the GC, it's likely that f will save just ebx, then g will
save just ebx, esi, edi, then h will save just ebx and esi. Hence if the caller
has assigned a local variable to ebp, the GC won't see the variable's contents.

We should give Andrea a big round of applause for catching this bug.





^ permalink raw reply	[flat|nested] 29+ messages in thread

* bug#41357: 28.0.50; GC may miss to mark calle safe register content
  2020-05-17 17:28           ` Eli Zaretskii
@ 2020-05-17 21:27             ` Paul Eggert
  0 siblings, 0 replies; 29+ messages in thread
From: Paul Eggert @ 2020-05-17 21:27 UTC (permalink / raw)
  To: Eli Zaretskii, Andrea Corallo; +Cc: 41357

On 5/17/20 10:28 AM, Eli Zaretskii wrote:
> I'm saying that we've lived with this
> bug for a very long time, so if it is real, it is definitely very-very
> rare.

Yes, it very much has the smell of a GC bug: rare and hard to reproduce, but
deadly when it occurs. I wouldn't be surprised if it's causing the rare and
hard-to-reproduce crashes you reported in Bug#41321. You might try installing
the patch into your copy of emacs-27 and see whether it affects Bug#41321.





^ permalink raw reply	[flat|nested] 29+ messages in thread

* bug#41357: 28.0.50; GC may miss to mark calle safe register content
  2020-05-17 16:46     ` Paul Eggert
  2020-05-17 17:00       ` Pip Cet
  2020-05-17 17:04       ` Eli Zaretskii
@ 2020-05-17 21:47       ` Andrea Corallo
  2 siblings, 0 replies; 29+ messages in thread
From: Andrea Corallo @ 2020-05-17 21:47 UTC (permalink / raw)
  To: Paul Eggert; +Cc: 41357

Paul Eggert <eggert@cs.ucla.edu> writes:

> On 5/17/20 9:40 AM, Andrea Corallo wrote:
>> I think this is a real bug that we have in the codebase (emacs-27
>> included).
>
> Thanks for all the detective work! Your analysis is correct and your patch looks
> good. I've always been suspicious of that code, and it looks like you've
> confirmed my suspicions.
>
> The only question in my mind is whether to install the patch into the emacs-27
> branch or the master branch. Given Eli's problems with stability in emacs-27
> (see Bug#41321), I'm inclined to think the former, as the bug could explain the
> problems Eli is observing.

Hi Paul,

I'm glad you liked the investigation.

I've pushed the fix on master as Eli suggested, feel free to improve it
in case.

I hope a different agreement can be found for 27, in case I'll port it
there.

Thanks!

  Andrea

-- 
akrl@sdf.org





^ permalink raw reply	[flat|nested] 29+ messages in thread

* bug#41357: 28.0.50; GC may miss to mark calle safe register content
  2020-05-17 12:42 bug#41357: 28.0.50; GC may miss to mark calle safe register content Andrea Corallo
  2020-05-17 15:36 ` Eli Zaretskii
@ 2020-05-25  2:09 ` Tom Tromey
  2020-05-25  8:37   ` Andrea Corallo
  1 sibling, 1 reply; 29+ messages in thread
From: Tom Tromey @ 2020-05-25  2:09 UTC (permalink / raw)
  To: Andrea Corallo; +Cc: 41357, Paul Eggert

Andrea> For now I'm testing the simple attached patch that seams to do the job
Andrea> for me.

It looks like this patch was checked in, so I think this bug can be
closed.  I didn't want to do it without verifying with you first though.

Tom





^ permalink raw reply	[flat|nested] 29+ messages in thread

* bug#41357: 28.0.50; GC may miss to mark calle safe register content
  2020-05-25  2:09 ` Tom Tromey
@ 2020-05-25  8:37   ` Andrea Corallo
  2020-05-28 22:08     ` Paul Eggert
  0 siblings, 1 reply; 29+ messages in thread
From: Andrea Corallo @ 2020-05-25  8:37 UTC (permalink / raw)
  To: Tom Tromey; +Cc: 41357, Paul Eggert

Tom Tromey <tom@tromey.com> writes:

> Andrea> For now I'm testing the simple attached patch that seams to do the job
> Andrea> for me.
>
> It looks like this patch was checked in, so I think this bug can be
> closed.  I didn't want to do it without verifying with you first though.

Hi Tom,

thanks.  Yes the only left point was if to apply it on emacs-27 given
the bug is present there too, I think Eli prefers not to do that tho.

Not sure what should be the state of the bug then, feel free to close it
if that's the correct state.

Thanks

  Andrea

-- 
akrl@sdf.org





^ permalink raw reply	[flat|nested] 29+ messages in thread

* bug#41357: 28.0.50; GC may miss to mark calle safe register content
  2020-05-25  8:37   ` Andrea Corallo
@ 2020-05-28 22:08     ` Paul Eggert
  0 siblings, 0 replies; 29+ messages in thread
From: Paul Eggert @ 2020-05-28 22:08 UTC (permalink / raw)
  To: Andrea Corallo, Tom Tromey; +Cc: 41357-done

On 5/25/20 1:37 AM, Andrea Corallo wrote:

> Not sure what should be the state of the bug then, feel free to close it
> if that's the correct state.

"Fixed in master" is good enough to close a bug report, so I'm closing it.
Thanks again.





^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2020-05-28 22:08 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-05-17 12:42 bug#41357: 28.0.50; GC may miss to mark calle safe register content Andrea Corallo
2020-05-17 15:36 ` Eli Zaretskii
2020-05-17 16:40   ` Andrea Corallo
2020-05-17 16:46     ` Paul Eggert
2020-05-17 17:00       ` Pip Cet
2020-05-17 19:01         ` Paul Eggert
2020-05-17 19:19           ` Eli Zaretskii
2020-05-17 20:23             ` Andrea Corallo
2020-05-17 21:03             ` Paul Eggert
2020-05-17 17:04       ` Eli Zaretskii
2020-05-17 17:21         ` Andrea Corallo
2020-05-17 17:28           ` Eli Zaretskii
2020-05-17 21:27             ` Paul Eggert
2020-05-17 21:47       ` Andrea Corallo
2020-05-17 17:04     ` Eli Zaretskii
2020-05-17 17:08       ` Paul Eggert
2020-05-17 17:24         ` Eli Zaretskii
2020-05-17 19:05           ` Paul Eggert
2020-05-17 19:26             ` Eli Zaretskii
2020-05-17 19:46               ` Andrea Corallo
2020-05-17 21:21               ` Paul Eggert
2020-05-17 17:13       ` Andrea Corallo
2020-05-17 17:22         ` Eli Zaretskii
2020-05-17 17:45           ` Andrea Corallo
2020-05-17 17:57             ` Eli Zaretskii
2020-05-17 18:16               ` Andrea Corallo
2020-05-25  2:09 ` Tom Tromey
2020-05-25  8:37   ` Andrea Corallo
2020-05-28 22:08     ` Paul Eggert

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).