On 12/24/2015 09:36 AM, Eli Zaretskii wrote:
>> Cc: Emacs-devel@gnu.org
>> From: Daniel Colascione <dancol@dancol.org>
>> Date: Thu, 24 Dec 2015 09:04:49 -0800
>>
>> You'd prefer Emacs to lock up or corrupt data instead?
> 
> Instead of crashing and corrupting data?  What's the difference?
> 
> Of course, if it would do that all the time, or even most of the time,
> we'd consider the solution a bad one, and remove it or look for ways
> of improving it.  But we are not there; in most cases the recovery
> doesn't hang and doesn't corrupt any data.

How would we know? It's not as if we have telemetry from real users that
lets us quantitatively evaluate crash frequency. (Automatically sending
crash reports is something else we should do, although I suspect that's
going to be a very long discussion.)

In any case, I expect the undefined-behavior problem to be worse in a
modules-heavy system, since most of the Emacs core code is written to
use non-local control flow for error reporting already, and since it
uses the GC for resource cleanup. I expect module code to be written in
a style less tolerant of arbitrary non-local control flow.

>> Neither you nor Paul have addressed any of the alternatives to this
>> longjmp-from-anywhere behavior. You have not addressed the point that
>> Emacs can crash fatally in numerous ways having nothing to do with stack
>> overflow. You have not addressed the point that we already have robust
>> stack overflow protection at the Lisp level, and so don't need
>> additional workarounds at the C level. You have not even provided any
>> evidence that C-level stack overflow is a problem worth solving.
> 
> I think we did address those, you just didn't like the responses, so
> you don't accept them as responses.

I have seen no evidence that C stack overflow is a real problem that
justifies the risks inherent in the current error handling scheme.

>> All I see is a insistence that we keep the longjmp hack stay because
>> "Emacs must not crash", even though it demonstrably does crash in
>> numerous exciting ways, and won't stop any time soon, because real
>> programs always have bugs, and experience shows that failing quickly
>> (trying to preserve data) is better than trying to limp along, because
>> that just makes the situation worse.
> 
> Stack overflow recovery is an attempt to solve some of these crashes.
> Having it means that users will lose their work in a smaller number of
> use cases.  So it's an improvement, even if a small one.  I fail to
> see in it any cause for such excitement.

I've already outlined a scheme for preventing data loss in most fatal
crash instances, not just those arising from stack overflow.

>> I know the rebuttal to that last point is that the perfect shouldn't be
>> the enemy of the good: believe me, I've debugged enough crashes and
>> hangs caused by well-intentioned crash recovery code to know that
>> invoking undefined behavior to recover from a crash is far below "good"
>> on the scale of things you can do to improve program reliability.
> 
> I believe you.  Now please believe me and Paul who have slightly
> different experience and have come to slightly different conclusions.
> 
>> 1) Using some mechanism (alloca will work, although OS-specific options
>> exist), make sure you have X MB of address space dedicated to the main
>> thread on startup. At this point, we cannot lose data, and failing to
>> obtain this address space is both unlikely and as harmful as failing to
>> obtain space for Emacs BSS.
>>
>> 2) Now we know the addresses of the top and bottom of the stack.
>>
>> 3) On each time Lisp calls into C, each time a module calls into the
>> Emacs core, and on each QUIT, subtract the current stack pointer from
>> the top of the stack. The result is a lower bound on the amount of stack
>> space available. This computation is very cheap: it's one load from
>> global storage or TLS and a subtract instruction.
>>
>> 4) If the amount of stack space available is less than some threshold,
>> say Y, signal a stack exhaustion error.
>>
>> 5) Require that C code (modules included) do not use more than Y MB of
>> stack space between QUITs or calls to the module API
>>
>> 6) Set Y to a reasonable figure like 4MB. Third-party libraries must
>> already be able to run in bounded stack space because they're usually
>> designed to run off the main thread, and on both Windows and POSIX
>> systems, non-main thread stacks are sized on thread startup and cannot grow.
>>
>> I have no idea why we would prefer the SIGSEGV trap approach to
>> the scheme I just outlined.
> 
> Your scheme has disadvantages as well.  Selecting a good value for Y
> is a hard problem.  Choose too much, and you will risk aborting valid
> programs; choose too little, and you will overflow the stack.  Making
> sure C doesn't use more than Y is also hard, especially for GC.

The GC stack use problem is a separate bug. The right fix there, I
think, is to use some data structure other than the C stack for keeping
track of the set of objects being marked.

Other VMs don't tend to have this problem: one common approach is to
allocate managed objects from a contiguous range of address space and
use a bit vector to remember all the object-start positions in this
range. Then, instead of recursively marking all objects, the GC can just
linearly scan from the start to the end of the heap, marking objects as
it goes. We can't do that because our backing store is malloc, not a
linear region we can annotate with a few bit vectors.

We might be able to use some kind of cursor into the now-mandatory
mem_node tree.

In any case, the possibility of the C stack overflowing during GC isn't
relevant to this discussion, since that has isn't covered by the current
logic anyway.

> It
> sounds like just making the stack larger is a better and easier
> solution.

I'd be perfectly happy deleting the stack overflow code entirely and
increasing the declared stack size (on platforms where we ask for it).

> Threads make this even more complicated.  At least on Windows, by
> default each thread gets the same amount of memory reserved for its
> stack as recorded by the linker in the program's header, i.e. 8MB in
> our case.  So several threads can easily eat up a large portion of the
> program's address space, and then the actual amount of stack is much
> smaller than you might think.

We don't have to run Emacs on the main thread. We could, instead, with
minimal code changes, call CreateThread on startup, supplying a larger
stack size that applies only to that thread. Or we can let X=8MB and
Y=2MB (the system default).

I'm not clear on what you mean by "stack is smaller than you might
think": on both POSIX systems and on Windows, thread stacks are address
space reservations made at thread creation time. If we can't fit another
thread stack in the current address space, the failure mode is thread
creation failing, not thread stacks being undersized.

> So on balance, I don't see how your proposal is better.

I'm really not sure what's balancing the risk of data corruption and
lockups caused by the stack overflow code. Emacs got along fine for
decades before Dmitry added the stack overflow check late last year.