On 01/03/2016 12:25 PM, John Wiegley wrote:
>>>>>> Daniel Colascione <dancol@dancol.org> writes:
> 
>> In practice, the Lisp stack depth limits provide enough protection, and the
>> risk of data corruption is too great. The existing auto-save logic is good
>> enough for data recovery, especially if we run the sigsegv handler on the
>> alternate signal stack (which we can make as large as we want) when
>> possible.
> 
> OK, I see we have two roads, and I see where your objection is coming from.
> 
> You say, "In practice". Can you expound on your practical experience? I'm
> curious if there's a real experience you've had that leads to such a strong
> objection.

I hate to use arguments from experience, but you asked: I worked on
crash reporting for Windows Phone, and I do significant work for crash
reporting on Messenger and Facebook for Android. I've worked extensively
with Breakpad, ACRA, multiplexed unix signal handlers, crash
classification, and so on. In my experience, attempts to recover from
crashes have almost always made problems worse: they obscure root causes
of important bugs by causing seemingly-impossible downstream crashes and
data corruption.

It's not just a theoretical problem: I've spent lots of late nights
staring at stack traces, trying to figure out how a certain deadlock
could be possible, only to realize that the program had already crashed
--- or would have, if a seldom-tested bit of code hadn't checked for
NULL and returned without releasing a lock, causing a hang half an hour
later. It's even worse with an SEH handler, which allows programmers
writing for Windows to do this:

  for(;;) {
    __try {
      DoSomething();
    __except(1) {
      // LOL: silently ignore all stack overflow, NULL deref, etc.
    }
  }

The Emacs error recovery code is similar in spirit. Granted, it's not
silent, and we don't try to recover from *all* segfaults, but it's still
essentially ignoring a programming error and trying to continue.

It's because I've wasted so much time debugging these kinds of programs
that I strongly prefer failing fast when someone goes wrong and relying
on automatic persistence mechanisms to preserve volatile data. I've seen
dozens of simple bugs (that could have been quickly fixed) turn into
monsters because someone tried to paper over them and keep a program
from crashing.

In the context of stack overflow in Emacs, if we're getting this code,
it's because we made a mistake [1] in the C core. There's no user
interaction that should cause us to overflow the stack. I'd rather know
about that mistake and get a user back into a working Emacs as soon as
possible.

[1] The GC tracing thing is concerning, but Paul's post actually gives
me an idea for fixing it without completely redoing marking: we can
reserve GC stack at the same time we allocate lisp objects, say in 2MB
chunks, and then just switch stacks as we mark.

> Also, note that other cases of error recovery leading to undefined behavior
> exist in the wild: If a process uses too much memory, Linux's OOM killer will
> terminate arbitrary processes in an attempt to prevent system lockup. There
> are no guarantees that it will not kill something that leaves the system in an
> inconsistent or bad state, since the process it kills may have been in the
> middle of a critical process, and the author might not have written proper
> signal handlers.

Nit: unfortunately, it's not possible for normal processes to even
detect the Linux OOM killer's operation. Death comes quickly via
SIGKILL; if you want to recover, you need a watchdog.

You're right that under Linux, programs need to prepare for the
possibility that they might suddenly cease to exist. We're talking about
something different here, which is the possibility that a program can
*keep running*, but in a damaged and undefined state.

> I'm inclined to leave the stack overflow protection in until it bites us;
> because I know from personal evidence that having Emacs suddenly disappear
> DOES bite people. I'm less sure about "undefined behavior" that I haven't
> experienced yet...

I'm worried that it'll be hard to know if it bites us, particularly
since the problems I'm imagining are infrequent, unreproducible, and
carry no obvious signature that would show up in a user crash report.