On 01/03/2016 12:25 PM, John Wiegley wrote: >>>>>> Daniel Colascione writes: > >> In practice, the Lisp stack depth limits provide enough protection, and the >> risk of data corruption is too great. The existing auto-save logic is good >> enough for data recovery, especially if we run the sigsegv handler on the >> alternate signal stack (which we can make as large as we want) when >> possible. > > OK, I see we have two roads, and I see where your objection is coming from. > > You say, "In practice". Can you expound on your practical experience? I'm > curious if there's a real experience you've had that leads to such a strong > objection. I hate to use arguments from experience, but you asked: I worked on crash reporting for Windows Phone, and I do significant work for crash reporting on Messenger and Facebook for Android. I've worked extensively with Breakpad, ACRA, multiplexed unix signal handlers, crash classification, and so on. In my experience, attempts to recover from crashes have almost always made problems worse: they obscure root causes of important bugs by causing seemingly-impossible downstream crashes and data corruption. It's not just a theoretical problem: I've spent lots of late nights staring at stack traces, trying to figure out how a certain deadlock could be possible, only to realize that the program had already crashed --- or would have, if a seldom-tested bit of code hadn't checked for NULL and returned without releasing a lock, causing a hang half an hour later. It's even worse with an SEH handler, which allows programmers writing for Windows to do this: for(;;) { __try { DoSomething(); __except(1) { // LOL: silently ignore all stack overflow, NULL deref, etc. } } The Emacs error recovery code is similar in spirit. Granted, it's not silent, and we don't try to recover from *all* segfaults, but it's still essentially ignoring a programming error and trying to continue. It's because I've wasted so much time debugging these kinds of programs that I strongly prefer failing fast when someone goes wrong and relying on automatic persistence mechanisms to preserve volatile data. I've seen dozens of simple bugs (that could have been quickly fixed) turn into monsters because someone tried to paper over them and keep a program from crashing. In the context of stack overflow in Emacs, if we're getting this code, it's because we made a mistake [1] in the C core. There's no user interaction that should cause us to overflow the stack. I'd rather know about that mistake and get a user back into a working Emacs as soon as possible. [1] The GC tracing thing is concerning, but Paul's post actually gives me an idea for fixing it without completely redoing marking: we can reserve GC stack at the same time we allocate lisp objects, say in 2MB chunks, and then just switch stacks as we mark. > Also, note that other cases of error recovery leading to undefined behavior > exist in the wild: If a process uses too much memory, Linux's OOM killer will > terminate arbitrary processes in an attempt to prevent system lockup. There > are no guarantees that it will not kill something that leaves the system in an > inconsistent or bad state, since the process it kills may have been in the > middle of a critical process, and the author might not have written proper > signal handlers. Nit: unfortunately, it's not possible for normal processes to even detect the Linux OOM killer's operation. Death comes quickly via SIGKILL; if you want to recover, you need a watchdog. You're right that under Linux, programs need to prepare for the possibility that they might suddenly cease to exist. We're talking about something different here, which is the possibility that a program can *keep running*, but in a damaged and undefined state. > I'm inclined to leave the stack overflow protection in until it bites us; > because I know from personal evidence that having Emacs suddenly disappear > DOES bite people. I'm less sure about "undefined behavior" that I haven't > experienced yet... I'm worried that it'll be hard to know if it bites us, particularly since the problems I'm imagining are infrequent, unreproducible, and carry no obvious signature that would show up in a user crash report.