>>>>> Daniel Colascione writes: > It's not just a theoretical problem: I've spent lots of late nights staring > at stack traces, trying to figure out how a certain deadlock could be > possible, only to realize that the program had already crashed --- or would > have, if a seldom-tested bit of code hadn't checked for NULL and returned > without releasing a lock, causing a hang half an hour later. I see. Isn't what you describe an argument against error handling in general, though? It too can mask the origin of serious problems. What if we do this: 1. When a serious error occurs that engages crash recovery, we pop up a window in Emacs describing that a serious error occurred that would have crashed Emacs --and that *nothing* should be trusted now. All the user should do is save critical buffers and exit immediately. 2. When in such a state, M-x report-emacs-bug automatically includes a trace for the location where the crash occurred. Of course, this assumes Emacs is still functional enough to send e-mail. > You're right that under Linux, programs need to prepare for the possibility > that they might suddenly cease to exist. We're talking about something > different here, which is the possibility that a program can *keep running*, > but in a damaged and undefined state. I was thinking the system itself is now running in a damaged and undefined state. When that happens, I often reboot since I can't really trust it anymore. > I'm worried that it'll be hard to know if it bites us, particularly since > the problems I'm imagining are infrequent, unreproducible, and carry no > obvious signature that would show up in a user crash report. If we use a window to pop up an alarm indicating, boldly, that Emacs is now UNSTABLE and should only be used to save files and exit -- maybe even noting how to abort Emacs to avoid typical cleanup actions -- we can start getting feedback on whether this feature really helps or hurts. I understand error handlers can mask problems, and that they've made your life more difficult as an engineer concerned with uncovering such causes. However, I'm disinclined to accept, a priori, that it will hurt before trying it out. When Emacs isn't being run under gdb (which it almost never is) it also doesn't give much useful information about what happened, and loses data. With the crash recovery logic, we should at least be able to provide a trace of where we were when the crash was detected, plus give the user a chance of reporting that data back to us. I see this as possibly *increasing* the amount of error information we receive, and not just masking or eliminating it. -- John Wiegley GPG fingerprint = 4710 CF98 AF9B 327B B80F http://newartisans.com 60E1 46C4 BD1A 7AC1 4BA2