>>>>> Daniel Colascione <dancol@dancol.org> writes:

> It's not just a theoretical problem: I've spent lots of late nights staring
> at stack traces, trying to figure out how a certain deadlock could be
> possible, only to realize that the program had already crashed --- or would
> have, if a seldom-tested bit of code hadn't checked for NULL and returned
> without releasing a lock, causing a hang half an hour later.

I see. Isn't what you describe an argument against error handling in general,
though? It too can mask the origin of serious problems.

What if we do this:

  1. When a serious error occurs that engages crash recovery, we pop up a
     window in Emacs describing that a serious error occurred that would have
     crashed Emacs --and that *nothing* should be trusted now. All the user
     should do is save critical buffers and exit immediately.

  2. When in such a state, M-x report-emacs-bug automatically includes a trace
     for the location where the crash occurred. Of course, this assumes Emacs
     is still functional enough to send e-mail.

> You're right that under Linux, programs need to prepare for the possibility
> that they might suddenly cease to exist. We're talking about something
> different here, which is the possibility that a program can *keep running*,
> but in a damaged and undefined state.

I was thinking the system itself is now running in a damaged and undefined
state. When that happens, I often reboot since I can't really trust it
anymore.

> I'm worried that it'll be hard to know if it bites us, particularly since
> the problems I'm imagining are infrequent, unreproducible, and carry no
> obvious signature that would show up in a user crash report.

If we use a window to pop up an alarm indicating, boldly, that Emacs is now
UNSTABLE and should only be used to save files and exit -- maybe even noting
how to abort Emacs to avoid typical cleanup actions -- we can start getting
feedback on whether this feature really helps or hurts.

I understand error handlers can mask problems, and that they've made your life
more difficult as an engineer concerned with uncovering such causes. However,
I'm disinclined to accept, a priori, that it will hurt before trying it out.

When Emacs isn't being run under gdb (which it almost never is) it also
doesn't give much useful information about what happened, and loses data. With
the crash recovery logic, we should at least be able to provide a trace of
where we were when the crash was detected, plus give the user a chance of
reporting that data back to us. I see this as possibly *increasing* the amount
of error information we receive, and not just masking or eliminating it.

-- 
John Wiegley                  GPG fingerprint = 4710 CF98 AF9B 327B B80F
http://newartisans.com                          60E1 46C4 BD1A 7AC1 4BA2