Mueen Nawaz writes: > After a lot of poking around, I figured out the problem, and this may be > of interest to the developers (although not sure if it is a xapian issue > or a notmuch issue). > > Here's why it would freeze: > > I have a post-new hook that runs a Python script. Depending on whether > the new email it is processing matches a rule I have, it will fire off > an email to the sender using the SMTP library in Python. > > I had recently upgraded my MTA (PostFix), and it had a backward > incompatible change that broke my config. I don't know why, but I could > still send emails via Emacs, but when I tried to send them via Python, > Postfix would log an error and it would not send. The Python statement > would freeze (I guess Postfix doesn't return an appropriate response? > Not sure why). > > > I have a cron job to run "notmuch new" 3 times an hour. Since the hook > was frozen, so was the notmuch new command. I had quite a lot of > "notmuch new" processes. I assume this meant the DB was locked all this > time for writing. notmuch unlocks the database before running the hook, so I don't understand how a hung hook results in a locked database. If it happens again (or you're motivated to set up a testbed) I'd be interested in the output of lsof ~/Maildir/.notmuch/xapian/flintlock Also, is this by chance a network file system? Because those often break locking. > Now killing all those jobs did not fix the database. It was still > broken. And as we saw the second time round, it was /really/ broken - it > would not even open in read-only mode. That seems like something the Xapian devs (in copy) might be interested in fixing, if you could come up with a simple reproducer. > It is scary that if a post-new hook freezes while the database is > locked, it could (eventually) clobber the database. I don't know if > notmuch can do anything to prevent this outcome? notmuch could be cleverer about timing out on trying to acquire a lock. I suspect it's a bit delicate to get that right, and I've been hoping the underlying primitives would get a bit more flexible w.r.t. locking. We could also potentially run hooks in the equivalent of "timeout", but I don't know how much code that would be. A simpler option (once we understand what the real problem is) would be to suggest that users use timeout themselves in hooks to be run unattended.