From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from localhost (localhost [127.0.0.1]) by arlo.cworth.org (Postfix) with ESMTP id C43276DE10BA for ; Mon, 10 Sep 2018 04:01:14 -0700 (PDT) X-Virus-Scanned: Debian amavisd-new at cworth.org X-Spam-Flag: NO X-Spam-Score: 0.002 X-Spam-Level: X-Spam-Status: No, score=0.002 tagged_above=-999 required=5 tests=[AWL=0.013, SPF_PASS=-0.001, T_RP_MATCHES_RCVD=-0.01] autolearn=disabled Received: from arlo.cworth.org ([127.0.0.1]) by localhost (arlo.cworth.org [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id fZ69X0Hv1qFe for ; Mon, 10 Sep 2018 04:01:13 -0700 (PDT) Received: from fethera.tethera.net (fethera.tethera.net [198.245.60.197]) by arlo.cworth.org (Postfix) with ESMTPS id 1136F6DE10B8 for ; Mon, 10 Sep 2018 04:01:12 -0700 (PDT) Received: from remotemail by fethera.tethera.net with local (Exim 4.89) (envelope-from ) id 1fzJwK-0002cv-UA; Mon, 10 Sep 2018 07:01:08 -0400 Received: (nullmailer pid 8300 invoked by uid 1000); Mon, 10 Sep 2018 11:01:07 -0000 From: David Bremner To: Mueen Nawaz , notmuch@notmuchmail.org Cc: xapian-discuss@lists.xapian.org Subject: Re: Notmuch DB Problems In-Reply-To: <87tvn1e32k.fsf@nawaz.org> References: <47409d4ed692a336458371102bcbcbd86ab4a067@webmail.nawaz.org> <874lf3es5a.fsf@nikula.org> <87tvn1e32k.fsf@nawaz.org> X-List-To: notmuch Date: Mon, 10 Sep 2018 08:01:06 -0300 Message-ID: <87a7opk45p.fsf@tethera.net> MIME-Version: 1.0 Content-Type: multipart/signed; boundary="=-=-="; micalg=pgp-sha256; protocol="application/pgp-signature" X-BeenThere: notmuch@notmuchmail.org X-Mailman-Version: 2.1.26 Precedence: list List-Id: "Use and development of the notmuch mail system." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 10 Sep 2018 11:01:14 -0000 --=-=-= Content-Type: text/plain Content-Transfer-Encoding: quoted-printable Mueen Nawaz writes: > After a lot of poking around, I figured out the problem, and this may be > of interest to the developers (although not sure if it is a xapian issue > or a notmuch issue). > > Here's why it would freeze: > > I have a post-new hook that runs a Python script. Depending on whether > the new email it is processing matches a rule I have, it will fire off > an email to the sender using the SMTP library in Python. > > I had recently upgraded my MTA (PostFix), and it had a backward > incompatible change that broke my config. I don't know why, but I could > still send emails via Emacs, but when I tried to send them via Python, > Postfix would log an error and it would not send. The Python statement > would freeze (I guess Postfix doesn't return an appropriate response? > Not sure why).=20 > > > I have a cron job to run "notmuch new" 3 times an hour. Since the hook > was frozen, so was the notmuch new command. I had quite a lot of > "notmuch new" processes. I assume this meant the DB was locked all this > time for writing. notmuch unlocks the database before running the hook, so I don't understand how a hung hook results in a locked database. If it happens again (or you're motivated to set up a testbed) I'd be interested in the output of lsof ~/Maildir/.notmuch/xapian/flintlock Also, is this by chance a network file system? Because those often break locking. > Now killing all those jobs did not fix the database. It was still > broken. And as we saw the second time round, it was /really/ broken - it > would not even open in read-only mode. That seems like something the Xapian devs (in copy) might be interested in fixing, if you could come up with a simple reproducer. > It is scary that if a post-new hook freezes while the database is > locked, it could (eventually) clobber the database. I don't know if > notmuch can do anything to prevent this outcome? notmuch could be cleverer about timing out on trying to acquire a lock. I suspect it's a bit delicate to get that right, and I've been hoping the underlying primitives would get a bit more flexible w.r.t. locking. We could also potentially run hooks in the equivalent of "timeout", but I don't know how much code that would be. A simpler option (once we understand what the real problem is) would be to suggest that users use timeout themselves in hooks to be run unattended. --=-=-= Content-Type: application/pgp-signature; name="signature.asc" -----BEGIN PGP SIGNATURE----- iQGzBAEBCAAdFiEE3VS2dnyDRXKVCQCp8gKXHaSnniwFAluWTvIACgkQ8gKXHaSn niwevwv/X3eYbvSWntX4Z/ng04J0+0wi9904YGUc+an281W8NTkQmlawJI9AM48X +nizypStsOTeWsQzDJRDgIcS6LXHFDzhU/6IK6PD4p72+Tidg3xHEFNH0QikbkjP Ihy/VGFHtT0aYHhYC285ryCM3PMDqY13K5tGNvenmgtUQyJj72cFMDHd7RrDe/Jl gSerqRERLPKSY5oOtqKGHD49sNStLvwGH6IjQ7M9zgSJixzErLYcwq2nscbTX1/9 KLYO0LLIOyJwV1sY9jzzaDRroalVVXpxMDY+iCSoDtK5qk3VBcpoCEAEx0B8jVW3 6PcXavGq3jdoaNdqNIBbnQMoqi5k7t6cWW4bCgQtdpVumd0v+68Pc24MZc7rmEHE 91pVJWOkkkxFqHnzYI6yx5Ncu7EPG6y2oWxYSvz+QxYwpd3rbbJ+hZCcx4AOdAri BkARNiHuS0iHMhCxT/pFR7l5PELJ8GFh5BSA0CaF0K7f50A/Aa5+kWaSGq+ybnqN 1dKx/Tn8 =ZRSk -----END PGP SIGNATURE----- --=-=-=--