unofficial mirror of notmuch@notmuchmail.org
 help / color / mirror / code / Atom feed
* Notmuch DB Problems
@ 2018-09-05  1:00 mueen
  2018-09-05 18:05 ` Jani Nikula
  0 siblings, 1 reply; 10+ messages in thread
From: mueen @ 2018-09-05  1:00 UTC (permalink / raw)
  To: notmuch

[-- Attachment #1: Type: text/plain, Size: 1463 bytes --]


Hi,

A few days ago I noticed notmuch new was no longer working (I have it
as a cron job so it took a while to figure it out).

It just freezes. I do have a Python hook, and it was freezing on the
line that opens the database.

I tried a notmuch dump. Same problem - freezes

Based on some earlier threads, I tried a notmuch compact. Same problem
- freezes.

All these freezes seem to use no memory/CPU. 

Interestingly, queries work fine - from both the command line and the
Emacs interface. So I can read old stuff just fine. But all the
commands above cause a freeze. 

Currently using notmuch-0.24.2. I tried notmuch-0.27 - same problem.

Results of a xapian check:

docdata:
blocksize=8K items=6 firstunused=3 revision=6442 levels=0 root=0
B-tree checked okay
docdata table structure checked OK

termlist:
blocksize=8K items=178562 firstunused=53441 revision=6442 levels=2
root=46086
/usr/bin/xapian-check: DatabaseError: 1 unused block(s) missing from
the free list, first is 0

What are my options? Unfortunately the last dump I have is many months
old, so I'm a bit wary of deleting the database and rebuilding. Given
that the show and search commands work, I was wondering if I can write
a script to get all the message/thread ID's for all the tags and store
them, and then rebuild the database and use that stored information to
retag all my messages (all without using the dump command)?

Thanks!

Mueen


[-- Attachment #2: Type: text/html, Size: 1822 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Notmuch DB Problems
  2018-09-05  1:00 Notmuch DB Problems mueen
@ 2018-09-05 18:05 ` Jani Nikula
  2018-09-07 15:32   ` Mueen Nawaz
  0 siblings, 1 reply; 10+ messages in thread
From: Jani Nikula @ 2018-09-05 18:05 UTC (permalink / raw)
  To: mueen, notmuch

On Tue, 04 Sep 2018, mueen@nawaz.org wrote:
> Hi,
>
> A few days ago I noticed notmuch new was no longer working (I have it
> as a cron job so it took a while to figure it out).
>
> It just freezes. I do have a Python hook, and it was freezing on the
> line that opens the database.
>
> I tried a notmuch dump. Same problem - freezes
>
> Based on some earlier threads, I tried a notmuch compact. Same problem
> - freezes.
>
> All these freezes seem to use no memory/CPU. 
>
> Interestingly, queries work fine - from both the command line and the
> Emacs interface. So I can read old stuff just fine. But all the
> commands above cause a freeze. 
>
> Currently using notmuch-0.24.2. I tried notmuch-0.27 - same problem.
>
> Results of a xapian check:
>
> docdata:
> blocksize=8K items=6 firstunused=3 revision=6442 levels=0 root=0
> B-tree checked okay
> docdata table structure checked OK
>
> termlist:
> blocksize=8K items=178562 firstunused=53441 revision=6442 levels=2
> root=46086
> /usr/bin/xapian-check: DatabaseError: 1 unused block(s) missing from
> the free list, first is 0
>
> What are my options? Unfortunately the last dump I have is many months
> old, so I'm a bit wary of deleting the database and rebuilding. Given
> that the show and search commands work, I was wondering if I can write
> a script to get all the message/thread ID's for all the tags and store
> them, and then rebuild the database and use that stored information to
> retag all my messages (all without using the dump command)?

It might be interesting to see an strace log to possibly get an idea
where it gets stuck.

Is the filesystem writable and working okay?

If search and show work, I'm guessing it gets stuck in trying to open
the database writable. One hackish idea is to patch notmuch dump to open
the database in read-only mode, and dump the tags. See below. The dump
command opens the database writable to prevent changes while
dumping. (Arguably this could be a command line option for cases like
yours.)

BR,
Jani.

diff --git a/notmuch-dump.c b/notmuch-dump.c
index ef2f02dfeb5c..d06dbcf50224 100644
--- a/notmuch-dump.c
+++ b/notmuch-dump.c
@@ -364,7 +364,7 @@ notmuch_dump_command (notmuch_config_t *config, int argc, char *argv[])
     int ret;
 
     if (notmuch_database_open (notmuch_config_get_database_path (config),
-			       NOTMUCH_DATABASE_MODE_READ_WRITE, &notmuch))
+			       NOTMUCH_DATABASE_MODE_READ_ONLY, &notmuch))
 	return EXIT_FAILURE;
 
     notmuch_exit_if_unmatched_db_uuid (notmuch);

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: Notmuch DB Problems
  2018-09-05 18:05 ` Jani Nikula
@ 2018-09-07 15:32   ` Mueen Nawaz
  2018-09-10 11:01     ` David Bremner
  0 siblings, 1 reply; 10+ messages in thread
From: Mueen Nawaz @ 2018-09-07 15:32 UTC (permalink / raw)
  To: Jani Nikula, notmuch

Jani Nikula <jani@nikula.org> writes:

> It might be interesting to see an strace log to possibly get an idea
> where it gets stuck.
>
> Is the filesystem writable and working okay?
>
> If search and show work, I'm guessing it gets stuck in trying to open
> the database writable. One hackish idea is to patch notmuch dump to open
> the database in read-only mode, and dump the tags. See below. The dump
> command opens the database writable to prevent changes while
> dumping. (Arguably this could be a command line option for cases like
> yours.)

Thanks - your patch worked. I dumped all the tags, deleted the database,
rebuilt it and restored the tags. All was well.

Until the following day at noon I noticed the problem was back. By
evening, I could not even do queries - it wouldn't open even in read
only mode. The database was dead.

After a lot of poking around, I figured out the problem, and this may be
of interest to the developers (although not sure if it is a xapian issue
or a notmuch issue).

Here's why it would freeze:

I have a post-new hook that runs a Python script. Depending on whether
the new email it is processing matches a rule I have, it will fire off
an email to the sender using the SMTP library in Python.

I had recently upgraded my MTA (PostFix), and it had a backward
incompatible change that broke my config. I don't know why, but I could
still send emails via Emacs, but when I tried to send them via Python,
Postfix would log an error and it would not send. The Python statement
would freeze (I guess Postfix doesn't return an appropriate response?
Not sure why). 


I have a cron job to run "notmuch new" 3 times an hour. Since the hook
was frozen, so was the notmuch new command. I had quite a lot of
"notmuch new" processes. I assume this meant the DB was locked all this
time for writing.

Now killing all those jobs did not fix the database. It was still
broken. And as we saw the second time round, it was /really/ broken - it
would not even open in read-only mode.

It is scary that if a post-new hook freezes while the database is
locked, it could (eventually) clobber the database. I don't know if
notmuch can do anything to prevent this outcome?

BTW, I think the DB would die only after a while. In my experiments, if
I killed the hook soon (e.g. under 1 minute), the database seemed fine. 

-- 
Don't use a big word where a diminutive one will suffice.


                    /\  /\               /\  /
                   /  \/  \ u e e n     /  \/  a w a z
                       >>>>>>mueen@nawaz.org<<<<<<
                                   anl

 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Notmuch DB Problems
  2018-09-07 15:32   ` Mueen Nawaz
@ 2018-09-10 11:01     ` David Bremner
  2018-09-10 14:21       ` Mueen Nawaz
  2018-09-10 21:24       ` Olly Betts
  0 siblings, 2 replies; 10+ messages in thread
From: David Bremner @ 2018-09-10 11:01 UTC (permalink / raw)
  To: Mueen Nawaz, notmuch; +Cc: xapian-discuss

[-- Attachment #1: Type: text/plain, Size: 2469 bytes --]

Mueen Nawaz <mueen@nawaz.org> writes:


> After a lot of poking around, I figured out the problem, and this may be
> of interest to the developers (although not sure if it is a xapian issue
> or a notmuch issue).
>
> Here's why it would freeze:
>
> I have a post-new hook that runs a Python script. Depending on whether
> the new email it is processing matches a rule I have, it will fire off
> an email to the sender using the SMTP library in Python.
>
> I had recently upgraded my MTA (PostFix), and it had a backward
> incompatible change that broke my config. I don't know why, but I could
> still send emails via Emacs, but when I tried to send them via Python,
> Postfix would log an error and it would not send. The Python statement
> would freeze (I guess Postfix doesn't return an appropriate response?
> Not sure why). 
>
>
> I have a cron job to run "notmuch new" 3 times an hour. Since the hook
> was frozen, so was the notmuch new command. I had quite a lot of
> "notmuch new" processes. I assume this meant the DB was locked all this
> time for writing.

notmuch unlocks the database before running the hook, so I don't
understand how a hung hook results in a locked database. If it happens
again (or you're motivated to set up a testbed) I'd be interested in the
output of

           lsof ~/Maildir/.notmuch/xapian/flintlock

Also, is this by chance a network file system? Because those often
break locking.

> Now killing all those jobs did not fix the database. It was still
> broken. And as we saw the second time round, it was /really/ broken - it
> would not even open in read-only mode.

That seems like something the Xapian devs (in copy) might be interested
in fixing, if you could come up with a simple reproducer.

> It is scary that if a post-new hook freezes while the database is
> locked, it could (eventually) clobber the database. I don't know if
> notmuch can do anything to prevent this outcome?

notmuch could be cleverer about timing out on trying to acquire a
lock. I suspect it's a bit delicate to get that right, and I've been
hoping the underlying primitives would get a bit more flexible
w.r.t. locking.

We could also potentially run hooks in the equivalent of "timeout", but
I don't know how much code that would be.  A simpler option (once we
understand what the real problem is) would be to suggest that users use
timeout themselves in hooks to be run unattended.



[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 658 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Notmuch DB Problems
  2018-09-10 11:01     ` David Bremner
@ 2018-09-10 14:21       ` Mueen Nawaz
  2018-09-10 15:08         ` David Bremner
  2018-09-10 21:24       ` Olly Betts
  1 sibling, 1 reply; 10+ messages in thread
From: Mueen Nawaz @ 2018-09-10 14:21 UTC (permalink / raw)
  To: David Bremner, xapian-discuss, notmuch

David Bremner <david@tethera.net> writes:

>> Here's why it would freeze: 
>> 
>> I have a post-new hook that runs a Python script. Depending on 
>> whether the new email it is processing matches a rule I have, 
>> it will fire off an email to the sender using the SMTP library 
>> in Python. 
>> 
>> I had recently upgraded my MTA (PostFix), and it had a backward 
>> incompatible change that broke my config. I don't know why, but 
>> I could still send emails via Emacs, but when I tried to send 
>> them via Python, Postfix would log an error and it would not 
>> send. The Python statement would freeze (I guess Postfix 
>> doesn't return an appropriate response?  Not sure why).  
>>  
>> I have a cron job to run "notmuch new" 3 times an hour. Since 
>> the hook was frozen, so was the notmuch new command. I had 
>> quite a lot of "notmuch new" processes. I assume this meant the 
>> DB was locked all this time for writing. 
> 
> notmuch unlocks the database before running the hook, so I don't 
> understand how a hung hook results in a locked database. If it 
> happens again (or you're motivated to set up a testbed) I'd be 
> interested in the output of 

Well, it results in a locked database because I have this in the 
(Python) hook:

DATABASE = notmuch.Database(mode=notmuch.Database.MODE.READ_WRITE)

Soon after that I freeze the new messages. And at the end I thaw 
them out. The hang occurs in between the two, I think.

> Also, is this by chance a network file system? Because those 
> often break locking. 

No - regular hard drive.

>> Now killing all those jobs did not fix the database. It was 
>> still broken. And as we saw the second time round, it was 
>> /really/ broken - it would not even open in read-only mode. 
> 
> That seems like something the Xapian devs (in copy) might be 
> interested in fixing, if you could come up with a simple 
> reproducer. 

I can think of two experiments:

1. Write a hook that opens the database as above, and then just 
does nothing (e.g. while True). Let it run, say, for 24 
hours. (Not sure if the "freeze" part is relevant.

2. Same as the above, but have a cron job that fires "notmuch new" 
every 20 minutes. This will freeze on the database line above (all 
except the first invocation which will be stuck at while True).

After a day of this, check if you can open the database in 
READ_WRITE mode. 

> notmuch could be cleverer about timing out on trying to acquire 
> a lock. I suspect it's a bit delicate to get that right, and 
> I've been hoping the underlying primitives would get a bit more 
> flexible w.r.t. locking. 

I agree having notmuch handle it is not ideal. I was originally 
thinking there should be a default timeout that one can adjust as 
needed. However, when someone does "notmuch new" to build a new 
database, that can take several minutes. And others may have flows 
very different from mine.

At the very least, we probably should know why the DB be clobbered 
at all.

-- 
Don't take life so seriously.  It won't last.


                    /\  /\               /\  /
                   /  \/  \ u e e n     /  \/  a w a z
                       >>>>>>mueen@nawaz.org<<<<<<
                                   anl

 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Notmuch DB Problems
  2018-09-10 14:21       ` Mueen Nawaz
@ 2018-09-10 15:08         ` David Bremner
  2018-09-10 15:19           ` Mueen Nawaz
  0 siblings, 1 reply; 10+ messages in thread
From: David Bremner @ 2018-09-10 15:08 UTC (permalink / raw)
  To: Mueen Nawaz, xapian-discuss, notmuch

Mueen Nawaz <mueen@nawaz.org> writes:

>
> DATABASE = notmuch.Database(mode=notmuch.Database.MODE.READ_WRITE)

OK. So your code is locking the database, and never unlocking it
(because of the hang). So that part is at least not mysterious.

> I can think of two experiments:

I was thinking more along the lines of something that could be part of
the notmuch test suite, i.e. run in a few seconds. Or at worst in 10
minutes or so to be usable to debug.

d

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Notmuch DB Problems
  2018-09-10 15:08         ` David Bremner
@ 2018-09-10 15:19           ` Mueen Nawaz
  0 siblings, 0 replies; 10+ messages in thread
From: Mueen Nawaz @ 2018-09-10 15:19 UTC (permalink / raw)
  To: David Bremner, xapian-discuss, notmuch, mueen

David Bremner <david@tethera.net> writes:
>> I can think of two experiments: 
> 
> I was thinking more along the lines of something that could be 
> part of the notmuch test suite, i.e. run in a few seconds. Or at 
> worst in 10 minutes or so to be usable to debug. 

I don't know if this can reliably be done. As I pointed out in my 
earlier post, hanging for a short time (1-2 minutes) did not seem 
to clobber the database. I don't know what the threshold is, and 
whether the threshold is constant. 

-- 
Don't take life so seriously.  It won't last.


                    /\  /\               /\  /
                   /  \/  \ u e e n     /  \/  a w a z
                       >>>>>>mueen@nawaz.org<<<<<<
                                   anl

 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Notmuch DB Problems
  2018-09-10 11:01     ` David Bremner
  2018-09-10 14:21       ` Mueen Nawaz
@ 2018-09-10 21:24       ` Olly Betts
  2018-09-10 23:45         ` David Bremner
  2018-09-12  2:41         ` Mueen Nawaz
  1 sibling, 2 replies; 10+ messages in thread
From: Olly Betts @ 2018-09-10 21:24 UTC (permalink / raw)
  To: David Bremner; +Cc: Mueen Nawaz, notmuch, xapian-discuss

On Mon, Sep 10, 2018 at 08:01:06AM -0300, David Bremner wrote:
> Mueen Nawaz <mueen@nawaz.org> writes:
> > Now killing all those jobs did not fix the database. It was still
> > broken. And as we saw the second time round, it was /really/ broken - it
> > would not even open in read-only mode.
> 
> That seems like something the Xapian devs (in copy) might be interested
> in fixing, if you could come up with a simple reproducer.

I'm certainly happy to investigate if someone can provide a way for
me to make it happen on demand.

It doesn't make much sense to me that holding the lock alone could be
causing any sort of corruption - that's just an fcntl() lock.

I would suggest to make sure you're running Xapian 1.4.7 as that fixed a
cursor handling bug which affected notmuch.  I didn't find a way to make
it corrupt on-disk data, but it's hard to be completely certain that it
couldn't ever do that, so ruling out that as a cause would be good.

> notmuch could be cleverer about timing out on trying to acquire a
> lock. I suspect it's a bit delicate to get that right, and I've been
> hoping the underlying primitives would get a bit more flexible
> w.r.t. locking.

You mean in Xapian?  If so, a wishlist bug saying what you're hoping
for might help it happen.

Cheers,
    Olly

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Notmuch DB Problems
  2018-09-10 21:24       ` Olly Betts
@ 2018-09-10 23:45         ` David Bremner
  2018-09-12  2:41         ` Mueen Nawaz
  1 sibling, 0 replies; 10+ messages in thread
From: David Bremner @ 2018-09-10 23:45 UTC (permalink / raw)
  To: Olly Betts; +Cc: notmuch

Olly Betts <olly@survex.com> writes:
>
> You mean in Xapian?  If so, a wishlist bug saying what you're hoping
> for might help it happen.
>
> Cheers,
>     Olly

I filed

  https://trac.xapian.org/ticket/769#ticket

Maybe I'm overthinking it and I should just impliment some kind of loop
around a try catch block to do the open.  For some reason I had the
impression that was a bit tricky to get right.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Notmuch DB Problems
  2018-09-10 21:24       ` Olly Betts
  2018-09-10 23:45         ` David Bremner
@ 2018-09-12  2:41         ` Mueen Nawaz
  1 sibling, 0 replies; 10+ messages in thread
From: Mueen Nawaz @ 2018-09-12  2:41 UTC (permalink / raw)
  To: Olly Betts, David Bremner; +Cc: notmuch, xapian-discuss

Olly Betts <olly@survex.com> writes:

> It doesn't make much sense to me that holding the lock alone 
> could be causing any sort of corruption - that's just an fcntl() 
> lock. 
> 
> I would suggest to make sure you're running Xapian 1.4.7 as that 
> fixed a cursor handling bug which affected notmuch.  I didn't 
> find a way to make it corrupt on-disk data, but it's hard to be 
> completely certain that it couldn't ever do that, so ruling out 
> that as a cause would be good. 

I was running 1.4.5 - maybe that's the cause?


-- 
Don't take life so seriously.  It won't last.


                    /\  /\               /\  /
                   /  \/  \ u e e n     /  \/  a w a z
                       >>>>>>mueen@nawaz.org<<<<<<
                                   anl

 

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2018-09-12  2:41 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-09-05  1:00 Notmuch DB Problems mueen
2018-09-05 18:05 ` Jani Nikula
2018-09-07 15:32   ` Mueen Nawaz
2018-09-10 11:01     ` David Bremner
2018-09-10 14:21       ` Mueen Nawaz
2018-09-10 15:08         ` David Bremner
2018-09-10 15:19           ` Mueen Nawaz
2018-09-10 21:24       ` Olly Betts
2018-09-10 23:45         ` David Bremner
2018-09-12  2:41         ` Mueen Nawaz

Code repositories for project(s) associated with this public inbox

	https://yhetil.org/notmuch.git/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).