unofficial mirror of notmuch@notmuchmail.org
 help / color / mirror / code / Atom feed
* notmuch new: Memory problem
@ 2009-11-20  8:56 Dominik Epple
  2009-11-20 16:01 ` Carl Worth
  2009-11-20 20:56 ` Carl Worth
  0 siblings, 2 replies; 12+ messages in thread
From: Dominik Epple @ 2009-11-20  8:56 UTC (permalink / raw)
  To: notmuch

Hi,

I am strongly interested in giving notmuch a try. But I fail setting
it up. The problem is that during "notmuch new", memory consumption
and system load increases to values that make my system unusable. I
then killed "notmuch new" at a memory consumption of 2.7G and at a
system load of 7.

After hitting Ctrl-C, it says "Stopping" but does not stop. I then
killed "notmuch new" after some minutes with signal KILL.

Is there a problem with the number of my mails? I currently have over
40.000 Mails... they live currently in mbox files, I created a Maildir
with mb2md-3.20.pl.

OS is SuSE Linux 11.1, kernel 2.6.27.29-0.1-default, notmuch pulled
today from git, compiled manually, dependencies also downloaded and
installed manually, in the following versions:

gmime-2.4.11.tar.bz2
talloc-2.0.0.tar.gz
xapian-core-1.0.17.tar.gz

Any help?

Thanks
Dominik

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: notmuch new: Memory problem
  2009-11-20  8:56 notmuch new: Memory problem Dominik Epple
@ 2009-11-20 16:01 ` Carl Worth
  2009-11-23 15:30   ` Dominik Epple
  2009-11-20 20:56 ` Carl Worth
  1 sibling, 1 reply; 12+ messages in thread
From: Carl Worth @ 2009-11-20 16:01 UTC (permalink / raw)
  To: Dominik Epple, notmuch

On Fri, 20 Nov 2009 09:56:50 +0100, Dominik Epple <dominik.epple@googlemail.com> wrote:
> I am strongly interested in giving notmuch a try.

Welcome to notmuch, Dominik! I'm sorry your initial attempt to use it
hasn't been quite as smooth as we might like.

>                                                   But I fail setting
> it up. The problem is that during "notmuch new", memory consumption
> and system load increases to values that make my system unusable. I
> then killed "notmuch new" at a memory consumption of 2.7G and at a
> system load of 7.

Yikes. That really sounds like something ran out of control consuming
memory. I certainly haven't seen anything like that before.

> After hitting Ctrl-C, it says "Stopping" but does not stop. I then
> killed "notmuch new" after some minutes with signal KILL.

After "Stopping" gets printed, the notmuch code won't be doing any more
work. It is expected that it will take some time after that message is
printed before notmuch will actually exit. The extra time is to wait for
Xapian to flush out to disk data that notmuch has already provided to
it.

I'm curious how big your .notmuch directory ended up after this
operation. (And how that compares in size to the total size of your
collection of mail.)

> Is there a problem with the number of my mails? I currently have over
> 40.000 Mails... they live currently in mbox files, I created a Maildir
> with mb2md-3.20.pl.

That's definitely not too much mail. I think you should expect "notmuch
new" currently to index on the order of 10 - 100 messages/sec.

Your "notmuch new" process should have been reporting a count once per
second as it progressed, (at least until things went wrong). How far did
you see that go?

I'm wondering if there's a particular file (or files) that are
triggering the bad behavior. Maybe we need a debug option for "notmuch
new" to print the filenames of messages as they are being processed.

-Carl

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: notmuch new: Memory problem
  2009-11-20  8:56 notmuch new: Memory problem Dominik Epple
  2009-11-20 16:01 ` Carl Worth
@ 2009-11-20 20:56 ` Carl Worth
  2009-11-23 16:26   ` Dominik Epple
  1 sibling, 1 reply; 12+ messages in thread
From: Carl Worth @ 2009-11-20 20:56 UTC (permalink / raw)
  To: Dominik Epple, notmuch

On Fri, 20 Nov 2009 09:56:50 +0100, Dominik Epple <dominik.epple@googlemail.com> wrote:
> Is there a problem with the number of my mails? I currently have over
> 40.000 Mails... they live currently in mbox files, I created a Maildir
> with mb2md-3.20.pl.

I'm suspecting that you have some big files in there, (such as indexes
from some other mail program). We had code in notmuch to detect and
ignore these, but a recent bug had broken that.

I just fixed this code as of the below commit. So please update and try
again and let us know if things work any better.

Thanks for your patience!

-Carl

commit 3ae12b1e286d1c0041a2e3957cb01daa2981dad9
Author: Carl Worth <cworth@cworth.org>
Date:   Fri Nov 20 21:46:37 2009 +0100

    add_message: Re-fix handling of non-mail files.
    
    More fallout from _get_header now returning "" for missing headers.
    
    The bug here is that we would no longer detect that a file is not an
    email message and give up on it like we should.
    
    And this time, I actually audited all callers to
    notmuch_message_get_header, so hopefully we're done fixing this
    bug over and over.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: notmuch new: Memory problem
  2009-11-20 16:01 ` Carl Worth
@ 2009-11-23 15:30   ` Dominik Epple
  0 siblings, 0 replies; 12+ messages in thread
From: Dominik Epple @ 2009-11-23 15:30 UTC (permalink / raw)
  To: Carl Worth; +Cc: notmuch

Hi,

Thanks for your help. Here is the information you requested:

2009/11/20 Carl Worth <cworth@cworth.org>:
> I'm curious how big your .notmuch directory ended up after this
> operation. (And how that compares in size to the total size of your
> collection of mail.)

I guess you mean these directories:

$ du -sh Maildir
2,8G	Maildir
$ cd Maildir
$ du -sh .notmuch
1,1G	.notmuch

> That's definitely not too much mail. I think you should expect "notmuch
> new" currently to index on the order of 10 - 100 messages/sec.
>
> Your "notmuch new" process should have been reporting a count once per
> second as it progressed, (at least until things went wrong). How far did
> you see that go?

It started quickly, but its speed decreased, and I interrupted it at
some 4000 messages, if I remember correctly.

Regards
Dominik

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: notmuch new: Memory problem
  2009-11-20 20:56 ` Carl Worth
@ 2009-11-23 16:26   ` Dominik Epple
  2009-11-25  9:39     ` Dominik Epple
  0 siblings, 1 reply; 12+ messages in thread
From: Dominik Epple @ 2009-11-23 16:26 UTC (permalink / raw)
  To: Carl Worth; +Cc: notmuch

Hi,

2009/11/20 Carl Worth <cworth@cworth.org>:
> On Fri, 20 Nov 2009 09:56:50 +0100, Dominik Epple <dominik.epple@googlemail.com> wrote:
>> Is there a problem with the number of my mails? I currently have over
>> 40.000 Mails... they live currently in mbox files, I created a Maildir
>> with mb2md-3.20.pl.
>
> I'm suspecting that you have some big files in there, (such as indexes
> from some other mail program). We had code in notmuch to detect and
> ignore these, but a recent bug had broken that.
>
> I just fixed this code as of the below commit. So please update and try
> again and let us know if things work any better.

Ok, one of the problems seems to be solved. One can learn from the
info: output that the code actually ignores non-email data. These
files are small and fragments of real mail. Obviously the mb2md code
made errors there.

But I run in a different issue. I have a lot of files in the Maildir
which contain base64 encoded binary data. (Some remote site sends my
its daily backup logs.) Those files are all of 2.4 megabyte in size.
By adding some debug code to notmuch-new.c, I find out that the
program becomes very slow and consumes a lot of memory when adding
these files. I just killed it when it consumed 2 GByte again.

So as you suspected, the problem seems to stem from large files. But
those large files are not indices or stuff like that from different
mail programs, but they are valid emails which contain a lot of
(encoded) binary data.

Perhaps we should be able to configure notmuch such that he ignores
all mails that match specific pattern (like "Subject: Backup logs
from.*")

Regards
Dominik

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: notmuch new: Memory problem
  2009-11-23 16:26   ` Dominik Epple
@ 2009-11-25  9:39     ` Dominik Epple
  2009-11-26 18:46       ` Carl Worth
  0 siblings, 1 reply; 12+ messages in thread
From: Dominik Epple @ 2009-11-25  9:39 UTC (permalink / raw)
  To: Carl Worth; +Cc: notmuch

Hello,

I repeated the procedure (mb2md, notmuch new), but before, I saved all
those large emails with backup logs into a separate folder which i
deleted before "notmuch new". Then, "notmoch new" works as expected.
So the problem stems indeed from too many too large files being
present. (I actually found some being as large as 40M, not just 2.4M,
as written in previous mails.)

Regards
Dominik


2009/11/23 Dominik Epple <dominik.epple@googlemail.com>:
> Hi,
>
> 2009/11/20 Carl Worth <cworth@cworth.org>:
>> On Fri, 20 Nov 2009 09:56:50 +0100, Dominik Epple <dominik.epple@googlemail.com> wrote:
>>> Is there a problem with the number of my mails? I currently have over
>>> 40.000 Mails... they live currently in mbox files, I created a Maildir
>>> with mb2md-3.20.pl.
>>
>> I'm suspecting that you have some big files in there, (such as indexes
>> from some other mail program). We had code in notmuch to detect and
>> ignore these, but a recent bug had broken that.
>>
>> I just fixed this code as of the below commit. So please update and try
>> again and let us know if things work any better.
>
> Ok, one of the problems seems to be solved. One can learn from the
> info: output that the code actually ignores non-email data. These
> files are small and fragments of real mail. Obviously the mb2md code
> made errors there.
>
> But I run in a different issue. I have a lot of files in the Maildir
> which contain base64 encoded binary data. (Some remote site sends my
> its daily backup logs.) Those files are all of 2.4 megabyte in size.
> By adding some debug code to notmuch-new.c, I find out that the
> program becomes very slow and consumes a lot of memory when adding
> these files. I just killed it when it consumed 2 GByte again.
>
> So as you suspected, the problem seems to stem from large files. But
> those large files are not indices or stuff like that from different
> mail programs, but they are valid emails which contain a lot of
> (encoded) binary data.
>
> Perhaps we should be able to configure notmuch such that he ignores
> all mails that match specific pattern (like "Subject: Backup logs
> from.*")
>
> Regards
> Dominik
>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: notmuch new: Memory problem
  2009-11-25  9:39     ` Dominik Epple
@ 2009-11-26 18:46       ` Carl Worth
  2009-11-26 19:16         ` Carl Worth
  0 siblings, 1 reply; 12+ messages in thread
From: Carl Worth @ 2009-11-26 18:46 UTC (permalink / raw)
  To: Dominik Epple; +Cc: notmuch

On Wed, 25 Nov 2009 10:39:57 +0100, Dominik Epple <dominik.epple@googlemail.com> wrote:
> So the problem stems indeed from too many too large files being
> present. (I actually found some being as large as 40M, not just 2.4M,
> as written in previous mails.)

That's very good to know.

And I'm glad you at least have things working smoothly now.

So perhaps the new configuration option we want is a limit on message
size? Rather than ignoring large files entirely, notmuch could just stop
indexing messages past the configured limit?

-Carl

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: notmuch new: Memory problem
  2009-11-26 18:46       ` Carl Worth
@ 2009-11-26 19:16         ` Carl Worth
  2010-02-05 18:59           ` notmuch new: Memory problem (with uuencoded content) Carl Worth
  0 siblings, 1 reply; 12+ messages in thread
From: Carl Worth @ 2009-11-26 19:16 UTC (permalink / raw)
  To: Dominik Epple; +Cc: notmuch

On Thu, 26 Nov 2009 10:46:54 -0800, Carl Worth <cworth@cworth.org> wrote:
> So perhaps the new configuration option we want is a limit on message
> size? Rather than ignoring large files entirely, notmuch could just stop
> indexing messages past the configured limit?

Having just written that, I don't think it's actually an interesting
option.

Instead of working around the bug, we should just find out what the bug
actually is. It could be that Xapian's TermGenerator is just going nuts
here. Or it could be that Xapian is just trying to hold too much data in
memory instead of flushing it out to disk.

Currently, notmuch doesn't ever call any explicit Xapian flush. Instead,
we rely on the default behavior which is that Xapian will flush to disk
after every batch of 10000 documents added. So it's possible that all
that's actually needed here is for notmuch to notice that it just
indexed a huge file, and then explicitly flush to avoid Xapian using too
much memory. Or, perhaps better, Xapian could be fixed to automatically
flush if its memory usages gets "too big", (if the missing flush is
actually what's needed here).

Clearly, some experimenting is needed. Dominik, if you can share the
large file, (with either me alone or with the whole list), a pointer to
where we could download it would be appreciated.

-Carl

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: notmuch new: Memory problem (with uuencoded content)
  2009-11-26 19:16         ` Carl Worth
@ 2010-02-05 18:59           ` Carl Worth
  2010-02-06 10:40             ` Michal Sojka
  0 siblings, 1 reply; 12+ messages in thread
From: Carl Worth @ 2010-02-05 18:59 UTC (permalink / raw)
  To: Dominik Epple; +Cc: notmuch

[-- Attachment #1: Type: text/plain, Size: 1466 bytes --]

On Thu, 26 Nov 2009 11:16:21 -0800, Carl Worth <cworth@cworth.org> wrote:
> Clearly, some experimenting is needed. Dominik, if you can share the
> large file, (with either me alone or with the whole list), a pointer to
> where we could download it would be appreciated.

Dominik replied to me privately and described a way for me to create a
file that replicates the bug. Here's a recipe I came up with from his
description:

	mkdir tmp
	cd tmp/
	echo [database]$'\n'path=mail > notmuch-config
	mkdir mail
	echo From: Me$'\n'To: You$'\n'Subject: uuencode$'\n' > mail/msg
	dd if=/dev/urandom of=blob bs=1024 count=10240
	uuencode blob < blob >> mail/msg
	NOTMUCH_CONFIG=notmuch-config notmuch new

So that's a 10MB blob of random data which uuencodes to a ~14MB mail
file. And notmuch (before a patch I just pushed) chews on it for quite a
while, consuming several hundred MB of memory and resulting finally in a
76MB Xapian database (with chert).

I'm not sure if there is a Xapian bug there or not, (or perhaps a bug in
how notmuch is using Xapian to generate the terms for this large of an
email message).

But the thing that's obvious to me is that indexing encoded data like
this doesn't make any sense at all. So I've just pushed a set of patches
to notmuch to make it detect uuencoded data within a mail message and
ignore it.

Of course, I also pushed a set of tests to the test suite for this, (and
some new "notmuch search" tests while I was at it).

-Carl

[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: notmuch new: Memory problem (with uuencoded content)
  2010-02-05 18:59           ` notmuch new: Memory problem (with uuencoded content) Carl Worth
@ 2010-02-06 10:40             ` Michal Sojka
  2010-02-06 21:45               ` Carl Worth
  0 siblings, 1 reply; 12+ messages in thread
From: Michal Sojka @ 2010-02-06 10:40 UTC (permalink / raw)
  To: notmuch

On Friday 05 of February 2010 19:59:12 Carl Worth wrote:
> Of course, I also pushed a set of tests to the test suite for this, (and
> some new "notmuch search" tests while I was at it).

Hi Carl,

I've just looked at your notmuch-test commits. Did you noticed my patches 
which port Git's test framework for use with notmuch? That framework has the 
same spirit as yours (shell scripting, easy to use) but compared to your 
current test script it has some nice features:

- Test suite is split into several files. Therefore you do not need to run the 
whole test suit when you are working in one area of notmuch.
- If some test fails, the executed commands are automatically displayed from 
which you can immediately see what was the problem.
- Working directory for each test has a fixed name based on the name of the 
script (no $$) so you know where to look if some test fails.
- You can decide whether you want to stop on the first failure or complete the 
whole test suite.
- At the end the results are summarized so you do not need to watch the output 
of the test suite.

It is straightforward to convert your current test script to Git's framework. 
If you are interested I'll do it.

Michal

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: notmuch new: Memory problem (with uuencoded content)
  2010-02-06 10:40             ` Michal Sojka
@ 2010-02-06 21:45               ` Carl Worth
  2010-02-08 14:19                 ` Michal Sojka
  0 siblings, 1 reply; 12+ messages in thread
From: Carl Worth @ 2010-02-06 21:45 UTC (permalink / raw)
  To: Michal Sojka, notmuch

[-- Attachment #1: Type: text/plain, Size: 805 bytes --]

On Sat, 6 Feb 2010 11:40:18 +0100, Michal Sojka <sojkam1@fel.cvut.cz> wrote:
> I've just looked at your notmuch-test commits. Did you noticed my patches 
> which port Git's test framework for use with notmuch?

Hi Michal,

Ah, my mistake!

That's what I get for working through my backlog chronologically. ;-)

> That framework has the 
> same spirit as yours (shell scripting, easy to use) but compared to your 
> current test script it has some nice features:

All of these features do sound very nice.

> It is straightforward to convert your current test script to Git's framework. 
> If you are interested I'll do it.

Yes, I'd be quite interested in seeing that. Thanks for your
contributions, and sorry I missed (or haven't yet gotten to) the patch
you sent earlier.

-Carl

[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: notmuch new: Memory problem (with uuencoded content)
  2010-02-06 21:45               ` Carl Worth
@ 2010-02-08 14:19                 ` Michal Sojka
  0 siblings, 0 replies; 12+ messages in thread
From: Michal Sojka @ 2010-02-08 14:19 UTC (permalink / raw)
  To: Carl Worth; +Cc: notmuch

On Saturday 06 of February 2010 22:45:32 Carl Worth wrote:
> On Sat, 6 Feb 2010 11:40:18 +0100, Michal Sojka <sojkam1@fel.cvut.cz> wrote:
> > It is straightforward to convert your current test script to Git's
> > framework. If you are interested I'll do it.
> 
> Yes, I'd be quite interested in seeing that. Thanks for your
> contributions, and sorry I missed (or haven't yet gotten to) the patch
> you sent earlier.

Hi Carl,

I did the conversion of the test script. I'll post it to thread 
id:87ljf8pvxx.fsf@yoom.home.cworth.org, where it is more appropriate.

Michal

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2010-02-08 14:19 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-11-20  8:56 notmuch new: Memory problem Dominik Epple
2009-11-20 16:01 ` Carl Worth
2009-11-23 15:30   ` Dominik Epple
2009-11-20 20:56 ` Carl Worth
2009-11-23 16:26   ` Dominik Epple
2009-11-25  9:39     ` Dominik Epple
2009-11-26 18:46       ` Carl Worth
2009-11-26 19:16         ` Carl Worth
2010-02-05 18:59           ` notmuch new: Memory problem (with uuencoded content) Carl Worth
2010-02-06 10:40             ` Michal Sojka
2010-02-06 21:45               ` Carl Worth
2010-02-08 14:19                 ` Michal Sojka

Code repositories for project(s) associated with this public inbox

	https://yhetil.org/notmuch.git/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).