unofficial mirror of notmuch@notmuchmail.org
 help / color / mirror / code / Atom feed
* notmuch ignoring alot of emails
@ 2019-03-23  6:45 Alexei Gilchrist
  2019-03-30 11:29 ` David Bremner
  2019-06-28 17:16 ` Alvaro Herrera
  0 siblings, 2 replies; 19+ messages in thread
From: Alexei Gilchrist @ 2019-03-23  6:45 UTC (permalink / raw)
  To: notmuch

Hi

When I run notmuch I get a bunch (hundreds) of emails that are ignored 
with:

Note: Ignoring non-mail file: ...

The files are valid maildir files but have a paragraph somewhere in the 
body where someone has written "From ".

Is there a fix to force the recognition of maildir files in this case? I 
thought this was a solved problem with gmime since 2.6.7.

Sorry for the pun in the subject but I am using alot and I only see the 
messages notmuch sees, neomutt has no issues seeing these messages but I 
want a tighter integration with notmuch.

I'm on a mac and compiled notmuch-0.28.3; installed gmime 3.2.3 with 
brew, and verified notmuch was linking against it:

≻ otool -L /usr/local/bin/notmuch
/usr/local/bin/notmuch:
	/usr/local/lib/libnotmuch.5.dylib (compatibility version 5.2.0, current 
version 5.2.0)
	/usr/local/opt/gmime/lib/libgmime-3.0.0.dylib (compatibility version 
202.0.0, current version 202.2.0)
	/usr/local/opt/glib/lib/libgio-2.0.0.dylib (compatibility version 
6001.0.0, current version 6001.0.0)
	/usr/local/opt/glib/lib/libgobject-2.0.0.dylib (compatibility version 
6001.0.0, current version 6001.0.0)
	/usr/local/opt/glib/lib/libglib-2.0.0.dylib (compatibility version 
6001.0.0, current version 6001.0.0)
	/usr/local/opt/gettext/lib/libintl.8.dylib (compatibility version 
10.0.0, current version 10.5.0)
	/usr/local/opt/talloc/lib/libtalloc.dylib (compatibility version 0.0.0, 
current version 0.0.0)
	/usr/lib/libz.1.dylib (compatibility version 1.0.0, current version 
1.2.11)
	/usr/local/opt/xapian/lib/libxapian.30.dylib (compatibility version 
36.0.0, current version 36.1.0)
	/usr/lib/libc++.1.dylib (compatibility version 1.0.0, current version 
400.9.4)
	/usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current 
version 1252.200.5)

≻ ls -l /usr/local/opt/gmime
lrwxr-xr-x  1 alexei  admin  21 23 Mar 12:09 /usr/local/opt/gmime -> 
../Cellar/gmime/3.2.3

≻ ls -l /usr/local/Cellar/gmime/3.2.3/lib/
total 2280
drwxr-xr-x  3 alexei  staff      96 27 Nov 11:09 girepository-1.0
-rw-r--r--  1 alexei  staff  444500 23 Mar 12:09 libgmime-3.0.0.dylib
-r--r--r--  1 alexei  staff  720504 27 Nov 11:09 libgmime-3.0.a
lrwxr-xr-x  1 alexei  staff      20 27 Nov 11:09 libgmime-3.0.dylib -> 
libgmime-3.0.0.dylib
drwxr-xr-x  3 alexei  staff      96 23 Mar 12:09 pkgconfig

Any ideas for a fix?

cheers,

Alexei

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: notmuch ignoring alot of emails
  2019-03-23  6:45 notmuch ignoring alot of emails Alexei Gilchrist
@ 2019-03-30 11:29 ` David Bremner
  2019-03-30 23:53   ` Alexei Gilchrist
  2019-06-28 17:16 ` Alvaro Herrera
  1 sibling, 1 reply; 19+ messages in thread
From: David Bremner @ 2019-03-30 11:29 UTC (permalink / raw)
  To: Alexei Gilchrist, notmuch

"Alexei Gilchrist" <te100@runbox.com> writes:

> Hi
>
> When I run notmuch I get a bunch (hundreds) of emails that are ignored 
> with:
>
> Note: Ignoring non-mail file: ...
>
> The files are valid maildir files but have a paragraph somewhere in the 
> body where someone has written "From ".
>

And do they also have have a line starting with "From " as the first
line? This makes them mbox files. The second "From " makes them mbox
files with multiple messages. Notmuch thinks your MDA (the thing that
made those files) is misconfigured, assuming my guess about the format
is correct.

> Is there a fix to force the recognition of maildir files in this case? I 
> thought this was a solved problem with gmime since 2.6.7.

There is not currently a way to do that. It's not a GMime problem, it's
a design choice of notmuch to avoid parsing multiple message
mbox's. That was originally added as a safety feature, and I think it
should probably stay the default. If someone wants work on adding a
configuration switch I can point them in the right direction.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: notmuch ignoring alot of emails
  2019-03-30 11:29 ` David Bremner
@ 2019-03-30 23:53   ` Alexei Gilchrist
  2019-03-31  4:06     ` David Bremner
                       ` (2 more replies)
  0 siblings, 3 replies; 19+ messages in thread
From: Alexei Gilchrist @ 2019-03-30 23:53 UTC (permalink / raw)
  To: David Bremner; +Cc: notmuch

[-- Attachment #1: Type: text/plain, Size: 2019 bytes --]

>> When I run notmuch I get a bunch (hundreds) of emails that are 
>> ignored
>> with:
>>
>> Note: Ignoring non-mail file: ...
>>
>> The files are valid maildir files but have a paragraph somewhere in 
>> the
>> body where someone has written "From ".
>>
>
> And do they also have have a line starting with "From " as the first
> line? This makes them mbox files. The second "From " makes them mbox
> files with multiple messages. Notmuch thinks your MDA (the thing that
> made those files) is misconfigured, assuming my guess about the format
> is correct.

Every message file begins with “From “. This is true of all messages 
downloaded by both offlineimap (with type = Maildir) and mbsync.
neomutt has no issues dealing with these files as maildir and mu has no 
issues indexing them either. I’m assuming that stating with “From 
“ is part of the maildir spec.

The problem occurs specifically with notmuch. If someone sends a message 
with a line that begins with “From “ in the *body* then it confuses 
notmuch.

mu can correctly index these messages but my mu is linked against 
libgmime-2.6, my notmuch (0.28.3) is linked against libgmime-3.0.


>> Is there a fix to force the recognition of maildir files in this 
>> case? I
>> thought this was a solved problem with gmime since 2.6.7.
>
> There is not currently a way to do that. It's not a GMime problem, 
> it's
> a design choice of notmuch to avoid parsing multiple message
> mbox's. That was originally added as a safety feature, and I think it
> should probably stay the default. If someone wants work on adding a
> configuration switch I can point them in the right direction.

This is a poor design decision. It means anyone on the internet can 
break your mail setup simply by sending a message with a line starting 
with “From “.
(and using usual quoted-printable Content-Transfer-Encoding).

Try it. Send yourself a message with the line “From bad parsing comes 
chaos” and see if your notmuch can find it. My version can’t.

[-- Attachment #2: Type: text/html, Size: 3027 bytes --]

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: notmuch ignoring alot of emails
  2019-03-30 23:53   ` Alexei Gilchrist
@ 2019-03-31  4:06     ` David Bremner
  2019-03-31  8:52     ` Tomi Ollila
  2019-03-31 11:00     ` Tomas Nordin
  2 siblings, 0 replies; 19+ messages in thread
From: David Bremner @ 2019-03-31  4:06 UTC (permalink / raw)
  To: Alexei Gilchrist; +Cc: notmuch

"Alexei Gilchrist" <te100@runbox.com> writes:

>
> Try it. Send yourself a message with the line “From bad parsing comes 
> chaos” and see if your notmuch can find it. My version can’t.

It's not that simple. My MDA is configured not to add the initial mbox
"From " line to files in maildirs.

d

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: notmuch ignoring alot of emails
  2019-03-30 23:53   ` Alexei Gilchrist
  2019-03-31  4:06     ` David Bremner
@ 2019-03-31  8:52     ` Tomi Ollila
  2019-03-31 11:00     ` Tomas Nordin
  2 siblings, 0 replies; 19+ messages in thread
From: Tomi Ollila @ 2019-03-31  8:52 UTC (permalink / raw)
  To: Alexei Gilchrist, David Bremner; +Cc: notmuch

On Sun, Mar 31 2019, Alexei Gilchrist wrote:

>>> When I run notmuch I get a bunch (hundreds) of emails that are 
>>> ignored
>>> with:
>>>
>>> Note: Ignoring non-mail file: ...
>>>
>>> The files are valid maildir files but have a paragraph somewhere in 
>>> the
>>> body where someone has written "From ".
>>>
>>
>> And do they also have have a line starting with "From " as the first
>> line? This makes them mbox files. The second "From " makes them mbox
>> files with multiple messages. Notmuch thinks your MDA (the thing that
>> made those files) is misconfigured, assuming my guess about the format
>> is correct.
>
> Every message file begins with “From “. This is true of all messages 
> downloaded by both offlineimap (with type = Maildir) and mbsync.
> neomutt has no issues dealing with these files as maildir and mu has no 
> issues indexing them either. I’m assuming that stating with “From 
> “ is part of the maildir spec.
>
> The problem occurs specifically with notmuch. If someone sends a message 
> with a line that begins with “From “ in the *body* then it confuses 
> notmuch.
>
> mu can correctly index these messages but my mu is linked against 
> libgmime-2.6, my notmuch (0.28.3) is linked against libgmime-3.0.
>
>
>>> Is there a fix to force the recognition of maildir files in this 
>>> case? I
>>> thought this was a solved problem with gmime since 2.6.7.
>>
>> There is not currently a way to do that. It's not a GMime problem, 
>> it's
>> a design choice of notmuch to avoid parsing multiple message
>> mbox's. That was originally added as a safety feature, and I think it
>> should probably stay the default. If someone wants work on adding a
>> configuration switch I can point them in the right direction.
>
> This is a poor design decision. It means anyone on the internet can 
> break your mail setup simply by sending a message with a line starting 
> with “From “.
> (and using usual quoted-printable Content-Transfer-Encoding).

There are few things to remember in notmuch development:

- notmuch is more of an evolution than intelligent design. it is hard to 
  do any long-planned design when writing email software...

- we all do welcome people do SMOP with notmuch and tolerate patches with
  good commit messages and elegant content.

- it may take some time to get changes reviewed...

In this particular case it would be nice if someone(tm) investigated how 
mu and neomutt handles these email -- and how broken (if at all) those go
if those are given large mbox file... was it so that both of those can
read mbox files... 
(which notmuch doesn't (but one can always use mboxvievfs! >;)))?

> Try it. Send yourself a message with the line “From bad parsing comes 
> chaos” and see if your notmuch can find it. My version can’t.

My MDA (md5mda.sh) does not add 'From ' as beginning of first
line in my delivered emails (i.e. works similarly in this respect as
David's MDA).

Tomi

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: notmuch ignoring alot of emails
  2019-03-30 23:53   ` Alexei Gilchrist
  2019-03-31  4:06     ` David Bremner
  2019-03-31  8:52     ` Tomi Ollila
@ 2019-03-31 11:00     ` Tomas Nordin
  2019-03-31 22:02       ` Alexei Gilchrist
  2 siblings, 1 reply; 19+ messages in thread
From: Tomas Nordin @ 2019-03-31 11:00 UTC (permalink / raw)
  To: Alexei Gilchrist, David Bremner; +Cc: notmuch

Alexei Gilchrist <te100@runbox.com> writes:

> Every message file begins with “From “. This is true of all messages 
> downloaded by both offlineimap (with type = Maildir) and mbsync.
> neomutt has no issues dealing with these files as maildir and mu has no 
> issues indexing them either. I’m assuming that stating with “From 
> “ is part of the maildir spec.

FWIW, I use Offlineimap and files retreived with it here does not begin
with "From". I see things like "Received: from..." or "Return-Path:..."
as the beginning of the first line.

> Try it. Send yourself a message with the line “From bad parsing comes 
> chaos” and see if your notmuch can find it. My version can’t.

I tried that and find messages as expected. I mean, the message I sent
and this thread.

Best regards
--
Tomas

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: notmuch ignoring alot of emails
  2019-03-31 11:00     ` Tomas Nordin
@ 2019-03-31 22:02       ` Alexei Gilchrist
  2019-03-31 23:27         ` David Bremner
  0 siblings, 1 reply; 19+ messages in thread
From: Alexei Gilchrist @ 2019-03-31 22:02 UTC (permalink / raw)
  To: Tomas Nordin; +Cc: David Bremner, notmuch

That’s interesting. Do you know a link to the file spec for maildir 
file content? All I can find is information about the directory 
structure and file naming, not the file content.

mbsync which specialises in maildir also had an initial “From “ line 
for me, and they are independently configured. I’ll try out a couple 
of different mail hosts to see if it’s that.

I can imagine that mutt just assumes they are maildir files once 
configured that way, but mu also assumes the files are maildir and also 
uses gmime to parse. However the current version on home-brew (Mac) is 
linked to a version of gmime which was fixed to accomodate multiple 
“From “ lines I believe, though I haven’t dug through the source 
yet.

Cheers,

Alexei

On 31 Mar 2019, at 22:00, Tomas Nordin wrote:

> Alexei Gilchrist <te100@runbox.com> writes:
>
>> Every message file begins with “From “. This is true of all 
>> messages
>> downloaded by both offlineimap (with type = Maildir) and mbsync.
>> neomutt has no issues dealing with these files as maildir and mu has 
>> no
>> issues indexing them either. I’m assuming that stating with “From
>> “ is part of the maildir spec.
>
> FWIW, I use Offlineimap and files retreived with it here does not 
> begin
> with "From". I see things like "Received: from..." or 
> "Return-Path:..."
> as the beginning of the first line.
>
>> Try it. Send yourself a message with the line “From bad parsing 
>> comes
>> chaos” and see if your notmuch can find it. My version can’t.
>
> I tried that and find messages as expected. I mean, the message I sent
> and this thread.
>
> Best regards
> --
> Tomas

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: notmuch ignoring alot of emails
  2019-03-31 22:02       ` Alexei Gilchrist
@ 2019-03-31 23:27         ` David Bremner
  0 siblings, 0 replies; 19+ messages in thread
From: David Bremner @ 2019-03-31 23:27 UTC (permalink / raw)
  To: Alexei Gilchrist, Tomas Nordin; +Cc: notmuch

"Alexei Gilchrist" <te100@runbox.com> writes:

> That’s interesting. Do you know a link to the file spec for maildir 
> file content? All I can find is information about the directory 
> structure and file naming, not the file content.

As far as I know, this is specified by RFC 5322. 

> mbsync which specialises in maildir also had an initial “From “ line 
> for me, and they are independently configured. I’ll try out a couple 
> of different mail hosts to see if it’s that.

Yes, it could well determined by how the messages are delivered on the
server.

> I can imagine that mutt just assumes they are maildir files once 
> configured that way, but mu also assumes the files are maildir and also 
> uses gmime to parse. However the current version on home-brew (Mac) is 
> linked to a version of gmime which was fixed to accomodate multiple 
> “From “ lines I believe, though I haven’t dug through the source 
> yet.

As I mentioned above, it's not really related to the version of GMime,
it's about how GMime is called, and whether the client wishes to parse
mbox files containing more than one message. Or to ignore the "From "
line at the beginning.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: notmuch ignoring alot of emails
  2019-03-23  6:45 notmuch ignoring alot of emails Alexei Gilchrist
  2019-03-30 11:29 ` David Bremner
@ 2019-06-28 17:16 ` Alvaro Herrera
  2019-06-28 20:11   ` Alvaro Herrera
  1 sibling, 1 reply; 19+ messages in thread
From: Alvaro Herrera @ 2019-06-28 17:16 UTC (permalink / raw)
  To: Alexei Gilchrist; +Cc: notmuch

On 2019-Mar-23, Alexei Gilchrist wrote:

> When I run notmuch I get a bunch (hundreds) of emails that are ignored with:
> 
> Note: Ignoring non-mail file: ...
> 
> The files are valid maildir files but have a paragraph somewhere in the body
> where someone has written "From ".

Yeah, that happens too when you attach patches generated with git
format-patch as plain text; this is extremely common in the
pgsql-hackers@lists.postgresql.org mailing list (you can download an
mbox from there for any month, convert it to a maildir, and give the
resulting maildir to notmuch -- you'll likely find a few dozen emails
that fail parsing).  This is a very annoying problem for me, see
201901181607.4rba4c5uyimv@alvherre.pgsql in this list earlier this year.

I worked around it by patching _notmuch_message_file_parse in
lib/message-file.c to set is_mbox = false unconditionally; but that's
not a real solution (and hence I didn't post as a patch here), and it
explodes real good if you have an actual mbox in the directory where the
mail is (since after the hack it won't skip it anymore).

I think a real solution is to parse the message header, look for the
Content-Length, and determine mbox-ness by looking for "From" only past
that many bytes; that seems to match what other mail parsing tools do.
However, I haven't gotten around to doing that.

-- 
Álvaro Herrera                            39°49'30"S 73°17'W
"La experiencia nos dice que el hombre peló millones de veces las patatas,
pero era forzoso admitir la posibilidad de que en un caso entre millones,
las patatas pelarían al hombre" (Ijon Tichy)

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: notmuch ignoring alot of emails
  2019-06-28 17:16 ` Alvaro Herrera
@ 2019-06-28 20:11   ` Alvaro Herrera
  2019-06-29 19:03     ` David Bremner
  2019-06-30 17:29     ` Tomi Ollila
  0 siblings, 2 replies; 19+ messages in thread
From: Alvaro Herrera @ 2019-06-28 20:11 UTC (permalink / raw)
  To: Alexei Gilchrist; +Cc: notmuch

On 2019-Jun-28, Alvaro Herrera wrote:

> I think a real solution is to parse the message header, look for the
> Content-Length, and determine mbox-ness by looking for "From" only past
> that many bytes; that seems to match what other mail parsing tools do.

Sorry, I misspoke: there's no such thing as Content-Length.
It's Content-Type/boundary that needs to be watched for.  Only consider
that the file is an mbox if a "^From " line appears after the boundary
end marker (which seems to be defined as "the boundary string followed
by two dashes --").

Here's a sample message, BTW:
https://www.postgresql.org/message-id/raw/3ad5ba71-d200-96da-f903-7e3b16416140@lab.ntt.co.jp
(username "archives", password "antispam").

-- 
Álvaro Herrera       Valdivia, Chile

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: notmuch ignoring alot of emails
  2019-06-28 20:11   ` Alvaro Herrera
@ 2019-06-29 19:03     ` David Bremner
  2019-06-29 19:09       ` David Bremner
  2019-06-30 17:29     ` Tomi Ollila
  1 sibling, 1 reply; 19+ messages in thread
From: David Bremner @ 2019-06-29 19:03 UTC (permalink / raw)
  To: Alvaro Herrera, Alexei Gilchrist; +Cc: notmuch

Alvaro Herrera <alvherre@alvh.no-ip.org> writes:

> On 2019-Jun-28, Alvaro Herrera wrote:
>
>> I think a real solution is to parse the message header, look for the
>> Content-Length, and determine mbox-ness by looking for "From" only past
>> that many bytes; that seems to match what other mail parsing tools do.
>
> Sorry, I misspoke: there's no such thing as Content-Length.
> It's Content-Type/boundary that needs to be watched for.  Only consider
> that the file is an mbox if a "^From " line appears after the boundary
> end marker (which seems to be defined as "the boundary string followed
> by two dashes --").
>
> Here's a sample message, BTW:
> https://www.postgresql.org/message-id/raw/3ad5ba71-d200-96da-f903-7e3b16416140@lab.ntt.co.jp
> (username "archives", password "antispam").

I'm not keen on writing (more) ad hoc MIME parsing code, so if you can
phrase this in terms of GMime API (or at least MIME parts) it would be
great.

d

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: notmuch ignoring alot of emails
  2019-06-29 19:03     ` David Bremner
@ 2019-06-29 19:09       ` David Bremner
  2019-07-01 15:26         ` Alvaro Herrera
  0 siblings, 1 reply; 19+ messages in thread
From: David Bremner @ 2019-06-29 19:09 UTC (permalink / raw)
  To: Alvaro Herrera, Alexei Gilchrist; +Cc: notmuch

David Bremner <david@tethera.net> writes:

> Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
>
>> On 2019-Jun-28, Alvaro Herrera wrote:
>>
>>> I think a real solution is to parse the message header, look for the
>>> Content-Length, and determine mbox-ness by looking for "From" only past
>>> that many bytes; that seems to match what other mail parsing tools do.
>>
>> Sorry, I misspoke: there's no such thing as Content-Length.
>> It's Content-Type/boundary that needs to be watched for.  Only consider
>> that the file is an mbox if a "^From " line appears after the boundary
>> end marker (which seems to be defined as "the boundary string followed
>> by two dashes --").
>>
>> Here's a sample message, BTW:
>> https://www.postgresql.org/message-id/raw/3ad5ba71-d200-96da-f903-7e3b16416140@lab.ntt.co.jp
>> (username "archives", password "antispam").
>
> I'm not keen on writing (more) ad hoc MIME parsing code, so if you can
> phrase this in terms of GMime API (or at least MIME parts) it would be
> great.
>
> d

On second thought, I guess it might not be practical to use GMime to parse
the file, since that might perform badly on large mboxes.

d

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: notmuch ignoring alot of emails
  2019-06-28 20:11   ` Alvaro Herrera
  2019-06-29 19:03     ` David Bremner
@ 2019-06-30 17:29     ` Tomi Ollila
  2019-07-01 15:36       ` Alvaro Herrera
  1 sibling, 1 reply; 19+ messages in thread
From: Tomi Ollila @ 2019-06-30 17:29 UTC (permalink / raw)
  Cc: notmuch

On Fri, Jun 28 2019, Alvaro Herrera wrote:

> On 2019-Jun-28, Alvaro Herrera wrote:
>
>> I think a real solution is to parse the message header, look for the
>> Content-Length, and determine mbox-ness by looking for "From" only past
>> that many bytes; that seems to match what other mail parsing tools do.
>
> Sorry, I misspoke: there's no such thing as Content-Length.
> It's Content-Type/boundary that needs to be watched for.  Only consider
> that the file is an mbox if a "^From " line appears after the boundary
> end marker (which seems to be defined as "the boundary string followed
> by two dashes --").

Just checking line starting with 'From ' would be pretty naïve since
From may be first word in any line in text body.

If we'd have to do content scanning then at least empty line before
From would be reguired, and next lines starting like
Received: someone@not.an.example
Date: a date
From: someone

(and then empty line... ;)

all this checkin would be required and still it could fail (perhaps
this content get modified in the fly, but then signature check, if
this mail had one, could fail...)

If there is header that tells the length of the body, then things
could be easier...

Tomi

>
> -- 
> Álvaro Herrera       Valdivia, Chile

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: notmuch ignoring alot of emails
  2019-06-29 19:09       ` David Bremner
@ 2019-07-01 15:26         ` Alvaro Herrera
  0 siblings, 0 replies; 19+ messages in thread
From: Alvaro Herrera @ 2019-07-01 15:26 UTC (permalink / raw)
  To: David Bremner; +Cc: Alexei Gilchrist, notmuch

On 2019-Jun-29, David Bremner wrote:

> David Bremner <david@tethera.net> writes:
> 
> > Alvaro Herrera <alvherre@alvh.no-ip.org> writes:

> >> It's Content-Type/boundary that needs to be watched for.  Only consider
> >> that the file is an mbox if a "^From " line appears after the boundary
> >> end marker (which seems to be defined as "the boundary string followed
> >> by two dashes --").

> > I'm not keen on writing (more) ad hoc MIME parsing code, so if you can
> > phrase this in terms of GMime API (or at least MIME parts) it would be
> > great.

Yeah, I was having a look at the GMime API last week to have a think
about how to do it with that.

> On second thought, I guess it might not be practical to use GMime to parse
> the file, since that might perform badly on large mboxes.

I think we only need to search for the first end boundary; if there's
anything beyond that, return is_mbox true.  So we only need to fully
process the first email, and we can stop searching at that point.

-- 
Álvaro Herrera                                http://www.twitter.com/alvherre
"Puedes vivir sólo una vez, pero si lo haces bien, una vez es suficiente"

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: notmuch ignoring alot of emails
  2019-06-30 17:29     ` Tomi Ollila
@ 2019-07-01 15:36       ` Alvaro Herrera
  2019-11-16 17:40         ` David Bremner
  0 siblings, 1 reply; 19+ messages in thread
From: Alvaro Herrera @ 2019-07-01 15:36 UTC (permalink / raw)
  To: Tomi Ollila; +Cc: notmuch

On 2019-Jun-30, Tomi Ollila wrote:

> Just checking line starting with 'From ' would be pretty naïve since
> From may be first word in any line in text body.

Even so, early mail systems relied on there not being any such lines,
and they escaped those lines to be ">From" or to use quoted-printable
encoding.  GMime has bespoke code to do this, in fact.  Mail systems
stopped doing this escaping after MIME boundaries got more widely used,
I suppose.

I think NNTP used content length much more extensively than email.  Of
course, NNTP is almost disappeared now ...

> If we'd have to do content scanning then at least empty line before
> From would be reguired, and next lines starting like
> Received: someone@not.an.example
> Date: a date
> From: someone
> 
> (and then empty line... ;)
> 
> all this checkin would be required and still it could fail (perhaps
> this content get modified in the fly, but then signature check, if
> this mail had one, could fail...)

This logic still fails if you have mail-like content in the mail, such
as attachments produced by "git format-patch".  Many open source lists
don't have this problem because they use "git send-email" instead, but
this is not universal.

> If there is header that tells the length of the body, then things
> could be easier...

Early emails had Content-Length as a header, but it was not universal,
and nowadays it seems to have been abandoned as a practice; the MIME
content boundary is used universally (or at least I cannot find any
recent divergence from this practice.)

-- 
Álvaro Herrera                                http://www.twitter.com/alvherre

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: notmuch ignoring alot of emails
  2019-07-01 15:36       ` Alvaro Herrera
@ 2019-11-16 17:40         ` David Bremner
       [not found]           ` <87eey7szz6.fsf@eirikba.org>
  0 siblings, 1 reply; 19+ messages in thread
From: David Bremner @ 2019-11-16 17:40 UTC (permalink / raw)
  To: Alvaro Herrera, Tomi Ollila; +Cc: notmuch

Alvaro Herrera <alvherre@alvh.no-ip.org> writes:

> On 2019-Jun-30, Tomi Ollila wrote:
>
>> Just checking line starting with 'From ' would be pretty naïve since
>> From may be first word in any line in text body.
>
> Even so, early mail systems relied on there not being any such lines,
> and they escaped those lines to be ">From" or to use quoted-printable
> encoding.  GMime has bespoke code to do this, in fact.  Mail systems
> stopped doing this escaping after MIME boundaries got more widely used,
> I suppose.

As far as I know this is still correct for mbox files in general.
In general confusion arises because notmuch has a strict idea of what an
mbox is (file starts with "From "), while other software takes a more
relaxed approach.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: notmuch ignoring alot of emails
       [not found]           ` <87eey7szz6.fsf@eirikba.org>
@ 2019-11-17 12:31             ` David Bremner
  2019-11-17 13:46               ` David Bremner
  0 siblings, 1 reply; 19+ messages in thread
From: David Bremner @ 2019-11-17 12:31 UTC (permalink / raw)
  To: Eirik Byrkjeflot Anonsen, Alvaro Herrera, Tomi Ollila; +Cc: notmuch

Eirik Byrkjeflot Anonsen <eirik@eirikba.org> writes:

>
> Or, notmuch could just look at the first line of the file. If it starts
> with "From ", it is an mbox. If it starts with a reasonable mail header,
> it is not an mbox. If it is neither, fall back to the old heuristics.
>

FTR, this is what happens now (although iirc the actual check is done by
GMime). So maybe I'm missing some context here.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: notmuch ignoring alot of emails
  2019-11-17 12:31             ` David Bremner
@ 2019-11-17 13:46               ` David Bremner
       [not found]                 ` <87blt9tdjj.fsf@eirikba.org>
  0 siblings, 1 reply; 19+ messages in thread
From: David Bremner @ 2019-11-17 13:46 UTC (permalink / raw)
  To: Eirik Byrkjeflot Anonsen, Alvaro Herrera, Tomi Ollila; +Cc: notmuch

David Bremner <david@tethera.net> writes:

> Eirik Byrkjeflot Anonsen <eirik@eirikba.org> writes:
>
>>
>> Or, notmuch could just look at the first line of the file. If it starts
>> with "From ", it is an mbox. If it starts with a reasonable mail header,
>> it is not an mbox. If it is neither, fall back to the old heuristics.
>>
>
> FTR, this is what happens now (although iirc the actual check is done by
> GMime). So maybe I'm missing some context here.

Re-reading, I guess your point is maybe that we should ignore successive
unescaped "From "?  We're really trapped a bit by wanting to support
single message mboxes. We tried to remove this support at one point, but
this caused so much problems we put it back.

d

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: notmuch ignoring alot of emails
       [not found]                 ` <87blt9tdjj.fsf@eirikba.org>
@ 2019-11-19 20:18                   ` David Bremner
  0 siblings, 0 replies; 19+ messages in thread
From: David Bremner @ 2019-11-19 20:18 UTC (permalink / raw)
  To: Eirik Byrkjeflot Anonsen, Alvaro Herrera, Tomi Ollila; +Cc: notmuch

Eirik Byrkjeflot Anonsen <eirik@eirikba.org> writes:

> Then I can really only see three alternatives:
>
> 1. Ignore any "From " lines that aren't followed by something that looks
>    like it could reasonably be a mail header (as Tomi suggested). My
>    suspicion is that this would eliminate almost all false positives.
>    (Outside of mailing lists discussing mboxes, at least.)

This seems more hopeful to me than relying on Content-Length. I tried
(but failed) to quickly understand what GMime is doing to decide if
something is an mbox, but it seems possible that Jeff S (GMime
maintainer) might be receptive to something along those lines.

There is a GMIME_FORMAT_MBOX, but maybe something like
GMIME_FORMAT_RFC4155 could specify a stricter mbox format were each
"message" should be roughly RFC2822 formatted.

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2019-11-19 20:18 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2019-03-23  6:45 notmuch ignoring alot of emails Alexei Gilchrist
2019-03-30 11:29 ` David Bremner
2019-03-30 23:53   ` Alexei Gilchrist
2019-03-31  4:06     ` David Bremner
2019-03-31  8:52     ` Tomi Ollila
2019-03-31 11:00     ` Tomas Nordin
2019-03-31 22:02       ` Alexei Gilchrist
2019-03-31 23:27         ` David Bremner
2019-06-28 17:16 ` Alvaro Herrera
2019-06-28 20:11   ` Alvaro Herrera
2019-06-29 19:03     ` David Bremner
2019-06-29 19:09       ` David Bremner
2019-07-01 15:26         ` Alvaro Herrera
2019-06-30 17:29     ` Tomi Ollila
2019-07-01 15:36       ` Alvaro Herrera
2019-11-16 17:40         ` David Bremner
     [not found]           ` <87eey7szz6.fsf@eirikba.org>
2019-11-17 12:31             ` David Bremner
2019-11-17 13:46               ` David Bremner
     [not found]                 ` <87blt9tdjj.fsf@eirikba.org>
2019-11-19 20:18                   ` David Bremner

Code repositories for project(s) associated with this public inbox

	https://yhetil.org/notmuch.git/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).