unofficial mirror of notmuch@notmuchmail.org
 help / color / mirror / code / Atom feed
* [PATCH] restore: Be more liberal in which data to accept.
@ 2011-10-29 10:40 Thomas Schwinge
  2011-11-15 13:52 ` David Bremner
  2012-01-07 13:37 ` David Bremner
  0 siblings, 2 replies; 4+ messages in thread
From: Thomas Schwinge @ 2011-10-29 10:40 UTC (permalink / raw)
  To: notmuch

From: Thomas Schwinge <thomas@schwinge.name>

There are ``Message-ID''s out in the wild that contain spaces.

---


Hi!

Carl, the main question for you is: does this break sup-import
operability?


Spammers are quite inventive for creating ``interesting Messages-ID''s.
Apparently, notmuch handles these fine internally, but it breaks a
dump/restore cycle:

    $ notmuch restore < ~/tmp/Mail-notmuch_dump/dump
    No filename given. Reading dump from stdin.
    Warning: Ignoring invalid input line: 3791856948.991306994491@m0.net Received:fromdialup-62.215.274.4.dial1.stamford([62.215.274.4]  ([...])
    Warning: Ignoring invalid input line: PM200010:29:54 AM ([...])
    Warning: Ignoring invalid input line: PM200010:51:48 AM ([...])
    Warning: Ignoring invalid input line: PM200011:47:35 AM ([...])
    Warning: Ignoring invalid input line: PM200011:48:46 AM ([...])
    Warning: Ignoring invalid input line: PM200011:50:10 AM ([...])
    Warning: Ignoring invalid input line: PM200012:21:05 AM ([...])
    Warning: Ignoring invalid input line: PM200012:21:17 AM ([...])
    Warning: Ignoring invalid input line: PM200012:21:18 AM ([...])
    Warning: Ignoring invalid input line: PM200012:21:32 AM ([...])
    Warning: Ignoring invalid input line: PM20001:48:38 PM ([...])
    Warning: Ignoring invalid input line: PM20001:53:07 PM ([...])
    Warning: Ignoring invalid input line: PM20004:01:48 AM ([...])
    Warning: Ignoring invalid input line: PM20004:01:59 AM ([...])
    Warning: Ignoring invalid input line: PM20004:10:44 AM ([...])
    Warning: Ignoring invalid input line: PM20004:20:00 AM ([...])
    Warning: Ignoring invalid input line: PM20005:06:50 PM ([...])
    Warning: Ignoring invalid input line: PM20005:14:17 AM ([...])
    Warning: Ignoring invalid input line: PM20005:32:15 PM ([...])
    Warning: Ignoring invalid input line: PM20005:32:22 PM ([...])
    Warning: Ignoring invalid input line: PM20005:33:05 PM ([...])
    Warning: Ignoring invalid input line: PM20005:33:57 AM ([...])
    Warning: Ignoring invalid input line: PM20006:24:12 AM ([...])
    Warning: Ignoring invalid input line: PM20006:25:04 AM ([...])
    Warning: Ignoring invalid input line: PM20006:25:49 AM ([...])
    Warning: Ignoring invalid input line: PM20006:26:11 AM ([...])
    Warning: Ignoring invalid input line: PM20007:05:34 PM ([...])
    Warning: Ignoring invalid input line: PM2000PM 04:09:15 ([...])
    Warning: Ignoring invalid input line: PM2000¿ÀÀü 11:07:41 ([...])
    Warning: Ignoring invalid input line: PM2000¿ÀÈÄ 12:42:47 ([...])
    Warning: Ignoring invalid input line: PM2000¿ÀÈÄ 12:42:48 ([...])
    Warning: Ignoring invalid input line: PM2000¿ÀÈÄ 5:58:28 ([...])
    Warning: Ignoring invalid input line: PM2000¿ÀÈÄ 6:30:51 ([...])
    Warning: Ignoring invalid input line: Prospect Mailer 20000:37:04 ([...])
    Warning: Ignoring invalid input line: Prospect Mailer 20000:37:09 ([...])
    Warning: Ignoring invalid input line: Prospect Mailer 20000:37:11 ([...])
    Warning: Ignoring invalid input line: Prospect Mailer 20000:37:12 ([...])
    Warning: Ignoring invalid input line: Prospect Mailer 20000:37:45 ([...])
    Warning: Ignoring invalid input line: Prospect Mailer 20000:38:10 ([...])

Thus, dump; remove all tags; restore is not nullipotent, which it should
be.

Especially noteworthy is probably the first one: it happens to have
gotten a Received line mangled into the Message-ID, and it ends with a
space character.

Some more from the freak show:

    $MESSAGE_ID ([...])
    %CUSTOM_CHAR[8-10]$%CUSTOM_CHAR[8-10]$%CUSTOM_CHAR[8-10]@%CUSTOM_DOMAIN.msn.com ([...])
    %RNDDIGIT1025.%RNDDIGIT15%RNDLCCHAR15%RNDDIGIT110%RNDLCCHAR13@ ([...])
    %RNDDIGIT1025.%RNDDIGIT15%RNDLCCHAR15%RNDDIGIT110ucp@yahoo.com ([...])
    %RNDDIGIT1025.%RNDDIGIT15%RNDLCCHAR15%RNDDIGIT110vs@yahoo.com ([...])
    %RNDDIGIT27eq52md1$9rg57p%RNDDIGIT14$277ts40lsh@%RNDWORD13ivo4068 ([...])
    %RNDDIGIT27g10u874$3cqh62f%RNDDIGIT14$7fgo121wnwt@%RNDWORD13quw32712 ([...])
    %RNDDIGIT27mog75vx711$541xqm480xc%RNDDIGIT14$031nq1pk@%RNDWORD13av2979 ([...])
    %RNDDIGIT27nqf761drk7$7l4mza%RNDDIGIT14$96ijq17zq@%RNDWORD13b1779 ([...])
    %RNDDIGIT27q0tcg10$94pcn1mw%RNDDIGIT14$7x77pztx@%RNDWORD13ny7619 ([...])
    %RNDDIGIT27uiw866tv49$5c3rg%RNDDIGIT14$6jl43vv@%RNDWORD13uwh17820 ([...])
    %RNDDIGIT27x966lug3$0pr016r%RNDDIGIT14$8ye15k@%RNDWORD13qps90907 ([...])
    %RNDDIGIT310%RNDLCCHAR15%RNDDIGIT15%RNDLCCHAR15$%RNDDIGIT17%RNDDIGIT13%RNDLCCHAR13%RNDDIGIT13$%RNDDIGIT15%RNDLCCHAR13%RNDDIGIT13%RNDLCCHAR13%RNDDIGIT13@ ([...])
    %RNDDIGIT310%RNDLCCHAR15%RNDDIGIT15%RNDLCCHAR15$%RNDDIGIT17%RNDDIGIT13%RNDLCCHAR13%RNDDIGIT13$%RNDDIGIT15%RNDLCCHAR13%RNDDIGIT13%RNDLCCHAR13%RNDDIGIT13@bambi ([...])
    %RNDDIGIT310%RNDLCCHAR15%RNDDIGIT15%RNDLCCHAR15$%RNDDIGIT17%RNDDIGIT13%RNDLCCHAR13%RNDDIGIT13$%RNDDIGIT15%RNDLCCHAR13%RNDDIGIT13%RNDLCCHAR13%RNDDIGIT13@wheelchair ([...])
    %RNDDIGIT520.%RNDDIGIT110.%RNDDIGIT110@-%RNDLCCHAR13%RNDDIGIT13. ([...])
    %RNDDIGIT520.%RNDDIGIT110.%RNDDIGIT110@-hi3.yahoo.com ([...])
    %RNDDIGIT520.%RNDDIGIT110.%RNDDIGIT110@-xz24.yahoo.com ([...])
    %RNDDIGIT520.%RNDDIGIT110.%RNDDIGIT110@lutanist-%RNDLCCHAR13%RNDDIGIT13.msn.com ([...])
    %RNDDIGIT520.%RNDDIGIT110.%RNDDIGIT110@millipede-jfq402.yahoo.com ([...])
    %RNDDIGIT520.%RNDDIGIT110.%RNDDIGIT110@referenda-sgw04.yahoo.com ([...])
    %RNDDIGIT715.h8OheY%RNDDIGIT28@proffer5.o'brien%RNDDIGIT2yahoo.com ([...])
    %RNDDIGIT715.jt36NNBvbF%RNDDIGIT28@schematic5.myers%RNDDIGIT2yahoo.com ([...])
    %RNDDIGIT715.wz394MICrdY%RNDDIGIT28@agriculture6.city%RNDDIGIT2yahoo.com ([...])
    %RNDLCCHAR13%RNDDIGIT13%RNDLCCHAR13%RNDDIGIT13-%RNDDIGIT520-%RNDDIGIT1035@%RNDDIGIT13 ([...])
    %RNDLCCHAR13%RNDDIGIT13%RNDLCCHAR13%RNDDIGIT13-%RNDDIGIT520-%RNDDIGIT1035@pontiac%RNDDIGIT13 ([...])

Someone needs to improve their scripting language abilities...  But on
the other hand:

    $ notmuch search --output=files -- 'id:"$MESSAGE_ID"' | wc -l
    25

This goes by the lines of ``notmuch as a spam filter'': these are
different spam messages, but due to notmuch's Message-ID-based keying,
they are all coalesced into one.  ;-)

    000010ff21d1$00005c94$000024ca@smtp.mail.gr^M ([...])
    0000247e7459$0000617b$000030b1@mx1.777.net.cn^M ([...])
    200107261918.PAA15837@unix.harrisondigital.com^M ([...])
    20050131113558.GB4396@dragonfly.hU^S@hU^S@ ([...])
    5614105.1027079773228.JavaMail.à^U±@à^U± ([...])
    6428921.1027079772968.JavaMail.à^U±@à^U± ([...])
    6864195.1027080005012.JavaMail.à^U±@à^U± ([...])

Yes, these are really embedded carriage returns (^M; and whatever ^S and
^U are).  These are handled fine.  (Replaced in this text by their ^x
representation.)

    1IO\225y@-00094R-XB@BSN-77-184-114.dsl.siol.net ([...])
    1IP\225o@-000C29-BR@shcn-4.unm.edu ([...])
    SAK.2002.05.10.kmfogibc@\212ù\222è ([...])
    SAK.2002.05.11.ckbbpbpe@\212ù\222è ([...])
    SAK.2002.05.11.qmgoaoai@\212ù\222è ([...])
    SAK.2002.05.12.cfolrrgc@\212ù\222è ([...])
    SAK.2002.05.12.chpbngla@\212ù\222è ([...])
    SAK.2002.05.12.cooajnlj@\212ù\222è ([...])
    SAK.2002.05.12.folfrldb@\212ù\222è ([...])
    SAK.2002.05.12.ncphnarn@\212ù\222è ([...])
    SAK.2002.05.12.tcjbjsoo@\212ù\222è ([...])

Embedded non-ASCII characters \212, \222, \225.  These are handled fine.
(Replaced in this text by their octal \xxx representation.)


Another approach would be to detect invalid Message-IDs (only allow valid
ones as per the standard) at notmuch new time, and replace these with a
generated Message-ID (as if it's missing completely).  But I don't think
we should generated a Message-ID unless we really need to.


Grüße,
 Thomas


---

 notmuch-restore.c |    7 +++----
 1 files changed, 3 insertions(+), 4 deletions(-)

diff --git a/notmuch-restore.c b/notmuch-restore.c
index e4a5355..122c3e7 100644
--- a/notmuch-restore.c
+++ b/notmuch-restore.c
@@ -56,12 +56,11 @@ notmuch_restore_command (unused (void *ctx), int argc, char *argv[])
 	input = stdin;
     }
 
-    /* Dump output is one line per message. We match a sequence of
-     * non-space characters for the message-id, then one or more
-     * spaces, then a list of space-separated tags as a sequence of
+    /* The input data is one line per message.  First comes the message-id,
+     * then one space, then a list of space-separated tags as a sequence of
      * characters within literal '(' and ')'. */
     xregcomp (&regex,
-	      "^([^ ]+) \\(([^)]*)\\)$",
+	      "^(.+) \\(([^)]*)\\)$",
 	      REG_EXTENDED);
 
     while ((line_len = getline (&line, &line_size, input)) != -1) {
-- 
tg: (3bafdfc..) t/restore_liberal_regex (depends on: baseline)

^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCH] restore: Be more liberal in which data to accept.
  2011-10-29 10:40 [PATCH] restore: Be more liberal in which data to accept Thomas Schwinge
@ 2011-11-15 13:52 ` David Bremner
  2012-01-07 13:37 ` David Bremner
  1 sibling, 0 replies; 4+ messages in thread
From: David Bremner @ 2011-11-15 13:52 UTC (permalink / raw)
  To: Thomas Schwinge, notmuch

On Sat, 29 Oct 2011 12:40:07 +0200, Thomas Schwinge <thomas@schwinge.name> wrote:

> From: Thomas Schwinge <thomas@schwinge.name>
> 
> There are ``Message-ID''s out in the wild that contain spaces.
> 
> ---

> Carl, the main question for you is: does this break sup-import
> operability?


Any other sup users care to comment?

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH] restore: Be more liberal in which data to accept.
  2011-10-29 10:40 [PATCH] restore: Be more liberal in which data to accept Thomas Schwinge
  2011-11-15 13:52 ` David Bremner
@ 2012-01-07 13:37 ` David Bremner
  2012-01-07 18:20   ` Jameson Graef Rollins
  1 sibling, 1 reply; 4+ messages in thread
From: David Bremner @ 2012-01-07 13:37 UTC (permalink / raw)
  To: Thomas Schwinge, notmuch

On Sat, 29 Oct 2011 12:40:07 +0200, Thomas Schwinge <thomas@schwinge.name> wrote:
> From: Thomas Schwinge <thomas@schwinge.name>
> 
> There are ``Message-ID''s out in the wild that contain spaces.
> 

> Spammers are quite inventive for creating ``interesting Messages-ID''s.
> Apparently, notmuch handles these fine internally, but it breaks a
> dump/restore cycle:

Two questions.

1) Do you think we should change the current regex as well as provide a
new space tolerant format
(id:"1324214111-32079-1-git-send-email-david@tethera.net")? I guess it
doesn't really hurt. Notmuch is probably already creating dump files
that sup can't read in your case.

2) can you share (by private email is a fine) a few of those really
   problematic messages so I can use them to test the new dump/restore code?

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH] restore: Be more liberal in which data to accept.
  2012-01-07 13:37 ` David Bremner
@ 2012-01-07 18:20   ` Jameson Graef Rollins
  0 siblings, 0 replies; 4+ messages in thread
From: Jameson Graef Rollins @ 2012-01-07 18:20 UTC (permalink / raw)
  To: David Bremner, Thomas Schwinge, notmuch

[-- Attachment #1: Type: text/plain, Size: 351 bytes --]

On Sat, 07 Jan 2012 09:37:20 -0400, David Bremner <david@tethera.net> wrote:
> 2) can you share (by private email is a fine) a few of those really
>    problematic messages so I can use them to test the new dump/restore code?

I think the better option would be to create a test that runs through
all the horrible message ids we can dream up.

jamie.

[-- Attachment #2: Type: application/pgp-signature, Size: 835 bytes --]

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2012-01-07 18:20 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-10-29 10:40 [PATCH] restore: Be more liberal in which data to accept Thomas Schwinge
2011-11-15 13:52 ` David Bremner
2012-01-07 13:37 ` David Bremner
2012-01-07 18:20   ` Jameson Graef Rollins

Code repositories for project(s) associated with this public inbox

	https://yhetil.org/notmuch.git/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).