From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from localhost (localhost [127.0.0.1]) by olra.theworths.org (Postfix) with ESMTP id 410EC429E26 for ; Sat, 29 Oct 2011 03:40:28 -0700 (PDT) X-Virus-Scanned: Debian amavisd-new at olra.theworths.org X-Spam-Flag: NO X-Spam-Score: 2.847 X-Spam-Level: ** X-Spam-Status: No, score=2.847 tagged_above=-999 required=5 tests=[PERCENT_RANDOM=2.837, RCVD_IN_DNSWL_NONE=-0.0001, T_LOTS_OF_MONEY=0.01] autolearn=disabled Received: from olra.theworths.org ([127.0.0.1]) by localhost (olra.theworths.org [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id B88PK75O3n3c for ; Sat, 29 Oct 2011 03:40:27 -0700 (PDT) Received: from smtprelay03.ispgateway.de (smtprelay03.ispgateway.de [80.67.31.30]) by olra.theworths.org (Postfix) with ESMTP id D9AD1431FB6 for ; Sat, 29 Oct 2011 03:40:26 -0700 (PDT) Received: from [87.180.87.168] (helo=stokes.schwinge.homeip.net) by smtprelay03.ispgateway.de with esmtpa (Exim 4.68) (envelope-from ) id 1RK6Ku-0007g6-6v for notmuch@notmuchmail.org; Sat, 29 Oct 2011 12:40:24 +0200 Received: (qmail 28875 invoked from network); 29 Oct 2011 10:40:15 -0000 Received: from kepler.schwinge.homeip.net (192.168.111.7) by stokes.schwinge.homeip.net with QMQP; 29 Oct 2011 10:40:15 -0000 Received: (nullmailer pid 7240 invoked by uid 1000); Sat, 29 Oct 2011 10:40:15 -0000 From: Thomas Schwinge To: notmuch@notmuchmail.org Subject: [PATCH] restore: Be more liberal in which data to accept. Date: Sat, 29 Oct 2011 12:40:07 +0200 Message-Id: <1319884807-7206-1-git-send-email-thomas@schwinge.name> X-Mailer: git-send-email 1.7.6.3 MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Df-Sender: dGhvbWFzQHNjaHdpbmdlLm5hbWU= X-BeenThere: notmuch@notmuchmail.org X-Mailman-Version: 2.1.13 Precedence: list List-Id: "Use and development of the notmuch mail system." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 29 Oct 2011 10:40:28 -0000 From: Thomas Schwinge There are ``Message-ID''s out in the wild that contain spaces. --- Hi! Carl, the main question for you is: does this break sup-import operability? Spammers are quite inventive for creating ``interesting Messages-ID''s. Apparently, notmuch handles these fine internally, but it breaks a dump/restore cycle: $ notmuch restore < ~/tmp/Mail-notmuch_dump/dump No filename given. Reading dump from stdin. Warning: Ignoring invalid input line: 3791856948.991306994491@m0.net Received:fromdialup-62.215.274.4.dial1.stamford([62.215.274.4] ([...]) Warning: Ignoring invalid input line: PM200010:29:54 AM ([...]) Warning: Ignoring invalid input line: PM200010:51:48 AM ([...]) Warning: Ignoring invalid input line: PM200011:47:35 AM ([...]) Warning: Ignoring invalid input line: PM200011:48:46 AM ([...]) Warning: Ignoring invalid input line: PM200011:50:10 AM ([...]) Warning: Ignoring invalid input line: PM200012:21:05 AM ([...]) Warning: Ignoring invalid input line: PM200012:21:17 AM ([...]) Warning: Ignoring invalid input line: PM200012:21:18 AM ([...]) Warning: Ignoring invalid input line: PM200012:21:32 AM ([...]) Warning: Ignoring invalid input line: PM20001:48:38 PM ([...]) Warning: Ignoring invalid input line: PM20001:53:07 PM ([...]) Warning: Ignoring invalid input line: PM20004:01:48 AM ([...]) Warning: Ignoring invalid input line: PM20004:01:59 AM ([...]) Warning: Ignoring invalid input line: PM20004:10:44 AM ([...]) Warning: Ignoring invalid input line: PM20004:20:00 AM ([...]) Warning: Ignoring invalid input line: PM20005:06:50 PM ([...]) Warning: Ignoring invalid input line: PM20005:14:17 AM ([...]) Warning: Ignoring invalid input line: PM20005:32:15 PM ([...]) Warning: Ignoring invalid input line: PM20005:32:22 PM ([...]) Warning: Ignoring invalid input line: PM20005:33:05 PM ([...]) Warning: Ignoring invalid input line: PM20005:33:57 AM ([...]) Warning: Ignoring invalid input line: PM20006:24:12 AM ([...]) Warning: Ignoring invalid input line: PM20006:25:04 AM ([...]) Warning: Ignoring invalid input line: PM20006:25:49 AM ([...]) Warning: Ignoring invalid input line: PM20006:26:11 AM ([...]) Warning: Ignoring invalid input line: PM20007:05:34 PM ([...]) Warning: Ignoring invalid input line: PM2000PM 04:09:15 ([...]) Warning: Ignoring invalid input line: PM2000¿ÀÀü 11:07:41 ([...]) Warning: Ignoring invalid input line: PM2000¿ÀÈÄ 12:42:47 ([...]) Warning: Ignoring invalid input line: PM2000¿ÀÈÄ 12:42:48 ([...]) Warning: Ignoring invalid input line: PM2000¿ÀÈÄ 5:58:28 ([...]) Warning: Ignoring invalid input line: PM2000¿ÀÈÄ 6:30:51 ([...]) Warning: Ignoring invalid input line: Prospect Mailer 20000:37:04 ([...]) Warning: Ignoring invalid input line: Prospect Mailer 20000:37:09 ([...]) Warning: Ignoring invalid input line: Prospect Mailer 20000:37:11 ([...]) Warning: Ignoring invalid input line: Prospect Mailer 20000:37:12 ([...]) Warning: Ignoring invalid input line: Prospect Mailer 20000:37:45 ([...]) Warning: Ignoring invalid input line: Prospect Mailer 20000:38:10 ([...]) Thus, dump; remove all tags; restore is not nullipotent, which it should be. Especially noteworthy is probably the first one: it happens to have gotten a Received line mangled into the Message-ID, and it ends with a space character. Some more from the freak show: $MESSAGE_ID ([...]) %CUSTOM_CHAR[8-10]$%CUSTOM_CHAR[8-10]$%CUSTOM_CHAR[8-10]@%CUSTOM_DOMAIN.msn.com ([...]) %RNDDIGIT1025.%RNDDIGIT15%RNDLCCHAR15%RNDDIGIT110%RNDLCCHAR13@ ([...]) %RNDDIGIT1025.%RNDDIGIT15%RNDLCCHAR15%RNDDIGIT110ucp@yahoo.com ([...]) %RNDDIGIT1025.%RNDDIGIT15%RNDLCCHAR15%RNDDIGIT110vs@yahoo.com ([...]) %RNDDIGIT27eq52md1$9rg57p%RNDDIGIT14$277ts40lsh@%RNDWORD13ivo4068 ([...]) %RNDDIGIT27g10u874$3cqh62f%RNDDIGIT14$7fgo121wnwt@%RNDWORD13quw32712 ([...]) %RNDDIGIT27mog75vx711$541xqm480xc%RNDDIGIT14$031nq1pk@%RNDWORD13av2979 ([...]) %RNDDIGIT27nqf761drk7$7l4mza%RNDDIGIT14$96ijq17zq@%RNDWORD13b1779 ([...]) %RNDDIGIT27q0tcg10$94pcn1mw%RNDDIGIT14$7x77pztx@%RNDWORD13ny7619 ([...]) %RNDDIGIT27uiw866tv49$5c3rg%RNDDIGIT14$6jl43vv@%RNDWORD13uwh17820 ([...]) %RNDDIGIT27x966lug3$0pr016r%RNDDIGIT14$8ye15k@%RNDWORD13qps90907 ([...]) %RNDDIGIT310%RNDLCCHAR15%RNDDIGIT15%RNDLCCHAR15$%RNDDIGIT17%RNDDIGIT13%RNDLCCHAR13%RNDDIGIT13$%RNDDIGIT15%RNDLCCHAR13%RNDDIGIT13%RNDLCCHAR13%RNDDIGIT13@ ([...]) %RNDDIGIT310%RNDLCCHAR15%RNDDIGIT15%RNDLCCHAR15$%RNDDIGIT17%RNDDIGIT13%RNDLCCHAR13%RNDDIGIT13$%RNDDIGIT15%RNDLCCHAR13%RNDDIGIT13%RNDLCCHAR13%RNDDIGIT13@bambi ([...]) %RNDDIGIT310%RNDLCCHAR15%RNDDIGIT15%RNDLCCHAR15$%RNDDIGIT17%RNDDIGIT13%RNDLCCHAR13%RNDDIGIT13$%RNDDIGIT15%RNDLCCHAR13%RNDDIGIT13%RNDLCCHAR13%RNDDIGIT13@wheelchair ([...]) %RNDDIGIT520.%RNDDIGIT110.%RNDDIGIT110@-%RNDLCCHAR13%RNDDIGIT13. ([...]) %RNDDIGIT520.%RNDDIGIT110.%RNDDIGIT110@-hi3.yahoo.com ([...]) %RNDDIGIT520.%RNDDIGIT110.%RNDDIGIT110@-xz24.yahoo.com ([...]) %RNDDIGIT520.%RNDDIGIT110.%RNDDIGIT110@lutanist-%RNDLCCHAR13%RNDDIGIT13.msn.com ([...]) %RNDDIGIT520.%RNDDIGIT110.%RNDDIGIT110@millipede-jfq402.yahoo.com ([...]) %RNDDIGIT520.%RNDDIGIT110.%RNDDIGIT110@referenda-sgw04.yahoo.com ([...]) %RNDDIGIT715.h8OheY%RNDDIGIT28@proffer5.o'brien%RNDDIGIT2yahoo.com ([...]) %RNDDIGIT715.jt36NNBvbF%RNDDIGIT28@schematic5.myers%RNDDIGIT2yahoo.com ([...]) %RNDDIGIT715.wz394MICrdY%RNDDIGIT28@agriculture6.city%RNDDIGIT2yahoo.com ([...]) %RNDLCCHAR13%RNDDIGIT13%RNDLCCHAR13%RNDDIGIT13-%RNDDIGIT520-%RNDDIGIT1035@%RNDDIGIT13 ([...]) %RNDLCCHAR13%RNDDIGIT13%RNDLCCHAR13%RNDDIGIT13-%RNDDIGIT520-%RNDDIGIT1035@pontiac%RNDDIGIT13 ([...]) Someone needs to improve their scripting language abilities... But on the other hand: $ notmuch search --output=files -- 'id:"$MESSAGE_ID"' | wc -l 25 This goes by the lines of ``notmuch as a spam filter'': these are different spam messages, but due to notmuch's Message-ID-based keying, they are all coalesced into one. ;-) 000010ff21d1$00005c94$000024ca@smtp.mail.gr^M ([...]) 0000247e7459$0000617b$000030b1@mx1.777.net.cn^M ([...]) 200107261918.PAA15837@unix.harrisondigital.com^M ([...]) 20050131113558.GB4396@dragonfly.hU^S@hU^S@ ([...]) 5614105.1027079773228.JavaMail.à^U±@à^U± ([...]) 6428921.1027079772968.JavaMail.à^U±@à^U± ([...]) 6864195.1027080005012.JavaMail.à^U±@à^U± ([...]) Yes, these are really embedded carriage returns (^M; and whatever ^S and ^U are). These are handled fine. (Replaced in this text by their ^x representation.) 1IO\225y@-00094R-XB@BSN-77-184-114.dsl.siol.net ([...]) 1IP\225o@-000C29-BR@shcn-4.unm.edu ([...]) SAK.2002.05.10.kmfogibc@\212ù\222è ([...]) SAK.2002.05.11.ckbbpbpe@\212ù\222è ([...]) SAK.2002.05.11.qmgoaoai@\212ù\222è ([...]) SAK.2002.05.12.cfolrrgc@\212ù\222è ([...]) SAK.2002.05.12.chpbngla@\212ù\222è ([...]) SAK.2002.05.12.cooajnlj@\212ù\222è ([...]) SAK.2002.05.12.folfrldb@\212ù\222è ([...]) SAK.2002.05.12.ncphnarn@\212ù\222è ([...]) SAK.2002.05.12.tcjbjsoo@\212ù\222è ([...]) Embedded non-ASCII characters \212, \222, \225. These are handled fine. (Replaced in this text by their octal \xxx representation.) Another approach would be to detect invalid Message-IDs (only allow valid ones as per the standard) at notmuch new time, and replace these with a generated Message-ID (as if it's missing completely). But I don't think we should generated a Message-ID unless we really need to. Grüße, Thomas --- notmuch-restore.c | 7 +++---- 1 files changed, 3 insertions(+), 4 deletions(-) diff --git a/notmuch-restore.c b/notmuch-restore.c index e4a5355..122c3e7 100644 --- a/notmuch-restore.c +++ b/notmuch-restore.c @@ -56,12 +56,11 @@ notmuch_restore_command (unused (void *ctx), int argc, char *argv[]) input = stdin; } - /* Dump output is one line per message. We match a sequence of - * non-space characters for the message-id, then one or more - * spaces, then a list of space-separated tags as a sequence of + /* The input data is one line per message. First comes the message-id, + * then one space, then a list of space-separated tags as a sequence of * characters within literal '(' and ')'. */ xregcomp (®ex, - "^([^ ]+) \\(([^)]*)\\)$", + "^(.+) \\(([^)]*)\\)$", REG_EXTENDED); while ((line_len = getline (&line, &line_size, input)) != -1) { -- tg: (3bafdfc..) t/restore_liberal_regex (depends on: baseline)