From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from localhost (localhost [127.0.0.1]) by arlo.cworth.org (Postfix) with ESMTP id 0A4506DE3B10 for ; Wed, 21 Mar 2018 08:11:43 -0700 (PDT) X-Virus-Scanned: Debian amavisd-new at cworth.org X-Spam-Flag: NO X-Spam-Score: -0.011 X-Spam-Level: X-Spam-Status: No, score=-0.011 tagged_above=-999 required=5 tests=[RCVD_IN_DNSWL_NONE=-0.0001, SPF_PASS=-0.001, T_RP_MATCHES_RCVD=-0.01] autolearn=disabled Received: from arlo.cworth.org ([127.0.0.1]) by localhost (arlo.cworth.org [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 1Kfinl7jQR-c for ; Wed, 21 Mar 2018 08:11:40 -0700 (PDT) X-Greylist: delayed 338 seconds by postgrey-1.36 at arlo; Wed, 21 Mar 2018 08:11:40 PDT Received: from pinegw01.uts.mcmaster.ca (pinegw01.uts.mcmaster.ca [130.113.64.127]) by arlo.cworth.org (Postfix) with ESMTPS id 419CE6DE3AF7 for ; Wed, 21 Mar 2018 08:11:40 -0700 (PDT) Received: from pinegw03.uts.mcmaster.ca (pinegw03.uts.mcmaster.ca [130.113.64.132]) by pinegw01.uts.mcmaster.ca (8.14.4/8.14.4) with ESMTP id w2LF5xWt016586 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=NO); Wed, 21 Mar 2018 11:05:59 -0400 Received: from mbx.rhpcs.mcmaster.ca (mbx.rhpcs.mcmaster.ca [130.113.48.57]) by pinegw03.uts.mcmaster.ca (8.14.4/8.14.4) with ESMTP id w2LF5wnK028453; Wed, 21 Mar 2018 11:05:58 -0400 Received: by mbx.rhpcs.mcmaster.ca (Postfix, from userid 10010153) id 5D7FFC4263; Wed, 21 Mar 2018 11:05:58 -0400 (EDT) From: Servilio Afre Puentes To: Daniel Kahn Gillmor , notmuch@notmuchmail.org Subject: Re: multilingual notmuch (and Content-Language) In-Reply-To: <87d101v33s.fsf@fifthhorseman.net> References: <87d101v33s.fsf@fifthhorseman.net> Date: Wed, 21 Mar 2018 11:05:58 -0400 Message-ID: <87muz1v57t.fsf@mcmaster.ca> Content-Type: text/plain X-PMX-Version-Mac: 6.0.3.2322014, Antispam-Engine: 2.7.2.2107409, Antispam-Data: 2018.3.21.145415 X-PerlMx-Spam: Gauge=X, Probability=10%, Report=' TO_IN_SUBJECT 0.5, RCVD_FROM_IP_DATE 0.1, HTML_00_01 0.05, HTML_00_10 0.05, BODYTEXTP_SIZE_3000_LESS 0, BODY_SIZE_2000_2999 0, BODY_SIZE_5000_LESS 0, BODY_SIZE_7000_LESS 0, DATE_TZ_NA 0, FROM_NAME_PHRASE 0, IN_REP_TO 0, LEGITIMATE_SIGNS 0, MSG_THREAD 0, REFERENCES 0, URI_ENDS_IN_HTML 0, __ANY_URI 0, __BOUNCE_CHALLENGE_SUBJ 0, __BOUNCE_NDR_SUBJ_EXEMPT 0, __C230066_P5 0, __CP_URI_IN_BODY 0, __CT 0, __CT_TEXT_PLAIN 0, __FORWARDED_MSG 0, __FRAUD_COMMON 0, __FRAUD_REFNUM 0, __HAS_FROM 0, __HAS_MSGID 0, __HTTPS_URI 0, __IN_REP_TO 0, __MIME_TEXT_ONLY 0, __MIME_TEXT_P 0, __MIME_TEXT_P1 0, __MULTIPLE_URI_TEXT 0, __NO_HTML_TAG_RAW 0, __REFERENCES 0, __SANE_MSGID 0, __SUBJ_ALPHA_NEGATE 0, __TO_IN_SUBJECT 0, __TO_NAME 0, __TO_NAME_DIFF_FROM_ACC 0, __TO_REAL_NAMES 0, __URI_IN_BODY 0, __URI_NOT_IMG 0, __URI_NO_MAILTO 0, __URI_NS , __URI_WITHOUT_PATH 0, __URI_WITH_PATH 0' X-Mailman-Approved-At: Fri, 23 Mar 2018 09:18:08 -0700 X-BeenThere: notmuch@notmuchmail.org X-Mailman-Version: 2.1.26 Precedence: list List-Id: "Use and development of the notmuch mail system." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 21 Mar 2018 15:11:43 -0000 On Sun, Mar 18 2018, Daniel Kahn Gillmor wrote: > https://tools.ietf.org/html/rfc3282 describes a Content-Language: > header. https://tools.ietf.org/html/rfc8255 describes > a multipart/multilingual Content-Type. > > notmuch currently uses xapian with a hard-coded English stemmer which > works great for me as a monolingual American, but limits the > applicability of notmuch to Anglophiles (people who speak English). > That makes me sad. > > AIUI, xapian is pretty much committed to being a single-language > indexer. Have you seen the different stemmers it already has? Reference: https://xapian.org/docs/sourcedoc/html/dir_430c089e7e18d7ac6ff937a35cc3312c.html > But i just wanted to point out that it's possible that we > could be smarter about this in notmuch, and wanted to make a space for > possible design discussion. > > a few concrete suggestions (intended as brainstorming, feedback welcome): > > * if we know our index expects english, and we have a message part that > *is not* english (e.g. Content-Language: es), we could avoid indexing > that part. I'd prefer leaving the choice of default stemmer to the user. > * during indexing, we could add a property to each message when we > discover a Content-Language header. this would let you do something > like "notmuch search property:lang=es" to find all messages > explicitly tagged as spanish. > > * (pretty crazy) If we're willing to search in another language we > could add an additional xapian database configured that language, and > we could index identified parts in that language. Do we need to have separate DB if we can use different stemmers dynamically? > * for text parts without a Content-Language: header, we could do some > concrete heuristics to guess the language. For example, choose the > 1000 most popular words for each language we might know about, and > look for their presence in the text. Choose the language that is > most heavily represented, and store it in the index as a property. > this could be combined with the suggestions above. +1 for heuristics. > what do you think? what ideas are missing from the branstorm above? I'd > love to hear from people with multilingual mailboxes about how we might > be able to make notmuch work better for them. As an actively bilingual person (English and Spanish), I love this idea. Servilio -- Servilio Afre Puentes Programmer/Analyst, SHARCNET project RHPCS | http://www.rhpcs.mcmaster.ca SHARCNET | https://sharcnet.ca Compute Ontario | http://computeontario.ca Compute/Calcul Canada | http://computecanada.ca 905-525-9140, x22540