From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from localhost (localhost [127.0.0.1]) by arlo.cworth.org (Postfix) with ESMTP id BC30E6DE184E for ; Sun, 18 Mar 2018 11:06:04 -0700 (PDT) X-Virus-Scanned: Debian amavisd-new at cworth.org X-Spam-Flag: NO X-Spam-Score: -0.042 X-Spam-Level: X-Spam-Status: No, score=-0.042 tagged_above=-999 required=5 tests=[AWL=-0.042] autolearn=disabled Received: from arlo.cworth.org ([127.0.0.1]) by localhost (arlo.cworth.org [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id sr-0nd_Y6OSO for ; Sun, 18 Mar 2018 11:06:03 -0700 (PDT) Received: from che.mayfirst.org (che.mayfirst.org [162.247.75.118]) by arlo.cworth.org (Postfix) with ESMTPS id DF6966DE184F for ; Sun, 18 Mar 2018 11:06:02 -0700 (PDT) Received: from fifthhorseman.net (dhcp-8362.meeting.ietf.org [31.133.131.98]) by che.mayfirst.org (Postfix) with ESMTPSA id F2289F99A for ; Sun, 18 Mar 2018 14:06:00 -0400 (EDT) Received: by fifthhorseman.net (Postfix, from userid 1000) id 89CB52034E; Sun, 18 Mar 2018 15:02:35 +0000 (GMT) From: Daniel Kahn Gillmor To: notmuch@notmuchmail.org Subject: multilingual notmuch (and Content-Language) Date: Sun, 18 Mar 2018 15:02:31 +0000 Message-ID: <87d101v33s.fsf@fifthhorseman.net> MIME-Version: 1.0 Content-Type: multipart/signed; boundary="=-=-="; micalg=pgp-sha512; protocol="application/pgp-signature" X-BeenThere: notmuch@notmuchmail.org X-Mailman-Version: 2.1.26 Precedence: list List-Id: "Use and development of the notmuch mail system." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 18 Mar 2018 18:06:04 -0000 --=-=-= Content-Type: text/plain https://tools.ietf.org/html/rfc3282 describes a Content-Language: header. https://tools.ietf.org/html/rfc8255 describes a multipart/multilingual Content-Type. notmuch currently uses xapian with a hard-coded English stemmer which works great for me as a monolingual American, but limits the applicability of notmuch to Anglophiles (people who speak English). That makes me sad. AIUI, xapian is pretty much committed to being a single-language indexer. But i just wanted to point out that it's possible that we could be smarter about this in notmuch, and wanted to make a space for possible design discussion. a few concrete suggestions (intended as brainstorming, feedback welcome): * if we know our index expects english, and we have a message part that *is not* english (e.g. Content-Language: es), we could avoid indexing that part. * during indexing, we could add a property to each message when we discover a Content-Language header. this would let you do something like "notmuch search property:lang=es" to find all messages explicitly tagged as spanish. * (pretty crazy) If we're willing to search in another language we could add an additional xapian database configured that language, and we could index identified parts in that language. * for text parts without a Content-Language: header, we could do some concrete heuristics to guess the language. For example, choose the 1000 most popular words for each language we might know about, and look for their presence in the text. Choose the language that is most heavily represented, and store it in the index as a property. this could be combined with the suggestions above. what do you think? what ideas are missing from the branstorm above? I'd love to hear from people with multilingual mailboxes about how we might be able to make notmuch work better for them. Regards, --dkg --=-=-= Content-Type: application/pgp-signature; name="signature.asc" -----BEGIN PGP SIGNATURE----- iHUEARYKAB0WIQTTaP514aqS9uSbmdJsHx7ezFD6UwUCWq5/hwAKCRBsHx7ezFD6 U4rKAQChnnujlE6zbbXD6bynI0ffhkz6yrQmJC/zE6eNxsiRJQD7BBevkL1TXkc8 hUmFyl2FlrRSzWAeCSRI0nGWgCnPTwY= =VQmi -----END PGP SIGNATURE----- --=-=-=--