From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from localhost (localhost [127.0.0.1]) by arlo.cworth.org (Postfix) with ESMTP id 4A24E6DE184A for ; Sun, 18 Mar 2018 12:32:44 -0700 (PDT) X-Virus-Scanned: Debian amavisd-new at cworth.org X-Spam-Flag: NO X-Spam-Score: -0.01 X-Spam-Level: X-Spam-Status: No, score=-0.01 tagged_above=-999 required=5 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, T_SPF_TEMPERROR=0.01] autolearn=disabled Received: from arlo.cworth.org ([127.0.0.1]) by localhost (arlo.cworth.org [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id VpfH_2s2VKmD for ; Sun, 18 Mar 2018 12:32:43 -0700 (PDT) Received: from mail-lf0-f49.google.com (mail-lf0-f49.google.com [209.85.215.49]) by arlo.cworth.org (Postfix) with ESMTPS id 215B86DE1849 for ; Sun, 18 Mar 2018 12:32:43 -0700 (PDT) Received: by mail-lf0-f49.google.com with SMTP id v207-v6so2586496lfa.10 for ; Sun, 18 Mar 2018 12:32:43 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=nikula-org.20150623.gappssmtp.com; s=20150623; h=from:to:subject:in-reply-to:references:date:message-id:mime-version; bh=m7YpspsOEEIqAGYUyQCYxiy9JtLnwUVg2yvcDsXWgXk=; b=CyS65G9EF6z4T+C/yooGtIYm5jdBgqfaKokWf0rSu+fJHFUDag1uav5iG2bDAooGMK erNlDDQ9DrM+ZoDN5Z3pSFBQsTrSVMR8g8WGL/Cl/ng8OsspeZLqhrXPsd1kDWlwViw+ mH0BkSgMl4f3gsAkKJPZgz7qPfWoS1K8DemM6GUGmGAXmshKPu2TBIsQNzqx84BWLRdz PxVGnofbjODoJ/bmTDjR4kESKaFs38M8ZK/JAb1LqGUIfwlVs9V8KWCee61io3/Yr2zQ k2XeeXBJGNyy4hDiGBjS/MMozRpQ4t7xWcryBpJ84a/zZfW0KZng9uGWdxYK1+gt2xvc i+bQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:subject:in-reply-to:references:date :message-id:mime-version; bh=m7YpspsOEEIqAGYUyQCYxiy9JtLnwUVg2yvcDsXWgXk=; b=T3kkmh2bEgGpdajqF6LsamVEr4kjcS/ocLTgLuW8UJjkXvCSP4ZCKDH2n2bPSMFwyv 6jbOI0ilKtTyWwCaFxMXqY9K2Pbo8QzwJ6oz0HAXaJuH/1mhb3b4E1EuaEb+VFZiypB6 r0asx/bkokWrfPJJ76fgnm0yh59+Ut6ljRlL0h/aew9W+db87lQ2XflPC8FeWd2vF4G9 RNYRRfFICxUxm6zyWHjLRfrPxN5EI+BCbZdkZTcj745WUKrfRDQaH8vUx5nC91x/SCRp VYZVCEr2zFV57wqN77yMDIPAg63pPwhZ1h1RPBmWeKCYccRUEejgDlgtr7j/O3Skp51o EPnQ== X-Gm-Message-State: AElRT7HDTegsg3jaiob/km39K1KnAENL2Wl0RBe5izdcy2lwLB9u3t8O 3kSYdMO3ji1+vikGcX7WTucndGq19wc= X-Google-Smtp-Source: AG47ELuxj8I+kYhaIv/+Sf7hsM+ZqZsTxfVDok9LmcAXwXSWf0+RAKZQVoB+e1RGFYm3y9tdNYPYnQ== X-Received: by 10.46.86.204 with SMTP id k73mr6141732lje.16.1521401560565; Sun, 18 Mar 2018 12:32:40 -0700 (PDT) Received: from localhost (89-27-116-76.bb.dnainternet.fi. [89.27.116.76]) by smtp.gmail.com with ESMTPSA id f46-v6sm507945lfh.56.2018.03.18.12.32.39 (version=TLS1_2 cipher=ECDHE-RSA-CHACHA20-POLY1305 bits=256/256); Sun, 18 Mar 2018 12:32:39 -0700 (PDT) From: Jani Nikula To: Daniel Kahn Gillmor , notmuch@notmuchmail.org Subject: Re: multilingual notmuch (and Content-Language) In-Reply-To: <87d101v33s.fsf@fifthhorseman.net> References: <87d101v33s.fsf@fifthhorseman.net> Date: Sun, 18 Mar 2018 21:32:35 +0200 Message-ID: <87bmflrxgs.fsf@nikula.org> MIME-Version: 1.0 Content-Type: text/plain X-BeenThere: notmuch@notmuchmail.org X-Mailman-Version: 2.1.26 Precedence: list List-Id: "Use and development of the notmuch mail system." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 18 Mar 2018 19:32:44 -0000 On Sun, 18 Mar 2018, Daniel Kahn Gillmor wrote: > * if we know our index expects english, and we have a message part that > *is not* english (e.g. Content-Language: es), we could avoid indexing > that part. Why would we do that? Search mostly works just fine for non-English languages, it's just that the *stemming* is not right. > what do you think? what ideas are missing from the branstorm above? I'd > love to hear from people with multilingual mailboxes about how we might > be able to make notmuch work better for them. With my limited understanding of this, stemming happens both at indexing and searching. Basically at indexing, the term generator indexes both the full and the stemmed version of words. I'm wondering if we could look at Content-Language (and missing that, heuristics), and (if the user so desires) use multiple term generators with different stemmers on a per document basis. Or, use non-stemming indexing for unidentified or unsupported languages. How far would that take us? Then, perhaps, we could also perform language specific queries? I don't know how feasible that is, or if it would require Xapian changes. BR, Jani.