unofficial mirror of notmuch@notmuchmail.org
 help / color / mirror / code / Atom feed
From: Daniel Kahn Gillmor <dkg@fifthhorseman.net>
To: Jani Nikula <jani@nikula.org>,
	notmuch@notmuchmail.org, Olly Betts <olly@survex.com>
Subject: Re: multilingual notmuch (and Content-Language)
Date: Mon, 19 Mar 2018 07:38:07 +0000	[thread overview]
Message-ID: <87vadstt0g.fsf@fifthhorseman.net> (raw)
In-Reply-To: <87bmflrxgs.fsf@nikula.org>

On Sun 2018-03-18 21:32:35 +0200, Jani Nikula wrote:
> On Sun, 18 Mar 2018, Daniel Kahn Gillmor <dkg@fifthhorseman.net> wrote:
>>  * if we know our index expects english, and we have a message part that
>>    *is not* english (e.g. Content-Language: es), we could avoid indexing
>>    that part.
>
> Why would we do that? Search mostly works just fine for non-English
> languages, it's just that the *stemming* is not right.
>
>> what do you think?  what ideas are missing from the branstorm above?  I'd
>> love to hear from people with multilingual mailboxes about how we might
>> be able to make notmuch work better for them.
>
> With my limited understanding of this, stemming happens both at indexing
> and searching. Basically at indexing, the term generator indexes both
> the full and the stemmed version of words. I'm wondering if we could
> look at Content-Language (and missing that, heuristics), and (if the
> user so desires) use multiple term generators with different stemmers on
> a per document basis. Or, use non-stemming indexing for unidentified or
> unsupported languages. How far would that take us? Then, perhaps, we
> could also perform language specific queries?
>
> I don't know how feasible that is, or if it would require Xapian
> changes.

thanks, this is exactly the kind of promising idea i was hoping my dumb
questions and half-baked suggestions would provoke :)

Maybe Olly or someone else with deeper knowledge of xapian can weigh in
about the feasibility of this proposal?

          --dkg

  reply	other threads:[~2018-03-19  9:42 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-03-18 15:02 multilingual notmuch (and Content-Language) Daniel Kahn Gillmor
2018-03-18 18:22 ` David Bremner
2018-03-18 19:32 ` Jani Nikula
2018-03-19  7:38   ` Daniel Kahn Gillmor [this message]
2018-03-21 15:05 ` Servilio Afre Puentes

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://notmuchmail.org/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87vadstt0g.fsf@fifthhorseman.net \
    --to=dkg@fifthhorseman.net \
    --cc=jani@nikula.org \
    --cc=notmuch@notmuchmail.org \
    --cc=olly@survex.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://yhetil.org/notmuch.git/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).