unofficial mirror of notmuch@notmuchmail.org
 help / color / mirror / code / Atom feed
From: Servilio Afre Puentes <afrepues@mcmaster.ca>
To: Daniel Kahn Gillmor <dkg@fifthhorseman.net>, notmuch@notmuchmail.org
Subject: Re: multilingual notmuch (and Content-Language)
Date: Wed, 21 Mar 2018 11:05:58 -0400	[thread overview]
Message-ID: <87muz1v57t.fsf@mcmaster.ca> (raw)
In-Reply-To: <87d101v33s.fsf@fifthhorseman.net>

On Sun, Mar 18 2018, Daniel Kahn Gillmor wrote:

> https://tools.ietf.org/html/rfc3282 describes a Content-Language:
> header.  https://tools.ietf.org/html/rfc8255 describes
> a multipart/multilingual Content-Type.
>
> notmuch currently uses xapian with a hard-coded English stemmer which
> works great for me as a monolingual American, but limits the
> applicability of notmuch to Anglophiles (people who speak English).
> That makes me sad.
>
> AIUI, xapian is pretty much committed to being a single-language
> indexer.

Have you seen the different stemmers it already has? Reference:

 https://xapian.org/docs/sourcedoc/html/dir_430c089e7e18d7ac6ff937a35cc3312c.html

> But i just wanted to point out that it's possible that we
> could be smarter about this in notmuch, and wanted to make a space for
> possible design discussion.
>
> a few concrete suggestions (intended as brainstorming, feedback welcome):
>
>  * if we know our index expects english, and we have a message part that
>    *is not* english (e.g. Content-Language: es), we could avoid indexing
>    that part.

I'd prefer leaving the choice of default stemmer to the user.

>  * during indexing, we could add a property to each message when we
>    discover a Content-Language header.  this would let you do something
>    like "notmuch search property:lang=es" to find all messages
>    explicitly tagged as spanish.
>
>  * (pretty crazy) If we're willing to search in another language we
>    could add an additional xapian database configured that language, and
>    we could index identified parts in that language.

Do we need to have separate DB if we can use different stemmers dynamically?

>  * for text parts without a Content-Language: header, we could do some
>    concrete heuristics to guess the language.  For example, choose the
>    1000 most popular words for each language we might know about, and
>    look for their presence in the text.  Choose the language that is
>    most heavily represented, and store it in the index as a property.
>    this could be combined with the suggestions above.

+1 for heuristics.

> what do you think?  what ideas are missing from the branstorm above?  I'd
> love to hear from people with multilingual mailboxes about how we might
> be able to make notmuch work better for them.

As an actively bilingual person (English and Spanish), I love this idea.

Servilio

-- 

Servilio Afre Puentes
Programmer/Analyst, SHARCNET project
RHPCS | http://www.rhpcs.mcmaster.ca
SHARCNET | https://sharcnet.ca
Compute Ontario | http://computeontario.ca
Compute/Calcul Canada | http://computecanada.ca

905-525-9140, x22540

      parent reply	other threads:[~2018-03-21 15:11 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-03-18 15:02 multilingual notmuch (and Content-Language) Daniel Kahn Gillmor
2018-03-18 18:22 ` David Bremner
2018-03-18 19:32 ` Jani Nikula
2018-03-19  7:38   ` Daniel Kahn Gillmor
2018-03-21 15:05 ` Servilio Afre Puentes [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://notmuchmail.org/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87muz1v57t.fsf@mcmaster.ca \
    --to=afrepues@mcmaster.ca \
    --cc=dkg@fifthhorseman.net \
    --cc=notmuch@notmuchmail.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://yhetil.org/notmuch.git/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).