unofficial mirror of notmuch@notmuchmail.org
 help / color / mirror / code / Atom feed
* multilingual notmuch (and Content-Language)
@ 2018-03-18 15:02 Daniel Kahn Gillmor
  2018-03-18 18:22 ` David Bremner
                   ` (2 more replies)
  0 siblings, 3 replies; 5+ messages in thread
From: Daniel Kahn Gillmor @ 2018-03-18 15:02 UTC (permalink / raw)
  To: notmuch

[-- Attachment #1: Type: text/plain, Size: 1904 bytes --]

https://tools.ietf.org/html/rfc3282 describes a Content-Language:
header.  https://tools.ietf.org/html/rfc8255 describes
a multipart/multilingual Content-Type.

notmuch currently uses xapian with a hard-coded English stemmer which
works great for me as a monolingual American, but limits the
applicability of notmuch to Anglophiles (people who speak English).
That makes me sad.

AIUI, xapian is pretty much committed to being a single-language
indexer.  But i just wanted to point out that it's possible that we
could be smarter about this in notmuch, and wanted to make a space for
possible design discussion.

a few concrete suggestions (intended as brainstorming, feedback welcome):

 * if we know our index expects english, and we have a message part that
   *is not* english (e.g. Content-Language: es), we could avoid indexing
   that part.

 * during indexing, we could add a property to each message when we
   discover a Content-Language header.  this would let you do something
   like "notmuch search property:lang=es" to find all messages
   explicitly tagged as spanish.

 * (pretty crazy) If we're willing to search in another language we
   could add an additional xapian database configured that language, and
   we could index identified parts in that language.

 * for text parts without a Content-Language: header, we could do some
   concrete heuristics to guess the language.  For example, choose the
   1000 most popular words for each language we might know about, and
   look for their presence in the text.  Choose the language that is
   most heavily represented, and store it in the index as a property.
   this could be combined with the suggestions above.

what do you think?  what ideas are missing from the branstorm above?  I'd
love to hear from people with multilingual mailboxes about how we might
be able to make notmuch work better for them.

Regards,

        --dkg

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 227 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: multilingual notmuch (and Content-Language)
  2018-03-18 15:02 multilingual notmuch (and Content-Language) Daniel Kahn Gillmor
@ 2018-03-18 18:22 ` David Bremner
  2018-03-18 19:32 ` Jani Nikula
  2018-03-21 15:05 ` Servilio Afre Puentes
  2 siblings, 0 replies; 5+ messages in thread
From: David Bremner @ 2018-03-18 18:22 UTC (permalink / raw)
  To: Daniel Kahn Gillmor, notmuch

Daniel Kahn Gillmor <dkg@fifthhorseman.net> writes:

> AIUI, xapian is pretty much committed to being a single-language
> indexer.  But i just wanted to point out that it's possible that we
> could be smarter about this in notmuch, and wanted to make a space for
> possible design discussion.
>

More precisely, it uses a single _stemmer_ when generating terms and
when parsing queries. Nothing says that these have to correspond to a
single human language. The stemmer is also configured at runtime, so it
could in principle be per database configurable. I mention the
possibility of a custom stemmer because that also seems like a natural
place to put things like unicode normalization and accent removal.

d

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: multilingual notmuch (and Content-Language)
  2018-03-18 15:02 multilingual notmuch (and Content-Language) Daniel Kahn Gillmor
  2018-03-18 18:22 ` David Bremner
@ 2018-03-18 19:32 ` Jani Nikula
  2018-03-19  7:38   ` Daniel Kahn Gillmor
  2018-03-21 15:05 ` Servilio Afre Puentes
  2 siblings, 1 reply; 5+ messages in thread
From: Jani Nikula @ 2018-03-18 19:32 UTC (permalink / raw)
  To: Daniel Kahn Gillmor, notmuch

On Sun, 18 Mar 2018, Daniel Kahn Gillmor <dkg@fifthhorseman.net> wrote:
>  * if we know our index expects english, and we have a message part that
>    *is not* english (e.g. Content-Language: es), we could avoid indexing
>    that part.

Why would we do that? Search mostly works just fine for non-English
languages, it's just that the *stemming* is not right.

> what do you think?  what ideas are missing from the branstorm above?  I'd
> love to hear from people with multilingual mailboxes about how we might
> be able to make notmuch work better for them.

With my limited understanding of this, stemming happens both at indexing
and searching. Basically at indexing, the term generator indexes both
the full and the stemmed version of words. I'm wondering if we could
look at Content-Language (and missing that, heuristics), and (if the
user so desires) use multiple term generators with different stemmers on
a per document basis. Or, use non-stemming indexing for unidentified or
unsupported languages. How far would that take us? Then, perhaps, we
could also perform language specific queries?

I don't know how feasible that is, or if it would require Xapian
changes.

BR,
Jani.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: multilingual notmuch (and Content-Language)
  2018-03-18 19:32 ` Jani Nikula
@ 2018-03-19  7:38   ` Daniel Kahn Gillmor
  0 siblings, 0 replies; 5+ messages in thread
From: Daniel Kahn Gillmor @ 2018-03-19  7:38 UTC (permalink / raw)
  To: Jani Nikula, notmuch, Olly Betts

On Sun 2018-03-18 21:32:35 +0200, Jani Nikula wrote:
> On Sun, 18 Mar 2018, Daniel Kahn Gillmor <dkg@fifthhorseman.net> wrote:
>>  * if we know our index expects english, and we have a message part that
>>    *is not* english (e.g. Content-Language: es), we could avoid indexing
>>    that part.
>
> Why would we do that? Search mostly works just fine for non-English
> languages, it's just that the *stemming* is not right.
>
>> what do you think?  what ideas are missing from the branstorm above?  I'd
>> love to hear from people with multilingual mailboxes about how we might
>> be able to make notmuch work better for them.
>
> With my limited understanding of this, stemming happens both at indexing
> and searching. Basically at indexing, the term generator indexes both
> the full and the stemmed version of words. I'm wondering if we could
> look at Content-Language (and missing that, heuristics), and (if the
> user so desires) use multiple term generators with different stemmers on
> a per document basis. Or, use non-stemming indexing for unidentified or
> unsupported languages. How far would that take us? Then, perhaps, we
> could also perform language specific queries?
>
> I don't know how feasible that is, or if it would require Xapian
> changes.

thanks, this is exactly the kind of promising idea i was hoping my dumb
questions and half-baked suggestions would provoke :)

Maybe Olly or someone else with deeper knowledge of xapian can weigh in
about the feasibility of this proposal?

          --dkg

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: multilingual notmuch (and Content-Language)
  2018-03-18 15:02 multilingual notmuch (and Content-Language) Daniel Kahn Gillmor
  2018-03-18 18:22 ` David Bremner
  2018-03-18 19:32 ` Jani Nikula
@ 2018-03-21 15:05 ` Servilio Afre Puentes
  2 siblings, 0 replies; 5+ messages in thread
From: Servilio Afre Puentes @ 2018-03-21 15:05 UTC (permalink / raw)
  To: Daniel Kahn Gillmor, notmuch

On Sun, Mar 18 2018, Daniel Kahn Gillmor wrote:

> https://tools.ietf.org/html/rfc3282 describes a Content-Language:
> header.  https://tools.ietf.org/html/rfc8255 describes
> a multipart/multilingual Content-Type.
>
> notmuch currently uses xapian with a hard-coded English stemmer which
> works great for me as a monolingual American, but limits the
> applicability of notmuch to Anglophiles (people who speak English).
> That makes me sad.
>
> AIUI, xapian is pretty much committed to being a single-language
> indexer.

Have you seen the different stemmers it already has? Reference:

 https://xapian.org/docs/sourcedoc/html/dir_430c089e7e18d7ac6ff937a35cc3312c.html

> But i just wanted to point out that it's possible that we
> could be smarter about this in notmuch, and wanted to make a space for
> possible design discussion.
>
> a few concrete suggestions (intended as brainstorming, feedback welcome):
>
>  * if we know our index expects english, and we have a message part that
>    *is not* english (e.g. Content-Language: es), we could avoid indexing
>    that part.

I'd prefer leaving the choice of default stemmer to the user.

>  * during indexing, we could add a property to each message when we
>    discover a Content-Language header.  this would let you do something
>    like "notmuch search property:lang=es" to find all messages
>    explicitly tagged as spanish.
>
>  * (pretty crazy) If we're willing to search in another language we
>    could add an additional xapian database configured that language, and
>    we could index identified parts in that language.

Do we need to have separate DB if we can use different stemmers dynamically?

>  * for text parts without a Content-Language: header, we could do some
>    concrete heuristics to guess the language.  For example, choose the
>    1000 most popular words for each language we might know about, and
>    look for their presence in the text.  Choose the language that is
>    most heavily represented, and store it in the index as a property.
>    this could be combined with the suggestions above.

+1 for heuristics.

> what do you think?  what ideas are missing from the branstorm above?  I'd
> love to hear from people with multilingual mailboxes about how we might
> be able to make notmuch work better for them.

As an actively bilingual person (English and Spanish), I love this idea.

Servilio

-- 

Servilio Afre Puentes
Programmer/Analyst, SHARCNET project
RHPCS | http://www.rhpcs.mcmaster.ca
SHARCNET | https://sharcnet.ca
Compute Ontario | http://computeontario.ca
Compute/Calcul Canada | http://computecanada.ca

905-525-9140, x22540

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2018-03-21 15:11 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-03-18 15:02 multilingual notmuch (and Content-Language) Daniel Kahn Gillmor
2018-03-18 18:22 ` David Bremner
2018-03-18 19:32 ` Jani Nikula
2018-03-19  7:38   ` Daniel Kahn Gillmor
2018-03-21 15:05 ` Servilio Afre Puentes

Code repositories for project(s) associated with this public inbox

	https://yhetil.org/notmuch.git/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).