unofficial mirror of notmuch@notmuchmail.org
 help / color / mirror / code / Atom feed
From: David Bremner <david@tethera.net>
To: Bruno Deremble <bruno.deremble@ens.fr>, notmuch@notmuchmail.org
Cc: Stefano Zacchiroli <zack@debian.org>
Subject: Re: accented characters
Date: Mon, 13 Nov 2017 09:22:36 -0400	[thread overview]
Message-ID: <87efp2b9er.fsf@tethera.net> (raw)
In-Reply-To: <87h8tz8b2v.fsf@ens.fr>

[-- Attachment #1: Type: text/plain, Size: 2162 bytes --]

Bruno Deremble <bruno.deremble@ens.fr> writes:

> A way to handle this could be to only index non accented words which
> requires to add a filter before the indexing process. I looked at the code
> and it seems that this should be handled by gmime? 
> there are also libraries that are supposed to do that such as 'unac'.
>
> Is it something that you have been exploring already?

We have discussed it a bit (with another francophone, in copy ;) ), but
I think no-one got very far.

I guess the ideal case would be to have the possibility of for both
accented and accent free search. That would require adding some more
terms to the index (both accented and unaccented version). It's not
clear to me yet what kind of performance impact that would have. 

Xapian already has something called "stemmers" (in xapian-core/languages
in the source tree), which do, among other things, strip accents. Those
are generally targetted at a single language, which I suspect is not
very useful for notmuch (even I as a mostly-unilingual person have a
fair amount of English, French, and German in my mailstore). Nonetheless
a custom stemmer might be the right way to go, since that step is
happening anyway.  Or perhaps people would be happy enough with being
able to set the stemmer (currently it is hardcoded to English). That
would be a relatively easy change to notmuch, but I don't know how many
people would find it a good tradeoff to lose English stemming
(i.e. search for 'stem' and 'stemming' being equivalant) for
de-accenting.

I'm not sure if the query language would need to support the
distinction between accented and unaccented searches. I imagine that
people naturally type the non-accented versions in a search, but I do
wonder about cases like (German) München. Should that be stemmed to
Munchen or Muenchen ?

The other thing I don't know is how many people would be happy with just
stripping all accents. That could be done in a gmime filter, as you
suggest. That would be more likely to require changes to the query
language. Off hand I don't know how to transparently de-accent all query
words.

d






[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 658 bytes --]

  reply	other threads:[~2017-11-13 13:22 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-11-12 21:02 accented characters Bruno Deremble
2017-11-13 13:22 ` David Bremner [this message]
2017-11-13 14:35   ` Stefano Zacchiroli
2017-11-13 17:47     ` David Bremner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://notmuchmail.org/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87efp2b9er.fsf@tethera.net \
    --to=david@tethera.net \
    --cc=bruno.deremble@ens.fr \
    --cc=notmuch@notmuchmail.org \
    --cc=zack@debian.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://yhetil.org/notmuch.git/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).