unofficial mirror of notmuch@notmuchmail.org
 help / color / mirror / code / Atom feed
* accented characters
@ 2017-11-12 21:02 Bruno Deremble
  2017-11-13 13:22 ` David Bremner
  0 siblings, 1 reply; 4+ messages in thread
From: Bruno Deremble @ 2017-11-12 21:02 UTC (permalink / raw)
  To: notmuch


Hi,
I am still new to notmuch and keep experimenting it; a lot of very
interesting features.
I realized that searching "été" and "ete" do not give the same answer
which may be confusing in some situation (in case the sender has an
accented name and may or may not sign his email with his accented name)

A way to handle this could be to only index non accented words which
requires to add a filter before the indexing process. I looked at the code
and it seems that this should be handled by gmime? 
there are also libraries that are supposed to do that such as 'unac'.

Is it something that you have been exploring already?

thank you
bruno
 

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: accented characters
  2017-11-12 21:02 accented characters Bruno Deremble
@ 2017-11-13 13:22 ` David Bremner
  2017-11-13 14:35   ` Stefano Zacchiroli
  0 siblings, 1 reply; 4+ messages in thread
From: David Bremner @ 2017-11-13 13:22 UTC (permalink / raw)
  To: Bruno Deremble, notmuch; +Cc: Stefano Zacchiroli

[-- Attachment #1: Type: text/plain, Size: 2162 bytes --]

Bruno Deremble <bruno.deremble@ens.fr> writes:

> A way to handle this could be to only index non accented words which
> requires to add a filter before the indexing process. I looked at the code
> and it seems that this should be handled by gmime? 
> there are also libraries that are supposed to do that such as 'unac'.
>
> Is it something that you have been exploring already?

We have discussed it a bit (with another francophone, in copy ;) ), but
I think no-one got very far.

I guess the ideal case would be to have the possibility of for both
accented and accent free search. That would require adding some more
terms to the index (both accented and unaccented version). It's not
clear to me yet what kind of performance impact that would have. 

Xapian already has something called "stemmers" (in xapian-core/languages
in the source tree), which do, among other things, strip accents. Those
are generally targetted at a single language, which I suspect is not
very useful for notmuch (even I as a mostly-unilingual person have a
fair amount of English, French, and German in my mailstore). Nonetheless
a custom stemmer might be the right way to go, since that step is
happening anyway.  Or perhaps people would be happy enough with being
able to set the stemmer (currently it is hardcoded to English). That
would be a relatively easy change to notmuch, but I don't know how many
people would find it a good tradeoff to lose English stemming
(i.e. search for 'stem' and 'stemming' being equivalant) for
de-accenting.

I'm not sure if the query language would need to support the
distinction between accented and unaccented searches. I imagine that
people naturally type the non-accented versions in a search, but I do
wonder about cases like (German) München. Should that be stemmed to
Munchen or Muenchen ?

The other thing I don't know is how many people would be happy with just
stripping all accents. That could be done in a gmime filter, as you
suggest. That would be more likely to require changes to the query
language. Off hand I don't know how to transparently de-accent all query
words.

d






[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 658 bytes --]

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: accented characters
  2017-11-13 13:22 ` David Bremner
@ 2017-11-13 14:35   ` Stefano Zacchiroli
  2017-11-13 17:47     ` David Bremner
  0 siblings, 1 reply; 4+ messages in thread
From: Stefano Zacchiroli @ 2017-11-13 14:35 UTC (permalink / raw)
  To: David Bremner; +Cc: Bruno Deremble, notmuch

On Mon, Nov 13, 2017 at 09:22:36AM -0400, David Bremner wrote:
> The other thing I don't know is how many people would be happy with just
> stripping all accents. That could be done in a gmime filter, as you
> suggest. That would be more likely to require changes to the query
> language. Off hand I don't know how to transparently de-accent all query
> words.

My gut feeling is that removing accents by default from both the terms
in the index and user queries would go a long way in addressing this
problem. Especially so if it's a boolean option in notmuch config (which
default to stripping accents).

As a random example/data point, chromium does that and when you search
unaccented strings in a web page will find any combination of them with
accents. Is, by far, my best UX experience w.r.t. accents on GNU/Linux.

Unicode has a notion of canonical form that rearrange accented
characters in a sequence of non-accented characters + modifiers
https://en.wikipedia.org/wiki/Unicode_equivalence . A bunch of libraries
use that stuff to normalize-away accents in unicode strings. I'm aware
of a few in Python for instance, but not in C++ (which I believe is what
you'd be interested in).

HTH,
-- 
Stefano Zacchiroli . zack@upsilon.cc . upsilon.cc/zack . . o . . . o . o
Computer Science Professor . CTO Software Heritage . . . . . o . . . o o
Former Debian Project Leader & OSI Board Director  . . . o o o . . . o .
« the first rule of tautology club is the first rule of tautology club »

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: accented characters
  2017-11-13 14:35   ` Stefano Zacchiroli
@ 2017-11-13 17:47     ` David Bremner
  0 siblings, 0 replies; 4+ messages in thread
From: David Bremner @ 2017-11-13 17:47 UTC (permalink / raw)
  To: Stefano Zacchiroli; +Cc: Bruno Deremble, notmuch

Stefano Zacchiroli <zack@debian.org> writes:

>
> Unicode has a notion of canonical form that rearrange accented
> characters in a sequence of non-accented characters + modifiers
> https://en.wikipedia.org/wiki/Unicode_equivalence . A bunch of libraries
> use that stuff to normalize-away accents in unicode strings. I'm aware
> of a few in Python for instance, but not in C++ (which I believe is what
> you'd be interested in).
>

Apropos, Rob Browning started looking at canonicalization using glib

in

        id:1440951676-17286-1-git-send-email-rlb@defaultvalue.org
        http://article.gmane.org/gmane.mail.notmuch.general/21004

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2017-11-13 17:47 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2017-11-12 21:02 accented characters Bruno Deremble
2017-11-13 13:22 ` David Bremner
2017-11-13 14:35   ` Stefano Zacchiroli
2017-11-13 17:47     ` David Bremner

Code repositories for project(s) associated with this public inbox

	https://yhetil.org/notmuch.git/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).