unofficial mirror of notmuch@notmuchmail.org
 help / color / mirror / code / Atom feed
From: Daniel Kahn Gillmor <dkg@fifthhorseman.net>
To: David Bremner <david@tethera.net>, notmuch@notmuchmail.org
Subject: Re: locales and notmuch
Date: Wed, 19 Jun 2019 09:09:34 -0400	[thread overview]
Message-ID: <87a7edzppt.fsf@fifthhorseman.net> (raw)
In-Reply-To: <8736ohard7.fsf@tethera.net>

[-- Attachment #1: Type: text/plain, Size: 3133 bytes --]

(sorry for the late reply to this thread)

On Thu 2019-02-21 15:11:48 -0400, David Bremner wrote:
> to be unique case-insensitively, so I decided to convert them to lower
> case on input. This turns out to be "fun", if we try to handle things
> other than ASCII.  So one option is to just insist prefixes are ASCII.
>
> Otherwise we could insist they are UTF-8, ignoring the locale. The
> fullest generality (I think) is to first convert from the users locale
> to utf8, as in the attached sample program.

I don't think this discussion fully covers just how "fun" this
conversion is.

Even if we assume UTF-8 in the database (which i think we should),
making something all lower-case is locale-dependent.  The classic
example, iirc, is that in most UTF-8 locales, U+0049 LATIN CAPITAL
LETTER I downcases to U+0069 LATIN SMALL LETTER I, but in tr_TR
(Turkish), it downcases to U+0131 LATIN SMALL LETTER DOTLESS I.  (and
upper-casing U+0069 LATIN SMALL LETTER I in tr_TR yields U+0130 LATIN
CAPITAL LETTER I WITH DOT ABOVE)

Similarly, if there's anything that the DB cares about collation for,
that also varies dramatically across UTF-8 locales.

sigh.

I have no problem with asserting that all character strings in the
notmuch database are UTF-8.  That's just the only sane thing to do in
2019.  But if we build any feature into notmuch that makes assumptions
or requirements about upper-casing, lower-casing, or collating strings,
and that feature interacts between the currently-running locale and
whatever locale was used to store data in the the database in the past,
and those locales can differ, we may be inflicting some subtle pain on
users.

(note that i'm assuming in this discussion that we're *just* talking
about metadata -- notmuch configuration options, explicit xapian terms,
etc, but *not* the indexed text of the messages, which is an entirely
different kettle of fish)

I see two protective approaches for handling this simply yet being clear
about our concerns.  Both methods introduce a clear dependency on some
UTF-8 locale, in the way that we also have clear dependencies on GMime
or Xapian.

 a) assert that all text strings in the notmuch db's metadata are
    C.UTF-8, and enforce this explicitly in the codebase.

or,

 b) upon database initialization, select a UTF-8 locale (probably based
    on the user's locale during "notmuch setup") and store it in the
    database (perhaps reporting and displaying it via a "notmuch config"
    value).  If any locale-dependent function is used against
    in-database metadata while a *different* locale is active in the
    environment, warn that this mismatch is happening, and prefer the
    locale stored in the db.

I don't have the capacity to work on this kind of safeguard right now,
but someone who wants to learn more about locales and notmuch could try
to implement it and we could see what happens.  Being explicit about the
concern like this might help to raise the profile of the specific risky
codepaths, which in turn could prompt someone to make a more
sophisticated and useful fix than either of the guardrails described
above.

        --dkg

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 227 bytes --]

  parent reply	other threads:[~2019-06-19 13:10 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-02-21 19:11 locales and notmuch David Bremner
2019-02-21 19:57 ` David Bremner
2019-02-23  0:26   ` Matt Armstrong
2019-02-23 11:43     ` David Bremner
2019-06-19 13:09 ` Daniel Kahn Gillmor [this message]
2019-06-19 19:52   ` David Bremner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://notmuchmail.org/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87a7edzppt.fsf@fifthhorseman.net \
    --to=dkg@fifthhorseman.net \
    --cc=david@tethera.net \
    --cc=notmuch@notmuchmail.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://yhetil.org/notmuch.git/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).