From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from localhost (localhost [127.0.0.1]) by arlo.cworth.org (Postfix) with ESMTP id 07E376DE12B0 for ; Wed, 19 Jun 2019 06:10:03 -0700 (PDT) X-Virus-Scanned: Debian amavisd-new at cworth.org X-Spam-Flag: NO X-Spam-Score: -0.301 X-Spam-Level: X-Spam-Status: No, score=-0.301 tagged_above=-999 required=5 tests=[AWL=-0.100, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, RCVD_IN_DNSWL_NONE=-0.0001, SPF_PASS=-0.001] autolearn=disabled Received: from arlo.cworth.org ([127.0.0.1]) by localhost (arlo.cworth.org [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id frfdyAFCe8Y0 for ; Wed, 19 Jun 2019 06:09:58 -0700 (PDT) Received: from che.mayfirst.org (che.mayfirst.org [162.247.75.118]) by arlo.cworth.org (Postfix) with ESMTPS id 2638A6DE1060 for ; Wed, 19 Jun 2019 06:09:58 -0700 (PDT) DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/simple; d=fifthhorseman.net; i=@fifthhorseman.net; q=dns/txt; s=2019; t=1560949796; h=from : to : subject : in-reply-to : references : date : message-id : mime-version : content-type : from; bh=ZPDLBLey6dcBR1oq+sDXQhHjkWWIyGo+ryR10+f/It8=; b=5fNcxxaB+45Xv6hUkealYTzvylCngwmkOCQ4XfdXdzF2TFPMFXEnJNxf MySnQfoyMX8Uc3En4WsW8blrg8hICA== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=fifthhorseman.net; i=@fifthhorseman.net; q=dns/txt; s=2019rsa; t=1560949796; h=from : to : subject : in-reply-to : references : date : message-id : mime-version : content-type : from; bh=ZPDLBLey6dcBR1oq+sDXQhHjkWWIyGo+ryR10+f/It8=; b=ZrT8q2xWOGfFP/bz9ePIT05oCIji1frthrmVjwCViXHZDhGZzcWdTBXL Qj6K0KziwQKnPk2tQJLB2lNAAYfsemsP194pAU5Xut9RbKO/3ehuCiX+k+ NT2F+cRpujZhTCvqNNMaQ3SBXTqW+5FMQ+cQ3twpTcqHv+drYjtlfjyle1 4pg4WBfrdX7qpCJQlDFUD6kRlYdM3TvN5bBbaLzudpgxWseJZ4YBbkU/It AM0iaUiYAReFgRkKsJf2MFCVWqUIrYndRY7msOZ1ZwyVoKGktbatJ12c/z feWFAmUVDbbn4wvcIr5wc4nlZX7QTKo+ww+0UO6ONlyoPwid1MYAXw== Received: from fifthhorseman.net (unknown [IPv6:2001:470:1f07:60d:5cf3:eff:fee2:4b88]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by che.mayfirst.org (Postfix) with ESMTPSA id A3468F99D; Wed, 19 Jun 2019 09:09:54 -0400 (EDT) Received: by fifthhorseman.net (Postfix, from userid 1000) id 1948D202A7; Wed, 19 Jun 2019 09:09:35 -0400 (EDT) From: Daniel Kahn Gillmor To: David Bremner , notmuch@notmuchmail.org Subject: Re: locales and notmuch In-Reply-To: <8736ohard7.fsf@tethera.net> References: <8736ohard7.fsf@tethera.net> Autocrypt: addr=dkg@fifthhorseman.net; prefer-encrypt=mutual; keydata= mDMEXEK/AhYJKwYBBAHaRw8BAQdAr/gSROcn+6m8ijTN0DV9AahoHGafy52RRkhCZVwxhEe0K0Rh bmllbCBLYWhuIEdpbGxtb3IgPGRrZ0BmaWZ0aGhvcnNlbWFuLm5ldD6ImQQTFggAQQIbAQUJA8Jn AAULCQgHAgYVCgkICwIEFgIDAQIeAQIXgBYhBMS8Lds4zOlkhevpwvIGkReQOOXGBQJcQsbzAhkB AAoJEPIGkReQOOXG4fkBAO1joRxqAZY57PjdzGieXLpluk9RkWa3ufkt3YUVEpH/AP9c+pgIxtyW +FwMQRjlqljuj8amdN4zuEqaCy4hhz/1DbgzBFxCv4sWCSsGAQQB2kcPAQEHQERSZxSPmgtdw6nN u7uxY7bzb9TnPrGAOp9kClBLRwGfiPUEGBYIACYWIQTEvC3bOMzpZIXr6cLyBpEXkDjlxgUCXEK/ iwIbAgUJAeEzgACBCRDyBpEXkDjlxnYgBBkWCAAdFiEEyQ5tNiAKG5IqFQnndhgZZSmuX/gFAlxC v4sACgkQdhgZZSmuX/iVWgD/fCU4ONzgy8w8UCHGmrmIZfDvdhg512NIBfx+Mz9ls5kA/Rq97vz4 z48MFuBdCuu0W/fVqVjnY7LN5n+CQJwGC0MIA7QA/RyY7Sz2gFIOcrns0RpoHr+3WI+won3xCD8+ sVXSHZvCAP98HCjDnw/b0lGuCR7coTXKLIM44/LFWgXAdZjm1wjODbg4BFxCv50SCisGAQQBl1UB BQEBB0BG4iXnHX/fs35NWKMWQTQoRI7oiAUt0wJHFFJbomxXbAMBCAeIfgQYFggAJhYhBMS8Lds4 zOlkhevpwvIGkReQOOXGBQJcQr+dAhsMBQkB4TOAAAoJEPIGkReQOOXGe/cBAPlek5d9xzcXUn/D kY6jKmxe26CTws3ZkbK6Aa5Ey/qKAP0VuPQSCRxA7RKfcB/XrEphfUFkraL06Xn/xGwJ+D0hCw== Date: Wed, 19 Jun 2019 09:09:34 -0400 Message-ID: <87a7edzppt.fsf@fifthhorseman.net> MIME-Version: 1.0 Content-Type: multipart/signed; boundary="=-=-="; micalg=pgp-sha512; protocol="application/pgp-signature" X-BeenThere: notmuch@notmuchmail.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: "Use and development of the notmuch mail system." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 19 Jun 2019 13:10:03 -0000 --=-=-= Content-Type: text/plain (sorry for the late reply to this thread) On Thu 2019-02-21 15:11:48 -0400, David Bremner wrote: > to be unique case-insensitively, so I decided to convert them to lower > case on input. This turns out to be "fun", if we try to handle things > other than ASCII. So one option is to just insist prefixes are ASCII. > > Otherwise we could insist they are UTF-8, ignoring the locale. The > fullest generality (I think) is to first convert from the users locale > to utf8, as in the attached sample program. I don't think this discussion fully covers just how "fun" this conversion is. Even if we assume UTF-8 in the database (which i think we should), making something all lower-case is locale-dependent. The classic example, iirc, is that in most UTF-8 locales, U+0049 LATIN CAPITAL LETTER I downcases to U+0069 LATIN SMALL LETTER I, but in tr_TR (Turkish), it downcases to U+0131 LATIN SMALL LETTER DOTLESS I. (and upper-casing U+0069 LATIN SMALL LETTER I in tr_TR yields U+0130 LATIN CAPITAL LETTER I WITH DOT ABOVE) Similarly, if there's anything that the DB cares about collation for, that also varies dramatically across UTF-8 locales. sigh. I have no problem with asserting that all character strings in the notmuch database are UTF-8. That's just the only sane thing to do in 2019. But if we build any feature into notmuch that makes assumptions or requirements about upper-casing, lower-casing, or collating strings, and that feature interacts between the currently-running locale and whatever locale was used to store data in the the database in the past, and those locales can differ, we may be inflicting some subtle pain on users. (note that i'm assuming in this discussion that we're *just* talking about metadata -- notmuch configuration options, explicit xapian terms, etc, but *not* the indexed text of the messages, which is an entirely different kettle of fish) I see two protective approaches for handling this simply yet being clear about our concerns. Both methods introduce a clear dependency on some UTF-8 locale, in the way that we also have clear dependencies on GMime or Xapian. a) assert that all text strings in the notmuch db's metadata are C.UTF-8, and enforce this explicitly in the codebase. or, b) upon database initialization, select a UTF-8 locale (probably based on the user's locale during "notmuch setup") and store it in the database (perhaps reporting and displaying it via a "notmuch config" value). If any locale-dependent function is used against in-database metadata while a *different* locale is active in the environment, warn that this mismatch is happening, and prefer the locale stored in the db. I don't have the capacity to work on this kind of safeguard right now, but someone who wants to learn more about locales and notmuch could try to implement it and we could see what happens. Being explicit about the concern like this might help to raise the profile of the specific risky codepaths, which in turn could prompt someone to make a more sophisticated and useful fix than either of the guardrails described above. --dkg --=-=-= Content-Type: application/pgp-signature; name="signature.asc" -----BEGIN PGP SIGNATURE----- iHUEARYKAB0WIQTJDm02IAobkioVCed2GBllKa5f+AUCXQo0DgAKCRB2GBllKa5f +FRiAP9wcoWvFM6zN8KEwhDffiFu8tFjL9gql0ZCokURt2CWEgEA4kraNv5VbAoZ bMBgIKsTOUkKKe5Qp/FeU2xXkGSFdwU= =+9AM -----END PGP SIGNATURE----- --=-=-=--