From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from localhost (localhost [127.0.0.1]) by arlo.cworth.org (Postfix) with ESMTP id 3A9236DE0F21 for ; Sat, 23 Feb 2019 03:44:06 -0800 (PST) X-Virus-Scanned: Debian amavisd-new at cworth.org X-Spam-Flag: NO X-Spam-Score: -0.008 X-Spam-Level: X-Spam-Status: No, score=-0.008 tagged_above=-999 required=5 tests=[AWL=-0.007, SPF_PASS=-0.001] autolearn=disabled Received: from arlo.cworth.org ([127.0.0.1]) by localhost (arlo.cworth.org [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id r0FANyG5pvC7 for ; Sat, 23 Feb 2019 03:44:05 -0800 (PST) Received: from fethera.tethera.net (fethera.tethera.net [198.245.60.197]) by arlo.cworth.org (Postfix) with ESMTPS id 404786DE0F19 for ; Sat, 23 Feb 2019 03:44:05 -0800 (PST) Received: from remotemail by fethera.tethera.net with local (Exim 4.89) (envelope-from ) id 1gxViq-0004sm-SH; Sat, 23 Feb 2019 06:44:00 -0500 Received: (nullmailer pid 26829 invoked by uid 1000); Sat, 23 Feb 2019 11:43:58 -0000 From: David Bremner To: Matt Armstrong , notmuch@notmuchmail.org Subject: Re: locales and notmuch In-Reply-To: References: <8736ohard7.fsf@tethera.net> <87r2c0ap8v.fsf@tethera.net> Date: Sat, 23 Feb 2019 07:43:58 -0400 Message-ID: <87va1alog1.fsf@tethera.net> MIME-Version: 1.0 Content-Type: text/plain X-BeenThere: notmuch@notmuchmail.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: "Use and development of the notmuch mail system." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 23 Feb 2019 11:44:06 -0000 Matt Armstrong writes: > > Notmuch should probably adopt a coherent strategy with respect to > character set encodings, rather than do something ad-hoc for the > feature. Most systems I have worked with normalize to UTF-8 at the > edges and do all work using that encoding. > You're probably correct. On the other hand, lack of locale handling is not something that people actually complain about very much. So if we do decide to "Do the right thing", then I'd probably just continue ignoring the problem, rather than block working on things that do annoy people. > It is an interesting question: what encoding does .notmuch-config use? > UTF-8? User's choice? It's loaded by g_key_file_load_from_data; I suspect that does no conversion. > Similarly, what is the encoding of notmuch's > command line args? There is no conversion done. In both these cases it probably works mostly OK for people (at least nobody complained) because user values are treated as opaque null terminated byte sequences. > I was just reading https://xapian.org/features and Xapian seems to store > text in UTF-8. If this is the case, where is the code that does the > charset conversions between the email messages and UTF-8? I'd have to double check the code to be sure, but I suspect this is done by GMime when parsing the files. > How about > between the command line args to UTF-8? AFAIR, there is no conversion, and search terms are passed straight to Xapian. This probably doesn't work well for people with non-UTF-8 locales.