From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: Eli Zaretskii Newsgroups: gmane.emacs.devel Subject: Re: [Emacs-diffs] master db828f6: Don't rely on defaults in decoding UTF-8 encoded Lisp files Date: Sun, 27 Sep 2015 11:55:36 +0300 Message-ID: <83fv20fdh3.fsf@gnu.org> References: <20150921165211.20434.28114@vcs.savannah.gnu.org> <83fv27mt7r.fsf@gnu.org> <83wpvfix7i.fsf@gnu.org> <83fv23hr0z.fsf@gnu.org> <5605CB6B.4000102@cs.ucla.edu> <83twqhhf0g.fsf@gnu.org> <5606AC48.7090801@cs.ucla.edu> <83zj09fbzp.fsf@gnu.org> <5606C140.6090309@cs.ucla.edu> <878u7trwlb.fsf@fencepost.gnu.org> <5606E995.2000102@cs.ucla.edu> <83si61ezxd.fsf@gnu.org> <560700E1.4010403@cs.ucla.edu> <83pp14fhj5.fsf@gnu.org> <5607A758.4020205@cs.ucla.edu> Reply-To: Eli Zaretskii NNTP-Posting-Host: plane.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: QUOTED-PRINTABLE X-Trace: ger.gmane.org 1443344180 1969 80.91.229.3 (27 Sep 2015 08:56:20 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Sun, 27 Sep 2015 08:56:20 +0000 (UTC) Cc: dak@gnu.org, monnier@iro.umontreal.ca, emacs-devel@gnu.org To: Paul Eggert Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Sun Sep 27 10:56:11 2015 Return-path: Envelope-to: ged-emacs-devel@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by plane.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1Zg7kX-0000Dr-LM for ged-emacs-devel@m.gmane.org; Sun, 27 Sep 2015 10:56:01 +0200 Original-Received: from localhost ([::1]:56633 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1Zg7kX-0004tE-5p for ged-emacs-devel@m.gmane.org; Sun, 27 Sep 2015 04:56:01 -0400 Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:34933) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1Zg7kC-0004Za-PL for emacs-devel@gnu.org; Sun, 27 Sep 2015 04:55:41 -0400 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1Zg7kB-0000iL-HD for emacs-devel@gnu.org; Sun, 27 Sep 2015 04:55:40 -0400 Original-Received: from mtaout24.012.net.il ([80.179.55.180]:57272) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1Zg7k5-0000fF-AW; Sun, 27 Sep 2015 04:55:33 -0400 Original-Received: from conversion-daemon.mtaout24.012.net.il by mtaout24.012.net.il (HyperSendmail v2007.08) id <0NVB00I00UTK6T00@mtaout24.012.net.il>; Sun, 27 Sep 2015 11:48:18 +0300 (IDT) Original-Received: from HOME-C4E4A596F7 ([84.94.185.246]) by mtaout24.012.net.il (HyperSendmail v2007.08) with ESMTPA id <0NVB005T1V4I4D60@mtaout24.012.net.il>; Sun, 27 Sep 2015 11:48:18 +0300 (IDT) In-reply-to: <5607A758.4020205@cs.ucla.edu> X-012-Sender: halo1@inter.net.il X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.6.x X-Received-From: 80.179.55.180 X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Xref: news.gmane.org gmane.emacs.devel:190394 Archived-At: > Cc: dak@gnu.org, monnier@iro.umontreal.ca, emacs-devel@gnu.org > From: Paul Eggert > Date: Sun, 27 Sep 2015 01:22:48 -0700 >=20 > Eli Zaretskii wrote: > > I've also looked at the *.po files in the latest releases of GNU = Make, > > Gawk, Texinfo, and Binutils, and I find that between 20% and 25% = of > > such files still use non-UTF-8 encodings. >=20 > Yes, and those files are a pain to look at with Emacs now, since it= typically=20 > misguesses their encodings. Presumably Emacs should be looking at = .po files'=20 > charset=3D decorations. You need to install the po-mode. But anyway, that's not the issue at hand. I just used those files as indicators of preferences of some locales. > > while I agree with you that UTF-8 encoded files are the majority > > among non-ASCII files (and Emacs development aligns itself with t= hat > > fact very well), the non-UTF-8 minority, even in the Posix world,= is > > still significant enough, and we cannot possibly ignore it. >=20 > Naturally we cannot ignore it. All I'm suggesting is that we chang= e the default=20 > behavior so that it's more UTF-8 friendly, since that's the way the= world is=20 > going. The old Emacs behavior should still be available, for peopl= e who need it. You use "default" here in a sense that is different from what the Mul= e stuff does. Since Emacs attempts to support i18n, not just l10n, it cannot ask users to modify their defaults whenever they meet a file that's decoded incorrectly. Emacs uses the defaults in this area as the last resort, when no other information is available in the file itself or its accompanying meta-data. That default is already as friendly to UTF-8 as possible: UTF-8 is used in any locale where that's the default. Going further, i.e. preferring UTF-8 in locales whose preferences are different, will simply bring back the old bugs and misfeatures of Emacs 20 and 21 which we worked so hard to eradicate. IMO, the _only_ sane way forward is to introduce more reliable ways o= f detecting the encoding, whether by using some new kinds of meta-data or by more extensive analysis of the text itself. (The latter solution will probably have difficulties with decoding sub-process output, but it could be very efficient with disk files and large bodies of text made available to Emacs at once.) IOW, I don't think we will be able to change our locale-derived defaults any time soon. What we can do is minimize the probability o= f having to fall back on those defaults. But this requires that Someone=E2=84=A2 volunteers to revamp our detect_coding_* implementat= ions in that direction.