From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: Eli Zaretskii Newsgroups: gmane.emacs.devel Subject: Re: [Emacs-diffs] master db828f6: Don't rely on defaults in decoding UTF-8 encoded Lisp files Date: Sun, 27 Sep 2015 10:27:58 +0300 Message-ID: <83pp14fhj5.fsf@gnu.org> References: <20150921165211.20434.28114@vcs.savannah.gnu.org> <83fv27mt7r.fsf@gnu.org> <83wpvfix7i.fsf@gnu.org> <83fv23hr0z.fsf@gnu.org> <5605CB6B.4000102@cs.ucla.edu> <83twqhhf0g.fsf@gnu.org> <5606AC48.7090801@cs.ucla.edu> <83zj09fbzp.fsf@gnu.org> <5606C140.6090309@cs.ucla.edu> <878u7trwlb.fsf@fencepost.gnu.org> <5606E995.2000102@cs.ucla.edu> <83si61ezxd.fsf@gnu.org> <560700E1.4010403@cs.ucla.edu> Reply-To: Eli Zaretskii NNTP-Posting-Host: plane.gmane.org X-Trace: ger.gmane.org 1443338949 27504 80.91.229.3 (27 Sep 2015 07:29:09 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Sun, 27 Sep 2015 07:29:09 +0000 (UTC) Cc: dak@gnu.org, monnier@iro.umontreal.ca, emacs-devel@gnu.org To: Paul Eggert Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Sun Sep 27 09:29:00 2015 Return-path: Envelope-to: ged-emacs-devel@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by plane.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1Zg6OE-0005Df-JK for ged-emacs-devel@m.gmane.org; Sun, 27 Sep 2015 09:28:54 +0200 Original-Received: from localhost ([::1]:56412 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1Zg6OE-0004zR-2W for ged-emacs-devel@m.gmane.org; Sun, 27 Sep 2015 03:28:54 -0400 Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:45657) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1Zg6O9-0004z0-D6 for emacs-devel@gnu.org; Sun, 27 Sep 2015 03:28:50 -0400 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1Zg6O8-0003n8-E8 for emacs-devel@gnu.org; Sun, 27 Sep 2015 03:28:49 -0400 Original-Received: from mtaout20.012.net.il ([80.179.55.166]:34995) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1Zg6O4-0003ld-IQ; Sun, 27 Sep 2015 03:28:44 -0400 Original-Received: from conversion-daemon.a-mtaout20.012.net.il by a-mtaout20.012.net.il (HyperSendmail v2007.08) id <0NVB00900RA6TZ00@a-mtaout20.012.net.il>; Sun, 27 Sep 2015 10:27:52 +0300 (IDT) Original-Received: from HOME-C4E4A596F7 ([84.94.185.246]) by a-mtaout20.012.net.il (HyperSendmail v2007.08) with ESMTPA id <0NVB009PDREEIX60@a-mtaout20.012.net.il>; Sun, 27 Sep 2015 10:27:52 +0300 (IDT) In-reply-to: <560700E1.4010403@cs.ucla.edu> X-012-Sender: halo1@inter.net.il X-detected-operating-system: by eggs.gnu.org: Solaris 10 X-Received-From: 80.179.55.166 X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Xref: news.gmane.org gmane.emacs.devel:190377 Archived-At: > Cc: dak@gnu.org, monnier@iro.umontreal.ca, emacs-devel@gnu.org > From: Paul Eggert > Date: Sat, 26 Sep 2015 13:32:33 -0700 > > Eli Zaretskii wrote: > > The relevant statistics for Emacs is of source files, not of HTML > > pages. > > Sure, and source files are how this thread got started: nowadays in GNU projects > they're typically UTF-8 regardless of system locale settings, and Emacs should > be better about supporting this typical situation. UTF-8 is common partly > because source files are shared widely via the Internet, on sites like Savannah. > > The days of lonely hackers writing code in their own private Shift-JIS > directories are largely over. Of course Emacs can still support such users, but > the default should be tailored to what's more typical nowadays. Emacs supports the typical situation quite well already, definitely so in a typical (i.e. UTF-8) locale. The issue at hand is not how to support the typical situation, it's whether that typical situation is the _only_ situation that matters, so much so that we can ignore the locale-derived defaults. In any case, I said we needed _statistics_, i.e. numbers, not just impressions and opinions. I don't know how to find a representative set of C sources, not even for European locales. I looked at the C files of GNU projects from the last years on my main development system, which is probably not very representative. There are more than 142,000 C files there. Using the 'file' utility, I found about 1.8% of UTF-8 encoded files and about 0.2% ISO-8859 encoded files (the vast majority was US ASCII, of course). That's still more than 250 ISO-8859 encoded files. I've also looked at the *.po files in the latest releases of GNU Make, Gawk, Texinfo, and Binutils, and I find that between 20% and 25% of such files still use non-UTF-8 encodings. I see similar figures for the txi-*.tex files that came with Texinfo 6.0. Presumably, that follows the default conventions of the respective locales. So, while I agree with you that UTF-8 encoded files are the majority among non-ASCII files (and Emacs development aligns itself with that fact very well), the non-UTF-8 minority, even in the Posix world, is still significant enough, and we cannot possibly ignore it.