From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: Eli Zaretskii Newsgroups: gmane.emacs.devel Subject: Re: [Emacs-diffs] master db828f6: Don't rely on defaults in decoding UTF-8 encoded Lisp files Date: Sat, 26 Sep 2015 20:25:19 +0300 Message-ID: <83wpvdf5z4.fsf@gnu.org> References: <20150921165211.20434.28114@vcs.savannah.gnu.org> <83fv27mt7r.fsf@gnu.org> <83wpvfix7i.fsf@gnu.org> <83fv23hr0z.fsf@gnu.org> <5605CB6B.4000102@cs.ucla.edu> <83twqhhf0g.fsf@gnu.org> <5606AC48.7090801@cs.ucla.edu> <83zj09fbzp.fsf@gnu.org> <5606C140.6090309@cs.ucla.edu> Reply-To: Eli Zaretskii NNTP-Posting-Host: plane.gmane.org X-Trace: ger.gmane.org 1443288346 25593 80.91.229.3 (26 Sep 2015 17:25:46 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Sat, 26 Sep 2015 17:25:46 +0000 (UTC) Cc: monnier@iro.umontreal.ca, emacs-devel@gnu.org To: Paul Eggert Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Sat Sep 26 19:25:37 2015 Return-path: Envelope-to: ged-emacs-devel@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by plane.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1ZftDv-0002sd-72 for ged-emacs-devel@m.gmane.org; Sat, 26 Sep 2015 19:25:23 +0200 Original-Received: from localhost ([::1]:54675 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1ZftDu-0003q3-JK for ged-emacs-devel@m.gmane.org; Sat, 26 Sep 2015 13:25:22 -0400 Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:35401) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1ZftDr-0003pS-SU for emacs-devel@gnu.org; Sat, 26 Sep 2015 13:25:20 -0400 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1ZftDo-0007GM-1L for emacs-devel@gnu.org; Sat, 26 Sep 2015 13:25:19 -0400 Original-Received: from mtaout22.012.net.il ([80.179.55.172]:63124) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1ZftDn-0007Ft-Qd for emacs-devel@gnu.org; Sat, 26 Sep 2015 13:25:15 -0400 Original-Received: from conversion-daemon.a-mtaout22.012.net.il by a-mtaout22.012.net.il (HyperSendmail v2007.08) id <0NVA00700O9G7Y00@a-mtaout22.012.net.il> for emacs-devel@gnu.org; Sat, 26 Sep 2015 20:25:13 +0300 (IDT) Original-Received: from HOME-C4E4A596F7 ([84.94.185.246]) by a-mtaout22.012.net.il (HyperSendmail v2007.08) with ESMTPA id <0NVA006XBOE0WP90@a-mtaout22.012.net.il>; Sat, 26 Sep 2015 20:25:13 +0300 (IDT) In-reply-to: <5606C140.6090309@cs.ucla.edu> X-012-Sender: halo1@inter.net.il X-detected-operating-system: by eggs.gnu.org: Solaris 10 X-Received-From: 80.179.55.172 X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Xref: news.gmane.org gmane.emacs.devel:190364 Archived-At: > Cc: monnier@iro.umontreal.ca, emacs-devel@gnu.org > From: Paul Eggert > Date: Sat, 26 Sep 2015 09:01:04 -0700 > > Eli Zaretskii wrote: > > So you are, in effect, saying that it is incorrect to derive the > > default encodings from the locale's codeset? > > Yes, for Emacs developers. And come to think of it, for most Emacs users. > Nowadays in my experience most non-ASCII text files use UTF-8, regardless of > locale. Are you sure your experience isn't biased by the fact you mostly work in UTF-8 locales? > The old days of having to guess encoding from the locale are passing > away. This is partly due to UTF-8 being the encoding of choice for > HTML and XML, where UTF-8 overtook the older 8-bit encodings in 2008 > and now is by far the dominant encoding. We already DTRT with XML files, and should be doing TRT with any file format that includes the specification of the encoding in it. The problem, IMO, is not only with disk files. It is also with email messages, output from processes, etc. E.g., I routinely get Latin-1 encoded email from people whose platform is GNU/Linux. IOW, non-UTF encodings are far from being dead yet. Using UTF-8 by default is certainly wrong on MS-Windows. > One way to accommodate the new reality would be to change Emacs so that by > default the system locale does not affect Emacs's guess of a file's encoding if > the file's initial sample is valid UTF-8. Users could set a variable to > re-enable the old behavior. The problem with this line of thought is that "initial sample" part -- how far into the file should we look, how far is far enough? E.g., tips.texi has its first non-ASCII character at character position 25353. We've been there before, and found this not reliable enough. Anyway, doesn't "(prefer-coding-system 'utf-8)" already does what you want us to offer?