From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: Paul Eggert Newsgroups: gmane.emacs.devel Subject: Re: [Emacs-diffs] master db828f6: Don't rely on defaults in decoding UTF-8 encoded Lisp files Date: Sun, 27 Sep 2015 01:34:39 -0700 Organization: UCLA Computer Science Department Message-ID: <5607AA1F.4030508@cs.ucla.edu> References: <20150921165211.20434.28114@vcs.savannah.gnu.org> <83fv27mt7r.fsf@gnu.org> <83wpvfix7i.fsf@gnu.org> <83fv23hr0z.fsf@gnu.org> <5605CB6B.4000102@cs.ucla.edu> <83twqhhf0g.fsf@gnu.org> <5606AC48.7090801@cs.ucla.edu> <83zj09fbzp.fsf@gnu.org> <5606C140.6090309@cs.ucla.edu> <56077431.7010906@cs.ucla.edu> NNTP-Posting-Host: plane.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit X-Trace: ger.gmane.org 1443342910 16429 80.91.229.3 (27 Sep 2015 08:35:10 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Sun, 27 Sep 2015 08:35:10 +0000 (UTC) Cc: emacs-devel@gnu.org To: stephen@xemacs.org Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Sun Sep 27 10:34:55 2015 Return-path: Envelope-to: ged-emacs-devel@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by plane.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1Zg7Q3-0006uP-8G for ged-emacs-devel@m.gmane.org; Sun, 27 Sep 2015 10:34:51 +0200 Original-Received: from localhost ([::1]:56578 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1Zg7Q2-0008Uu-Oc for ged-emacs-devel@m.gmane.org; Sun, 27 Sep 2015 04:34:50 -0400 Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:59183) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1Zg7Pz-0008Uj-HV for emacs-devel@gnu.org; Sun, 27 Sep 2015 04:34:48 -0400 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1Zg7Pw-0008GH-AM for emacs-devel@gnu.org; Sun, 27 Sep 2015 04:34:47 -0400 Original-Received: from zimbra.cs.ucla.edu ([131.179.128.68]:35992) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1Zg7Pw-0008Fo-0N for emacs-devel@gnu.org; Sun, 27 Sep 2015 04:34:44 -0400 Original-Received: from localhost (localhost [127.0.0.1]) by zimbra.cs.ucla.edu (Postfix) with ESMTP id ACC541611DA; Sun, 27 Sep 2015 01:34:43 -0700 (PDT) Original-Received: from zimbra.cs.ucla.edu ([127.0.0.1]) by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10032) with ESMTP id w7MGyg9jFBq2; Sun, 27 Sep 2015 01:34:40 -0700 (PDT) Original-Received: from localhost (localhost [127.0.0.1]) by zimbra.cs.ucla.edu (Postfix) with ESMTP id 400D41611EB; Sun, 27 Sep 2015 01:34:40 -0700 (PDT) X-Virus-Scanned: amavisd-new at zimbra.cs.ucla.edu Original-Received: from zimbra.cs.ucla.edu ([127.0.0.1]) by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10026) with ESMTP id YWv3tUvLkyQq; Sun, 27 Sep 2015 01:34:40 -0700 (PDT) Original-Received: from [192.168.1.9] (pool-100-32-155-148.lsanca.fios.verizon.net [100.32.155.148]) by zimbra.cs.ucla.edu (Postfix) with ESMTPSA id 19FF11611D9; Sun, 27 Sep 2015 01:34:40 -0700 (PDT) User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Thunderbird/38.2.0 In-Reply-To: X-detected-operating-system: by eggs.gnu.org: GNU/Linux 3.x X-Received-From: 131.179.128.68 X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Xref: news.gmane.org gmane.emacs.devel:190390 Archived-At: stephen@xemacs.org wrote: > Perhaps most > recently authored pages are UTF-8. But the data sets themselves are > typically flat files, either CSV or plaintext. The explanatory pages, > even if in HTML, often haven't been revised in decades. Yes, that's pretty much my experience. In Japan older stuff is mostly Shift-JIS, EUC, or maybe ISO-2022-JP. New stuff is mostly UTF-8. People using old email software send old encodings because that's what they've been doing for decades. Normally it works, because the email envelope tells you the encoding. But sometimes people screw up and you get mojibake. But this situation is not an argument for having the locale determine encoding when visiting random imported files that lack envelopes. For such files, it often doesn't work to set LC_ALL=ja_JP.ujis and expect Emacs to get things right. (This is one of things that Eli has noted multiple times, and he's right.) Of course if one is working in a conservative Japanese government ministry that standardized on Shift-JIS back in 1992 and hasn't changed since then, then things are different, and Emacs should support such users. But typical Emacs users are not in this situation, and the Emacs default should cater to the more-typical case today. To narrow things down a bit I briefly looked for .jp websites that talk about Emacs. Google reported the following first page's worth of hits (I list year of composition, encoding, and URL). Again, the new stuff is mostly UTF-8, and the old stuff is a mishmash, so it's another data point suggesting that defaulting to UTF-8 would not be such a bad thing for editing today's text. 2002 Shift-JIS http://www.rsch.tuis.ac.jp/~ohmi/literacy/emacs/quick.html 2008 ISO-2022-JP http://www.wakayama-u.ac.jp/~takehiko/webprg/03.html 2015 EUC-JP http://d.hatena.ne.jp/tarao/20150221/1424518030 2015 UTF-8 http://uguisu.skr.jp/Windows/emacs.html 2015 UTF-8 http://www.amazon.co.jp/Emacs%E5%AE%9F%E8%B7%B5%E5%85%A5%E9%96%80-%EF%BD%9E%E6%80%9D%E8%80%83%E3%82%92%E7%9B%B4%E6%84%9F%E7%9A%84%E3%81%AB%E3%82%B3%E3%83%BC%E3%83%89%E5%8C%96%E3%81%97%E3%80%81%E9%96%8B%E7%99%BA%E3%82%92%E5%8A%A0%E9%80%9F%E3%81%99%E3%82%8B-WEB-DB-PRESS-plus/dp/4774150029 2015 UTF-8 http://www.sigasi.jp/better-emacs-vhdl-mode 2006 Shift-JIS http://www.math.kobe-u.ac.jp/icms2006/icms2006-video/slides/grayson/share/doc/Macaulay2/Macaulay2/html/_teaching_spemacs_sphow_spto_spfind_sp__M2.html 2015 UTF-8 https://osdn.jp/projects/gnupack/