From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: stephen@xemacs.org Newsgroups: gmane.emacs.devel Subject: Re: [Emacs-diffs] master db828f6: Don't rely on defaults in decoding UTF-8 encoded Lisp files Date: Sun, 27 Sep 2015 15:20:56 +0900 Message-ID: References: <20150921165211.20434.28114@vcs.savannah.gnu.org> <83fv27mt7r.fsf@gnu.org> <83wpvfix7i.fsf@gnu.org> <83fv23hr0z.fsf@gnu.org> <5605CB6B.4000102@cs.ucla.edu> <83twqhhf0g.fsf@gnu.org> <5606AC48.7090801@cs.ucla.edu> <83zj09fbzp.fsf@gnu.org> <5606C140.6090309@cs.ucla.edu> <56077431.7010906@cs.ucla.edu> NNTP-Posting-Host: plane.gmane.org X-Trace: ger.gmane.org 1443334876 5396 80.91.229.3 (27 Sep 2015 06:21:16 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Sun, 27 Sep 2015 06:21:16 +0000 (UTC) Cc: emacs-devel@gnu.org To: Paul Eggert Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Sun Sep 27 08:21:11 2015 Return-path: Envelope-to: ged-emacs-devel@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by plane.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1Zg5Ke-00014U-UE for ged-emacs-devel@m.gmane.org; Sun, 27 Sep 2015 08:21:09 +0200 Original-Received: from localhost ([::1]:56269 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1Zg5Ke-0005fX-H2 for ged-emacs-devel@m.gmane.org; Sun, 27 Sep 2015 02:21:08 -0400 Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:36061) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1Zg5Kb-0005fN-9v for emacs-devel@gnu.org; Sun, 27 Sep 2015 02:21:06 -0400 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1Zg5KY-0001yh-42 for emacs-devel@gnu.org; Sun, 27 Sep 2015 02:21:05 -0400 Original-Received: from turnbull.sk.tsukuba.ac.jp ([130.158.96.25]:50463) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1Zg5KX-0001xk-Q8 for emacs-devel@gnu.org; Sun, 27 Sep 2015 02:21:02 -0400 Original-Received: from steve by turnbull.sk.tsukuba.ac.jp with local (Exim 4.86) (envelope-from ) id 1Zg5KS-0005NI-Ul; Sun, 27 Sep 2015 15:20:57 +0900 In-Reply-To: <56077431.7010906@cs.ucla.edu> X-Mailer: VM 8.0.12-devo-585 under 21.5 (beta34) "kale" cfd8bd42d357 XEmacs Lucid (x86_64-apple-darwin14.3.0) X-SA-Exim-Connect-IP: X-SA-Exim-Mail-From: steve@turnbull.sk.tsukuba.ac.jp X-SA-Exim-Scanned: No (on turnbull.sk.tsukuba.ac.jp); SAEximRunCond expanded to false X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] X-Received-From: 130.158.96.25 X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Xref: news.gmane.org gmane.emacs.devel:190376 Archived-At: Paul Eggert writes: > I think your information is out of date. Rather, I think that yours is superficial. Really, you should listen to those of us who live and work outside of the ASCII hemisphere. I live and teach in Japan (a stone's throw from ETL, as it happens), and most of the students I supervise are Chinese. I regularly need to access Chinese and Japanese government and corporate data, and retrieve preprints and data (and sometimes code) from the personal pages of other scholars. Mojibake in the HTML pages is frequent, in both Firefox and Chrome (of course it's almost always easy to guess the actual coded character set in use, but it is mojibake). A frequent cause is webservers configured to send "Content-Type: text/html; charset=utf-8" but the page is encoded in something else. > Yes, ten years ago there was a lot of non-UTF-8 out there, but > nowadays they've largely moved on to UTF-8. "Beauty is only skin-deep." The *top* pages, and some whole sites, have moved on, because having beautiful (if mostly useless) top pages is a matter of "face", so they buy new ones from companies with fancy up-to-date web design software every couple of years. Perhaps most recently authored pages are UTF-8. But the data sets themselves are typically flat files, either CSV or plaintext. The explanatory pages, even if in HTML, often haven't been revised in decades. Such useful content is typically in a national standard coded character set rather than Unicode. And Emacs is hardly limited to the web. In practice, almost all mail I receive from Chinese (even when it is in English or Japanese) is labelled GB2312, GBK, or GB18030. The great majority of Japanese mail is either Shift JIS or ISO 2022 JP (sometimes with "OEM characters" that even today aren't in Unicode because they're not in JIS). > Of course one can still find a few web sites using other encodings, > but like it or not, UTF-8 dominates now. What's not to like about UTF-8?! I *wish* non-UTF-8 was a matter of information archaeology and Buddhist scholarship! I'm sad to say, it is not: GB variants, Big5, and JIS variants are the *majority* of the non-ASCII data I handle every day in my Emacs. (It's not the "great majority" only because about 30% of the non-ASCII text I handle in Emacs is authored by me, in UTF-8, of course.) Regards,