From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: Paul Eggert Newsgroups: gmane.emacs.devel Subject: Re: [Emacs-diffs] master db828f6: Don't rely on defaults in decoding UTF-8 encoded Lisp files Date: Sat, 26 Sep 2015 11:53:09 -0700 Organization: UCLA Computer Science Department Message-ID: <5606E995.2000102@cs.ucla.edu> References: <20150921165211.20434.28114@vcs.savannah.gnu.org> <83fv27mt7r.fsf@gnu.org> <83wpvfix7i.fsf@gnu.org> <83fv23hr0z.fsf@gnu.org> <5605CB6B.4000102@cs.ucla.edu> <83twqhhf0g.fsf@gnu.org> <5606AC48.7090801@cs.ucla.edu> <83zj09fbzp.fsf@gnu.org> <5606C140.6090309@cs.ucla.edu> <878u7trwlb.fsf@fencepost.gnu.org> NNTP-Posting-Host: plane.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit X-Trace: ger.gmane.org 1443293616 6040 80.91.229.3 (26 Sep 2015 18:53:36 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Sat, 26 Sep 2015 18:53:36 +0000 (UTC) Cc: Eli Zaretskii , monnier@iro.umontreal.ca, emacs-devel@gnu.org To: David Kastrup Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Sat Sep 26 20:53:27 2015 Return-path: Envelope-to: ged-emacs-devel@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by plane.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1Zfub2-0005Sb-G5 for ged-emacs-devel@m.gmane.org; Sat, 26 Sep 2015 20:53:20 +0200 Original-Received: from localhost ([::1]:54883 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1Zfub1-0000p3-So for ged-emacs-devel@m.gmane.org; Sat, 26 Sep 2015 14:53:19 -0400 Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:49573) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1Zfuay-0000ov-F9 for emacs-devel@gnu.org; Sat, 26 Sep 2015 14:53:17 -0400 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1Zfuax-0002IJ-L2 for emacs-devel@gnu.org; Sat, 26 Sep 2015 14:53:16 -0400 Original-Received: from zimbra.cs.ucla.edu ([131.179.128.68]:47175) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1Zfuas-0002HI-Lk; Sat, 26 Sep 2015 14:53:10 -0400 Original-Received: from localhost (localhost [127.0.0.1]) by zimbra.cs.ucla.edu (Postfix) with ESMTP id 483901611C3; Sat, 26 Sep 2015 11:53:10 -0700 (PDT) Original-Received: from zimbra.cs.ucla.edu ([127.0.0.1]) by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10032) with ESMTP id o1WoW1ylhj0Q; Sat, 26 Sep 2015 11:53:09 -0700 (PDT) Original-Received: from localhost (localhost [127.0.0.1]) by zimbra.cs.ucla.edu (Postfix) with ESMTP id 779231611C5; Sat, 26 Sep 2015 11:53:09 -0700 (PDT) X-Virus-Scanned: amavisd-new at zimbra.cs.ucla.edu Original-Received: from zimbra.cs.ucla.edu ([127.0.0.1]) by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10026) with ESMTP id YmCDiM8YesDs; Sat, 26 Sep 2015 11:53:09 -0700 (PDT) Original-Received: from [192.168.1.9] (pool-100-32-155-148.lsanca.fios.verizon.net [100.32.155.148]) by zimbra.cs.ucla.edu (Postfix) with ESMTPSA id 53D18161163; Sat, 26 Sep 2015 11:53:09 -0700 (PDT) User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Thunderbird/38.2.0 In-Reply-To: <878u7trwlb.fsf@fencepost.gnu.org> X-detected-operating-system: by eggs.gnu.org: GNU/Linux 3.x X-Received-From: 131.179.128.68 X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Xref: news.gmane.org gmane.emacs.devel:190367 Archived-At: David Kastrup wrote: > How frequent are you reading Hebrew, Arabic, Chinese, Japanese, and > Korean texts? How relevant is your experience? Hebrew, not so much -- Eli has far more experience with that. Arabic I was just reading last week (not natively; I use a translator). This week I was reading a lot of Turkish. In all cases I was looking at text prepared by others. In all cases my sources used UTF-8 -- not due to my influence, but because that's what's typical these days. In my previous job I routinely had to deal with CJK text, and did so with lots of different encodings, including monstrosities such as DBCS-Host that Emacs doesn't even support. So my experience is reasonably good in this area -- better than the average random hacker anyway. If you go back 20 years, non-UTF-8 encodings such as Shift-JIS and EUC were by far the most popular in Japan. Nowadays? Sure, Shift-JIS and EUC are still used, but they're going downhill. Of the top 20 web sites in Japan (according to Alexa), 18 use UTF-8, one uses Shift-JIS, and one uses EUC on their home pages. In the w3techs survey of world web sites, 85% use UTF-8; the second most-popular encoding, ISO-8859-1, is at only 7.5%, and it's that high only because the old HTML standard made ISO-8859-1 the default. So in practice, defaulting to UTF-8 is quite a good choice nowadays. Of course if we can get the proper encoding from the document or its envelope we should prefer that, and that should let us deal with web documents and email.