From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: "Stephen J. Turnbull" Newsgroups: gmane.emacs.devel Subject: Re: Emacs Lisp's future Date: Wed, 15 Oct 2014 12:07:39 +0900 Message-ID: <87mw8ym3no.fsf@uwakimon.sk.tsukuba.ac.jp> References: <87d2ahm3nw.fsf@fencepost.gnu.org> <871tqneyvl.fsf@netris.org> <87zjd9swfj.fsf@uwakimon.sk.tsukuba.ac.jp> <87oatnqpml.fsf@uwakimon.sk.tsukuba.ac.jp> <874mvdrj45.fsf@uwakimon.sk.tsukuba.ac.jp> <20141009044917.GA19957@fencepost.gnu.org> <83lhopisfr.fsf@gnu.org> <87ppe1pldu.fsf@uwakimon.sk.tsukuba.ac.jp> <8761ft5wpo.fsf@fencepost.gnu.org> <83k349b0vj.fsf@gnu.org> <83bnph96kh.fsf@gnu.org> <87ppdwo7ll.fsf@uwakimon.sk.tsukuba.ac.jp> <543BE7CB.9040801@cs.ucla.edu> <87egubopls.fsf@uwakimon.sk.tsukuba.ac.jp> <87bnpfyjaf.fsf@fencepost.gnu.org> <87a94zoo57.fsf@uwakimon.sk.tsukuba.ac.jp> <83h9z77p7d.fsf@gnu.org> <8761fnnne9.fsf@uwakimon.sk.tsukuba.ac.jp> <543D8186.9000101@cs.ucla.edu> NNTP-Posting-Host: plane.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Trace: ger.gmane.org 1413342516 2443 80.91.229.3 (15 Oct 2014 03:08:36 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Wed, 15 Oct 2014 03:08:36 +0000 (UTC) Cc: Eli Zaretskii , emacs-devel@gnu.org To: Paul Eggert Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Wed Oct 15 05:08:28 2014 Return-path: Envelope-to: ged-emacs-devel@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by plane.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1XeEwu-0000l8-4m for ged-emacs-devel@m.gmane.org; Wed, 15 Oct 2014 05:08:28 +0200 Original-Received: from localhost ([::1]:41371 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1XeEwt-0005RJ-65 for ged-emacs-devel@m.gmane.org; Tue, 14 Oct 2014 23:08:27 -0400 Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:33097) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1XeEwZ-0005RB-RJ for emacs-devel@gnu.org; Tue, 14 Oct 2014 23:08:13 -0400 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1XeEwU-0003pJ-2A for emacs-devel@gnu.org; Tue, 14 Oct 2014 23:08:07 -0400 Original-Received: from shako.sk.tsukuba.ac.jp ([130.158.97.161]:57741) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1XeEwN-0003Kx-Gz; Tue, 14 Oct 2014 23:07:55 -0400 Original-Received: from uwakimon.sk.tsukuba.ac.jp (uwakimon.sk.tsukuba.ac.jp [130.158.99.156]) by shako.sk.tsukuba.ac.jp (Postfix) with ESMTP id 8CE211C391A; Wed, 15 Oct 2014 12:07:39 +0900 (JST) Original-Received: by uwakimon.sk.tsukuba.ac.jp (Postfix, from userid 1000) id 80F2E1A2C6C; Wed, 15 Oct 2014 12:07:39 +0900 (JST) In-Reply-To: <543D8186.9000101@cs.ucla.edu> X-Mailer: VM undefined under 21.5 (beta34) "kale" acf1c26e3019 XEmacs Lucid (x86_64-unknown-linux) X-detected-operating-system: by eggs.gnu.org: GNU/Linux 3.x X-Received-From: 130.158.97.161 X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Xref: news.gmane.org gmane.emacs.devel:175384 Archived-At: Paul Eggert writes: > On 10/14/2014 12:03 AM, Stephen J. Turnbull wrote: > > in the Emacs tree "grep -r" > > is probably just a bug. >=20 > Although "grep -r" doesn't conformto POSIX, it is a handy GNU extension,= =20 It's not a question of conformance, it's a question of GIGO. As you yourself know: > grep works reasonably well even with text files in the "wrong" encoding,= =20 > and even with non-text files. I don't expect grep to match UTF-8=20 > patterns to the corresponding EUC-JP text, because I know it doesn't=20 > translate. Oh, so you intentionally chose an example where you know it works, and published that on a public mailing list, without warning the kids not to try it at home? Do you realize that although all Japanese computer users occasionally experience mojibake, only a few understand the mechanism and its implications for "simple" operations like grep? I suppose that goes in spades for the Chinese. Consider searching for =E5=85=83=E6=B0=97 to find HELLO, "knowing" that Emacs uses the UTF-8 encod= ing! > Emacs's M-x grep command supports this usage well, and I don't see how=20 > it would be an improvement to call this usage a "bug" or for the Emacs=20 > (or grep) default to insist on strict coding correctness here. Ah, so you've never lived anywhere but Kansas, Dorothy? There are 1.5 billion[1] Asians who disagree that "grep -r =E3=81=97=E3=81=BE=E3=81=A3=E3= =81=9F" is well- supported by Emacs or grep in an environment with multiple encodings, which is most of them (except where they've consciously instituted a program of converting legacy documents to a common encoding). That's why the "Japanese patch" is also "the patch that would not die". But that patch is not in any mainline program that I know of, because accurate auto-detection requires knowledge of the target language so it doesn't generalize (the "Japanese patch" assumes that the language is Japanese, so it must be facing ISO-2022-JP, Shift-JIS, or EUC-JP, and relatively recent versions added UTF-8 and BOM detection to that). The patch is not able to distinguish EUC-JP from EUC-CN, for example, in typical use where the designations of character sets to registers is implicit. (Distinguishing Shift-JIS from Big5 is highly but not 100% reliable, and of course distinguishing the language variants of ISO-2022-7 is trivial because the control sequences specify character sets to be installed in the GL register.) > Eli is correct that UTF-8 is the encoding typically used for text > in the Emacs source code. For more about this, please see "Source > file encoding" in admin/notes/unicode. XEmacs made that decision in 1998 (only using ISO-2022-JP). I know how this works. The only difference between us is that I live in Tsukuba, and I've spoken to Handa and Tomita inter alia about these issues over beers (in Japanese as well as in English), and I've read the extremist anti-Unicode tirades (in Japanese). I don't know *why* Dr. Handa sides with those maniacs (they claim that JIS incorporates a mystic Yamato-damashii =3D "authentic Japanese spirit") although I believe it's out of a genuine desire to support multiculturalism (via his specialty of developing multilingual software). However, like the Japanese patch, detecting culture and choosing font for the same repertoire via encoding is a limited technique. It only works well for Han-using languages. For example, the northern European countries have different notions about positioning of accents, which is apparently noticable to non-native speakers with umlauts. I suspect (though I haven't asked and don't have time to search the library for wordwide newspapers) that the various English-speaking cultures, the French, the Spanish, the Italians, and the Germans have different notions of what constitutes readable or beautiful typography -- it's definitely the case that the ASCII characters in Japanese fonts "look Japanese" (to me, anyway). But good luck choosing fonts based on distinguishing ISO-8859-1 from ISO-8859-1! :-) Dr. Handa's approach to multiculturalism, then, is fundamentally different from that of the engineers and scholars who have evolved Unicode (more precisely, universal coded character sets and the related encoding mechanisms) over the last 30 years or so, not to forget the W3C which has concluded that (as long as conventional glyphs are available for the character repertoire) font choice is purely a presentation issue, and should be handled by markup. Unicode has even deprecated the use of "language tag" characters. They do remain in the repertoire, so could be used to deal with the issues we are discussing. http://www.unicode.org/faq/languagetagging.html http://www.unicode.org/versions/Unicode6.2.0/ch16.pdf#G26419 Note that the language tags are isomorphic to control sequences as used in ISO 2022 (except that being encoded in a block disjoint from graphic characters, they're harder to screw up), so they introduce no text handling issues for Emacs not already present in encodings using ISO 2022 extension techniques. So there you have it. There is *no barrier* to converting *all* files to conformant UTF-8, except a couple hours' hacking to make `help-with-tutorial' and `view-hello-file' recognize language tags.[2] It might be preferable to use a different approach, more conformant to the Unicode/W3C party line, though. Thank you for your persistence. This discussion will greatly inform my future work in XEmacs. (I'm done discussing the issue for Emacs, because I don't expect Dr. Handa -- who is more expert than I -- to change his approach after all these years. This is all just IMHO FWIW YMMV -- and I suspect Dr. Handa counts his "mileage" in kilometers. ;-) Footnotes:=20 [1] I don't know about Indic languages. I'm under the impression that these days they almost universally use Unicode in preference to ISCII and such-like, so they may not have the issue. If that is incorrect, then you can make that 2.5 billion Asians. [2] Note that the limitation of the hack to those functions only is consistent with the Unicode-recommended usage of language tags.