From mboxrd@z Thu Jan 1 00:00:00 1970 Path: main.gmane.org!not-for-mail From: Kenichi Handa Newsgroups: gmane.emacs.devel Subject: Re: eight-bit char handling in emacs-unicode Date: Wed, 26 Nov 2003 09:07:47 +0900 (JST) Sender: emacs-devel-bounces+emacs-devel=quimby.gnus.org@gnu.org Message-ID: <200311260007.JAA26617@etlken.m17n.org> References: <200311250107.KAA24646@etlken.m17n.org> NNTP-Posting-Host: deer.gmane.org Mime-Version: 1.0 (generated by SEMI 1.14.3 - "Ushinoya") Content-Type: text/plain; charset=US-ASCII X-Trace: sea.gmane.org 1069807043 24248 80.91.224.253 (26 Nov 2003 00:37:23 GMT) X-Complaints-To: usenet@sea.gmane.org NNTP-Posting-Date: Wed, 26 Nov 2003 00:37:23 +0000 (UTC) Cc: jas@extundo.com, emacs-devel@gnu.org Original-X-From: emacs-devel-bounces+emacs-devel=quimby.gnus.org@gnu.org Wed Nov 26 01:37:19 2003 Return-path: Original-Received: from quimby.gnus.org ([80.91.224.244]) by deer.gmane.org with esmtp (Exim 3.35 #1 (Debian)) id 1AOngF-0003zT-00 for ; Wed, 26 Nov 2003 01:37:19 +0100 Original-Received: from monty-python.gnu.org ([199.232.76.173]) by quimby.gnus.org with esmtp (Exim 3.35 #1 (Debian)) id 1AOngF-0008SI-00 for ; Wed, 26 Nov 2003 01:37:19 +0100 Original-Received: from localhost ([127.0.0.1] helo=monty-python.gnu.org) by monty-python.gnu.org with esmtp (Exim 4.24) id 1AOodY-0000r7-DY for emacs-devel@quimby.gnus.org; Tue, 25 Nov 2003 20:38:36 -0500 Original-Received: from list by monty-python.gnu.org with tmda-scanned (Exim 4.24) id 1AOoBu-0004TZ-En for emacs-devel@gnu.org; Tue, 25 Nov 2003 20:10:02 -0500 Original-Received: from mail by monty-python.gnu.org with spam-scanned (Exim 4.24) id 1AOoBN-0004Hd-Gx for emacs-devel@gnu.org; Tue, 25 Nov 2003 20:10:00 -0500 Original-Received: from [192.47.44.130] (helo=tsukuba.m17n.org) by monty-python.gnu.org with esmtp (Exim 4.24) id 1AOoBL-0004GU-TR for emacs-devel@gnu.org; Tue, 25 Nov 2003 20:09:28 -0500 Original-Received: from fs.m17n.org (fs.m17n.org [192.47.44.2]) by tsukuba.m17n.org (8.11.6p2/3.7W-20010518204228) with ESMTP id hAQ07mh00040; Wed, 26 Nov 2003 09:07:48 +0900 (JST) (envelope-from handa@m17n.org) Original-Received: from etlken.m17n.org (etlken.m17n.org [192.47.44.125]) by fs.m17n.org (8.11.6/3.7W-20010823150639) with ESMTP id hAQ07ls19715; Wed, 26 Nov 2003 09:07:47 +0900 (JST) Original-Received: (from handa@localhost) by etlken.m17n.org (8.8.8+Sun/3.7W-2001040620) id JAA26617; Wed, 26 Nov 2003 09:07:47 +0900 (JST) Original-To: monnier@IRO.UMontreal.CA In-reply-to: (message from Stefan Monnier on 25 Nov 2003 10:43:05 -0500) User-Agent: SEMI/1.14.3 (Ushinoya) FLIM/1.14.2 (Yagi-Nishiguchi) APEL/10.2 Emacs/21.3 (sparc-sun-solaris2.6) MULE/5.0 (SAKAKI) X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.2 Precedence: list List-Id: Emacs development discussions. List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+emacs-devel=quimby.gnus.org@gnu.org Xref: main.gmane.org gmane.emacs.devel:18126 X-Report-Spam: http://spam.gmane.org/gmane.emacs.devel:18126 In article , Stefan Monnier writes: >> It seems that you keep of saying that "A does B, thus it's >> nonsense". But, I'm arguing that "A does C". > Well, the thing is: I still don't understand what is C. > From what I understand, you say that C is "a conversion from multibyte > to a sequence of code-points", Yes, that what I said. > but since the output is a unibyte string, > that restrict it to cases where the code-points can be encoded in 8 bits, > thus it doesn't sound very generic Yes. But I thought generic or not is not a point here. > and I don't see any application for it > (nor do I see any practical difference with using encode-coding-string > since the output AFAIK would be the same). My examples shows that we can't use encode-coding-string. How can we use encode-coding-string without knowing what coding system to use? I haven't heard your answer yet. >> It doesn't make sense because you treat the result as "a >> unibyte string encoded in Latin-1". >> It makes sense if you treat the result as "a unibyte string >> in which each byte represents a sequence of Unicode >> code-points", doesn't it? > But each byte can only represent the 0-255 subset of unicode code-points, in > which case this is equivalent (practically speaking) to latin-1, isn't it ? Yes. And that covers all characters the user uses in this case. >>> It'd make sense if the environment said "latin-1 when you can, >>> utf-8 otherwise" or something like that, but then we would use >>> encode-coding-string anyway. >> It's itself nonsense to have such a coding system. > I was not thinking of a coding-system, but just some encoding job, > such as what is done when saving a buffer (where my .emacs does exactly > that: try latin-1 first and utf-8 if that fails). Ah, I see. But, my understanding is that string-make-unibyte/multibyte are designed not to change the number of characters to make the difference of unibyte/multibyte transparent in Lisp. That restriction leads to a case that non-supported characters are handled incorrectly. But, I think Richard's design policy was that incorrect handling of non-supported characters is better than a possibly more disastrous error caused by the change of number of characters. >> Do you agree with having string-make-unibyte if it signals an error on >> non-Latin-1 characters? > Of course: that's pretty much what I suggested: make-string-unibyte only > accepts multibyte chars that correspond to "bytes". I agree with that. But, it just changes the behaviour of the function on error case. It doesn't change the concept of what it does. >>> I just don't know of a concrete case where it makes sense to use >>> string-make-unibyte. >> I'll paraphrase my previous example as this: >> It is perfectly possible to live in such an environment >> where only the characters U+0000..U+00FF of Unicode is >> used but only the coding system utf-8 is used. >> But, I don't claim that the above is a realistic case. >> Another non-realistic but concrete case is: >> Use only the charset iso-8859-5 and the encoding CTEXT. > I don't see any use of string-make-unibyte in your two examples. Again, I'd like to ask how to use encode-coding-string without knowing the proper coding-system in each case. > And "having string-make-unibyte if it signals an error on non-Latin-1 > characters" means that the second example can't be used any more. In the second case, of course "supported characters" are what included in the charset iso-8859-5, and string-make-unibyte should accept them. Again, the result is the same as encoding by the coding system iso-8859-5, but we only know about the coding system CTEXT here. --- Ken'ichi HANDA handa@m17n.org