From mboxrd@z Thu Jan 1 00:00:00 1970 Path: main.gmane.org!not-for-mail From: "Stefan Monnier" Newsgroups: gmane.emacs.devel Subject: Re: unibyte<->multibyte conversion [Re: Emacs-diffs Digest, Vol 2, Issue 28] Date: Tue, 21 Jan 2003 12:44:50 -0500 Sender: emacs-devel-bounces+emacs-devel=quimby.gnus.org@gnu.org Message-ID: <200301211744.h0LHioH16311@rum.cs.yale.edu> References: <3405-Sat18Jan2003154003+0200-eliz@is.elta.co.il> <200301200229.LAA16287@etlken.m17n.org> <6480-Mon20Jan2003214849+0200-eliz@is.elta.co.il> <200301210010.JAA17551@etlken.m17n.org> <200301210045.h0L0jS812745@rum.cs.yale.edu> <200301210804.RAA03089@etlken.m17n.org> NNTP-Posting-Host: main.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Transfer-Encoding: quoted-printable X-Trace: main.gmane.org 1043172055 10368 80.91.224.249 (21 Jan 2003 18:00:55 GMT) X-Complaints-To: usenet@main.gmane.org NNTP-Posting-Date: Tue, 21 Jan 2003 18:00:55 +0000 (UTC) Cc: emacs-devel@gnu.org Return-path: Original-Received: from quimby.gnus.org ([80.91.224.244]) by main.gmane.org with esmtp (Exim 3.35 #1 (Debian)) id 18b2hV-0002gr-00 for ; Tue, 21 Jan 2003 19:00:41 +0100 Original-Received: from monty-python.gnu.org ([199.232.76.173]) by quimby.gnus.org with esmtp (Exim 3.12 #1 (Debian)) id 18b2sy-0005IV-00 for ; Tue, 21 Jan 2003 19:12:32 +0100 Original-Received: from localhost ([127.0.0.1] helo=monty-python.gnu.org) by monty-python.gnu.org with esmtp (Exim 4.10.13) id 18b2g2-00026X-07 for emacs-devel@quimby.gnus.org; Tue, 21 Jan 2003 12:59:10 -0500 Original-Received: from list by monty-python.gnu.org with tmda-scanned (Exim 4.10.13) id 18b2cR-0001a3-00 for emacs-devel@gnu.org; Tue, 21 Jan 2003 12:55:27 -0500 Original-Received: from mail by monty-python.gnu.org with spam-scanned (Exim 4.10.13) id 18b2as-0000fF-00 for emacs-devel@gnu.org; Tue, 21 Jan 2003 12:53:51 -0500 Original-Received: from rum.cs.yale.edu ([128.36.229.169]) by monty-python.gnu.org with esmtp (Exim 4.10.13) id 18b2SE-0006WW-00 for emacs-devel@gnu.org; Tue, 21 Jan 2003 12:44:55 -0500 Original-Received: (from monnier@localhost) by rum.cs.yale.edu (8.11.6/8.11.6) id h0LHioH16311; Tue, 21 Jan 2003 12:44:50 -0500 X-Mailer: exmh version 2.4 06/23/2000 with nmh-1.0.4 Original-To: Kenichi Handa X-MIME-Autoconverted: from 8bit to quoted-printable by rum.cs.yale.edu id h0LHioH16311 X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1b5 Precedence: list List-Id: Emacs development discussions. List-Help: List-Post: List-Subscribe: , List-Archive: List-Unsubscribe: , Errors-To: emacs-devel-bounces+emacs-devel=quimby.gnus.org@gnu.org Xref: main.gmane.org gmane.emacs.devel:10942 X-Report-Spam: http://spam.gmane.org/gmane.emacs.devel:10942 > In article <200301210045.h0L0jS812745@rum.cs.yale.edu>, "Stefan Monnier= " writes: > >> unibyte sequence (hex): 81 81 C0 C0 > >> result of conversion display in multbyte= buffer > >> string-as-multibyte: 9E A1 81 C0 C0 \201=C0\300 > >> string-make-multibyte: 9E A1 9E A1 81 C0 81 C0 \201\201=C0=C0 > >> string-to-multibyte: 9E A1 9E A1 C0 C0 \201\201\300\300 >=20 > > I find the terminology and the concepts confusing. >=20 > I agree that those names are not that intuitive, but the > first two were there before I noticed it. :-p > But, in what sense, the concepts are confusing? The concept of string-as-multibyte made some sense in Emacs-20 when it was really "look under the hood: take the same bytes but interpret them differently". In Emacs-21, this is not the case any more so I don't really understand what's the intent behind it other than emacs-mule decoding (that it might happen to come out of some other decoding step rather than out of a file is not really relevant, I think). I think what I find confusing is that the name of those functions implicitly says "take the string and give me the same one, but just multibyte instead of unibyte", even though there's no unambiguous way to have "the same one". So there has to be a choice of how the conversion between unibyte and multibyte takes place, but this choice is not clearly described by the functions's name. > Please note that decode-coding-string also does eol > conversion. Using 'internal-unix, 'default-unix, Sorry for my sloppyness. > 'raw-text-unix will make them more equivalent. This should probably be `no-conversion' (or `binary'). Admittedly, it's the same, but I think it carries the intent a bit better. > But, as we now have eight-bit-XXXX, I agree that > string-as-multibyte is not that useful, string-to-multibyte > is better. But they do different things and the name-difference does not explain clearly the subtle distinction, so I think it's more confusing than anything else. > > 2 - there is no `default' coding-system either. Or maybe > > locale-coding-system is this default: if your locale is > > latin-1 then that's latin-1. >=20 > If one does not do set-language-enviroment, > locale-coding-system can be used as `default'. And otherwise ? The mere fact that I don't know the answer to this question seems like a good indication that pretty much nobody knows what `string-make-multibyte' does, so anyone who uses it is most likely using it wrong. Luckily, it seems only ps-mule.el uses it (although much more code uses the underlying nonascii-translation-table functionality). > > 3 - when called with a `raw-text' coding-system, decode-coding-string > > returns a unibyte string, which is obviously not what we want her= e. > > It might make sense for internal operations to return unibyte > > strings for the `raw-text' case, but I was really surprised that > > decode-coding-string would ever return a unibyte string. >=20 > I tend to agree that it is better that decode-coding-string > always return a multibyte string now. If it can be fixed, we can recommend (decode-coding-string str 'no-conver= sion) rather than introducing a new function string-to-multibyte. > I think string-FOO-multibyte (and also string-FOO-unibyte) > are conceptually different from decoding (and encoding) > operations. It's difficult for me to explain it clearly, > but I'll try. >=20 > Decoding and encoding are interface between Emacs and the > outer world. >=20 > Decoding is for converting an external byte sequence > (i.e. belonging to a world out of Emacs) into Emacs' > representation. >=20 > Encoding is for converting Emacs' represenatation to a byte > sequence that is used out of Emacs. But the `emacs-mule' coding-system is used both inside and outside, and same goes for `binary', so the distinction between inside and outside is not very clear-cut. I find it more helpful to think in terms of bytes and chars: unibyte strings are sequences of bytes while multibyte strings are sequences of chars. Converting between bytes and chars is the purpose of coding-systems. In such a context, string-FOO-multibyte are obviously just various forms of decoding, but the names don't give a good sense of which decoding is used. > And, if one wants to insert a result of encode-coding-string > in a multibyte buffer (perhaps for some post-processing), > what he should do? If we have string-to-multibyte, we can > do this: > (insert (string-to-multibyte > (encode-coding-string MULTIBYTE-STRING CODING))) > If we don't have it, and provided that decode-coding-string > always returns a multibyte string, we must do: > (insert (decode-coding-string > (encode-coding-string MULTIBYTE-STRING CODING) 'raw-text-u= nix)) > Isn't it very funny? Obviously, I agree with Miles, that the second is much more clear (especi= ally if you replace `raw-text-unix' with `no-conversion'. well, I prefer `bin= ary' myself, since the `no-conversion' is also a misnomer given that a convers= ion does take place). > By the way, I think the culprit of the current problem is > this Emacs' doctrine: > Do unibyte<->mutibyte conversion by "MAKE" by default. Since MAKE uses some kind of "default" related to the current language environment, I think it's OK, except that it's not clear in what way it's "related". But of course, there should simply never be such a thing as "guess what this unibyte stream translates into". The coding-system used to decode unibyte into multibyte should always be "clearly" defined (by the process's coding-system, the keyboard's coding-system, ...). I.e. it is simply a bug to insert a unibyte string into a multibyte buffe= r (and vice versa). As for inserting a char between 128 and 256 into a multibyte buffer... it should ideally always be treated as an eight-bit-foo char, but I think that making such a change right now would not be wise because there is still too much code which forgets to decode its bytes into chars (an instead relies on the MAKE default to turn those chars into latin-1 chars). > Although this doctrine surely works for handling unibyte and > multibyte represenation transparently, it makes Elisp > programmers very very confused. And it is useful only for > people whose main charset is single-byte. >=20 > I seriously considering changing it in emacs-unicode. Might be a good idea for emacs-unicode indeed. Stefan