From mboxrd@z Thu Jan 1 00:00:00 1970 Path: main.gmane.org!not-for-mail From: Stefan Monnier Newsgroups: gmane.emacs.devel Subject: Re: eight-bit char handling in emacs-unicode Date: 18 Nov 2003 22:05:39 -0500 Sender: emacs-devel-bounces+emacs-devel=quimby.gnus.org@gnu.org Message-ID: References: <200311130153.KAA04615@etlken.m17n.org> <200311130610.PAA04983@etlken.m17n.org> <200311130901.SAA05204@etlken.m17n.org> <200311140047.JAA06414@etlken.m17n.org> <200311180733.QAA13703@etlken.m17n.org> <200311190006.JAA14847@etlken.m17n.org> NNTP-Posting-Host: deer.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Trace: sea.gmane.org 1069211289 8672 80.91.224.253 (19 Nov 2003 03:08:09 GMT) X-Complaints-To: usenet@sea.gmane.org NNTP-Posting-Date: Wed, 19 Nov 2003 03:08:09 +0000 (UTC) Cc: emacs-devel@gnu.org, jas@extundo.com Original-X-From: emacs-devel-bounces+emacs-devel=quimby.gnus.org@gnu.org Wed Nov 19 04:08:06 2003 Return-path: Original-Received: from quimby.gnus.org ([80.91.224.244]) by deer.gmane.org with esmtp (Exim 3.35 #1 (Debian)) id 1AMIhK-0000vE-00 for ; Wed, 19 Nov 2003 04:08:06 +0100 Original-Received: from monty-python.gnu.org ([199.232.76.173]) by quimby.gnus.org with esmtp (Exim 3.35 #1 (Debian)) id 1AMIhK-00083V-00 for ; Wed, 19 Nov 2003 04:08:06 +0100 Original-Received: from localhost ([127.0.0.1] helo=monty-python.gnu.org) by monty-python.gnu.org with esmtp (Exim 4.24) id 1AMJdI-0007Zk-VK for emacs-devel@quimby.gnus.org; Tue, 18 Nov 2003 23:08:00 -0500 Original-Received: from list by monty-python.gnu.org with tmda-scanned (Exim 4.24) id 1AMJdD-0007ZM-UU for emacs-devel@gnu.org; Tue, 18 Nov 2003 23:07:55 -0500 Original-Received: from mail by monty-python.gnu.org with spam-scanned (Exim 4.24) id 1AMJch-0007WK-0y for emacs-devel@gnu.org; Tue, 18 Nov 2003 23:07:54 -0500 Original-Received: from [132.204.24.67] (helo=mercure.iro.umontreal.ca) by monty-python.gnu.org with esmtp (Exim 4.24) id 1AMJcg-0007WH-Ow for emacs-devel@gnu.org; Tue, 18 Nov 2003 23:07:22 -0500 Original-Received: from vor.iro.umontreal.ca (vor.iro.umontreal.ca [132.204.24.42]) by mercure.iro.umontreal.ca (8.12.9/8.12.9) with ESMTP id hAJ35ebj019996; Tue, 18 Nov 2003 22:05:43 -0500 Original-Received: by vor.iro.umontreal.ca (Postfix, from userid 20848) id 0E7C73C63E; Tue, 18 Nov 2003 22:05:39 -0500 (EST) Original-To: Kenichi Handa In-Reply-To: <200311190006.JAA14847@etlken.m17n.org> Original-Lines: 64 User-Agent: Gnus/5.09 (Gnus v5.9.0) Emacs/21.3.50 X-DIRO-MailScanner: Found to be clean X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.2 Precedence: list List-Id: Emacs development discussions. List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+emacs-devel=quimby.gnus.org@gnu.org Xref: main.gmane.org gmane.emacs.devel:17902 X-Report-Spam: http://spam.gmane.org/gmane.emacs.devel:17902 > I see. Apart from the design itself, I agree that it's difficult to > introduce a new type. But, when I discussed with Richard about the > Character type object a few year ago, he was not that negative provided > that it gives sure improvement. Sounds about right to me: we have one free tag that we could use for chars (and that I currently use to boost the max buffer size from 256MB to 512MB in my local code). But it needs to pay for itself. > Then, we can't use make-string-unibyte for the current case > because, in emacs-unicode, (concat '(?a 192)) returns a > multibyte string whose second element is A-grave, not an > eight-bit-char. Am I missing something? Well, obviously we need to make it accept this case (i.e. accept both the latin-1 192 and the eight-bit-char 192). I'm sure there'll be other issues. I haven't had much time to think about it and you're obviously better placed to foresee potential problems. >> To do what your string-make-unibyte does you should use >> `encode-coding-string' where the coding system is passed explicitly. > Those are conceptually different things (I remember the > similar discussion we had a while ago). > encode-coding-string does: > char-sequence --CCS-set--> (CCS/codepoint-pair)-sequence > --CES--> encoded-byte-sequence > string-make-unibyte does: > char-sequence --CCS--> code-point-sequence > --concat--> code-point-sequence > These two yield the same result only when CCS support all > chars in "char-sequence" and CES is stateless > (e.g. iso-latin-1) and . You lost me here (I'm a poor soul whose doesn't know much outside of the latin-1 world). I thought that string-make-unibyte only behaves meaningfully for "normal 8bit coding-systems" such as latin-1. >> I've changed my Emacs so that string-make-unibyte does the above >> (i.e. signals an error if it encounters a non-byte char) and it works fairly >> well, except for the few places where the elisp code is sloppy and needs to >> be fixed. > How did you change it? string-make-unibyte internally uses > the function copy_text. Did you change it? But, then, each > time you copy a multibyte string into a unibyte buffer, you > should get an error. Of course: it's an error. A unibyte buffer cannot represent multibyte chars, so you need to encode them first (into a unibyte string). Now to tell you the truth, my change had to accept a few (not so) special cases and it took a bit of fiddling to make the code lenient enough to accept elisp code I didn't feel like "fixing". I can't remember the details off-hand, but I remember having problems with regexp matching functions where multibyte regexps are used in unibyte buffers. -- Stefan