From mboxrd@z Thu Jan  1 00:00:00 1970
Path: main.gmane.org!not-for-mail
From: Stefan Monnier <monnier@IRO.UMontreal.CA>
Newsgroups: gmane.emacs.devel
Subject: Re: eight-bit char handling in emacs-unicode
Date: 18 Nov 2003 22:05:39 -0500
Sender: emacs-devel-bounces+emacs-devel=quimby.gnus.org@gnu.org
Message-ID: <jwvptfp139w.fsf-monnier+emacs/devel@vor.iro.umontreal.ca>
References: <ilubrrha7oc.fsf@latte.josefsson.org>
	<200311130153.KAA04615@etlken.m17n.org>
	<ilur80c50uj.fsf@latte.josefsson.org>
	<200311130610.PAA04983@etlken.m17n.org>
	<iluekwcwyl8.fsf@latte.josefsson.org>
	<200311130901.SAA05204@etlken.m17n.org>
	<ilun0b08by1.fsf@latte.josefsson.org>
	<200311140047.JAA06414@etlken.m17n.org>
	<jwvhe12emr3.fsf-monnier+emacs/devel@vor.iro.umontreal.ca>
	<200311180733.QAA13703@etlken.m17n.org>
	<jwvn0atd38w.fsf-monnier+emacs/devel@vor.iro.umontreal.ca>
	<200311190006.JAA14847@etlken.m17n.org>
NNTP-Posting-Host: deer.gmane.org
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
X-Trace: sea.gmane.org 1069211289 8672 80.91.224.253 (19 Nov 2003 03:08:09 GMT)
X-Complaints-To: usenet@sea.gmane.org
NNTP-Posting-Date: Wed, 19 Nov 2003 03:08:09 +0000 (UTC)
Cc: emacs-devel@gnu.org, jas@extundo.com
Original-X-From: emacs-devel-bounces+emacs-devel=quimby.gnus.org@gnu.org Wed Nov 19 04:08:06 2003
Return-path: <emacs-devel-bounces+emacs-devel=quimby.gnus.org@gnu.org>
Original-Received: from quimby.gnus.org ([80.91.224.244])
	by deer.gmane.org with esmtp (Exim 3.35 #1 (Debian))
	id 1AMIhK-0000vE-00
	for <emacs-devel@deer.gmane.org>; Wed, 19 Nov 2003 04:08:06 +0100
Original-Received: from monty-python.gnu.org ([199.232.76.173])
	by quimby.gnus.org with esmtp (Exim 3.35 #1 (Debian))
	id 1AMIhK-00083V-00
	for <emacs-devel@quimby.gnus.org>; Wed, 19 Nov 2003 04:08:06 +0100
Original-Received: from localhost ([127.0.0.1] helo=monty-python.gnu.org)
	by monty-python.gnu.org with esmtp (Exim 4.24)
	id 1AMJdI-0007Zk-VK
	for emacs-devel@quimby.gnus.org; Tue, 18 Nov 2003 23:08:00 -0500
Original-Received: from list by monty-python.gnu.org with tmda-scanned (Exim 4.24)
	id 1AMJdD-0007ZM-UU
	for emacs-devel@gnu.org; Tue, 18 Nov 2003 23:07:55 -0500
Original-Received: from mail by monty-python.gnu.org with spam-scanned (Exim 4.24)
	id 1AMJch-0007WK-0y
	for emacs-devel@gnu.org; Tue, 18 Nov 2003 23:07:54 -0500
Original-Received: from [132.204.24.67] (helo=mercure.iro.umontreal.ca)
	by monty-python.gnu.org with esmtp (Exim 4.24) id 1AMJcg-0007WH-Ow
	for emacs-devel@gnu.org; Tue, 18 Nov 2003 23:07:22 -0500
Original-Received: from vor.iro.umontreal.ca (vor.iro.umontreal.ca [132.204.24.42])
	by mercure.iro.umontreal.ca (8.12.9/8.12.9) with ESMTP id
	hAJ35ebj019996; Tue, 18 Nov 2003 22:05:43 -0500
Original-Received: by vor.iro.umontreal.ca (Postfix, from userid 20848)
	id 0E7C73C63E; Tue, 18 Nov 2003 22:05:39 -0500 (EST)
Original-To: Kenichi Handa <handa@m17n.org>
In-Reply-To: <200311190006.JAA14847@etlken.m17n.org>
Original-Lines: 64
User-Agent: Gnus/5.09 (Gnus v5.9.0) Emacs/21.3.50
X-DIRO-MailScanner: Found to be clean
X-BeenThere: emacs-devel@gnu.org
X-Mailman-Version: 2.1.2
Precedence: list
List-Id: Emacs development discussions.  <emacs-devel.gnu.org>
List-Unsubscribe: <http://mail.gnu.org/mailman/listinfo/emacs-devel>,
	<mailto:emacs-devel-request@gnu.org?subject=unsubscribe>
List-Archive: <http://mail.gnu.org/pipermail/emacs-devel>
List-Post: <mailto:emacs-devel@gnu.org>
List-Help: <mailto:emacs-devel-request@gnu.org?subject=help>
List-Subscribe: <http://mail.gnu.org/mailman/listinfo/emacs-devel>,
	<mailto:emacs-devel-request@gnu.org?subject=subscribe>
Errors-To: emacs-devel-bounces+emacs-devel=quimby.gnus.org@gnu.org
Xref: main.gmane.org gmane.emacs.devel:17902
X-Report-Spam: http://spam.gmane.org/gmane.emacs.devel:17902

> I see.  Apart from the design itself, I agree that it's difficult to
> introduce a new type.  But, when I discussed with Richard about the
> Character type object a few year ago, he was not that negative provided
> that it gives sure improvement.

Sounds about right to me: we have one free tag that we could use for chars
(and that I currently use to boost the max buffer size from 256MB to 512MB
in my local code).
But it needs to pay for itself.

> Then, we can't use make-string-unibyte for the current case
> because, in emacs-unicode, (concat '(?a 192)) returns a
> multibyte string whose second element is A-grave, not an
> eight-bit-char.  Am I missing something?

Well, obviously we need to make it accept this case (i.e. accept both the
latin-1 192 and the eight-bit-char 192).  I'm sure there'll be other issues.
I haven't had much time to think about it and you're obviously better
placed to foresee potential problems.

>> To do what your string-make-unibyte does you should use
>> `encode-coding-string' where the coding system is passed explicitly.

> Those are conceptually different things (I remember the
> similar discussion we had a while ago).

> encode-coding-string does:
> char-sequence --CCS-set--> (CCS/codepoint-pair)-sequence
>     --CES--> encoded-byte-sequence

> string-make-unibyte does:
> char-sequence --CCS--> code-point-sequence
>     --concat--> code-point-sequence

> These two yield the same result only when CCS support all
> chars in "char-sequence" and CES is stateless
> (e.g. iso-latin-1) and .

You lost me here (I'm a poor soul whose doesn't know much outside of the
latin-1 world).
I thought that string-make-unibyte only behaves meaningfully for
"normal 8bit coding-systems" such as latin-1.

>> I've changed my Emacs so that string-make-unibyte does the above
>> (i.e. signals an error if it encounters a non-byte char) and it works fairly
>> well, except for the few places where the elisp code is sloppy and needs to
>> be fixed.

> How did you change it?  string-make-unibyte internally uses
> the function copy_text.  Did you change it?  But, then, each
> time you copy a multibyte string into a unibyte buffer, you
> should get an error.

Of course: it's an error.  A unibyte buffer cannot represent multibyte
chars, so you need to encode them first (into a unibyte string).

Now to tell you the truth, my change had to accept a few (not so) special
cases and it took a bit of fiddling to make the code lenient enough to
accept elisp code I didn't feel like "fixing".  I can't remember the details
off-hand, but I remember having problems with regexp matching functions
where multibyte regexps are used in unibyte buffers.


-- Stefan