From mboxrd@z Thu Jan  1 00:00:00 1970
Path: news.gmane.org!not-for-mail
From: Eli Zaretskii <eliz@gnu.org>
Newsgroups: gmane.emacs.devel
Subject: Re: Unibyte strings in Lisp data structures
Date: Tue, 13 Jul 2010 19:13:59 +0300
Message-ID: <837hkzs6k8.fsf@gnu.org>
References: <83aapvsbfh.fsf@gnu.org> <m3r5j7777p.fsf@hase.home>
Reply-To: Eli Zaretskii <eliz@gnu.org>
NNTP-Posting-Host: lo.gmane.org
X-Trace: dough.gmane.org 1279038041 24590 80.91.229.12 (13 Jul 2010 16:20:41 GMT)
X-Complaints-To: usenet@dough.gmane.org
NNTP-Posting-Date: Tue, 13 Jul 2010 16:20:41 +0000 (UTC)
Cc: emacs-devel@gnu.org, handa@m17n.org
To: Andreas Schwab <schwab@linux-m68k.org>
Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Tue Jul 13 18:20:38 2010
Return-path: <emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org>
Envelope-to: ged-emacs-devel@m.gmane.org
Original-Received: from lists.gnu.org ([199.232.76.165])
	by lo.gmane.org with esmtp (Exim 4.69)
	(envelope-from <emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org>)
	id 1OYiDf-0003h0-5P
	for ged-emacs-devel@m.gmane.org; Tue, 13 Jul 2010 18:20:35 +0200
Original-Received: from localhost ([127.0.0.1]:41394 helo=lists.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.43)
	id 1OYiDc-0002kx-4U
	for ged-emacs-devel@m.gmane.org; Tue, 13 Jul 2010 12:20:28 -0400
Original-Received: from [140.186.70.92] (port=36399 helo=eggs.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.43) id 1OYi9e-0001uT-LY
	for emacs-devel@gnu.org; Tue, 13 Jul 2010 12:17:18 -0400
Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.69)
	(envelope-from <eliz@gnu.org>) id 1OYi9J-0005Aw-IY
	for emacs-devel@gnu.org; Tue, 13 Jul 2010 12:16:03 -0400
Original-Received: from mtaout20.012.net.il ([80.179.55.166]:49857)
	by eggs.gnu.org with esmtp (Exim 4.69) (envelope-from <eliz@gnu.org>)
	id 1OYi9J-0005Ai-Bw
	for emacs-devel@gnu.org; Tue, 13 Jul 2010 12:16:01 -0400
Original-Received: from conversion-daemon.a-mtaout20.012.net.il by
	a-mtaout20.012.net.il (HyperSendmail v2007.08) id
	<0L5I006007RE6D00@a-mtaout20.012.net.il> for
	emacs-devel@gnu.org; Tue, 13 Jul 2010 19:15:59 +0300 (IDT)
Original-Received: from HOME-C4E4A596F7 ([77.127.120.144]) by a-mtaout20.012.net.il
	(HyperSendmail v2007.08) with ESMTPA id
	<0L5I0027U7ULY3A0@a-mtaout20.012.net.il>;
	Tue, 13 Jul 2010 19:15:59 +0300 (IDT)
In-reply-to: <m3r5j7777p.fsf@hase.home>
X-012-Sender: halo1@inter.net.il
X-detected-operating-system: by eggs.gnu.org: Solaris 10 (beta)
X-BeenThere: emacs-devel@gnu.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: "Emacs development discussions." <emacs-devel.gnu.org>
List-Unsubscribe: <http://lists.gnu.org/mailman/listinfo/emacs-devel>,
	<mailto:emacs-devel-request@gnu.org?subject=unsubscribe>
List-Archive: <http://lists.gnu.org/archive/html/emacs-devel>
List-Post: <mailto:emacs-devel@gnu.org>
List-Help: <mailto:emacs-devel-request@gnu.org?subject=help>
List-Subscribe: <http://lists.gnu.org/mailman/listinfo/emacs-devel>,
	<mailto:emacs-devel-request@gnu.org?subject=subscribe>
Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org
Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org
Xref: news.gmane.org gmane.emacs.devel:127193
Archived-At: <http://permalink.gmane.org/gmane.emacs.devel/127193>

> From: Andreas Schwab <schwab@linux-m68k.org>
> Cc: Kenichi Handa <handa@m17n.org>,  emacs-devel@gnu.org
> Date: Tue, 13 Jul 2010 17:05:30 +0200
> 
> Eli Zaretskii <eliz@gnu.org> writes:
> 
> > What code decides that they should be unibyte, when Emacs reads
> > jka-cmpr-hook.el?
> 
> Strings are read as unibyte by default unless they contain non-ascii,
> non-8-bit characters. (See (elisp) Converting Representations::).

Thanks, but I'm not sure this is relevant.  The section you pointed to
deals with conversions and insertions, not with how strings are read
by the Lisp reader.

Note that in jka-cmpr-hook.el, these magic signatures are specified as
octal escapes:

    ["\\.g?z\\(~\\|\\.~[0-9]+~\\)?\\'"
     "compressing"        "gzip"         ("-c" "-q")
     "uncompressing"      "gzip"         ("-c" "-q" "-d")
     t t "\037\213"]

I think the relevant code is this fragment from lread.c:read_escape:

    case '0':
    case '1':
    case '2':
    case '3':
    case '4':
    case '5':
    case '6':
    case '7':
      /* An octal escape, as in ANSI C.  */
      {
	register int i = c - '0';
	register int count = 0;
	while (++count < 3)
	  {
	    if ((c = READCHAR) >= '0' && c <= '7')
	      {
		i *= 8;
		i += c - '0';
	      }
	    else
	      {
		UNREAD (c);
		break;
	      }
	  }

	if (i >= 0x80 && i < 0x100)
	  i = BYTE8_TO_CHAR (i);
	return i;
      }

The BYTE8_TO_CHAR macro returns the multibyte representation of an
eight-bit byte.  Then, in read1, we do:

		if (CHAR_BYTE8_P (c))
		  force_singlebyte = 1;
		...
	else if (force_singlebyte)
	  {
	    nchars = str_as_unibyte (read_buffer, p - read_buffer);

The question is now: will this rule remain stable for time long enough
to rely on it?  Or is it safer to convert both strings to the same
representation for comparison?