From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: Eli Zaretskii Newsgroups: gmane.emacs.devel Subject: Re: Unibyte strings in Lisp data structures Date: Tue, 13 Jul 2010 19:13:59 +0300 Message-ID: <837hkzs6k8.fsf@gnu.org> References: <83aapvsbfh.fsf@gnu.org> Reply-To: Eli Zaretskii NNTP-Posting-Host: lo.gmane.org X-Trace: dough.gmane.org 1279038041 24590 80.91.229.12 (13 Jul 2010 16:20:41 GMT) X-Complaints-To: usenet@dough.gmane.org NNTP-Posting-Date: Tue, 13 Jul 2010 16:20:41 +0000 (UTC) Cc: emacs-devel@gnu.org, handa@m17n.org To: Andreas Schwab Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Tue Jul 13 18:20:38 2010 Return-path: Envelope-to: ged-emacs-devel@m.gmane.org Original-Received: from lists.gnu.org ([199.232.76.165]) by lo.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1OYiDf-0003h0-5P for ged-emacs-devel@m.gmane.org; Tue, 13 Jul 2010 18:20:35 +0200 Original-Received: from localhost ([127.0.0.1]:41394 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1OYiDc-0002kx-4U for ged-emacs-devel@m.gmane.org; Tue, 13 Jul 2010 12:20:28 -0400 Original-Received: from [140.186.70.92] (port=36399 helo=eggs.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1OYi9e-0001uT-LY for emacs-devel@gnu.org; Tue, 13 Jul 2010 12:17:18 -0400 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.69) (envelope-from ) id 1OYi9J-0005Aw-IY for emacs-devel@gnu.org; Tue, 13 Jul 2010 12:16:03 -0400 Original-Received: from mtaout20.012.net.il ([80.179.55.166]:49857) by eggs.gnu.org with esmtp (Exim 4.69) (envelope-from ) id 1OYi9J-0005Ai-Bw for emacs-devel@gnu.org; Tue, 13 Jul 2010 12:16:01 -0400 Original-Received: from conversion-daemon.a-mtaout20.012.net.il by a-mtaout20.012.net.il (HyperSendmail v2007.08) id <0L5I006007RE6D00@a-mtaout20.012.net.il> for emacs-devel@gnu.org; Tue, 13 Jul 2010 19:15:59 +0300 (IDT) Original-Received: from HOME-C4E4A596F7 ([77.127.120.144]) by a-mtaout20.012.net.il (HyperSendmail v2007.08) with ESMTPA id <0L5I0027U7ULY3A0@a-mtaout20.012.net.il>; Tue, 13 Jul 2010 19:15:59 +0300 (IDT) In-reply-to: X-012-Sender: halo1@inter.net.il X-detected-operating-system: by eggs.gnu.org: Solaris 10 (beta) X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Xref: news.gmane.org gmane.emacs.devel:127193 Archived-At: > From: Andreas Schwab > Cc: Kenichi Handa , emacs-devel@gnu.org > Date: Tue, 13 Jul 2010 17:05:30 +0200 > > Eli Zaretskii writes: > > > What code decides that they should be unibyte, when Emacs reads > > jka-cmpr-hook.el? > > Strings are read as unibyte by default unless they contain non-ascii, > non-8-bit characters. (See (elisp) Converting Representations::). Thanks, but I'm not sure this is relevant. The section you pointed to deals with conversions and insertions, not with how strings are read by the Lisp reader. Note that in jka-cmpr-hook.el, these magic signatures are specified as octal escapes: ["\\.g?z\\(~\\|\\.~[0-9]+~\\)?\\'" "compressing" "gzip" ("-c" "-q") "uncompressing" "gzip" ("-c" "-q" "-d") t t "\037\213"] I think the relevant code is this fragment from lread.c:read_escape: case '0': case '1': case '2': case '3': case '4': case '5': case '6': case '7': /* An octal escape, as in ANSI C. */ { register int i = c - '0'; register int count = 0; while (++count < 3) { if ((c = READCHAR) >= '0' && c <= '7') { i *= 8; i += c - '0'; } else { UNREAD (c); break; } } if (i >= 0x80 && i < 0x100) i = BYTE8_TO_CHAR (i); return i; } The BYTE8_TO_CHAR macro returns the multibyte representation of an eight-bit byte. Then, in read1, we do: if (CHAR_BYTE8_P (c)) force_singlebyte = 1; ... else if (force_singlebyte) { nchars = str_as_unibyte (read_buffer, p - read_buffer); The question is now: will this rule remain stable for time long enough to rely on it? Or is it safer to convert both strings to the same representation for comparison?