all messages for Emacs-related lists mirrored at yhetil.org
 help / color / mirror / code / Atom feed
From: Eli Zaretskii <eliz@gnu.org>
To: Andreas Schwab <schwab@linux-m68k.org>
Cc: emacs-devel@gnu.org, handa@m17n.org
Subject: Re: Unibyte strings in Lisp data structures
Date: Tue, 13 Jul 2010 19:13:59 +0300	[thread overview]
Message-ID: <837hkzs6k8.fsf@gnu.org> (raw)
In-Reply-To: <m3r5j7777p.fsf@hase.home>

> From: Andreas Schwab <schwab@linux-m68k.org>
> Cc: Kenichi Handa <handa@m17n.org>,  emacs-devel@gnu.org
> Date: Tue, 13 Jul 2010 17:05:30 +0200
> 
> Eli Zaretskii <eliz@gnu.org> writes:
> 
> > What code decides that they should be unibyte, when Emacs reads
> > jka-cmpr-hook.el?
> 
> Strings are read as unibyte by default unless they contain non-ascii,
> non-8-bit characters. (See (elisp) Converting Representations::).

Thanks, but I'm not sure this is relevant.  The section you pointed to
deals with conversions and insertions, not with how strings are read
by the Lisp reader.

Note that in jka-cmpr-hook.el, these magic signatures are specified as
octal escapes:

    ["\\.g?z\\(~\\|\\.~[0-9]+~\\)?\\'"
     "compressing"        "gzip"         ("-c" "-q")
     "uncompressing"      "gzip"         ("-c" "-q" "-d")
     t t "\037\213"]

I think the relevant code is this fragment from lread.c:read_escape:

    case '0':
    case '1':
    case '2':
    case '3':
    case '4':
    case '5':
    case '6':
    case '7':
      /* An octal escape, as in ANSI C.  */
      {
	register int i = c - '0';
	register int count = 0;
	while (++count < 3)
	  {
	    if ((c = READCHAR) >= '0' && c <= '7')
	      {
		i *= 8;
		i += c - '0';
	      }
	    else
	      {
		UNREAD (c);
		break;
	      }
	  }

	if (i >= 0x80 && i < 0x100)
	  i = BYTE8_TO_CHAR (i);
	return i;
      }

The BYTE8_TO_CHAR macro returns the multibyte representation of an
eight-bit byte.  Then, in read1, we do:

		if (CHAR_BYTE8_P (c))
		  force_singlebyte = 1;
		...
	else if (force_singlebyte)
	  {
	    nchars = str_as_unibyte (read_buffer, p - read_buffer);

The question is now: will this rule remain stable for time long enough
to rely on it?  Or is it safer to convert both strings to the same
representation for comparison?



  reply	other threads:[~2010-07-13 16:13 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-07-13 14:28 Unibyte strings in Lisp data structures Eli Zaretskii
2010-07-13 15:05 ` Andreas Schwab
2010-07-13 16:13   ` Eli Zaretskii [this message]
2010-07-13 18:40     ` Andreas Schwab

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=837hkzs6k8.fsf@gnu.org \
    --to=eliz@gnu.org \
    --cc=emacs-devel@gnu.org \
    --cc=handa@m17n.org \
    --cc=schwab@linux-m68k.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this external index

	https://git.savannah.gnu.org/cgit/emacs.git
	https://git.savannah.gnu.org/cgit/emacs/org-mode.git

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.