* Unibyte strings in Lisp data structures
@ 2010-07-13 14:28 Eli Zaretskii
2010-07-13 15:05 ` Andreas Schwab
0 siblings, 1 reply; 4+ messages in thread
From: Eli Zaretskii @ 2010-07-13 14:28 UTC (permalink / raw)
To: Kenichi Handa; +Cc: emacs-devel
Take a look at jka-compr-compression-info-list: each compression
method has a magic signature there, which is the 9th element of the
vector describing that compression method.
Now evaluate this:
(multibyte-string-p (aref (car jka-compr-compression-info-list) 9))
=> nil
These magic signatures are unibyte strings. But why are they unibyte?
What code decides that they should be unibyte, when Emacs reads
jka-cmpr-hook.el? Can we rely on the fact that these strings will
always be unibyte?
I bumped into this while debugging a problem in rmailmm.el: saving
attachments whose file names end in .gz produces a file that is
gzip-compressed twice. I finally traced this to this fragment in
jka-compr.el:
;; If the contents to be written out
;; are properly compressed already,
;; don't try to compress them over again.
(not (and magic
(equal (if (stringp start)
(substring start 0 (min (length start)
(length magic)))
(let* ((from (or start (point-min)))
(to (min (or end (point-max))
(+ from (length magic)))))
(buffer-substring from to)))
magic))))
This test failed, because `magic' is a unibyte string, while
buffer-substring was returning a multibyte string.
The fix seems to be easy: modify rmail-mime-save to make the temporary
buffer it uses be a unibyte buffer. But then I started to wonder how
come `magic' is a unibyte string, and can I rely on that?
There is, of course, the alternative to convert both strings to
unibyte and compare that. Still, I think it would be good to know how
come these strings are unibyte to begin with.
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: Unibyte strings in Lisp data structures
2010-07-13 14:28 Unibyte strings in Lisp data structures Eli Zaretskii
@ 2010-07-13 15:05 ` Andreas Schwab
2010-07-13 16:13 ` Eli Zaretskii
0 siblings, 1 reply; 4+ messages in thread
From: Andreas Schwab @ 2010-07-13 15:05 UTC (permalink / raw)
To: Eli Zaretskii; +Cc: emacs-devel, Kenichi Handa
Eli Zaretskii <eliz@gnu.org> writes:
> What code decides that they should be unibyte, when Emacs reads
> jka-cmpr-hook.el?
Strings are read as unibyte by default unless they contain non-ascii,
non-8-bit characters. (See (elisp) Converting Representations::).
Andreas.
--
Andreas Schwab, schwab@linux-m68k.org
GPG Key fingerprint = 58CA 54C7 6D53 942B 1756 01D3 44D5 214B 8276 4ED5
"And now for something completely different."
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: Unibyte strings in Lisp data structures
2010-07-13 15:05 ` Andreas Schwab
@ 2010-07-13 16:13 ` Eli Zaretskii
2010-07-13 18:40 ` Andreas Schwab
0 siblings, 1 reply; 4+ messages in thread
From: Eli Zaretskii @ 2010-07-13 16:13 UTC (permalink / raw)
To: Andreas Schwab; +Cc: emacs-devel, handa
> From: Andreas Schwab <schwab@linux-m68k.org>
> Cc: Kenichi Handa <handa@m17n.org>, emacs-devel@gnu.org
> Date: Tue, 13 Jul 2010 17:05:30 +0200
>
> Eli Zaretskii <eliz@gnu.org> writes:
>
> > What code decides that they should be unibyte, when Emacs reads
> > jka-cmpr-hook.el?
>
> Strings are read as unibyte by default unless they contain non-ascii,
> non-8-bit characters. (See (elisp) Converting Representations::).
Thanks, but I'm not sure this is relevant. The section you pointed to
deals with conversions and insertions, not with how strings are read
by the Lisp reader.
Note that in jka-cmpr-hook.el, these magic signatures are specified as
octal escapes:
["\\.g?z\\(~\\|\\.~[0-9]+~\\)?\\'"
"compressing" "gzip" ("-c" "-q")
"uncompressing" "gzip" ("-c" "-q" "-d")
t t "\037\213"]
I think the relevant code is this fragment from lread.c:read_escape:
case '0':
case '1':
case '2':
case '3':
case '4':
case '5':
case '6':
case '7':
/* An octal escape, as in ANSI C. */
{
register int i = c - '0';
register int count = 0;
while (++count < 3)
{
if ((c = READCHAR) >= '0' && c <= '7')
{
i *= 8;
i += c - '0';
}
else
{
UNREAD (c);
break;
}
}
if (i >= 0x80 && i < 0x100)
i = BYTE8_TO_CHAR (i);
return i;
}
The BYTE8_TO_CHAR macro returns the multibyte representation of an
eight-bit byte. Then, in read1, we do:
if (CHAR_BYTE8_P (c))
force_singlebyte = 1;
...
else if (force_singlebyte)
{
nchars = str_as_unibyte (read_buffer, p - read_buffer);
The question is now: will this rule remain stable for time long enough
to rely on it? Or is it safer to convert both strings to the same
representation for comparison?
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: Unibyte strings in Lisp data structures
2010-07-13 16:13 ` Eli Zaretskii
@ 2010-07-13 18:40 ` Andreas Schwab
0 siblings, 0 replies; 4+ messages in thread
From: Andreas Schwab @ 2010-07-13 18:40 UTC (permalink / raw)
To: Eli Zaretskii; +Cc: emacs-devel, handa
Eli Zaretskii <eliz@gnu.org> writes:
> The question is now: will this rule remain stable for time long enough
> to rely on it? Or is it safer to convert both strings to the same
> representation for comparison?
I think we should make it that (equal s (string-to-multibyte s)) is
always true.
Andreas.
--
Andreas Schwab, schwab@linux-m68k.org
GPG Key fingerprint = 58CA 54C7 6D53 942B 1756 01D3 44D5 214B 8276 4ED5
"And now for something completely different."
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2010-07-13 18:40 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-07-13 14:28 Unibyte strings in Lisp data structures Eli Zaretskii
2010-07-13 15:05 ` Andreas Schwab
2010-07-13 16:13 ` Eli Zaretskii
2010-07-13 18:40 ` Andreas Schwab
Code repositories for project(s) associated with this public inbox
https://git.savannah.gnu.org/cgit/emacs.git
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).