* bug#5700: emacs-23 and 8-bit characters in 128..255
@ 2010-03-09 19:51 Nelson H. F. Beebe
2010-03-09 22:02 ` Stefan Monnier
0 siblings, 1 reply; 4+ messages in thread
From: Nelson H. F. Beebe @ 2010-03-09 19:51 UTC (permalink / raw)
To: 5700; +Cc: beebe
When emacs-23 came out and I began to use it, I noticed problems in
some of my extensive locally-written emacs code.
I've been far too busy to try to track down why, and sometimes, the
problems were resolved simply by rerunning byte-compile-file.
This morning, I set out to track down the source of one of the
problems in a function that I use a lot, and eventually narrowed it to
the failure of functions like these:
(string-equal (buffer-substring (point) (1+ (point))) "\377")
(looking-at "\377")
In emacs-22 and earlier, if the character at point is octal 377
(decimal 255, hexadecimal 0xff), this function returns t. In
emacs-23, it returns nil. Further testing shows identical behavior
for characters in the decimal range 128--255 (octal \200--\377).
I suspect the reason is this comment in the NEWS file:
The internal encoding used for buffers and strings is now
Unicode-based and called `utf-8-emacs' (`emacs-internal' is an alias
for this). This encoding is backward-compatible with Unicode's UTF-8
encoding. The internal encoding previously used by Emacs,
`emacs-mule', is still available for reading and writing files.
The code in question uses the character ?\377 as a unique sentinel
that terminates the function's processing. It needs to be a
nonprintable character that is not use in normal text files, and I
found that changing it to ?\177 (ASCII DELete) made the code work
properly. That change is transparent to older emacs versions, so in
this case, it is harmless. Nevertheless, since the technique of using
data sentinels is an ancient practice in many programing languages, I
suspect that my own code is not the only Emacs Lisp code to be
affected by the change.
The question for this list is this:
If UTF-8 is used internally in the buffer text, then why are
numeric representations of unprintable characters in search
strings apparently not translated the same way?
In all of my Emacs Lisp source code files, the character set is plain
ASCII, which is a proper subset of UTF-8, requiring only a single byte
per character.
-------------------------------------------------------------------------------
- Nelson H. F. Beebe Tel: +1 801 581 5254 -
- University of Utah FAX: +1 801 581 4148 -
- Department of Mathematics, 110 LCB Internet e-mail: beebe@math.utah.edu -
- 155 S 1400 E RM 233 beebe@acm.org beebe@computer.org -
- Salt Lake City, UT 84112-0090, USA URL: http://www.math.utah.edu/~beebe/ -
-------------------------------------------------------------------------------
^ permalink raw reply [flat|nested] 4+ messages in thread
* bug#5700: emacs-23 and 8-bit characters in 128..255
2010-03-09 19:51 bug#5700: emacs-23 and 8-bit characters in 128..255 Nelson H. F. Beebe
@ 2010-03-09 22:02 ` Stefan Monnier
2016-07-06 23:52 ` npostavs
0 siblings, 1 reply; 4+ messages in thread
From: Stefan Monnier @ 2010-03-09 22:02 UTC (permalink / raw)
To: Nelson H. F. Beebe; +Cc: 5700
> This morning, I set out to track down the source of one of the
> problems in a function that I use a lot, and eventually narrowed it to
> the failure of functions like these:
> (string-equal (buffer-substring (point) (1+ (point))) "\377")
Indeed, we have a problem:
(string-equal "\377" (string-to-multibyte "\377"))
returned t in Emacs-22 but returns nil in Emacs-23. Another (somewhat
related) problem is that under Emacs-22, we had:
"\377" prints as "\377"
"\xff" prints as "\xff"
(multibyte-string-p "\377") prints as "\xff"
which seems acceptable, whereas under Emacs-23 we have:
"\377" prints as "ÿ"
"\xff" prints as "ÿ"
(multibyte-string-p "\377") prints as "\377"
which looks rather confusing.
> (looking-at "\377")
This is probably a separate bug.
> The internal encoding used for buffers and strings is now
> Unicode-based and called `utf-8-emacs' (`emacs-internal' is an alias
It is related, but only to the extent that a lot of the code that
handles multibyte chars (and especially "eight-bit chars") was
completely rewritten, and this is a very delicate area.
Stefan
^ permalink raw reply [flat|nested] 4+ messages in thread
* bug#5700: emacs-23 and 8-bit characters in 128..255
2010-03-09 22:02 ` Stefan Monnier
@ 2016-07-06 23:52 ` npostavs
2016-07-07 16:21 ` Eli Zaretskii
0 siblings, 1 reply; 4+ messages in thread
From: npostavs @ 2016-07-06 23:52 UTC (permalink / raw)
To: Stefan Monnier; +Cc: Nelson H. F. Beebe, 5700
tags 5700 notabug
quit
With Emacs 24/25, using "\u00FF" works:
(string-equal (buffer-substring (point) (1+ (point))) "\u00FF")
(looking-at "\u00FF")
Seems to be another instance of the unibyte vs multibyte string escape syntax thing:
You can also use hexadecimal escape sequences (‘\xN’) and octal
escape sequences (‘\N’) in string constants. *But beware:* If a
string constant contains hexadecimal or octal escape sequences, and
these escape sequences all specify unibyte characters (i.e., less
than 256), and there are no other literal non-ASCII characters or
Unicode-style escape sequences in the string, then Emacs
automatically assumes that it is a unibyte string. That is to say,
it assumes that all non-ASCII characters occurring in the string are
8-bit raw bytes.
Stefan Monnier <monnier@IRO.UMontreal.CA> writes:
> which seems acceptable, whereas under Emacs-23 we have:
>
[...]
> (multibyte-string-p "\377") prints as "\377"
In 23.4 it returns returns nil
^ permalink raw reply [flat|nested] 4+ messages in thread
* bug#5700: emacs-23 and 8-bit characters in 128..255
2016-07-06 23:52 ` npostavs
@ 2016-07-07 16:21 ` Eli Zaretskii
0 siblings, 0 replies; 4+ messages in thread
From: Eli Zaretskii @ 2016-07-07 16:21 UTC (permalink / raw)
To: npostavs; +Cc: beebe, monnier, 5700
> From: npostavs@users.sourceforge.net
> Date: Wed, 06 Jul 2016 19:52:16 -0400
> Cc: "Nelson H. F. Beebe" <beebe@math.utah.edu>, 5700@debbugs.gnu.org
>
> With Emacs 24/25, using "\u00FF" works:
>
> (string-equal (buffer-substring (point) (1+ (point))) "\u00FF")
> (looking-at "\u00FF")
>
> Seems to be another instance of the unibyte vs multibyte string escape syntax thing:
>
> You can also use hexadecimal escape sequences (‘\xN’) and octal
> escape sequences (‘\N’) in string constants. *But beware:* If a
> string constant contains hexadecimal or octal escape sequences, and
> these escape sequences all specify unibyte characters (i.e., less
> than 256), and there are no other literal non-ASCII characters or
> Unicode-style escape sequences in the string, then Emacs
> automatically assumes that it is a unibyte string. That is to say,
> it assumes that all non-ASCII characters occurring in the string are
> 8-bit raw bytes.
>
> Stefan Monnier <monnier@IRO.UMontreal.CA> writes:
> > which seems acceptable, whereas under Emacs-23 we have:
> >
> [...]
> > (multibyte-string-p "\377") prints as "\377"
>
> In 23.4 it returns returns nil
Yes.
The other significant piece of the puzzle is described in this text
from the ELisp manual:
For technical reasons, a unibyte and a multibyte string are ‘equal’
if and only if they contain the same sequence of character codes
and all these codes are either in the range 0 through 127 (ASCII)
or 160 through 255 (‘eight-bit-graphic’). However, when a unibyte
string is converted to a multibyte string, all characters with
codes in the range 160 through 255 are converted to characters with
higher codes, whereas ASCII characters remain unchanged. Thus, a
unibyte string and its conversion to multibyte are only ‘equal’ if
the string is all ASCII. Character codes 160 through 255 are not
entirely proper in multibyte text, even though they can occur. As
a consequence, the situation where a unibyte and a multibyte string
are ‘equal’ without both being all ASCII is a technical oddity that
very few Emacs Lisp programmers ever get confronted with. *Note
Text Representations::.
This was one of the significant changes in Emacs 23, and I think it is
the main factor for the changed behavior reported by Nelson.
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2016-07-07 16:21 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-03-09 19:51 bug#5700: emacs-23 and 8-bit characters in 128..255 Nelson H. F. Beebe
2010-03-09 22:02 ` Stefan Monnier
2016-07-06 23:52 ` npostavs
2016-07-07 16:21 ` Eli Zaretskii
Code repositories for project(s) associated with this public inbox
https://git.savannah.gnu.org/cgit/emacs.git
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).