all messages for Emacs-related lists mirrored at yhetil.org
 help / color / mirror / code / Atom feed
* bug#5700: emacs-23 and 8-bit characters in 128..255
@ 2010-03-09 19:51 Nelson H. F. Beebe
  2010-03-09 22:02 ` Stefan Monnier
  0 siblings, 1 reply; 4+ messages in thread
From: Nelson H. F. Beebe @ 2010-03-09 19:51 UTC (permalink / raw)
  To: 5700; +Cc: beebe

When emacs-23 came out and I began to use it, I noticed problems in
some of my extensive locally-written emacs code.

I've been far too busy to try to track down why, and sometimes, the
problems were resolved simply by rerunning byte-compile-file.

This morning, I set out to track down the source of one of the
problems in a function that I use a lot, and eventually narrowed it to
the failure of functions like these:

    (string-equal (buffer-substring (point) (1+ (point))) "\377")
    (looking-at "\377")

In emacs-22 and earlier, if the character at point is octal 377
(decimal 255, hexadecimal 0xff), this function returns t.  In
emacs-23, it returns nil.  Further testing shows identical behavior
for characters in the decimal range 128--255 (octal \200--\377).

I suspect the reason is this comment in the NEWS file:

    The internal encoding used for buffers and strings is now
    Unicode-based and called `utf-8-emacs' (`emacs-internal' is an alias
    for this).  This encoding is backward-compatible with Unicode's UTF-8
    encoding.  The internal encoding previously used by Emacs,
    `emacs-mule', is still available for reading and writing files.

The code in question uses the character ?\377 as a unique sentinel
that terminates the function's processing.  It needs to be a
nonprintable character that is not use in normal text files, and I
found that changing it to ?\177 (ASCII DELete) made the code work
properly.  That change is transparent to older emacs versions, so in
this case, it is harmless.  Nevertheless, since the technique of using
data sentinels is an ancient practice in many programing languages, I
suspect that my own code is not the only Emacs Lisp code to be
affected by the change.

The question for this list is this:

    If UTF-8 is used internally in the buffer text, then why are
    numeric representations of unprintable characters in search
    strings apparently not translated the same way?

In all of my Emacs Lisp source code files, the character set is plain
ASCII, which is a proper subset of UTF-8, requiring only a single byte
per character.

-------------------------------------------------------------------------------
- Nelson H. F. Beebe                    Tel: +1 801 581 5254                  -
- University of Utah                    FAX: +1 801 581 4148                  -
- Department of Mathematics, 110 LCB    Internet e-mail: beebe@math.utah.edu  -
- 155 S 1400 E RM 233                       beebe@acm.org  beebe@computer.org -
- Salt Lake City, UT 84112-0090, USA    URL: http://www.math.utah.edu/~beebe/ -
-------------------------------------------------------------------------------







^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2016-07-07 16:21 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-03-09 19:51 bug#5700: emacs-23 and 8-bit characters in 128..255 Nelson H. F. Beebe
2010-03-09 22:02 ` Stefan Monnier
2016-07-06 23:52   ` npostavs
2016-07-07 16:21     ` Eli Zaretskii

Code repositories for project(s) associated with this external index

	https://git.savannah.gnu.org/cgit/emacs.git
	https://git.savannah.gnu.org/cgit/emacs/org-mode.git

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.