bug#5700: emacs-23 and 8-bit characters in 128..255

unofficial mirror of bug-gnu-emacs@gnu.org 
 help / color / mirror / code / Atom feed

* bug#5700: emacs-23 and 8-bit characters in 128..255
@ 2010-03-09 19:51 Nelson H. F. Beebe
  2010-03-09 22:02 ` Stefan Monnier
  0 siblings, 1 reply; 4+ messages in thread
From: Nelson H. F. Beebe @ 2010-03-09 19:51 UTC (permalink / raw)
  To: 5700; +Cc: beebe

When emacs-23 came out and I began to use it, I noticed problems in
some of my extensive locally-written emacs code.

I've been far too busy to try to track down why, and sometimes, the
problems were resolved simply by rerunning byte-compile-file.

This morning, I set out to track down the source of one of the
problems in a function that I use a lot, and eventually narrowed it to
the failure of functions like these:

    (string-equal (buffer-substring (point) (1+ (point))) "\377")
    (looking-at "\377")

In emacs-22 and earlier, if the character at point is octal 377
(decimal 255, hexadecimal 0xff), this function returns t.  In
emacs-23, it returns nil.  Further testing shows identical behavior
for characters in the decimal range 128--255 (octal \200--\377).

I suspect the reason is this comment in the NEWS file:

    The internal encoding used for buffers and strings is now
    Unicode-based and called `utf-8-emacs' (`emacs-internal' is an alias
    for this).  This encoding is backward-compatible with Unicode's UTF-8
    encoding.  The internal encoding previously used by Emacs,
    `emacs-mule', is still available for reading and writing files.

The code in question uses the character ?\377 as a unique sentinel
that terminates the function's processing.  It needs to be a
nonprintable character that is not use in normal text files, and I
found that changing it to ?\177 (ASCII DELete) made the code work
properly.  That change is transparent to older emacs versions, so in
this case, it is harmless.  Nevertheless, since the technique of using
data sentinels is an ancient practice in many programing languages, I
suspect that my own code is not the only Emacs Lisp code to be
affected by the change.

The question for this list is this:

    If UTF-8 is used internally in the buffer text, then why are
    numeric representations of unprintable characters in search
    strings apparently not translated the same way?

In all of my Emacs Lisp source code files, the character set is plain
ASCII, which is a proper subset of UTF-8, requiring only a single byte
per character.

-------------------------------------------------------------------------------
- Nelson H. F. Beebe                    Tel: +1 801 581 5254                  -
- University of Utah                    FAX: +1 801 581 4148                  -
- Department of Mathematics, 110 LCB    Internet e-mail: beebe@math.utah.edu  -
- 155 S 1400 E RM 233                       beebe@acm.org  beebe@computer.org -
- Salt Lake City, UT 84112-0090, USA    URL: http://www.math.utah.edu/~beebe/ -
-------------------------------------------------------------------------------

^ permalink raw reply	[flat|nested] 4+ messages in thread

* bug#5700: emacs-23 and 8-bit characters in 128..255
  2010-03-09 19:51 bug#5700: emacs-23 and 8-bit characters in 128..255 Nelson H. F. Beebe
@ 2010-03-09 22:02 ` Stefan Monnier
  2016-07-06 23:52   ` npostavs
  0 siblings, 1 reply; 4+ messages in thread
From: Stefan Monnier @ 2010-03-09 22:02 UTC (permalink / raw)
  To: Nelson H. F. Beebe; +Cc: 5700

> This morning, I set out to track down the source of one of the
> problems in a function that I use a lot, and eventually narrowed it to
> the failure of functions like these:

>     (string-equal (buffer-substring (point) (1+ (point))) "\377")

Indeed, we have a problem:

  (string-equal "\377" (string-to-multibyte "\377"))

returned t in Emacs-22 but returns nil in Emacs-23.  Another (somewhat
related) problem is that under Emacs-22, we had:

  "\377"                        prints as    "\377"
  "\xff"                        prints as    "\xff"
  (multibyte-string-p "\377")   prints as    "\xff"

which seems acceptable, whereas under Emacs-23 we have:

  "\377"                        prints as    "ÿ"
  "\xff"                        prints as    "ÿ"
  (multibyte-string-p "\377")   prints as    "\377"

which looks rather confusing.

>     (looking-at "\377")

This is probably a separate bug.

>     The internal encoding used for buffers and strings is now
>     Unicode-based and called `utf-8-emacs' (`emacs-internal' is an alias

It is related, but only to the extent that a lot of the code that
handles multibyte chars (and especially "eight-bit chars") was
completely rewritten, and this is a very delicate area.


        Stefan






^ permalink raw reply	[flat|nested] 4+ messages in thread

* bug#5700: emacs-23 and 8-bit characters in 128..255
  2010-03-09 22:02 ` Stefan Monnier
@ 2016-07-06 23:52   ` npostavs
  2016-07-07 16:21     ` Eli Zaretskii
  0 siblings, 1 reply; 4+ messages in thread
From: npostavs @ 2016-07-06 23:52 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: Nelson H. F. Beebe, 5700

tags 5700 notabug
quit

With Emacs 24/25, using "\u00FF" works:

(string-equal (buffer-substring (point) (1+ (point))) "\u00FF")
(looking-at "\u00FF")

Seems to be another instance of the unibyte vs multibyte string escape syntax thing:

       You can also use hexadecimal escape sequences (‘\xN’) and octal
    escape sequences (‘\N’) in string constants.  *But beware:* If a
    string constant contains hexadecimal or octal escape sequences, and
    these escape sequences all specify unibyte characters (i.e., less
    than 256), and there are no other literal non-ASCII characters or
    Unicode-style escape sequences in the string, then Emacs
    automatically assumes that it is a unibyte string.  That is to say,
    it assumes that all non-ASCII characters occurring in the string are
    8-bit raw bytes.

Stefan Monnier <monnier@IRO.UMontreal.CA> writes:
> which seems acceptable, whereas under Emacs-23 we have:
>
[...]
>   (multibyte-string-p "\377")   prints as    "\377"

In 23.4 it returns returns nil





^ permalink raw reply	[flat|nested] 4+ messages in thread

* bug#5700: emacs-23 and 8-bit characters in 128..255
  2016-07-06 23:52   ` npostavs
@ 2016-07-07 16:21     ` Eli Zaretskii
  0 siblings, 0 replies; 4+ messages in thread
From: Eli Zaretskii @ 2016-07-07 16:21 UTC (permalink / raw)
  To: npostavs; +Cc: beebe, monnier, 5700

> From: npostavs@users.sourceforge.net
> Date: Wed, 06 Jul 2016 19:52:16 -0400
> Cc: "Nelson H. F. Beebe" <beebe@math.utah.edu>, 5700@debbugs.gnu.org
> 
> With Emacs 24/25, using "\u00FF" works:
> 
> (string-equal (buffer-substring (point) (1+ (point))) "\u00FF")
> (looking-at "\u00FF")
> 
> Seems to be another instance of the unibyte vs multibyte string escape syntax thing:
> 
>        You can also use hexadecimal escape sequences (‘\xN’) and octal
>     escape sequences (‘\N’) in string constants.  *But beware:* If a
>     string constant contains hexadecimal or octal escape sequences, and
>     these escape sequences all specify unibyte characters (i.e., less
>     than 256), and there are no other literal non-ASCII characters or
>     Unicode-style escape sequences in the string, then Emacs
>     automatically assumes that it is a unibyte string.  That is to say,
>     it assumes that all non-ASCII characters occurring in the string are
>     8-bit raw bytes.
> 
> Stefan Monnier <monnier@IRO.UMontreal.CA> writes:
> > which seems acceptable, whereas under Emacs-23 we have:
> >
> [...]
> >   (multibyte-string-p "\377")   prints as    "\377"
> 
> In 23.4 it returns returns nil

Yes.

The other significant piece of the puzzle is described in this text
from the ELisp manual:

     For technical reasons, a unibyte and a multibyte string are ‘equal’
     if and only if they contain the same sequence of character codes
     and all these codes are either in the range 0 through 127 (ASCII)
     or 160 through 255 (‘eight-bit-graphic’).  However, when a unibyte
     string is converted to a multibyte string, all characters with
     codes in the range 160 through 255 are converted to characters with
     higher codes, whereas ASCII characters remain unchanged.  Thus, a
     unibyte string and its conversion to multibyte are only ‘equal’ if
     the string is all ASCII.  Character codes 160 through 255 are not
     entirely proper in multibyte text, even though they can occur.  As
     a consequence, the situation where a unibyte and a multibyte string
     are ‘equal’ without both being all ASCII is a technical oddity that
     very few Emacs Lisp programmers ever get confronted with.  *Note
     Text Representations::.

This was one of the significant changes in Emacs 23, and I think it is
the main factor for the changed behavior reported by Nelson.





^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2016-07-07 16:21 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-03-09 19:51 bug#5700: emacs-23 and 8-bit characters in 128..255 Nelson H. F. Beebe
2010-03-09 22:02 ` Stefan Monnier
2016-07-06 23:52   ` npostavs
2016-07-07 16:21     ` Eli Zaretskii

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).