all messages for Emacs-related lists mirrored at yhetil.org
 help / color / mirror / code / Atom feed
* bug#5797: 23.1; search-forward in unibyte buffer for \377
@ 2010-03-29 15:09 rasmith
  0 siblings, 0 replies; only message in thread
From: rasmith @ 2010-03-29 15:09 UTC (permalink / raw)
  To: 5797


Please write in English if possible, because the Emacs maintainers
usually do not have translators to read other languages for them.

Your bug report will be posted to the bug-gnu-emacs@gnu.org mailing list,
and to the gnu.emacs.bug news group.

Please describe exactly what actions triggered the bug
and the precise symptoms of the bug:

search-forward fails to find a unibyte \377 in a raw unibyte buffer.
I use "cgreek", a package written by Naoto Takahashi for handling
polytonic (ancient, fully accented) Greek.  It includes a file,
cgreek-tlg.el, for processing the files in the Thesaurus Linguae
Graecae, which have their own unique formats.  In these files, the
byte \377 is used as a string terminator.  Prior to emacs23, these
files could be processed by reading the file in with
insert-file-contents-literally, making the buffer unibyte with
(set-buffer-multibyte nil), and searching for the string terminator
with (search-forward (char-to-string ?\xff)).  However, that search
now fails to find a single byte \377 and instead matches on the
two-byte sequence \231\277.  

Changing the search function to (search-forward (unibyte-string ?\377))
has the same result.  

On investigation, I see the following:

After further investigation, I'm not certain it's a bug: it may be an
intentional part of the modifications to accommodate utf-8.  Here are
the details;

In a multibyte-buffer (set-buffer-multibyte t), 
   
(search-forward (char-to-string ?\xff)) matches utf-8 "ÿ" (i.e. \303\277)
(search-forward (char-to-string ?\377)) matches utf-8 "ÿ"
(search-forward (unibyte-string ?\377)) matches byte \377

In a unibyte buffer (set-buffer-multibyte nil)

(search-forward (char-to-string ?\xff)) matches \231\277
(search-forward (char-to-string ?\377)) matches \231\277
(search-forward (unibyte-string ?\377)) matches \231\277

In other words, search-forward cannot find byte \377 when searching in
a *unibyte* buffer, but it can find that same byte if the buffer is
changed to multibyte.  The reason is that in a unibyte buffer,
search-forward apparently changes byte \377 to a two-byte
representation (but not to utf-8, which would be \303\277).  

This may be exactly the intended behavior of search-forward, but it
breaks scripts expecting search-forward to be able to find a single
high 8-bit byte in a unibyte buffer.  In context, changing the buffer
to multibyte is not a solution.

The code in which I found this error can be fixed by replacing
    (search-forward (char-to-string ?\xff))
with
    (skip-chars-forward "^\377")
    (forward-char 1)
(fix provided by Naoto Takahashi)

However, that means that scripts counting on the old behavior of
search-forward will have to be modified. 

If Emacs crashed, and you have the Emacs process in the gdb debugger,
please include the output from the following gdb commands:
    `bt full' and `xbacktrace'.
If you would like to further debug the crash, please read the file
/usr/local/share/emacs/23.1/etc/DEBUG for instructions.


In GNU Emacs 23.1.1 (amd64-portbld-freebsd8.0, GTK+ Version 2.18.7)
 of 2010-03-25 on aristotle.tamu.edu
Windowing system distributor `The X.Org Foundation', version 11.0.10605000
configured using `configure  '--with-x-toolkit=gtk' '--x-libraries=/usr/local/lib' '--x-includes=/usr/local/include' '--prefix=/usr/local' '--mandir=/usr/local/man' '--infodir=/usr/local/info/' '--build=amd64-portbld-freebsd8.0' 'build_alias=amd64-portbld-freebsd8.0' 'CC=cc' 'CFLAGS=-O2 -pipe -fno-strict-aliasing' 'LDFLAGS=-L/usr/local/lib -lintl' 'CPPFLAGS=-I/usr/local/include''

Important settings:
  value of $LC_ALL: en_US.UTF-8
  value of $LC_COLLATE: nil
  value of $LC_CTYPE: nil
  value of $LC_MESSAGES: nil
  value of $LC_MONETARY: nil
  value of $LC_NUMERIC: nil
  value of $LC_TIME: nil
  value of $LANG: en_US.UTF-8
  value of $XMODIFIERS: nil
  locale-coding-system: utf-8-unix
  default-enable-multibyte-characters: t

Major mode: Lisp Interaction

Minor modes in effect:
  tooltip-mode: t
  tool-bar-mode: t
  mouse-wheel-mode: t
  menu-bar-mode: t
  file-name-shadow-mode: t
  global-font-lock-mode: t
  font-lock-mode: t
  blink-cursor-mode: t
  global-auto-composition-mode: t
  auto-composition-mode: t
  auto-encryption-mode: t
  auto-compression-mode: t
  line-number-mode: t
  transient-mark-mode: t

Recent input:
o <down> <down> <down> <return> C-q 0 0 0 <return> 
C-q 3 7 7 <return> <up> <up> <up> <left> <up> C-x C-e 
C-x o <down> <down> <down> <down> <backspace> <backspace> 
C-q 2 3 1 <return> ] <backspace> C-q 2 7 7 <return> 
<up> <up> <up> <up> C-e C-x C-e <up> <up> <left> C-x 
C-e <up> <up> <switch-frame> <down-mouse-1> <mouse-movement> 
<switch-frame> <mouse-1> <help-echo> <switch-frame> 
<switch-frame> <switch-frame> <switch-frame> <switch-frame> 
<switch-frame> <switch-frame> <switch-frame> <help-echo> 
<up> <up> <left> <up> <right> C-k C-y <return> C-y 
<left> <backspace> <backspace> <backspace> t <right> 
C-x C-e <down> <right> <right> <right> <right> <right> 
<right> <right> <right> <right> <right> <right> <right> 
<right> <right> <right> C-x C-e C-x o <down> C-x C-e 
<up> <up> <up> <left> <left> <left> <left> <return> 
<up> ( s e a r c h - f o r w a r d SPC ( c h a r - 
t o - s t r i o n g <backspace> <backspace> <backspace> 
g <backspace> g SPC <backspace> <backspace> n g SPC 
? \ x f f ) ) C-x C-e C-x o <up> <up> <down> <up> C-x 
C-e <down> <down> C-e C-x C-e <up> <up> <up> <up> C-e 
C-x C-e <up> <up> <left> C-x C-e <up> <up> <up> <up> 
<up> <up> C-e C-x C-e <down> C-e C-x C-e C-x o <down> 
<down> <down> <down> <down> <down> <return> C-q 3 7 
7 <return> <up> <up> <up> <up> <up> <up> <left> <left> 
C-x C-e <up> <up> <up> <up> <up> <up> <down> <left> 
<left> C-x C-e <up> <up> <up> <up> <left> C-x C-e <up> 
<up> <up> <up> <up> <left> <left> <left> <left> <left> 
C-x C-e <down> <down> C-e C-x C-e <up> <up> <up> <up> 
C-e C-x C-e <up> <up> <up> C-e C-x C-e <down> <switch-frame> 
<switch-frame> <help-echo> <help-echo> <help-echo> 
M-x r e p o r t <tab> b <tab> <return>

Recent messages:
Entering debugger...
326
Entering debugger...
nil
369 [3 times]
t
Entering debugger...
374 [2 times]
366
nil
369 [3 times]







^ permalink raw reply	[flat|nested] only message in thread

only message in thread, other threads:[~2010-03-29 15:09 UTC | newest]

Thread overview: (only message) (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-03-29 15:09 bug#5797: 23.1; search-forward in unibyte buffer for \377 rasmith

Code repositories for project(s) associated with this external index

	https://git.savannah.gnu.org/cgit/emacs.git
	https://git.savannah.gnu.org/cgit/emacs/org-mode.git

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.