all messages for Emacs-related lists mirrored at yhetil.org
 help / color / mirror / code / Atom feed
From: rasmith@tamu.edu
To: 5797@debbugs.gnu.org
Subject: bug#5797: 23.1; search-forward in unibyte buffer for \377
Date: Mon, 29 Mar 2010 10:09:19 -0500 (CDT)	[thread overview]
Message-ID: <20100329.100919.319083499807539873.rasmith@aristotle.tamu.edu> (raw)


Please write in English if possible, because the Emacs maintainers
usually do not have translators to read other languages for them.

Your bug report will be posted to the bug-gnu-emacs@gnu.org mailing list,
and to the gnu.emacs.bug news group.

Please describe exactly what actions triggered the bug
and the precise symptoms of the bug:

search-forward fails to find a unibyte \377 in a raw unibyte buffer.
I use "cgreek", a package written by Naoto Takahashi for handling
polytonic (ancient, fully accented) Greek.  It includes a file,
cgreek-tlg.el, for processing the files in the Thesaurus Linguae
Graecae, which have their own unique formats.  In these files, the
byte \377 is used as a string terminator.  Prior to emacs23, these
files could be processed by reading the file in with
insert-file-contents-literally, making the buffer unibyte with
(set-buffer-multibyte nil), and searching for the string terminator
with (search-forward (char-to-string ?\xff)).  However, that search
now fails to find a single byte \377 and instead matches on the
two-byte sequence \231\277.  

Changing the search function to (search-forward (unibyte-string ?\377))
has the same result.  

On investigation, I see the following:

After further investigation, I'm not certain it's a bug: it may be an
intentional part of the modifications to accommodate utf-8.  Here are
the details;

In a multibyte-buffer (set-buffer-multibyte t), 
   
(search-forward (char-to-string ?\xff)) matches utf-8 "ÿ" (i.e. \303\277)
(search-forward (char-to-string ?\377)) matches utf-8 "ÿ"
(search-forward (unibyte-string ?\377)) matches byte \377

In a unibyte buffer (set-buffer-multibyte nil)

(search-forward (char-to-string ?\xff)) matches \231\277
(search-forward (char-to-string ?\377)) matches \231\277
(search-forward (unibyte-string ?\377)) matches \231\277

In other words, search-forward cannot find byte \377 when searching in
a *unibyte* buffer, but it can find that same byte if the buffer is
changed to multibyte.  The reason is that in a unibyte buffer,
search-forward apparently changes byte \377 to a two-byte
representation (but not to utf-8, which would be \303\277).  

This may be exactly the intended behavior of search-forward, but it
breaks scripts expecting search-forward to be able to find a single
high 8-bit byte in a unibyte buffer.  In context, changing the buffer
to multibyte is not a solution.

The code in which I found this error can be fixed by replacing
    (search-forward (char-to-string ?\xff))
with
    (skip-chars-forward "^\377")
    (forward-char 1)
(fix provided by Naoto Takahashi)

However, that means that scripts counting on the old behavior of
search-forward will have to be modified. 

If Emacs crashed, and you have the Emacs process in the gdb debugger,
please include the output from the following gdb commands:
    `bt full' and `xbacktrace'.
If you would like to further debug the crash, please read the file
/usr/local/share/emacs/23.1/etc/DEBUG for instructions.


In GNU Emacs 23.1.1 (amd64-portbld-freebsd8.0, GTK+ Version 2.18.7)
 of 2010-03-25 on aristotle.tamu.edu
Windowing system distributor `The X.Org Foundation', version 11.0.10605000
configured using `configure  '--with-x-toolkit=gtk' '--x-libraries=/usr/local/lib' '--x-includes=/usr/local/include' '--prefix=/usr/local' '--mandir=/usr/local/man' '--infodir=/usr/local/info/' '--build=amd64-portbld-freebsd8.0' 'build_alias=amd64-portbld-freebsd8.0' 'CC=cc' 'CFLAGS=-O2 -pipe -fno-strict-aliasing' 'LDFLAGS=-L/usr/local/lib -lintl' 'CPPFLAGS=-I/usr/local/include''

Important settings:
  value of $LC_ALL: en_US.UTF-8
  value of $LC_COLLATE: nil
  value of $LC_CTYPE: nil
  value of $LC_MESSAGES: nil
  value of $LC_MONETARY: nil
  value of $LC_NUMERIC: nil
  value of $LC_TIME: nil
  value of $LANG: en_US.UTF-8
  value of $XMODIFIERS: nil
  locale-coding-system: utf-8-unix
  default-enable-multibyte-characters: t

Major mode: Lisp Interaction

Minor modes in effect:
  tooltip-mode: t
  tool-bar-mode: t
  mouse-wheel-mode: t
  menu-bar-mode: t
  file-name-shadow-mode: t
  global-font-lock-mode: t
  font-lock-mode: t
  blink-cursor-mode: t
  global-auto-composition-mode: t
  auto-composition-mode: t
  auto-encryption-mode: t
  auto-compression-mode: t
  line-number-mode: t
  transient-mark-mode: t

Recent input:
o <down> <down> <down> <return> C-q 0 0 0 <return> 
C-q 3 7 7 <return> <up> <up> <up> <left> <up> C-x C-e 
C-x o <down> <down> <down> <down> <backspace> <backspace> 
C-q 2 3 1 <return> ] <backspace> C-q 2 7 7 <return> 
<up> <up> <up> <up> C-e C-x C-e <up> <up> <left> C-x 
C-e <up> <up> <switch-frame> <down-mouse-1> <mouse-movement> 
<switch-frame> <mouse-1> <help-echo> <switch-frame> 
<switch-frame> <switch-frame> <switch-frame> <switch-frame> 
<switch-frame> <switch-frame> <switch-frame> <help-echo> 
<up> <up> <left> <up> <right> C-k C-y <return> C-y 
<left> <backspace> <backspace> <backspace> t <right> 
C-x C-e <down> <right> <right> <right> <right> <right> 
<right> <right> <right> <right> <right> <right> <right> 
<right> <right> <right> C-x C-e C-x o <down> C-x C-e 
<up> <up> <up> <left> <left> <left> <left> <return> 
<up> ( s e a r c h - f o r w a r d SPC ( c h a r - 
t o - s t r i o n g <backspace> <backspace> <backspace> 
g <backspace> g SPC <backspace> <backspace> n g SPC 
? \ x f f ) ) C-x C-e C-x o <up> <up> <down> <up> C-x 
C-e <down> <down> C-e C-x C-e <up> <up> <up> <up> C-e 
C-x C-e <up> <up> <left> C-x C-e <up> <up> <up> <up> 
<up> <up> C-e C-x C-e <down> C-e C-x C-e C-x o <down> 
<down> <down> <down> <down> <down> <return> C-q 3 7 
7 <return> <up> <up> <up> <up> <up> <up> <left> <left> 
C-x C-e <up> <up> <up> <up> <up> <up> <down> <left> 
<left> C-x C-e <up> <up> <up> <up> <left> C-x C-e <up> 
<up> <up> <up> <up> <left> <left> <left> <left> <left> 
C-x C-e <down> <down> C-e C-x C-e <up> <up> <up> <up> 
C-e C-x C-e <up> <up> <up> C-e C-x C-e <down> <switch-frame> 
<switch-frame> <help-echo> <help-echo> <help-echo> 
M-x r e p o r t <tab> b <tab> <return>

Recent messages:
Entering debugger...
326
Entering debugger...
nil
369 [3 times]
t
Entering debugger...
374 [2 times]
366
nil
369 [3 times]







                 reply	other threads:[~2010-03-29 15:09 UTC|newest]

Thread overview: [no followups] expand[flat|nested]  mbox.gz  Atom feed

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20100329.100919.319083499807539873.rasmith@aristotle.tamu.edu \
    --to=rasmith@tamu.edu \
    --cc=5797@debbugs.gnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this external index

	https://git.savannah.gnu.org/cgit/emacs.git
	https://git.savannah.gnu.org/cgit/emacs/org-mode.git

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.