all messages for Emacs-related lists mirrored at yhetil.org
 help / color / mirror / code / Atom feed
* search-forward in emacs23 lisp
@ 2010-03-27 20:31 rasmith
  2010-03-28 16:39 ` rasmith
                   ` (2 more replies)
  0 siblings, 3 replies; 14+ messages in thread
From: rasmith @ 2010-03-27 20:31 UTC (permalink / raw)
  To: help-gnu-emacs

The behavior of the search-forward function in emacs-lisp has changed
in emacs23 in a way that breaks some scripts I use, in particular
cgreek-tlg.el from Naoto Takahashi's cgreek package.  This package
includes facilities for reading files in the Thesaurus Linguae Graecae
(TLG) containing both Greek texts and data about those texts, each in
a format unique to the TLG.  Parsing those files requires reading them
into a buffer literally and searching for strings terminated by \xff
(byte 255).  Under emacs22, this only required 
    (search-forward (char-to-string ?\xff))
However, under emacs23, char-to-string with an 8-bit argument (128
through 255) now returns a two-byte string (\x00\xff).  So, these
searches fail.  I tried changing to unibyte-string.  In fact, 
    (unibyte-string ?\377) 
does return a string containing just one byte (255), as I've verified
with what-cursor-position.  However, 
    (search-forward (unibyte-string ?\377)) 
doesn't match an occurrence of 255.  Instead, it matches on the two-byte
string \231\277 (\x99bf).  That two-byte sequence doesn't appear to me
to be a possible Unicode character (I thought the utf-8 representation
of 255 would be \0xc1\0x3f).  Perhaps this is something peculiar to
utf-8-emacs? 

If I move to the buffer that contains the data to be parsed (which has
its multibyte flag set to nil), then 
(search-forward (unibyte-string ?\377)) behaves as above.  However, in
that same buffer, a keyboard isearch-forward for \377 finds a \377
with no problem.  

So, what I need to know is: is there a way to make search-forward find
a single 8-bit byte between 128 and 255?


Robin Smith
Department of Philosophy           rasmith@tamu.edu
Texas A&M University               http://aristotle.tamu.edu/~rasmith/
4237 TAMU                          Voice +1 979 845 5679
College Station, TX 77843-4237     FAX +1 979 845 0458




^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: search-forward in emacs23 lisp
  2010-03-27 20:31 search-forward in emacs23 lisp rasmith
@ 2010-03-28 16:39 ` rasmith
  2010-03-28 16:50   ` Lennart Borgman
  2010-03-28 21:45 ` Peter Dyballa
  2010-03-28 23:00 ` Johan Bockgård
  2 siblings, 1 reply; 14+ messages in thread
From: rasmith @ 2010-03-28 16:39 UTC (permalink / raw)
  To: help-gnu-emacs

From: rasmith@tamu.edu
Subject: search-forward in emacs23 lisp
Date: Sat, 27 Mar 2010 15:31:48 -0500 (CDT)

Sorry to reply to my own post, but the following rather ugly solution
solves the problem of finding a single FF byte:
      (while (/= (char-after) ?\377)
	(forward-char 1)
	)
      (forward-char 1)
This replaces 
      (search-forward (unibyte-string ?\377))
which, in emacs23, no matter what I do, insists on turning the byte
into the two-byte string \231\277 before searching.

But surely there's a better way?

Robin Smith
Department of Philosophy           rasmith@tamu.edu
Texas A&M University               http://aristotle.tamu.edu/~rasmith/
4237 TAMU                          Voice +1 979 845 5679
College Station, TX 77843-4237     FAX +1 979 845 0458




^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: search-forward in emacs23 lisp
  2010-03-28 16:39 ` rasmith
@ 2010-03-28 16:50   ` Lennart Borgman
  2010-03-28 17:04     ` rasmith
  0 siblings, 1 reply; 14+ messages in thread
From: Lennart Borgman @ 2010-03-28 16:50 UTC (permalink / raw)
  To: rasmith; +Cc: help-gnu-emacs

On Sun, Mar 28, 2010 at 6:39 PM,  <rasmith@tamu.edu> wrote:> Sorry to
reply to my own post, but the following rather ugly solution
> solves the problem of finding a single FF byte:
>      (while (/= (char-after) ?\377)
>        (forward-char 1)
>        )
>      (forward-char 1)
> This replaces
>      (search-forward (unibyte-string ?\377))
> which, in emacs23, no matter what I do, insists on turning the byte
> into the two-byte string \231\277 before searching.
>
> But surely there's a better way?

Hi Robin,

Someone else knows this much better than me and can explain the
details, but I believe that unibyte-string is a low level function
that you do not need here.

How about just

   (search-forward (char-to-string ?\377))
   or (search-forward (char-to-string 255))

Does that work for you?




^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: search-forward in emacs23 lisp
  2010-03-28 16:50   ` Lennart Borgman
@ 2010-03-28 17:04     ` rasmith
  2010-03-28 17:10       ` Lennart Borgman
  0 siblings, 1 reply; 14+ messages in thread
From: rasmith @ 2010-03-28 17:04 UTC (permalink / raw)
  To: lennart.borgman; +Cc: help-gnu-emacs

From: Lennart Borgman <lennart.borgman@gmail.com>
Subject: Re: search-forward in emacs23 lisp
Date: Sun, 28 Mar 2010 18:50:46 +0200

> On Sun, Mar 28, 2010 at 6:39 PM,  <rasmith@tamu.edu> wrote:> Sorry to
> reply to my own post, but the following rather ugly solution
>> solves the problem of finding a single FF byte:
>>      (while (/= (char-after) ?\377)
>>        (forward-char 1)
>>        )
>>      (forward-char 1)
>> This replaces
>>      (search-forward (unibyte-string ?\377))
>> which, in emacs23, no matter what I do, insists on turning the byte
>> into the two-byte string \231\277 before searching.
>>
>> But surely there's a better way?
> 
> Hi Robin,
> 
> Someone else knows this much better than me and can explain the
> details, but I believe that unibyte-string is a low level function
> that you do not need here.
> 
> How about just
> 
>    (search-forward (char-to-string ?\377))
>    or (search-forward (char-to-string 255))
> 
> Does that work for you?

Nope.  That's exactly what caused the original problem (that is, the
code that broke was exactly what you suggest).  Using either one of
these, what search-forward will look for is a two-byte string (in
other words, it undertakes to convert the high 8-bit character into
something like a utf-8 representation of it (\377 can't occur as the
first byte of a utf-8 character, which is probably what triggers
this).

Robin Smith




^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: search-forward in emacs23 lisp
  2010-03-28 17:04     ` rasmith
@ 2010-03-28 17:10       ` Lennart Borgman
  2010-03-28 17:56         ` rasmith
  2010-03-28 17:59         ` rasmith
  0 siblings, 2 replies; 14+ messages in thread
From: Lennart Borgman @ 2010-03-28 17:10 UTC (permalink / raw)
  To: rasmith; +Cc: help-gnu-emacs

On Sun, Mar 28, 2010 at 7:04 PM,  <rasmith@tamu.edu> wrote:
> From: Lennart Borgman <lennart.borgman@gmail.com>
> Subject: Re: search-forward in emacs23 lisp
> Date: Sun, 28 Mar 2010 18:50:46 +0200
>
>> On Sun, Mar 28, 2010 at 6:39 PM,  <rasmith@tamu.edu> wrote:> Sorry to
>> reply to my own post, but the following rather ugly solution
>>> solves the problem of finding a single FF byte:
>>>      (while (/= (char-after) ?\377)
>>>        (forward-char 1)
>>>        )
>>>      (forward-char 1)
>>> This replaces
>>>      (search-forward (unibyte-string ?\377))
>>> which, in emacs23, no matter what I do, insists on turning the byte
>>> into the two-byte string \231\277 before searching.
>>>
>>> But surely there's a better way?
>>
>> Hi Robin,
>>
>> Someone else knows this much better than me and can explain the
>> details, but I believe that unibyte-string is a low level function
>> that you do not need here.
>>
>> How about just
>>
>>    (search-forward (char-to-string ?\377))
>>    or (search-forward (char-to-string 255))
>>
>> Does that work for you?
>
> Nope.  That's exactly what caused the original problem (that is, the
> code that broke was exactly what you suggest).  Using either one of
> these, what search-forward will look for is a two-byte string (in
> other words, it undertakes to convert the high 8-bit character into
> something like a utf-8 representation of it (\377 can't occur as the
> first byte of a utf-8 character, which is probably what triggers
> this).


Oh, sorry. I read your first message now. It looks like you have found
a problem with search-forward in this case and a bug in isearch. I
suggest that you file a bug report.




^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: search-forward in emacs23 lisp
  2010-03-28 17:10       ` Lennart Borgman
@ 2010-03-28 17:56         ` rasmith
  2010-03-28 17:59         ` rasmith
  1 sibling, 0 replies; 14+ messages in thread
From: rasmith @ 2010-03-28 17:56 UTC (permalink / raw)
  To: lennart.borgman; +Cc: help-gnu-emacs

From: Lennart Borgman <lennart.borgman@gmail.com>
Subject: Re: search-forward in emacs23 lisp
Date: Sun, 28 Mar 2010 19:10:59 +0200



>> Nope.  That's exactly what caused the original problem (that is, the
>> code that broke was exactly what you suggest).  Using either one of
>> these, what search-forward will look for is a two-byte string (in
>> other words, it undertakes to convert the high 8-bit character into
>> something like a utf-8 representation of it (\377 can't occur as the
>> first byte of a utf-8 character, which is probably what triggers
>> this).
> 
> 
> Oh, sorry. I read your first message now. It looks like you have found
> a problem with search-forward in this case and a bug in isearch. I
> suggest that you file a bug report.

I'll do that.  To say a little more about the problem:
   (char-to-string ?\xff)
produces a *two-byte* string, \0x00\0xff, while 
   (unibyte-string ?\377) 
produces a *one-byte* string, as it should.  However, when *either*
of these is given as an argument to search-forward, what it actually
searches for is the *two-byte* string \231\277.  I don't really see
where that's coming from, since I thought the utf-8 representation of
\377 was \303\077 (\xc33f).  I know that emacs23 uses a default
internal format with the name utf-8-emacs for buffers, but I don't
know its details.

Robin Smith




^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: search-forward in emacs23 lisp
  2010-03-28 17:10       ` Lennart Borgman
  2010-03-28 17:56         ` rasmith
@ 2010-03-28 17:59         ` rasmith
  2010-03-28 18:22           ` Lennart Borgman
  1 sibling, 1 reply; 14+ messages in thread
From: rasmith @ 2010-03-28 17:59 UTC (permalink / raw)
  To: lennart.borgman; +Cc: help-gnu-emacs

From: Lennart Borgman <lennart.borgman@gmail.com>
Subject: Re: search-forward in emacs23 lisp
Date: Sun, 28 Mar 2010 19:10:59 +0200


> Oh, sorry. I read your first message now. It looks like you have found
> a problem with search-forward in this case and a bug in isearch. I
> suggest that you file a bug report.

And as one last addition, I don't think there's any problem with
isearch: C-s C-q 3 7 7 finds byte 255 with no problem at all.  The bug
(if that's what it is) is in search-forward.  

Robin Smith




^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: search-forward in emacs23 lisp
  2010-03-28 17:59         ` rasmith
@ 2010-03-28 18:22           ` Lennart Borgman
  0 siblings, 0 replies; 14+ messages in thread
From: Lennart Borgman @ 2010-03-28 18:22 UTC (permalink / raw)
  To: rasmith; +Cc: help-gnu-emacs

On Sun, Mar 28, 2010 at 7:59 PM,  <rasmith@tamu.edu> wrote:
> From: Lennart Borgman <lennart.borgman@gmail.com>
> Subject: Re: search-forward in emacs23 lisp
> Date: Sun, 28 Mar 2010 19:10:59 +0200
>
>
>> Oh, sorry. I read your first message now. It looks like you have found
>> a problem with search-forward in this case and a bug in isearch. I
>> suggest that you file a bug report.
>
> And as one last addition, I don't think there's any problem with
> isearch: C-s C-q 3 7 7 finds byte 255 with no problem at all.  The bug
> (if that's what it is) is in search-forward.

The isearch-forward doc string says

   Type C-q to quote control character to search for it.

Here you are searching for a byte, not a character. So I think it is a bug.

But maybe you do not want to file a bug report for this?




^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: search-forward in emacs23 lisp
  2010-03-27 20:31 search-forward in emacs23 lisp rasmith
  2010-03-28 16:39 ` rasmith
@ 2010-03-28 21:45 ` Peter Dyballa
  2010-03-29  0:44   ` rasmith
  2010-03-28 23:00 ` Johan Bockgård
  2 siblings, 1 reply; 14+ messages in thread
From: Peter Dyballa @ 2010-03-28 21:45 UTC (permalink / raw)
  To: rasmith; +Cc: help-gnu-emacs


Am 27.03.2010 um 21:31 schrieb rasmith:

> The behavior of the search-forward function in emacs-lisp has changed
> in emacs23 in a way that breaks some scripts I use, in particular
> cgreek-tlg.el from Naoto Takahashi's cgreek package.


Maybe the problem is simply that, that the buffer is in UTF-8. Then is  
makes really no sense to search for that byte because it does not  
exist, like a quark (although baryons and mesons are built from them),  
there only exists the two-byte word \xc3\xbf (standing for ÿ, LATIN  
SMALL LETTER Y WITH DIAERESIS). Clearly, you can't search what does  
not exist – except you're Lancelot.

Which coding is used in the buffer? Can you switch to a (raw) byte- 
based encoding and test in this state?

--
Greetings

   Pete

I wouldn't recommend sex, drugs or insanity for everyone, but they've  
always worked for me.
				– Hunter S. Thompson





^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: search-forward in emacs23 lisp
  2010-03-27 20:31 search-forward in emacs23 lisp rasmith
  2010-03-28 16:39 ` rasmith
  2010-03-28 21:45 ` Peter Dyballa
@ 2010-03-28 23:00 ` Johan Bockgård
  2010-03-29  6:51   ` Eli Zaretskii
  2 siblings, 1 reply; 14+ messages in thread
From: Johan Bockgård @ 2010-03-28 23:00 UTC (permalink / raw)
  To: help-gnu-emacs

rasmith@tamu.edu writes:

> If I move to the buffer that contains the data to be parsed (which has
> its multibyte flag set to nil), then 
> (search-forward (unibyte-string ?\377)) behaves as above.  However, in
> that same buffer, a keyboard isearch-forward for \377 finds a \377
> with no problem.

There does seem to be a bug regarding search in unibyte buffers,

    ;; This works
    (let ((case-fold-search nil)) (search-forward "\377"))

    ;; This actually matches \277 instead!
    (let ((case-fold-search t)) (search-forward "\377"))


Isearch works, by luck, since it binds case-fold-search to nil because
of this strange behavior of `downcase' in a unibyte context,

    (let ((default-enable-multibyte-characters nil))
      (with-temp-buffer
        (downcase 255)))  ; worked correctly in Emacs 22
    => 4194303




^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: search-forward in emacs23 lisp
  2010-03-28 21:45 ` Peter Dyballa
@ 2010-03-29  0:44   ` rasmith
  0 siblings, 0 replies; 14+ messages in thread
From: rasmith @ 2010-03-29  0:44 UTC (permalink / raw)
  To: Peter_Dyballa; +Cc: help-gnu-emacs

From: Peter Dyballa <Peter_Dyballa@Web.DE>
Subject: Re: search-forward in emacs23 lisp
Date: Sun, 28 Mar 2010 23:45:26 +0200

> 
> Am 27.03.2010 um 21:31 schrieb rasmith:
> 
>> The behavior of the search-forward function in emacs-lisp has changed
>> in emacs23 in a way that breaks some scripts I use, in particular
>> cgreek-tlg.el from Naoto Takahashi's cgreek package.
> 
> 
> Maybe the problem is simply that, that the buffer is in UTF-8. Then is
> makes really no sense to search for that byte because it does not
> exist, like a quark (although baryons and mesons are built from them),
> there only exists the two-byte word \xc3\xbf (standing for ÿ, LATIN
> SMALL LETTER Y WITH DIAERESIS). Clearly, you can't search what does
> not exist – except you're Lancelot.
> 
> Which coding is used in the buffer? Can you switch to a (raw)
> byte-based encoding and test in this state?
> 

No, the buffer's not in utf-8.  The file was read in with
insert-file-contents literally, and (set-buffer raw) 
and (set-buffer-multibyte nil) were executed just before that.
When I run the function containing the problem code, sometimes it just
returns a not found: "\377" and stops, and sometimes it returns an
error message indicating that it's not looking at what it expects (the
actual message is "Unexpected author description introducer" followed
by a pair of bytes in hex).  I can then switch into that buffer, and
in the latter case what I find is that the point is sitting just after
a pair of bytes, specifically \231\277 (this is where 
(search-forward (char-to-string ?\xff)) stopped).  This is well beyond
an earlier occurrence of \377 in the buffer (I won't explain the
rather complicated format of the files in question, but in them \377
is used as a string terminator--and don't ask me to change that, since
the whole purpose of the code is to process files having this
format). While visiting that buffer, it's pretty obvious that it's in
raw mode (all high bytes display in octal, and what-cursor-position
identifies everything you look at as an 8-bit byte, never a utf-8
multibyte character).  

Within that buffer, an isearch for \377 finds a 255 byte
with no problem.  The problem is entirely in the search-forward
function.  I tried inserting (search-forward (unibyte-string ?\377))
in the buffer and executing it from there; when I do that, it skips
right over \377 but stops instead at \231\277 (which as I pointed out
is not the utf-8 version of \377).  This result happens with all the
possible arguments I've come up with for search-forward, such as:
(unibyte-string ?\377) 
(string-to-unibyte (unibyte-string ?\377))
"ÿ"
"\377"
"\xff" (this is even worse: it's translated to two bytes \x00ff)

I've verified that (unibyte-string ?\377) returns exactly what it
should: a string containing just the 8-bit byte \377.  However, when 
search-forward gets that argument, running from a raw buffer with
multibyte turned off, it first turns it into the two-byte string
\231\277 and then matches on that.  If there's a way to keep it from
doing that, I'd like to know.

As I said in a reply to myself, I found a workaround:

      (while (/= (char-after) ?\377)
	(forward-char 1)
	)
      (forward-char 1)

But it would be nice to know exactly what it is that search-forward is
doing here.  My knowledge of emacs-lisp is pretty rudimentary, so if
I'm missing something obvious, please let me know.

Thanks,

Robin Smith

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: search-forward in emacs23 lisp
  2010-03-28 23:00 ` Johan Bockgård
@ 2010-03-29  6:51   ` Eli Zaretskii
  2010-03-29 15:01     ` rasmith
  0 siblings, 1 reply; 14+ messages in thread
From: Eli Zaretskii @ 2010-03-29  6:51 UTC (permalink / raw)
  To: help-gnu-emacs

> From: bojohan@gnu.org (Johan =?utf-8?Q?Bockg=C3=A5rd?=)
> Date: Mon, 29 Mar 2010 01:00:45 +0200
> Cc: 
> 
> There does seem to be a bug regarding search in unibyte buffers,

Please report this ASAP to the Emacs bug-tracker.  Emacs 23.2 is in
the last stages of pretest, and so we should not waste any time
discussing bugs here, if we want them to be fixed in the next release.

Thanks.




^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: search-forward in emacs23 lisp
  2010-03-29  6:51   ` Eli Zaretskii
@ 2010-03-29 15:01     ` rasmith
  2010-03-29 15:17       ` Eli Zaretskii
  0 siblings, 1 reply; 14+ messages in thread
From: rasmith @ 2010-03-29 15:01 UTC (permalink / raw)
  To: eliz; +Cc: help-gnu-emacs

From: Eli Zaretskii <eliz@gnu.org>
Subject: Re: search-forward in emacs23 lisp
Date: Mon, 29 Mar 2010 09:51:07 +0300

>> From: bojohan@gnu.org (Johan =?utf-8?Q?Bockg=C3=A5rd?=)
>> Date: Mon, 29 Mar 2010 01:00:45 +0200
>> Cc: 
>> 
>> There does seem to be a bug regarding search in unibyte buffers,
> 
> Please report this ASAP to the Emacs bug-tracker.  Emacs 23.2 is in
> the last stages of pretest, and so we should not waste any time
> discussing bugs here, if we want them to be fixed in the next release.
> 

After further investigation, I'm not certain it's a bug: it may be an
intentional part of the modifications to accommodate utf-8.  Here are
the details;

In a multibyte-buffer (set-buffer-multibyte t), 
   
(search-forward (char-to-string ?\xff)) matches utf-8 "ÿ" (i.e. \303\277)
(search-forward (char-to-string ?\377)) matches utf-8 "ÿ"
(search-forward (unibyte-string ?\377)) matches byte \377

In a unibyte buffer (set-buffer-multibyte nil)

(search-forward (char-to-string ?\xff)) matches \231\277
(search-forward (char-to-string ?\377)) matches \231\277
(search-forward (unibyte-string ?\377)) matches \231\277

In other words, search-forward cannot find byte \377 when searching in
a *unibyte* buffer, but it can find that same byte if the buffer is
changed to multibyte.  The reason is that in a unibyte buffer,
search-forward apparently changes byte \377 to a two-byte
representation (but not to utf-8, which would be \303\277).  

The code I had a problem with can be fixed by using char-after
(or more elegantly, I've now learned, using skip-chars-forward),
However, there's probably other code out there that's now broken
because of this.  Is it a bug, or was it a mistake to expect
search-forward to find a single high byte in a multibyte buffer in the
first place?

Robin Smith





^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: search-forward in emacs23 lisp
  2010-03-29 15:01     ` rasmith
@ 2010-03-29 15:17       ` Eli Zaretskii
  0 siblings, 0 replies; 14+ messages in thread
From: Eli Zaretskii @ 2010-03-29 15:17 UTC (permalink / raw)
  To: help-gnu-emacs

> Date: Mon, 29 Mar 2010 10:01:17 -0500 (CDT)
> Cc: help-gnu-emacs@gnu.org
> From: rasmith@tamu.edu
> 
> In other words, search-forward cannot find byte \377 when searching in
> a *unibyte* buffer, but it can find that same byte if the buffer is
> changed to multibyte.  The reason is that in a unibyte buffer,
> search-forward apparently changes byte \377 to a two-byte
> representation (but not to utf-8, which would be \303\277).  
> 
> The code I had a problem with can be fixed by using char-after
> (or more elegantly, I've now learned, using skip-chars-forward),
> However, there's probably other code out there that's now broken
> because of this.  Is it a bug, or was it a mistake to expect
> search-forward to find a single high byte in a multibyte buffer in the
> first place?

Please ask these questions on emacs-devel@gnu.org.  All the experts
who know the answers are there.





^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2010-03-29 15:17 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-03-27 20:31 search-forward in emacs23 lisp rasmith
2010-03-28 16:39 ` rasmith
2010-03-28 16:50   ` Lennart Borgman
2010-03-28 17:04     ` rasmith
2010-03-28 17:10       ` Lennart Borgman
2010-03-28 17:56         ` rasmith
2010-03-28 17:59         ` rasmith
2010-03-28 18:22           ` Lennart Borgman
2010-03-28 21:45 ` Peter Dyballa
2010-03-29  0:44   ` rasmith
2010-03-28 23:00 ` Johan Bockgård
2010-03-29  6:51   ` Eli Zaretskii
2010-03-29 15:01     ` rasmith
2010-03-29 15:17       ` Eli Zaretskii

Code repositories for project(s) associated with this external index

	https://git.savannah.gnu.org/cgit/emacs.git
	https://git.savannah.gnu.org/cgit/emacs/org-mode.git

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.