* search-forward in emacs23 lisp @ 2010-03-27 20:31 rasmith 2010-03-28 16:39 ` rasmith ` (2 more replies) 0 siblings, 3 replies; 14+ messages in thread From: rasmith @ 2010-03-27 20:31 UTC (permalink / raw) To: help-gnu-emacs The behavior of the search-forward function in emacs-lisp has changed in emacs23 in a way that breaks some scripts I use, in particular cgreek-tlg.el from Naoto Takahashi's cgreek package. This package includes facilities for reading files in the Thesaurus Linguae Graecae (TLG) containing both Greek texts and data about those texts, each in a format unique to the TLG. Parsing those files requires reading them into a buffer literally and searching for strings terminated by \xff (byte 255). Under emacs22, this only required (search-forward (char-to-string ?\xff)) However, under emacs23, char-to-string with an 8-bit argument (128 through 255) now returns a two-byte string (\x00\xff). So, these searches fail. I tried changing to unibyte-string. In fact, (unibyte-string ?\377) does return a string containing just one byte (255), as I've verified with what-cursor-position. However, (search-forward (unibyte-string ?\377)) doesn't match an occurrence of 255. Instead, it matches on the two-byte string \231\277 (\x99bf). That two-byte sequence doesn't appear to me to be a possible Unicode character (I thought the utf-8 representation of 255 would be \0xc1\0x3f). Perhaps this is something peculiar to utf-8-emacs? If I move to the buffer that contains the data to be parsed (which has its multibyte flag set to nil), then (search-forward (unibyte-string ?\377)) behaves as above. However, in that same buffer, a keyboard isearch-forward for \377 finds a \377 with no problem. So, what I need to know is: is there a way to make search-forward find a single 8-bit byte between 128 and 255? Robin Smith Department of Philosophy rasmith@tamu.edu Texas A&M University http://aristotle.tamu.edu/~rasmith/ 4237 TAMU Voice +1 979 845 5679 College Station, TX 77843-4237 FAX +1 979 845 0458 ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: search-forward in emacs23 lisp 2010-03-27 20:31 search-forward in emacs23 lisp rasmith @ 2010-03-28 16:39 ` rasmith 2010-03-28 16:50 ` Lennart Borgman 2010-03-28 21:45 ` Peter Dyballa 2010-03-28 23:00 ` Johan Bockgård 2 siblings, 1 reply; 14+ messages in thread From: rasmith @ 2010-03-28 16:39 UTC (permalink / raw) To: help-gnu-emacs From: rasmith@tamu.edu Subject: search-forward in emacs23 lisp Date: Sat, 27 Mar 2010 15:31:48 -0500 (CDT) Sorry to reply to my own post, but the following rather ugly solution solves the problem of finding a single FF byte: (while (/= (char-after) ?\377) (forward-char 1) ) (forward-char 1) This replaces (search-forward (unibyte-string ?\377)) which, in emacs23, no matter what I do, insists on turning the byte into the two-byte string \231\277 before searching. But surely there's a better way? Robin Smith Department of Philosophy rasmith@tamu.edu Texas A&M University http://aristotle.tamu.edu/~rasmith/ 4237 TAMU Voice +1 979 845 5679 College Station, TX 77843-4237 FAX +1 979 845 0458 ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: search-forward in emacs23 lisp 2010-03-28 16:39 ` rasmith @ 2010-03-28 16:50 ` Lennart Borgman 2010-03-28 17:04 ` rasmith 0 siblings, 1 reply; 14+ messages in thread From: Lennart Borgman @ 2010-03-28 16:50 UTC (permalink / raw) To: rasmith; +Cc: help-gnu-emacs On Sun, Mar 28, 2010 at 6:39 PM, <rasmith@tamu.edu> wrote:> Sorry to reply to my own post, but the following rather ugly solution > solves the problem of finding a single FF byte: > (while (/= (char-after) ?\377) > (forward-char 1) > ) > (forward-char 1) > This replaces > (search-forward (unibyte-string ?\377)) > which, in emacs23, no matter what I do, insists on turning the byte > into the two-byte string \231\277 before searching. > > But surely there's a better way? Hi Robin, Someone else knows this much better than me and can explain the details, but I believe that unibyte-string is a low level function that you do not need here. How about just (search-forward (char-to-string ?\377)) or (search-forward (char-to-string 255)) Does that work for you? ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: search-forward in emacs23 lisp 2010-03-28 16:50 ` Lennart Borgman @ 2010-03-28 17:04 ` rasmith 2010-03-28 17:10 ` Lennart Borgman 0 siblings, 1 reply; 14+ messages in thread From: rasmith @ 2010-03-28 17:04 UTC (permalink / raw) To: lennart.borgman; +Cc: help-gnu-emacs From: Lennart Borgman <lennart.borgman@gmail.com> Subject: Re: search-forward in emacs23 lisp Date: Sun, 28 Mar 2010 18:50:46 +0200 > On Sun, Mar 28, 2010 at 6:39 PM, <rasmith@tamu.edu> wrote:> Sorry to > reply to my own post, but the following rather ugly solution >> solves the problem of finding a single FF byte: >> (while (/= (char-after) ?\377) >> (forward-char 1) >> ) >> (forward-char 1) >> This replaces >> (search-forward (unibyte-string ?\377)) >> which, in emacs23, no matter what I do, insists on turning the byte >> into the two-byte string \231\277 before searching. >> >> But surely there's a better way? > > Hi Robin, > > Someone else knows this much better than me and can explain the > details, but I believe that unibyte-string is a low level function > that you do not need here. > > How about just > > (search-forward (char-to-string ?\377)) > or (search-forward (char-to-string 255)) > > Does that work for you? Nope. That's exactly what caused the original problem (that is, the code that broke was exactly what you suggest). Using either one of these, what search-forward will look for is a two-byte string (in other words, it undertakes to convert the high 8-bit character into something like a utf-8 representation of it (\377 can't occur as the first byte of a utf-8 character, which is probably what triggers this). Robin Smith ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: search-forward in emacs23 lisp 2010-03-28 17:04 ` rasmith @ 2010-03-28 17:10 ` Lennart Borgman 2010-03-28 17:56 ` rasmith 2010-03-28 17:59 ` rasmith 0 siblings, 2 replies; 14+ messages in thread From: Lennart Borgman @ 2010-03-28 17:10 UTC (permalink / raw) To: rasmith; +Cc: help-gnu-emacs On Sun, Mar 28, 2010 at 7:04 PM, <rasmith@tamu.edu> wrote: > From: Lennart Borgman <lennart.borgman@gmail.com> > Subject: Re: search-forward in emacs23 lisp > Date: Sun, 28 Mar 2010 18:50:46 +0200 > >> On Sun, Mar 28, 2010 at 6:39 PM, <rasmith@tamu.edu> wrote:> Sorry to >> reply to my own post, but the following rather ugly solution >>> solves the problem of finding a single FF byte: >>> (while (/= (char-after) ?\377) >>> (forward-char 1) >>> ) >>> (forward-char 1) >>> This replaces >>> (search-forward (unibyte-string ?\377)) >>> which, in emacs23, no matter what I do, insists on turning the byte >>> into the two-byte string \231\277 before searching. >>> >>> But surely there's a better way? >> >> Hi Robin, >> >> Someone else knows this much better than me and can explain the >> details, but I believe that unibyte-string is a low level function >> that you do not need here. >> >> How about just >> >> (search-forward (char-to-string ?\377)) >> or (search-forward (char-to-string 255)) >> >> Does that work for you? > > Nope. That's exactly what caused the original problem (that is, the > code that broke was exactly what you suggest). Using either one of > these, what search-forward will look for is a two-byte string (in > other words, it undertakes to convert the high 8-bit character into > something like a utf-8 representation of it (\377 can't occur as the > first byte of a utf-8 character, which is probably what triggers > this). Oh, sorry. I read your first message now. It looks like you have found a problem with search-forward in this case and a bug in isearch. I suggest that you file a bug report. ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: search-forward in emacs23 lisp 2010-03-28 17:10 ` Lennart Borgman @ 2010-03-28 17:56 ` rasmith 2010-03-28 17:59 ` rasmith 1 sibling, 0 replies; 14+ messages in thread From: rasmith @ 2010-03-28 17:56 UTC (permalink / raw) To: lennart.borgman; +Cc: help-gnu-emacs From: Lennart Borgman <lennart.borgman@gmail.com> Subject: Re: search-forward in emacs23 lisp Date: Sun, 28 Mar 2010 19:10:59 +0200 >> Nope. That's exactly what caused the original problem (that is, the >> code that broke was exactly what you suggest). Using either one of >> these, what search-forward will look for is a two-byte string (in >> other words, it undertakes to convert the high 8-bit character into >> something like a utf-8 representation of it (\377 can't occur as the >> first byte of a utf-8 character, which is probably what triggers >> this). > > > Oh, sorry. I read your first message now. It looks like you have found > a problem with search-forward in this case and a bug in isearch. I > suggest that you file a bug report. I'll do that. To say a little more about the problem: (char-to-string ?\xff) produces a *two-byte* string, \0x00\0xff, while (unibyte-string ?\377) produces a *one-byte* string, as it should. However, when *either* of these is given as an argument to search-forward, what it actually searches for is the *two-byte* string \231\277. I don't really see where that's coming from, since I thought the utf-8 representation of \377 was \303\077 (\xc33f). I know that emacs23 uses a default internal format with the name utf-8-emacs for buffers, but I don't know its details. Robin Smith ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: search-forward in emacs23 lisp 2010-03-28 17:10 ` Lennart Borgman 2010-03-28 17:56 ` rasmith @ 2010-03-28 17:59 ` rasmith 2010-03-28 18:22 ` Lennart Borgman 1 sibling, 1 reply; 14+ messages in thread From: rasmith @ 2010-03-28 17:59 UTC (permalink / raw) To: lennart.borgman; +Cc: help-gnu-emacs From: Lennart Borgman <lennart.borgman@gmail.com> Subject: Re: search-forward in emacs23 lisp Date: Sun, 28 Mar 2010 19:10:59 +0200 > Oh, sorry. I read your first message now. It looks like you have found > a problem with search-forward in this case and a bug in isearch. I > suggest that you file a bug report. And as one last addition, I don't think there's any problem with isearch: C-s C-q 3 7 7 finds byte 255 with no problem at all. The bug (if that's what it is) is in search-forward. Robin Smith ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: search-forward in emacs23 lisp 2010-03-28 17:59 ` rasmith @ 2010-03-28 18:22 ` Lennart Borgman 0 siblings, 0 replies; 14+ messages in thread From: Lennart Borgman @ 2010-03-28 18:22 UTC (permalink / raw) To: rasmith; +Cc: help-gnu-emacs On Sun, Mar 28, 2010 at 7:59 PM, <rasmith@tamu.edu> wrote: > From: Lennart Borgman <lennart.borgman@gmail.com> > Subject: Re: search-forward in emacs23 lisp > Date: Sun, 28 Mar 2010 19:10:59 +0200 > > >> Oh, sorry. I read your first message now. It looks like you have found >> a problem with search-forward in this case and a bug in isearch. I >> suggest that you file a bug report. > > And as one last addition, I don't think there's any problem with > isearch: C-s C-q 3 7 7 finds byte 255 with no problem at all. The bug > (if that's what it is) is in search-forward. The isearch-forward doc string says Type C-q to quote control character to search for it. Here you are searching for a byte, not a character. So I think it is a bug. But maybe you do not want to file a bug report for this? ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: search-forward in emacs23 lisp 2010-03-27 20:31 search-forward in emacs23 lisp rasmith 2010-03-28 16:39 ` rasmith @ 2010-03-28 21:45 ` Peter Dyballa 2010-03-29 0:44 ` rasmith 2010-03-28 23:00 ` Johan Bockgård 2 siblings, 1 reply; 14+ messages in thread From: Peter Dyballa @ 2010-03-28 21:45 UTC (permalink / raw) To: rasmith; +Cc: help-gnu-emacs Am 27.03.2010 um 21:31 schrieb rasmith: > The behavior of the search-forward function in emacs-lisp has changed > in emacs23 in a way that breaks some scripts I use, in particular > cgreek-tlg.el from Naoto Takahashi's cgreek package. Maybe the problem is simply that, that the buffer is in UTF-8. Then is makes really no sense to search for that byte because it does not exist, like a quark (although baryons and mesons are built from them), there only exists the two-byte word \xc3\xbf (standing for ÿ, LATIN SMALL LETTER Y WITH DIAERESIS). Clearly, you can't search what does not exist – except you're Lancelot. Which coding is used in the buffer? Can you switch to a (raw) byte- based encoding and test in this state? -- Greetings Pete I wouldn't recommend sex, drugs or insanity for everyone, but they've always worked for me. – Hunter S. Thompson ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: search-forward in emacs23 lisp 2010-03-28 21:45 ` Peter Dyballa @ 2010-03-29 0:44 ` rasmith 0 siblings, 0 replies; 14+ messages in thread From: rasmith @ 2010-03-29 0:44 UTC (permalink / raw) To: Peter_Dyballa; +Cc: help-gnu-emacs From: Peter Dyballa <Peter_Dyballa@Web.DE> Subject: Re: search-forward in emacs23 lisp Date: Sun, 28 Mar 2010 23:45:26 +0200 > > Am 27.03.2010 um 21:31 schrieb rasmith: > >> The behavior of the search-forward function in emacs-lisp has changed >> in emacs23 in a way that breaks some scripts I use, in particular >> cgreek-tlg.el from Naoto Takahashi's cgreek package. > > > Maybe the problem is simply that, that the buffer is in UTF-8. Then is > makes really no sense to search for that byte because it does not > exist, like a quark (although baryons and mesons are built from them), > there only exists the two-byte word \xc3\xbf (standing for ÿ, LATIN > SMALL LETTER Y WITH DIAERESIS). Clearly, you can't search what does > not exist – except you're Lancelot. > > Which coding is used in the buffer? Can you switch to a (raw) > byte-based encoding and test in this state? > No, the buffer's not in utf-8. The file was read in with insert-file-contents literally, and (set-buffer raw) and (set-buffer-multibyte nil) were executed just before that. When I run the function containing the problem code, sometimes it just returns a not found: "\377" and stops, and sometimes it returns an error message indicating that it's not looking at what it expects (the actual message is "Unexpected author description introducer" followed by a pair of bytes in hex). I can then switch into that buffer, and in the latter case what I find is that the point is sitting just after a pair of bytes, specifically \231\277 (this is where (search-forward (char-to-string ?\xff)) stopped). This is well beyond an earlier occurrence of \377 in the buffer (I won't explain the rather complicated format of the files in question, but in them \377 is used as a string terminator--and don't ask me to change that, since the whole purpose of the code is to process files having this format). While visiting that buffer, it's pretty obvious that it's in raw mode (all high bytes display in octal, and what-cursor-position identifies everything you look at as an 8-bit byte, never a utf-8 multibyte character). Within that buffer, an isearch for \377 finds a 255 byte with no problem. The problem is entirely in the search-forward function. I tried inserting (search-forward (unibyte-string ?\377)) in the buffer and executing it from there; when I do that, it skips right over \377 but stops instead at \231\277 (which as I pointed out is not the utf-8 version of \377). This result happens with all the possible arguments I've come up with for search-forward, such as: (unibyte-string ?\377) (string-to-unibyte (unibyte-string ?\377)) "ÿ" "\377" "\xff" (this is even worse: it's translated to two bytes \x00ff) I've verified that (unibyte-string ?\377) returns exactly what it should: a string containing just the 8-bit byte \377. However, when search-forward gets that argument, running from a raw buffer with multibyte turned off, it first turns it into the two-byte string \231\277 and then matches on that. If there's a way to keep it from doing that, I'd like to know. As I said in a reply to myself, I found a workaround: (while (/= (char-after) ?\377) (forward-char 1) ) (forward-char 1) But it would be nice to know exactly what it is that search-forward is doing here. My knowledge of emacs-lisp is pretty rudimentary, so if I'm missing something obvious, please let me know. Thanks, Robin Smith ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: search-forward in emacs23 lisp 2010-03-27 20:31 search-forward in emacs23 lisp rasmith 2010-03-28 16:39 ` rasmith 2010-03-28 21:45 ` Peter Dyballa @ 2010-03-28 23:00 ` Johan Bockgård 2010-03-29 6:51 ` Eli Zaretskii 2 siblings, 1 reply; 14+ messages in thread From: Johan Bockgård @ 2010-03-28 23:00 UTC (permalink / raw) To: help-gnu-emacs rasmith@tamu.edu writes: > If I move to the buffer that contains the data to be parsed (which has > its multibyte flag set to nil), then > (search-forward (unibyte-string ?\377)) behaves as above. However, in > that same buffer, a keyboard isearch-forward for \377 finds a \377 > with no problem. There does seem to be a bug regarding search in unibyte buffers, ;; This works (let ((case-fold-search nil)) (search-forward "\377")) ;; This actually matches \277 instead! (let ((case-fold-search t)) (search-forward "\377")) Isearch works, by luck, since it binds case-fold-search to nil because of this strange behavior of `downcase' in a unibyte context, (let ((default-enable-multibyte-characters nil)) (with-temp-buffer (downcase 255))) ; worked correctly in Emacs 22 => 4194303 ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: search-forward in emacs23 lisp 2010-03-28 23:00 ` Johan Bockgård @ 2010-03-29 6:51 ` Eli Zaretskii 2010-03-29 15:01 ` rasmith 0 siblings, 1 reply; 14+ messages in thread From: Eli Zaretskii @ 2010-03-29 6:51 UTC (permalink / raw) To: help-gnu-emacs > From: bojohan@gnu.org (Johan =?utf-8?Q?Bockg=C3=A5rd?=) > Date: Mon, 29 Mar 2010 01:00:45 +0200 > Cc: > > There does seem to be a bug regarding search in unibyte buffers, Please report this ASAP to the Emacs bug-tracker. Emacs 23.2 is in the last stages of pretest, and so we should not waste any time discussing bugs here, if we want them to be fixed in the next release. Thanks. ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: search-forward in emacs23 lisp 2010-03-29 6:51 ` Eli Zaretskii @ 2010-03-29 15:01 ` rasmith 2010-03-29 15:17 ` Eli Zaretskii 0 siblings, 1 reply; 14+ messages in thread From: rasmith @ 2010-03-29 15:01 UTC (permalink / raw) To: eliz; +Cc: help-gnu-emacs From: Eli Zaretskii <eliz@gnu.org> Subject: Re: search-forward in emacs23 lisp Date: Mon, 29 Mar 2010 09:51:07 +0300 >> From: bojohan@gnu.org (Johan =?utf-8?Q?Bockg=C3=A5rd?=) >> Date: Mon, 29 Mar 2010 01:00:45 +0200 >> Cc: >> >> There does seem to be a bug regarding search in unibyte buffers, > > Please report this ASAP to the Emacs bug-tracker. Emacs 23.2 is in > the last stages of pretest, and so we should not waste any time > discussing bugs here, if we want them to be fixed in the next release. > After further investigation, I'm not certain it's a bug: it may be an intentional part of the modifications to accommodate utf-8. Here are the details; In a multibyte-buffer (set-buffer-multibyte t), (search-forward (char-to-string ?\xff)) matches utf-8 "ÿ" (i.e. \303\277) (search-forward (char-to-string ?\377)) matches utf-8 "ÿ" (search-forward (unibyte-string ?\377)) matches byte \377 In a unibyte buffer (set-buffer-multibyte nil) (search-forward (char-to-string ?\xff)) matches \231\277 (search-forward (char-to-string ?\377)) matches \231\277 (search-forward (unibyte-string ?\377)) matches \231\277 In other words, search-forward cannot find byte \377 when searching in a *unibyte* buffer, but it can find that same byte if the buffer is changed to multibyte. The reason is that in a unibyte buffer, search-forward apparently changes byte \377 to a two-byte representation (but not to utf-8, which would be \303\277). The code I had a problem with can be fixed by using char-after (or more elegantly, I've now learned, using skip-chars-forward), However, there's probably other code out there that's now broken because of this. Is it a bug, or was it a mistake to expect search-forward to find a single high byte in a multibyte buffer in the first place? Robin Smith ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: search-forward in emacs23 lisp 2010-03-29 15:01 ` rasmith @ 2010-03-29 15:17 ` Eli Zaretskii 0 siblings, 0 replies; 14+ messages in thread From: Eli Zaretskii @ 2010-03-29 15:17 UTC (permalink / raw) To: help-gnu-emacs > Date: Mon, 29 Mar 2010 10:01:17 -0500 (CDT) > Cc: help-gnu-emacs@gnu.org > From: rasmith@tamu.edu > > In other words, search-forward cannot find byte \377 when searching in > a *unibyte* buffer, but it can find that same byte if the buffer is > changed to multibyte. The reason is that in a unibyte buffer, > search-forward apparently changes byte \377 to a two-byte > representation (but not to utf-8, which would be \303\277). > > The code I had a problem with can be fixed by using char-after > (or more elegantly, I've now learned, using skip-chars-forward), > However, there's probably other code out there that's now broken > because of this. Is it a bug, or was it a mistake to expect > search-forward to find a single high byte in a multibyte buffer in the > first place? Please ask these questions on emacs-devel@gnu.org. All the experts who know the answers are there. ^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2010-03-29 15:17 UTC | newest] Thread overview: 14+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2010-03-27 20:31 search-forward in emacs23 lisp rasmith 2010-03-28 16:39 ` rasmith 2010-03-28 16:50 ` Lennart Borgman 2010-03-28 17:04 ` rasmith 2010-03-28 17:10 ` Lennart Borgman 2010-03-28 17:56 ` rasmith 2010-03-28 17:59 ` rasmith 2010-03-28 18:22 ` Lennart Borgman 2010-03-28 21:45 ` Peter Dyballa 2010-03-29 0:44 ` rasmith 2010-03-28 23:00 ` Johan Bockgård 2010-03-29 6:51 ` Eli Zaretskii 2010-03-29 15:01 ` rasmith 2010-03-29 15:17 ` Eli Zaretskii
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).