* Is it valid to use the zero-byte "^@" in regexps? @ 2014-06-18 9:14 Thorsten Jolitz 2014-06-18 9:52 ` Nicolas Richard 2014-06-18 11:38 ` Michael Albinus 0 siblings, 2 replies; 7+ messages in thread From: Thorsten Jolitz @ 2014-06-18 9:14 UTC (permalink / raw) To: help-gnu-emacs Hi List, when matching multi-line text, using the negated zero-byte in a regexp is convenient to match *any* chararcter, since it should only appear in binary files not in text files. However, I sometimes get strange and a bit unpredictable results using this technique. To rule out a fundamental problem - is it valid to have the zero-byte (inserted with C-q C-@) appear in a regexp like this? ,-------------------------------------------------------- | "^#\\+begin_src[[:space:]]+emacs-lisp[^^@]*\n#\\+end_src" `-------------------------------------------------------- If so, this regexp should reliably match any ,----------------------- | #+begin_src emacs-lisp | [...] | #+end_src `----------------------- no matter whats inside the block, right? -- cheers, Thorsten ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Is it valid to use the zero-byte "^@" in regexps? 2014-06-18 9:14 Is it valid to use the zero-byte "^@" in regexps? Thorsten Jolitz @ 2014-06-18 9:52 ` Nicolas Richard 2014-06-18 10:22 ` Thorsten Jolitz 2014-06-18 11:38 ` Michael Albinus 1 sibling, 1 reply; 7+ messages in thread From: Nicolas Richard @ 2014-06-18 9:52 UTC (permalink / raw) To: Thorsten Jolitz; +Cc: help-gnu-emacs Thorsten Jolitz <tjolitz@gmail.com> writes: > To rule out a fundamental problem - is it valid to have the zero-byte > (inserted with C-q C-@) appear in a regexp like this? > > ,-------------------------------------------------------- > | "^#\\+begin_src[[:space:]]+emacs-lisp[^^@]*\n#\\+end_src" > `-------------------------------------------------------- I don't see why it wouldn't be valid, but I don't know. If it is desirable is another question : it would be better to search for the beginning, then search for the end with another regexp. > If so, this regexp should reliably match any > > ,----------------------- > | #+begin_src emacs-lisp > | [...] > | #+end_src > `----------------------- From the first occurrence of #+begin_src emacs-lisp ;; after point to the last occurence of #+end_src in the buffer. If there's more than one, they'll be part of the match too. e.g. if there's another block in the same document : #+begin_src sh echo whatever. #+end_src it'll be part of the match too. If you don't want that, make the star non-greedy by appending a question mark to it: (re-search-forward "^#\\+begin_src[[:space:]]+emacs-lisp[^^@]*?\n#\\+end_src") > no matter whats inside the block, right? Except NUL characters of course. -- Nico. ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Is it valid to use the zero-byte "^@" in regexps? 2014-06-18 9:52 ` Nicolas Richard @ 2014-06-18 10:22 ` Thorsten Jolitz 2014-06-18 10:55 ` Nicolas Richard 0 siblings, 1 reply; 7+ messages in thread From: Thorsten Jolitz @ 2014-06-18 10:22 UTC (permalink / raw) To: help-gnu-emacs Nicolas Richard <theonewiththeevillook@yahoo.fr> writes: > Thorsten Jolitz <tjolitz@gmail.com> writes: >> To rule out a fundamental problem - is it valid to have the zero-byte >> (inserted with C-q C-@) appear in a regexp like this? >> >> ,-------------------------------------------------------- >> | "^#\\+begin_src[[:space:]]+emacs-lisp[^^@]*\n#\\+end_src" >> `-------------------------------------------------------- > > I don't see why it wouldn't be valid, but I don't know. If it is > desirable is another question : it would be better to search for the > beginning, then search for the end with another regexp. That what I did initially, and what is of course much easier, but took twice (?) as long too ... >> If so, this regexp should reliably match any >> >> ,----------------------- >> | #+begin_src emacs-lisp >> | [...] >> | #+end_src >> `----------------------- > > From the first occurrence of > #+begin_src emacs-lisp > ;; after point to the last occurence of > #+end_src > in the buffer. If there's more than one, they'll be part of the match > too. e.g. if there's another block in the same document : > #+begin_src sh > echo whatever. > #+end_src > it'll be part of the match too. If you don't want that, make the star > non-greedy by appending a question mark to it: > (re-search-forward > "^#\\+begin_src[[:space:]]+emacs-lisp[^^@]*?\n#\\+end_src") yes, thanks for the hint, in my real sources I do use the non-greedy *? (otherwise it would not work), but forgot about it when writing the mail. >> no matter whats inside the block, right? > > Except NUL characters of course. i.e. zero-byte "^@"? But Emacs can differentiate between NUL characters and the @ character - or not? NUL chars have blue fonts, and message-mode complains when trying to send them via email, but e.g. this mail has many @ chars that are just normal text (just like my test-file) and they are recognized as such. Often, but not always, the not matched source-blocks contain @ characters (but not NUL chars). The strange thing is that the failed matching happens with these blocks being part of a really big testfile. When I isolate and copy them to a temp buffer and try to match them there, it just works. That makes testing/bisecting a bit difficult - whenever I find the problem and isolate it, its gone ... Therefore my question - is this technique with negated zero-bytes in regexps supposed to work, or maybe problematic from the beginning? -- cheers, Thorsten ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Is it valid to use the zero-byte "^@" in regexps? 2014-06-18 10:22 ` Thorsten Jolitz @ 2014-06-18 10:55 ` Nicolas Richard 2014-06-18 11:16 ` Thorsten Jolitz 0 siblings, 1 reply; 7+ messages in thread From: Nicolas Richard @ 2014-06-18 10:55 UTC (permalink / raw) To: Thorsten Jolitz; +Cc: help-gnu-emacs Thorsten Jolitz <tjolitz@gmail.com> writes: >> I don't see why it wouldn't be valid, but I don't know. If it is >> desirable is another question : it would be better to search for the >> beginning, then search for the end with another regexp. > > That what I did initially, and what is of course much easier, but took > twice (?) as long too ... I'm surprised but I guess I'm being too naive. >> Except NUL characters of course. > > i.e. zero-byte "^@"? Yes, "NUL" is the name you find in most ASCII charts. "zero-byte" less so, afaict. > But Emacs can differentiate between NUL characters and the @ character - Of course. One has ascii code 0, the other is 64. NUL is represented by ^@ because of http://en.wikipedia.org/wiki/Caret_notation If you hit C-f with point before a NUL, you jump over it ; whereas if you C-f with point before the two characters ^@ (i.e. not a NUL), cursor only jumps over the ^. > Often, but not always, the not matched source-blocks contain @ > characters (but not NUL chars). The strange thing is that the failed > matching happens with these blocks being part of a really big > testfile. When I isolate and copy them to a temp buffer and try to match > them there, it just works. If you have a reproducible recipe (even with a big file) it would certainly help. -- Nico. ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Is it valid to use the zero-byte "^@" in regexps? 2014-06-18 10:55 ` Nicolas Richard @ 2014-06-18 11:16 ` Thorsten Jolitz 0 siblings, 0 replies; 7+ messages in thread From: Thorsten Jolitz @ 2014-06-18 11:16 UTC (permalink / raw) To: help-gnu-emacs Nicolas Richard <theonewiththeevillook@yahoo.fr> writes: > Thorsten Jolitz <tjolitz@gmail.com> writes: >>> I don't see why it wouldn't be valid, but I don't know. If it is >>> desirable is another question : it would be better to search for the >>> beginning, then search for the end with another regexp. >> >> That what I did initially, and what is of course much easier, but took >> twice (?) as long too ... > > I'm surprised but I guess I'm being too naive. most likely not, the speed problem might be unrelated, I have to double-check again. >>> Except NUL characters of course. >> >> i.e. zero-byte "^@"? > > Yes, "NUL" is the name you find in most ASCII charts. "zero-byte" less > so, afaict. > >> But Emacs can differentiate between NUL characters and the @ character - > > Of course. One has ascii code 0, the other is 64. > > NUL is represented by ^@ because of > http://en.wikipedia.org/wiki/Caret_notation > > If you hit C-f with point before a NUL, you jump over it ; whereas if > you C-f with point before the two characters ^@ (i.e. not a NUL), cursor > only jumps over the ^. yes, thats what I could expect from a well-behaving Emacs ... >> Often, but not always, the not matched source-blocks contain @ >> characters (but not NUL chars). The strange thing is that the failed >> matching happens with these blocks being part of a really big >> testfile. When I isolate and copy them to a temp buffer and try to match >> them there, it just works. > > If you have a reproducible recipe (even with a big file) it would > certainly help. After double-checking myy test-file again, it seems that the bug was sitting iin front of the computer again. Although thatnice library ert-buffer.el enables me to run buffer tests on rea-wors without *without* modifying them, I had some left-over dangling ,----------- | #+begin_src `----------- delimiters in my test file. I probably called the commands directly (not via ERT), accidentally, and a few things went wrong and left these dangling delimiters in the original file. After undoing this, the DIFF's of the ERT test now show mainly indentation and whitespace differences, which is quite encouraging. Conclusion -> NUL chars in regexps do work, if the testfile isn't messed up. Thx for your input. -- cheers, Thorsten ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Is it valid to use the zero-byte "^@" in regexps? 2014-06-18 9:14 Is it valid to use the zero-byte "^@" in regexps? Thorsten Jolitz 2014-06-18 9:52 ` Nicolas Richard @ 2014-06-18 11:38 ` Michael Albinus 2014-06-18 12:15 ` Nicolas Richard 1 sibling, 1 reply; 7+ messages in thread From: Michael Albinus @ 2014-06-18 11:38 UTC (permalink / raw) To: Thorsten Jolitz; +Cc: help-gnu-emacs Thorsten Jolitz <tjolitz@gmail.com> writes: > Hi List, Hi, > ,-------------------------------------------------------- > | "^#\\+begin_src[[:space:]]+emacs-lisp[^^@]*\n#\\+end_src" > `-------------------------------------------------------- "^#\\+begin_src[[:space:]]+emacs-lisp[[:ascii:]]+\n#\\+end_src" Best regards, Michael. ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Is it valid to use the zero-byte "^@" in regexps? 2014-06-18 11:38 ` Michael Albinus @ 2014-06-18 12:15 ` Nicolas Richard 0 siblings, 0 replies; 7+ messages in thread From: Nicolas Richard @ 2014-06-18 12:15 UTC (permalink / raw) To: Michael Albinus; +Cc: help-gnu-emacs, Thorsten Jolitz Michael Albinus <michael.albinus@gmx.de> writes: > Thorsten Jolitz <tjolitz@gmail.com> writes: > >> Hi List, > > Hi, > >> ,-------------------------------------------------------- >> | "^#\\+begin_src[[:space:]]+emacs-lisp[^^@]*\n#\\+end_src" >> `-------------------------------------------------------- > > "^#\\+begin_src[[:space:]]+emacs-lisp[[:ascii:]]+\n#\\+end_src" Many characters are not in ASCII (0-127) To include NULs, this would work : "^#\\+begin_src[[:space:]]+emacs-lisp\\(?:.\\|\n\\)+\n#\\+end_src" -- Nico. ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2014-06-18 12:15 UTC | newest] Thread overview: 7+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2014-06-18 9:14 Is it valid to use the zero-byte "^@" in regexps? Thorsten Jolitz 2014-06-18 9:52 ` Nicolas Richard 2014-06-18 10:22 ` Thorsten Jolitz 2014-06-18 10:55 ` Nicolas Richard 2014-06-18 11:16 ` Thorsten Jolitz 2014-06-18 11:38 ` Michael Albinus 2014-06-18 12:15 ` Nicolas Richard
Code repositories for project(s) associated with this external index https://git.savannah.gnu.org/cgit/emacs.git https://git.savannah.gnu.org/cgit/emacs/org-mode.git This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.