Is it valid to use the zero-byte "^@" in regexps?

unofficial mirror of help-gnu-emacs@gnu.org
 help / color / mirror / Atom feed

* Is it valid to use the zero-byte "^@" in regexps?
@ 2014-06-18  9:14 Thorsten Jolitz
  2014-06-18  9:52 ` Nicolas Richard
  2014-06-18 11:38 ` Michael Albinus
  0 siblings, 2 replies; 7+ messages in thread
From: Thorsten Jolitz @ 2014-06-18  9:14 UTC (permalink / raw)
  To: help-gnu-emacs

Hi List, 

when matching multi-line text, using the negated zero-byte in a regexp
is convenient to match *any* chararcter, since it should only appear in
binary files not in text files. 

However, I sometimes get strange and a bit unpredictable results using
this technique. 

To rule out a fundamental problem - is it valid to have the zero-byte
(inserted with C-q C-@) appear in a regexp like this? 

,--------------------------------------------------------
| "^#\\+begin_src[[:space:]]+emacs-lisp[^^@]*\n#\\+end_src"
`--------------------------------------------------------

If so, this regexp should reliably match any 

,-----------------------
| #+begin_src emacs-lisp
|  [...]
| #+end_src
`-----------------------

no matter whats inside the block, right?

-- 
cheers,
Thorsten

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Is it valid to use the zero-byte "^@" in regexps?
  2014-06-18  9:14 Is it valid to use the zero-byte "^@" in regexps? Thorsten Jolitz
@ 2014-06-18  9:52 ` Nicolas Richard
  2014-06-18 10:22   ` Thorsten Jolitz
  2014-06-18 11:38 ` Michael Albinus
  1 sibling, 1 reply; 7+ messages in thread
From: Nicolas Richard @ 2014-06-18  9:52 UTC (permalink / raw)
  To: Thorsten Jolitz; +Cc: help-gnu-emacs

Thorsten Jolitz <tjolitz@gmail.com> writes:
> To rule out a fundamental problem - is it valid to have the zero-byte
> (inserted with C-q C-@) appear in a regexp like this? 
>
> ,--------------------------------------------------------
> | "^#\\+begin_src[[:space:]]+emacs-lisp[^^@]*\n#\\+end_src"
> `--------------------------------------------------------

I don't see why it wouldn't be valid, but I don't know. If it is
desirable is another question : it would be better to search for the
beginning, then search for the end with another regexp.

> If so, this regexp should reliably match any 
>
> ,-----------------------
> | #+begin_src emacs-lisp
> |  [...]
> | #+end_src
> `-----------------------

From the first occurrence of
#+begin_src emacs-lisp
;; after point to the last occurence of
#+end_src
in the buffer. If there's more than one, they'll be part of the match
too. e.g. if there's another block in the same document :
#+begin_src sh
echo whatever.
#+end_src
it'll be part of the match too. If you don't want that, make the star
non-greedy by appending a question mark to it:
(re-search-forward "^#\\+begin_src[[:space:]]+emacs-lisp[^^@]*?\n#\\+end_src")

> no matter whats inside the block, right?

Except NUL characters of course.

-- 
Nico.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Is it valid to use the zero-byte "^@" in regexps?
  2014-06-18  9:52 ` Nicolas Richard
@ 2014-06-18 10:22   ` Thorsten Jolitz
  2014-06-18 10:55     ` Nicolas Richard
  0 siblings, 1 reply; 7+ messages in thread
From: Thorsten Jolitz @ 2014-06-18 10:22 UTC (permalink / raw)
  To: help-gnu-emacs

Nicolas Richard <theonewiththeevillook@yahoo.fr> writes:

> Thorsten Jolitz <tjolitz@gmail.com> writes:
>> To rule out a fundamental problem - is it valid to have the zero-byte
>> (inserted with C-q C-@) appear in a regexp like this? 
>>
>> ,--------------------------------------------------------
>> | "^#\\+begin_src[[:space:]]+emacs-lisp[^^@]*\n#\\+end_src"
>> `--------------------------------------------------------
>
> I don't see why it wouldn't be valid, but I don't know. If it is
> desirable is another question : it would be better to search for the
> beginning, then search for the end with another regexp.

That what I did initially, and what is of course much easier, but took
twice (?) as long too ...

>> If so, this regexp should reliably match any 
>>
>> ,-----------------------
>> | #+begin_src emacs-lisp
>> |  [...]
>> | #+end_src
>> `-----------------------
>
> From the first occurrence of
> #+begin_src emacs-lisp
> ;; after point to the last occurence of
> #+end_src
> in the buffer. If there's more than one, they'll be part of the match
> too. e.g. if there's another block in the same document :
> #+begin_src sh
> echo whatever.
> #+end_src
> it'll be part of the match too. If you don't want that, make the star
> non-greedy by appending a question mark to it:
> (re-search-forward
> "^#\\+begin_src[[:space:]]+emacs-lisp[^^@]*?\n#\\+end_src")

yes, thanks for the hint, in my real sources I do use the non-greedy *?
(otherwise it would not work), but forgot about it when writing the
mail.

>> no matter whats inside the block, right?
>
> Except NUL characters of course.

i.e. zero-byte "^@"?

But Emacs can differentiate between NUL characters and the @ character -
or not? NUL chars have blue fonts, and message-mode complains when
trying to send them via email, but e.g. this mail has many @ chars that
are just normal text (just like my test-file) and they are recognized as
such.

Often, but not always, the not matched source-blocks contain @
characters (but not NUL chars). The strange thing is that the failed
matching happens with these blocks being part of a really big
testfile. When I isolate and copy them to a temp buffer and try to match
them there, it just works.

That makes testing/bisecting a bit difficult - whenever I find the
problem and isolate it, its gone ...

Therefore my question - is this technique with negated zero-bytes in
regexps supposed to work, or maybe problematic from the beginning?

-- 
cheers,
Thorsten

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Is it valid to use the zero-byte "^@" in regexps?
  2014-06-18 10:22   ` Thorsten Jolitz
@ 2014-06-18 10:55     ` Nicolas Richard
  2014-06-18 11:16       ` Thorsten Jolitz
  0 siblings, 1 reply; 7+ messages in thread
From: Nicolas Richard @ 2014-06-18 10:55 UTC (permalink / raw)
  To: Thorsten Jolitz; +Cc: help-gnu-emacs

Thorsten Jolitz <tjolitz@gmail.com> writes:
>> I don't see why it wouldn't be valid, but I don't know. If it is
>> desirable is another question : it would be better to search for the
>> beginning, then search for the end with another regexp.
>
> That what I did initially, and what is of course much easier, but took
> twice (?) as long too ...

I'm surprised but I guess I'm being too naive.

>> Except NUL characters of course.
>
> i.e. zero-byte "^@"?

Yes, "NUL" is the name you find in most ASCII charts. "zero-byte" less
so, afaict.

> But Emacs can differentiate between NUL characters and the @ character -

Of course. One has ascii code 0, the other is 64.

NUL is represented by ^@ because of
http://en.wikipedia.org/wiki/Caret_notation

If you hit C-f with point before a NUL, you jump over it ; whereas if
you C-f with point before the two characters ^@ (i.e. not a NUL), cursor
only jumps over the ^.

> Often, but not always, the not matched source-blocks contain @
> characters (but not NUL chars). The strange thing is that the failed
> matching happens with these blocks being part of a really big
> testfile. When I isolate and copy them to a temp buffer and try to match
> them there, it just works.

If you have a reproducible recipe (even with a big file) it would
certainly help.

-- 
Nico.



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Is it valid to use the zero-byte "^@" in regexps?
  2014-06-18 10:55     ` Nicolas Richard
@ 2014-06-18 11:16       ` Thorsten Jolitz
  0 siblings, 0 replies; 7+ messages in thread
From: Thorsten Jolitz @ 2014-06-18 11:16 UTC (permalink / raw)
  To: help-gnu-emacs

Nicolas Richard <theonewiththeevillook@yahoo.fr> writes:

> Thorsten Jolitz <tjolitz@gmail.com> writes:
>>> I don't see why it wouldn't be valid, but I don't know. If it is
>>> desirable is another question : it would be better to search for the
>>> beginning, then search for the end with another regexp.
>>
>> That what I did initially, and what is of course much easier, but took
>> twice (?) as long too ...
>
> I'm surprised but I guess I'm being too naive.

most likely not, the speed problem might be unrelated, I have to
double-check again.

>>> Except NUL characters of course.
>>
>> i.e. zero-byte "^@"?
>
> Yes, "NUL" is the name you find in most ASCII charts. "zero-byte" less
> so, afaict.
>
>> But Emacs can differentiate between NUL characters and the @ character -
>
> Of course. One has ascii code 0, the other is 64.
>
> NUL is represented by ^@ because of
> http://en.wikipedia.org/wiki/Caret_notation
>
> If you hit C-f with point before a NUL, you jump over it ; whereas if
> you C-f with point before the two characters ^@ (i.e. not a NUL), cursor
> only jumps over the ^.

yes, thats what I could expect from a well-behaving Emacs ...

>> Often, but not always, the not matched source-blocks contain @
>> characters (but not NUL chars). The strange thing is that the failed
>> matching happens with these blocks being part of a really big
>> testfile. When I isolate and copy them to a temp buffer and try to match
>> them there, it just works.
>
> If you have a reproducible recipe (even with a big file) it would
> certainly help.

After double-checking myy test-file again, it seems that the bug was
sitting iin front of the computer again. Although thatnice library
ert-buffer.el enables me to run buffer tests on rea-wors without
*without* modifying them, I had some left-over dangling 

,-----------
| #+begin_src
`-----------

delimiters in my test file.

I probably called the commands directly (not via ERT), accidentally, and
a few things went wrong and left these dangling delimiters in the
original file. After undoing this, the DIFF's of the ERT test now show
mainly indentation and whitespace differences, which is quite
encouraging.

Conclusion -> NUL chars in regexps do work, if the testfile isn't messed
up. Thx for your input.

-- 
cheers,
Thorsten

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Is it valid to use the zero-byte "^@" in regexps?
  2014-06-18  9:14 Is it valid to use the zero-byte "^@" in regexps? Thorsten Jolitz
  2014-06-18  9:52 ` Nicolas Richard
@ 2014-06-18 11:38 ` Michael Albinus
  2014-06-18 12:15   ` Nicolas Richard
  1 sibling, 1 reply; 7+ messages in thread
From: Michael Albinus @ 2014-06-18 11:38 UTC (permalink / raw)
  To: Thorsten Jolitz; +Cc: help-gnu-emacs

Thorsten Jolitz <tjolitz@gmail.com> writes:

> Hi List, 

Hi,

> ,--------------------------------------------------------
> | "^#\\+begin_src[[:space:]]+emacs-lisp[^^@]*\n#\\+end_src"
> `--------------------------------------------------------

"^#\\+begin_src[[:space:]]+emacs-lisp[[:ascii:]]+\n#\\+end_src"

Best regards, Michael.



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Is it valid to use the zero-byte "^@" in regexps?
  2014-06-18 11:38 ` Michael Albinus
@ 2014-06-18 12:15   ` Nicolas Richard
  0 siblings, 0 replies; 7+ messages in thread
From: Nicolas Richard @ 2014-06-18 12:15 UTC (permalink / raw)
  To: Michael Albinus; +Cc: help-gnu-emacs, Thorsten Jolitz

Michael Albinus <michael.albinus@gmx.de> writes:

> Thorsten Jolitz <tjolitz@gmail.com> writes:
>
>> Hi List, 
>
> Hi,
>
>> ,--------------------------------------------------------
>> | "^#\\+begin_src[[:space:]]+emacs-lisp[^^@]*\n#\\+end_src"
>> `--------------------------------------------------------
>
> "^#\\+begin_src[[:space:]]+emacs-lisp[[:ascii:]]+\n#\\+end_src"

Many characters are not in ASCII (0-127)

To include NULs, this would work :
"^#\\+begin_src[[:space:]]+emacs-lisp\\(?:.\\|\n\\)+\n#\\+end_src"

-- 
Nico.



^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2014-06-18 12:15 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-06-18  9:14 Is it valid to use the zero-byte "^@" in regexps? Thorsten Jolitz
2014-06-18  9:52 ` Nicolas Richard
2014-06-18 10:22   ` Thorsten Jolitz
2014-06-18 10:55     ` Nicolas Richard
2014-06-18 11:16       ` Thorsten Jolitz
2014-06-18 11:38 ` Michael Albinus
2014-06-18 12:15   ` Nicolas Richard

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).