unofficial mirror of emacs-devel@gnu.org 
 help / color / mirror / code / Atom feed
* [BUG] Regexp compiler, problem with character classes
@ 2006-06-03  1:14 Johan Bockgård
  2006-09-07 21:15 ` Richard Stallman
                   ` (2 more replies)
  0 siblings, 3 replies; 13+ messages in thread
From: Johan Bockgård @ 2006-06-03  1:14 UTC (permalink / raw)



[I'm resending this because I think it's a serious bug. It makes
character classes totally unreliable.]

Character classes are translated to character alternatives during the
regexp compile phase. This is wrong, since the syntax table should be
taken into account during the actual matching. This may be non-trivial
to fix.


    (with-temp-buffer
      (list
       (progn (modify-syntax-entry ?a " ")
              (string-match "x[[:space:]]" "xa"))
       (progn (modify-syntax-entry ?a "w")
              (string-match "x[[:space:]]" "xa"))))
    => (0 0)



0:      /exactn/1/x
3:      /charset [\t\f a\302\200-\303\277]
37:     /succeed
38:     end of pattern.

Compiling pattern: x[[:space:]]

Compiled pattern: 
38 bytes used/174 bytes allocated.
fastmap: x
re_nsub: 0      regs_alloc: 0   can_be_null: 0  no_sub: 0       not_bol: 0      not_eol: 0      syntax: 340204
0:      /exactn/1/x
3:      /charset [\t\f a\302\200-\303\277]
37:     /succeed
38:     end of pattern.
0:      /exactn/1/x
3:      /charset [\t\f a\302\200-\303\277]
37:     /succeed
38:     end of pattern.



As an effect you get the behavior below, since the compiler takes no
care to setup the syntax in the first place:


1)

    emacs -Q

    (with-temp-buffer
      (string-match "x[[:space:]]" "x\n"))

    => nil

(exit Emacs)


2)
    emacs -Q

    (with-temp-buffer
      (char-syntax ?\n)
      (string-match "x[[:space:]]" "x\n"))

    => 0


(Fchar_syntax does
    gl_state.current_syntax_table = current_buffer->syntax_table;)

-- 
This is bad.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [BUG] Regexp compiler, problem with character classes
  2006-06-03  1:14 [BUG] Regexp compiler, problem with character classes Johan Bockgård
@ 2006-09-07 21:15 ` Richard Stallman
  2006-09-13  9:50   ` Johan Bockgård
  2006-09-07 21:15 ` Richard Stallman
  2006-09-15  3:14 ` Richard Stallman
  2 siblings, 1 reply; 13+ messages in thread
From: Richard Stallman @ 2006-09-07 21:15 UTC (permalink / raw)
  Cc: emacs-devel

    (with-temp-buffer
      (string-match "x[[:space:]]" "x\n"))

    => nil

    (Fchar_syntax does
	gl_state.current_syntax_table = current_buffer->syntax_table;)

re_search_2 does it also, thru the macro
SETUP_SYNTAX_TABLE_FOR_OBJECT.  I just verified that
gl_state.current_buffer->syntax_table has the proper value after that
line.

So I do not understand why string-match returns the wrong value.
There must be a reason, but it is not such a simple reason.
Can someone debug it and figure that out?

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [BUG] Regexp compiler, problem with character classes
  2006-06-03  1:14 [BUG] Regexp compiler, problem with character classes Johan Bockgård
  2006-09-07 21:15 ` Richard Stallman
@ 2006-09-07 21:15 ` Richard Stallman
  2006-09-14 23:20   ` Chong Yidong
  2006-09-15  3:14 ` Richard Stallman
  2 siblings, 1 reply; 13+ messages in thread
From: Richard Stallman @ 2006-09-07 21:15 UTC (permalink / raw)
  Cc: emacs-devel

	(with-temp-buffer
	  (list
	   (progn (modify-syntax-entry ?a " ")
		  (string-match "x[[:space:]]" "xa"))
	   (progn (modify-syntax-entry ?a "w")
		  (string-match "x[[:space:]]" "xa"))))
	=> (0 0)

The easiest way to fix that bug
is to make each element of the compiled regexp cache
specify the syntax table that it corresponds to,
and make modify-syntax-entry clear the cache.
That way, no change in regex.c is needed.

Could someone please do this and then ack?

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [BUG] Regexp compiler, problem with character classes
  2006-09-07 21:15 ` Richard Stallman
@ 2006-09-13  9:50   ` Johan Bockgård
  2006-09-13 19:25     ` Richard Stallman
  0 siblings, 1 reply; 13+ messages in thread
From: Johan Bockgård @ 2006-09-13  9:50 UTC (permalink / raw)


Richard Stallman <rms@gnu.org> writes:

>     (with-temp-buffer
>       (string-match "x[[:space:]]" "x\n"))
>
>     => nil
>
>     (Fchar_syntax does
> 	gl_state.current_syntax_table = current_buffer->syntax_table;)
>
> re_search_2 does it also, thru the macro
> SETUP_SYNTAX_TABLE_FOR_OBJECT.  I just verified that
> gl_state.current_buffer->syntax_table has the proper value after that
> line.

Yes, but that is too late. What matters is which syntax is seen by the
regexp *compiler* (this is why there is a problem in the first place)
when it transforms [[:space:]] into [\t\f etc]. This has already
happened before re_search_2 comes into play.

-- 
Johan Bockgård

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [BUG] Regexp compiler, problem with character classes
  2006-09-13  9:50   ` Johan Bockgård
@ 2006-09-13 19:25     ` Richard Stallman
  0 siblings, 0 replies; 13+ messages in thread
From: Richard Stallman @ 2006-09-13 19:25 UTC (permalink / raw)
  Cc: emacs-devel

    Yes, but that is too late. What matters is which syntax is seen by the
    regexp *compiler* (this is why there is a problem in the first place)
    when it transforms [[:space:]] into [\t\f etc]. This has already
    happened before re_search_2 comes into play.

I see.  Thanks.

Does this fix that problem?

*** regex.c	22 Feb 2006 13:56:52 -0500	1.211
--- regex.c	13 Sep 2006 15:22:57 -0400	
***************
*** 6197,6202 ****
--- 6197,6206 ----
  {
    reg_errcode_t ret;
  
+ #ifdef emacs
+   gl_state.current_syntax_table = current_buffer->syntax_table;
+ #endif
+ 
    /* GNU code is written to assume at least RE_NREGS registers will be set
       (and at least one extra will be -1).  */
    bufp->regs_allocated = REGS_UNALLOCATED;

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [BUG] Regexp compiler, problem with character classes
  2006-09-07 21:15 ` Richard Stallman
@ 2006-09-14 23:20   ` Chong Yidong
  2006-09-15 14:29     ` Richard Stallman
  2006-09-18  8:43     ` Johan Bockgård
  0 siblings, 2 replies; 13+ messages in thread
From: Chong Yidong @ 2006-09-14 23:20 UTC (permalink / raw)
  Cc: emacs-devel, Johan Bockgård

Richard Stallman <rms@gnu.org> writes:

> 	(with-temp-buffer
> 	  (list
> 	   (progn (modify-syntax-entry ?a " ")
> 		  (string-match "x[[:space:]]" "xa"))
> 	   (progn (modify-syntax-entry ?a "w")
> 		  (string-match "x[[:space:]]" "xa"))))
> 	=> (0 0)
>
> The easiest way to fix that bug
> is to make each element of the compiled regexp cache
> specify the syntax table that it corresponds to,
> and make modify-syntax-entry clear the cache.
> That way, no change in regex.c is needed.
>
> Could someone please do this and then ack?

Does this patch look OK?

*** emacs/src/search.c.~1.212.~	2006-09-08 16:32:57.000000000 -0400
--- emacs/src/search.c	2006-09-14 19:16:24.000000000 -0400
***************
*** 207,212 ****
--- 207,223 ----
      }
  }
  
+ void
+ clear_regexp_cache ()
+ {
+   int i;
+ 
+   BLOCK_INPUT;
+   for (i = 0; i < REGEXP_CACHE_SIZE; ++i)
+     searchbufs[i].regexp = Qnil;
+   UNBLOCK_INPUT;
+ }
+ 
  /* Compile a regexp if necessary, but first check to see if there's one in
     the cache.
     PATTERN is the pattern to compile.
*** /home/cyd/emacs/src/syntax.c.~1.199.~	2006-07-21 14:57:09.000000000 -0400
--- /home/cyd/emacs/src/syntax.c	2006-09-14 19:18:10.000000000 -0400
***************
*** 1039,1044 ****
--- 1039,1046 ----
      check_syntax_table (syntax_table);
  
    SET_RAW_SYNTAX_ENTRY (syntax_table, XINT (c), Fstring_to_syntax (newentry));
+   clear_regexp_cache ();
+ 
    return Qnil;
  }
  \f

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [BUG] Regexp compiler, problem with character classes
  2006-06-03  1:14 [BUG] Regexp compiler, problem with character classes Johan Bockgård
  2006-09-07 21:15 ` Richard Stallman
  2006-09-07 21:15 ` Richard Stallman
@ 2006-09-15  3:14 ` Richard Stallman
  2 siblings, 0 replies; 13+ messages in thread
From: Richard Stallman @ 2006-09-15  3:14 UTC (permalink / raw)
  Cc: emacs-devel

[I sent this message a week ago but did not get a response.]

	(with-temp-buffer
	  (list
	   (progn (modify-syntax-entry ?a " ")
		  (string-match "x[[:space:]]" "xa"))
	   (progn (modify-syntax-entry ?a "w")
		  (string-match "x[[:space:]]" "xa"))))
	=> (0 0)

The easiest way to fix that bug
is to make each element of the compiled regexp cache
specify the syntax table that it corresponds to,
and make modify-syntax-entry clear the cache.
That way, no change in regex.c is needed.

Could someone please do this and then ack?

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [BUG] Regexp compiler, problem with character classes
  2006-09-14 23:20   ` Chong Yidong
@ 2006-09-15 14:29     ` Richard Stallman
  2006-09-15 15:13       ` Chong Yidong
  2006-09-18  8:43     ` Johan Bockgård
  1 sibling, 1 reply; 13+ messages in thread
From: Richard Stallman @ 2006-09-15 14:29 UTC (permalink / raw)
  Cc: emacs-devel, bojohan+mail

    + void
    + clear_regexp_cache ()
    + {
    +   int i;
    + 
    +   BLOCK_INPUT;
    +   for (i = 0; i < REGEXP_CACHE_SIZE; ++i)
    +     searchbufs[i].regexp = Qnil;
    +   UNBLOCK_INPUT;
    + }

1. That leaks the memory in the compiled regexps.

2. I don't see a reason for BLOCK_INPUT.
I don't think anything in a signal handler can compile a regexp.
Is that not so?

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [BUG] Regexp compiler, problem with character classes
  2006-09-15 14:29     ` Richard Stallman
@ 2006-09-15 15:13       ` Chong Yidong
  0 siblings, 0 replies; 13+ messages in thread
From: Chong Yidong @ 2006-09-15 15:13 UTC (permalink / raw)
  Cc: emacs-devel, bojohan+mail

Richard Stallman <rms@gnu.org> writes:

>     + void
>     + clear_regexp_cache ()
>     + {
>     +   int i;
>     + 
>     +   BLOCK_INPUT;
>     +   for (i = 0; i < REGEXP_CACHE_SIZE; ++i)
>     +     searchbufs[i].regexp = Qnil;
>     +   UNBLOCK_INPUT;
>     + }
>
> 1. That leaks the memory in the compiled regexps.

Are you sure?  AFAICT, re_compile_pattern automagically manages the
memory allocated in each re_pattern_buffer struct, based on the value
of bufp->allocated and bufp->buffer.  If we reset searchbuf->regexp to
Qnil, that means that cache element can be used to store a compiled
regexp, and the memory used by any compiled regexp (i.e., the
re_pattern_buffer) previously existing in that cache element is
reused.

This seems to be the existing practice in search.c: the cache elements
are initialized in syms_of_search as

  for (i = 0; i < REGEXP_CACHE_SIZE; ++i)
    {
      searchbufs[i].buf.allocated = 100;
      searchbufs[i].buf.buffer = (unsigned char *) xmalloc (100);
      ...
      searchbufs[i].regexp = Qnil;
      ...
    }

When compile_pattern is called with an uncached regexp, it tries to
cache it in an empty cache element (i.e., one with a nil `regexp'
entry).  If no cache elements are empty, it uses the oldest cache
element by resetting its `regexp' entry and passing it along to
re_compile_pattern.

> 2. I don't see a reason for BLOCK_INPUT.
> I don't think anything in a signal handler can compile a regexp.

That's probably true.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [BUG] Regexp compiler, problem with character classes
  2006-09-14 23:20   ` Chong Yidong
  2006-09-15 14:29     ` Richard Stallman
@ 2006-09-18  8:43     ` Johan Bockgård
  2006-09-18 12:53       ` Chong Yidong
  1 sibling, 1 reply; 13+ messages in thread
From: Johan Bockgård @ 2006-09-18  8:43 UTC (permalink / raw)


Chong Yidong <cyd@stupidchicken.com> writes:

> Richard Stallman <rms@gnu.org> writes:
>
>> The easiest way to fix that bug is to make each element of the
>> compiled regexp cache specify the syntax table that it corresponds
>> to, and make modify-syntax-entry clear the cache. That way, no
>> change in regex.c is needed.
>>
>> Could someone please do this and then ack?
>
> Does this patch look OK? [...]

The "make each element of the compiled regexp cache specify the syntax
table that it corresponds to" part is still needed.

Here's an example:

    $ emacs -Q

    Evaluate in *scratch*:

      (list
       (string-match "x[[:space:]]" "x\n")
       (with-temp-buffer
         (string-match "x[[:space:]]" "x\n")))

    => (nil nil)

Expected: (nil 0)

--
Johan Bockgård

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [BUG] Regexp compiler, problem with character classes
  2006-09-18  8:43     ` Johan Bockgård
@ 2006-09-18 12:53       ` Chong Yidong
  2006-09-18 13:03         ` Stefan Monnier
  2006-09-18 13:12         ` Johan Bockgård
  0 siblings, 2 replies; 13+ messages in thread
From: Chong Yidong @ 2006-09-18 12:53 UTC (permalink / raw)


bojohan+news@dd.chalmers.se (Johan Bockgård) writes:

>>> The easiest way to fix that bug is to make each element of the
>>> compiled regexp cache specify the syntax table that it corresponds
>>> to, and make modify-syntax-entry clear the cache. That way, no
>>> change in regex.c is needed.
>> Does this patch look OK? [...]
>
> The "make each element of the compiled regexp cache specify the syntax
> table that it corresponds to" part is still needed.
>
> Here's an example:
>
>     $ emacs -Q
>
>     Evaluate in *scratch*:
>
>       (list
>        (string-match "x[[:space:]]" "x\n")
>        (with-temp-buffer
>          (string-match "x[[:space:]]" "x\n")))
>
>     => (nil nil)
>
> Expected: (nil 0)

You're right; the following additional patch should fix that.

*** emacs/src/search.c.~1.213.~	2006-09-18 08:39:59.000000000 -0400
--- emacs/src/search.c	2006-09-18 08:51:50.000000000 -0400
***************
*** 41,47 ****
  struct regexp_cache
  {
    struct regexp_cache *next;
!   Lisp_Object regexp, whitespace_regexp;
    struct re_pattern_buffer buf;
    char fastmap[0400];
    /* Nonzero means regexp was compiled to do full POSIX backtracking.  */
--- 41,47 ----
  struct regexp_cache
  {
    struct regexp_cache *next;
!   Lisp_Object regexp, whitespace_regexp, syntax_table;
    struct re_pattern_buffer buf;
    char fastmap[0400];
    /* Nonzero means regexp was compiled to do full POSIX backtracking.  */
***************
*** 167,172 ****
--- 167,173 ----
    cp->posix = posix;
    cp->buf.multibyte = multibyte;
    cp->whitespace_regexp = Vsearch_spaces_regexp;
+   cp->syntax_table = current_buffer->syntax_table;
    /* Doing BLOCK_INPUT here has the effect that
       the debugger won't run if an error occurs.
       Why is BLOCK_INPUT needed here?  */
***************
*** 256,261 ****
--- 257,263 ----
  	  && EQ (cp->buf.translate, (! NILP (translate) ? translate : make_number (0)))
  	  && cp->posix == posix
  	  && cp->buf.multibyte == multibyte
+ 	  && EQ (cp->syntax_table, current_buffer->syntax_table)
  	  && !NILP (Fequal (cp->whitespace_regexp, Vsearch_spaces_regexp)))
  	break;
  
***************
*** 3114,3121 ****
--- 3116,3125 ----
        searchbufs[i].buf.fastmap = searchbufs[i].fastmap;
        searchbufs[i].regexp = Qnil;
        searchbufs[i].whitespace_regexp = Qnil;
+       searchbufs[i].syntax_table = Qnil;
        staticpro (&searchbufs[i].regexp);
        staticpro (&searchbufs[i].whitespace_regexp);
+       staticpro (&searchbufs[i].syntax_table);
        searchbufs[i].next = (i == REGEXP_CACHE_SIZE-1 ? 0 : &searchbufs[i+1]);
      }
    searchbuf_head = &searchbufs[0];

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [BUG] Regexp compiler, problem with character classes
  2006-09-18 12:53       ` Chong Yidong
@ 2006-09-18 13:03         ` Stefan Monnier
  2006-09-18 13:12         ` Johan Bockgård
  1 sibling, 0 replies; 13+ messages in thread
From: Stefan Monnier @ 2006-09-18 13:03 UTC (permalink / raw)
  Cc: emacs-devel

>>>> The easiest way to fix that bug is to make each element of the
>>>> compiled regexp cache specify the syntax table that it corresponds
>>>> to, and make modify-syntax-entry clear the cache. That way, no
>>>> change in regex.c is needed.
>>> Does this patch look OK? [...]
>> 
>> The "make each element of the compiled regexp cache specify the syntax
>> table that it corresponds to" part is still needed.
>> 
>> Here's an example:
>> 
>> $ emacs -Q
>> 
>> Evaluate in *scratch*:
>> 
>> (list
>> (string-match "x[[:space:]]" "x\n")
>> (with-temp-buffer
>> (string-match "x[[:space:]]" "x\n")))
>> 
>> => (nil nil)
>> 
>> Expected: (nil 0)

> You're right; the following additional patch should fix that.

Looks good.  But please add a comment/todo item that the syntax-table should
only be stored&checked in the (rare) case that something like [[:space:]]
is used.


        Stefan


> *** emacs/src/search.c.~1.213.~	2006-09-18 08:39:59.000000000 -0400
> --- emacs/src/search.c	2006-09-18 08:51:50.000000000 -0400
> ***************
> *** 41,47 ****
>   struct regexp_cache
>   {
>     struct regexp_cache *next;
> !   Lisp_Object regexp, whitespace_regexp;
>     struct re_pattern_buffer buf;
>     char fastmap[0400];
>     /* Nonzero means regexp was compiled to do full POSIX backtracking.  */
> --- 41,47 ----
>   struct regexp_cache
>   {
>     struct regexp_cache *next;
> !   Lisp_Object regexp, whitespace_regexp, syntax_table;
>     struct re_pattern_buffer buf;
>     char fastmap[0400];
>     /* Nonzero means regexp was compiled to do full POSIX backtracking.  */
> ***************
> *** 167,172 ****
> --- 167,173 ----
cp-> posix = posix;
cp-> buf.multibyte = multibyte;
cp-> whitespace_regexp = Vsearch_spaces_regexp;
> +   cp->syntax_table = current_buffer->syntax_table;
>     /* Doing BLOCK_INPUT here has the effect that
>        the debugger won't run if an error occurs.
>        Why is BLOCK_INPUT needed here?  */
> ***************
> *** 256,261 ****
> --- 257,263 ----
>   	  && EQ (cp->buf.translate, (! NILP (translate) ? translate : make_number (0)))
>   	  && cp->posix == posix
>   	  && cp->buf.multibyte == multibyte
> + 	  && EQ (cp->syntax_table, current_buffer->syntax_table)
>   	  && !NILP (Fequal (cp->whitespace_regexp, Vsearch_spaces_regexp)))
>   	break;
  
> ***************
> *** 3114,3121 ****
> --- 3116,3125 ----
>         searchbufs[i].buf.fastmap = searchbufs[i].fastmap;
>         searchbufs[i].regexp = Qnil;
>         searchbufs[i].whitespace_regexp = Qnil;
> +       searchbufs[i].syntax_table = Qnil;
>         staticpro (&searchbufs[i].regexp);
>         staticpro (&searchbufs[i].whitespace_regexp);
> +       staticpro (&searchbufs[i].syntax_table);
>         searchbufs[i].next = (i == REGEXP_CACHE_SIZE-1 ? 0 : &searchbufs[i+1]);
>       }
>     searchbuf_head = &searchbufs[0];


> _______________________________________________
> Emacs-devel mailing list
> Emacs-devel@gnu.org
> http://lists.gnu.org/mailman/listinfo/emacs-devel

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [BUG] Regexp compiler, problem with character classes
  2006-09-18 12:53       ` Chong Yidong
  2006-09-18 13:03         ` Stefan Monnier
@ 2006-09-18 13:12         ` Johan Bockgård
  1 sibling, 0 replies; 13+ messages in thread
From: Johan Bockgård @ 2006-09-18 13:12 UTC (permalink / raw)


Chong Yidong <cyd@stupidchicken.com> writes:

> You're right; the following additional patch should fix that.

Thanks a lot.

-- 
Johan Bockgård

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2006-09-18 13:12 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-06-03  1:14 [BUG] Regexp compiler, problem with character classes Johan Bockgård
2006-09-07 21:15 ` Richard Stallman
2006-09-13  9:50   ` Johan Bockgård
2006-09-13 19:25     ` Richard Stallman
2006-09-07 21:15 ` Richard Stallman
2006-09-14 23:20   ` Chong Yidong
2006-09-15 14:29     ` Richard Stallman
2006-09-15 15:13       ` Chong Yidong
2006-09-18  8:43     ` Johan Bockgård
2006-09-18 12:53       ` Chong Yidong
2006-09-18 13:03         ` Stefan Monnier
2006-09-18 13:12         ` Johan Bockgård
2006-09-15  3:14 ` Richard Stallman

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).