all messages for Emacs-related lists mirrored at yhetil.org
 help / color / mirror / code / Atom feed
From: Michal Nazarewicz <mina86@mina86.com>
To: Eli Zaretskii <eliz@gnu.org>
Cc: 24020@debbugs.gnu.org
Subject: bug#24020: [PATCH] Fix ‘[[:word:]]*\u2620’ failing to match ‘foo\u2620’
Date: Mon, 18 Jul 2016 20:07:18 +0200	[thread overview]
Message-ID: <xa1tk2gi252h.fsf@mina86.com> (raw)
In-Reply-To: <83r3ar0z0u.fsf@gnu.org>

On Mon, Jul 18 2016, Eli Zaretskii wrote:
>> From: Michal Nazarewicz <mina86@mina86.com>
>> Date: Mon, 18 Jul 2016 16:04:44 +0200
>> 
>> mutually_exclusive_p did not check for the claass bits of an charset
>> opcode when comparing it with an exactn which resulted in situation
>> where it thought a multibyte character could not match the character
>> class.
>> 
>> This assumption caused incorrect optimisation of the regular expression
>> and eventually failure of ‘[[:word:]]*\u2620’ to match ‘foo\u2620’.
>> 
>> The issue affects all multibyte word characters as well as other
>> character classes which may match multibyte characters.
>
> Thanks.
>
> Unfortunately, the above description is too terse for me to understand
> the issue and the way you propose to fix it.  Could you please provide
> more details,

‘[[:word:]]*\u2620’ ends up being as:

0:	/on_failure_keep_string_jump to 28
3:	/charset [ !extends past end of pattern! $-%0-9A-Za-z]has-range-table
25:	/jump to 3
28:	/exactn/3/â// 
33:	/succeed
34:	end of pattern.

while ‘\sw*\u2620’ as:

0:	/on_failure_jump to 8
3:	/syntaxspec/2
5:	/jump to 0
8:	/exactn/3/â// 
13:	/succeed
14:	end of pattern.

Apart from a different opcode to match the word class the crux of the
difference is the first opcode: on_failure_keep_string_jump
vs. on_failure_jump.

As a matter of fact, regex_compile puts a on_failure_jump_smart opcode
at the beginning which is optimised by re_match_2_internal (debug code
removed for brevity):

	/* This operation is used for greedy *.
	   Compare the beginning of the repeat with what in the
	   pattern follows its end. If we can establish that there
	   is nothing that they would both match, i.e., that we
	   would have to backtrack because of (as in, e.g., `a*a')
	   then we can use a non-backtracking loop based on
	   on_failure_keep_string_jump instead of on_failure_jump.  */
	case on_failure_jump_smart:
	  EXTRACT_NUMBER_AND_INCR (mcnt, p);
	  {
	    re_char *p1 = p; /* Next operation.  */
	    /* Here, we discard `const', making re_match non-reentrant.  */
	    unsigned char *p2 = (unsigned char*) p + mcnt; /* Jump dest.  */
	    unsigned char *p3 = (unsigned char*) p - 3; /* opcode location.  */

	    p -= 3;		/* Reset so that we will re-execute the
				   instruction once it's been changed. */

	    EXTRACT_NUMBER (mcnt, p2 - 2);

	    /* Ensure this is a indeed the trivial kind of loop
	       we are expecting.  */
	    if (mutually_exclusive_p (bufp, p1, p2))
	      {
		/* Use a fast `on_failure_keep_string_jump' loop.  */
		*p3 = (unsigned char) on_failure_keep_string_jump;
		STORE_NUMBER (p2 - 2, mcnt + 3);
	      }
	    else
	      {
		/* Default to a safe `on_failure_jump' loop.  */
		*p3 = (unsigned char) on_failure_jump;
	      }
	  }

In other words, in our example, the code checks whether ‘[[:word:]]’ can
match ‘💀’.  If it cannot than we can be greedy about matching
‘[[:word:]]*’ and never backtrace looking for a shorter match; if it
can, we may need to backtrace if the overall matching fails.

mutually_exclusive_p concludes that ‘[[:word:]]’ cannot match ‘💀’ (or
any non-ASCII characters really) but as a matter of fact, word class
does match skull character.

So when ‘[[:word:]]*💀’ matches ‘foo💀’ the following happens:
1. ‘[[:word:]]*’ matches the whole string.
2. String is now empty so ‘💀’ doesn’t match.
3. Because of incorrect assumptions, the engine does not shorten the
   initial ‘[[:word:]]*’ match.

(I may be butchering the exact terms and algorithm that is being applied
but the general idea is, I hope, shown).

> Note that some of the classes deliberately don't work on multibyte
> characters, and are documented as such.

This is irrelevant.  ‘[[:word:]]*’ matches ‘foo’ thus ‘[[:word:]]*b’
must match ‘foob’ (which it does) and ‘[[:word:]]*☠’ must match ‘foo☠’
(which it doesn’t).

> including what problems you saw in classes other than [:word:]?

The problem happens for any class which matches multibyte characters.
For example:

    (string-match-p "[[:alpha:]]*" "żółć")
    => 0
    (string-match-p "[[:alpha:]]*w" "żółw")
    => 0
    (string-match-p "[[:alpha:]]*ć" "żółć")
    => nil  (should be 0)
    ;; or even more simply:
    (string-match-p "[[:alpha:]]*ć" "ż")
    => nil  (should be 0)

In general, for a class FOO, if a multibyte character C matches that
class regex "[[:FOO]]*C") should match the character itself but it
doesn’t.

> So if we are changing that, there should be documentation changes and
> an entry in NEWS as well (but I suggest not to make such changes too
> easily, not without measuring the impact on performance, if any).
>
>> * src/regex.c (executing_charset): A new function for executing the
>> charset and charset_not opcodes.  It performs check on the character
>> taking into consideration existing bitmap, rang table and class bits.
>                                              ^^^^
> A typo.
>
>> +#ifdef emacs
>> +  else if (rtp)
>> +    {
>> +      int class_bits = CHARSET_RANGE_TABLE_BITS (p);
>> +      re_wchar_t range_start, range_end;
>> +
>> +  /* Sort tests by the most commonly used classes with some adjustment to which
>> +     tests are easiest to perform.  Frequencies of character class names as of
>> +     2016-07-15:
>
> Not sure what files you used for this.  Are those Emacs source files?
>
>> diff --git a/test/src/regex-tests.el b/test/src/regex-tests.el
>> new file mode 100644
>> index 0000000..a2dd4f0
>> --- /dev/null
>> +++ b/test/src/regex-tests.el
>> @@ -0,0 +1,75 @@
>> +;;; buffer-tests.el --- tests for regex.c functions -*- lexical-binding: t -*-
>        ^^^^^^^^^^^^^^^
>
> Copy-paste error.
>
>> +;;; buffer-tests.el ends here
>
> And another one.

-- 
Best regards
ミハウ “𝓶𝓲𝓷𝓪86” ナザレヴイツ
«If at first you don’t succeed, give up skydiving»





  reply	other threads:[~2016-07-18 18:07 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-07-18 14:04 bug#24020: [PATCH] Fix ‘[[:word:]]*\u2620’ failing to match ‘foo\u2620’ Michal Nazarewicz
2016-07-18 15:03 ` Eli Zaretskii
2016-07-18 18:07   ` Michal Nazarewicz [this message]
2016-07-18 23:30   ` bug#24020: [PATCHv2] Fix ‘[[:word:]]*\u2620’ failing to match ‘foo\u2620’ (bug#24020) Michal Nazarewicz
2016-07-19  8:00     ` Andreas Schwab
2016-07-20 12:36       ` Michal Nazarewicz
2016-07-25 21:54         ` Michal Nazarewicz
2016-07-27 16:22 ` bug#24020: [PATCH] Fix ‘is multibyte’ test regex.c’s mutually_exclusive_p (bug#24020) Michal Nazarewicz

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=xa1tk2gi252h.fsf@mina86.com \
    --to=mina86@mina86.com \
    --cc=24020@debbugs.gnu.org \
    --cc=eliz@gnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this external index

	https://git.savannah.gnu.org/cgit/emacs.git
	https://git.savannah.gnu.org/cgit/emacs/org-mode.git

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.