From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!.POSTED.blaine.gmane.org!not-for-mail From: "Miguel V. S. Frasson" Newsgroups: gmane.emacs.devel Subject: Improving regexp-opt Date: Thu, 7 Feb 2019 14:41:04 -0200 Message-ID: Mime-Version: 1.0 Content-Type: multipart/alternative; boundary="000000000000dd2cdf0581508229" Injection-Info: blaine.gmane.org; posting-host="blaine.gmane.org:195.159.176.226"; logging-data="186985"; mail-complaints-to="usenet@blaine.gmane.org" To: emacs-devel@gnu.org Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Thu Feb 07 17:47:29 2019 Return-path: Envelope-to: ged-emacs-devel@m.gmane.org Original-Received: from lists.gnu.org ([209.51.188.17]) by blaine.gmane.org with esmtps (TLS1.0:RSA_AES_256_CBC_SHA1:256) (Exim 4.89) (envelope-from ) id 1grmpi-000mTf-Es for ged-emacs-devel@m.gmane.org; Thu, 07 Feb 2019 17:47:26 +0100 Original-Received: from localhost ([127.0.0.1]:43170 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1grmph-0006Jy-E1 for ged-emacs-devel@m.gmane.org; Thu, 07 Feb 2019 11:47:25 -0500 Original-Received: from eggs.gnu.org ([209.51.188.92]:40216) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1grmk5-0004gX-90 for emacs-devel@gnu.org; Thu, 07 Feb 2019 11:41:38 -0500 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1grmk0-0003Dv-H7 for emacs-devel@gnu.org; Thu, 07 Feb 2019 11:41:33 -0500 Original-Received: from mail-it1-x12c.google.com ([2607:f8b0:4864:20::12c]:33671) by eggs.gnu.org with esmtps (TLS1.0:RSA_AES_128_CBC_SHA1:16) (Exim 4.71) (envelope-from ) id 1grmjn-0003A5-4S for emacs-devel@gnu.org; Thu, 07 Feb 2019 11:41:22 -0500 Original-Received: by mail-it1-x12c.google.com with SMTP id q78so6024886itc.0 for ; Thu, 07 Feb 2019 08:41:17 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:from:date:message-id:subject:to; bh=KrZoNiO8zYj3zgjjcB7qEbEEJ3hClgo5VjAGftNY+Ic=; b=BNPIz2MdPHGhQchnrJIlhPYAlDC41yyCEHhHQc9WKpiqeMkI2zKJjnVmocYDQu8bKo 1myaHjcHqxoNpKEoItfPEsP7Aogxr98Tn0uco9CNrZTtUJsVakzqzdcGVvjTOE+COqKZ 90H60QC0PIEGZ1sI7paGcwXSHqZ8ZLwVkhOJnABFljAzbRnHx/8PrypkmKGicC3PvqjC Ou8b4HIClzF0ZNi/37bAGEHedufdQEE8J9TBo92KoIuFs2XYJ9K//rII+HQFmobU0dys kptP6L618L1u7yH7fZyJFX5DYnZEtjzFjl8u7RLnE/eB56JMX4F4+7WrRVGGG0sXpoPg DEog== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:from:date:message-id:subject:to; bh=KrZoNiO8zYj3zgjjcB7qEbEEJ3hClgo5VjAGftNY+Ic=; b=DPW7NXdr41rP3hOvrhWKXUTWXHkOw97q4IACl6jR696Uh8xsvMaVtV54IOQCSjR3EP mWbyDJwG4tTFWASnWKtHKZgSWUG8WAg9sOXFyupubF0GRlBmPQtemMjvgDIu1qi/ofGd g0nPa9PKjQF1hqUzShIYkKQxCSnAM3chd20mFe8vh1Ltaqh6rCLL0XNlWeYPFm02egZx 7//vXJHCgse6fXd/zaIH/3mJIeXNZ9m5AXJlxUMn/hjfUvQ9nq/I4oGZB67MkFb++l5h axSh/cYDu4uXfERKv2+8U1SInBNTjiiUFTbreFdXu27TNtvM9fjp4lLHYsKlaVfiwf/f tNZQ== X-Gm-Message-State: AHQUAuZgOufTEyPF9igZn0TeNciRZ/DMxAE7v2807UPAzLWHtmrmDno/ qHCU2ZLQlyIrnyTfk0vlhXdIrjnPeFZUavH4fI1lqA== X-Google-Smtp-Source: AHgI3IY2hIXc4IS9uDqPRpzQczrSfnmAYySjAdiQq5ZSLjG1UyyqGSCLJ5IXyTQ/EiQdnjeGkrO+8G8SbXDcdcCcE5s= X-Received: by 2002:a24:ac62:: with SMTP id m34mr5426883iti.129.1549557676451; Thu, 07 Feb 2019 08:41:16 -0800 (PST) X-detected-operating-system: by eggs.gnu.org: Genre and OS details not recognized. X-Received-From: 2607:f8b0:4864:20::12c X-Mailman-Approved-At: Thu, 07 Feb 2019 11:45:47 -0500 X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.21 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Original-Sender: "Emacs-devel" Xref: news.gmane.org gmane.emacs.devel:233096 Archived-At: --000000000000dd2cdf0581508229 Content-Type: text/plain; charset="UTF-8" Hi I had an idea to improve regexp-opt. (I use Emacs 25.3.2). In a regexp when you have a group with alternatives, sometimes all alternatives *finish* with one or more common atom regexps. You could take the common part out of the group and try to improve the remaining smaller group, splitting all strings that match and recurse regexp-opt. Example 1: we read from regexp-opt.el: ;; One possible improvement would be to compile '("aa" "ab" "ba" "bb") ;; into "[ab][ab]" rather than "a[ab]\\|b[ab]". I'm not sure it's worth ;; it but if someone knows how to do it without going through too many ;; contortions, I'm all ears. Evaluating, (regexp-opt '("aa" "ab" "ba" "bb")) -> "\\(?:a[ab]\\|b[ab]\\)" All alternatives finish with "[ab]", so it is equivalent to "\\(?:a\\|b\\)[ab]" The remaining group is "\\(?:a\\|b\\)". We can further improve making a list of all strings that match it and recourse regexp-opt: "\\(?:a\\|b\\)" all-matchs = ("a" "b") (regexp-opt '("a" "b")) -> "[ab]" Finally join the two regexps: "[ab][ab]" ... and the wish of the developer is fulfilled :) Example 2: (regexp-opt '("car" "cdr" "caar" "cadr" "cdar" "cddr")) -> "\\(?:c\\(?:\\(?:a[ad]\\|d[ad]\\|[ad]\\)r\\)\\)" First of all, there is an (apparently) unnecessary group around the result. Look the inner group of the result: "\\(?:a[ad]\\|d[ad]\\|[ad]\\)" Notice that all alternatives finish with "[ad]", so this group is equivalent to "\\(?:a\\|d\\|\\)[ad]" The smaller group matches the strings ("a" "d" "") and (regexp-opt '("a" "d" "")) ->"[ad]?" and the inner group is equivalent to "[ad]?[ad]" So, (regexp-opt '("car" "cdr" "caar" "cadr" "cdar" "cddr")) could return (eliminating outer group) "c[ad]?[ad]r" Of course the splitting and recurse not always improve smaller group. For example, (regexp-opt '("master" "monster" "mister")) -> "\\(?:m\\(?:\\(?:on\\|[ai]\\)ster\\)\\)" It would be better (imo) eliminate unnecessary groups (those without "\\|") that the result was "m\\(?:on\\|[ai]\\)ster" Cheers Miguel Frasson --000000000000dd2cdf0581508229 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
Hi

I had an idea to improve regexp-opt. (I use Emacs 25.3.2).

In a regexp when you have= a group with alternatives, sometimes all alternatives *finish* with one or= more common atom regexps. You could take the common part out of the group = and try to improve the remaining smaller group, splitting all strings that = match and recurse regexp-opt.

Example 1: we read from regexp-opt.el:
;; One = possible improvement would be to compile '("aa" "ab"= ; "ba" "bb")
;; into "[ab][= ab]" rather than "a[ab]\\|b[ab]".=C2=A0 I'm not sure it&= #39;s worth
;; it but if someone knows how to do it = without going through too many
;; contortions, I'= ;m all ears.

Evaluating,=
(regexp-opt '("aa" "ab"=C2= =A0 "ba" "bb"))
=C2=A0-> &quo= t;\\(?:a[ab]\\|b[ab]\\)"

All alternatives finish with "[ab]", so it is equivalent to=
=C2=A0 "\\(?:a\\|b\\)[ab]"

The remaining group is "\\(?:a\\= |b\\)". We can further improve making a list of all strings that match= it and recourse regexp-opt:
"\\(?:a\\|b\\)&quo= t;
all-matchs =3D ("a" "b")
(regexp-opt '("a" "b")) -> &q= uot;[ab]"

Finally j= oin the two regexps: "[ab][ab]"
... and th= e wish of the developer is fulfilled :)=C2=A0

Example 2:
(regexp-opt '(&= quot;car" "cdr" "caar" "cadr" "cdar= " "cddr"))
-> "\\(?:c\\(?:\\(= ?:a[ad]\\|d[ad]\\|[ad]\\)r\\)\\)"

First of all, there is an (apparently) unnecessary group aro= und the result.=C2=A0

Lo= ok the inner group of the result:
=C2=A0 "\\(?:= a[ad]\\|d[ad]\\|[ad]\\)"
Notice that all altern= atives finish with "[ad]", so this group is equivalent to
=C2=A0"\\(?:a\\|d\\|\\)[ad]"=C2=A0

The smaller group matches the strings= ("a" "d" "") and
=C2= =A0 (regexp-opt '("a" "d" "")) ->"= ;[ad]?"
and the inner group is equivalent to=C2= =A0
=C2=A0"[ad]?[ad]"

So,
=C2=A0 (regexp-op= t '("car" "cdr" "caar" "cadr" &= quot;cdar" "cddr"))
could return (eli= minating outer group)=C2=A0
=C2=A0 "c[ad]?[ad]r= "

Of course the spl= itting and recurse not always improve smaller group. For example,=C2=A0
(regexp-opt '("master" "monster"= ; "mister"))
-> "\\(?:m\\(?:\\(?:o= n\\|[ai]\\)ster\\)\\)"

It would be better (imo) eliminate unnecessary groups (those without &q= uot;\\|") that the result was
=C2=A0 "m\\(= ?:on\\|[ai]\\)ster"

Cheers

Miguel Frasson

--000000000000dd2cdf0581508229--