From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!.POSTED.blaine.gmane.org!not-for-mail From: Stefan Monnier Newsgroups: gmane.emacs.devel Subject: Re: Improving regexp-opt Date: Thu, 07 Feb 2019 22:48:01 -0500 Message-ID: References: Mime-Version: 1.0 Content-Type: text/plain Injection-Info: blaine.gmane.org; posting-host="blaine.gmane.org:195.159.176.226"; logging-data="214304"; mail-complaints-to="usenet@blaine.gmane.org" User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/27.0.50 (gnu/linux) To: emacs-devel@gnu.org Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Fri Feb 08 04:48:27 2019 Return-path: Envelope-to: ged-emacs-devel@m.gmane.org Original-Received: from lists.gnu.org ([209.51.188.17]) by blaine.gmane.org with esmtps (TLS1.0:RSA_AES_256_CBC_SHA1:256) (Exim 4.89) (envelope-from ) id 1grx9O-000td6-So for ged-emacs-devel@m.gmane.org; Fri, 08 Feb 2019 04:48:27 +0100 Original-Received: from localhost ([127.0.0.1]:50535 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1grx9N-0006eb-QI for ged-emacs-devel@m.gmane.org; Thu, 07 Feb 2019 22:48:25 -0500 Original-Received: from eggs.gnu.org ([209.51.188.92]:51107) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1grx9C-0006eG-Mr for emacs-devel@gnu.org; Thu, 07 Feb 2019 22:48:15 -0500 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1grx9C-0004he-15 for emacs-devel@gnu.org; Thu, 07 Feb 2019 22:48:14 -0500 Original-Received: from [195.159.176.226] (port=39352 helo=blaine.gmane.org) by eggs.gnu.org with esmtps (TLS1.0:RSA_AES_256_CBC_SHA1:32) (Exim 4.71) (envelope-from ) id 1grx9B-0004f1-QD for emacs-devel@gnu.org; Thu, 07 Feb 2019 22:48:13 -0500 Original-Received: from list by blaine.gmane.org with local (Exim 4.89) (envelope-from ) id 1grx99-000tKN-6X for emacs-devel@gnu.org; Fri, 08 Feb 2019 04:48:11 +0100 X-Injected-Via-Gmane: http://gmane.org/ Cancel-Lock: sha1:pVeh711xUf80mQRIgKaCq3n96w8= X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] X-Received-From: 195.159.176.226 X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.21 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Original-Sender: "Emacs-devel" Xref: news.gmane.org gmane.emacs.devel:233103 Archived-At: > (regexp-opt '("car" "cdr" "caar" "cadr" "cdar" "cddr")) > -> "\\(?:c\\(?:\\(?:a[ad]\\|d[ad]\\|[ad]\\)r\\)\\)" > First of all, there is an (apparently) unnecessary group around the result. FWIW, I think this is not an error: we want (concat (regexp-opt STRS) "*") to have a well-defined behavior (i.e. allow any number of repetitions of STRS). > (regexp-opt '("master" "monster" "mister")) > -> "\\(?:m\\(?:\\(?:on\\|[ai]\\)ster\\)\\)" > It would be better (imo) eliminate unnecessary groups (those without "\\|") > that the result was > "m\\(?:on\\|[ai]\\)ster" Here, OTOH, the second (shy) subgroup is indeed unnecessary. Regarding improving regexp-opt, in the general case you're looking at minimizing finite state automatons. When regexp-opt was written, the main purpose was to try and reduce backtracking and for that it's perfectly sufficient to turn ("ack" "attack") into "a\\(?:ck\\|ttack\\)". I later added "tail sharing" so that ("ack" "attack") turns into "a\\(?:tta\\)?ck" but that's not really much use in practice. We could try and get fancier, but it will tend to slow down regexp-opt even more for rather small benefits (except in corner cases). A much better approach is to go for a real "regexp to NFA/DFA conversion". The `lex.el` package is one such example, but it's very inefficient (in terms of building the FA and in the size of the FA, not in terms of running the FA). Stefan