From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!.POSTED!not-for-mail From: Stefan Monnier Newsgroups: gmane.emacs.devel Subject: Re: Regular expression libraries Date: Fri, 16 Dec 2016 00:15:18 -0500 Message-ID: References: <01d7e608-04d2-84a4-6143-e954bc9d569f@mit.edu> NNTP-Posting-Host: blaine.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit X-Trace: blaine.gmane.org 1481865380 1074 195.159.176.226 (16 Dec 2016 05:16:20 GMT) X-Complaints-To: usenet@blaine.gmane.org NNTP-Posting-Date: Fri, 16 Dec 2016 05:16:20 +0000 (UTC) User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/25.1.50 (gnu/linux) To: emacs-devel@gnu.org Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Fri Dec 16 06:16:10 2016 Return-path: Envelope-to: ged-emacs-devel@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by blaine.gmane.org with esmtp (Exim 4.84_2) (envelope-from ) id 1cHksL-0006oa-D2 for ged-emacs-devel@m.gmane.org; Fri, 16 Dec 2016 06:16:09 +0100 Original-Received: from localhost ([::1]:58528 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1cHksP-0007ll-Gq for ged-emacs-devel@m.gmane.org; Fri, 16 Dec 2016 00:16:13 -0500 Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:45003) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1cHkrp-0007gl-Bs for emacs-devel@gnu.org; Fri, 16 Dec 2016 00:15:38 -0500 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1cHkrl-0002CM-AE for emacs-devel@gnu.org; Fri, 16 Dec 2016 00:15:37 -0500 Original-Received: from [195.159.176.226] (port=47423 helo=blaine.gmane.org) by eggs.gnu.org with esmtps (TLS1.0:RSA_AES_128_CBC_SHA1:16) (Exim 4.71) (envelope-from ) id 1cHkrk-0002Bj-Pj for emacs-devel@gnu.org; Fri, 16 Dec 2016 00:15:33 -0500 Original-Received: from list by blaine.gmane.org with local (Exim 4.84_2) (envelope-from ) id 1cHkrc-0008G3-3p for emacs-devel@gnu.org; Fri, 16 Dec 2016 06:15:24 +0100 X-Injected-Via-Gmane: http://gmane.org/ Original-Lines: 50 Original-X-Complaints-To: usenet@blaine.gmane.org Cancel-Lock: sha1:WV0N9Az+fk4sSJBnWyLmXI1d18s= X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] [fuzzy] X-Received-From: 195.159.176.226 X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.21 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Original-Sender: "Emacs-devel" Xref: news.gmane.org gmane.emacs.devel:210497 Archived-At: > * Emacs has special regexp features based on syntax classes, or the position of the point. These features don't seem to be supported in gnulib nor glibc > * Emacs uses a gap buffer, whereas gnulib and glibc expects a char* for the subject string The gnulib regexp code used to be a derivative of the code we have in Emacs (or rather they both derived from the same codebase). Since then it's been replaced with a completely different codebase. But IIRC the two did merge back (or close enough) before gnulib' code was replaced with something else. > Stefan also expressed a desire to move to a better regular expression > engine — ideally one that uses a non-backtracking implementation when > possible, to avoid the exponential blowups that Emacs currently > suffers from on certain regular expressions. >From my point of view, the main problem with the current code is the blowup in pathological cases, and it's the only one that justifies moving to another implementation. Some users may want to change the engine so as to get new features, but FWIW, I think it'd be much easier to just add the features to the current engine. The main reason such features are currently lacking is that I've been resisting their addition because I don't want such features to later make it harder to switch to a non-backtracking engine. > * There are only two usable, non-backtracking regexp libraries: What about the one in glibc? > - TRE (http://laurikari.net/tre/) uses backtracking if needed, but > otherwise uses a finite automaton. It's stable, can be relatively > easily linked into Emacs, and performance is similar to what we > have (I benchmarked it). It does support matching on a gap buffer > (it has an interface to use the moral equivalent of a random > iterator, reguexec). It doesn't support Emacs' native encoding > out of the box, though, and the library is unmaintained; there are > multiple forks which fix various security vulnerabilities. So, this would be acceptable, but would mean we'd still use "our own code" and maintain it ourselves. > If I understand our current implementation, our regexp.c functions > take the moral equivalent of two strings, and match over that, > pretending that it's just one large string. In practice, these two > strings are the two halves of the gap buffer. Correct? That's right. Stefan