From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!.POSTED!not-for-mail From: =?UTF-8?Q?Cl=c3=a9ment_Pit--Claudel?= Newsgroups: gmane.emacs.devel Subject: Regular expression libraries Date: Thu, 15 Dec 2016 14:00:25 -0500 Message-ID: <01d7e608-04d2-84a4-6143-e954bc9d569f@mit.edu> NNTP-Posting-Host: blaine.gmane.org Mime-Version: 1.0 Content-Type: multipart/signed; protocol="application/pkcs7-signature"; micalg=sha-256; boundary="------------ms030208010000040705000407" X-Trace: blaine.gmane.org 1481832131 4505 195.159.176.226 (15 Dec 2016 20:02:11 GMT) X-Complaints-To: usenet@blaine.gmane.org NNTP-Posting-Date: Thu, 15 Dec 2016 20:02:11 +0000 (UTC) User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.5.1 To: Emacs developers Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Thu Dec 15 21:02:06 2016 Return-path: Envelope-to: ged-emacs-devel@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by blaine.gmane.org with esmtp (Exim 4.84_2) (envelope-from ) id 1cHcE9-00006c-Oe for ged-emacs-devel@m.gmane.org; Thu, 15 Dec 2016 21:02:05 +0100 Original-Received: from localhost ([::1]:56791 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1cHcEE-0007D5-2Q for ged-emacs-devel@m.gmane.org; Thu, 15 Dec 2016 15:02:10 -0500 Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:50085) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1cHbGb-0002wT-IE for emacs-devel@gnu.org; Thu, 15 Dec 2016 14:00:34 -0500 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1cHbGX-0004zf-MX for emacs-devel@gnu.org; Thu, 15 Dec 2016 14:00:33 -0500 Original-Received: from mout.kundenserver.de ([212.227.17.13]:57759) by eggs.gnu.org with esmtps (TLS1.0:DHE_RSA_AES_256_CBC_SHA1:32) (Exim 4.71) (envelope-from ) id 1cHbGX-0004w5-Bd for emacs-devel@gnu.org; Thu, 15 Dec 2016 14:00:29 -0500 Original-Received: from [18.26.2.123] ([18.26.2.123]) by mrelayeu.kundenserver.de (mreue101 [212.227.15.184]) with ESMTPSA (Nemesis) id 0MIe3k-1cJn7a2cZ6-002DkU for ; Thu, 15 Dec 2016 20:00:26 +0100 X-Provags-ID: V03:K0:0R9GKob27KJAF0XXZOIbwAZXF4adf/7wY6ltkYXRJnBrWAtEe3r OMSALM5dL37Z94DwNrLRQTe8cwyWD9VYVAplC2lF8s5KE4XThiq/TW54ngqM0Xmc71INvS3 m9QmhmQnn8Qej7LRJoIzNgxlQttW9GHgK//ywqcsxej7O1sV2f6HG4dzomn5OWzhrJsS8/N bMcy+DP+cLxcF9qMApp4Q== X-UI-Out-Filterresults: notjunk:1;V01:K0:sNO4snnqwA4=:pPTuHMImKo/SMTortIDpE0 U3R1CnBGrHWtps2csE0wO2MmUXBk+GPw/CimQ/PA6KtVY4o9CiS38vRwoCf8Hg2/e5xwh/69H GNsp6rRU14n25U+WXcI2Nk5R2nXcsaqa/GBYgmHeplIMsU6WmUAn77V2ncTkhR7K+2ST59QSY 35ue1DcoJdGa7jR6omtJrQouj1Ba7Y3a23Xi4pHIHCQNg2+kQoPe/Csq9yi9EIMF1tusgfdk4 ZVMso+kavZe5G1VFAq5n3oGdq/iSRVqv5Hv2rlkWoogiPF7zXQd8yXfggtgJR2EgVN5EtIevd jMOdQvzjFJbshmptmoJzogjdroE6PiaroXVjJOQbqriSGaK/go1B3buFOgPcKca2IiM3opVxJ hET+NVaxtzQ3xVkfUgojMEM++vr3Iz0q7qyUuvgcIea8+K3tLXLMaoSXdZ5JeR2LrpekpXJXu Awr2asyGiNNLsGcM9xyI9k+X5nNdQKi25cUanaXSbLS5Ks0A70+uRp+VgsDflPsc/47cvnbQX WNAF68o5rcggueeLRh449Ek9zZXXv2HHhCBitxeVpsQEM3sPuY/ojbp1Y40RLALifPQjpBtTv mfjcGJnYRykUax5sq+nYeetS9uE2EMDsp+NJRIvqCDQdKGVnFVZK9FLjh/lCTjkHADNWx+kQn +8ih7kZKkK1yjhgCKaAW5PZduLNra11L7mnbm79fUZQs8wWdbVw3RkQfVLXaCW8LrX6M= X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] [fuzzy] X-Received-From: 212.227.17.13 X-Mailman-Approved-At: Thu, 15 Dec 2016 15:01:06 -0500 X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.21 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Original-Sender: "Emacs-devel" Xref: news.gmane.org gmane.emacs.devel:210485 Archived-At: This is a cryptographically signed message in MIME format. --------------ms030208010000040705000407 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Hi emacs-devel, The top of regex.c says this: /* Ignore some GCC warnings for now. This section should go away once the Emacs and Gnulib regex code is merged. */ And indeed I've seen brief discussions of this merge in the past (https:/= /lists.gnu.org/archive/html/emacs-devel/2002-04/msg00083.html , https://l= ists.gnu.org/archive/html/emacs-devel/2016-07/msg01115.html). Looking at= this is more detail, though, it is not clear how such a merge could happ= en: * Emacs has special regexp features based on syntax classes, or the posit= ion of the point. These features don't seem to be supported in gnulib no= r glibc * Emacs uses a gap buffer, whereas gnulib and glibc expects a char* for t= he subject string Stefan also expressed a desire to move to a better regular expression eng= ine =E2=80=94 ideally one that uses a non-backtracking implementation whe= n possible, to avoid the exponential blowups that Emacs currently suffers= from on certain regular expressions. I've had a quick look at the existing options. Here's what I gathered (t= he list below includes only libraries whose licensing is compatible with = Emacs): * There are only two usable, non-backtracking regexp libraries: - RE2 (from Google, C++) doesn't support any form of backtracking, thou= gh, so "\\(.*\\)\1" won't work anymore. Additionally, RE2 does not suppo= rt matching over a gap buffer (the subject string needs to be a contiguou= s chunk of memory). Discussion here: https://github.com/google/re2/issue= s/126 - TRE (http://laurikari.net/tre/) uses backtracking if needed, but othe= rwise uses a finite automaton. It's stable, can be relatively easily lin= ked into Emacs, and performance is similar to what we have (I benchmarked= it). It does support matching on a gap buffer (it has an interface to u= se the moral equivalent of a random iterator, reguexec). It doesn't supp= ort Emacs' native encoding out of the box, though, and the library is unm= aintained; there are multiple forks which fix various security vulnerabil= ities. In both cases, the syntax isn't compatible with that of Emacs, which wo= uld force us to write a compatibility layer. And Emacs' internal encodin= g isn't supported either. * Backtracking implementations are more common, but none seem to fit the = bill: - Oniguruma and Onigmo (C, used in Ruby, the Atom text editor, and othe= rs) support a wide range of encodings, including allegedly Emacs' interna= l encoding, as well as many syntaxes, including Emacs' syntax. But they = don't support matching on a gap buffer either (they take a contiguous str= ing). Discussion here: https://github.com/k-takata/Onigmo/issues/83 - PCRE (C, used in Perl and others) doesn't have Emacs encoding, nor Em= acs syntax. It requires a contiguous buffer, but it does support partial= matches. Partial matches can apparently be restarted, which could almos= t work in our case, but the semantics of restarting a search from a parti= al match are broken. - Boost (C++) can use a custom random-access iterator for the subject s= tring, which should work for gap buffers; but I'm not sure how good the e= ncoding story is. If I understand our current implementation, our regexp.c functions take t= he moral equivalent of two strings, and match over that, pretending that = it's just one large string. In practice, these two strings are the two h= alves of the gap buffer. Correct? The bottom line is that, even though it should be relatively easy to chan= ge string-match-p to use onigmo, changing re-search-forward sounds tricky= =2E Did I miss something? Was there a concrete plan for merging with gl= ibc, or for using an automaton-based matching strategy? The advantages o= f switching could be support for a more common syntax, more flexible regu= lar expressions (named groups, assertions, =E2=80=A6), better performance= , and better handling of special cases. Cheers, Cl=C3=A9ment. --------------ms030208010000040705000407 Content-Type: application/pkcs7-signature; name="smime.p7s" Content-Transfer-Encoding: base64 Content-Disposition: attachment; filename="smime.p7s" Content-Description: S/MIME Cryptographic Signature MIAGCSqGSIb3DQEHAqCAMIACAQExDzANBglghkgBZQMEAgEFADCABgkqhkiG9w0BBwEAAKCC A8gwggPEMIIDLaADAgECAhAFTpEFd7pdiMcr6MwkhoRoMA0GCSqGSIb3DQEBCwUAMGwxCzAJ BgNVBAYTAlVTMRYwFAYDVQQIEw1NYXNzYWNodXNldHRzMS4wLAYDVQQKEyVNYXNzYWNodXNl dHRzIEluc3RpdHV0ZSBvZiBUZWNobm9sb2d5MRUwEwYDVQQLEwxDbGllbnQgQ0EgdjEwHhcN MTYwNzIyMjAzNzM3WhcNMTcwNzMxMjAzNzM3WjCBrDELMAkGA1UEBhMCVVMxFjAUBgNVBAgT DU1hc3NhY2h1c2V0dHMxLjAsBgNVBAoTJU1hc3NhY2h1c2V0dHMgSW5zdGl0dXRlIG9mIFRl Y2hub2xvZ3kxFTATBgNVBAsTDENsaWVudCBDQSB2MTEeMBwGA1UEAxMVQ2xlbWVudCBGIFBp dC1DbGF1ZGVsMR4wHAYJKoZIhvcNAQkBFg9jcGl0Y2xhQE1JVC5FRFUwggEiMA0GCSqGSIb3 DQEBAQUAA4IBDwAwggEKAoIBAQDsGIVjnDysgVLsrxleGDQEZl+iGBLP/jTIQQ+YIyZZYRVI 99cMACDLph3Qcm4BaRcTho8JOavaLhh4Z2+ZmSfjweyV0xnZWBJCTBeNI1oEoyJNbjFHWTIl TTvTt5dIjs3a+zFYTw1MWAZ4pafu9Pf9h/HaEPTUKlzSZxDeMvPOcgy4EdnY8dtL01we1Ify 75izdeVA5I5w6zRXctD3CGoXBrGiItYDMqWBK9TXYto3nv/Gqr9uww7OVp71lL3NU5B3Sf/L KluHbBFvTOSzW2/SKY1Rx7vr5y+pB3x8dlAYUW6u7pRFDVDHPMWP++ywzdBfLXifLjYu559Q 6hzHTYbPAgMBAAGjgaEwgZ4wCQYDVR0TBAIwADARBglghkgBhvhCAQEEBAMCBaAwHQYDVR0l BBYwFAYIKwYBBQUHAwQGCCsGAQUFBwMCMAsGA1UdDwQEAwIF4DAdBgNVHQ4EFgQU7aMDTdzd UPsZO6dMaXbmqp2doQUwMwYDVR0fBCwwKjAooCagJIYiaHR0cDovL2NhLm1pdC5lZHUvY2Ev bWl0Y2xpZW50LmNybDANBgkqhkiG9w0BAQsFAAOBgQC/dYWdWhW8tzDOax/vqKDpffMjVeT2 ITDAndaxp6RTMKo+TWczZJ3e3xaKHMmTKvvtL94l1gcxFkWwKeZY47IQB5r/6IodFek6RWMg BjoypsLaE+f/tRw3iNds+jJyrMpRqRbEIBvxMTwhYc5MQU9o4xOgg4TXPHH6nf6VAm6+TjGC A7EwggOtAgEBMIGAMGwxCzAJBgNVBAYTAlVTMRYwFAYDVQQIEw1NYXNzYWNodXNldHRzMS4w LAYDVQQKEyVNYXNzYWNodXNldHRzIEluc3RpdHV0ZSBvZiBUZWNobm9sb2d5MRUwEwYDVQQL EwxDbGllbnQgQ0EgdjECEAVOkQV3ul2IxyvozCSGhGgwDQYJYIZIAWUDBAIBBQCgggIBMBgG CSqGSIb3DQEJAzELBgkqhkiG9w0BBwEwHAYJKoZIhvcNAQkFMQ8XDTE2MTIxNTE5MDAyNVow LwYJKoZIhvcNAQkEMSIEIK5m+6hwMtusXjDYIiFzcj45mNLfz2KvcmTPAyFoSUjeMGwGCSqG SIb3DQEJDzFfMF0wCwYJYIZIAWUDBAEqMAsGCWCGSAFlAwQBAjAKBggqhkiG9w0DBzAOBggq hkiG9w0DAgICAIAwDQYIKoZIhvcNAwICAUAwBwYFKw4DAgcwDQYIKoZIhvcNAwICASgwgZEG CSsGAQQBgjcQBDGBgzCBgDBsMQswCQYDVQQGEwJVUzEWMBQGA1UECBMNTWFzc2FjaHVzZXR0 czEuMCwGA1UEChMlTWFzc2FjaHVzZXR0cyBJbnN0aXR1dGUgb2YgVGVjaG5vbG9neTEVMBMG A1UECxMMQ2xpZW50IENBIHYxAhAFTpEFd7pdiMcr6MwkhoRoMIGTBgsqhkiG9w0BCRACCzGB g6CBgDBsMQswCQYDVQQGEwJVUzEWMBQGA1UECBMNTWFzc2FjaHVzZXR0czEuMCwGA1UEChMl TWFzc2FjaHVzZXR0cyBJbnN0aXR1dGUgb2YgVGVjaG5vbG9neTEVMBMGA1UECxMMQ2xpZW50 IENBIHYxAhAFTpEFd7pdiMcr6MwkhoRoMA0GCSqGSIb3DQEBAQUABIIBANgIya62IPONHIWj RFIuVvJY7uhXOJjy7YTqy4ocqa6zDNmRicpG2sgwlGWzp/b7sfyXXTLuPf6gpWd+1YHehLm3 2K1ZiyPUhhfPt1+m5M1q6owfQHCjSOWAeUOA9AKJaDk+43a603ptdaQOQazDjU2Wi16Y98Qu Tiez+dyfNIAci37uQ7pCTHP/CTWVEFmZ4/Yui4ULjte6P9P/R1md2OwgeqBj5vkB1yNRBrJc dur8kHWzzM09sF9b7fsvobhX1P+K2qCe8Q0P8kwxkpofqqUyNDPKxBEy2m1YA1JJMkDzqeG3 NKghVm1Eepo5pxvOVqog+DbNDcIgt38jLBCaSnUAAAAAAAA= --------------ms030208010000040705000407--