From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: =?UTF-8?Q?Cl=c3=a9ment_Pit--Claudel?= Newsgroups: gmane.emacs.devel Subject: Re: Char-folding: how can we implement matching multiple characters as a single "thing"? Date: Mon, 30 Nov 2015 17:49:35 +0100 Message-ID: <565C7E1F.10204@gmail.com> References: NNTP-Posting-Host: plane.gmane.org Mime-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="06DbjO78LNMVDxtAwKqcLsN4afuAjfA5F" X-Trace: ger.gmane.org 1448902192 27134 80.91.229.3 (30 Nov 2015 16:49:52 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Mon, 30 Nov 2015 16:49:52 +0000 (UTC) To: emacs-devel@gnu.org Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Mon Nov 30 17:49:45 2015 Return-path: Envelope-to: ged-emacs-devel@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by plane.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1a3Re4-00077T-G6 for ged-emacs-devel@m.gmane.org; Mon, 30 Nov 2015 17:49:44 +0100 Original-Received: from localhost ([::1]:42064 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1a3Re3-00054U-Rc for ged-emacs-devel@m.gmane.org; Mon, 30 Nov 2015 11:49:43 -0500 Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:44267) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1a3Re0-00054N-Ri for emacs-devel@gnu.org; Mon, 30 Nov 2015 11:49:41 -0500 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1a3Rdx-0000XO-JG for emacs-devel@gnu.org; Mon, 30 Nov 2015 11:49:40 -0500 Original-Received: from mout.kundenserver.de ([212.227.126.135]:49456) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1a3Rdx-0000X2-9h for emacs-devel@gnu.org; Mon, 30 Nov 2015 11:49:37 -0500 Original-Received: from [192.168.1.82] ([109.24.225.43]) by mrelayeu.kundenserver.de (mreue003) with ESMTPSA (Nemesis) id 0MaX6b-1ZjqSf0uE5-00K5Vb for ; Mon, 30 Nov 2015 17:49:36 +0100 X-Enigmail-Draft-Status: N1110 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Thunderbird/38.3.0 In-Reply-To: X-Provags-ID: V03:K0:e1t/O/3kN2paCOp3nGk9Nb5YylGl9eKNjt3U4Ukqs1Zqb5Hydjt fM8hyrRF//VHyuyomwD3hltqt81QVW2DKOVzdOp6eujrp+jetPvM3p+Z8etw/hJRrEldHk5 w/BkYvzFANDnbfMEYX1CDfpjz41VaO3SMY70mMIj81B+7vqlKtyJg844boKU4U/RP8HbOiY v9oHtQy7A6k2UC05XTpQw== X-UI-Out-Filterresults: notjunk:1;V01:K0:W4ynsmbHQzs=:HWKNspvaMffDu2bm0pZDyb tDfRCeQCWxUMoZXUkGTEG8wE9crdz5yl8jQrRYBVIiA0+JrcTIvOBY8ykUmeMVLFftCOCR0Vx jk5J7wxT+LbtZVG3O/GFDcmRL9QTYJrs3Wla9xT0C73TShpkFnvDbmmHHZH0JhG3UpIbJpgle dInIAfipEU6rCdTzX/v00uCK4O2BvO4ljBPF5hsMSQjmBtbXpBoucBK2mHnLkyySSFID1T8N2 XOWak/wThXyF9IlfXmEDKdDMklVXJltlhRRrAUQxmDi+cYlkrnSlkI7962HyQi0pSY07ZKZJT P0NeG881VRQpwUBzpU0twr9bUBDsEgrTByBgFxBiFDRLZRdl5Phm0dTYEBvVLg7v9/5HQXcxc iCHhacrPAaKzc9Qnmo2TYaIA4kW35wL96ZCnJfqYg3AtGQrpctkErnkpgfKhLCvqoayjIJqrv 5JQk8fiZDu60+zEN9l/EaOhyRrnpbnF4XFQFcvgl0GjPEdT5qvE74kKVih3r+V10gxLNTUtbV IsEStBmd4Etv7VwNEl79SSoIJToYDXTnGty2EKt2OfhMLdLTM1EeT+di3z6MHzwKzjW4UC20I h74fYfUJTkAvGgZl1Cym0Azt5E8Dy/emxACY1b4MN7ukwI5npn4xzwNsK5hfrmrUTYbjTjJij +XhAjXHQyqP0vGnAnI2LSMG0OlEJmWwDqkrhM9335s4sKerrXTFHsJ9kC7TI/VR+3rMY= X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] X-Received-From: 212.227.126.135 X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Xref: news.gmane.org gmane.emacs.devel:195625 Archived-At: This is an OpenPGP/MIME signed message (RFC 4880 and 3156) --06DbjO78LNMVDxtAwKqcLsN4afuAjfA5F Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable On 11/30/2015 04:54 PM, Artur Malabarba wrote: > As you can see. The number of branches will scale badly with the=20 > number of input characters. Which means the number of output=20 > characters will be much more than 40 times the input. And even if > the next character does NOT combine with 'x', it's hard to write an=20 > algorithm that can identify that and close all branches before=20 > continuing with the conversion. >=20 > And that is why I disabled this code for now. With a short string=20 > like "endless-visit-notifications" it was return regexps longer than=20 > 10k characters. Which is more than Emacs handle. Hi Arthur, Thanks for your efforts on this; it would be very convenient for "ff" or = "fi" to match their ligature counterparts. Are the difficulties due to the sheer size of the regular expressions, or= to their complexity (too much branching?). If the size is the real issue= (and if that size mostly comes from repeated folding of individual chara= cters), then I can think of two options: * Extend the C-level implementation of regular expressions to make charac= ter-folding a capability of the C engine, allowing for succinct regexps += a flag. * Add a new backslash to regular expressions for matching char-folded cha= racters (custom character categories almost already allow for this, so th= e implementation would probably be similar). These two solutions would not help with the branching, but they would all= ow for smaller input to the regexp engine (again, assuming that the size = issue is due to the large brackets, not to the branching). On the other hand, if excessive branching is the issue (either due to the= size of the regexps even with the ideas above applied, or because the en= gine has trouble with all the backtracking), then maybe normalization can= help: instead of searching the actual buffer, the regexp engine could be= made to search the buffer itself, and a copy of it with all text normali= zed, for some "good" definition of normalization. So looking for "iff" in= "di=EF=AC=83cult" would really look for "iff" in both "di=EF=AC=83cult" = and "difficult". Since normalization is a per-character operation, this d= oes not require maintaining a second copy of the buffer: instead, one can= just feed two streams of characters to the regexp engine: the un-normali= zed one, and the normalized one. As a side note, it may also be possible to track at the buffer level whet= her the whole file is ascii; if it is, then case folding can be omitted, = making any performance degradation stemming from character folding more a= cceptable. One can also use finer-grained control, especially if characte= r folding happens at the C level in the regexp engine. Cheers, Cl=C3=A9ment. --06DbjO78LNMVDxtAwKqcLsN4afuAjfA5F Content-Type: application/pgp-signature; name="signature.asc" Content-Description: OpenPGP digital signature Content-Disposition: attachment; filename="signature.asc" -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.22 (GNU/Linux) iQIcBAEBAgAGBQJWXH4fAAoJEPqg+cTm90wja6QP/A7XuxdpyJ8T0bPAxAWdVb/F 2kwSCT4QHtyYdPRJy6G3iMWe//HygtWkrprzk+ZbdOiSlfpF7k/pEl46Tnuj9Ie0 6VdQPnJiSgqG+Qug03+jXInsHOFbJayuJAnYTWj2Ct3BRBxaVmsTY9/Hr1VXOm0j fDCkWrP0AR51tYcTltRF/bf9xQ4jfcodo+eIb8LS/QGF9IwuRNctZeT9SZZR/IfD u0ZlmdUmSnls9YeDSd7jSiZb0LG84hPLJhWVV5/aytK6jT4+SykNU+KNcHYdZzfT x3p4XUFZlVP8y6XX2R2Tmz8M7vQzyDI4KnrniFQ8n1HEis7JUsmlAYKlrlSqExgp IndK7xSLsYlOMyp1KJm8otbJ5fu/pXGy2S9N4osQLmZ8dVLJAWyPnZ3db0iepEuv PT3MRe+Qgpw0+A5Ki2Q4PFghHYNQqkOOLF+hsIwWzsl96NUn6oWrbAcyqmyGvRts 66bY6a70RtwyvEcaCBVbM3sOPV99OsXEjCJLlWBwU2gCv4w25pAyJdxyusvalaXf T1hXoQHYDvVo9esExrvlpy16AeqMlHhfiUmmC+S1wXKn6bY4YVcZdnwk4910RcRD a4ogUaT1ZNEfUxn4uCHfhljDM9LOkwCisnVHICSiFx7l4cHIgSoZ5dRfws8bIu+2 3bs1C+dDnwkIH4AxyLrA =4Qcz -----END PGP SIGNATURE----- --06DbjO78LNMVDxtAwKqcLsN4afuAjfA5F--