From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: Eli Zaretskii Newsgroups: gmane.emacs.devel Subject: Re: Char-folding: how can we implement matching multiple characters as a single "thing"? Date: Tue, 01 Dec 2015 17:50:12 +0200 Message-ID: <837fkykw23.fsf@gnu.org> References: <565C7E1F.10204@gmail.com> <837fkzmkxi.fsf@gnu.org> Reply-To: Eli Zaretskii NNTP-Posting-Host: plane.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: QUOTED-PRINTABLE X-Trace: ger.gmane.org 1448985034 3189 80.91.229.3 (1 Dec 2015 15:50:34 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Tue, 1 Dec 2015 15:50:34 +0000 (UTC) Cc: clement.pit@gmail.com, emacs-devel@gnu.org To: bruce.connor.am@gmail.com Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Tue Dec 01 16:50:25 2015 Return-path: Envelope-to: ged-emacs-devel@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by plane.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1a3nCC-0001jT-LH for ged-emacs-devel@m.gmane.org; Tue, 01 Dec 2015 16:50:24 +0100 Original-Received: from localhost ([::1]:53375 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1a3nCC-0006Mi-4T for ged-emacs-devel@m.gmane.org; Tue, 01 Dec 2015 10:50:24 -0500 Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:42156) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1a3nC7-0006Ma-CI for emacs-devel@gnu.org; Tue, 01 Dec 2015 10:50:20 -0500 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1a3nC4-0005sC-6V for emacs-devel@gnu.org; Tue, 01 Dec 2015 10:50:19 -0500 Original-Received: from mtaout24.012.net.il ([80.179.55.180]:54035) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1a3nC3-0005ru-SR for emacs-devel@gnu.org; Tue, 01 Dec 2015 10:50:16 -0500 Original-Received: from conversion-daemon.mtaout24.012.net.il by mtaout24.012.net.il (HyperSendmail v2007.08) id <0NYO00F00RD4EL00@mtaout24.012.net.il> for emacs-devel@gnu.org; Tue, 01 Dec 2015 17:43:02 +0200 (IST) Original-Received: from HOME-C4E4A596F7 ([84.94.185.246]) by mtaout24.012.net.il (HyperSendmail v2007.08) with ESMTPA id <0NYO008IGRNQOH70@mtaout24.012.net.il>; Tue, 01 Dec 2015 17:43:02 +0200 (IST) In-reply-to: X-012-Sender: halo1@inter.net.il X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.6.x X-Received-From: 80.179.55.180 X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Xref: news.gmane.org gmane.emacs.devel:195681 Archived-At: > Date: Tue, 1 Dec 2015 14:18:30 +0000 > From: Artur Malabarba >=20 > There's also a 3rd option. I posted some code here a while ago that > implemented char-folding by temporarily replacing the > (current-case-table) with a char-fold-table. This was fast, and muc= h > nicer than the current regexps, but it had the limitation of only > being a character-to-character relation. So it couldn't do somethin= g > as basic as 'a' matching "=C3=A4" (because that's 1 char matching 2= ). >=20 > However, it's possible that we could combine the two solutions, usi= ng > this case-table for as much as possible and then using regexps for > anything else. This way the regexp pattern that replaces each input > character would likely be considerably smaller than 45 chars (I'd > guess between 3 and 15 depending on the character). > The number of branches would still scale badly with the input strin= g > size. but the smaller multiplicative factor should give us more lee= way > before scaling up to 10k chars. My gut feeling is that if we go to the C level, we should implement this properly. Coding another partial solution will almost certainly bump into some subtle limitations. In particular, any solution that requires a literal search to use regexps under the hood will present restrictions, because it will not play well with other regexp-based features, like word search and C-M-s itself.