From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: Artur Malabarba Newsgroups: gmane.emacs.devel Subject: Char-folding: how can we implement matching multiple characters as a single "thing"? Date: Mon, 30 Nov 2015 15:54:50 +0000 Message-ID: Reply-To: bruce.connor.am@gmail.com NNTP-Posting-Host: plane.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable X-Trace: ger.gmane.org 1448898910 3220 80.91.229.3 (30 Nov 2015 15:55:10 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Mon, 30 Nov 2015 15:55:10 +0000 (UTC) Cc: emacs-devel To: Eli Zaretskii Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Mon Nov 30 16:55:09 2015 Return-path: Envelope-to: ged-emacs-devel@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by plane.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1a3QnD-0005m7-E9 for ged-emacs-devel@m.gmane.org; Mon, 30 Nov 2015 16:55:07 +0100 Original-Received: from localhost ([::1]:41684 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1a3QnC-0003Vv-Ow for ged-emacs-devel@m.gmane.org; Mon, 30 Nov 2015 10:55:06 -0500 Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:48484) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1a3Qn0-0003Vp-2x for emacs-devel@gnu.org; Mon, 30 Nov 2015 10:54:54 -0500 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1a3Qmz-00059D-72 for emacs-devel@gnu.org; Mon, 30 Nov 2015 10:54:54 -0500 Original-Received: from mail-lf0-x22d.google.com ([2a00:1450:4010:c07::22d]:34232) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1a3Qmx-00058m-Lp; Mon, 30 Nov 2015 10:54:51 -0500 Original-Received: by lffu14 with SMTP id u14so203037843lff.1; Mon, 30 Nov 2015 07:54:50 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:reply-to:sender:date:message-id:subject:from:to:cc :content-type:content-transfer-encoding; bh=NQW1uKGSMk5n1XyYtwHYl3NuLWgj2DEeBA5jzJKcz9E=; b=guNQOcY6qVCdPqR6XD0cE4oMxEN7LI+Qnrgf/2rNVrJsQ+7jEZSASS5a2nIlhJTPra lidMGY7tFLae0wH5SP/nLeCKQfytn6uuHRL2F4rI8IrhmXBedzDDz3SG3DwO/wd/T/XX o7Lc0fjI7adGG1GfQ3x74LBNVdp5HuDoZ6rOeuJ8++/yPSRAKGw8hGiSAlRgXcX8vbbG +B8JzkPzLbafyHlIcm6Lm1jMHmuVSbES/14EgexKMR6J7wxH679EkvBU3m5lk71Dug5Z 0LW3nRBXTlbm+YjeZ/NY98ymEbgcbl42EyhcIkPmN3k44vhw2dh2sgxepzZqUfHcvQwQ AaAg== X-Received: by 10.112.242.167 with SMTP id wr7mr1231051lbc.69.1448898890783; Mon, 30 Nov 2015 07:54:50 -0800 (PST) Original-Received: by 10.112.202.99 with HTTP; Mon, 30 Nov 2015 07:54:50 -0800 (PST) X-Google-Sender-Auth: 1_Y5GcuLZWAGzamCBx4oziDBAiU X-detected-operating-system: by eggs.gnu.org: Error: Malformed IPv6 address (bad octet value). X-Received-From: 2a00:1450:4010:c07::22d X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Xref: news.gmane.org gmane.emacs.devel:195612 Archived-At: I pushed this feature a couple of days ago, but I had to disable it now (the code is in `character-fold-to-regexp'). For those who don't know, the way char-folding works is by replacing each character in the search string with a regexp of the characters it can match. For instance, a character-folding search for "fi" is actually a regexp search for "\\(?:f=CC=87\\|[f=E1=B6=A0=E1=B8=9F=E2=93=95=EF=BD=86=F0=9D=90= =9F=F0=9D=91=93=F0=9D=92=87=F0=9D=92=BB=F0=9D=93=AF=F0=9D=94=A3=F0=9D=95=97= =F0=9D=96=8B=F0=9D=96=BF=F0=9D=97=B3=F0=9D=98=A7=F0=9D=99=9B=F0=9D=9A=8F]\\= )\\(?:i[=CC=80-=CC=84=CC=86=CC=88=CC=89=CC=8C=CC=8F=CC=91=CC=A3=CC=A8=CC=B0= ]\\|[i=C3=AC-=C3=AF=C4=A9=C4=AB=C4=AD=C4=AF=C7=90=C8=89=C8=8B=E1=B5=A2=E1= =B8=AD=E1=BB=89=E1=BB=8B=E2=81=B1=E2=84=B9=E2=85=88=E2=85=B0=E2=93=98=EF=BD= =89=F0=9D=90=A2=F0=9D=91=96=F0=9D=92=8A=F0=9D=92=BE=F0=9D=93=B2=F0=9D=94=A6= =F0=9D=95=9A=F0=9D=96=8E=F0=9D=97=82=F0=9D=97=B6=F0=9D=98=AA=F0=9D=99=9E=F0= =9D=9A=92]\\)". As you can see, this multiplies the length of the string by a factor of about 45. But that's still acceptable. However, this only includes foldings for the 'f' character and the 'i' character independently. As it happens, the string "fi" can also match the ligature '=EF=AC=81'. Now, for the sake of conciseness, let me denote the f and i regexps above as [f] and [i]. The code I pushed a couple of days ago added support for multi-char matches (such as "fi" matching that ligature), but returning a regexp like this: "\\([f][i]\\|=EF=AC=81\\)" This might not look like much. It only adds a few chars to a regexp that already has about 90. But the problem is that it introduces a new branch. Let's say we want to search for the word "fix". The letters 'i' and 'x' could also represent the character '=E2=85=B8' (roman numeral nine). So the only way to write that regexp is: "\\([f]\\([i][x]\\|=E2=85=B8\\)\\|=EF=AC=81[x]\\)". Now the pattern [x] (which I remind you has ~ 40 chars) appears twice in the regexp, and we've gained another branch. If the next character the search string can also combine with 'x', then we go from 3 to 5 branches. As you can see. The number of branches will scale badly with the number of input characters. Which means the number of output characters will be much more than 40 times the input. And even if the next character does NOT combine with 'x', it's hard to write an algorithm that can identify that and close all branches before continuing with the conversion. And that is why I disabled this code for now. With a short string like "endless-visit-notifications" it was return regexps longer than 10k characters. Which is more than Emacs handle. Does anyone have alternative ideas?