From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: Artur Malabarba Newsgroups: gmane.emacs.devel Subject: Re: Char-folding: how can we implement matching multiple characters as a single "thing"? Date: Tue, 1 Dec 2015 14:18:30 +0000 Message-ID: References: <565C7E1F.10204@gmail.com> <837fkzmkxi.fsf@gnu.org> Reply-To: bruce.connor.am@gmail.com NNTP-Posting-Host: plane.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable X-Trace: ger.gmane.org 1448979549 5476 80.91.229.3 (1 Dec 2015 14:19:09 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Tue, 1 Dec 2015 14:19:09 +0000 (UTC) To: Eli Zaretskii , =?UTF-8?B?Q2zDqW1lbnQgUGl0LS1DbGF1ZGVs?= , emacs-devel Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Tue Dec 01 15:19:09 2015 Return-path: Envelope-to: ged-emacs-devel@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by plane.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1a3llc-0004zW-3Y for ged-emacs-devel@m.gmane.org; Tue, 01 Dec 2015 15:18:52 +0100 Original-Received: from localhost ([::1]:52874 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1a3llb-0003K4-86 for ged-emacs-devel@m.gmane.org; Tue, 01 Dec 2015 09:18:51 -0500 Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:37789) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1a3llJ-0003Iv-Ss for emacs-devel@gnu.org; Tue, 01 Dec 2015 09:18:37 -0500 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1a3llJ-0000xS-05 for emacs-devel@gnu.org; Tue, 01 Dec 2015 09:18:33 -0500 Original-Received: from mail-lf0-x236.google.com ([2a00:1450:4010:c07::236]:33642) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1a3llH-0000x7-AS; Tue, 01 Dec 2015 09:18:31 -0500 Original-Received: by lfaz4 with SMTP id z4so10055400lfa.0; Tue, 01 Dec 2015 06:18:30 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:reply-to:sender:in-reply-to:references:date:message-id :subject:from:to:content-type:content-transfer-encoding; bh=8s2RhrxnpmTRB0lOZJINzMq6xnBYihR9KFjbyMjFrbE=; b=aZ7tvRWUZewFXtZVfJ4jOFXmf/Y9qqk+6apQCFthU2TI7BTZBqkY80XhhXrNDqUcCe /CdSUHfho8cn3ug6C7SyqQmK78VQSAA0IR6GTxu1dQ4wX9EwxlAwJrtpiUrwxTX6mjVw axbJ/0ncttjrHMjzlPBoDeAbDI3RLbYBBDOnej4I0vTBQH3meChd7NSz8J17e7IsCdT2 RkzEtgu8ZXDelEZh1WKfbj4E6WRAxwWPjXS1U0nh5Eht8AyK6X7PcYorJxyLy+xj8cAw oci6TiL/nq658sJuS10G3xoQgd5eVjUSxGZm6ClVFQSEO52JadeaRRK9P4seWqOesTVG EpNA== X-Received: by 10.112.161.33 with SMTP id xp1mr27438357lbb.141.1448979510462; Tue, 01 Dec 2015 06:18:30 -0800 (PST) Original-Received: by 10.112.202.99 with HTTP; Tue, 1 Dec 2015 06:18:30 -0800 (PST) In-Reply-To: X-Google-Sender-Auth: arxOj0XfX9yvO9SXmWsbiNIHbRE X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] X-Received-From: 2a00:1450:4010:c07::236 X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Xref: news.gmane.org gmane.emacs.devel:195670 Archived-At: All right. For now, I've gone with Paul's suggestion and just made the algorithm dumber. It won't catch every single scenario, but that's better than catching none. I too agree that the ideal approach would be to implement this entirely in C, but so far we lack the necessary human effort. There's also a 3rd option. I posted some code here a while ago that implemented char-folding by temporarily replacing the (current-case-table) with a char-fold-table. This was fast, and much nicer than the current regexps, but it had the limitation of only being a character-to-character relation. So it couldn't do something as basic as 'a' matching "=C3=A4" (because that's 1 char matching 2). However, it's possible that we could combine the two solutions, using this case-table for as much as possible and then using regexps for anything else. This way the regexp pattern that replaces each input character would likely be considerably smaller than 45 chars (I'd guess between 3 and 15 depending on the character). The number of branches would still scale badly with the input string size. but the smaller multiplicative factor should give us more leeway before scaling up to 10k chars. 2015-11-30 21:48 GMT+00:00 John Wiegley : >>>>>> Eli Zaretskii writes: > >> Volunteers are welcome to work on the ultimate solution, which should in= deed >> include normalization of both the search string and the buffer/string te= xt >> that is searched. > > I imagine this would be done iteratively, with caching of what had been > normalized if we happen to back-track within a certain bound. > > Any takers for working on the "ultimate solution"? > > John >