From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: Stefan Monnier Newsgroups: gmane.emacs.devel Subject: Re: Character group folding in searches Date: Sun, 08 Feb 2015 09:03:23 -0500 Message-ID: References: <83zj8rcdpi.fsf@gnu.org> <83k2zudfqk.fsf@gnu.org> <83bnl6d8d6.fsf@gnu.org> <831tm2cdm3.fsf@gnu.org> <83bnl5buvm.fsf@gnu.org> NNTP-Posting-Host: plane.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Trace: ger.gmane.org 1423404229 1616 80.91.229.3 (8 Feb 2015 14:03:49 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Sun, 8 Feb 2015 14:03:49 +0000 (UTC) Cc: bruce.connor.am@gmail.com, emacs-devel@gnu.org To: Eli Zaretskii Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Sun Feb 08 15:03:48 2015 Return-path: Envelope-to: ged-emacs-devel@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by plane.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1YKSSg-0006Ho-LN for ged-emacs-devel@m.gmane.org; Sun, 08 Feb 2015 15:03:46 +0100 Original-Received: from localhost ([::1]:56666 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1YKSSg-0000Q8-1i for ged-emacs-devel@m.gmane.org; Sun, 08 Feb 2015 09:03:46 -0500 Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:48082) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1YKSST-0000Q3-OU for emacs-devel@gnu.org; Sun, 08 Feb 2015 09:03:34 -0500 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1YKSSO-0000WR-F4 for emacs-devel@gnu.org; Sun, 08 Feb 2015 09:03:33 -0500 Original-Received: from chene.dit.umontreal.ca ([132.204.246.20]:43421) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1YKSSO-0000WA-9M; Sun, 08 Feb 2015 09:03:28 -0500 Original-Received: from pastel.home (lechon.iro.umontreal.ca [132.204.27.242]) by chene.dit.umontreal.ca (8.14.1/8.14.1) with ESMTP id t18E3OAV017559; Sun, 8 Feb 2015 09:03:24 -0500 Original-Received: by pastel.home (Postfix, from userid 20848) id EF5901137; Sun, 8 Feb 2015 09:03:23 -0500 (EST) In-Reply-To: <83bnl5buvm.fsf@gnu.org> (Eli Zaretskii's message of "Sat, 07 Feb 2015 17:31:57 +0200") User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/25.0.50 (gnu/linux) X-NAI-Spam-Flag: NO X-NAI-Spam-Threshold: 5 X-NAI-Spam-Score: 0 X-NAI-Spam-Rules: 1 Rules triggered RV5211=0 X-NAI-Spam-Version: 2.3.0.9393 : core <5211> : inlines <2052> : streams <1386829> : uri <1849766> X-detected-operating-system: by eggs.gnu.org: Genre and OS details not recognized. X-Received-From: 132.204.246.20 X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Xref: news.gmane.org gmane.emacs.devel:182630 Archived-At: > I'm sorry, I don't understand how this will solve the use-cases > brought up in this thread. Can you explain? Every equivalence class selected by such a DFA can match any set of strings that can be described by a regular expression, so it should be more than sufficiently powerful. > . exact match -- only exactly the same codepoints match The DFA is trivial, matches any (and only) one-char sequences and returns the char. > . base-character match -- this ignores any combining marks, > diacriticals, etc. Admittedly, less trivial since we have to remember the base char after matching it, while skipping subsequent combining marks and diacriticals. > . matching ligatures, such as =EF=AC=83 and ffi Straightforward. > . ignoring punctuation, like string-collate-equalp does, > i.e. "foobar" will match "foo.bar" Easy: the DFA will simply loop back when it sees a ".". > . ignoring isolated zero-width or non-combining marks and > directional controls Same. > I understand very well how these can be handled by several different > char-tables, but you seem to say that a single char-table can do all > this, and I don't see how. Not sure what you mean by "single char-table" or why you think I said something about single-vs-multiple char-tables. A first implementation of DFAs could use internally char-tables (where each node of the DFA is a char-table) but I think it's something entirely different from what you mean by "different char-tables" or "single char-table", since you'd choose one DFA (which may have any number of char-tables inside). > Now I'm completely confused: char-tables don't need this optimization, > as you well know: they already are space-efficient for storing > characters that map to the table's default value. So I probably > misunderstand your whole idea, if it does need such an optimization. A DFA can have hundreds of nodes (hence hundreds of char-tables if we use char-tables for that), most of which map one or two chars to a special value while all others are mapped to "the default", so there can be significant gains from using a more specialized representation. >> PS: And this same kind of "char-table extended into a DFA" could be >> useful for syntax-tables in order to provide much more flexible support >> for multi-character comment markers or "paren-like nested elements". > If that's your itch to scratch, I'm impatiently waiting for patches ;-) It's been in the back of my mind for many years. Stefan