From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: Miles Bader Newsgroups: gmane.emacs.devel Subject: Re: announcing thaiword.el? Date: Tue, 29 Mar 2005 17:35:15 +0900 Message-ID: References: <20050325.081838.163323532.wl@gnu.org> <20050325.232613.73792307.wl@gnu.org> <200503260106.KAA20718@etlken.m17n.org> <200503280047.JAA25472@etlken.m17n.org> Reply-To: Miles Bader NNTP-Posting-Host: main.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Trace: sea.gmane.org 1112085507 23909 80.91.229.2 (29 Mar 2005 08:38:27 GMT) X-Complaints-To: usenet@sea.gmane.org NNTP-Posting-Date: Tue, 29 Mar 2005 08:38:27 +0000 (UTC) Cc: emacs-devel@gnu.org, rms@gnu.org, monnier@iro.umontreal.ca Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Tue Mar 29 10:38:23 2005 Return-path: Original-Received: from lists.gnu.org ([199.232.76.165]) by ciao.gmane.org with esmtp (Exim 4.43) id 1DGCEV-0000MU-PR for ged-emacs-devel@m.gmane.org; Tue, 29 Mar 2005 10:37:56 +0200 Original-Received: from localhost ([127.0.0.1] helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1DGCUl-0005Tf-I3 for ged-emacs-devel@m.gmane.org; Tue, 29 Mar 2005 03:54:43 -0500 Original-Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43) id 1DGCTl-0005CN-9I for emacs-devel@gnu.org; Tue, 29 Mar 2005 03:53:42 -0500 Original-Received: from exim by lists.gnu.org with spam-scanned (Exim 4.43) id 1DGCTb-00057Z-SO for emacs-devel@gnu.org; Tue, 29 Mar 2005 03:53:35 -0500 Original-Received: from [199.232.76.173] (helo=monty-python.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1DGCTa-00056Y-Cr for emacs-devel@gnu.org; Tue, 29 Mar 2005 03:53:30 -0500 Original-Received: from [210.143.35.51] (helo=tyo201.gate.nec.co.jp) by monty-python.gnu.org with esmtp (Exim 4.34) id 1DGCCF-0000PA-Sv; Tue, 29 Mar 2005 03:35:36 -0500 Original-Received: from mailgate4.nec.co.jp (mailgate53.nec.co.jp [10.7.69.184]) by tyo201.gate.nec.co.jp (8.11.7/3.7W01080315) with ESMTP id j2T8ZKO01754; Tue, 29 Mar 2005 17:35:20 +0900 (JST) Original-Received: (from root@localhost) by mailgate4.nec.co.jp (8.11.7/3.7W-MAILGATE-NEC) id j2T8ZJf09317; Tue, 29 Mar 2005 17:35:19 +0900 (JST) Original-Received: from edsgm01.lsi.nec.co.jp ([10.50.208.11]) by mailsv5.nec.co.jp (8.11.7/3.7W-MAILSV4-NEC) with ESMTP id j2T8ZJn16737; Tue, 29 Mar 2005 17:35:19 +0900 (JST) Original-Received: from mcsss2.ucom.lsi.nec.co.jp (localhost [127.0.0.1]) by edsgm01.lsi.nec.co.jp (8.12.10/8.12.10) with ESMTP id j2T8ZHbE016284; Tue, 29 Mar 2005 17:35:17 +0900 (JST) Original-Received: from mctpc71 (mctpc71.ucom.lsi.nec.co.jp [10.30.118.121]) by mcsss2.ucom.lsi.nec.co.jp (8.12.10/8.12.8/EDcg v2.01-mc/1046780839) with ESMTP id j2T8ZGKt020070; Tue, 29 Mar 2005 17:35:16 +0900 (JST) Original-Received: by mctpc71 (Postfix, from userid 31295) id D1AAB2A; Tue, 29 Mar 2005 17:35:15 +0900 (JST) Original-To: Kenichi Handa System-Type: i686-pc-linux-gnu Blat: Foop In-Reply-To: <200503280047.JAA25472@etlken.m17n.org> (Kenichi Handa's message of "Mon, 28 Mar 2005 09:47:09 +0900 (JST)") Original-Lines: 31 X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Xref: news.gmane.org gmane.emacs.devel:35290 X-Report-Spam: http://spam.gmane.org/gmane.emacs.devel:35290 On Mon, 28 Mar 2005 09:47:09 +0900 (JST), Kenichi Handa wrote: > To handle the regular expression "\\b" and "\\B" correctly > for Thai, we need a bigger change in regex.c. For the > moment, I have no idea how to do that. Current extensions to "word syntax", using `word-separating-categories' etc., seem to do the correct thing with regexps.[*] Perhaps some extension to that mechanism would work. For instance, what if entries in `word-separating-categories' could have an optional predicate function -- in addition to the current (CAT1 . CAT2) format, allow (CAT1 CAT2 PREDICATE-FUN), and only consider the entry to match if PREDICATE-FUN fun (with some apropriate args) also returns true? Then for a case like Thai, where you want to do more complicated tests to establish word-boundaries inside sequences of non-delimited text, could use a "degenerate" entry in `word-separating-categories' with both CAT1 and CAT2 the same, but also with a predicate attached to do the more complicated test. I suppose that would slow down word matching when the predicate is called, but it would only happen for text where that is appropriate. -Miles [*] I was surprised that this is true, and I don't understand why from my quick look at regex.c :-/ ... But my simple tests seem to show that it does really work. E.g., I can add '(?C . ?C) to `word-separating-categories', and then a regexp search will suddenly start considering every single kanji character as a standalone word. -- Do not taunt Happy Fun Ball.