From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: "Stephen J. Turnbull" Newsgroups: gmane.emacs.devel Subject: Re: Case mapping of sharp s Date: Wed, 18 Nov 2009 14:33:57 +0900 Message-ID: <876398wg56.fsf@uwakimon.sk.tsukuba.ac.jp> References: <19200.4158.380820.761685@a1i15.kph.uni-mainz.de> <83lji6mgg4.fsf@gnu.org> <83iqd9m14h.fsf@gnu.org> <83bpj0mq2v.fsf@gnu.org> NNTP-Posting-Host: lo.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Trace: ger.gmane.org 1258522051 7345 80.91.229.12 (18 Nov 2009 05:27:31 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Wed, 18 Nov 2009 05:27:31 +0000 (UTC) Cc: ulm@gentoo.org, emacs-devel@gnu.org, Kenichi Handa To: Eli Zaretskii Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Wed Nov 18 06:27:24 2009 Return-path: Envelope-to: ged-emacs-devel@m.gmane.org Original-Received: from lists.gnu.org ([199.232.76.165]) by lo.gmane.org with esmtp (Exim 4.50) id 1NAd4d-0001bV-OA for ged-emacs-devel@m.gmane.org; Wed, 18 Nov 2009 06:27:24 +0100 Original-Received: from localhost ([127.0.0.1]:50561 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1NAd4c-0003P7-Vd for ged-emacs-devel@m.gmane.org; Wed, 18 Nov 2009 00:27:22 -0500 Original-Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43) id 1NAd4W-0003MW-5o for emacs-devel@gnu.org; Wed, 18 Nov 2009 00:27:16 -0500 Original-Received: from exim by lists.gnu.org with spam-scanned (Exim 4.43) id 1NAd4Q-0003GV-Ih for emacs-devel@gnu.org; Wed, 18 Nov 2009 00:27:14 -0500 Original-Received: from [199.232.76.173] (port=52874 helo=monty-python.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1NAd4Q-0003GS-DZ for emacs-devel@gnu.org; Wed, 18 Nov 2009 00:27:10 -0500 Original-Received: from mtps02.sk.tsukuba.ac.jp ([130.158.97.224]:35477) by monty-python.gnu.org with esmtp (Exim 4.60) (envelope-from ) id 1NAd4O-0005tC-95; Wed, 18 Nov 2009 00:27:08 -0500 Original-Received: from uwakimon.sk.tsukuba.ac.jp (uwakimon.sk.tsukuba.ac.jp [130.158.99.156]) by mtps02.sk.tsukuba.ac.jp (Postfix) with ESMTP id BE1C48212; Wed, 18 Nov 2009 14:27:05 +0900 (JST) Original-Received: by uwakimon.sk.tsukuba.ac.jp (Postfix, from userid 1000) id 3BA591A265E; Wed, 18 Nov 2009 14:33:58 +0900 (JST) In-Reply-To: <83bpj0mq2v.fsf@gnu.org> X-Mailer: VM 8.0.12-devo-585 under 21.5 (beta29) "garbanzo" d20e0a45a4b2 XEmacs Lucid (x86_64-unknown-linux) X-detected-operating-system: by monty-python.gnu.org: GNU/Linux 2.6, seldom 2.4 (older, 4) X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Xref: news.gmane.org gmane.emacs.devel:117149 Archived-At: Eli Zaretskii writes: > My idea was to use the same technique to fold the text as you search > through it. You could, for example, fold one character at a time, so > that instead of comparing against X you compare against fold(X). That's precisely how case-insensitive Boyer-Moore is done. But it's not that easy for Unicode, because you will need to check *characters*, not octets, while Boyer-Moore is very much oriented toward fixed-width charsets. I think the problem here is that because of the nature of Unicode, this folding often crosses character blocks (here meaning "sets of characters with the same bits except for the lowest 6"). That means that you will need to keep track not only of the folding for the current octet, but you will also need to be able to backtrack to check the higher bits. By contrast, with the Mule internal encoding the case pairs are always in the same block of 96 characters. I don't think that it's *impossible* to implement Boyer-Moore with translation for UTF-8, but it's nowhere near as easy as for the unibyte charset case, which Mule internal can in practice be reduced to. It's possible that all this fiddling could result in a performance hit of the same order of magnitude as the performance hit of brute-forcing the comparison.