From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: Kenichi Handa Newsgroups: gmane.emacs.devel Subject: Re: Case mapping of sharp s Date: Wed, 25 Nov 2009 11:13:23 +0900 Message-ID: References: <4B05A11F.5000700@gmx.de> <4B05D3EE.2000101@gmx.de> <4B0759BA.2010303@gmx.de> <19207.50135.691132.983395@a1i15.kph.uni-mainz.de> <4B0C32BF.2020708@gmx.de> NNTP-Posting-Host: lo.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Trace: ger.gmane.org 1259115233 24650 80.91.229.12 (25 Nov 2009 02:13:53 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Wed, 25 Nov 2009 02:13:53 +0000 (UTC) Cc: ulm@gentoo.org, schwab@linux-m68k.org, monnier@iro.umontreal.ca, emacs-devel@gnu.org To: grischka Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Wed Nov 25 03:13:45 2009 Return-path: Envelope-to: ged-emacs-devel@m.gmane.org Original-Received: from lists.gnu.org ([199.232.76.165]) by lo.gmane.org with esmtp (Exim 4.50) id 1ND7O4-0000yc-Vl for ged-emacs-devel@m.gmane.org; Wed, 25 Nov 2009 03:13:45 +0100 Original-Received: from localhost ([127.0.0.1]:54909 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1ND7O4-0000fa-ES for ged-emacs-devel@m.gmane.org; Tue, 24 Nov 2009 21:13:44 -0500 Original-Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43) id 1ND7O0-0000fV-48 for emacs-devel@gnu.org; Tue, 24 Nov 2009 21:13:40 -0500 Original-Received: from exim by lists.gnu.org with spam-scanned (Exim 4.43) id 1ND7Nv-0000cY-Lt for emacs-devel@gnu.org; Tue, 24 Nov 2009 21:13:39 -0500 Original-Received: from [199.232.76.173] (port=35120 helo=monty-python.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1ND7Nv-0000cV-FL for emacs-devel@gnu.org; Tue, 24 Nov 2009 21:13:35 -0500 Original-Received: from mx1.aist.go.jp ([150.29.246.133]:40189) by monty-python.gnu.org with esmtp (Exim 4.60) (envelope-from ) id 1ND7Nu-0003Cz-Rx for emacs-devel@gnu.org; Tue, 24 Nov 2009 21:13:35 -0500 Original-Received: from rqsmtp1.aist.go.jp (rqsmtp1.aist.go.jp [150.29.254.115]) by mx1.aist.go.jp with ESMTP id nAP2DQ74018131; Wed, 25 Nov 2009 11:13:26 +0900 (JST) env-from (handa@m17n.org) Original-Received: from smtp3.aist.go.jp by rqsmtp1.aist.go.jp with ESMTP id nAP2DQUg026415; Wed, 25 Nov 2009 11:13:26 +0900 (JST) env-from (handa@m17n.org) Original-Received: by smtp3.aist.go.jp with ESMTP id nAP2DNWe024310; Wed, 25 Nov 2009 11:13:23 +0900 (JST) env-from (handa@m17n.org) Original-Received: from handa by etlken with local (Exim 4.69) (envelope-from ) id 1ND7Nj-0006Y5-E2; Wed, 25 Nov 2009 11:13:23 +0900 In-Reply-To: <4B0C32BF.2020708@gmx.de> (message from grischka on Tue, 24 Nov 2009 20:23:43 +0100) X-detected-operating-system: by monty-python.gnu.org: Solaris 9 X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Xref: news.gmane.org gmane.emacs.devel:117722 Archived-At: In article <4B0C32BF.2020708@gmx.de>, grischka writes: > DEC_BOTH is maybe not slower than INC_BOTH, but two DEC_BOTH > are (as with Andy's patch). Moderately slower, still ;) So, changing the current backward matching to forward matching should is effective. > The originally observed slowness was not because of the usage of > CHAR_TO_BYTE, but because of the flaws in CHAR_TO_BYTE, such as > using unrelated "best_below" and "best_above" in the same expression. > For the numbers, with my 100MB file test case: > backward search previously: > 14 .. 90 s (random) > backward search with fixed CHAR_TO_BYTE: > 5.6 s I don't see any fix of CHAR_TO_BYTE in the current CVS code. Where is it? > > (2) Pre-compute the character codes in PAT in an integer > > array if LEN is not that long (perhaps LEN < 256, or > > at most, sizeof (int) * LEN < MAX_ALLOCA). > > > > Then, you don't need the repeated STRING_CHAR on PAT. This > > can be applicable to forward search too. > > > In practice searching a string is mostly about searching the first > char in the string, basically like strchr(buf, pat[0]). (That is > unless you'd search for "aabb" in "aaabaaaaaaababbaaaabb" which is > not a practical example) > So as to pre-computing the pattern, you'd get the most improvement > already from just pre-computing "pat[0]" or "pat[len-1]" if you > want to. When there's no match, it's true that pre-computing of the whole pattern is a waste. But, when there's a match, we anyway compute character codes of the whole pattern. So perhaps the good strategy will be to record the computed character codes gradually when we need it. > > (3) In addition to (2), pre-compute the character codes in > > BUF too in an array of the same length as (2). > > > > Then you can avoid using STRING_CHAR and TRANSLATE > > repeatedly on the same place of BUF. This requires modulo > > calculation to get an index of the array, but I think it's > > faster than the combination of STRING_CHAR and TRANSLATE. > Because the first char matches rarely (on average), a repeated > translation of the same place happens rarely too. > Of course, TRANSLATE (-> Fassoc(trt, make_number())) per se is > slow, so a translation table as C array for say the 0..127 > range, would help indeed. > In any case, with some tweaking it is possible to improve both > directions by ~70% (that is down to about 1 sec for the test > case). I still don't know why boyer_moore with a one-char > pattern takes only 0.5 seconds though. It's amazingly fast. Are you comparing both methods with the same value of case-fold-search? > Btw it seems that long loading time for the big file has much to > do with inefficient counting of newlines. Appearently it takes > ~2 sec to load the file and then another ~6 sec to scan newlines. > It should be (far) under 0.5 sec. Why is the code of counting newlines called when we just visit a file? --- Kenichi Handa handa@m17n.org