From mboxrd@z Thu Jan  1 00:00:00 1970
Path: news.gmane.org!not-for-mail
From: Kenichi Handa <handa@m17n.org>
Newsgroups: gmane.emacs.devel
Subject: Re: Case mapping of sharp s
Date: Wed, 25 Nov 2009 11:13:23 +0900
Message-ID: <tl7r5rn9wsc.fsf@m17n.org>
References: <4B05A11F.5000700@gmx.de>
	<jwvskcai43z.fsf-monnier+emacs@gnu.org>	<4B05D3EE.2000101@gmx.de>
	<jwvws1mt0ao.fsf-monnier+emacs@gnu.org>	<4B0759BA.2010303@gmx.de>	<19207.50135.691132.983395@a1i15.kph.uni-mainz.de>
	<m21vjst7ha.fsf@igel.home> <tl7vdh09kie.fsf@m17n.org>
	<4B0C32BF.2020708@gmx.de>
NNTP-Posting-Host: lo.gmane.org
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
X-Trace: ger.gmane.org 1259115233 24650 80.91.229.12 (25 Nov 2009 02:13:53 GMT)
X-Complaints-To: usenet@ger.gmane.org
NNTP-Posting-Date: Wed, 25 Nov 2009 02:13:53 +0000 (UTC)
Cc: ulm@gentoo.org, schwab@linux-m68k.org, monnier@iro.umontreal.ca,
	emacs-devel@gnu.org
To: grischka <grishka@gmx.de>
Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Wed Nov 25 03:13:45 2009
Return-path: <emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org>
Envelope-to: ged-emacs-devel@m.gmane.org
Original-Received: from lists.gnu.org ([199.232.76.165])
	by lo.gmane.org with esmtp (Exim 4.50)
	id 1ND7O4-0000yc-Vl
	for ged-emacs-devel@m.gmane.org; Wed, 25 Nov 2009 03:13:45 +0100
Original-Received: from localhost ([127.0.0.1]:54909 helo=lists.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.43)
	id 1ND7O4-0000fa-ES
	for ged-emacs-devel@m.gmane.org; Tue, 24 Nov 2009 21:13:44 -0500
Original-Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43)
	id 1ND7O0-0000fV-48
	for emacs-devel@gnu.org; Tue, 24 Nov 2009 21:13:40 -0500
Original-Received: from exim by lists.gnu.org with spam-scanned (Exim 4.43)
	id 1ND7Nv-0000cY-Lt
	for emacs-devel@gnu.org; Tue, 24 Nov 2009 21:13:39 -0500
Original-Received: from [199.232.76.173] (port=35120 helo=monty-python.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.43) id 1ND7Nv-0000cV-FL
	for emacs-devel@gnu.org; Tue, 24 Nov 2009 21:13:35 -0500
Original-Received: from mx1.aist.go.jp ([150.29.246.133]:40189)
	by monty-python.gnu.org with esmtp (Exim 4.60)
	(envelope-from <handa@m17n.org>) id 1ND7Nu-0003Cz-Rx
	for emacs-devel@gnu.org; Tue, 24 Nov 2009 21:13:35 -0500
Original-Received: from rqsmtp1.aist.go.jp (rqsmtp1.aist.go.jp [150.29.254.115])
	by mx1.aist.go.jp  with ESMTP id nAP2DQ74018131;
	Wed, 25 Nov 2009 11:13:26 +0900 (JST) env-from (handa@m17n.org)
Original-Received: from smtp3.aist.go.jp
	by rqsmtp1.aist.go.jp  with ESMTP id nAP2DQUg026415;
	Wed, 25 Nov 2009 11:13:26 +0900 (JST) env-from (handa@m17n.org)
Original-Received: by smtp3.aist.go.jp  with ESMTP id nAP2DNWe024310;
	Wed, 25 Nov 2009 11:13:23 +0900 (JST) env-from (handa@m17n.org)
Original-Received: from handa by etlken with local (Exim 4.69)
	(envelope-from <handa@m17n.org>)
	id 1ND7Nj-0006Y5-E2; Wed, 25 Nov 2009 11:13:23 +0900
In-Reply-To: <4B0C32BF.2020708@gmx.de> (message from grischka on Tue,
	24 Nov 2009 20:23:43 +0100)
X-detected-operating-system: by monty-python.gnu.org: Solaris 9
X-BeenThere: emacs-devel@gnu.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: "Emacs development discussions." <emacs-devel.gnu.org>
List-Unsubscribe: <http://lists.gnu.org/mailman/listinfo/emacs-devel>,
	<mailto:emacs-devel-request@gnu.org?subject=unsubscribe>
List-Archive: <http://lists.gnu.org/pipermail/emacs-devel>
List-Post: <mailto:emacs-devel@gnu.org>
List-Help: <mailto:emacs-devel-request@gnu.org?subject=help>
List-Subscribe: <http://lists.gnu.org/mailman/listinfo/emacs-devel>,
	<mailto:emacs-devel-request@gnu.org?subject=subscribe>
Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org
Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org
Xref: news.gmane.org gmane.emacs.devel:117722
Archived-At: <http://permalink.gmane.org/gmane.emacs.devel/117722>

In article <4B0C32BF.2020708@gmx.de>, grischka <grishka@gmx.de> writes:

> DEC_BOTH is maybe not slower than INC_BOTH, but two DEC_BOTH
> are (as with Andy's patch).  Moderately slower, still ;)

So, changing the current backward matching to forward
matching should is effective.

> The originally observed slowness was not because of the usage of
> CHAR_TO_BYTE, but because of the flaws in CHAR_TO_BYTE, such as
> using unrelated "best_below" and "best_above" in the same expression.

> For the numbers, with my 100MB file test case:

> backward search previously:
> 	14 .. 90 s (random)
> backward search with fixed CHAR_TO_BYTE:
> 	5.6 s

I don't see any fix of CHAR_TO_BYTE in the current CVS
code.  Where is it?

> > (2) Pre-compute the character codes in PAT in an integer
> >     array if LEN is not that long (perhaps LEN < 256, or
> >     at most, sizeof (int) * LEN < MAX_ALLOCA).
> > 
> > Then, you don't need the repeated STRING_CHAR on PAT.  This
> > can be applicable to forward search too.
> >

> In practice searching a string is mostly about searching the first
> char in the string, basically like strchr(buf, pat[0]).  (That is
> unless you'd search for "aabb" in "aaabaaaaaaababbaaaabb" which is
> not a practical example)

> So as to pre-computing the pattern, you'd get the most improvement
> already from just pre-computing "pat[0]" or "pat[len-1]" if you
> want to.

When there's no match, it's true that pre-computing of the
whole pattern is a waste.  But, when there's a match, we
anyway compute character codes of the whole pattern.  So
perhaps the good strategy will be to record the computed
character codes gradually when we need it.

> > (3) In addition to (2), pre-compute the character codes in
> >     BUF too in an array of the same length as (2).
> > 
> > Then you can avoid using STRING_CHAR and TRANSLATE
> > repeatedly on the same place of BUF.  This requires modulo
> > calculation to get an index of the array, but I think it's
> > faster than the combination of STRING_CHAR and TRANSLATE.

> Because the first char matches rarely (on average), a repeated
> translation of the same place happens rarely too.

> Of course, TRANSLATE (-> Fassoc(trt, make_number())) per se is
> slow,  so a translation table as C array for say the 0..127
> range, would help indeed.

> In any case, with some tweaking it is possible to improve both
> directions by ~70% (that is down to about 1 sec for the test
> case).  I still don't know why boyer_moore with a one-char
> pattern takes only 0.5 seconds though.  It's amazingly fast.

Are you comparing both methods with the same value of
case-fold-search?

> Btw it seems that long loading time for the big file has much to
> do with inefficient counting of newlines.  Appearently it takes
> ~2 sec to load the file and then another ~6 sec to scan newlines.
> It should be (far) under 0.5 sec.

Why is the code of counting newlines called when we just
visit a file?

---
Kenichi Handa
handa@m17n.org