From mboxrd@z Thu Jan  1 00:00:00 1970
Path: news.gmane.org!not-for-mail
From: grischka <grishka@gmx.de>
Newsgroups: gmane.emacs.devel
Subject: Re: Case mapping of sharp s
Date: Tue, 24 Nov 2009 20:23:43 +0100
Message-ID: <4B0C32BF.2020708@gmx.de>
References: <4B05A11F.5000700@gmx.de>
	<jwvskcai43z.fsf-monnier+emacs@gnu.org>	<4B05D3EE.2000101@gmx.de>
	<jwvws1mt0ao.fsf-monnier+emacs@gnu.org>	<4B0759BA.2010303@gmx.de>	<19207.50135.691132.983395@a1i15.kph.uni-mainz.de>
	<m21vjst7ha.fsf@igel.home> <tl7vdh09kie.fsf@m17n.org>
NNTP-Posting-Host: lo.gmane.org
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
X-Trace: ger.gmane.org 1259090806 19934 80.91.229.12 (24 Nov 2009 19:26:46 GMT)
X-Complaints-To: usenet@ger.gmane.org
NNTP-Posting-Date: Tue, 24 Nov 2009 19:26:46 +0000 (UTC)
Cc: ulm@gentoo.org, Andreas Schwab <schwab@linux-m68k.org>,
	monnier@iro.umontreal.ca, emacs-devel@gnu.org
To: Kenichi Handa <handa@m17n.org>
Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Tue Nov 24 20:26:38 2009
Return-path: <emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org>
Envelope-to: ged-emacs-devel@m.gmane.org
Original-Received: from lists.gnu.org ([199.232.76.165])
	by lo.gmane.org with esmtp (Exim 4.50)
	id 1ND125-0004q2-Ty
	for ged-emacs-devel@m.gmane.org; Tue, 24 Nov 2009 20:26:38 +0100
Original-Received: from localhost ([127.0.0.1]:60933 helo=lists.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.43)
	id 1ND125-0007pq-CU
	for ged-emacs-devel@m.gmane.org; Tue, 24 Nov 2009 14:26:37 -0500
Original-Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43)
	id 1ND10b-0005sT-Et
	for emacs-devel@gnu.org; Tue, 24 Nov 2009 14:25:05 -0500
Original-Received: from exim by lists.gnu.org with spam-scanned (Exim 4.43)
	id 1ND10W-0005lT-TW
	for emacs-devel@gnu.org; Tue, 24 Nov 2009 14:25:04 -0500
Original-Received: from [199.232.76.173] (port=55756 helo=monty-python.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.43) id 1ND10W-0005lA-Ly
	for emacs-devel@gnu.org; Tue, 24 Nov 2009 14:25:00 -0500
Original-Received: from mail.gmx.net ([213.165.64.20]:44517)
	by monty-python.gnu.org with smtp (Exim 4.60)
	(envelope-from <grishka@gmx.de>) id 1ND10V-000661-QP
	for emacs-devel@gnu.org; Tue, 24 Nov 2009 14:25:00 -0500
Original-Received: (qmail invoked by alias); 24 Nov 2009 19:24:39 -0000
Original-Received: from p57A09B59.dip0.t-ipconnect.de (EHLO [192.168.1.2])
	[87.160.155.89]
	by mail.gmx.net (mp058) with SMTP; 24 Nov 2009 20:24:39 +0100
X-Authenticated: #18588216
X-Provags-ID: V01U2FsdGVkX1+sTjNcgpSsDrN9MT152JG6xvcoFhSMnj00JFbT1v
	tJ5KfRT1/nsM6t
User-Agent: Thunderbird 2.0.0.23 (Windows/20090812)
In-Reply-To: <tl7vdh09kie.fsf@m17n.org>
X-Y-GMX-Trusted: 0
X-FuHaFi: 0.46
X-detected-operating-system: by monty-python.gnu.org: GNU/Linux 2.6 (newer, 3)
X-BeenThere: emacs-devel@gnu.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: "Emacs development discussions." <emacs-devel.gnu.org>
List-Unsubscribe: <http://lists.gnu.org/mailman/listinfo/emacs-devel>,
	<mailto:emacs-devel-request@gnu.org?subject=unsubscribe>
List-Archive: <http://lists.gnu.org/pipermail/emacs-devel>
List-Post: <mailto:emacs-devel@gnu.org>
List-Help: <mailto:emacs-devel-request@gnu.org?subject=help>
List-Subscribe: <http://lists.gnu.org/mailman/listinfo/emacs-devel>,
	<mailto:emacs-devel-request@gnu.org?subject=subscribe>
Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org
Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org
Xref: news.gmane.org gmane.emacs.devel:117701
Archived-At: <http://permalink.gmane.org/gmane.emacs.devel/117701>

Kenichi Handa wrote:
> In article <m21vjst7ha.fsf@igel.home>, Andreas Schwab <schwab@linux-m68k.org> writes:
> 
>> Ulrich Mueller <ulm@gentoo.org> writes:
>>> Is that the reason for the backwards search in ERC buffers being
>>> extremely slow? It may keep Emacs busy for several *minutes*. And it's
>>> not interruptible with C-g.
> 
>> Does this patch help?
> 
> Here are some ideas to improve it.
> 
> (1) Do forward matching in backward search.
> 
> The original code roughly does this to search "012abc"
> for "012" from the tail.
>    check if "012" matches with "abc"
>    check if "012" matches with "2ab"
>    ...
> 
> But the new code does this:
>    check if "210" matches with "cba"
>    check if "210" matches with "ba2"
>    ...
> 
> As INC_BOTH is faster than DEC_BOTH, the original way of
> check matching is faster.  

DEC_BOTH is maybe not slower than INC_BOTH, but two DEC_BOTH
are (as with Andy's patch).  Moderately slower, still ;)

> The slowness of the orignal code
> was caused by using CHAR_TO_BYTE to find the place of "2"
> when you know the place of "a".  Use DEC_BOTH here only.

The originally observed slowness was not because of the usage of
CHAR_TO_BYTE, but because of the flaws in CHAR_TO_BYTE, such as
using unrelated "best_below" and "best_above" in the same expression.

For the numbers, with my 100MB file test case:

backward search previously:
	14 .. 90 s (random)
backward search with fixed CHAR_TO_BYTE:
	5.6 s
backward search without CHAR_TO_BYTE (Andy's patch):
	4.1 s
forward search:
	3.6 s

> (2) Pre-compute the character codes in PAT in an integer
>     array if LEN is not that long (perhaps LEN < 256, or
>     at most, sizeof (int) * LEN < MAX_ALLOCA).
> 
> Then, you don't need the repeated STRING_CHAR on PAT.  This
> can be applicable to forward search too.
>

In practice searching a string is mostly about searching the first
char in the string, basically like strchr(buf, pat[0]).  (That is
unless you'd search for "aabb" in "aaabaaaaaaababbaaaabb" which is
not a practical example)

So as to pre-computing the pattern, you'd get the most improvement
already from just pre-computing "pat[0]" or "pat[len-1]" if you
want to.

> (3) In addition to (2), pre-compute the character codes in
>     BUF too in an array of the same length as (2).
> 
> Then you can avoid using STRING_CHAR and TRANSLATE
> repeatedly on the same place of BUF.  This requires modulo
> calculation to get an index of the array, but I think it's
> faster than the combination of STRING_CHAR and TRANSLATE.

Because the first char matches rarely (on average), a repeated
translation of the same place happens rarely too.

Of course, TRANSLATE (-> Fassoc(trt, make_number())) per se is
slow,  so a translation table as C array for say the 0..127
range, would help indeed.

In any case, with some tweaking it is possible to improve both
directions by ~70% (that is down to about 1 sec for the test
case).  I still don't know why boyer_moore with a one-char
pattern takes only 0.5 seconds though.  It's amazingly fast.

Btw it seems that long loading time for the big file has much to
do with inefficient counting of newlines.  Appearently it takes
~2 sec to load the file and then another ~6 sec to scan newlines.
It should be (far) under 0.5 sec.

--- grischka

> ---
> Kenichi Handa
> handa@m17n.org
>