From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: grischka Newsgroups: gmane.emacs.devel Subject: Re: Case mapping of sharp s Date: Tue, 24 Nov 2009 20:23:43 +0100 Message-ID: <4B0C32BF.2020708@gmx.de> References: <4B05A11F.5000700@gmx.de> <4B05D3EE.2000101@gmx.de> <4B0759BA.2010303@gmx.de> <19207.50135.691132.983395@a1i15.kph.uni-mainz.de> NNTP-Posting-Host: lo.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Trace: ger.gmane.org 1259090806 19934 80.91.229.12 (24 Nov 2009 19:26:46 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Tue, 24 Nov 2009 19:26:46 +0000 (UTC) Cc: ulm@gentoo.org, Andreas Schwab , monnier@iro.umontreal.ca, emacs-devel@gnu.org To: Kenichi Handa Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Tue Nov 24 20:26:38 2009 Return-path: Envelope-to: ged-emacs-devel@m.gmane.org Original-Received: from lists.gnu.org ([199.232.76.165]) by lo.gmane.org with esmtp (Exim 4.50) id 1ND125-0004q2-Ty for ged-emacs-devel@m.gmane.org; Tue, 24 Nov 2009 20:26:38 +0100 Original-Received: from localhost ([127.0.0.1]:60933 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1ND125-0007pq-CU for ged-emacs-devel@m.gmane.org; Tue, 24 Nov 2009 14:26:37 -0500 Original-Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43) id 1ND10b-0005sT-Et for emacs-devel@gnu.org; Tue, 24 Nov 2009 14:25:05 -0500 Original-Received: from exim by lists.gnu.org with spam-scanned (Exim 4.43) id 1ND10W-0005lT-TW for emacs-devel@gnu.org; Tue, 24 Nov 2009 14:25:04 -0500 Original-Received: from [199.232.76.173] (port=55756 helo=monty-python.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1ND10W-0005lA-Ly for emacs-devel@gnu.org; Tue, 24 Nov 2009 14:25:00 -0500 Original-Received: from mail.gmx.net ([213.165.64.20]:44517) by monty-python.gnu.org with smtp (Exim 4.60) (envelope-from ) id 1ND10V-000661-QP for emacs-devel@gnu.org; Tue, 24 Nov 2009 14:25:00 -0500 Original-Received: (qmail invoked by alias); 24 Nov 2009 19:24:39 -0000 Original-Received: from p57A09B59.dip0.t-ipconnect.de (EHLO [192.168.1.2]) [87.160.155.89] by mail.gmx.net (mp058) with SMTP; 24 Nov 2009 20:24:39 +0100 X-Authenticated: #18588216 X-Provags-ID: V01U2FsdGVkX1+sTjNcgpSsDrN9MT152JG6xvcoFhSMnj00JFbT1v tJ5KfRT1/nsM6t User-Agent: Thunderbird 2.0.0.23 (Windows/20090812) In-Reply-To: X-Y-GMX-Trusted: 0 X-FuHaFi: 0.46 X-detected-operating-system: by monty-python.gnu.org: GNU/Linux 2.6 (newer, 3) X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Xref: news.gmane.org gmane.emacs.devel:117701 Archived-At: Kenichi Handa wrote: > In article , Andreas Schwab writes: > >> Ulrich Mueller writes: >>> Is that the reason for the backwards search in ERC buffers being >>> extremely slow? It may keep Emacs busy for several *minutes*. And it's >>> not interruptible with C-g. > >> Does this patch help? > > Here are some ideas to improve it. > > (1) Do forward matching in backward search. > > The original code roughly does this to search "012abc" > for "012" from the tail. > check if "012" matches with "abc" > check if "012" matches with "2ab" > ... > > But the new code does this: > check if "210" matches with "cba" > check if "210" matches with "ba2" > ... > > As INC_BOTH is faster than DEC_BOTH, the original way of > check matching is faster. DEC_BOTH is maybe not slower than INC_BOTH, but two DEC_BOTH are (as with Andy's patch). Moderately slower, still ;) > The slowness of the orignal code > was caused by using CHAR_TO_BYTE to find the place of "2" > when you know the place of "a". Use DEC_BOTH here only. The originally observed slowness was not because of the usage of CHAR_TO_BYTE, but because of the flaws in CHAR_TO_BYTE, such as using unrelated "best_below" and "best_above" in the same expression. For the numbers, with my 100MB file test case: backward search previously: 14 .. 90 s (random) backward search with fixed CHAR_TO_BYTE: 5.6 s backward search without CHAR_TO_BYTE (Andy's patch): 4.1 s forward search: 3.6 s > (2) Pre-compute the character codes in PAT in an integer > array if LEN is not that long (perhaps LEN < 256, or > at most, sizeof (int) * LEN < MAX_ALLOCA). > > Then, you don't need the repeated STRING_CHAR on PAT. This > can be applicable to forward search too. > In practice searching a string is mostly about searching the first char in the string, basically like strchr(buf, pat[0]). (That is unless you'd search for "aabb" in "aaabaaaaaaababbaaaabb" which is not a practical example) So as to pre-computing the pattern, you'd get the most improvement already from just pre-computing "pat[0]" or "pat[len-1]" if you want to. > (3) In addition to (2), pre-compute the character codes in > BUF too in an array of the same length as (2). > > Then you can avoid using STRING_CHAR and TRANSLATE > repeatedly on the same place of BUF. This requires modulo > calculation to get an index of the array, but I think it's > faster than the combination of STRING_CHAR and TRANSLATE. Because the first char matches rarely (on average), a repeated translation of the same place happens rarely too. Of course, TRANSLATE (-> Fassoc(trt, make_number())) per se is slow, so a translation table as C array for say the 0..127 range, would help indeed. In any case, with some tweaking it is possible to improve both directions by ~70% (that is down to about 1 sec for the test case). I still don't know why boyer_moore with a one-char pattern takes only 0.5 seconds though. It's amazingly fast. Btw it seems that long loading time for the big file has much to do with inefficient counting of newlines. Appearently it takes ~2 sec to load the file and then another ~6 sec to scan newlines. It should be (far) under 0.5 sec. --- grischka > --- > Kenichi Handa > handa@m17n.org >