Re: Case mapping of sharp s

unofficial mirror of emacs-devel@gnu.org 
 help / color / mirror / code / Atom feed

From: grischka <grishka@gmx.de>
To: Kenichi Handa <handa@m17n.org>
Cc: ulm@gentoo.org, Andreas Schwab <schwab@linux-m68k.org>,
	monnier@iro.umontreal.ca, emacs-devel@gnu.org
Subject: Re: Case mapping of sharp s
Date: Tue, 24 Nov 2009 20:23:43 +0100	[thread overview]
Message-ID: <4B0C32BF.2020708@gmx.de> (raw)
In-Reply-To: <tl7vdh09kie.fsf@m17n.org>

Kenichi Handa wrote:
> In article <m21vjst7ha.fsf@igel.home>, Andreas Schwab <schwab@linux-m68k.org> writes:
> 
>> Ulrich Mueller <ulm@gentoo.org> writes:
>>> Is that the reason for the backwards search in ERC buffers being
>>> extremely slow? It may keep Emacs busy for several *minutes*. And it's
>>> not interruptible with C-g.
> 
>> Does this patch help?
> 
> Here are some ideas to improve it.
> 
> (1) Do forward matching in backward search.
> 
> The original code roughly does this to search "012abc"
> for "012" from the tail.
>    check if "012" matches with "abc"
>    check if "012" matches with "2ab"
>    ...
> 
> But the new code does this:
>    check if "210" matches with "cba"
>    check if "210" matches with "ba2"
>    ...
> 
> As INC_BOTH is faster than DEC_BOTH, the original way of
> check matching is faster.  

DEC_BOTH is maybe not slower than INC_BOTH, but two DEC_BOTH
are (as with Andy's patch).  Moderately slower, still ;)

> The slowness of the orignal code
> was caused by using CHAR_TO_BYTE to find the place of "2"
> when you know the place of "a".  Use DEC_BOTH here only.

The originally observed slowness was not because of the usage of
CHAR_TO_BYTE, but because of the flaws in CHAR_TO_BYTE, such as
using unrelated "best_below" and "best_above" in the same expression.

For the numbers, with my 100MB file test case:

backward search previously:
	14 .. 90 s (random)
backward search with fixed CHAR_TO_BYTE:
	5.6 s
backward search without CHAR_TO_BYTE (Andy's patch):
	4.1 s
forward search:
	3.6 s

> (2) Pre-compute the character codes in PAT in an integer
>     array if LEN is not that long (perhaps LEN < 256, or
>     at most, sizeof (int) * LEN < MAX_ALLOCA).
> 
> Then, you don't need the repeated STRING_CHAR on PAT.  This
> can be applicable to forward search too.
>

In practice searching a string is mostly about searching the first
char in the string, basically like strchr(buf, pat[0]).  (That is
unless you'd search for "aabb" in "aaabaaaaaaababbaaaabb" which is
not a practical example)

So as to pre-computing the pattern, you'd get the most improvement
already from just pre-computing "pat[0]" or "pat[len-1]" if you
want to.

> (3) In addition to (2), pre-compute the character codes in
>     BUF too in an array of the same length as (2).
> 
> Then you can avoid using STRING_CHAR and TRANSLATE
> repeatedly on the same place of BUF.  This requires modulo
> calculation to get an index of the array, but I think it's
> faster than the combination of STRING_CHAR and TRANSLATE.

Because the first char matches rarely (on average), a repeated
translation of the same place happens rarely too.

Of course, TRANSLATE (-> Fassoc(trt, make_number())) per se is
slow,  so a translation table as C array for say the 0..127
range, would help indeed.

In any case, with some tweaking it is possible to improve both
directions by ~70% (that is down to about 1 sec for the test
case).  I still don't know why boyer_moore with a one-char
pattern takes only 0.5 seconds though.  It's amazingly fast.

Btw it seems that long loading time for the big file has much to
do with inefficient counting of newlines.  Appearently it takes
~2 sec to load the file and then another ~6 sec to scan newlines.
It should be (far) under 0.5 sec.

--- grischka

> ---
> Kenichi Handa
> handa@m17n.org
>

next prev parent reply	other threads:[~2009-11-24 19:23 UTC|newest]

Thread overview: 66+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-11-19 19:48 Case mapping of sharp s grischka
2009-11-19 21:49 ` Stefan Monnier
2009-11-19 22:43   ` David Kastrup
2009-11-20  2:08     ` Stefan Monnier
2009-11-20  8:03       ` David Kastrup
2009-11-20 14:14         ` Stefan Monnier
2009-11-20  3:41     ` Stephen J. Turnbull
2009-11-20  4:20       ` Stefan Monnier
2009-11-20  7:13         ` Stephen J. Turnbull
2009-11-21  0:02           ` Richard Stallman
2009-11-21 12:39             ` David Kastrup
2009-11-21 17:40               ` Stephen J. Turnbull
2009-11-21 19:15                 ` Eli Zaretskii
2009-11-22  2:58                   ` Stephen J. Turnbull
2009-11-22  4:28                     ` Eli Zaretskii
2009-11-22  8:27                       ` Stephen J. Turnbull
2009-11-23  1:30                 ` Kenichi Handa
2009-11-21 22:52               ` Richard Stallman
2009-11-20  8:10     ` Ulrich Mueller
2009-11-20 11:46       ` Stephen J. Turnbull
2009-11-20 14:43         ` Ulrich Mueller
2009-11-21  4:33           ` Stephen J. Turnbull
2009-11-19 23:25   ` grischka
2009-11-20  2:11     ` Stefan Monnier
2009-11-21  3:08       ` grischka
2009-11-21  8:58         ` Eli Zaretskii
2009-11-21  9:33           ` Andreas Schwab
2009-11-21 11:45             ` Eli Zaretskii
2009-11-21 15:33           ` grischka
2009-11-21 10:41         ` Ulrich Mueller
2009-11-21 11:58           ` Andreas Schwab
2009-11-21 17:01             ` Ulrich Mueller
2009-11-22 12:11               ` Andreas Schwab
2009-11-22 20:15                 ` Stefan Monnier
2009-11-24 12:26             ` Kenichi Handa
2009-11-24 19:23               ` grischka [this message]
2009-11-25  2:13                 ` Kenichi Handa
2009-11-26 13:07                   ` grischka
2009-11-29 22:03                   ` Juri Linkov
2009-11-30  1:22                     ` Stefan Monnier
2009-11-30  1:28                     ` Kenichi Handa
2009-11-30  1:36                       ` Kenichi Handa
2009-11-30  7:01                     ` Ulrich Mueller
2009-11-30 12:01                       ` Juri Linkov
2009-11-30 13:09                         ` martin rudalics
2009-11-30 21:57                       ` Juri Linkov
2009-11-30 22:34                         ` Ulrich Mueller
2009-12-01  0:02                           ` Juri Linkov
  -- strict thread matches above, loose matches on Subject: below --
2009-11-15 14:29 Ulrich Mueller
2009-11-16 12:06 ` Kenichi Handa
2009-11-16 16:38   ` Ulrich Mueller
2009-11-17  7:36     ` Kenichi Handa
2009-11-17 21:23       ` Reiner Steib
2009-11-16 19:12   ` Eli Zaretskii
2009-11-17  7:43     ` martin rudalics
2009-11-17  7:49     ` Kenichi Handa
2009-11-17 18:56       ` Eli Zaretskii
2009-11-18  1:00         ` Kenichi Handa
2009-11-18  4:09           ` Eli Zaretskii
2009-11-18  5:33             ` Stephen J. Turnbull
2009-11-18  6:26             ` Kenichi Handa
2009-11-18 14:44               ` Stefan Monnier
2009-11-18 19:05                 ` Ulrich Mueller
2009-11-19  1:16                   ` Stefan Monnier
2009-11-18 17:58               ` Eli Zaretskii
2009-11-19  1:57                 ` Stephen J. Turnbull

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://www.gnu.org/software/emacs/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4B0C32BF.2020708@gmx.de \
    --to=grishka@gmx.de \
    --cc=emacs-devel@gnu.org \
    --cc=handa@m17n.org \
    --cc=monnier@iro.umontreal.ca \
    --cc=schwab@linux-m68k.org \
    --cc=ulm@gentoo.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).