New buffer-case-table makes search

unofficial mirror of emacs-devel@gnu.org 
 help / color / mirror / code / Atom feed

* New buffer-case-table makes search_buffer painfully slow
@ 2006-05-04 13:46 Elias Oltmanns
  2006-05-06 19:10 ` Elias Oltmanns
  0 siblings, 1 reply; 6+ messages in thread
From: Elias Oltmanns @ 2006-05-04 13:46 UTC (permalink / raw)


Hi all,

switching from emacs 21 to emacs 22 has a very significant performance
impact on packages that make heavy use of search_buffer. An example
that actually made me aware of this problem is gnus processing large
mbox files. Further analysis of this problem revealed that in emacs 22
an "i" in the search string makes search_buffer use simple_search()
instead of boyer_moore(). This means that, for instance, a loop
repeatedly calling re-search-forward with the search string
"X-Gnus-Article-Number" takes (in the order of several magnitudes)
more time in emacs 22 than in emacs 21 just because of the "i" in
article -- at least in a multibyte buffer. The cause for this seems to
be a change in the buffer-case-table. Comparing the output of M-x
describe-buffer-case-table in emacs 21 resp. emacs 22 makes me wonder
whether a match of a certain character in unicode row 32 with "i" in
the emacs 22 table might be the cause for this trouble. If so, what
would be the right thing to do about it? Of course, applications like
gnus have to open the mbox files in multibyte mode simply because
mails in different languages and charsets may be stored in these
files. Yet, I'm quite confident that quite a few people if not the
majority will never need the match of i with this obscure character
but would certainly prefer the boyer_moore algorithm when searching
for strings containing an "i".

Any ideas and thoughts concerning this problem?

Regards,

Elias

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: New buffer-case-table makes search_buffer painfully slow
  2006-05-04 13:46 New buffer-case-table makes search_buffer painfully slow Elias Oltmanns
@ 2006-05-06 19:10 ` Elias Oltmanns
  2006-05-06 20:17   ` Andreas Schwab
  2006-05-07  5:01   ` Richard Stallman
  0 siblings, 2 replies; 6+ messages in thread
From: Elias Oltmanns @ 2006-05-06 19:10 UTC (permalink / raw)

Elias Oltmanns <oltmanns@uni-bonn.de> wrote:
> Hi all,
>
> switching from emacs 21 to emacs 22 has a very significant performance
> impact on packages that make heavy use of search_buffer. An example
> that actually made me aware of this problem is gnus processing large
> mbox files. Further analysis of this problem revealed that in emacs 22
> an "i" in the search string makes search_buffer use simple_search()
> instead of boyer_moore(). 

Emacs 22's EQUIVALENCES table relates i, and thus I as well, to two
more characters with character codes 331857 and 331856. On
www.unicode.org the character look up engine couldn't find a match for
U+51051 or U+51050 saying that most likely those codes weren't
assigned to any characters yet.

So, here is a plain question: Is there a bug in the case-table in
emacs 22 or does the search engine on www.unicode.org for some reason
miss certain character ranges? Slightly biassed, I'm disregarding the
possibility of me being unable to use www.unicode.org properly, which,
in fact, might well be the reason for my confusion.

Second question: If the case-table was right, what would be the right
way to tacle the problem described in my original post? For me the
following snippet in .emacs solves the problem:
--- ~/.emacs ---
(unless (< emacs-major-version 22)
  (set-case-syntax 331856 "w" (standard-case-table))
  (set-case-syntax 331857 "w" (standard-case-table)))
--- ~/.emacs ---

This, of course, is a durty hack and I'm wondering whether emacs
should provide a feature to "clean up" the EQUIVALENCES table in the
ascii range in order to avoid falling back to a slow search
algorithm when we are searching for pure ascii strings. Or do you
think that packages like gnus which make heavy use of
re-search-forward should handle these performance issues
themselves---or indeed the users.

Regards,

Elias

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: New buffer-case-table makes search_buffer painfully slow
  2006-05-06 19:10 ` Elias Oltmanns
@ 2006-05-06 20:17   ` Andreas Schwab
  2006-05-07  5:01   ` Richard Stallman
  1 sibling, 0 replies; 6+ messages in thread
From: Andreas Schwab @ 2006-05-06 20:17 UTC (permalink / raw)
  Cc: emacs-devel

Elias Oltmanns <oltmanns@uni-bonn.de> writes:

> Emacs 22's EQUIVALENCES table relates i, and thus I as well, to two
> more characters with character codes 331857 and 331856.

These are Emacs internal codes, corresponding to U+0131 LATIN SMALL LETTER
DOTLESS I and U+0130 LATIN CAPITAL LETTER I WITH DOT ABOVE.

Andreas.

-- 
Andreas Schwab, SuSE Labs, schwab@suse.de
SuSE Linux Products GmbH, Maxfeldstraße 5, 90409 Nürnberg, Germany
PGP key fingerprint = 58CA 54C7 6D53 942B 1756  01D3 44D5 214B 8276 4ED5
"And now for something completely different."

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: New buffer-case-table makes search_buffer painfully slow
  2006-05-06 19:10 ` Elias Oltmanns
  2006-05-06 20:17   ` Andreas Schwab
@ 2006-05-07  5:01   ` Richard Stallman
  2006-05-12 14:16     ` Elias Oltmanns
  1 sibling, 1 reply; 6+ messages in thread
From: Richard Stallman @ 2006-05-07  5:01 UTC (permalink / raw)
  Cc: emacs-devel

    Emacs 22's EQUIVALENCES table relates i, and thus I as well, to two
    more characters with character codes 331857 and 331856. On
    www.unicode.org the character look up engine couldn't find a match for
    U+51051 or U+51050 saying that most likely those codes weren't
    assigned to any characters yet.

I think this has to do with the special characters for Turkish,
lower-case i without dot and upper-case I with dot.  In Turkish,
upcasing and downcasing preserve the dot, or the absence of the dot.

I think these lines in characters.el are the cause of the problem.

  (set-downcase-syntax  ?? ?i tbl)
  (set-upcase-syntax    ?I ?? tbl)

They set up only half of what Turkish needs.
They make dotless-i upcase into I, and they make
I-with-dot downcase into i.  They can't do vice versa
because that would break things for other languages.
So they are not really useful.  We could simply delete them.

We could also add a minor mode to set up the case table all the way
for Turkish.

Would someone like to do that?

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: New buffer-case-table makes search_buffer painfully slow
  2006-05-07  5:01   ` Richard Stallman
@ 2006-05-12 14:16     ` Elias Oltmanns
  2006-05-13  4:53       ` Richard Stallman
  0 siblings, 1 reply; 6+ messages in thread
From: Elias Oltmanns @ 2006-05-12 14:16 UTC (permalink / raw)

Richard Stallman <rms@gnu.org> wrote:
>     Emacs 22's EQUIVALENCES table relates i, and thus I as well, to
>     two more characters with character codes 331857 and 331856. On
>     www.unicode.org the character look up engine couldn't find a
>     match for U+51051 or U+51050 saying that most likely those codes
>     weren't assigned to any characters yet.
>
> I think this has to do with the special characters for Turkish,
> lower-case i without dot and upper-case I with dot. In Turkish,
> upcasing and downcasing preserve the dot, or the absence of the dot.
>
> I think these lines in characters.el are the cause of the problem.
>
>   (set-downcase-syntax ?? ?i tbl) (set-upcase-syntax ?I ?? tbl)
>
> They set up only half of what Turkish needs. They make dotless-i
> upcase into I, and they make I-with-dot downcase into i. They can't
> do vice versa because that would break things for other languages.
> So they are not really useful. We could simply delete them.
>
> We could also add a minor mode to set up the case table all the way
> for Turkish.

When I come to think of it, I'm not quite sure I understand what
exactly you have in mind with regard to the minor mode option.
Unfortunately, I don't know anything about Turkish at all, but I'd
imagine that while you're editing pure Turkish texts, you'd like to
have a matching pair of dotless and dotted up- and downcase i
respectively. That way up- and downcasing work properly and case
insensitive searches for an i would not match the dotless
versions---as expected, I suppose.

If you're editing mixed texts as, for instance, Turkish and English,
the current behaviour with i matching all four characters might be
more convenient; the same applies if you switch between Turkish and
other languages rather frequently.

The third option, which from my very biased point of view should be
the default, is that ASCII i should only match its ASCII upcase
counterpart.

How would you realise all these needs?

Regards,

Elias

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: New buffer-case-table makes search_buffer painfully slow
  2006-05-12 14:16     ` Elias Oltmanns
@ 2006-05-13  4:53       ` Richard Stallman
  0 siblings, 0 replies; 6+ messages in thread
From: Richard Stallman @ 2006-05-13  4:53 UTC (permalink / raw)
  Cc: emacs-devel

    When I come to think of it, I'm not quite sure I understand what
    exactly you have in mind with regard to the minor mode option.
    Unfortunately, I don't know anything about Turkish at all, but I'd

I am sure there are people on the list who understand the issue
and can fix the problem.  Would one of them please do so?

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2006-05-13  4:53 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-05-04 13:46 New buffer-case-table makes search_buffer painfully slow Elias Oltmanns
2006-05-06 19:10 ` Elias Oltmanns
2006-05-06 20:17   ` Andreas Schwab
2006-05-07  5:01   ` Richard Stallman
2006-05-12 14:16     ` Elias Oltmanns
2006-05-13  4:53       ` Richard Stallman

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).