Case mapping of sharp s

unofficial mirror of emacs-devel@gnu.org 
 help / color / mirror / code / Atom feed

* Case mapping of sharp s
@ 2009-11-15 14:29 Ulrich Mueller
  2009-11-16 12:06 ` Kenichi Handa
  0 siblings, 1 reply; 66+ messages in thread
From: Ulrich Mueller @ 2009-11-15 14:29 UTC (permalink / raw)
  To: emacs-devel

In Unicode since version 5.1.0 the U+1E9E code point is assigned to
"LATIN CAPITAL LETTER SHARP S". Would it be possible to add a mapping
from this to the lower case ß, as in the patch below?

However, I've noticed that similar mappings for Turkish ı (dotless i)
and İ (I with dot) were commented out [1]. Is it still so that such a
change would "make searches slow", as stated in the comment?

Ulrich

[1] <http://cvs.savannah.gnu.org/viewvc/emacs/lisp/international/characters.el?root=emacs&r1=1.56&r2=1.57>


--- emacs/lisp/international/characters.el
+++ emacs/lisp/international/characters.el
@@ -589,6 +589,11 @@
 	 (or (<= c #x1e94) (>= c #x1ea0))
 	 (set-case-syntax-pair c (1+ c) tbl))
     (setq c (1+ c)))
+  ;; U+1E9E LATIN CAPITAL LETTER SHARP S
+  ;; Note: In regular German orthography the sharp s has no upper case
+  ;; form, but is written as double S instead. Therefore the inverse
+  ;; mapping from lower to upper case shouldn't be done by default.
+  (set-downcase-syntax ?ẞ ?ß tbl)
 
   ;; Greek
   (modify-category-entry '(#x0370 . #x03ff) ?g)




^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Case mapping of sharp s
  2009-11-15 14:29 Case mapping of sharp s Ulrich Mueller
@ 2009-11-16 12:06 ` Kenichi Handa
  2009-11-16 16:38   ` Ulrich Mueller
  2009-11-16 19:12   ` Eli Zaretskii
  0 siblings, 2 replies; 66+ messages in thread
From: Kenichi Handa @ 2009-11-16 12:06 UTC (permalink / raw)
  To: Ulrich Mueller; +Cc: emacs-devel

In article <19200.4158.380820.761685@a1i15.kph.uni-mainz.de>, Ulrich Mueller <ulm@gentoo.org> writes:

> In Unicode since version 5.1.0 the U+1E9E code point is assigned to
> "LATIN CAPITAL LETTER SHARP S". Would it be possible to add a mapping
> from this to the lower case ß, as in the patch below?

> However, I've noticed that similar mappings for Turkish ı (dotless i)
> and İ (I with dot) were commented out [1]. Is it still so that such a
> change would "make searches slow", as stated in the comment?

That kind of setting surely makes the searching of ß and ẞ
slow because we can't use BM search when case-fold-search is
non-nil.  BM search is possible only when all
case-equivalent characters are represented by the same byte
length, and differ only in the last byte.

So, if you are sure that searching of ß is very rare (I have
no idea), please install it.

By the way, I think it's possible to improve the current
BM-search for such a case.  For instance, to search
"straße", we at first do BM-search for "stra" part and then
check the remaining "ße" part.  Aren't there any challenger?

---
Kenichi Handa
handa@m17n.org

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Case mapping of sharp s
  2009-11-16 12:06 ` Kenichi Handa
@ 2009-11-16 16:38   ` Ulrich Mueller
  2009-11-17  7:36     ` Kenichi Handa
  2009-11-16 19:12   ` Eli Zaretskii
  1 sibling, 1 reply; 66+ messages in thread
From: Ulrich Mueller @ 2009-11-16 16:38 UTC (permalink / raw)
  To: Kenichi Handa; +Cc: emacs-devel

>>>>> On Mon, 16 Nov 2009, Kenichi Handa wrote:

> In article <19200.4158.380820.761685@a1i15.kph.uni-mainz.de>, Ulrich Mueller <ulm@gentoo.org> writes:
>> In Unicode since version 5.1.0 the U+1E9E code point is assigned
>> to "LATIN CAPITAL LETTER SHARP S". Would it be possible to add a
>> mapping from this to the lower case ß, as in the patch below?

>> However, I've noticed that similar mappings for Turkish ı (dotless
>> i) and İ (I with dot) were commented out [1]. Is it still so that
>> such a change would "make searches slow", as stated in the comment?

> That kind of setting surely makes the searching of ß and ẞ slow
> because we can't use BM search when case-fold-search is non-nil.
> BM search is possible only when all case-equivalent characters are
> represented by the same byte length, and differ only in the last
> byte.

So do I understand this right: In order to perform a Boyer-Moore
search, the characters have to be either both ASCII, or must be in the
same group of 64 adjacent characters (because the last byte in UTF-8
encodes 6 bits)?

Is that the reason why also ÿ and Ÿ (U+00FF and U+0178, small/capital
y with diaeresis) don't form a case pair?

> So, if you are sure that searching of ß is very rare (I have
> no idea), please install it.

Usage of (lower case) ß is very common in a German language context,
so I'd guess that searching for it is not so rare.

On the other hand, capital ẞ is not used in regular German orthography
(that's probably the reason why the character was added to Unicode
only in 2008). So if the change would cause large tradeoffs in search
speed, then I think it's not worthwhile.

By what factor is the non-BM search slower, as compared to the BM
search?

> By the way, I think it's possible to improve the current BM-search
> for such a case. For instance, to search "straße", we at first do
> BM-search for "stra" part and then check the remaining "ße" part.
> Aren't there any challenger?

Ulrich

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Case mapping of sharp s
  2009-11-16 16:38   ` Ulrich Mueller
@ 2009-11-17  7:36     ` Kenichi Handa
  2009-11-17 21:23       ` Reiner Steib
  0 siblings, 1 reply; 66+ messages in thread
From: Kenichi Handa @ 2009-11-17  7:36 UTC (permalink / raw)
  To: Ulrich Mueller; +Cc: emacs-devel

In article <19201.32770.352944.474086@a1i15.kph.uni-mainz.de>, Ulrich Mueller <ulm@gentoo.org> writes:

> So do I understand this right: In order to perform a Boyer-Moore
> search, the characters have to be either both ASCII, or must be in the
> same group of 64 adjacent characters (because the last byte in UTF-8
> encodes 6 bits)?

Yes.

> Is that the reason why also ÿ and Ÿ (U+00FF and U+0178, small/capital
> y with diaeresis) don't form a case pair?

Yes.

> > So, if you are sure that searching of ß is very rare (I have
> > no idea), please install it.

> Usage of (lower case) ß is very common in a German language context,
> so I'd guess that searching for it is not so rare.

> On the other hand, capital ẞ is not used in regular German orthography
> (that's probably the reason why the character was added to Unicode
> only in 2008). So if the change would cause large tradeoffs in search
> speed, then I think it's not worthwhile.

> By what factor is the non-BM search slower, as compared to the BM
> search?

I don't know exactly.  It depends on the length of searching
string; longer the string is, the more BM search is faster
than simple serach.  At least, when this code was active,
  ;; (set-downcase-syntax  ?İ ?i tbl)
  ;; (set-upcase-syntax    ?I ?ı tbl)
there were complaints about the slowdown.

---
Kenichi Handa
handa@m17n.org




^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Case mapping of sharp s
  2009-11-17  7:36     ` Kenichi Handa
@ 2009-11-17 21:23       ` Reiner Steib
  0 siblings, 0 replies; 66+ messages in thread
From: Reiner Steib @ 2009-11-17 21:23 UTC (permalink / raw)
  To: Kenichi Handa; +Cc: Ulrich Mueller, emacs-devel

[ The following message is a courtesy copy of an article that has
  been posted to news:gmane.emacs.devel as well. ]

On Tue, Nov 17 2009, Kenichi Handa wrote:

> I don't know exactly.  It depends on the length of searching
> string; longer the string is, the more BM search is faster
> than simple serach.  At least, when this code was active,
>   ;; (set-downcase-syntax  ?İ ?i tbl)
>   ;; (set-upcase-syntax    ?I ?ı tbl)
> there were complaints about the slowdown.

Particularly in Gnus, searching for "X-Gnus-Article-Number" made
operations on large mbox/nnfolder files extremely slow.

See the tread "Slow operations on buffers of tens of megabytes" 
(November 2006):

<http://thread.gmane.org/or4ptezc71.fsf%40fsfla.org> or
<http://thread.gmane.org/gmane.emacs.gnus.general/63925/focus=63929>
for the full thread.

Bye, Reiner.
-- 
       ,,,
      (o o)
---ooO-(_)-Ooo---  |  PGP key available  |  http://rsteib.home.pages.de/

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Case mapping of sharp s
  2009-11-16 12:06 ` Kenichi Handa
  2009-11-16 16:38   ` Ulrich Mueller
@ 2009-11-16 19:12   ` Eli Zaretskii
  2009-11-17  7:43     ` martin rudalics
  2009-11-17  7:49     ` Kenichi Handa
  1 sibling, 2 replies; 66+ messages in thread
From: Eli Zaretskii @ 2009-11-16 19:12 UTC (permalink / raw)
  To: Kenichi Handa; +Cc: ulm, emacs-devel

> From: Kenichi Handa <handa@m17n.org>
> Date: Mon, 16 Nov 2009 21:06:38 +0900
> Cc: emacs-devel@gnu.org
> 
> In article <19200.4158.380820.761685@a1i15.kph.uni-mainz.de>, Ulrich Mueller <ulm@gentoo.org> writes:
> 
> > In Unicode since version 5.1.0 the U+1E9E code point is assigned to
> > "LATIN CAPITAL LETTER SHARP S". Would it be possible to add a mapping
> > from this to the lower case ß, as in the patch below?
> 
> > However, I've noticed that similar mappings for Turkish ı (dotless i)
> > and İ (I with dot) were commented out [1]. Is it still so that such a
> > change would "make searches slow", as stated in the comment?
> 
> That kind of setting surely makes the searching of ß and ẞ
> slow because we can't use BM search when case-fold-search is
> non-nil.  BM search is possible only when all
> case-equivalent characters are represented by the same byte
> length, and differ only in the last byte.

I think we need to solve this limitation anyway, if we want a decent
support for Unicode.  There are many more pairs of characters that
should normally be considered equal in search.

Wouldn't the technique described in UTS 18
(http://www.unicode.org/reports/tr18/) help here?





^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Case mapping of sharp s
  2009-11-16 19:12   ` Eli Zaretskii
@ 2009-11-17  7:43     ` martin rudalics
  2009-11-17  7:49     ` Kenichi Handa
  1 sibling, 0 replies; 66+ messages in thread
From: martin rudalics @ 2009-11-17  7:43 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: ulm, emacs-devel, Kenichi Handa

 > I think we need to solve this limitation anyway, if we want a decent
 > support for Unicode.  There are many more pairs of characters that
 > should normally be considered equal in search.

... and sorting ...

 > Wouldn't the technique described in UTS 18
 > (http://www.unicode.org/reports/tr18/) help here?

... and that of UTS 10 (http://www.unicode.org/reports/tr10/) for
sorting.

martin




^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Case mapping of sharp s
  2009-11-16 19:12   ` Eli Zaretskii
  2009-11-17  7:43     ` martin rudalics
@ 2009-11-17  7:49     ` Kenichi Handa
  2009-11-17 18:56       ` Eli Zaretskii
  1 sibling, 1 reply; 66+ messages in thread
From: Kenichi Handa @ 2009-11-17  7:49 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: ulm, emacs-devel

In article <83lji6mgg4.fsf@gnu.org>, Eli Zaretskii <eliz@gnu.org> writes:

> > That kind of setting surely makes the searching of ß and ẞ
> > slow because we can't use BM search when case-fold-search is
> > non-nil.  BM search is possible only when all
> > case-equivalent characters are represented by the same byte
> > length, and differ only in the last byte.

> I think we need to solve this limitation anyway, if we want a decent
> support for Unicode.  There are many more pairs of characters that
> should normally be considered equal in search.

I agree.

> Wouldn't the technique described in UTS 18
> (http://www.unicode.org/reports/tr18/) help here?

Which technique do you mean?  That report mostly describes
just rules of matching, not about fast implementation.

---
Kenichi Handa
handa@m17n.org




^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Case mapping of sharp s
  2009-11-17  7:49     ` Kenichi Handa
@ 2009-11-17 18:56       ` Eli Zaretskii
  2009-11-18  1:00         ` Kenichi Handa
  0 siblings, 1 reply; 66+ messages in thread
From: Eli Zaretskii @ 2009-11-17 18:56 UTC (permalink / raw)
  To: Kenichi Handa; +Cc: ulm, emacs-devel

> From: Kenichi Handa <handa@m17n.org>
> Cc: ulm@gentoo.org, emacs-devel@gnu.org
> Date: Tue, 17 Nov 2009 16:49:53 +0900
> 
> > Wouldn't the technique described in UTS 18
> > (http://www.unicode.org/reports/tr18/) help here?
> 
> Which technique do you mean?

The one described under "3.10 Folded Matching".




^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Case mapping of sharp s
  2009-11-17 18:56       ` Eli Zaretskii
@ 2009-11-18  1:00         ` Kenichi Handa
  2009-11-18  4:09           ` Eli Zaretskii
  0 siblings, 1 reply; 66+ messages in thread
From: Kenichi Handa @ 2009-11-18  1:00 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: ulm, emacs-devel

In article <83iqd9m14h.fsf@gnu.org>, Eli Zaretskii <eliz@gnu.org> writes:

> > From: Kenichi Handa <handa@m17n.org>
> > Cc: ulm@gentoo.org, emacs-devel@gnu.org
> > Date: Tue, 17 Nov 2009 16:49:53 +0900
> > 
> > > Wouldn't the technique described in UTS 18
> > > (http://www.unicode.org/reports/tr18/) help here?
> > 
> > Which technique do you mean?

> The one described under "3.10 Folded Matching".

But, that section just tells the way of providing pre-folded
text in matching, which, I think, is not an acceptable way
for searching a big buffer.

---
Kenichi Handa
handa@m17n.org




^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Case mapping of sharp s
  2009-11-18  1:00         ` Kenichi Handa
@ 2009-11-18  4:09           ` Eli Zaretskii
  2009-11-18  5:33             ` Stephen J. Turnbull
  2009-11-18  6:26             ` Kenichi Handa
  0 siblings, 2 replies; 66+ messages in thread
From: Eli Zaretskii @ 2009-11-18  4:09 UTC (permalink / raw)
  To: Kenichi Handa; +Cc: ulm, emacs-devel

> From: Kenichi Handa <handa@m17n.org>
> Cc: ulm@gentoo.org, emacs-devel@gnu.org
> Date: Wed, 18 Nov 2009 10:00:57 +0900
> 
> In article <83iqd9m14h.fsf@gnu.org>, Eli Zaretskii <eliz@gnu.org> writes:
> 
> > > From: Kenichi Handa <handa@m17n.org>
> > > Cc: ulm@gentoo.org, emacs-devel@gnu.org
> > > Date: Tue, 17 Nov 2009 16:49:53 +0900
> > > 
> > > > Wouldn't the technique described in UTS 18
> > > > (http://www.unicode.org/reports/tr18/) help here?
> > > 
> > > Which technique do you mean?
> 
> > The one described under "3.10 Folded Matching".
> 
> But, that section just tells the way of providing pre-folded
> text in matching, which, I think, is not an acceptable way
> for searching a big buffer.

My idea was to use the same technique to fold the text as you search
through it.  You could, for example, fold one character at a time, so
that instead of comparing against X you compare against fold(X).

I admit that I didn't think about this more than a few seconds,
though, so perhaps this idea doesn't "hold water".




^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Case mapping of sharp s
  2009-11-18  4:09           ` Eli Zaretskii
@ 2009-11-18  5:33             ` Stephen J. Turnbull
  2009-11-18  6:26             ` Kenichi Handa
  1 sibling, 0 replies; 66+ messages in thread
From: Stephen J. Turnbull @ 2009-11-18  5:33 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: ulm, emacs-devel, Kenichi Handa

Eli Zaretskii writes:

 > My idea was to use the same technique to fold the text as you search
 > through it.  You could, for example, fold one character at a time, so
 > that instead of comparing against X you compare against fold(X).

That's precisely how case-insensitive Boyer-Moore is done.  But it's
not that easy for Unicode, because you will need to check
*characters*, not octets, while Boyer-Moore is very much oriented
toward fixed-width charsets.

I think the problem here is that because of the nature of Unicode,
this folding often crosses character blocks (here meaning "sets of
characters with the same bits except for the lowest 6").  That means
that you will need to keep track not only of the folding for the
current octet, but you will also need to be able to backtrack to check
the higher bits.  By contrast, with the Mule internal encoding the
case pairs are always in the same block of 96 characters.

I don't think that it's *impossible* to implement Boyer-Moore with
translation for UTF-8, but it's nowhere near as easy as for the
unibyte charset case, which Mule internal can in practice be reduced
to.  It's possible that all this fiddling could result in a
performance hit of the same order of magnitude as the performance hit
of brute-forcing the comparison.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Case mapping of sharp s
  2009-11-18  4:09           ` Eli Zaretskii
  2009-11-18  5:33             ` Stephen J. Turnbull
@ 2009-11-18  6:26             ` Kenichi Handa
  2009-11-18 14:44               ` Stefan Monnier
  2009-11-18 17:58               ` Eli Zaretskii
  1 sibling, 2 replies; 66+ messages in thread
From: Kenichi Handa @ 2009-11-18  6:26 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: ulm, emacs-devel

In article <83bpj0mq2v.fsf@gnu.org>, Eli Zaretskii <eliz@gnu.org> writes:

> My idea was to use the same technique to fold the text as you search
> through it.  You could, for example, fold one character at a time, so
> that instead of comparing against X you compare against fold(X).

BM-search already does it.  The problem is that the current
`fold' function can be applicable only to the last byte of a
character.

---
Kenichi Handa
handa@m17n.org




^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Case mapping of sharp s
  2009-11-18  6:26             ` Kenichi Handa
@ 2009-11-18 14:44               ` Stefan Monnier
  2009-11-18 19:05                 ` Ulrich Mueller
  2009-11-18 17:58               ` Eli Zaretskii
  1 sibling, 1 reply; 66+ messages in thread
From: Stefan Monnier @ 2009-11-18 14:44 UTC (permalink / raw)
  To: Kenichi Handa; +Cc: Eli Zaretskii, ulm, emacs-devel

>> My idea was to use the same technique to fold the text as you search
>> through it.  You could, for example, fold one character at a time, so
>> that instead of comparing against X you compare against fold(X).

> BM-search already does it.  The problem is that the current
> `fold' function can be applicable only to the last byte of a
> character.

We'd need to use a variant of BM that can search a set of strings rather
than a single string.  Given the fact that all the strings in the set
are very similar, it should be possible to come up with an efficient
algorithm for it.  The basic idea I'd try is to build the BM table for
each of the alternative strings and then merge them into a single table
that takes for each entry the smallest shift distance of each table.
There's a lot more to it, I'm sure, but it should be workable.
Would make a nice project for a student.

        Stefan

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Case mapping of sharp s
  2009-11-18 14:44               ` Stefan Monnier
@ 2009-11-18 19:05                 ` Ulrich Mueller
  2009-11-19  1:16                   ` Stefan Monnier
  0 siblings, 1 reply; 66+ messages in thread
From: Ulrich Mueller @ 2009-11-18 19:05 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: Eli Zaretskii, emacs-devel, Kenichi Handa

>>>>> On Wed, 18 Nov 2009, Stefan Monnier wrote:

> Given the fact that all the strings in the set are very similar,
> it should be possible to come up with an efficient algorithm for it.
> The basic idea I'd try is to build the BM table for each of the
> alternative strings and then merge them into a single table that
> takes for each entry the smallest shift distance of each table.

So if the string has n characters and each of them has two case
variants, then 2^n keys would be needed for the search? (Or even more,
if the extra slots of the case table are used.)

Ulrich




^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Case mapping of sharp s
  2009-11-18 19:05                 ` Ulrich Mueller
@ 2009-11-19  1:16                   ` Stefan Monnier
  0 siblings, 0 replies; 66+ messages in thread
From: Stefan Monnier @ 2009-11-19  1:16 UTC (permalink / raw)
  To: Ulrich Mueller; +Cc: Eli Zaretskii, Kenichi Handa, emacs-devel

>> Given the fact that all the strings in the set are very similar,
>> it should be possible to come up with an efficient algorithm for it.
>> The basic idea I'd try is to build the BM table for each of the
>> alternative strings and then merge them into a single table that
>> takes for each entry the smallest shift distance of each table.
> So if the string has n characters and each of them has two case
> variants, then 2^n keys would be needed for the search? (Or even more,
> if the extra slots of the case table are used.)

Except that you only need to do that for those special chars where the
current approach doesn't work.  And except that you don't need to
generate all the combinations: you can generate the table directly from
the single input string (you'll probably need to keep something like
a "spread" count that says how many bytes there are between the shortest
and longest representation of the part of the string you've already
processed).
I don't claim it's trivial, but I'm pretty sure it's workable.

        Stefan

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Case mapping of sharp s
  2009-11-18  6:26             ` Kenichi Handa
  2009-11-18 14:44               ` Stefan Monnier
@ 2009-11-18 17:58               ` Eli Zaretskii
  2009-11-19  1:57                 ` Stephen J. Turnbull
  1 sibling, 1 reply; 66+ messages in thread
From: Eli Zaretskii @ 2009-11-18 17:58 UTC (permalink / raw)
  To: Kenichi Handa; +Cc: ulm, emacs-devel

> From: Kenichi Handa <handa@m17n.org>
> Cc: ulm@gentoo.org, emacs-devel@gnu.org
> Date: Wed, 18 Nov 2009 15:26:32 +0900
> 
> In article <83bpj0mq2v.fsf@gnu.org>, Eli Zaretskii <eliz@gnu.org> writes:
> 
> > My idea was to use the same technique to fold the text as you search
> > through it.  You could, for example, fold one character at a time, so
> > that instead of comparing against X you compare against fold(X).
> 
> BM-search already does it.  The problem is that the current
> `fold' function can be applicable only to the last byte of a
> character.

Yes, I know: you explained it earlier in this thread.  What I thought
would help is that fold("SS") and fold("ß") have the same number of
bytes, which I think was the problem that prevented the use of BM.





^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Case mapping of sharp s
  2009-11-18 17:58               ` Eli Zaretskii
@ 2009-11-19  1:57                 ` Stephen J. Turnbull
  0 siblings, 0 replies; 66+ messages in thread
From: Stephen J. Turnbull @ 2009-11-19  1:57 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: ulm, emacs-devel, Kenichi Handa

Eli Zaretskii writes:

 > Yes, I know: you explained it earlier in this thread.  What I thought
 > would help is that fold("SS") and fold("ß") have the same number of
 > bytes, which I think was the problem that prevented the use of BM.

No, the problem is that BM thinks in terms of code units, and only
works "out of the box" if you are comparing a single code unit
(perhaps transformed by "folding").  In fixed width representations
you could use a 16-bit unit, but UTF-8 is not fixed width, so you
can't guarantee the appropriate alignment.  Thus the code unit is 1
octet, and the problem is that the width of fold("SS") is not 1 code
unit.

You also bloat the table from 256 bytes to 65536 bytes, which is
perhaps not large compared to modern memories, but still is a pretty
alarming factor of increase.  It's possible a sparse representation
would work OK, but that would clearly have a dramatic negative impact
on performance.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Case mapping of sharp s
@ 2009-11-19 19:48 grischka
  2009-11-19 21:49 ` Stefan Monnier
  0 siblings, 1 reply; 66+ messages in thread
From: grischka @ 2009-11-19 19:48 UTC (permalink / raw)
  To: handa; +Cc: emacs-devel

 > > By what factor is the non-BM search slower, as compared to the BM
 > > search?
 >
 > I don't know exactly.  It depends on the length of searching
 > string; longer the string is, the more BM search is faster
 > than simple serach.  At least, when this code was active,
 >   ;; (set-downcase-syntax  ?Ä ?i tbl)
 >  ;; (set-upcase-syntax    ?I ?Ä tbl)
 > there were complaints about the slowdown.

Actually I think there is something simply wrong with the simple
search, as it's much slower even for single chars (where bm doesn't
have any advantage) and additionally in some weird random fashion
it's again slower for backwards search, such as 14, 37, 66 ... 94
secs, where the bm takes 0.5 secs and simple forward constantly
~3.7 secs, all for isearch'ing one character in a 100Mb file.

--- grischka

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Case mapping of sharp s
  2009-11-19 19:48 grischka
@ 2009-11-19 21:49 ` Stefan Monnier
  2009-11-19 22:43   ` David Kastrup
  2009-11-19 23:25   ` grischka
  0 siblings, 2 replies; 66+ messages in thread
From: Stefan Monnier @ 2009-11-19 21:49 UTC (permalink / raw)
  To: grischka; +Cc: emacs-devel, handa

> Actually I think there is something simply wrong with the simple
> search, as it's much slower even for single chars (where bm doesn't
> have any advantage) and additionally in some weird random fashion
> it's again slower for backwards search, such as 14, 37, 66 ... 94
> secs, where the bm takes 0.5 secs and simple forward constantly
> ~3.7 secs, all for isearch'ing one character in a 100Mb file.

I can guess why it's much slower going backward: the simple search
operates on chars rather than bytes.  The internal encoding we use
(currently based on utf-8) is designed to be easy to parse going forward
but not so easy going backward (IIRC our encoding is actually even a bit
more painful in this case than pure utf-8).  BM on the other hand works
on bytes, so there's no such slowdown.

As for the general slowdown, it may also be due to having to parse the
char (encoded in utf-8) and then look it up in the corresponding char table
(a tree of arrays).

But maybe we're doing something silly somewhere.

        Stefan

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Case mapping of sharp s
  2009-11-19 21:49 ` Stefan Monnier
@ 2009-11-19 22:43   ` David Kastrup
  2009-11-20  2:08     ` Stefan Monnier
                       ` (2 more replies)
  2009-11-19 23:25   ` grischka
  1 sibling, 3 replies; 66+ messages in thread
From: David Kastrup @ 2009-11-19 22:43 UTC (permalink / raw)
  To: emacs-devel

Stefan Monnier <monnier@IRO.UMontreal.CA> writes:

>> Actually I think there is something simply wrong with the simple
>> search, as it's much slower even for single chars (where bm doesn't
>> have any advantage) and additionally in some weird random fashion
>> it's again slower for backwards search, such as 14, 37, 66 ... 94
>> secs, where the bm takes 0.5 secs and simple forward constantly
>> ~3.7 secs, all for isearch'ing one character in a 100Mb file.
>
> I can guess why it's much slower going backward: the simple search
> operates on chars rather than bytes.  The internal encoding we use
> (currently based on utf-8) is designed to be easy to parse going forward
> but not so easy going backward (IIRC our encoding is actually even a bit
> more painful in this case than pure utf-8).

I don't think so.  The utf-8 _scheme_ can be used to encode 21bits in 4
characters.  We stay within that range, in the utf-8 4 character scheme,
but outside of the Unicode range 2^20+2^16.

> BM on the other hand works on bytes, so there's no such slowdown.

With utf-8, I think that apart from character ranges, search forward and
backward should work perfectly like on 8-bit characters.  Exception is
incomplete character matches, but since the utf-8 scheme can immediately
tell "is a 7-bit character" "is the first character of a multibyte
sequence of length n" "is last or intermediate character of multibyte
sequence" this is not a serious problem.

> But maybe we're doing something silly somewhere.

The Emacs 22 multibyte scheme likely had worse properties for reverse
searching.  So maybe something might be simplified nowadays.

-- 
David Kastrup

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Case mapping of sharp s
  2009-11-19 22:43   ` David Kastrup
@ 2009-11-20  2:08     ` Stefan Monnier
  2009-11-20  8:03       ` David Kastrup
  2009-11-20  3:41     ` Stephen J. Turnbull
  2009-11-20  8:10     ` Ulrich Mueller
  2 siblings, 1 reply; 66+ messages in thread
From: Stefan Monnier @ 2009-11-20  2:08 UTC (permalink / raw)
  To: David Kastrup; +Cc: emacs-devel

> With utf-8, I think that apart from character ranges, search forward and
> backward should work perfectly like on 8-bit characters.  Exception is

I suspect you forgot case-folding.


        Stefan




^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Case mapping of sharp s
  2009-11-20  2:08     ` Stefan Monnier
@ 2009-11-20  8:03       ` David Kastrup
  2009-11-20 14:14         ` Stefan Monnier
  0 siblings, 1 reply; 66+ messages in thread
From: David Kastrup @ 2009-11-20  8:03 UTC (permalink / raw)
  To: emacs-devel

Stefan Monnier <monnier@iro.umontreal.ca> writes:

>> With utf-8, I think that apart from character ranges, search forward and
>> backward should work perfectly like on 8-bit characters.  Exception is
>
> I suspect you forgot case-folding.

How is that easier forwards than backwards?

-- 
David Kastrup





^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Case mapping of sharp s
  2009-11-20  8:03       ` David Kastrup
@ 2009-11-20 14:14         ` Stefan Monnier
  0 siblings, 0 replies; 66+ messages in thread
From: Stefan Monnier @ 2009-11-20 14:14 UTC (permalink / raw)
  To: David Kastrup; +Cc: emacs-devel

>>> With utf-8, I think that apart from character ranges, search forward and
>>> backward should work perfectly like on 8-bit characters.  Exception is
>> I suspect you forgot case-folding.
> How is that easier forwards than backwards?

It's not.  I'm only replying the quoted two lines which do not
distinguish between backward and forward.


        Stefan




^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Case mapping of sharp s
  2009-11-19 22:43   ` David Kastrup
  2009-11-20  2:08     ` Stefan Monnier
@ 2009-11-20  3:41     ` Stephen J. Turnbull
  2009-11-20  4:20       ` Stefan Monnier
  2009-11-20  8:10     ` Ulrich Mueller
  2 siblings, 1 reply; 66+ messages in thread
From: Stephen J. Turnbull @ 2009-11-20  3:41 UTC (permalink / raw)
  To: David Kastrup; +Cc: emacs-devel

David Kastrup writes:

 > > But maybe we're doing something silly somewhere.
 > 
 > The Emacs 22 multibyte scheme likely had worse properties for reverse
 > searching.  So maybe something might be simplified nowadays.

Nope.  The basic nature of the representation and even algorithms are
the same.  The main difference is that the leading-byte to character
length map in Mule coding is somewhat arbitrary, while in UTF-8
there's an algorithm for computing it.  In both cases, the sane
algorithm is to keep a 256-entry table of corresponding lengths and
use the octet as an index into that table.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Case mapping of sharp s
  2009-11-20  3:41     ` Stephen J. Turnbull
@ 2009-11-20  4:20       ` Stefan Monnier
  2009-11-20  7:13         ` Stephen J. Turnbull
  0 siblings, 1 reply; 66+ messages in thread
From: Stefan Monnier @ 2009-11-20  4:20 UTC (permalink / raw)
  To: Stephen J. Turnbull; +Cc: David Kastrup, emacs-devel

>> > But maybe we're doing something silly somewhere.
>> The Emacs 22 multibyte scheme likely had worse properties for reverse
>> searching.  So maybe something might be simplified nowadays.

> Nope.  The basic nature of the representation and even algorithms are
> the same.  The main difference is that the leading-byte to character
> length map in Mule coding is somewhat arbitrary, while in UTF-8
> there's an algorithm for computing it.  In both cases, the sane
> algorithm is to keep a 256-entry table of corresponding lengths and
> use the octet as an index into that table.

Actually, there is a significant difference when going backward, because
of the special encoding used for eight-bit-graphics in Emacs-22 and the
more regular and less space-efficient encoding used for eight-bit chars
in Emacs-23.


        Stefan




^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Case mapping of sharp s
  2009-11-20  4:20       ` Stefan Monnier
@ 2009-11-20  7:13         ` Stephen J. Turnbull
  2009-11-21  0:02           ` Richard Stallman
  0 siblings, 1 reply; 66+ messages in thread
From: Stephen J. Turnbull @ 2009-11-20  7:13 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: David Kastrup, emacs-devel

Stefan Monnier writes:

 > Actually, there is a significant difference when going backward, because
 > of the special encoding used for eight-bit-graphics in Emacs-22 and the
 > more regular and less space-efficient encoding used for eight-bit chars
 > in Emacs-23.

Premature optimization is the root of all error.  Fortunately, it took
you only 3 releases to figure that out. ;-)




^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Case mapping of sharp s
  2009-11-20  7:13         ` Stephen J. Turnbull
@ 2009-11-21  0:02           ` Richard Stallman
  2009-11-21 12:39             ` David Kastrup
  0 siblings, 1 reply; 66+ messages in thread
From: Richard Stallman @ 2009-11-21  0:02 UTC (permalink / raw)
  To: Stephen J. Turnbull; +Cc: dak, monnier, emacs-devel

I don't think the design of MULE was an error in the 1990s.




^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Case mapping of sharp s
  2009-11-21  0:02           ` Richard Stallman
@ 2009-11-21 12:39             ` David Kastrup
  2009-11-21 17:40               ` Stephen J. Turnbull
  2009-11-21 22:52               ` Richard Stallman
  0 siblings, 2 replies; 66+ messages in thread
From: David Kastrup @ 2009-11-21 12:39 UTC (permalink / raw)
  To: rms; +Cc: Stephen J. Turnbull, monnier, emacs-devel

Richard Stallman <rms@gnu.org> writes:

> I don't think the design of MULE was an error in the 1990s.

I think that the design of utf-8 that makes character starts immediately
recognizable without the need for rescanning or synchronization has been
an excellent idea.  MULE coding lacks this feature.  "less than optimal"
is not the same as "an error".

The coding scheme of utf-8 has enough ingenuity and non-obviousness in
it to make it eligible for a software patent.  Good thing we don't have
to worry about this one.

-- 
David Kastrup

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Case mapping of sharp s
  2009-11-21 12:39             ` David Kastrup
@ 2009-11-21 17:40               ` Stephen J. Turnbull
  2009-11-21 19:15                 ` Eli Zaretskii
  2009-11-23  1:30                 ` Kenichi Handa
  2009-11-21 22:52               ` Richard Stallman
  1 sibling, 2 replies; 66+ messages in thread
From: Stephen J. Turnbull @ 2009-11-21 17:40 UTC (permalink / raw)
  To: David Kastrup; +Cc: emacs-devel, rms, monnier

David Kastrup writes:
 > Richard Stallman <rms@gnu.org> writes:
 > 
 > > I don't think the design of MULE was an error in the 1990s.

Of course it was, at least as applied to the ISO 8859 family of
scripts.  In fact the ISO 8859 standard makes plain that characters
with the same name are identical across the ISO 8859 family.
Distinguishing (make-char 'latin-iso8859-1 32) from (make-char
'latin-iso8859-15 32) was a mistake, and it caused a lot of pain for
users and developers.

I agree that in Japan the design was plausible in the early 90s.  In
hindsight, I think it was an unfortunate choice, though.  It would
have been better for the Mule Lab (which has a fair amount of prestige
in this country) to lead the way toward open, universal standards by
working out the difficulties of dealing with multilingual text written
in a Unihan script (ie, Unicode).  In the end internationalized
encodings based on ISO 2022 extension techniques (such as TRON code
and Mule code) are all dead (except for ISO-2022-JP, still commonly
used in email), but Shift JIS remains in wide use, with only Unicode
gaining share.

 > I think that the design of utf-8 that makes character starts
 > immediately recognizable without the need for rescanning or
 > synchronization has been an excellent idea.  MULE coding lacks this
 > feature.

It does not lack that feature: C0 and GL codes are ASCII (one byte
characters), C1 codes are leading bytes, and GR codes are trailing
bytes.  Ie, all bytes less than 160 are character starters.  AFAIK,
Mule code developed this feature at about the same time that FSS-UTF
was invented (Mule development started in mid-1991, and the earliest
reference I can find to FSS-UTF is Ken Thompson's fss-utf.c dated
1992).  You'd have to ask Ken'ichi Handa for the exact date and
whether he was aware of FSS-UTF and such techniques when the Mule
encoding was designed.

UTF-8 doesn't really have any algorithmic string-processing advantages
over Mule code.  Even the fact that you can compute the length of a
character algorithmically from a UTF-8 leading byte is unimportant,
since it's much more efficient to use a table lookup for that.  The
big advantage of UTF-8 is that it's based on Unicode, so characters
that never should have been distinguished in the first place don't
have to be reidentified in Lisp.  Not to mention all of the useful
character data and the bidi algorithm, etc.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Case mapping of sharp s
  2009-11-21 17:40               ` Stephen J. Turnbull
@ 2009-11-21 19:15                 ` Eli Zaretskii
  2009-11-22  2:58                   ` Stephen J. Turnbull
  2009-11-23  1:30                 ` Kenichi Handa
  1 sibling, 1 reply; 66+ messages in thread
From: Eli Zaretskii @ 2009-11-21 19:15 UTC (permalink / raw)
  To: Stephen J. Turnbull; +Cc: dak, rms, monnier, emacs-devel

> From: "Stephen J. Turnbull" <stephen@xemacs.org>
> Date: Sun, 22 Nov 2009 02:40:09 +0900
> Cc: emacs-devel@gnu.org, rms@gnu.org, monnier@iro.umontreal.ca
> 
> UTF-8 doesn't really have any algorithmic string-processing advantages
> over Mule code.  Even the fact that you can compute the length of a
> character algorithmically from a UTF-8 leading byte is unimportant,
> since it's much more efficient to use a table lookup for that.  The
> big advantage of UTF-8 is that it's based on Unicode, so characters
> that never should have been distinguished in the first place don't
> have to be reidentified in Lisp.  Not to mention all of the useful
> character data and the bidi algorithm, etc.

Not that it's important, but since we are talking principles here: the
bidi algorithm has nothing to do with Unicode, let alone UTF-8.  It is
not even the best algorithm to handle bidirectional text (rumor has it
that Microsoft lobbied the consortium into the algorithm they
developed for their word processors).  Better algorithms with saner
results were available years before UAX#9 was published, and they
worked with codepoints from DOS the codepage 862 just dandy.  You
(Stephen) may recall that I presented one of them in Tsukuba 9 years
ago.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Case mapping of sharp s
  2009-11-21 19:15                 ` Eli Zaretskii
@ 2009-11-22  2:58                   ` Stephen J. Turnbull
  2009-11-22  4:28                     ` Eli Zaretskii
  0 siblings, 1 reply; 66+ messages in thread
From: Stephen J. Turnbull @ 2009-11-22  2:58 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: dak, emacs-devel, rms, monnier

Eli Zaretskii writes:

 > Not that it's important, but since we are talking principles here: the
 > bidi algorithm has nothing to do with Unicode, let alone UTF-8.  It is
 > not even the best algorithm to handle bidirectional text (rumor has it
 > that Microsoft lobbied the consortium into the algorithm they
 > developed for their word processors).

That would not be surprising, and it might even be the best result.
The "POSIX_ME_HARDER" argument notwithstanding, a mediocre standard is
often better than a proliferation of excellent nonstandards.

 > Better algorithms with saner results were available years before
 > UAX#9 was published, and they worked with codepoints from DOS the
 > codepage 862 just dandy.  You (Stephen) may recall that I presented
 > one of them in Tsukuba 9 years ago.

Sure, and it's still not usable in any popular editor AFAIK, while
millions of people are using the Unicode standard algorithm if it is
in fact the one the Microsoft uses in its wordprocessors.  Bash and
other nonconforming derivatives of the POSIX sh, on the other hand,
are in universal use.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Case mapping of sharp s
  2009-11-22  2:58                   ` Stephen J. Turnbull
@ 2009-11-22  4:28                     ` Eli Zaretskii
  2009-11-22  8:27                       ` Stephen J. Turnbull
  0 siblings, 1 reply; 66+ messages in thread
From: Eli Zaretskii @ 2009-11-22  4:28 UTC (permalink / raw)
  To: Stephen J. Turnbull; +Cc: dak, emacs-devel, rms, monnier

> From: "Stephen J. Turnbull" <stephen@xemacs.org>
> Cc: dak@gnu.org,
>     rms@gnu.org,
>     monnier@iro.umontreal.ca,
>     emacs-devel@gnu.org
> Date: Sun, 22 Nov 2009 11:58:53 +0900
> 
>  > Better algorithms with saner results were available years before
>  > UAX#9 was published, and they worked with codepoints from DOS the
>  > codepage 862 just dandy.  You (Stephen) may recall that I presented
>  > one of them in Tsukuba 9 years ago.
> 
> Sure, and it's still not usable in any popular editor AFAIK, while
> millions of people are using the Unicode standard algorithm if it is
> in fact the one the Microsoft uses in its wordprocessors.

Just because no one in the Emacs development was interested enough.
Welcome to Free Software!




^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Case mapping of sharp s
  2009-11-22  4:28                     ` Eli Zaretskii
@ 2009-11-22  8:27                       ` Stephen J. Turnbull
  0 siblings, 0 replies; 66+ messages in thread
From: Stephen J. Turnbull @ 2009-11-22  8:27 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: dak, rms, monnier, emacs-devel

Eli Zaretskii writes:

 > > Sure, and it's still not usable in any popular editor AFAIK, while
 > > millions of people are using the Unicode standard algorithm if it is
 > > in fact the one the Microsoft uses in its wordprocessors.
 > 
 > Just because no one in the Emacs development was interested enough.
 > Welcome to Free Software!

Agreed.  I know that doesn't make the outcome any happier.





^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Case mapping of sharp s
  2009-11-21 17:40               ` Stephen J. Turnbull
  2009-11-21 19:15                 ` Eli Zaretskii
@ 2009-11-23  1:30                 ` Kenichi Handa
  1 sibling, 0 replies; 66+ messages in thread
From: Kenichi Handa @ 2009-11-23  1:30 UTC (permalink / raw)
  To: Stephen J. Turnbull; +Cc: dak, rms, monnier, emacs-devel

In article <87zl6fdbeu.fsf@uwakimon.sk.tsukuba.ac.jp>, "Stephen J. Turnbull" <stephen@xemacs.org> writes:

> You'd have to ask Ken'ichi Handa for the exact date and
> whether he was aware of FSS-UTF and such techniques when the Mule
> encoding was designed.

Actually, the original design of Mule encoding was by RMS.
When I visited MIT to discuss how to integrate the facility
of Nemacs (Japanized version of Emacs) into Emacs, he
explained the idea of using 0x80..0x9F for leading bytes.
Based on it, I implented Mule, and then it is integrated
into Emacs.  I don't remember exact date of that visit;
perhaps 1990 or 1991, but I'm sure I've never heard about
FSS-UTF at that time.

---
Kenichi Handa
handa@m17n.org

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Case mapping of sharp s
  2009-11-21 12:39             ` David Kastrup
  2009-11-21 17:40               ` Stephen J. Turnbull
@ 2009-11-21 22:52               ` Richard Stallman
  1 sibling, 0 replies; 66+ messages in thread
From: Richard Stallman @ 2009-11-21 22:52 UTC (permalink / raw)
  To: David Kastrup; +Cc: stephen, monnier, emacs-devel

The MULE encoding saves space compared with UTF-8.  We designed it for
compactness of representation.  That was a big concern, 15 years ago.
Nowadays it isn't so important.




^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Case mapping of sharp s
  2009-11-19 22:43   ` David Kastrup
  2009-11-20  2:08     ` Stefan Monnier
  2009-11-20  3:41     ` Stephen J. Turnbull
@ 2009-11-20  8:10     ` Ulrich Mueller
  2009-11-20 11:46       ` Stephen J. Turnbull
  2 siblings, 1 reply; 66+ messages in thread
From: Ulrich Mueller @ 2009-11-20  8:10 UTC (permalink / raw)
  To: David Kastrup; +Cc: emacs-devel

>>>>> On Thu, 19 Nov 2009, David Kastrup wrote:

>> I can guess why it's much slower going backward: the simple search
>> operates on chars rather than bytes. The internal encoding we use
>> (currently based on utf-8) is designed to be easy to parse going
>> forward but not so easy going backward (IIRC our encoding is
>> actually even a bit more painful in this case than pure utf-8).

> I don't think so. The utf-8 _scheme_ can be used to encode 21bits in
> 4 characters.

The original UTF-8 (specified in RFC 2279) was good for encoding of
the full range of 2^31 characters in up to 6 bytes. The limitation to
2^20.1 came later and is artificial.

> We stay within that range, in the utf-8 4 character scheme, but
> outside of the Unicode range 2^20+2^16.

character.h says it's up to 22 bits encoded in up to 5 bytes:

,----
|    character code	1st byte   byte sequence
|    --------------	--------   -------------
|         0-7F		00..7F	   0xxxxxxx
|        80-7FF		C2..DF	   110xxxxx 10xxxxxx
|       800-FFFF		E0..EF	   1110xxxx 10xxxxxx 10xxxxxx
|     10000-1FFFFF	F0..F7	   11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
|    200000-3FFF7F	F8	   11111000 1000xxxx 10xxxxxx 10xxxxxx 10xxxxxx
|    3FFF80-3FFFFF	C0..C1	   1100000x 10xxxxxx (for eight-bit-char)
|    400000-...		invalid
`----

>> BM on the other hand works on bytes, so there's no such slowdown.

> With utf-8, I think that apart from character ranges, search forward and
> backward should work perfectly like on 8-bit characters.  Exception is
> incomplete character matches, but since the utf-8 scheme can immediately
> tell "is a 7-bit character" "is the first character of a multibyte
> sequence of length n" "is last or intermediate character of multibyte
> sequence" this is not a serious problem.

When the search is for equivalence classes of characters (e.g. case
folding), then I think it must operate on whole characters and
therefore has to find the start of each multibyte sequence.

Ulrich




^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Case mapping of sharp s
  2009-11-20  8:10     ` Ulrich Mueller
@ 2009-11-20 11:46       ` Stephen J. Turnbull
  2009-11-20 14:43         ` Ulrich Mueller
  0 siblings, 1 reply; 66+ messages in thread
From: Stephen J. Turnbull @ 2009-11-20 11:46 UTC (permalink / raw)
  To: Ulrich Mueller; +Cc: David Kastrup, emacs-devel

Ulrich Mueller writes:

 > When the search is for equivalence classes of characters (e.g. case
 > folding), then I think it must operate on whole characters and
 > therefore has to find the start of each multibyte sequence.

This is false for certain equivalence classes, namely those that cause
only one octet of the multibyte representation to change.  For Mule
encoding, this works for ranges of 96 characters, such as all the
unibyte charsets.  For UTF-8, it works for ASCII, and IIRC for letters
in the Latin-1 set, and maybe many other Latin letters.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Case mapping of sharp s
  2009-11-20 11:46       ` Stephen J. Turnbull
@ 2009-11-20 14:43         ` Ulrich Mueller
  2009-11-21  4:33           ` Stephen J. Turnbull
  0 siblings, 1 reply; 66+ messages in thread
From: Ulrich Mueller @ 2009-11-20 14:43 UTC (permalink / raw)
  To: Stephen J. Turnbull; +Cc: David Kastrup, emacs-devel

>>>>> On Fri, 20 Nov 2009, Stephen J Turnbull wrote:

>> When the search is for equivalence classes of characters (e.g. case
>> folding), then I think it must operate on whole characters and
>> therefore has to find the start of each multibyte sequence.

> This is false for certain equivalence classes, namely those that
> cause only one octet of the multibyte representation to change. For
> Mule encoding, this works for ranges of 96 characters, such as all
> the unibyte charsets. For UTF-8, it works for ASCII, and IIRC for
> letters in the Latin-1 set, and maybe many other Latin letters.

And how do you know that a UTF-8 encoded character is e.g. in the
Latin-1 set, without testing the first byte(s) of the multibyte
sequence?

Ulrich




^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Case mapping of sharp s
  2009-11-20 14:43         ` Ulrich Mueller
@ 2009-11-21  4:33           ` Stephen J. Turnbull
  0 siblings, 0 replies; 66+ messages in thread
From: Stephen J. Turnbull @ 2009-11-21  4:33 UTC (permalink / raw)
  To: Ulrich Mueller; +Cc: David Kastrup, emacs-devel

Ulrich Mueller writes:

 > And how do you know that a UTF-8 encoded character is e.g. in the
 > Latin-1 set, without testing the first byte(s) of the multibyte
 > sequence?

We have tested the first bytes in the *pattern*, and that's enough
because we're doing Boyer-Moore.





^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Case mapping of sharp s
  2009-11-19 21:49 ` Stefan Monnier
  2009-11-19 22:43   ` David Kastrup
@ 2009-11-19 23:25   ` grischka
  2009-11-20  2:11     ` Stefan Monnier
  1 sibling, 1 reply; 66+ messages in thread
From: grischka @ 2009-11-19 23:25 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: emacs-devel, handa

Stefan Monnier wrote:
>> Actually I think there is something simply wrong with the simple
>> search, as it's much slower even for single chars (where bm doesn't
>> have any advantage) and additionally in some weird random fashion
>> it's again slower for backwards search, such as 14, 37, 66 ... 94
>> secs, where the bm takes 0.5 secs and simple forward constantly
>> ~3.7 secs, all for isearch'ing one character in a 100Mb file.
> 
> I can guess why it's much slower going backward: the simple search
> operates on chars rather than bytes.  The internal encoding we use
> (currently based on utf-8) is designed to be easy to parse going forward
> but not so easy going backward (IIRC our encoding is actually even a bit
> more painful in this case than pure utf-8).  BM on the other hand works
> on bytes, so there's no such slowdown.

Okay, it's utf-8 (aka multibyte I suppose) however I was using
the same buffer in both cases (just a different search string).
So whatever makes it more difficult to scan backwards the same
situation exists for bm too.  Also how can it happen that a C
function varies between 4 and 90 seconds for the same action.

> As for the general slowdown, it may also be due to having to parse the
> char (encoded in utf-8) and then look it up in the corresponding char table
> (a tree of arrays).

 From what I saw the simple_search() calls out to lisp for every
single buffer-byte whereas boyer_moore() just translates it
with a prefabricated C table.

However if simple_search just would prepare the first char of
the pattern in both lower and uppercase version readily for
comparison it could be as fast as bm I guess (for a pattern
length of less than 4 that is)

--- grischka

> 
> But maybe we're doing something silly somewhere.
> 
> 
>         Stefan
> 





^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Case mapping of sharp s
  2009-11-19 23:25   ` grischka
@ 2009-11-20  2:11     ` Stefan Monnier
  2009-11-21  3:08       ` grischka
  0 siblings, 1 reply; 66+ messages in thread
From: Stefan Monnier @ 2009-11-20  2:11 UTC (permalink / raw)
  To: grischka; +Cc: emacs-devel, handa

> Also how can it happen that a C function varies between 4 and 90
> seconds for the same action.

I have no explanation for that.

>> As for the general slowdown, it may also be due to having to parse the
>> char (encoded in utf-8) and then look it up in the corresponding char table
>> (a tree of arrays).
> From what I saw the simple_search() calls out to lisp for every
> single buffer-byte

I really don't think that's the case.

> whereas boyer_moore() just translates it with a prefabricated C table.

> However if simple_search just would prepare the first char of
> the pattern in both lower and uppercase version readily for
> comparison it could be as fast as bm I guess (for a pattern
> length of less than 4 that is)

There may be more than 2 chars in the same "case-fold"
equivalence class, so it may be a bit more difficult in general, but
you're probably right that it can be made faster.


        Stefan




^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Case mapping of sharp s
  2009-11-20  2:11     ` Stefan Monnier
@ 2009-11-21  3:08       ` grischka
  2009-11-21  8:58         ` Eli Zaretskii
  2009-11-21 10:41         ` Ulrich Mueller
  0 siblings, 2 replies; 66+ messages in thread
From: grischka @ 2009-11-21  3:08 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: emacs-devel, handa

Stefan Monnier wrote:
>> Also how can it happen that a C function varies between 4 and 90
>> seconds for the same action.
> 
> I have no explanation for that.

Turned out that the time of backwards simple_search depends mostly
on the number of buffer markers in the buffer.

That's because of CHAR_TO_BYTE in the inner loop and then because
that one doesn't mind checking hundreds of markers for each single
char in the file.

--- grischka

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Case mapping of sharp s
  2009-11-21  3:08       ` grischka
@ 2009-11-21  8:58         ` Eli Zaretskii
  2009-11-21  9:33           ` Andreas Schwab
  2009-11-21 15:33           ` grischka
  2009-11-21 10:41         ` Ulrich Mueller
  1 sibling, 2 replies; 66+ messages in thread
From: Eli Zaretskii @ 2009-11-21  8:58 UTC (permalink / raw)
  To: grischka; +Cc: handa, monnier, emacs-devel

> Date: Sat, 21 Nov 2009 04:08:42 +0100
> From: grischka <grishka@gmx.de>
> Cc: emacs-devel@gnu.org, handa@m17n.org
> 
> Stefan Monnier wrote:
> >> Also how can it happen that a C function varies between 4 and 90
> >> seconds for the same action.
> > 
> > I have no explanation for that.
> 
> Turned out that the time of backwards simple_search depends mostly
> on the number of buffer markers in the buffer.
> 
> That's because of CHAR_TO_BYTE in the inner loop and then because
> that one doesn't mind checking hundreds of markers for each single
> char in the file.

CHAR_TO_BYTE could be expensive, yes.  But how else can you convert an
arbitrary character position to the corresponding byte position?  When
you scan forward, you know the byte length of a multi-byte UTF-8
sequence by the first byte, but what do you do when you scan backwards?

The markers CHAR_TO_BYTE considers are a kind of cache, and are
supposed to speed things up.  I don't know what measurements were done
at the time this caching was introduced, nor whether those
measurements were repeated when Emacs switched from Mule encoding to
the current extended UTF-8 encoding of characters.  Maybe nowadays
this caching no longer helps.  Maybe UTF-8 allows a simpler conversion
than just counting bytes since the beginning of the buffer.  Or maybe
this particular use-case does not benefit from the cache, and we
should have a no-cache method for doing the same.

IOW, more research is needed.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Case mapping of sharp s
  2009-11-21  8:58         ` Eli Zaretskii
@ 2009-11-21  9:33           ` Andreas Schwab
  2009-11-21 11:45             ` Eli Zaretskii
  2009-11-21 15:33           ` grischka
  1 sibling, 1 reply; 66+ messages in thread
From: Andreas Schwab @ 2009-11-21  9:33 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: grischka, emacs-devel, monnier, handa

Eli Zaretskii <eliz@gnu.org> writes:

> CHAR_TO_BYTE could be expensive, yes.  But how else can you convert an
> arbitrary character position to the corresponding byte position?  When
> you scan forward, you know the byte length of a multi-byte UTF-8
> sequence by the first byte, but what do you do when you scan backwards?

The first byte of a UTF-8 sequence is easily recognizable, so finding
out the length of the previous UTF-8 sequence is nearly as easy (see
DEC_POS).

Andreas.

-- 
Andreas Schwab, schwab@linux-m68k.org
GPG Key fingerprint = 58CA 54C7 6D53 942B 1756  01D3 44D5 214B 8276 4ED5
"And now for something completely different."




^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Case mapping of sharp s
  2009-11-21  9:33           ` Andreas Schwab
@ 2009-11-21 11:45             ` Eli Zaretskii
  0 siblings, 0 replies; 66+ messages in thread
From: Eli Zaretskii @ 2009-11-21 11:45 UTC (permalink / raw)
  To: Andreas Schwab; +Cc: grishka, handa, monnier, emacs-devel

> From: Andreas Schwab <schwab@linux-m68k.org>
> Date: Sat, 21 Nov 2009 10:33:47 +0100
> Cc: grischka <grishka@gmx.de>, emacs-devel@gnu.org, monnier@iro.umontreal.ca,
> 	handa@m17n.org
> 
> Eli Zaretskii <eliz@gnu.org> writes:
> 
> > CHAR_TO_BYTE could be expensive, yes.  But how else can you convert an
> > arbitrary character position to the corresponding byte position?  When
> > you scan forward, you know the byte length of a multi-byte UTF-8
> > sequence by the first byte, but what do you do when you scan backwards?
> 
> The first byte of a UTF-8 sequence is easily recognizable, so finding
> out the length of the previous UTF-8 sequence is nearly as easy (see
> DEC_POS).

I guess simple_search should use DEC_POS, then.




^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Case mapping of sharp s
  2009-11-21  8:58         ` Eli Zaretskii
  2009-11-21  9:33           ` Andreas Schwab
@ 2009-11-21 15:33           ` grischka
  1 sibling, 0 replies; 66+ messages in thread
From: grischka @ 2009-11-21 15:33 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: handa, monnier, emacs-devel

Eli Zaretskii wrote:
> Or maybe
> this particular use-case does not benefit from the cache, and we
> should have a no-cache method for doing the same.

In particular this use-case not only does not benefit from that
cache but severely suffers from a bug in there, that is for calls
with adjacent descending (probably also ascending) positions in
a row.

--- grischka




^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Case mapping of sharp s
  2009-11-21  3:08       ` grischka
  2009-11-21  8:58         ` Eli Zaretskii
@ 2009-11-21 10:41         ` Ulrich Mueller
  2009-11-21 11:58           ` Andreas Schwab
  1 sibling, 1 reply; 66+ messages in thread
From: Ulrich Mueller @ 2009-11-21 10:41 UTC (permalink / raw)
  To: grischka; +Cc: handa, Stefan Monnier, emacs-devel

>>>>> On Sat, 21 Nov 2009, grischka  wrote:

>>> Also how can it happen that a C function varies between 4 and 90
>>> seconds for the same action.
>> 
>> I have no explanation for that.

> Turned out that the time of backwards simple_search depends mostly
> on the number of buffer markers in the buffer.

> That's because of CHAR_TO_BYTE in the inner loop and then because
> that one doesn't mind checking hundreds of markers for each single
> char in the file.

Is that the reason for the backwards search in ERC buffers being
extremely slow? It may keep Emacs busy for several *minutes*. And it's
not interruptible with C-g.

Ulrich




^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Case mapping of sharp s
  2009-11-21 10:41         ` Ulrich Mueller
@ 2009-11-21 11:58           ` Andreas Schwab
  2009-11-21 17:01             ` Ulrich Mueller
  2009-11-24 12:26             ` Kenichi Handa
  0 siblings, 2 replies; 66+ messages in thread
From: Andreas Schwab @ 2009-11-21 11:58 UTC (permalink / raw)
  To: Ulrich Mueller; +Cc: grischka, emacs-devel, Stefan Monnier, handa

Ulrich Mueller <ulm@gentoo.org> writes:

> Is that the reason for the backwards search in ERC buffers being
> extremely slow? It may keep Emacs busy for several *minutes*. And it's
> not interruptible with C-g.

Does this patch help?

Andreas.

Index: search.c
===================================================================
RCS file: /sources/emacs/emacs/src/search.c,v
retrieving revision 1.247
diff -u -a -p -r1.247 search.c
--- search.c	21 Nov 2009 11:52:29 -0000	1.247
+++ search.c	21 Nov 2009 11:53:02 -0000
@@ -1609,39 +1609,36 @@ simple_search (n, pat, len, len_byte, tr
 	while (1)
 	  {
 	    /* Try matching at position POS.  */
-	    EMACS_INT this_pos = pos - len;
-	    EMACS_INT this_pos_byte;
+	    EMACS_INT this_pos = pos;
+	    EMACS_INT this_pos_byte = pos_byte;
 	    int this_len = len;
-	    unsigned char *p = pat;
+	    unsigned char *p = pat + len_byte;
 
-	    if (this_pos < lim || (pos_byte - len_byte) < lim_byte)
+	    if (this_pos - len < lim || (pos_byte - len_byte) < lim_byte)
 	      goto stop;
-	    this_pos_byte = CHAR_TO_BYTE (this_pos);
-	    match_byte = pos_byte - this_pos_byte;
 
 	    while (this_len > 0)
 	      {
-		int charlen, buf_charlen;
+		int charlen;
 		int pat_ch, buf_ch;
 
-		pat_ch = STRING_CHAR_AND_LENGTH (p, charlen);
-		buf_ch = STRING_CHAR_AND_LENGTH (BYTE_POS_ADDR (this_pos_byte),
-						 buf_charlen);
+		DEC_BOTH (this_pos, this_pos_byte);
+		PREV_CHAR_BOUNDARY (p, pat);
+		pat_ch = STRING_CHAR (p);
+		buf_ch = STRING_CHAR (BYTE_POS_ADDR (this_pos_byte));
 		TRANSLATE (buf_ch, trt, buf_ch);
 
 		if (buf_ch != pat_ch)
 		  break;
 
 		this_len--;
-		p += charlen;
-		this_pos_byte += buf_charlen;
-		this_pos++;
 	      }
 
 	    if (this_len == 0)
 	      {
-		pos -= len;
-		pos_byte -= match_byte;
+		match_byte = pos_byte - this_pos_byte;
+		pos = this_pos;
+		pos_byte = this_pos_byte;
 		break;
 	      }
 

-- 
Andreas Schwab, schwab@linux-m68k.org
GPG Key fingerprint = 58CA 54C7 6D53 942B 1756  01D3 44D5 214B 8276 4ED5
"And now for something completely different."




^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Case mapping of sharp s
  2009-11-21 11:58           ` Andreas Schwab
@ 2009-11-21 17:01             ` Ulrich Mueller
  2009-11-22 12:11               ` Andreas Schwab
  2009-11-24 12:26             ` Kenichi Handa
  1 sibling, 1 reply; 66+ messages in thread
From: Ulrich Mueller @ 2009-11-21 17:01 UTC (permalink / raw)
  To: Andreas Schwab; +Cc: grischka, emacs-devel, Stefan Monnier, handa

>>>>> On Sat, 21 Nov 2009, Andreas Schwab wrote:

> Ulrich Mueller <ulm@gentoo.org> writes:
>> Is that the reason for the backwards search in ERC buffers being
>> extremely slow? It may keep Emacs busy for several *minutes*. And
>> it's not interruptible with C-g.

> Does this patch help?

Definitely. For a quick test that I did, backwards search was sped up
by about a factor of four.

Ulrich




^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Case mapping of sharp s
  2009-11-21 17:01             ` Ulrich Mueller
@ 2009-11-22 12:11               ` Andreas Schwab
  2009-11-22 20:15                 ` Stefan Monnier
  0 siblings, 1 reply; 66+ messages in thread
From: Andreas Schwab @ 2009-11-22 12:11 UTC (permalink / raw)
  To: Ulrich Mueller; +Cc: grischka, emacs-devel, Stefan Monnier, handa

Ulrich Mueller <ulm@gentoo.org> writes:

>>>>>> On Sat, 21 Nov 2009, Andreas Schwab wrote:
>
>> Ulrich Mueller <ulm@gentoo.org> writes:
>>> Is that the reason for the backwards search in ERC buffers being
>>> extremely slow? It may keep Emacs busy for several *minutes*. And
>>> it's not interruptible with C-g.
>
>> Does this patch help?
>
> Definitely. For a quick test that I did, backwards search was sped up
> by about a factor of four.

Thanks for testing, I've installed it now.

Andreas.

-- 
Andreas Schwab, schwab@linux-m68k.org
GPG Key fingerprint = 58CA 54C7 6D53 942B 1756  01D3 44D5 214B 8276 4ED5
"And now for something completely different."




^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Case mapping of sharp s
  2009-11-22 12:11               ` Andreas Schwab
@ 2009-11-22 20:15                 ` Stefan Monnier
  0 siblings, 0 replies; 66+ messages in thread
From: Stefan Monnier @ 2009-11-22 20:15 UTC (permalink / raw)
  To: Andreas Schwab; +Cc: Ulrich Mueller, emacs-devel, grischka, handa

>> Definitely. For a quick test that I did, backwards search was sped up
>> by about a factor of four.

> Thanks for testing, I've installed it now.

Thanks Andreas,


        Stefan




^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Case mapping of sharp s
  2009-11-21 11:58           ` Andreas Schwab
  2009-11-21 17:01             ` Ulrich Mueller
@ 2009-11-24 12:26             ` Kenichi Handa
  2009-11-24 19:23               ` grischka
  1 sibling, 1 reply; 66+ messages in thread
From: Kenichi Handa @ 2009-11-24 12:26 UTC (permalink / raw)
  To: Andreas Schwab; +Cc: ulm, emacs-devel, grishka, monnier

In article <m21vjst7ha.fsf@igel.home>, Andreas Schwab <schwab@linux-m68k.org> writes:

> Ulrich Mueller <ulm@gentoo.org> writes:
> > Is that the reason for the backwards search in ERC buffers being
> > extremely slow? It may keep Emacs busy for several *minutes*. And it's
> > not interruptible with C-g.

> Does this patch help?

Here are some ideas to improve it.

(1) Do forward matching in backward search.

The original code roughly does this to search "012abc"
for "012" from the tail.
   check if "012" matches with "abc"
   check if "012" matches with "2ab"
   ...

But the new code does this:
   check if "210" matches with "cba"
   check if "210" matches with "ba2"
   ...

As INC_BOTH is faster than DEC_BOTH, the original way of
check matching is faster.  The slowness of the orignal code
was caused by using CHAR_TO_BYTE to find the place of "2"
when you know the place of "a".  Use DEC_BOTH here only.

(2) Pre-compute the character codes in PAT in an integer
    array if LEN is not that long (perhaps LEN < 256, or
    at most, sizeof (int) * LEN < MAX_ALLOCA).

Then, you don't need the repeated STRING_CHAR on PAT.  This
can be applicable to forward search too.

(3) In addition to (2), pre-compute the character codes in
    BUF too in an array of the same length as (2).

Then you can avoid using STRING_CHAR and TRANSLATE
repeatedly on the same place of BUF.  This requires modulo
calculation to get an index of the array, but I think it's
faster than the combination of STRING_CHAR and TRANSLATE.

---
Kenichi Handa
handa@m17n.org

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Case mapping of sharp s
  2009-11-24 12:26             ` Kenichi Handa
@ 2009-11-24 19:23               ` grischka
  2009-11-25  2:13                 ` Kenichi Handa
  0 siblings, 1 reply; 66+ messages in thread
From: grischka @ 2009-11-24 19:23 UTC (permalink / raw)
  To: Kenichi Handa; +Cc: ulm, Andreas Schwab, monnier, emacs-devel

Kenichi Handa wrote:
> In article <m21vjst7ha.fsf@igel.home>, Andreas Schwab <schwab@linux-m68k.org> writes:
> 
>> Ulrich Mueller <ulm@gentoo.org> writes:
>>> Is that the reason for the backwards search in ERC buffers being
>>> extremely slow? It may keep Emacs busy for several *minutes*. And it's
>>> not interruptible with C-g.
> 
>> Does this patch help?
> 
> Here are some ideas to improve it.
> 
> (1) Do forward matching in backward search.
> 
> The original code roughly does this to search "012abc"
> for "012" from the tail.
>    check if "012" matches with "abc"
>    check if "012" matches with "2ab"
>    ...
> 
> But the new code does this:
>    check if "210" matches with "cba"
>    check if "210" matches with "ba2"
>    ...
> 
> As INC_BOTH is faster than DEC_BOTH, the original way of
> check matching is faster.  

DEC_BOTH is maybe not slower than INC_BOTH, but two DEC_BOTH
are (as with Andy's patch).  Moderately slower, still ;)

> The slowness of the orignal code
> was caused by using CHAR_TO_BYTE to find the place of "2"
> when you know the place of "a".  Use DEC_BOTH here only.

The originally observed slowness was not because of the usage of
CHAR_TO_BYTE, but because of the flaws in CHAR_TO_BYTE, such as
using unrelated "best_below" and "best_above" in the same expression.

For the numbers, with my 100MB file test case:

backward search previously:
	14 .. 90 s (random)
backward search with fixed CHAR_TO_BYTE:
	5.6 s
backward search without CHAR_TO_BYTE (Andy's patch):
	4.1 s
forward search:
	3.6 s

> (2) Pre-compute the character codes in PAT in an integer
>     array if LEN is not that long (perhaps LEN < 256, or
>     at most, sizeof (int) * LEN < MAX_ALLOCA).
> 
> Then, you don't need the repeated STRING_CHAR on PAT.  This
> can be applicable to forward search too.
>

In practice searching a string is mostly about searching the first
char in the string, basically like strchr(buf, pat[0]).  (That is
unless you'd search for "aabb" in "aaabaaaaaaababbaaaabb" which is
not a practical example)

So as to pre-computing the pattern, you'd get the most improvement
already from just pre-computing "pat[0]" or "pat[len-1]" if you
want to.

> (3) In addition to (2), pre-compute the character codes in
>     BUF too in an array of the same length as (2).
> 
> Then you can avoid using STRING_CHAR and TRANSLATE
> repeatedly on the same place of BUF.  This requires modulo
> calculation to get an index of the array, but I think it's
> faster than the combination of STRING_CHAR and TRANSLATE.

Because the first char matches rarely (on average), a repeated
translation of the same place happens rarely too.

Of course, TRANSLATE (-> Fassoc(trt, make_number())) per se is
slow,  so a translation table as C array for say the 0..127
range, would help indeed.

In any case, with some tweaking it is possible to improve both
directions by ~70% (that is down to about 1 sec for the test
case).  I still don't know why boyer_moore with a one-char
pattern takes only 0.5 seconds though.  It's amazingly fast.

Btw it seems that long loading time for the big file has much to
do with inefficient counting of newlines.  Appearently it takes
~2 sec to load the file and then another ~6 sec to scan newlines.
It should be (far) under 0.5 sec.

--- grischka

> ---
> Kenichi Handa
> handa@m17n.org
> 

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Case mapping of sharp s
  2009-11-24 19:23               ` grischka
@ 2009-11-25  2:13                 ` Kenichi Handa
  2009-11-26 13:07                   ` grischka
  2009-11-29 22:03                   ` Juri Linkov
  0 siblings, 2 replies; 66+ messages in thread
From: Kenichi Handa @ 2009-11-25  2:13 UTC (permalink / raw)
  To: grischka; +Cc: ulm, schwab, monnier, emacs-devel

In article <4B0C32BF.2020708@gmx.de>, grischka <grishka@gmx.de> writes:

> DEC_BOTH is maybe not slower than INC_BOTH, but two DEC_BOTH
> are (as with Andy's patch).  Moderately slower, still ;)

So, changing the current backward matching to forward
matching should is effective.

> The originally observed slowness was not because of the usage of
> CHAR_TO_BYTE, but because of the flaws in CHAR_TO_BYTE, such as
> using unrelated "best_below" and "best_above" in the same expression.

> For the numbers, with my 100MB file test case:

> backward search previously:
> 	14 .. 90 s (random)
> backward search with fixed CHAR_TO_BYTE:
> 	5.6 s

I don't see any fix of CHAR_TO_BYTE in the current CVS
code.  Where is it?

> > (2) Pre-compute the character codes in PAT in an integer
> >     array if LEN is not that long (perhaps LEN < 256, or
> >     at most, sizeof (int) * LEN < MAX_ALLOCA).
> > 
> > Then, you don't need the repeated STRING_CHAR on PAT.  This
> > can be applicable to forward search too.
> >

> In practice searching a string is mostly about searching the first
> char in the string, basically like strchr(buf, pat[0]).  (That is
> unless you'd search for "aabb" in "aaabaaaaaaababbaaaabb" which is
> not a practical example)

> So as to pre-computing the pattern, you'd get the most improvement
> already from just pre-computing "pat[0]" or "pat[len-1]" if you
> want to.

When there's no match, it's true that pre-computing of the
whole pattern is a waste.  But, when there's a match, we
anyway compute character codes of the whole pattern.  So
perhaps the good strategy will be to record the computed
character codes gradually when we need it.

> > (3) In addition to (2), pre-compute the character codes in
> >     BUF too in an array of the same length as (2).
> > 
> > Then you can avoid using STRING_CHAR and TRANSLATE
> > repeatedly on the same place of BUF.  This requires modulo
> > calculation to get an index of the array, but I think it's
> > faster than the combination of STRING_CHAR and TRANSLATE.

> Because the first char matches rarely (on average), a repeated
> translation of the same place happens rarely too.

> Of course, TRANSLATE (-> Fassoc(trt, make_number())) per se is
> slow,  so a translation table as C array for say the 0..127
> range, would help indeed.

> In any case, with some tweaking it is possible to improve both
> directions by ~70% (that is down to about 1 sec for the test
> case).  I still don't know why boyer_moore with a one-char
> pattern takes only 0.5 seconds though.  It's amazingly fast.

Are you comparing both methods with the same value of
case-fold-search?

> Btw it seems that long loading time for the big file has much to
> do with inefficient counting of newlines.  Appearently it takes
> ~2 sec to load the file and then another ~6 sec to scan newlines.
> It should be (far) under 0.5 sec.

Why is the code of counting newlines called when we just
visit a file?

---
Kenichi Handa
handa@m17n.org




^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Case mapping of sharp s
  2009-11-25  2:13                 ` Kenichi Handa
@ 2009-11-26 13:07                   ` grischka
  2009-11-29 22:03                   ` Juri Linkov
  1 sibling, 0 replies; 66+ messages in thread
From: grischka @ 2009-11-26 13:07 UTC (permalink / raw)
  To: Kenichi Handa; +Cc: ulm, schwab, monnier, emacs-devel

Kenichi Handa wrote:
> In article <4B0C32BF.2020708@gmx.de>, grischka <grishka@gmx.de> writes:
> 
>> DEC_BOTH is maybe not slower than INC_BOTH, but two DEC_BOTH
>> are (as with Andy's patch).  Moderately slower, still ;)
> 
> So, changing the current backward matching to forward
> matching should is effective.
> 

No, there is no such condition.  There are several ways to avoid
the duplicate DEC_POS, on being to handle the "pattern_len == 0"
case right at the top of the function, for all its branches.

>> The originally observed slowness was not because of the usage of
>> CHAR_TO_BYTE, but because of the flaws in CHAR_TO_BYTE, such as
>> using unrelated "best_below" and "best_above" in the same expression.
> 
>> For the numbers, with my 100MB file test case:
> 
>> backward search previously:
>> 	14 .. 90 s (random)
>> backward search with fixed CHAR_TO_BYTE:
>> 	5.6 s
> 
> I don't see any fix of CHAR_TO_BYTE in the current CVS
> code.  Where is it?

Those tests were made with ad hoc modifications as needed. There
was also some code to measure the times, of course.

>> In any case, with some tweaking it is possible to improve both
>> directions by ~70% (that is down to about 1 sec for the test
>> case).  I still don't know why boyer_moore with a one-char
>> pattern takes only 0.5 seconds though.  It's amazingly fast.
> 
> Are you comparing both methods with the same value of
> case-fold-search?

Same value, but not same search patterns.  One with "sharp s",
one without.

Actually I just wanted to check the facts with the originally in
this thread proposed "sharp s" patch, because some people wrote it
would be too slow.  FWIW I don't think it would be any problem.

>> Btw it seems that long loading time for the big file has much to
>> do with inefficient counting of newlines.  Appearently it takes
>> ~2 sec to load the file and then another ~6 sec to scan newlines.
>> It should be (far) under 0.5 sec.
> 
> Why is the code of counting newlines called when we just
> visit a file?

I have no idea why.  Opening the 100MB file would call scan_buffer
(for \n) 67637 times.  The file has 3142771 lines though, so I take
it back: it's probably not "counting newlines" in that sense.  Maybe
it comes from  "Loading cc-langs ..." which happens after the first
2 seconds.

--- grischka

> 
> ---
> Kenichi Handa
> handa@m17n.org
> 

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Case mapping of sharp s
  2009-11-25  2:13                 ` Kenichi Handa
  2009-11-26 13:07                   ` grischka
@ 2009-11-29 22:03                   ` Juri Linkov
  2009-11-30  1:22                     ` Stefan Monnier
                                       ` (2 more replies)
  1 sibling, 3 replies; 66+ messages in thread
From: Juri Linkov @ 2009-11-29 22:03 UTC (permalink / raw)
  To: emacs-devel

BTW, does anyone know how to put a set of characters into one category
for the purpose of searching, without affecting case mapping?  I.e. to
not change the current case mappings, but allow a character from one
category to match all other characters from the same category.

`set-case-syntax-pair' is not suitable because it does this only for
a pair of characters and also changes case mappings.

Of course, it's possible to search with a regexp that matches
a character set in [...], or more convenient when a space in a regexp
is substituted with a set of characters from `search-spaces-regexp'.

I can't find a function like `set-case-syntax-pair' that would define
a set of equivalent characters for the search, so maybe this is not
possible.

-- 
Juri Linkov
http://www.jurta.org/emacs/

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Case mapping of sharp s
  2009-11-29 22:03                   ` Juri Linkov
@ 2009-11-30  1:22                     ` Stefan Monnier
  2009-11-30  1:28                     ` Kenichi Handa
  2009-11-30  7:01                     ` Ulrich Mueller
  2 siblings, 0 replies; 66+ messages in thread
From: Stefan Monnier @ 2009-11-30  1:22 UTC (permalink / raw)
  To: Juri Linkov; +Cc: emacs-devel

> BTW, does anyone know how to put a set of characters into one category
> for the purpose of searching, without affecting case mapping?  I.e. to
> not change the current case mappings, but allow a character from one
> category to match all other characters from the same category.

> `set-case-syntax-pair' is not suitable because it does this only for
> a pair of characters and also changes case mappings.

> Of course, it's possible to search with a regexp that matches
> a character set in [...], or more convenient when a space in a regexp
> is substituted with a set of characters from `search-spaces-regexp'.

> I can't find a function like `set-case-syntax-pair' that would define
> a set of equivalent characters for the search, so maybe this is not
> possible.

In theory, there is some amount of support for it via the slot nb 2 of
case tables:

   (defun get-eqvcase-table (case-table)
     "Return the eqvcase table of CASE-TABLE."
     (get-case-table-extra-slot case-table 2))

but AFAIK it's not used by search primitives, which instead implement
case-fold by something like (eq (lower CHAR1) (lower CHAR2)) which is
often cheaper than checking (memq CHAR1 <eqvcase>).

I bumped into it while working on a "lex in Elisp", where using such an
equiv-table makes a lot more sense (it makes the DFA larger, but speeds
up the matching by removing a call to `lower' from the inner loop).


        Stefan




^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Case mapping of sharp s
  2009-11-29 22:03                   ` Juri Linkov
  2009-11-30  1:22                     ` Stefan Monnier
@ 2009-11-30  1:28                     ` Kenichi Handa
  2009-11-30  1:36                       ` Kenichi Handa
  2009-11-30  7:01                     ` Ulrich Mueller
  2 siblings, 1 reply; 66+ messages in thread
From: Kenichi Handa @ 2009-11-30  1:28 UTC (permalink / raw)
  To: Juri Linkov; +Cc: emacs-devel

In article <87ljhpau88.fsf@mail.jurta.org>, Juri Linkov <juri@jurta.org> writes:

> BTW, does anyone know how to put a set of characters into one category
> for the purpose of searching, without affecting case mapping?  I.e. to
> not change the current case mappings, but allow a character from one
> category to match all other characters from the same category.

The concept of "character category" exists exactly for that
purpose.  See the elisp info secion "35.9 Categories".

---
Kenichi Handa
handa@m17n.org




^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Case mapping of sharp s
  2009-11-30  1:28                     ` Kenichi Handa
@ 2009-11-30  1:36                       ` Kenichi Handa
  0 siblings, 0 replies; 66+ messages in thread
From: Kenichi Handa @ 2009-11-30  1:36 UTC (permalink / raw)
  To: Kenichi Handa; +Cc: juri, emacs-devel

In article <tl74oocajj1.fsf@m17n.org>, Kenichi Handa <handa@m17n.org> writes:

> In article <87ljhpau88.fsf@mail.jurta.org>, Juri Linkov <juri@jurta.org> writes:
> > BTW, does anyone know how to put a set of characters into one category
> > for the purpose of searching, without affecting case mapping?  I.e. to
> > not change the current case mappings, but allow a character from one
> > category to match all other characters from the same category.

> The concept of "character category" exists exactly for that
> purpose.  See the elisp info secion "35.9 Categories".

Perhaps, I misunderstood your intention.  For "character
category", you must use regexp-search by specifying category
mnemonic.

---
Kenichi Handa
handa@m17n.org




^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Case mapping of sharp s
  2009-11-29 22:03                   ` Juri Linkov
  2009-11-30  1:22                     ` Stefan Monnier
  2009-11-30  1:28                     ` Kenichi Handa
@ 2009-11-30  7:01                     ` Ulrich Mueller
  2009-11-30 12:01                       ` Juri Linkov
  2009-11-30 21:57                       ` Juri Linkov
  2 siblings, 2 replies; 66+ messages in thread
From: Ulrich Mueller @ 2009-11-30  7:01 UTC (permalink / raw)
  To: Juri Linkov; +Cc: emacs-devel

>>>>> On Mon, 30 Nov 2009, Juri Linkov wrote:

> BTW, does anyone know how to put a set of characters into one
> category for the purpose of searching, without affecting case
> mapping? I.e. to not change the current case mappings, but allow a
> character from one category to match all other characters from the
> same category.

> `set-case-syntax-pair' is not suitable because it does this only for
> a pair of characters and also changes case mappings.

> Of course, it's possible to search with a regexp that matches a
> character set in [...], or more convenient when a space in a regexp
> is substituted with a set of characters from `search-spaces-regexp'.

> I can't find a function like `set-case-syntax-pair' that would
> define a set of equivalent characters for the search, so maybe this
> is not possible.

I'm not sure if it's exactly what you need, but I use the following to
search for characters while ignoring any accents:

   (let ((eqv-list '("aAàÀáÁâÂãÃäÄåÅ"
		     "cCçÇ"
		     "eEèÈéÉêÊëË"
		     "iIìÌíÍîÎïÏ"
		     "nNñÑ"
		     "oOòÒóÓôÔõÕöÖøØ"
		     "uUùÙúÚûÛüÜ"
		     "yYýÝÿ"))
	 (table (standard-case-table))
	 canon)
     (setq canon (copy-sequence table))
     (mapcar (lambda (s)
	       (mapcar (lambda (c) (aset canon c (aref s 0))) s))
	     eqv-list)
     (set-char-table-extra-slot table 1 canon)
     (set-char-table-extra-slot table 2 nil)
     (set-standard-case-table table))

I don't know if there is a simpler way to achieve this.

See also the following thread, from 1994 :-)
<http://groups.google.de/group/gnu.emacs.bug/browse_thread/thread/f337b3650fbee9fe/abd8d16ab34c264e>

Ulrich




^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Case mapping of sharp s
  2009-11-30  7:01                     ` Ulrich Mueller
@ 2009-11-30 12:01                       ` Juri Linkov
  2009-11-30 13:09                         ` martin rudalics
  2009-11-30 21:57                       ` Juri Linkov
  1 sibling, 1 reply; 66+ messages in thread
From: Juri Linkov @ 2009-11-30 12:01 UTC (permalink / raw)
  To: Ulrich Mueller; +Cc: emacs-devel

> I'm not sure if it's exactly what you need, but I use the following to
> search for characters while ignoring any accents:
>
>    (let ((eqv-list '("aAàÀáÁâÂãÃäÄåÅ"
> 		     "cCçÇ"
> 		     "eEèÈéÉêÊëË"
> 		     "iIìÌíÍîÎïÏ"
> 		     "nNñÑ"
> 		     "oOòÒóÓôÔõÕöÖøØ"
> 		     "uUùÙúÚûÛüÜ"
> 		     "yYýÝÿ"))
> 	 (table (standard-case-table))
> 	 canon)
>      (setq canon (copy-sequence table))
>      (mapcar (lambda (s)
> 	       (mapcar (lambda (c) (aset canon c (aref s 0))) s))
> 	     eqv-list)
>      (set-char-table-extra-slot table 1 canon)
>      (set-char-table-extra-slot table 2 nil)
>      (set-standard-case-table table))
>
> I don't know if there is a simpler way to achieve this.

This is exactly what I need - to ignore accents while searching.
I wonder why there is no mode like `search-ignore-accents-mode'
that does this.

-- 
Juri Linkov
http://www.jurta.org/emacs/




^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Case mapping of sharp s
  2009-11-30 12:01                       ` Juri Linkov
@ 2009-11-30 13:09                         ` martin rudalics
  0 siblings, 0 replies; 66+ messages in thread
From: martin rudalics @ 2009-11-30 13:09 UTC (permalink / raw)
  To: Juri Linkov; +Cc: Ulrich Mueller, emacs-devel

 > This is exactly what I need - to ignore accents while searching.

... and a way to ignore accents while sorting lines.

 > I wonder why there is no mode like `search-ignore-accents-mode'
 > that does this.

... or a corresponding argument for `sort-lines'.

martin




^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Case mapping of sharp s
  2009-11-30  7:01                     ` Ulrich Mueller
  2009-11-30 12:01                       ` Juri Linkov
@ 2009-11-30 21:57                       ` Juri Linkov
  2009-11-30 22:34                         ` Ulrich Mueller
  1 sibling, 1 reply; 66+ messages in thread
From: Juri Linkov @ 2009-11-30 21:57 UTC (permalink / raw)
  To: Ulrich Mueller; +Cc: emacs-devel

>    (let ((eqv-list '("aAàÀáÁâÂãÃäÄåÅ"
> 		     "cCçÇ"
> 		     "eEèÈéÉêÊëË"
> 		     "iIìÌíÍîÎïÏ"
> 		     "nNñÑ"
> 		     "oOòÒóÓôÔõÕöÖøØ"
> 		     "uUùÙúÚûÛüÜ"
> 		     "yYýÝÿ"))
> 	 (table (standard-case-table))
> 	 canon)
>      (setq canon (copy-sequence table))
>      (mapcar (lambda (s)
> 	       (mapcar (lambda (c) (aset canon c (aref s 0))) s))
> 	     eqv-list)
>      (set-char-table-extra-slot table 1 canon)
>      (set-char-table-extra-slot table 2 nil)
>      (set-standard-case-table table))

I tried this out, but it has no effect.  Maybe this used to work
in some older versions, but doesn't work in 23.1.

In addition to equivalent groups above, it would be nice also
to ignore combining accents.

-- 
Juri Linkov
http://www.jurta.org/emacs/




^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Case mapping of sharp s
  2009-11-30 21:57                       ` Juri Linkov
@ 2009-11-30 22:34                         ` Ulrich Mueller
  2009-12-01  0:02                           ` Juri Linkov
  0 siblings, 1 reply; 66+ messages in thread
From: Ulrich Mueller @ 2009-11-30 22:34 UTC (permalink / raw)
  To: Juri Linkov; +Cc: emacs-devel

>>>>> On Mon, 30 Nov 2009, Juri Linkov wrote:

> I tried this out, but it has no effect.  Maybe this used to work
> in some older versions, but doesn't work in 23.1.

Strange. Here it works with both 23.1 and current CVS trunk.
After evaluating the expression, did you create a new buffer?
(set-standard-case-table sets the case table for new buffers only.)

Ulrich




^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Case mapping of sharp s
  2009-11-30 22:34                         ` Ulrich Mueller
@ 2009-12-01  0:02                           ` Juri Linkov
  0 siblings, 0 replies; 66+ messages in thread
From: Juri Linkov @ 2009-12-01  0:02 UTC (permalink / raw)
  To: Ulrich Mueller; +Cc: emacs-devel

>> I tried this out, but it has no effect.  Maybe this used to work
>> in some older versions, but doesn't work in 23.1.
>
> Strange. Here it works with both 23.1 and current CVS trunk.
> After evaluating the expression, did you create a new buffer?
> (set-standard-case-table sets the case table for new buffers only.)

Thanks, I missed the fact that it affects new buffers.  So it works.
Provided it doesn't make the search slow, it would be nice to add
it to Emacs activating on some user settings.

Ignoring combining accents would be nice to have too, but perhaps
this is not possible to do using a case table?

-- 
Juri Linkov
http://www.jurta.org/emacs/




^ permalink raw reply	[flat|nested] 66+ messages in thread

end of thread, other threads:[~2009-12-01  0:02 UTC | newest]

Thread overview: 66+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-11-15 14:29 Case mapping of sharp s Ulrich Mueller
2009-11-16 12:06 ` Kenichi Handa
2009-11-16 16:38   ` Ulrich Mueller
2009-11-17  7:36     ` Kenichi Handa
2009-11-17 21:23       ` Reiner Steib
2009-11-16 19:12   ` Eli Zaretskii
2009-11-17  7:43     ` martin rudalics
2009-11-17  7:49     ` Kenichi Handa
2009-11-17 18:56       ` Eli Zaretskii
2009-11-18  1:00         ` Kenichi Handa
2009-11-18  4:09           ` Eli Zaretskii
2009-11-18  5:33             ` Stephen J. Turnbull
2009-11-18  6:26             ` Kenichi Handa
2009-11-18 14:44               ` Stefan Monnier
2009-11-18 19:05                 ` Ulrich Mueller
2009-11-19  1:16                   ` Stefan Monnier
2009-11-18 17:58               ` Eli Zaretskii
2009-11-19  1:57                 ` Stephen J. Turnbull
  -- strict thread matches above, loose matches on Subject: below --
2009-11-19 19:48 grischka
2009-11-19 21:49 ` Stefan Monnier
2009-11-19 22:43   ` David Kastrup
2009-11-20  2:08     ` Stefan Monnier
2009-11-20  8:03       ` David Kastrup
2009-11-20 14:14         ` Stefan Monnier
2009-11-20  3:41     ` Stephen J. Turnbull
2009-11-20  4:20       ` Stefan Monnier
2009-11-20  7:13         ` Stephen J. Turnbull
2009-11-21  0:02           ` Richard Stallman
2009-11-21 12:39             ` David Kastrup
2009-11-21 17:40               ` Stephen J. Turnbull
2009-11-21 19:15                 ` Eli Zaretskii
2009-11-22  2:58                   ` Stephen J. Turnbull
2009-11-22  4:28                     ` Eli Zaretskii
2009-11-22  8:27                       ` Stephen J. Turnbull
2009-11-23  1:30                 ` Kenichi Handa
2009-11-21 22:52               ` Richard Stallman
2009-11-20  8:10     ` Ulrich Mueller
2009-11-20 11:46       ` Stephen J. Turnbull
2009-11-20 14:43         ` Ulrich Mueller
2009-11-21  4:33           ` Stephen J. Turnbull
2009-11-19 23:25   ` grischka
2009-11-20  2:11     ` Stefan Monnier
2009-11-21  3:08       ` grischka
2009-11-21  8:58         ` Eli Zaretskii
2009-11-21  9:33           ` Andreas Schwab
2009-11-21 11:45             ` Eli Zaretskii
2009-11-21 15:33           ` grischka
2009-11-21 10:41         ` Ulrich Mueller
2009-11-21 11:58           ` Andreas Schwab
2009-11-21 17:01             ` Ulrich Mueller
2009-11-22 12:11               ` Andreas Schwab
2009-11-22 20:15                 ` Stefan Monnier
2009-11-24 12:26             ` Kenichi Handa
2009-11-24 19:23               ` grischka
2009-11-25  2:13                 ` Kenichi Handa
2009-11-26 13:07                   ` grischka
2009-11-29 22:03                   ` Juri Linkov
2009-11-30  1:22                     ` Stefan Monnier
2009-11-30  1:28                     ` Kenichi Handa
2009-11-30  1:36                       ` Kenichi Handa
2009-11-30  7:01                     ` Ulrich Mueller
2009-11-30 12:01                       ` Juri Linkov
2009-11-30 13:09                         ` martin rudalics
2009-11-30 21:57                       ` Juri Linkov
2009-11-30 22:34                         ` Ulrich Mueller
2009-12-01  0:02                           ` Juri Linkov

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).