unofficial mirror of guile-devel@gnu.org 
 help / color / mirror / Atom feed
* Does anyone have a better scm_string_hash ?
@ 2003-11-14 18:11 Roland Orre
  2003-11-14 20:09 ` Andreas Rottmann
  0 siblings, 1 reply; 9+ messages in thread
From: Roland Orre @ 2003-11-14 18:11 UTC (permalink / raw)


I've earlier posted this to guile-user but I think it's such a
serious issue, as string hashing is essential to an interactive
language, that it needs to be addressed also here.
---

I noticed that our data mining software run very very slow with a
new data base. I localized the problem to scm_string_hash.

A hash table in this case was loaded with 14166 strings. I have
a function which creates a reasonable sized hash table, in this
case the hash table size was 8209.

13856 of these strings were hashed to the same index= 1067.
303 strings got index = 8061.
2 strings got the index = 754.
8201 entries were empty.

We are running guile 1.6 but I checked the scm_string_hash from a recent
1.7 CVS also and the function in hash.c there is identical.

I added a few of the symbols hashing to 1067 below. One can of course
argue that the symbols in this case should be hashed as numbers.
Anyway, does anyone have any hint or have a better string hash function?

	Best regards
	Roland Orre


A few strings hashing to entry 1067 for hash table length 8209:
"01632001" "01627301" "01626801" "01626601" "01626501" "01626401"
"01626301" "01626101" "01626001" "01625901" "01625801" "01625701"
"01625401" "01625301" "01625101" "01625001" "01624801" "01624601"
"01624501" "01624401" "01624301" "01624101" "01624001" "01623901"
"01623801" "01623701" "01623601" "01623501" "01623401" "01623201"
"01623101" "01622901" "01622801" "01622701" "01622601" "01622401"





_______________________________________________
Guile-devel mailing list
Guile-devel@gnu.org
http://mail.gnu.org/mailman/listinfo/guile-devel


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Does anyone have a better scm_string_hash ?
  2003-11-14 18:11 Roland Orre
@ 2003-11-14 20:09 ` Andreas Rottmann
  0 siblings, 0 replies; 9+ messages in thread
From: Andreas Rottmann @ 2003-11-14 20:09 UTC (permalink / raw)
  Cc: guile-devel

Roland Orre <orre@nada.kth.se> writes:

> A hash table in this case was loaded with 14166 strings. I have
> a function which creates a reasonable sized hash table, in this
> case the hash table size was 8209.
>
> 13856 of these strings were hashed to the same index= 1067.
> 303 strings got index = 8061.
> 2 strings got the index = 754.
> 8201 entries were empty.
>
> We are running guile 1.6 but I checked the scm_string_hash from a recent
> 1.7 CVS also and the function in hash.c there is identical.
>
> I added a few of the symbols hashing to 1067 below. One can of course
> argue that the symbols in this case should be hashed as numbers.
> Anyway, does anyone have any hint or have a better string hash function?
>
Maybe have a look at how g_str_hash (from GLib) does it.

Gtx, Andi
-- 
Andreas Rottmann         | Rotty@ICQ      | 118634484@ICQ | a.rottmann@gmx.at
http://www.8ung.at/rotty | GnuPG Key: http://www.8ung.at/rotty/gpg.asc
Fingerprint              | DFB4 4EB4 78A4 5EEE 6219  F228 F92F CFC5 01FD 5B62

This reality is really just a fucked-up dream -- Papa Roach


_______________________________________________
Guile-devel mailing list
Guile-devel@gnu.org
http://mail.gnu.org/mailman/listinfo/guile-devel


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Does anyone have a better scm_string_hash ?
       [not found]         ` <874qx3rogk.fsf@zagadka.ping.de>
@ 2003-11-17 16:02           ` Marius Vollmer
  2003-11-17 16:29             ` Marius Vollmer
  2003-11-19  9:04             ` Ludovic Courtès
  0 siblings, 2 replies; 9+ messages in thread
From: Marius Vollmer @ 2003-11-17 16:02 UTC (permalink / raw)
  Cc: guile-user

Marius Vollmer <mvo@zagadka.de> writes:

> Both should be much better than the one we have right now and I'm
> going to install the one from Roland without the hash_mask.

Ok, I have installed the following hash function into 1.7.  As a nice
side effect, guile seems to start a bit faster now.  (Although the new
function does more work than the old.  Talk about being too clever.)

    unsigned long 
    scm_string_hash (const unsigned char *str, size_t len)
    {
      /* from suggestion at: */
      /* http://srfi.schemers.org/srfi-13/mail-archive/msg00112.html */

      unsigned long h = 0;
      while (len-- > 0)
        {
          /* h = (str[i] + h*37) % 2^bits_per_long; */
          h = *str++ + (h<<5) + (h<<2) + h;
        }
      return h;
    }

I removed 'i' from the calculation since neither the glib function nor
the one from Java uses it and I'm wary of heuristical magic when it
comes to random number generation and hash functions.  Permutations of
the input will still lead to different hash values since, for example
str[0]+str[1]*37 is different from str[1]+str[0]*37.

Just for kicks, I'm now going to see what kind of code GCC generates
for h*37 as compared to (h<<5) + (h<<2) + h...

-- 
GPG: D5D4E405 - 2F9B BCCC 8527 692A 04E3  331E FAF8 226A D5D4 E405


_______________________________________________
Guile-devel mailing list
Guile-devel@gnu.org
http://mail.gnu.org/mailman/listinfo/guile-devel


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Does anyone have a better scm_string_hash ?
  2003-11-17 16:02           ` Marius Vollmer
@ 2003-11-17 16:29             ` Marius Vollmer
  2003-11-18 20:15               ` Kevin Ryde
  2003-11-19  9:04             ` Ludovic Courtès
  1 sibling, 1 reply; 9+ messages in thread
From: Marius Vollmer @ 2003-11-17 16:29 UTC (permalink / raw)
  Cc: guile-user

Marius Vollmer <mvo@zagadka.de> writes:

> Just for kicks, I'm now going to see what kind of code GCC generates
> for h*37 as compared to (h<<5) + (h<<2) + h...

Interesting.  For h = a + (h<<5) + (h<<2) + h we get this sequence
(one line is one machine instruction):

    x = h
    x = x << 5
    a = a + x
    a = a + h*4
    h = a + h

and for h = a + h*37 we get

    x = h + h*8
    x = h + x*4
    h = x + a*1

which is nearly twice as clever...

-- 
GPG: D5D4E405 - 2F9B BCCC 8527 692A 04E3  331E FAF8 226A D5D4 E405


_______________________________________________
Guile-devel mailing list
Guile-devel@gnu.org
http://mail.gnu.org/mailman/listinfo/guile-devel


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Does anyone have a better scm_string_hash ?
  2003-11-17 16:29             ` Marius Vollmer
@ 2003-11-18 20:15               ` Kevin Ryde
  0 siblings, 0 replies; 9+ messages in thread
From: Kevin Ryde @ 2003-11-18 20:15 UTC (permalink / raw)


Marius Vollmer <mvo@zagadka.de> writes:
>
> and for h = a + h*37 we get

You probably want to write *37 like that, gcc is usually good at
reducing constant multipliers when it can help a given chip.


_______________________________________________
Guile-devel mailing list
Guile-devel@gnu.org
http://mail.gnu.org/mailman/listinfo/guile-devel


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Does anyone have a better scm_string_hash ?
  2003-11-17 16:02           ` Marius Vollmer
  2003-11-17 16:29             ` Marius Vollmer
@ 2003-11-19  9:04             ` Ludovic Courtès
  2003-11-19 15:02               ` Marius Vollmer
  1 sibling, 1 reply; 9+ messages in thread
From: Ludovic Courtès @ 2003-11-19  9:04 UTC (permalink / raw)
  Cc: guile-user, guile-devel

Hi,

One day, 17 hours, 34 seconds ago, 
Marius Vollmer wrote:
> Ok, I have installed the following hash function into 1.7.  As a nice
> side effect, guile seems to start a bit faster now.  (Although the new
> function does more work than the old.  Talk about being too clever.)

BTW, is this function also used for hashing symbol names?  If so, then
it must slightly improve performance!

Thanks,
Ludovic.


_______________________________________________
Guile-user mailing list
Guile-user@gnu.org
http://mail.gnu.org/mailman/listinfo/guile-user


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Does anyone have a better scm_string_hash ?
  2003-11-19  9:04             ` Ludovic Courtès
@ 2003-11-19 15:02               ` Marius Vollmer
  0 siblings, 0 replies; 9+ messages in thread
From: Marius Vollmer @ 2003-11-19 15:02 UTC (permalink / raw)
  Cc: guile-user

Ludovic Courtès <ludovic.courtes@laas.fr> writes:

> Hi,
>
> BTW, is this function also used for hashing symbol names?

Yes.

-- 
GPG: D5D4E405 - 2F9B BCCC 8527 692A 04E3  331E FAF8 226A D5D4 E405


_______________________________________________
Guile-user mailing list
Guile-user@gnu.org
http://mail.gnu.org/mailman/listinfo/guile-user


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Does anyone have a better scm_string_hash ?
@ 2003-11-20 15:48 Stephen Compall
  2003-11-20 17:29 ` Marius Vollmer
  0 siblings, 1 reply; 9+ messages in thread
From: Stephen Compall @ 2003-11-20 15:48 UTC (permalink / raw)


Being ever so useful, I was reading through gnulib in coreutils when I
spotted "hash.c".  Thought, I'd better take a look.  Anyway, here's
the string hashing code in there, even if you really don't want to
explore this further.

Er, of course, this is GPLed....

/* Return a hash index for a NUL-terminated STRING between 0 and N_BUCKETS-1.
   This is a convenience routine for constructing other hashing functions.  */

#if USE_DIFF_HASH

/* About hashings, Paul Eggert writes to me (FP), on 1994-01-01: "Please see
   B. J. McKenzie, R. Harries & T. Bell, Selecting a hashing algorithm,
   Software--practice & experience 20, 2 (Feb 1990), 209-224.  Good hash
   algorithms tend to be domain-specific, so what's good for [diffutils'] io.c
   may not be good for your application."  */

size_t
hash_string (const char *string, size_t n_buckets)
{
# define ROTATE_LEFT(Value, Shift) \
  ((Value) << (Shift) | (Value) >> ((sizeof (size_t) * CHAR_BIT) - (Shift)))
# define HASH_ONE_CHAR(Value, Byte) \
  ((Byte) + ROTATE_LEFT (Value, 7))

  size_t value = 0;

  for (; *string; string++)
    value = HASH_ONE_CHAR (value, (unsigned char) *string);
  return value % n_buckets;

# undef ROTATE_LEFT
# undef HASH_ONE_CHAR
}

#else /* not USE_DIFF_HASH */

/* This one comes from `recode', and performs a bit better than the above as
   per a few experiments.  It is inspired from a hashing routine found in the
   very old Cyber `snoop', itself written in typical Greg Mansfield style.
   (By the way, what happened to this excellent man?  Is he still alive?)  */

size_t
hash_string (const char *string, size_t n_buckets)
{
  size_t value = 0;

  while (*string)
    value = (value * 31 + (unsigned char) *string++) % n_buckets;
  return value;
}

#endif /* not USE_DIFF_HASH */

--
Stephen Compall or s11 or sirian

Better to use medicines at the outset than at the last moment.

Sears Tower rs9512c Sundevil AK-47 lynch red noise Treasury global SHA
SRI Clinton kibo blackjack Rand Corporation Crypto AG


_______________________________________________
Guile-devel mailing list
Guile-devel@gnu.org
http://mail.gnu.org/mailman/listinfo/guile-devel


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Does anyone have a better scm_string_hash ?
  2003-11-20 15:48 Does anyone have a better scm_string_hash ? Stephen Compall
@ 2003-11-20 17:29 ` Marius Vollmer
  0 siblings, 0 replies; 9+ messages in thread
From: Marius Vollmer @ 2003-11-20 17:29 UTC (permalink / raw)
  Cc: guile-devel

Stephen Compall <s11@member.fsf.org> writes:

> /* This one comes from `recode', and performs a bit better than the above as
>    per a few experiments.  It is inspired from a hashing routine found in the
>    very old Cyber `snoop', itself written in typical Greg Mansfield style.
>    (By the way, what happened to this excellent man?  Is he still alive?)  */
>
> size_t
> hash_string (const char *string, size_t n_buckets)
> {
>   size_t value = 0;
>
>   while (*string)
>     value = (value * 31 + (unsigned char) *string++) % n_buckets;
>   return value;
> }

This is essentially(?) the one used by glib.  We use it with 37
instead of 31 and like glib, compute the module after the loop.

    unsigned long 
    scm_string_hash (const unsigned char *str, size_t len)
    {
      /* from suggestion at: */
      /* http://srfi.schemers.org/srfi-13/mail-archive/msg00112.html */

      unsigned long h = 0;
      while (len-- > 0)
        h = *str++ + h*37;
      return h;
    }

-- 
GPG: D5D4E405 - 2F9B BCCC 8527 692A 04E3  331E FAF8 226A D5D4 E405


_______________________________________________
Guile-devel mailing list
Guile-devel@gnu.org
http://mail.gnu.org/mailman/listinfo/guile-devel


^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2003-11-20 17:29 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2003-11-20 15:48 Does anyone have a better scm_string_hash ? Stephen Compall
2003-11-20 17:29 ` Marius Vollmer
     [not found] <E1AKPRR-00036l-00@surf.glug.org>
     [not found] ` <8765hnf308.fsf@zagadka.ping.de>
     [not found]   ` <1068823738.13123.54.camel@localhost>
     [not found]     ` <20031114155148.GI16650@powergnu.laas.fr>
     [not found]       ` <1069058032.1638.21.camel@localhost>
     [not found]         ` <874qx3rogk.fsf@zagadka.ping.de>
2003-11-17 16:02           ` Marius Vollmer
2003-11-17 16:29             ` Marius Vollmer
2003-11-18 20:15               ` Kevin Ryde
2003-11-19  9:04             ` Ludovic Courtès
2003-11-19 15:02               ` Marius Vollmer
  -- strict thread matches above, loose matches on Subject: below --
2003-11-14 18:11 Roland Orre
2003-11-14 20:09 ` Andreas Rottmann

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).