unofficial mirror of guile-devel@gnu.org 
 help / color / mirror / Atom feed
* Does anyone have a better scm_string_hash ?
@ 2003-11-14 18:11 Roland Orre
  2003-11-14 20:09 ` Andreas Rottmann
  0 siblings, 1 reply; 9+ messages in thread
From: Roland Orre @ 2003-11-14 18:11 UTC (permalink / raw)


I've earlier posted this to guile-user but I think it's such a
serious issue, as string hashing is essential to an interactive
language, that it needs to be addressed also here.
---

I noticed that our data mining software run very very slow with a
new data base. I localized the problem to scm_string_hash.

A hash table in this case was loaded with 14166 strings. I have
a function which creates a reasonable sized hash table, in this
case the hash table size was 8209.

13856 of these strings were hashed to the same index= 1067.
303 strings got index = 8061.
2 strings got the index = 754.
8201 entries were empty.

We are running guile 1.6 but I checked the scm_string_hash from a recent
1.7 CVS also and the function in hash.c there is identical.

I added a few of the symbols hashing to 1067 below. One can of course
argue that the symbols in this case should be hashed as numbers.
Anyway, does anyone have any hint or have a better string hash function?

	Best regards
	Roland Orre


A few strings hashing to entry 1067 for hash table length 8209:
"01632001" "01627301" "01626801" "01626601" "01626501" "01626401"
"01626301" "01626101" "01626001" "01625901" "01625801" "01625701"
"01625401" "01625301" "01625101" "01625001" "01624801" "01624601"
"01624501" "01624401" "01624301" "01624101" "01624001" "01623901"
"01623801" "01623701" "01623601" "01623501" "01623401" "01623201"
"01623101" "01622901" "01622801" "01622701" "01622601" "01622401"





_______________________________________________
Guile-devel mailing list
Guile-devel@gnu.org
http://mail.gnu.org/mailman/listinfo/guile-devel


^ permalink raw reply	[flat|nested] 9+ messages in thread
[parent not found: <E1AKPRR-00036l-00@surf.glug.org>]
* Re: Does anyone have a better scm_string_hash ?
@ 2003-11-20 15:48 Stephen Compall
  2003-11-20 17:29 ` Marius Vollmer
  0 siblings, 1 reply; 9+ messages in thread
From: Stephen Compall @ 2003-11-20 15:48 UTC (permalink / raw)


Being ever so useful, I was reading through gnulib in coreutils when I
spotted "hash.c".  Thought, I'd better take a look.  Anyway, here's
the string hashing code in there, even if you really don't want to
explore this further.

Er, of course, this is GPLed....

/* Return a hash index for a NUL-terminated STRING between 0 and N_BUCKETS-1.
   This is a convenience routine for constructing other hashing functions.  */

#if USE_DIFF_HASH

/* About hashings, Paul Eggert writes to me (FP), on 1994-01-01: "Please see
   B. J. McKenzie, R. Harries & T. Bell, Selecting a hashing algorithm,
   Software--practice & experience 20, 2 (Feb 1990), 209-224.  Good hash
   algorithms tend to be domain-specific, so what's good for [diffutils'] io.c
   may not be good for your application."  */

size_t
hash_string (const char *string, size_t n_buckets)
{
# define ROTATE_LEFT(Value, Shift) \
  ((Value) << (Shift) | (Value) >> ((sizeof (size_t) * CHAR_BIT) - (Shift)))
# define HASH_ONE_CHAR(Value, Byte) \
  ((Byte) + ROTATE_LEFT (Value, 7))

  size_t value = 0;

  for (; *string; string++)
    value = HASH_ONE_CHAR (value, (unsigned char) *string);
  return value % n_buckets;

# undef ROTATE_LEFT
# undef HASH_ONE_CHAR
}

#else /* not USE_DIFF_HASH */

/* This one comes from `recode', and performs a bit better than the above as
   per a few experiments.  It is inspired from a hashing routine found in the
   very old Cyber `snoop', itself written in typical Greg Mansfield style.
   (By the way, what happened to this excellent man?  Is he still alive?)  */

size_t
hash_string (const char *string, size_t n_buckets)
{
  size_t value = 0;

  while (*string)
    value = (value * 31 + (unsigned char) *string++) % n_buckets;
  return value;
}

#endif /* not USE_DIFF_HASH */

--
Stephen Compall or s11 or sirian

Better to use medicines at the outset than at the last moment.

Sears Tower rs9512c Sundevil AK-47 lynch red noise Treasury global SHA
SRI Clinton kibo blackjack Rand Corporation Crypto AG


_______________________________________________
Guile-devel mailing list
Guile-devel@gnu.org
http://mail.gnu.org/mailman/listinfo/guile-devel


^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2003-11-20 17:29 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2003-11-14 18:11 Does anyone have a better scm_string_hash ? Roland Orre
2003-11-14 20:09 ` Andreas Rottmann
     [not found] <E1AKPRR-00036l-00@surf.glug.org>
     [not found] ` <8765hnf308.fsf@zagadka.ping.de>
     [not found]   ` <1068823738.13123.54.camel@localhost>
     [not found]     ` <20031114155148.GI16650@powergnu.laas.fr>
     [not found]       ` <1069058032.1638.21.camel@localhost>
     [not found]         ` <874qx3rogk.fsf@zagadka.ping.de>
2003-11-17 16:02           ` Marius Vollmer
2003-11-17 16:29             ` Marius Vollmer
2003-11-18 20:15               ` Kevin Ryde
2003-11-19  9:04             ` Ludovic Courtès
2003-11-19 15:02               ` Marius Vollmer
  -- strict thread matches above, loose matches on Subject: below --
2003-11-20 15:48 Stephen Compall
2003-11-20 17:29 ` Marius Vollmer

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).