* Does anyone have a better scm_string_hash ?
@ 2003-11-14 18:11 Roland Orre
2003-11-14 20:09 ` Andreas Rottmann
0 siblings, 1 reply; 9+ messages in thread
From: Roland Orre @ 2003-11-14 18:11 UTC (permalink / raw)
I've earlier posted this to guile-user but I think it's such a
serious issue, as string hashing is essential to an interactive
language, that it needs to be addressed also here.
---
I noticed that our data mining software run very very slow with a
new data base. I localized the problem to scm_string_hash.
A hash table in this case was loaded with 14166 strings. I have
a function which creates a reasonable sized hash table, in this
case the hash table size was 8209.
13856 of these strings were hashed to the same index= 1067.
303 strings got index = 8061.
2 strings got the index = 754.
8201 entries were empty.
We are running guile 1.6 but I checked the scm_string_hash from a recent
1.7 CVS also and the function in hash.c there is identical.
I added a few of the symbols hashing to 1067 below. One can of course
argue that the symbols in this case should be hashed as numbers.
Anyway, does anyone have any hint or have a better string hash function?
Best regards
Roland Orre
A few strings hashing to entry 1067 for hash table length 8209:
"01632001" "01627301" "01626801" "01626601" "01626501" "01626401"
"01626301" "01626101" "01626001" "01625901" "01625801" "01625701"
"01625401" "01625301" "01625101" "01625001" "01624801" "01624601"
"01624501" "01624401" "01624301" "01624101" "01624001" "01623901"
"01623801" "01623701" "01623601" "01623501" "01623401" "01623201"
"01623101" "01622901" "01622801" "01622701" "01622601" "01622401"
_______________________________________________
Guile-devel mailing list
Guile-devel@gnu.org
http://mail.gnu.org/mailman/listinfo/guile-devel
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Does anyone have a better scm_string_hash ?
2003-11-14 18:11 Does anyone have a better scm_string_hash ? Roland Orre
@ 2003-11-14 20:09 ` Andreas Rottmann
0 siblings, 0 replies; 9+ messages in thread
From: Andreas Rottmann @ 2003-11-14 20:09 UTC (permalink / raw)
Cc: guile-devel
Roland Orre <orre@nada.kth.se> writes:
> A hash table in this case was loaded with 14166 strings. I have
> a function which creates a reasonable sized hash table, in this
> case the hash table size was 8209.
>
> 13856 of these strings were hashed to the same index= 1067.
> 303 strings got index = 8061.
> 2 strings got the index = 754.
> 8201 entries were empty.
>
> We are running guile 1.6 but I checked the scm_string_hash from a recent
> 1.7 CVS also and the function in hash.c there is identical.
>
> I added a few of the symbols hashing to 1067 below. One can of course
> argue that the symbols in this case should be hashed as numbers.
> Anyway, does anyone have any hint or have a better string hash function?
>
Maybe have a look at how g_str_hash (from GLib) does it.
Gtx, Andi
--
Andreas Rottmann | Rotty@ICQ | 118634484@ICQ | a.rottmann@gmx.at
http://www.8ung.at/rotty | GnuPG Key: http://www.8ung.at/rotty/gpg.asc
Fingerprint | DFB4 4EB4 78A4 5EEE 6219 F228 F92F CFC5 01FD 5B62
This reality is really just a fucked-up dream -- Papa Roach
_______________________________________________
Guile-devel mailing list
Guile-devel@gnu.org
http://mail.gnu.org/mailman/listinfo/guile-devel
^ permalink raw reply [flat|nested] 9+ messages in thread
[parent not found: <E1AKPRR-00036l-00@surf.glug.org>]
* Re: Does anyone have a better scm_string_hash ?
@ 2003-11-20 15:48 Stephen Compall
2003-11-20 17:29 ` Marius Vollmer
0 siblings, 1 reply; 9+ messages in thread
From: Stephen Compall @ 2003-11-20 15:48 UTC (permalink / raw)
Being ever so useful, I was reading through gnulib in coreutils when I
spotted "hash.c". Thought, I'd better take a look. Anyway, here's
the string hashing code in there, even if you really don't want to
explore this further.
Er, of course, this is GPLed....
/* Return a hash index for a NUL-terminated STRING between 0 and N_BUCKETS-1.
This is a convenience routine for constructing other hashing functions. */
#if USE_DIFF_HASH
/* About hashings, Paul Eggert writes to me (FP), on 1994-01-01: "Please see
B. J. McKenzie, R. Harries & T. Bell, Selecting a hashing algorithm,
Software--practice & experience 20, 2 (Feb 1990), 209-224. Good hash
algorithms tend to be domain-specific, so what's good for [diffutils'] io.c
may not be good for your application." */
size_t
hash_string (const char *string, size_t n_buckets)
{
# define ROTATE_LEFT(Value, Shift) \
((Value) << (Shift) | (Value) >> ((sizeof (size_t) * CHAR_BIT) - (Shift)))
# define HASH_ONE_CHAR(Value, Byte) \
((Byte) + ROTATE_LEFT (Value, 7))
size_t value = 0;
for (; *string; string++)
value = HASH_ONE_CHAR (value, (unsigned char) *string);
return value % n_buckets;
# undef ROTATE_LEFT
# undef HASH_ONE_CHAR
}
#else /* not USE_DIFF_HASH */
/* This one comes from `recode', and performs a bit better than the above as
per a few experiments. It is inspired from a hashing routine found in the
very old Cyber `snoop', itself written in typical Greg Mansfield style.
(By the way, what happened to this excellent man? Is he still alive?) */
size_t
hash_string (const char *string, size_t n_buckets)
{
size_t value = 0;
while (*string)
value = (value * 31 + (unsigned char) *string++) % n_buckets;
return value;
}
#endif /* not USE_DIFF_HASH */
--
Stephen Compall or s11 or sirian
Better to use medicines at the outset than at the last moment.
Sears Tower rs9512c Sundevil AK-47 lynch red noise Treasury global SHA
SRI Clinton kibo blackjack Rand Corporation Crypto AG
_______________________________________________
Guile-devel mailing list
Guile-devel@gnu.org
http://mail.gnu.org/mailman/listinfo/guile-devel
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Does anyone have a better scm_string_hash ?
2003-11-20 15:48 Stephen Compall
@ 2003-11-20 17:29 ` Marius Vollmer
0 siblings, 0 replies; 9+ messages in thread
From: Marius Vollmer @ 2003-11-20 17:29 UTC (permalink / raw)
Cc: guile-devel
Stephen Compall <s11@member.fsf.org> writes:
> /* This one comes from `recode', and performs a bit better than the above as
> per a few experiments. It is inspired from a hashing routine found in the
> very old Cyber `snoop', itself written in typical Greg Mansfield style.
> (By the way, what happened to this excellent man? Is he still alive?) */
>
> size_t
> hash_string (const char *string, size_t n_buckets)
> {
> size_t value = 0;
>
> while (*string)
> value = (value * 31 + (unsigned char) *string++) % n_buckets;
> return value;
> }
This is essentially(?) the one used by glib. We use it with 37
instead of 31 and like glib, compute the module after the loop.
unsigned long
scm_string_hash (const unsigned char *str, size_t len)
{
/* from suggestion at: */
/* http://srfi.schemers.org/srfi-13/mail-archive/msg00112.html */
unsigned long h = 0;
while (len-- > 0)
h = *str++ + h*37;
return h;
}
--
GPG: D5D4E405 - 2F9B BCCC 8527 692A 04E3 331E FAF8 226A D5D4 E405
_______________________________________________
Guile-devel mailing list
Guile-devel@gnu.org
http://mail.gnu.org/mailman/listinfo/guile-devel
^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2003-11-20 17:29 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2003-11-14 18:11 Does anyone have a better scm_string_hash ? Roland Orre
2003-11-14 20:09 ` Andreas Rottmann
[not found] <E1AKPRR-00036l-00@surf.glug.org>
[not found] ` <8765hnf308.fsf@zagadka.ping.de>
[not found] ` <1068823738.13123.54.camel@localhost>
[not found] ` <20031114155148.GI16650@powergnu.laas.fr>
[not found] ` <1069058032.1638.21.camel@localhost>
[not found] ` <874qx3rogk.fsf@zagadka.ping.de>
2003-11-17 16:02 ` Marius Vollmer
2003-11-17 16:29 ` Marius Vollmer
2003-11-18 20:15 ` Kevin Ryde
2003-11-19 9:04 ` Ludovic Courtès
2003-11-19 15:02 ` Marius Vollmer
-- strict thread matches above, loose matches on Subject: below --
2003-11-20 15:48 Stephen Compall
2003-11-20 17:29 ` Marius Vollmer
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).