From: Roland Orre <orre@nada.kth.se>
Cc: guile-user@gnu.org
Subject: Re: Does anyone have a better scm_string_hash ?
Date: Mon, 17 Nov 2003 09:33:53 +0100 [thread overview]
Message-ID: <1069058032.1638.21.camel@localhost> (raw)
In-Reply-To: <20031114155148.GI16650@powergnu.laas.fr>
Thanks,
I replaced the scm_string_hash with the one below, reminding about what
Ludovic Courtès suggested. Of course it could have been preferrably to
use it as a private hash function only, but I prefer to use the
standard functions at all places, especially as I hadn't designed
around using a private hash function (it was used in many places...).
The performance increase we obtained with this hash function were
tremendous. With the problematic symbol table the average lookup
time went down from 10 ms to 9 us on this specific machine (1 GHz PIII).
The max number of collisions went down from 13856 to 7 (on a total
of 14166 strings in a table of length 8209). The number of empty
entries went down from 8201 to 909. The run time for a data mining
scan on a database went down from 5 hours to 15 minutes. The reason
I added i in the hash function was to decrease the risk that two
strings with same but permuted contents would hash to the same value.
I have not done any thorough tests on this hash function yet.
Best regards
Roland Orre
/* replaced by Roland Orre orre@nada.kth.se */
#define hash_mask 0x00FFFFFF
unsigned long
scm_string_hash (const unsigned char *str, size_t len)
{
/* from suggestion at: */
/* http://srfi.schemers.org/srfi-13/mail-archive/msg00112.html */
int i;
/* originally h=0 was suggested. orre@nada.kth.se */
unsigned long h=177;
for (i = len-1; i >= 0; i--)
{
/* h = (str[i] +i+ h*37) & hash_mask; */
h = (str[i] + i + (h<<5) + (h<<2) + h) & hash_mask;
}
return h;
} /* scm_string_hash */
On Fri, 2003-11-14 at 16:51, Ludovic Courtès wrote:
> You might want to look at
> http://srfi.schemers.org/srfi-13/mail-archive/msg00112.html .
>
> Basically, the idea there is that a hash key for string cn..c0 is:
> h = (c0 + c1*37 + c2*37^2 + ...) % hash_size.
> I used that and it *seems* to work quite well.
>
> This can be written as follows:
>
> k = 0;
> while (*string)
> {
> k = (k * 37) + (*string);
> k %= hash_size;
> string++;
> }
> return k;
>
> Cheers,
> Ludovic.
>
> Today, 15 minutes, 40 seconds ago, Roland Orre wrote:
> > I noticed that our data mining software run very very slow with a
> > new data base. I localized the problem to scm_string_hash.
> >
> > A hash table in this case was loaded with 14166 strings. I have
> > a function which creates a reasonable sized hash table, in this
> > case the hash table size was 8209.
> >
> > 13856 of these strings were hashed to the same index= 1067.
> > 303 strings got index = 8061.
> > 2 strings got the index = 754.
> > 8201 entries were empty.
> >
> > We are running guile 1.6 but I checked the scm_string_hash from a recent
> > 1.7 CVS also and the function in hash.c there is identical.
> >
> > I added a few of the symbols hashing to 1067 below. One can of course
> > argue that the symbols in this case should be hashed as numbers.
> > Anyway, does anyone have any hint or have a better string hash function?
> >
> > Best regards
> > Roland Orre
> >
> >
> > A few strings hashing to entry 1067 for hash table length 8209:
> > "01632001" "01627301" "01626801" "01626601" "01626501" "01626401"
> > "01626301" "01626101" "01626001" "01625901" "01625801" "01625701"
> > "01625401" "01625301" "01625101" "01625001" "01624801" "01624601"
> > "01624501" "01624401" "01624301" "01624101" "01624001" "01623901"
> > "01623801" "01623701" "01623601" "01623501" "01623401" "01623201"
> > "01623101" "01622901" "01622801" "01622701" "01622601" "01622401"
> >
> >
> >
> >
> > _______________________________________________
> > Guile-user mailing list
> > Guile-user@gnu.org
> > http://mail.gnu.org/mailman/listinfo/guile-user
_______________________________________________
Guile-user mailing list
Guile-user@gnu.org
http://mail.gnu.org/mailman/listinfo/guile-user
next prev parent reply other threads:[~2003-11-17 8:33 UTC|newest]
Thread overview: 21+ messages / expand[flat|nested] mbox.gz Atom feed top
2003-11-13 21:55 cmod-play 1 available + modsup.h additions Thien-Thi Nguyen
2003-11-14 8:26 ` Ludovic Courtès
2003-11-14 13:10 ` Thien-Thi Nguyen
2003-11-14 13:37 ` Ludovic Courtès
2003-11-14 17:38 ` Thien-Thi Nguyen
2003-11-14 14:29 ` Marius Vollmer
2003-11-14 14:17 ` Marius Vollmer
2003-11-14 15:28 ` Does anyone have a better scm_string_hash ? Roland Orre
2003-11-14 15:51 ` Ludovic Courtès
2003-11-17 8:33 ` Roland Orre [this message]
2003-11-17 13:01 ` Ludovic Courtès
2003-11-17 15:42 ` Marius Vollmer
2003-11-17 16:02 ` Marius Vollmer
2003-11-17 16:29 ` Marius Vollmer
2003-11-17 16:48 ` Allister MacLeod
2003-11-17 17:57 ` Marius Vollmer
2003-11-17 19:17 ` OT: x86 assembly timings/size (was Re: Does anyone have a better scm_string_hash ?) Allister MacLeod
2003-11-17 21:27 ` OT: x86 assembly timings/size Marius Vollmer
2003-11-19 9:04 ` Does anyone have a better scm_string_hash ? Ludovic Courtès
2003-11-19 15:02 ` Marius Vollmer
2003-11-14 17:40 ` cmod-play 1 available + modsup.h additions Thien-Thi Nguyen
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
List information: https://www.gnu.org/software/guile/
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1069058032.1638.21.camel@localhost \
--to=orre@nada.kth.se \
--cc=guile-user@gnu.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).