From mboxrd@z Thu Jan 1 00:00:00 1970 Path: main.gmane.org!not-for-mail From: Roland Orre Newsgroups: gmane.lisp.guile.user Subject: Re: Does anyone have a better scm_string_hash ? Date: Mon, 17 Nov 2003 09:33:53 +0100 Organization: Royal Institute of Technology Sender: guile-user-bounces+guile-user=m.gmane.org@gnu.org Message-ID: <1069058032.1638.21.camel@localhost> References: <8765hnf308.fsf@zagadka.ping.de> <1068823738.13123.54.camel@localhost> <20031114155148.GI16650@powergnu.laas.fr> NNTP-Posting-Host: deer.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable X-Trace: sea.gmane.org 1069058429 19539 80.91.224.253 (17 Nov 2003 08:40:29 GMT) X-Complaints-To: usenet@sea.gmane.org NNTP-Posting-Date: Mon, 17 Nov 2003 08:40:29 +0000 (UTC) Cc: guile-user@gnu.org Original-X-From: guile-user-bounces+guile-user=m.gmane.org@gnu.org Mon Nov 17 09:40:26 2003 Return-path: Original-Received: from monty-python.gnu.org ([199.232.76.173]) by deer.gmane.org with esmtp (Exim 3.35 #1 (Debian)) id 1ALevq-0000jd-00 for ; Mon, 17 Nov 2003 09:40:26 +0100 Original-Received: from localhost ([127.0.0.1] helo=monty-python.gnu.org) by monty-python.gnu.org with esmtp (Exim 4.24) id 1ALfs5-0005sC-66 for guile-user@m.gmane.org; Mon, 17 Nov 2003 04:40:37 -0500 Original-Received: from list by monty-python.gnu.org with tmda-scanned (Exim 4.24) id 1ALfrN-0005s4-Pd for guile-user@gnu.org; Mon, 17 Nov 2003 04:39:53 -0500 Original-Received: from mail by monty-python.gnu.org with spam-scanned (Exim 4.24) id 1ALfqo-0005mk-Su for guile-user@gnu.org; Mon, 17 Nov 2003 04:39:51 -0500 Original-Received: from [130.237.222.202] (helo=smtp.nada.kth.se) by monty-python.gnu.org with esmtp (TLSv1:DES-CBC3-SHA:168) (Exim 4.24) id 1ALfqo-0005mP-Cl for guile-user@gnu.org; Mon, 17 Nov 2003 04:39:18 -0500 Original-Received: from bari.bacon.su.se (bari.bacon.su.se [130.237.152.231]) (authenticated bits=0) by smtp.nada.kth.se (8.12.10/8.12.1) with ESMTP id hAH8boEf007635 (version=TLSv1/SSLv3 cipher=RC4-MD5 bits=128 verify=NO); Mon, 17 Nov 2003 09:37:51 +0100 (MET) Original-To: Ludovic =?ISO-8859-1?Q?Court=E8s?= In-Reply-To: <20031114155148.GI16650@powergnu.laas.fr> X-Mailer: Ximian Evolution 1.4.5 X-MIME-Autoconverted: from 8bit to quoted-printable by smtp.nada.kth.se id hAH8boEf007635 X-BeenThere: guile-user@gnu.org X-Mailman-Version: 2.1.2 Precedence: list List-Id: General Guile related discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: guile-user-bounces+guile-user=m.gmane.org@gnu.org Xref: main.gmane.org gmane.lisp.guile.user:2391 X-Report-Spam: http://spam.gmane.org/gmane.lisp.guile.user:2391 Thanks, I replaced the scm_string_hash with the one below, reminding about what Ludovic Court=E8s suggested. Of course it could have been preferrably to use it as a private hash function only, but I prefer to use the standard functions at all places, especially as I hadn't designed around using a private hash function (it was used in many places...). The performance increase we obtained with this hash function were tremendous. With the problematic symbol table the average lookup time went down from 10 ms to 9 us on this specific machine (1 GHz PIII). The max number of collisions went down from 13856 to 7 (on a total of 14166 strings in a table of length 8209). The number of empty entries went down from 8201 to 909. The run time for a data mining scan on a database went down from 5 hours to 15 minutes. The reason I added i in the hash function was to decrease the risk that two strings with same but permuted contents would hash to the same value. I have not done any thorough tests on this hash function yet. Best regards Roland Orre /* replaced by Roland Orre orre@nada.kth.se */ #define hash_mask 0x00FFFFFF unsigned long=20 scm_string_hash (const unsigned char *str, size_t len) { /* from suggestion at: */ /* http://srfi.schemers.org/srfi-13/mail-archive/msg00112.html */ int i; /* originally h=3D0 was suggested. orre@nada.kth.se */ unsigned long h=3D177; for (i =3D len-1; i >=3D 0; i--) { /* h =3D (str[i] +i+ h*37) & hash_mask; */ h =3D (str[i] + i + (h<<5) + (h<<2) + h) & hash_mask; } return h; } /* scm_string_hash */ On Fri, 2003-11-14 at 16:51, Ludovic Court=E8s wrote: > You might want to look at > http://srfi.schemers.org/srfi-13/mail-archive/msg00112.html . >=20 > Basically, the idea there is that a hash key for string cn..c0 is: > h =3D (c0 + c1*37 + c2*37^2 + ...) % hash_size. > I used that and it *seems* to work quite well. >=20 > This can be written as follows: >=20 > k =3D 0; > while (*string) > { > k =3D (k * 37) + (*string); > k %=3D hash_size; > string++; > } > return k; >=20 > Cheers, > Ludovic. >=20 > Today, 15 minutes, 40 seconds ago, Roland Orre wrote: > > I noticed that our data mining software run very very slow with a > > new data base. I localized the problem to scm_string_hash. > >=20 > > A hash table in this case was loaded with 14166 strings. I have > > a function which creates a reasonable sized hash table, in this > > case the hash table size was 8209. > >=20 > > 13856 of these strings were hashed to the same index=3D 1067. > > 303 strings got index =3D 8061. > > 2 strings got the index =3D 754. > > 8201 entries were empty. > >=20 > > We are running guile 1.6 but I checked the scm_string_hash from a rec= ent > > 1.7 CVS also and the function in hash.c there is identical. > >=20 > > I added a few of the symbols hashing to 1067 below. One can of course > > argue that the symbols in this case should be hashed as numbers. > > Anyway, does anyone have any hint or have a better string hash functi= on? > >=20 > > Best regards > > Roland Orre > >=20 > >=20 > > A few strings hashing to entry 1067 for hash table length 8209: > > "01632001" "01627301" "01626801" "01626601" "01626501" "01626401" > > "01626301" "01626101" "01626001" "01625901" "01625801" "01625701" > > "01625401" "01625301" "01625101" "01625001" "01624801" "01624601" > > "01624501" "01624401" "01624301" "01624101" "01624001" "01623901" > > "01623801" "01623701" "01623601" "01623501" "01623401" "01623201" > > "01623101" "01622901" "01622801" "01622701" "01622601" "01622401" > >=20 > >=20 > >=20 > >=20 > > _______________________________________________ > > Guile-user mailing list > > Guile-user@gnu.org > > http://mail.gnu.org/mailman/listinfo/guile-user _______________________________________________ Guile-user mailing list Guile-user@gnu.org http://mail.gnu.org/mailman/listinfo/guile-user