From mboxrd@z Thu Jan 1 00:00:00 1970 Path: main.gmane.org!not-for-mail From: Joris van der Hoeven Newsgroups: gmane.lisp.guile.devel,gmane.lisp.guile.user Subject: Re: Resizing hash tables in Guile Date: Thu, 13 Feb 2003 15:24:32 +0100 (MET) Sender: guile-devel-bounces+guile-devel=m.gmane.org@gnu.org Message-ID: References: NNTP-Posting-Host: main.gmane.org Mime-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-Trace: main.gmane.org 1045147077 30875 80.91.224.249 (13 Feb 2003 14:37:57 GMT) X-Complaints-To: usenet@main.gmane.org NNTP-Posting-Date: Thu, 13 Feb 2003 14:37:57 +0000 (UTC) Cc: Greg Troxel Return-path: Original-Received: from monty-python.gnu.org ([199.232.76.173]) by main.gmane.org with esmtp (Exim 3.35 #1 (Debian)) id 18jKI9-00079h-00 for ; Thu, 13 Feb 2003 15:24:45 +0100 Original-Received: from localhost ([127.0.0.1] helo=monty-python.gnu.org) by monty-python.gnu.org with esmtp (Exim 4.10.13) id 18jKJc-0002OF-08 for guile-devel@m.gmane.org; Thu, 13 Feb 2003 09:26:16 -0500 Original-Received: from list by monty-python.gnu.org with tmda-scanned (Exim 4.10.13) id 18jKIB-0001zS-00 for guile-devel@gnu.org; Thu, 13 Feb 2003 09:24:47 -0500 Original-Received: from mail by monty-python.gnu.org with spam-scanned (Exim 4.10.13) id 18jKI5-0001qv-00 for guile-devel@gnu.org; Thu, 13 Feb 2003 09:24:41 -0500 Original-Received: from matups.math.u-psud.fr ([129.175.50.4]) by monty-python.gnu.org with esmtp (Exim 4.10.13) id 18jKI2-0001dp-00; Thu, 13 Feb 2003 09:24:38 -0500 Original-Received: from anh.math.u-psud.fr (anh.math.u-psud.fr [129.175.50.156]) h1DEOXZ06795 ; Thu, 13 Feb 2003 15:24:33 +0100 (MET) Original-Received: from anh (anh [129.175.50.156]) by anh.math.u-psud.fr (Postfix) with SMTP id 6AB1FB2C8; Thu, 13 Feb 2003 15:24:32 +0100 (MET) X-Sender: texmacs@anh Original-To: "Harvey J. Stein" In-Reply-To: Original-cc: djurfeldt@nada.kth.se Original-cc: guile-user@gnu.org Original-cc: Roland Orre Original-cc: Joris van der Hoeven Original-cc: guile-devel@gnu.org Original-cc: Marius Vollmer X-BeenThere: guile-devel@gnu.org X-Mailman-Version: 2.1b5 Precedence: list List-Id: Developers list for Guile, the GNU extensibility library List-Help: List-Post: List-Subscribe: , List-Archive: List-Unsubscribe: , Errors-To: guile-devel-bounces+guile-devel=m.gmane.org@gnu.org Xref: main.gmane.org gmane.lisp.guile.devel:1930 gmane.lisp.guile.user:1649 X-Report-Spam: http://spam.gmane.org/gmane.lisp.guile.devel:1930 > > Regarding reshuffling time: Yes, rehuffling means that every operation > > isn't O(1), but it *does* mean that they are O(1) on average. You can > > understand this by a trick sometimes used in algorithm run-time > > analysis called "amortization": > > > > The idea is that for every operation on the hash table you "pay" some > > extra virtual time, and the sum of this time can then be used to > > account for the reshuffling time. In my implementation, the hash > > table is roughly doubled for each rehash. Before rehashing occurs, > > you have inserted N elements. This has cost you less than cN seconds. > > Rehashing is O(2N) = O(N), so we can say it will cost us less than dN > > seconds. If we now pay off d seconds per operation in advance, and > > note that the argument above holds equally well for each rehashing > > point, we realize that each operation costs less than c + d seconds on > > average. This means that, on average, operations are O(1). > > Inserts are, but lookups aren't necessarily. Both aren't necessarily, because inserting requires looking up too. > Lookups being O(1) requires uniformity of bucket sizes. > Worst case hash table lookup time is still O(N). You can also store a binary search tree in each of the buckets, if you think that your hash function is bad. > And good hashing functions are still hard to write. I do not really agree. A good hash algorithm for lists (or strings), which I use in TeXmacs, is to rotate the 32 bit integer hash values of each of the members by a prime number like 3, 5, 7 or 11 and progressively take the exclusive or. This seems to lead to bucket sizes as predicted by probability theory, even for hash tables of size 2^p. > People overestimate log(N) and overuse O(). When comparing an O(1) > algorithm to an O(log(N)) algorithm, it really comes down to the > actual functions involved, and actual problem size, not just the > asymptotic behavior. 2^32 is over 4,000,000,000. A factor 10 is still a factor 10 though. (2^10 ~~ 1000). > With this many > items, log(N) is still just 32, so an O(log(N)) algorithm will still > beat an O(1) algorithm if it's really log_2(N) vs 32. Yes, but the O(1) is really *table lookup* multiplied by a small constant here, so this is *fast*. You may adjust the small constant by choosing an appropriate threshold for "size/nr buckets". > Also, if a person's relying on O(1) for hash table performance, it might be > not because they need that on average, but because they need an upper > bound on the operation time, in which case automatic resizing would > also violate this, even though it maintains O(1) on average. This is a more serious drawback of standard hash tables, but, as I said before, we already have garbage collection in Guile anyway... > Trees also sort the data for you, which hash tables don't give you. But you need a compairison operation for that, which may be even less natural than a hash function. > So, ideally, one would have a hash table object with & without > resizing, and various sorts of tree (AVL, red/black, B*, etc) objects. > insert and delete and map would be methods that work on all of the > above, with map on trees returning the data in sorted order. For that > matter, insert & delete might as well also work on lists... Agreed: ideally, we have everything :^) _______________________________________________ Guile-devel mailing list Guile-devel@gnu.org http://mail.gnu.org/mailman/listinfo/guile-devel