From mboxrd@z Thu Jan 1 00:00:00 1970 Path: main.gmane.org!not-for-mail From: HJSTEIN@bloomberg.com (Harvey J. Stein) Newsgroups: gmane.lisp.guile.user,gmane.lisp.guile.devel Subject: Re: Resizing hash tables in Guile Date: 13 Feb 2003 08:55:23 -0500 Sender: guile-user-bounces+guile-user=m.gmane.org@gnu.org Message-ID: References: <87lm0o7951.fsf@alice.rotty.yi.org> <1044889242.1033.310.camel@localhost> <87fzqtjx96.fsf@zagadka.ping.de> <1045081065.1033.451.camel@localhost> NNTP-Posting-Host: main.gmane.org X-Trace: main.gmane.org 1045144736 21646 80.91.224.249 (13 Feb 2003 13:58:56 GMT) X-Complaints-To: usenet@main.gmane.org NNTP-Posting-Date: Thu, 13 Feb 2003 13:58:56 +0000 (UTC) Cc: Greg Troxel Return-path: Original-Received: from monty-python.gnu.org ([199.232.76.173]) by main.gmane.org with esmtp (Exim 3.35 #1 (Debian)) id 18jJqM-0005Ry-00 for ; Thu, 13 Feb 2003 14:56:02 +0100 Original-Received: from localhost ([127.0.0.1] helo=monty-python.gnu.org) by monty-python.gnu.org with esmtp (Exim 4.10.13) id 18jJs4-0004G5-05 for guile-user@m.gmane.org; Thu, 13 Feb 2003 08:57:48 -0500 Original-Received: from list by monty-python.gnu.org with tmda-scanned (Exim 4.10.13) id 18jJqy-00037o-00 for guile-user@gnu.org; Thu, 13 Feb 2003 08:56:40 -0500 Original-Received: from mail by monty-python.gnu.org with spam-scanned (Exim 4.10.13) id 18jJqW-0002U4-00 for guile-user@gnu.org; Thu, 13 Feb 2003 08:56:13 -0500 Original-Received: from mh3dmz4.bloomberg.net ([205.216.112.169]) by monty-python.gnu.org with esmtp (Exim 4.10.13) id 18jJpq-00027g-00; Thu, 13 Feb 2003 08:55:30 -0500 Original-Received: from mh5ny.bloomberg.com by mh3dmz4.bloomberg.net with ESMTP; Thu, 13 Feb 2003 08:55:27 -0500 Original-Received: from blinky.bloomberg.com by mh5ny.bloomberg.com with ESMTP; Thu, 13 Feb 2003 08:55:08 -0500 Original-Received: (from hjstein@localhost) by blinky.bloomberg.com (8.11.6/8.11.6) id h1DDtN729889; Thu, 13 Feb 2003 08:55:23 -0500 X-Authentication-Warning: blinky.bloomberg.com: hjstein set sender to hjstein@bloomberg.com using -f Original-To: djurfeldt@nada.kth.se In-Reply-To: Mikael Djurfeldt's message of "Thu, 13 Feb 2003 10:35:40 +0100" Original-Lines: 55 X-Mailer: Gnus v5.7/Emacs 20.7 Original-cc: guile-user@gnu.org Original-cc: Roland Orre Original-cc: Joris van der Hoeven Original-cc: guile-devel@gnu.org Original-cc: Andreas Rottmann Original-cc: Marius Vollmer X-BeenThere: guile-user@gnu.org X-Mailman-Version: 2.1b5 Precedence: list List-Id: General Guile related discussions List-Help: List-Post: List-Subscribe: , List-Archive: List-Unsubscribe: , Errors-To: guile-user-bounces+guile-user=m.gmane.org@gnu.org Xref: main.gmane.org gmane.lisp.guile.user:1648 gmane.lisp.guile.devel:1929 X-Report-Spam: http://spam.gmane.org/gmane.lisp.guile.user:1648 Mikael Djurfeldt writes: > Regarding reshuffling time: Yes, rehuffling means that every operation > isn't O(1), but it *does* mean that they are O(1) on average. You can > understand this by a trick sometimes used in algorithm run-time > analysis called "amortization": > > The idea is that for every operation on the hash table you "pay" some > extra virtual time, and the sum of this time can then be used to > account for the reshuffling time. In my implementation, the hash > table is roughly doubled for each rehash. Before rehashing occurs, > you have inserted N elements. This has cost you less than cN seconds. > Rehashing is O(2N) = O(N), so we can say it will cost us less than dN > seconds. If we now pay off d seconds per operation in advance, and > note that the argument above holds equally well for each rehashing > point, we realize that each operation costs less than c + d seconds on > average. This means that, on average, operations are O(1). Inserts are, but lookups aren't necessarily. Lookups being O(1) requires uniformity of bucket sizes. Worst case hash table lookup time is still O(N). And good hashing functions are still hard to write. People overestimate log(N) and overuse O(). When comparing an O(1) algorithm to an O(log(N)) algorithm, it really comes down to the actual functions involved, and actual problem size, not just the asymptotic behavior. 2^32 is over 4,000,000,000. With this many items, log(N) is still just 32, so an O(log(N)) algorithm will still beat an O(1) algorithm if it's really log_2(N) vs 32. And you can't have 2^32 items in your hash table unless this is on a 64bit machine. And with this many items in your hash table, you'd need 16gb (32gb on a 64bit machine) just to store the addresses, let alone the keys and the items themselves, so by this time you're probably hitting the swap so hard that the whole question is moot. Also, if a person's relying on O(1) for hash table performance, it might be not because they need that on average, but because they need an upper bound on the operation time, in which case automatic resizing would also violate this, even though it maintains O(1) on average. Trees also sort the data for you, which hash tables don't give you. So, ideally, one would have a hash table object with & without resizing, and various sorts of tree (AVL, red/black, B*, etc) objects. insert and delete and map would be methods that work on all of the above, with map on trees returning the data in sorted order. For that matter, insert & delete might as well also work on lists... BTW, the hash table implementation in Tcl seems to be quite good. -- Harvey Stein Bloomberg LP hjstein@bloomberg.com _______________________________________________ Guile-user mailing list Guile-user@gnu.org http://mail.gnu.org/mailman/listinfo/guile-user