Re: Resizing hash tables in Guile

From: HJSTEIN@bloomberg.com (Harvey J. Stein)
Cc: Greg Troxel <gdt@ir.bbn.com>
Subject: Re: Resizing hash tables in Guile
Date: 13 Feb 2003 08:55:23 -0500	[thread overview]
Message-ID: <kiw3cmsl1ys.fsf@blinky.bloomberg.com> (raw)
In-Reply-To: Mikael Djurfeldt's message of "Thu, 13 Feb 2003 10:35:40 +0100"

Mikael Djurfeldt <djurfeldt@nada.kth.se> writes:

 > Regarding reshuffling time: Yes, rehuffling means that every operation
 > isn't O(1), but it *does* mean that they are O(1) on average.  You can
 > understand this by a trick sometimes used in algorithm run-time
 > analysis called "amortization":
 > 
 > The idea is that for every operation on the hash table you "pay" some
 > extra virtual time, and the sum of this time can then be used to
 > account for the reshuffling time.  In my implementation, the hash
 > table is roughly doubled for each rehash.  Before rehashing occurs,
 > you have inserted N elements.  This has cost you less than cN seconds.
 > Rehashing is O(2N) = O(N), so we can say it will cost us less than dN
 > seconds.  If we now pay off d seconds per operation in advance, and
 > note that the argument above holds equally well for each rehashing
 > point, we realize that each operation costs less than c + d seconds on
 > average.  This means that, on average, operations are O(1).

Inserts are, but lookups aren't necessarily.  Lookups being O(1)
requires uniformity of bucket sizes.  Worst case hash table lookup
time is still O(N).  And good hashing functions are still hard to
write.

People overestimate log(N) and overuse O().  When comparing an O(1)
algorithm to an O(log(N)) algorithm, it really comes down to the
actual functions involved, and actual problem size, not just the
asymptotic behavior.  2^32 is over 4,000,000,000.  With this many
items, log(N) is still just 32, so an O(log(N)) algorithm will still
beat an O(1) algorithm if it's really log_2(N) vs 32.  And you can't
have 2^32 items in your hash table unless this is on a 64bit machine.
And with this many items in your hash table, you'd need 16gb (32gb on
a 64bit machine) just to store the addresses, let alone the keys and
the items themselves, so by this time you're probably hitting
the swap so hard that the whole question is moot.

Also, if a person's relying on O(1) for hash table performance, it might be
not because they need that on average, but because they need an upper
bound on the operation time, in which case automatic resizing would
also violate this, even though it maintains O(1) on average.

Trees also sort the data for you, which hash tables don't give you.

So, ideally, one would have a hash table object with & without
resizing, and various sorts of tree (AVL, red/black, B*, etc) objects.
insert and delete and map would be methods that work on all of the
above, with map on trees returning the data in sorted order.  For that
matter, insert & delete might as well also work on lists...

BTW, the hash table implementation in Tcl seems to be quite good.

-- 
Harvey Stein
Bloomberg LP
hjstein@bloomberg.com

_______________________________________________
Guile-user mailing list
Guile-user@gnu.org
http://mail.gnu.org/mailman/listinfo/guile-user