unofficial mirror of guile-devel@gnu.org 
 help / color / mirror / Atom feed
* Resizing hash tables in Guile
       [not found]         ` <xy74r7ckqmy.fsf@nada.kth.se>
@ 2003-02-11 13:59           ` Mikael Djurfeldt
  2003-02-11 17:34             ` Roland Orre
                               ` (2 more replies)
  0 siblings, 3 replies; 15+ messages in thread
From: Mikael Djurfeldt @ 2003-02-11 13:59 UTC (permalink / raw)
  Cc: Greg Troxel

I've just committed resizing hash table functionality to the Guile
core hashtable functions (in hashtab.c).

Currently, the only thing which has changed in the API is that the
size argument to the hash table constructors is now optional.  If
omitted, you get a resizing table.

Please comment on this API and also take a look at the implementation
in hashtab.c.

Questions:

Is this a good thing?  Should we keep it?

I made the hash tables thread-safe (locking/unlocking a mutex at hash
table access and rehashing).  Is that good?  An alternativ is to
require of the programmer to make sure the hash tables aren't accessed
in parallel.

The implementation isn't quite finished.  Removal should be rewritten
for the new tables.  Also, weak hash tables need to be handled in
hashtab.c in order to maintain a correct item count.

M


_______________________________________________
Guile-user mailing list
Guile-user@gnu.org
http://mail.gnu.org/mailman/listinfo/guile-user


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Resizing hash tables in Guile
  2003-02-11 13:59           ` Resizing hash tables in Guile Mikael Djurfeldt
@ 2003-02-11 17:34             ` Roland Orre
       [not found]               ` <ljy94lhgkb.fsf@burns.dt.e-technik.uni-dortmund.de>
  2003-02-12 16:10             ` Marius Vollmer
  2003-02-12 20:55             ` Rob Browning
  2 siblings, 1 reply; 15+ messages in thread
From: Roland Orre @ 2003-02-11 17:34 UTC (permalink / raw)
  Cc: Greg Troxel

On Tue, 2003-02-11 at 14:59, Mikael Djurfeldt wrote:
> I've just committed resizing hash table functionality to the Guile
> core hashtable functions (in hashtab.c).
> 
> Currently, the only thing which has changed in the API is that the
> size argument to the hash table constructors is now optional.  If
> omitted, you get a resizing table.
> 
> Please comment on this API and also take a look at the implementation
> in hashtab.c.

I've studied the code and consider this a perfect minimalistic solution.
It works without disturbing any old code. I'll test drive the code
tonight.

> Questions:
> 
> Is this a good thing?  Should we keep it?
It's definitely a good thing. In our application we sometimes need just
a few tens of items hashed but often many millions of items without
really knowing beforehand.
It's  definitely a good thing to not need to bother about analyzing
optimal table sizes any longer.

We hope it will stay and we will start using it in our application as
soon as possible.

> I made the hash tables thread-safe (locking/unlocking a mutex at hash
> table access and rehashing).  Is that good?  An alternativ is to
> require of the programmer to make sure the hash tables aren't accessed
> in parallel.

In our case I think we will most likely set up the different processes
to use different hash tables but we have still not parallellized our
applications. Could/should the mutex protection maybe be optional?

	Best regards
	Roland




_______________________________________________
Guile-devel mailing list
Guile-devel@gnu.org
http://mail.gnu.org/mailman/listinfo/guile-devel


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Resizing hash tables in Guile
  2003-02-11 13:59           ` Resizing hash tables in Guile Mikael Djurfeldt
  2003-02-11 17:34             ` Roland Orre
@ 2003-02-12 16:10             ` Marius Vollmer
  2003-02-12 17:53               ` Mikael Djurfeldt
  2003-02-12 20:55             ` Rob Browning
  2 siblings, 1 reply; 15+ messages in thread
From: Marius Vollmer @ 2003-02-12 16:10 UTC (permalink / raw)
  Cc: Greg Troxel

Mikael Djurfeldt <djurfeldt@nada.kth.se> writes:

> Is this a good thing?  Should we keep it?

I think resizing hash tables are important.  Thanks for writing them!

> The implementation isn't quite finished.  Removal should be rewritten
> for the new tables.  Also, weak hash tables need to be handled in
> hashtab.c in order to maintain a correct item count.

If you are not going to finish this soonish, please make a note in
workbook/tasks/TODO about what needs to still be done.

-- 
GPG: D5D4E405 - 2F9B BCCC 8527 692A 04E3  331E FAF8 226A D5D4 E405


_______________________________________________
Guile-user mailing list
Guile-user@gnu.org
http://mail.gnu.org/mailman/listinfo/guile-user


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Resizing hash tables in Guile
       [not found]               ` <ljy94lhgkb.fsf@burns.dt.e-technik.uni-dortmund.de>
@ 2003-02-12 17:47                 ` Mikael Djurfeldt
  2003-02-12 20:44                   ` Rob Browning
  0 siblings, 1 reply; 15+ messages in thread
From: Mikael Djurfeldt @ 2003-02-12 17:47 UTC (permalink / raw)
  Cc: guile-devel

Marius Vollmer <marius.vollmer@uni-dortmund.de> writes:

> Roland Orre <orre@nada.kth.se> writes:
>
>> > I made the hash tables thread-safe (locking/unlocking a mutex at hash
>> > table access and rehashing).  Is that good?  An alternativ is to
>> > require of the programmer to make sure the hash tables aren't accessed
>> > in parallel.
>> 
>> In our case I think we will most likely set up the different processes
>> to use different hash tables but we have still not parallellized our
>> applications. Could/should the mutex protection maybe be optional?
>
> Since users must consciously ask for resizable hash tables, we can
> require them to do the locking themselves.  Should we?  I would say
> yes, since people need to be aware of thread issues anyway, and maybe
> they have a better scheme for ensuring thread safe access to a hash
> table.

I agree.  When implementing the current thread safety I've encountered
a cluster of very interesting problems.  I'll write up a small text
about this, put in the workbook and post on the list.  I'd say I can
do this within a few weeks.  For example, I've discovered that the
with-mutex construct we provide in threads.scm is of dubious utility.
It also seems hard to second-guess the "users" needs when trying to
provide high-level thread-safety constructs in general.

I'll remove the thread protection code from hashtab.c.

> Removing the locking from the core code should improve performance,
> right?

Yes, at least marginally.

> However, the hash tables should be somewhat thread safe: they might
> not work as advertised when multiple threads access them, but the
> application might not crash.  That is, adding an element from one
> thread while another thread is doing the same thing might make one of
> the elements disappear, but it must leave the hash table in a valid
> state.
>
> Do the non-resizing hash tables behgave that way, incidentally?

I'd say yes.

> (Ahh, I just love cooperative threading.  These things are so much
> easier and efficient when there is only one running thread with
> defined switch points... :-)

I think the current restriction on signals only causing exceptions at
well defined points takes us a good way towards that.  I doubt that
there are still many points in Guile which need to be modified to get
the thread safety to where we want it.

M


_______________________________________________
Guile-devel mailing list
Guile-devel@gnu.org
http://mail.gnu.org/mailman/listinfo/guile-devel


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Resizing hash tables in Guile
  2003-02-12 16:10             ` Marius Vollmer
@ 2003-02-12 17:53               ` Mikael Djurfeldt
  2003-02-12 20:17                 ` Roland Orre
  0 siblings, 1 reply; 15+ messages in thread
From: Mikael Djurfeldt @ 2003-02-12 17:53 UTC (permalink / raw)
  Cc: Greg Troxel

Marius Vollmer <mvo@zagadka.de> writes:

> Mikael Djurfeldt <djurfeldt@nada.kth.se> writes:
>
>> Is this a good thing?  Should we keep it?
>
> I think resizing hash tables are important.

I've been thinking that maybe we should continue the move and let the
resizing tables entirely replace the fixed size ones.  It seems a
little silly to have to explain to Guile users that there are two
(sorry, eight) different kinds of hash tables...  Also, I think the
opacity of the resizing table objects is an advantage rather than a
disadvantage.  If they are opaque, we can any time modify the
underlying implementation (the well-known data abstraction argument).

What do you say?

>> The implementation isn't quite finished.  Removal should be rewritten
>> for the new tables.  Also, weak hash tables need to be handled in
>> hashtab.c in order to maintain a correct item count.
>
> If you are not going to finish this soonish, please make a note in
> workbook/tasks/TODO about what needs to still be done.

Right.  Removal is fixed now.  Weak tables could be fixed on a train
ride during the weekend. :)

(BTW, at that time it would be good if we have come to a decision with
regard to dropping the fixed-size vector of alist tables.)

M


_______________________________________________
Guile-devel mailing list
Guile-devel@gnu.org
http://mail.gnu.org/mailman/listinfo/guile-devel


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Resizing hash tables in Guile
  2003-02-12 17:53               ` Mikael Djurfeldt
@ 2003-02-12 20:17                 ` Roland Orre
  2003-02-13  9:35                   ` Mikael Djurfeldt
  2003-02-13  9:52                   ` Joris van der Hoeven
  0 siblings, 2 replies; 15+ messages in thread
From: Roland Orre @ 2003-02-12 20:17 UTC (permalink / raw)
  Cc: Greg Troxel

On Wed, 2003-02-12 at 18:53, Mikael Djurfeldt wrote:
> I've been thinking that maybe we should continue the move and let the
> resizing tables entirely replace the fixed size ones.  It seems a
> little silly to have to explain to Guile users that there are two
> (sorry, eight) different kinds of hash tables...  Also, I think the
> opacity of the resizing table objects is an advantage rather than a
> disadvantage.  If they are opaque, we can any time modify the
> underlying implementation (the well-known data abstraction argument).
> 
> What do you say?

I think that this is a a good idea, but I would like the optional size
argument to still be there as an initial guess about the number of
items to expect to avoid resizing and to optimize performance when
we have advance information about how many items to expect.

Of course this additional level of abstraction is nice as it also gives
a future option to use trees if the user e.g gives an optional
comparision operator.

By the way, Joris van der Hoeven mentioned that this type of resizing
hash tables are faster than trees, which have a time complexity
of O(log n)+reshuffling time if they are balanced. Do you have any
numbers showing how much faster and if possible if there are any
conditions? The reshuffling time will grow at least O(n) when the
size of our linear tables increases if I have understood right.

	Best regards
	Roland Orre



_______________________________________________
Guile-devel mailing list
Guile-devel@gnu.org
http://mail.gnu.org/mailman/listinfo/guile-devel


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Resizing hash tables in Guile
  2003-02-12 17:47                 ` Mikael Djurfeldt
@ 2003-02-12 20:44                   ` Rob Browning
  0 siblings, 0 replies; 15+ messages in thread
From: Rob Browning @ 2003-02-12 20:44 UTC (permalink / raw)
  Cc: Marius Vollmer

Mikael Djurfeldt <djurfeldt@nada.kth.se> writes:

> I agree.  When implementing the current thread safety I've
> encountered a cluster of very interesting problems.  I'll write up a
> small text about this, put in the workbook and post on the list.
> I'd say I can do this within a few weeks.  For example, I've
> discovered that the with-mutex construct we provide in threads.scm
> is of dubious utility.  It also seems hard to second-guess the
> "users" needs when trying to provide high-level thread-safety
> constructs in general.

Sorry I haven't had a chance to comment sooner, but I also feel that
we should probably let the user handle thread-safety.

Way back when I first started programming heavily with threads, I had
the naive impression that everything should "just be thread safe" :>
After a lot of hacking, I would have to say I was definitively wrong.
In my experience, the issues surrounding thread-safety and the
tradeoffs involved are too complex to be easily managed in a blanket
way.  As a trivial example, it can be frustrating if you're trying to
get high performance for a segment of code, to have no way to avoid
hundreds or thousands of mutex grabs when you already *know* that the
code segment in question is protected by some other means.

-- 
Rob Browning
rlb @defaultvalue.org, @linuxdevel.com, and @debian.org
Previously @cs.utexas.edu
GPG starting 2002-11-03 = 14DD 432F AE39 534D B592  F9A0 25C8 D377 8C7E 73A4


_______________________________________________
Guile-devel mailing list
Guile-devel@gnu.org
http://mail.gnu.org/mailman/listinfo/guile-devel


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Resizing hash tables in Guile
  2003-02-11 13:59           ` Resizing hash tables in Guile Mikael Djurfeldt
  2003-02-11 17:34             ` Roland Orre
  2003-02-12 16:10             ` Marius Vollmer
@ 2003-02-12 20:55             ` Rob Browning
  2003-02-13 10:43               ` Mikael Djurfeldt
  2 siblings, 1 reply; 15+ messages in thread
From: Rob Browning @ 2003-02-12 20:55 UTC (permalink / raw)
  Cc: Greg Troxel

Mikael Djurfeldt <djurfeldt@nada.kth.se> writes:

> I've just committed resizing hash table functionality to the Guile
> core hashtable functions (in hashtab.c).
>
> Currently, the only thing which has changed in the API is that the
> size argument to the hash table constructors is now optional.  If
> omitted, you get a resizing table.

Looks good to me.

If you have time, could you (very) briefly summarize the behavior?
Mostly I mean wrt how the implementation handles resizes (i.e. when
does it decide to resize, how does it pick the new size, etc.).
That'd be nice to have in comments in the file so that you can get an
overview without having to read the code in detail first.

Thanks

-- 
Rob Browning
rlb @defaultvalue.org, @linuxdevel.com, and @debian.org
Previously @cs.utexas.edu
GPG starting 2002-11-03 = 14DD 432F AE39 534D B592  F9A0 25C8 D377 8C7E 73A4


_______________________________________________
Guile-user mailing list
Guile-user@gnu.org
http://mail.gnu.org/mailman/listinfo/guile-user


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Resizing hash tables in Guile
  2003-02-12 20:17                 ` Roland Orre
@ 2003-02-13  9:35                   ` Mikael Djurfeldt
  2003-02-13 13:55                     ` Harvey J. Stein
  2003-02-13  9:52                   ` Joris van der Hoeven
  1 sibling, 1 reply; 15+ messages in thread
From: Mikael Djurfeldt @ 2003-02-13  9:35 UTC (permalink / raw)
  Cc: Greg Troxel

Roland Orre <orre@nada.kth.se> writes:

> On Wed, 2003-02-12 at 18:53, Mikael Djurfeldt wrote:
>> I've been thinking that maybe we should continue the move and let the
>> resizing tables entirely replace the fixed size ones.
[...]

> I think that this is a a good idea, but I would like the optional size
> argument to still be there as an initial guess about the number of
> items to expect
[...]

Agreed.

> Of course this additional level of abstraction is nice as it also gives
> a future option to use trees if the user e.g gives an optional
> comparision operator.
>
> By the way, Joris van der Hoeven mentioned that this type of resizing
> hash tables are faster than trees, which have a time complexity
> of O(log n)+reshuffling time if they are balanced. Do you have any
> numbers showing how much faster and if possible if there are any
> conditions? The reshuffling time will grow at least O(n) when the
> size of our linear tables increases if I have understood right.

I doubt that we'll ever use trees under the hash table API, the reason
being that the user expects O(1) for hash table operations.  Even
though O(log n) grows slowly, it is substantially different for large
enough data sets.

Regarding reshuffling time: Yes, rehuffling means that every operation
isn't O(1), but it *does* mean that they are O(1) on average.  You can
understand this by a trick sometimes used in algorithm run-time
analysis called "amortization":

The idea is that for every operation on the hash table you "pay" some
extra virtual time, and the sum of this time can then be used to
account for the reshuffling time.  In my implementation, the hash
table is roughly doubled for each rehash.  Before rehashing occurs,
you have inserted N elements.  This has cost you less than cN seconds.
Rehashing is O(2N) = O(N), so we can say it will cost us less than dN
seconds.  If we now pay off d seconds per operation in advance, and
note that the argument above holds equally well for each rehashing
point, we realize that each operation costs less than c + d seconds on
average.  This means that, on average, operations are O(1).

Mikael


_______________________________________________
Guile-user mailing list
Guile-user@gnu.org
http://mail.gnu.org/mailman/listinfo/guile-user


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Resizing hash tables in Guile
  2003-02-12 20:17                 ` Roland Orre
  2003-02-13  9:35                   ` Mikael Djurfeldt
@ 2003-02-13  9:52                   ` Joris van der Hoeven
  1 sibling, 0 replies; 15+ messages in thread
From: Joris van der Hoeven @ 2003-02-13  9:52 UTC (permalink / raw)
  Cc: Greg Troxel


> > I've been thinking that maybe we should continue the move and let the
> > resizing tables entirely replace the fixed size ones.  It seems a
> > little silly to have to explain to Guile users that there are two
> > (sorry, eight) different kinds of hash tables...  Also, I think the
> > opacity of the resizing table objects is an advantage rather than a
> > disadvantage.  If they are opaque, we can any time modify the
> > underlying implementation (the well-known data abstraction argument).
> > 
> > What do you say?
> 
> I think that this is a a good idea, but I would like the optional size
> argument to still be there as an initial guess about the number of
> items to expect to avoid resizing and to optimize performance when
> we have advance information about how many items to expect.

In fact, it is even better to specify the resizing behaviour
using optional arguments. For instance:

	up-ratio   : size up when size/slots >= up-ratio
	up-factor  : new nr slots := old nr slots * up-factor
	down-ratio : size up when size/slots < down-ratio
	down-factor: new nr slots := old nr slots / down-factor

Using a small up-ratio will improve the constant factor in the lookup speed,
but require the usage of a larger number of slots.

> Of course this additional level of abstraction is nice as it also gives
> a future option to use trees if the user e.g gives an optional
> comparision operator.

Yes.

> By the way, Joris van der Hoeven mentioned that this type of resizing
> hash tables are faster than trees, which have a time complexity
> of O(log n)+reshuffling time if they are balanced. Do you have any
> numbers showing how much faster and if possible if there are any
> conditions? The reshuffling time will grow at least O(n) when the
> size of our linear tables increases if I have understood right.

Well: lookup time in a balanced search tree is O(log n),
while lookup time in a table is O(1)...



_______________________________________________
Guile-user mailing list
Guile-user@gnu.org
http://mail.gnu.org/mailman/listinfo/guile-user


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Resizing hash tables in Guile
  2003-02-12 20:55             ` Rob Browning
@ 2003-02-13 10:43               ` Mikael Djurfeldt
  0 siblings, 0 replies; 15+ messages in thread
From: Mikael Djurfeldt @ 2003-02-13 10:43 UTC (permalink / raw)
  Cc: Greg Troxel

Rob Browning <rlb@defaultvalue.org> writes:

> If you have time, could you (very) briefly summarize the behavior?
> Mostly I mean wrt how the implementation handles resizes (i.e. when
> does it decide to resize, how does it pick the new size, etc.).

Good idea.

Done.

M


_______________________________________________
Guile-devel mailing list
Guile-devel@gnu.org
http://mail.gnu.org/mailman/listinfo/guile-devel


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Resizing hash tables in Guile
  2003-02-13  9:35                   ` Mikael Djurfeldt
@ 2003-02-13 13:55                     ` Harvey J. Stein
  2003-02-13 14:24                       ` Joris van der Hoeven
  0 siblings, 1 reply; 15+ messages in thread
From: Harvey J. Stein @ 2003-02-13 13:55 UTC (permalink / raw)
  Cc: Greg Troxel


Mikael Djurfeldt <djurfeldt@nada.kth.se> writes:

 > Regarding reshuffling time: Yes, rehuffling means that every operation
 > isn't O(1), but it *does* mean that they are O(1) on average.  You can
 > understand this by a trick sometimes used in algorithm run-time
 > analysis called "amortization":
 > 
 > The idea is that for every operation on the hash table you "pay" some
 > extra virtual time, and the sum of this time can then be used to
 > account for the reshuffling time.  In my implementation, the hash
 > table is roughly doubled for each rehash.  Before rehashing occurs,
 > you have inserted N elements.  This has cost you less than cN seconds.
 > Rehashing is O(2N) = O(N), so we can say it will cost us less than dN
 > seconds.  If we now pay off d seconds per operation in advance, and
 > note that the argument above holds equally well for each rehashing
 > point, we realize that each operation costs less than c + d seconds on
 > average.  This means that, on average, operations are O(1).

Inserts are, but lookups aren't necessarily.  Lookups being O(1)
requires uniformity of bucket sizes.  Worst case hash table lookup
time is still O(N).  And good hashing functions are still hard to
write.

People overestimate log(N) and overuse O().  When comparing an O(1)
algorithm to an O(log(N)) algorithm, it really comes down to the
actual functions involved, and actual problem size, not just the
asymptotic behavior.  2^32 is over 4,000,000,000.  With this many
items, log(N) is still just 32, so an O(log(N)) algorithm will still
beat an O(1) algorithm if it's really log_2(N) vs 32.  And you can't
have 2^32 items in your hash table unless this is on a 64bit machine.
And with this many items in your hash table, you'd need 16gb (32gb on
a 64bit machine) just to store the addresses, let alone the keys and
the items themselves, so by this time you're probably hitting
the swap so hard that the whole question is moot.

Also, if a person's relying on O(1) for hash table performance, it might be
not because they need that on average, but because they need an upper
bound on the operation time, in which case automatic resizing would
also violate this, even though it maintains O(1) on average.

Trees also sort the data for you, which hash tables don't give you.

So, ideally, one would have a hash table object with & without
resizing, and various sorts of tree (AVL, red/black, B*, etc) objects.
insert and delete and map would be methods that work on all of the
above, with map on trees returning the data in sorted order.  For that
matter, insert & delete might as well also work on lists...

BTW, the hash table implementation in Tcl seems to be quite good.

-- 
Harvey Stein
Bloomberg LP
hjstein@bloomberg.com


_______________________________________________
Guile-user mailing list
Guile-user@gnu.org
http://mail.gnu.org/mailman/listinfo/guile-user


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Resizing hash tables in Guile
  2003-02-13 13:55                     ` Harvey J. Stein
@ 2003-02-13 14:24                       ` Joris van der Hoeven
  2003-02-13 18:30                         ` Harvey J. Stein
  0 siblings, 1 reply; 15+ messages in thread
From: Joris van der Hoeven @ 2003-02-13 14:24 UTC (permalink / raw)
  Cc: Greg Troxel


>  > Regarding reshuffling time: Yes, rehuffling means that every operation
>  > isn't O(1), but it *does* mean that they are O(1) on average.  You can
>  > understand this by a trick sometimes used in algorithm run-time
>  > analysis called "amortization":
>  >
>  > The idea is that for every operation on the hash table you "pay" some
>  > extra virtual time, and the sum of this time can then be used to
>  > account for the reshuffling time.  In my implementation, the hash
>  > table is roughly doubled for each rehash.  Before rehashing occurs,
>  > you have inserted N elements.  This has cost you less than cN seconds.
>  > Rehashing is O(2N) = O(N), so we can say it will cost us less than dN
>  > seconds.  If we now pay off d seconds per operation in advance, and
>  > note that the argument above holds equally well for each rehashing
>  > point, we realize that each operation costs less than c + d seconds on
>  > average.  This means that, on average, operations are O(1).
> 
> Inserts are, but lookups aren't necessarily.

Both aren't necessarily, because inserting requires looking up too.

> Lookups being O(1) requires uniformity of bucket sizes.
> Worst case hash table lookup time is still O(N).

You can also store a binary search tree in each of the buckets,
if you think that your hash function is bad.

> And good hashing functions are still hard to write.

I do not really agree. A good hash algorithm for lists (or strings),
which I use in TeXmacs, is to rotate the 32 bit integer hash values of
each of the members by a prime number like 3, 5, 7 or 11 and progressively
take the exclusive or. This seems to lead to bucket sizes as
predicted by probability theory, even for hash tables of size 2^p.

> People overestimate log(N) and overuse O().  When comparing an O(1)
> algorithm to an O(log(N)) algorithm, it really comes down to the
> actual functions involved, and actual problem size, not just the
> asymptotic behavior.  2^32 is over 4,000,000,000.

A factor 10 is still a factor 10 though.
(2^10 ~~ 1000).

> With this many
> items, log(N) is still just 32, so an O(log(N)) algorithm will still
> beat an O(1) algorithm if it's really log_2(N) vs 32.

Yes, but the O(1) is really *table lookup* multiplied by a small
constant here, so this is *fast*. You may adjust the small constant
by choosing an appropriate threshold for "size/nr buckets".

> Also, if a person's relying on O(1) for hash table performance, it might be
> not because they need that on average, but because they need an upper
> bound on the operation time, in which case automatic resizing would
> also violate this, even though it maintains O(1) on average.

This is a more serious drawback of standard hash tables, but,
as I said before, we already have garbage collection in Guile anyway...

> Trees also sort the data for you, which hash tables don't give you.

But you need a compairison operation for that,
which may be even less natural than a hash function.

> So, ideally, one would have a hash table object with & without
> resizing, and various sorts of tree (AVL, red/black, B*, etc) objects.
> insert and delete and map would be methods that work on all of the
> above, with map on trees returning the data in sorted order.  For that
> matter, insert & delete might as well also work on lists...

Agreed: ideally, we have everything :^)



_______________________________________________
Guile-devel mailing list
Guile-devel@gnu.org
http://mail.gnu.org/mailman/listinfo/guile-devel


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Resizing hash tables in Guile
  2003-02-13 14:24                       ` Joris van der Hoeven
@ 2003-02-13 18:30                         ` Harvey J. Stein
  2003-02-13 20:02                           ` Paul Jarc
  0 siblings, 1 reply; 15+ messages in thread
From: Harvey J. Stein @ 2003-02-13 18:30 UTC (permalink / raw)
  Cc: Greg Troxel

Joris van der Hoeven <TeXmacs@math.u-psud.fr> writes:

 > >  > average.  This means that, on average, operations are O(1).
 > > 
 > > Inserts are, but lookups aren't necessarily.
 > 
 > Both aren't necessarily, because inserting requires looking up too.

Yes, if you want to return an error for inserting 2 items with the
same key.  No, if you allow multiple items with the same key.

 > > And good hashing functions are still hard to write.
 > 
 > I do not really agree. A good hash algorithm for lists (or strings),
 > which I use in TeXmacs, is ...

I guess "hard" isn't quite the word for it.  What I meant was that the
obvious ones don't work well.  You basically need to look up a good
one to get good behavior.  Tcl's fcn works pretty well too.  I forget
what it does.

 > > With this many items, log(N) is still just 32, so an O(log(N))
 > > algorithm will still beat an O(1) algorithm if it's really
 > > log_2(N) vs 32.
 > 
 > Yes, but the O(1) is really *table lookup* multiplied by a small
 > constant here, so this is *fast*. You may adjust the small constant
 > by choosing an appropriate threshold for "size/nr buckets".

Not just table lookup.  Also hash calculation from the key, which can
be more time consuming than a key comparison, and key comparisons for
everything in the bucket.

 > > Trees also sort the data for you, which hash tables don't give you.
 > 
 > But you need a compairison operation for that,
 > which may be even less natural than a hash function.

Yes.  For hash tables you just need a key equality test.  For trees
you need to be able to order the keys.

 > > So, ideally, one would have a hash table object with & without
 > > resizing, and various sorts of tree (AVL, red/black, B*, etc) objects.
 > > insert and delete and map would be methods that work on all of the
 > > above, with map on trees returning the data in sorted order.  For that
 > > matter, insert & delete might as well also work on lists...
 > 
 > Agreed: ideally, we have everything :^)

I think we're on the same page here...

STk had a nice scheme interface for hash tables.  The code uses the
Tcl hash tables, they autoresize and if I recall correctly, many of
the important parameters can be modified.  You might want to model the
guile stuff on this, or utilize this work.  I advocated this when the
discussion came up years ago on the guile mailing list.

BTW, there was a long discussion of hash tables on the guile list some
number of years ago.  At the time I didn't like liked the guile (aka
scm) hash tables.  They were a very minimal implementation.  The user
had to do too much to use them.  I don't know what it's like these
days, but it'd certainly be nice to have a good implementation
available which includes autosizing.

-- 
Harvey Stein
Bloomberg LP
hjstein@bloomberg.com


_______________________________________________
Guile-user mailing list
Guile-user@gnu.org
http://mail.gnu.org/mailman/listinfo/guile-user


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Resizing hash tables in Guile
  2003-02-13 18:30                         ` Harvey J. Stein
@ 2003-02-13 20:02                           ` Paul Jarc
  0 siblings, 0 replies; 15+ messages in thread
From: Paul Jarc @ 2003-02-13 20:02 UTC (permalink / raw)


HJSTEIN@bloomberg.net (Harvey J. Stein) wrote:
> Joris van der Hoeven <TeXmacs@math.u-psud.fr> writes:
>> inserting requires looking up too.
>
> Yes, if you want to return an error for inserting 2 items with the
> same key.

Or if you want to replace existing entries, as Guile's current hash
tables do.

>> But you need a compairison operation for that,
>> which may be even less natural than a hash function.
>
> Yes.  For hash tables you just need a key equality test.  For trees
> you need to be able to order the keys.

It's easy enough to support this: (< (hash key1 0) (hash key2 0))
But this is probably only useful in the general case.  It should be
possible to substitute a less expensive ordering function when one is
available.


paul


_______________________________________________
Guile-devel mailing list
Guile-devel@gnu.org
http://mail.gnu.org/mailman/listinfo/guile-devel


^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2003-02-13 20:02 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <Pine.GSO.3.96.1030208160743.22945E-100000@anh>
     [not found] ` <87lm0o7951.fsf@alice.rotty.yi.org>
     [not found]   ` <rmiu1fcnrhj.fsf@fnord.ir.bbn.com>
     [not found]     ` <1044889242.1033.310.camel@localhost>
     [not found]       ` <xy77kc8krhr.fsf@nada.kth.se>
     [not found]         ` <xy74r7ckqmy.fsf@nada.kth.se>
2003-02-11 13:59           ` Resizing hash tables in Guile Mikael Djurfeldt
2003-02-11 17:34             ` Roland Orre
     [not found]               ` <ljy94lhgkb.fsf@burns.dt.e-technik.uni-dortmund.de>
2003-02-12 17:47                 ` Mikael Djurfeldt
2003-02-12 20:44                   ` Rob Browning
2003-02-12 16:10             ` Marius Vollmer
2003-02-12 17:53               ` Mikael Djurfeldt
2003-02-12 20:17                 ` Roland Orre
2003-02-13  9:35                   ` Mikael Djurfeldt
2003-02-13 13:55                     ` Harvey J. Stein
2003-02-13 14:24                       ` Joris van der Hoeven
2003-02-13 18:30                         ` Harvey J. Stein
2003-02-13 20:02                           ` Paul Jarc
2003-02-13  9:52                   ` Joris van der Hoeven
2003-02-12 20:55             ` Rob Browning
2003-02-13 10:43               ` Mikael Djurfeldt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).