unofficial mirror of guile-user@gnu.org 
 help / color / mirror / Atom feed
* Efficiency and flexibility of hash-tables
@ 2003-02-08 11:00 Joris van der Hoeven
  2003-02-08 13:57 ` Roland Orre
  0 siblings, 1 reply; 32+ messages in thread
From: Joris van der Hoeven @ 2003-02-08 11:00 UTC (permalink / raw)



Hi,

When declaring a hash table using

	(define H (make-hash-table 100))

does this mean that the number of slots will *always* remain 100?

I am frequently dealing with hash tables where I do not
have a reasonable estimation of number of entires in advance.
In TeXmacs, I therefore implemented a hash table type which
doubles the number of slots each time that the number of entries
becomes larger than a constant times the number of slots
(and divides by two the number of slots when the number of
entries becomes smaller than a constant times the number of slots).
Has a similar system been implemented in (an extension of) guile?

Thanks for your help, Joris


-----------------------------------------------------------
Joris van der Hoeven <vdhoeven@texmacs.org>
http://www.texmacs.org: GNU TeXmacs scientific text editor
http://www.math.u-psud.fr/~vdhoeven: personal homepage
-----------------------------------------------------------



_______________________________________________
Guile-user mailing list
Guile-user@gnu.org
http://mail.gnu.org/mailman/listinfo/guile-user


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Efficiency and flexibility of hash-tables
  2003-02-08 11:00 Efficiency and flexibility of hash-tables Joris van der Hoeven
@ 2003-02-08 13:57 ` Roland Orre
  2003-02-08 14:14   ` Joris van der Hoeven
  0 siblings, 1 reply; 32+ messages in thread
From: Roland Orre @ 2003-02-08 13:57 UTC (permalink / raw)
  Cc: guile-user

On Sat, 2003-02-08 at 12:00, Joris van der Hoeven wrote:
> Hi,
> 
> When declaring a hash table using
> 
> 	(define H (make-hash-table 100))
No, the hash table is a vector of entries to lists where the actual
information is stored. A hash table in guile can therefore contain
any number of items. The number of entries is merely a choice of what
performance you need. If you declare too few entries you will get
a lot of linear search through the lists from each entry.
I myself use to estimate it so that the lists will rarely be deeper
than two or three to get a reasonable performance.

The performance is also reflected upon the hash function versus the
vector length. Usually it is advisable to use a prime number to avoid
systematic hashing to the same entries. Sometime it happened I missed
this and sloppily declared the hash table length to e.g. 1000000 if
needing about 3000000 items. The run took several hours instead of
the expected half an hour, which I got when changing the length to
1000003. If you have access to some mathematical package like maple
there is often a function nextprime which can be helpful.

Usually the built-in hash functions works fine but you may also
consider making a special hash functions for special needs if
the built-in function doesn't spread good enough.

	Best regards
	Roland Orre

> 
> does this mean that the number of slots will *always* remain 100?
> 
> I am frequently dealing with hash tables where I do not
> have a reasonable estimation of number of entires in advance.
> In TeXmacs, I therefore implemented a hash table type which
> doubles the number of slots each time that the number of entries
> becomes larger than a constant times the number of slots
> (and divides by two the number of slots when the number of
> entries becomes smaller than a constant times the number of slots).
> Has a similar system been implemented in (an extension of) guile?
> 
> Thanks for your help, Joris
> 
> 
> -----------------------------------------------------------
> Joris van der Hoeven <vdhoeven@texmacs.org>
> http://www.texmacs.org: GNU TeXmacs scientific text editor
> http://www.math.u-psud.fr/~vdhoeven: personal homepage
> -----------------------------------------------------------
> 
> 
> 
> _______________________________________________
> Guile-user mailing list
> Guile-user@gnu.org
> http://mail.gnu.org/mailman/listinfo/guile-user
-- 



_______________________________________________
Guile-user mailing list
Guile-user@gnu.org
http://mail.gnu.org/mailman/listinfo/guile-user


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Efficiency and flexibility of hash-tables
  2003-02-08 13:57 ` Roland Orre
@ 2003-02-08 14:14   ` Joris van der Hoeven
  2003-02-08 14:55     ` Roland Orre
  0 siblings, 1 reply; 32+ messages in thread
From: Joris van der Hoeven @ 2003-02-08 14:14 UTC (permalink / raw)
  Cc: Joris van der Hoeven


Hi,

Thanks for your reply. Unfortunately, I think that
you did not fully understand my question.

> > When declaring a hash table using
> > 
> > 	(define H (make-hash-table 100))
> > 
> > does this mean that the number of slots will *always* remain 100?
>
> No, the hash table is a vector of entries to lists where the actual
> information is stored. A hash table in guile can therefore contain
> any number of items. The number of entries is merely a choice of what
> performance you need. If you declare too few entries you will get
> a lot of linear search through the lists from each entry.
> I myself use to estimate it so that the lists will rarely be deeper
> than two or three to get a reasonable performance.

That is why I distinguished the word 'slots' from the word 'entries'.
The number of slots is the length of the vector you mention.
So the ratio 'nr entries / nr slots' should be small in order to
get a good performance.

My question was: is the number of slots automatically adapted
as a function of the number or entries, or is it not?
If you cannot have a good estimate for the number of entries,
then this auto-adaptation may be important.
In fact, I think that a good low level implementation of
general purpose hash tables should have this feature.

> The performance is also reflected upon the hash function versus the
> vector length. Usually it is advisable to use a prime number to avoid
> systematic hashing to the same entries. Sometime it happened I missed
> this and sloppily declared the hash table length to e.g. 1000000 if
> needing about 3000000 items. The run took several hours instead of
> the expected half an hour, which I got when changing the length to
> 1000003. If you have access to some mathematical package like maple
> there is often a function nextprime which can be helpful.
> 
> Usually the built-in hash functions works fine but you may also
> consider making a special hash functions for special needs if
> the built-in function doesn't spread good enough.
> 
> > I am frequently dealing with hash tables where I do not
> > have a reasonable estimation of number of entires in advance.
> > In TeXmacs, I therefore implemented a hash table type which
> > doubles the number of slots each time that the number of entries
> > becomes larger than a constant times the number of slots
> > (and divides by two the number of slots when the number of
> > entries becomes smaller than a constant times the number of slots).
> > Has a similar system been implemented in (an extension of) guile?
> > 
> > Thanks for your help, Joris
> > 
> > 
> > -----------------------------------------------------------
> > Joris van der Hoeven <vdhoeven@texmacs.org>
> > http://www.texmacs.org: GNU TeXmacs scientific text editor
> > http://www.math.u-psud.fr/~vdhoeven: personal homepage
> > -----------------------------------------------------------
> > 
> > 
> > 
> > _______________________________________________
> > Guile-user mailing list
> > Guile-user@gnu.org
> > http://mail.gnu.org/mailman/listinfo/guile-user
> -- 
> 



_______________________________________________
Guile-user mailing list
Guile-user@gnu.org
http://mail.gnu.org/mailman/listinfo/guile-user


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Efficiency and flexibility of hash-tables
  2003-02-08 14:14   ` Joris van der Hoeven
@ 2003-02-08 14:55     ` Roland Orre
  2003-02-08 15:14       ` Joris van der Hoeven
  2003-02-12 20:47       ` Efficiency and flexibility of hash-tables Paul Jarc
  0 siblings, 2 replies; 32+ messages in thread
From: Roland Orre @ 2003-02-08 14:55 UTC (permalink / raw)
  Cc: guile-user

On Sat, 2003-02-08 at 15:14, Joris van der Hoeven wrote:
> Hi,
> 
> Thanks for your reply. Unfortunately, I think that
> you did not fully understand my question.

Yes, I misunderstood it.

> My question was: is the number of slots automatically adapted
> as a function of the number or entries, or is it not?
> If you cannot have a good estimate for the number of entries,
> then this auto-adaptation may be important.
> In fact, I think that a good low level implementation of
> general purpose hash tables should have this feature.

I agree. I once developed (in Ada) for a debugging system, a hash
table where the hash entries were balanced trees instead of lists.
This could maybe be a good approach to reach good performance as
balanced trees are quite good from this point of view. (like the
new Reiser file system in Linux is using)

Your idea, about adaptive hash table length, could easily be made into
guile but from my opinion not as a general solution, as there will be
a lot of reshuffling of every item each time the vector length changes.

I guess somewhere out there, a balanced tree solution may exist for guile,
otherwise I could try to find my old ada-source and give it a try. The
disadvantage with the balanced tree solution though is that it's not
enough with a hash function (and equal?), it also needs a comparision
(<) function.

	Best regards
	Roland




_______________________________________________
Guile-user mailing list
Guile-user@gnu.org
http://mail.gnu.org/mailman/listinfo/guile-user


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Efficiency and flexibility of hash-tables
  2003-02-08 14:55     ` Roland Orre
@ 2003-02-08 15:14       ` Joris van der Hoeven
  2003-02-08 15:31         ` Mikael Djurfeldt
  2003-02-08 15:44         ` Roland Orre
  2003-02-12 20:47       ` Efficiency and flexibility of hash-tables Paul Jarc
  1 sibling, 2 replies; 32+ messages in thread
From: Joris van der Hoeven @ 2003-02-08 15:14 UTC (permalink / raw)
  Cc: Joris van der Hoeven


> > My question was: is the number of slots automatically adapted
> > as a function of the number or entries, or is it not?
> > If you cannot have a good estimate for the number of entries,
> > then this auto-adaptation may be important.
> > In fact, I think that a good low level implementation of
> > general purpose hash tables should have this feature.
> 
> I agree. I once developed (in Ada) for a debugging system, a hash
> table where the hash entries were balanced trees instead of lists.
> This could maybe be a good approach to reach good performance as
> balanced trees are quite good from this point of view. (like the
> new Reiser file system in Linux is using)
> 
> Your idea, about adaptive hash table length, could easily be made into
> guile but from my opinion not as a general solution, as there will be
> a lot of reshuffling of every item each time the vector length changes.

That is not really true: the reshuffling only takes place
when the number of entries changes in an important way.
Also, shuffling is linear in time. So if it only occurs
after adding or removing a proportional number of entries,
then the overall running time remains linear.
This is better than using a O(log n) overhead for all operations.
However, the price to pay is rather that the shuffling can occur
at arbitrary moments. But this is already the case for garbage
collection anyway. In fact, I implemented this kind of
hash tables in TeXmacs.

> I guess somewhere out there, a balanced tree solution may exist for guile,
> otherwise I could try to find my old ada-source and give it a try. The
> disadvantage with the balanced tree solution though is that it's not
> enough with a hash function (and equal?), it also needs a comparision
> (<) function.

So does there exist an adaptive-number-of-slots solution in guile?
If I understand you well, this is certainly not the case for
the standard implementation... :^(((



_______________________________________________
Guile-user mailing list
Guile-user@gnu.org
http://mail.gnu.org/mailman/listinfo/guile-user


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Efficiency and flexibility of hash-tables
  2003-02-08 15:14       ` Joris van der Hoeven
@ 2003-02-08 15:31         ` Mikael Djurfeldt
  2003-02-11 11:14           ` Joris van der Hoeven
  2003-02-08 15:44         ` Roland Orre
  1 sibling, 1 reply; 32+ messages in thread
From: Mikael Djurfeldt @ 2003-02-08 15:31 UTC (permalink / raw)
  Cc: Roland Orre

Joris van der Hoeven <TeXmacs@math.u-psud.fr> writes:

> So does there exist an adaptive-number-of-slots solution in guile?

No.  We'd be happy to include adaptation in the standard Guile
hash-tables, though.  Would it be possible for you to contribute a
patch?

Best regards,
Mikael Djurfeldt



_______________________________________________
Guile-user mailing list
Guile-user@gnu.org
http://mail.gnu.org/mailman/listinfo/guile-user


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Efficiency and flexibility of hash-tables
  2003-02-08 15:14       ` Joris van der Hoeven
  2003-02-08 15:31         ` Mikael Djurfeldt
@ 2003-02-08 15:44         ` Roland Orre
  2003-02-10  9:55           ` Andreas Rottmann
  1 sibling, 1 reply; 32+ messages in thread
From: Roland Orre @ 2003-02-08 15:44 UTC (permalink / raw)
  Cc: guile-user

On Sat, 2003-02-08 at 16:14, Joris van der Hoeven wrote:
> So does there exist an adaptive-number-of-slots solution in guile?
> If I understand you well, this is certainly not the case for
> the standard implementation... :^(((

Hmm :) When you asked, I took a look at my old code and, yes I almost
have such a solution which I made a few years ago (which actually seems
to be used quite often in my "standard" code). It uses a predefined
list of prime numbers and calculates a reasonable hash table size
from the inital assumption of number of items, but at a closer look
it seems that I didn't implement the resizing/reshuffling part, and
it's also quite hard tied to my application at the moment, but I'll
take a look at it and try to extract something useful.

But, anyone, who may have sometime similar. Don't wait for me!

(even though I need all kind of excuses to distract me from writing
 on what I should do at the moment :)

	Best regards
	Roland




_______________________________________________
Guile-user mailing list
Guile-user@gnu.org
http://mail.gnu.org/mailman/listinfo/guile-user


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Efficiency and flexibility of hash-tables
  2003-02-08 15:44         ` Roland Orre
@ 2003-02-10  9:55           ` Andreas Rottmann
  2003-02-10 14:24             ` Greg Troxel
  0 siblings, 1 reply; 32+ messages in thread
From: Andreas Rottmann @ 2003-02-10  9:55 UTC (permalink / raw)
  Cc: Joris van der Hoeven

Roland Orre <orre@nada.kth.se> writes:

> On Sat, 2003-02-08 at 16:14, Joris van der Hoeven wrote:
> > So does there exist an adaptive-number-of-slots solution in guile?
> > If I understand you well, this is certainly not the case for
> > the standard implementation... :^(((
> 
> Hmm :) When you asked, I took a look at my old code and, yes I almost
> have such a solution which I made a few years ago (which actually seems
> to be used quite often in my "standard" code). It uses a predefined
> list of prime numbers and calculates a reasonable hash table size
> from the inital assumption of number of items, but at a closer look
> it seems that I didn't implement the resizing/reshuffling part, and
> it's also quite hard tied to my application at the moment, but I'll
> take a look at it and try to extract something useful.
> 
> But, anyone, who may have sometime similar. Don't wait for me!
> 
Maybe one could re-use the GLib code, they have resizing hash tables.

Regards, Andy
-- 
Andreas Rottmann         | Dru@ICQ        | 118634484@ICQ | a.rottmann@gmx.at
http://www.8ung.at/rotty | GnuPG Key: http://www.8ung.at/rotty/gpg.asc
Fingerprint              | DFB4 4EB4 78A4 5EEE 6219  F228 F92F CFC5 01FD 5B62


_______________________________________________
Guile-user mailing list
Guile-user@gnu.org
http://mail.gnu.org/mailman/listinfo/guile-user


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Efficiency and flexibility of hash-tables
  2003-02-10  9:55           ` Andreas Rottmann
@ 2003-02-10 14:24             ` Greg Troxel
  2003-02-10 15:00               ` Roland Orre
  0 siblings, 1 reply; 32+ messages in thread
From: Greg Troxel @ 2003-02-10 14:24 UTC (permalink / raw)
  Cc: Roland Orre

  Maybe one could re-use the GLib code, they have resizing hash tables.

This prompted a few questions:

  Would it be reasonable for guile to depend on glib?
  I wouldn't mind, but I suspect others would view this as a problem.

  If not, would a 'guile-glib' package that has interfaces to the glib
  functions be appropriate.  The new guile-gobject might support this
  already.

  Didn't Jay Glascoe have auto-resizing hash tables implemented before?

Using glib's implementation (in glib, not copying the code) really
seems preferable, either natively in guile or via (possibly enhanced)
guile-gobject.

        Greg Troxel <gdt@ir.bbn.com>


_______________________________________________
Guile-user mailing list
Guile-user@gnu.org
http://mail.gnu.org/mailman/listinfo/guile-user


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Efficiency and flexibility of hash-tables
  2003-02-10 14:24             ` Greg Troxel
@ 2003-02-10 15:00               ` Roland Orre
  2003-02-10 16:52                 ` Mikael Djurfeldt
  0 siblings, 1 reply; 32+ messages in thread
From: Roland Orre @ 2003-02-10 15:00 UTC (permalink / raw)
  Cc: Joris van der Hoeven

On Mon, 2003-02-10 at 15:24, Greg Troxel wrote:
>   Maybe one could re-use the GLib code, they have resizing hash tables.
> 

Wouldn't glib's implementation cause gc problems? 

Today there are three different types of hash tables: vectors, weak hash
tables and double weak hash tables. The only real difference between 
them is how their content is handled by gc, "tag wise" they are close.

My idea (which I haven't revealed yet) about resizing hash tables may be
considered a little "dirty" but when thinking about it I don't really
see a problem with it.

My idea was to modify the routines 
scm_c_make_hash_table in hashtab.c,
scm_make_weak_key_hash_table and scm_make_weak_value_hash_table in 
weaks.c so they allocate three extra items in the mallocated block
but hides this to the gc by setting the actual vector length to the
ordinary one. These extra items will contain the actual size in
number of items stored in the table and current upper and lower
thresholds for resizing the table.

Then modify the routines:
scm_hash_fn_create_handle_x and scm_hash_fn_remove_x in hashtab.c
to do resizing when needed, using a table of prime numbers for next
and previous size.

The advantage with this solution is that it's minimalistic, i.e. all 
routines working with hash tables will behave exactly the same. Only
the routines changing the number of items in the table will be affected.

The reason why I don't see a problem with this solution is that the
free() routine uses information which is independent from guile's view
about the size of the allocated memory block.

The only disadvantage I see is the risk that someone are not using
the routine "make-hash-table" in their code. If they use "normal"
vectors as hash tables it may cause segmentation fault, but this is
not a serious one as I see it.

Comments?

	Best regards
	Roland Orre


> This prompted a few questions:
> 
>   Would it be reasonable for guile to depend on glib?
>   I wouldn't mind, but I suspect others would view this as a problem.
> 
>   If not, would a 'guile-glib' package that has interfaces to the glib
>   functions be appropriate.  The new guile-gobject might support this
>   already.
> 
>   Didn't Jay Glascoe have auto-resizing hash tables implemented before?
> 
> Using glib's implementation (in glib, not copying the code) really
> seems preferable, either natively in guile or via (possibly enhanced)
> guile-gobject.
> 
>         Greg Troxel <gdt@ir.bbn.com>




_______________________________________________
Guile-user mailing list
Guile-user@gnu.org
http://mail.gnu.org/mailman/listinfo/guile-user


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Efficiency and flexibility of hash-tables
  2003-02-10 15:00               ` Roland Orre
@ 2003-02-10 16:52                 ` Mikael Djurfeldt
  2003-02-10 17:09                   ` Roland Orre
  2003-02-10 17:11                   ` Mikael Djurfeldt
  0 siblings, 2 replies; 32+ messages in thread
From: Mikael Djurfeldt @ 2003-02-10 16:52 UTC (permalink / raw)
  Cc: Greg Troxel

Roland Orre <orre@nada.kth.se> writes:

(To Roland: Ledsen att jag inte hunnit svara tidigare.)

> My idea was to modify the routines 
> scm_c_make_hash_table in hashtab.c,
> scm_make_weak_key_hash_table and scm_make_weak_value_hash_table in 
> weaks.c so they allocate three extra items in the mallocated block
> but hides this to the gc by setting the actual vector length to the
> ordinary one. These extra items will contain the actual size in
> number of items stored in the table and current upper and lower
> thresholds for resizing the table.
>
> Then modify the routines:
> scm_hash_fn_create_handle_x and scm_hash_fn_remove_x in hashtab.c
> to do resizing when needed, using a table of prime numbers for next
> and previous size.
>
> The advantage with this solution is that it's minimalistic, i.e. all 
> routines working with hash tables will behave exactly the same. Only
> the routines changing the number of items in the table will be affected.

[...]

> Comments?

Firstly: People have suggested that we should re-use other
implementations of rehashing (resizable, adaptive, ...) hash-tables.

However, the kind of hash-tables we're using in Guile is basic stuff
described in any computer science textbook.  When I asked for a patch,
I asked for the Guile implemention.

Also, hash-tables are used throughout Guile so we need them in the
core, not as a loadable library.

Now a comment on Roland's suggestion:

If we want to store extra information in the hash-table object, in the
way you suggest, that information will be hidden from the user unless
we provide special selector functions which can extract it.  But this
defeats the very purpose of representing a hash-table as a simple
Scheme type, and we could just as well provide a "real" hash-table
object, isn't that so?

Also, how would a hash-function, when receiving a hash-table
represented as a vector, know that it safely can access memory
locations after it's last index?

I also think this would obfuscate the source.

What about this even more minimalistic solution:

The idea of having an upper threshold comes from the idea of
maintaining the load factor of the hash table in a reasonable
interval.  But the purpose of the load factor is to prevent to large
clusters forming in the same hash bucket.  What about simply counting
the pairs in a bucket during insertion and rehashing when we reach a
threshold?  Then no extra data needs to be stored in the table.  (We
would need to globally serialize rehashing with an external mutex,
though.)

One disadvantage of this solution is that since we're putting a hard
limit on the maximal length of a bucket list, the number of buckets
will be sensitive to things causing collisions, such as bad hash
functions or bad luck.

Another disadvantage is that the tables wouldn't shrink.  Is this an
issue?

Mikael


_______________________________________________
Guile-user mailing list
Guile-user@gnu.org
http://mail.gnu.org/mailman/listinfo/guile-user


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Efficiency and flexibility of hash-tables
  2003-02-10 16:52                 ` Mikael Djurfeldt
@ 2003-02-10 17:09                   ` Roland Orre
  2003-02-10 17:11                   ` Mikael Djurfeldt
  1 sibling, 0 replies; 32+ messages in thread
From: Roland Orre @ 2003-02-10 17:09 UTC (permalink / raw)
  Cc: Greg Troxel

On Mon, 2003-02-10 at 17:52, Mikael Djurfeldt wrote:
> The idea of having an upper threshold comes from the idea of
> maintaining the load factor of the hash table in a reasonable
> interval.  But the purpose of the load factor is to prevent to large
> clusters forming in the same hash bucket.  What about simply counting
> the pairs in a bucket during insertion and rehashing when we reach a
> threshold?  Then no extra data needs to be stored in the table.  (We
> would need to globally serialize rehashing with an external mutex,
> though.)

The risk with this is that if the hash function in a specific situation
is bad, not spreading enough, we may form clusters anyway. With a 
risk that we will reallocate hash tables much too often.

> Another disadvantage is that the tables wouldn't shrink.  Is this an
> issue?

For most of our applications it isn't because we will reallocate all
hash tables after each run.

In all my code over the years hash-remove! is very very rare.

	Best regards
	Rolan





_______________________________________________
Guile-user mailing list
Guile-user@gnu.org
http://mail.gnu.org/mailman/listinfo/guile-user


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Efficiency and flexibility of hash-tables
  2003-02-10 16:52                 ` Mikael Djurfeldt
  2003-02-10 17:09                   ` Roland Orre
@ 2003-02-10 17:11                   ` Mikael Djurfeldt
  2003-02-11 13:59                     ` Resizing hash tables in Guile Mikael Djurfeldt
  1 sibling, 1 reply; 32+ messages in thread
From: Mikael Djurfeldt @ 2003-02-10 17:11 UTC (permalink / raw)
  Cc: Greg Troxel

Mikael Djurfeldt <djurfeldt@nada.kth.se> writes:

> Also, hash-tables are used throughout Guile so we need them in the
> core, not as a loadable library.

Actually, considering the invasive changes we'd have to make on the
hash-table representation, the vector re-sizing operation (which we
once removed from Guile for reasons of design) not the least, we
should probably provide rehashing hash-tables as a separate library
and not touch the current Guile ones...


_______________________________________________
Guile-user mailing list
Guile-user@gnu.org
http://mail.gnu.org/mailman/listinfo/guile-user


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Efficiency and flexibility of hash-tables
  2003-02-08 15:31         ` Mikael Djurfeldt
@ 2003-02-11 11:14           ` Joris van der Hoeven
  2003-02-11 11:28             ` Joris van der Hoeven
  0 siblings, 1 reply; 32+ messages in thread
From: Joris van der Hoeven @ 2003-02-11 11:14 UTC (permalink / raw)
  Cc: Roland Orre


> > So does there exist an adaptive-number-of-slots solution in guile?
> 
> No.  We'd be happy to include adaptation in the standard Guile
> hash-tables, though.  Would it be possible for you to contribute a
> patch?

I don't have much knowledge about how to write low-level guile functions.
In the meantime, you might want to use the following code.
I only use the equal?-based access methods, but it would be easy
to add the eq?- and eqv?-based methods.

The 'adaptive hash tables' are stored as a pair with a usual hashtable and
its number of entries. I have tested the code a bit and it seems to work.
If you find any bugs, then please let me know.

Best wishes, Joris

-----------------------------------------------------------
Joris van der Hoeven <vdhoeven@texmacs.org>
http://www.texmacs.org: GNU TeXmacs scientific text editor
http://www.math.u-psud.fr/~vdhoeven: personal homepage
-----------------------------------------------------------



;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
;;
;; MODULE      : ahash-table.scm (part of GNU TeXmacs)
;; DESCRIPTION : adaptive hash tables
;; COPYRIGHT   : (C) 2003  Joris van der Hoeven
;;
;; This software falls under the GNU general public license and comes WITHOUT
;; ANY WARRANTY WHATSOEVER. See the file $TEXMACS_PATH/LICENSE for details.
;; If you don't have this file, write to the Free Software Foundation, Inc.,
;; 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
;;
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;


;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
;; Adaptive hash tables
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;

(define (make-ahash-table)
  (cons (make-hash-table 1) 0))

(define (ahash-ref h key)
  (hash-ref (car h) key))

(define (ahash-size h)
  (cdr h))

(define (ahash-slots! h new-size)
  (let ((new-h (make-hash-table new-size)))
    (hash-fold (lambda (key value dummy) (hash-set! new-h key value))
	       #f (car h))
    (set-car! h new-h)))

(define (ahash-set! h key value)
  (if (hash-ref (car h) key)
      (hash-set! (car h) key value)
      (begin
	(if (>= (cdr h) (vector-length (car h)))
	    (ahash-slots! h (+ (* 2 (vector-length (car h))) 1)))
	(set-cdr! h (+ (cdr h) 1))
	(hash-set! (car h) key value))))

(define (ahash-remove! h key)
  (let ((removed (hash-remove! (car h) key)))
    (if removed
	(begin
	  (set-cdr! h (- (cdr h) 1))
	  (if (< (+ (* 4 (cdr h)) 1) (vector-length (car h)))
	      (ahash-slots! h (quotient (vector-length (car h)) 2)))))
    removed))

(define (ahash-fold h)
  (hash-fold (car h)))



_______________________________________________
Guile-user mailing list
Guile-user@gnu.org
http://mail.gnu.org/mailman/listinfo/guile-user


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Efficiency and flexibility of hash-tables
  2003-02-11 11:14           ` Joris van der Hoeven
@ 2003-02-11 11:28             ` Joris van der Hoeven
  2003-02-11 12:50               ` Mikael Djurfeldt
  0 siblings, 1 reply; 32+ messages in thread
From: Joris van der Hoeven @ 2003-02-11 11:28 UTC (permalink / raw)
  Cc: Roland Orre


> The 'adaptive hash tables' are stored as a pair with a usual hashtable and
> its number of entries. I have tested the code a bit and it seems to work.
> If you find any bugs, then please let me know.

Here follows a slight improvement of the code for the case that
you want to use boolean values for the entries of the hash table.
Notice also that the hash value of the key is computed twice
whenever you want to assign a value to a key;
in a more low-level implementation, the second call can be avoided.

Best wishes, Joris


;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
;;
;; MODULE      : ahash-table.scm
;; DESCRIPTION : adaptive hash tables
;; COPYRIGHT   : (C) 2003  Joris van der Hoeven
;;
;; This software falls under the GNU general public license and comes WITHOUT
;; ANY WARRANTY WHATSOEVER. See the file $TEXMACS_PATH/LICENSE for details.
;; If you don't have this file, write to the Free Software Foundation, Inc.,
;; 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
;;
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;


;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
;; Adaptive hash tables
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;

(define (make-ahash-table)
  (cons (make-hash-table 1) 0))

(define (ahash-ref h key)
  (hash-ref (car h) key))

(define (ahash-get-handle h key)
  (hash-get-handle (car h) key))

(define (ahash-size h)
  (cdr h))

(define (ahash-slots! h new-size)
  (let ((new-h (make-hash-table new-size)))
    (hash-fold (lambda (key value dummy) (hash-set! new-h key value))
	       #f (car h))
    (set-car! h new-h)))

(define (ahash-set! h key value)
  (if (hash-get-handle (car h) key)
      (hash-set! (car h) key value)
      (begin
	(if (>= (cdr h) (vector-length (car h)))
	    (ahash-slots! h (+ (* 2 (vector-length (car h))) 1)))
	(set-cdr! h (+ (cdr h) 1))
	(hash-set! (car h) key value))))

(define (ahash-remove! h key)
  (let ((removed (hash-remove! (car h) key)))
    (if removed
	(begin
	  (set-cdr! h (- (cdr h) 1))
	  (if (< (+ (* 4 (cdr h)) 1) (vector-length (car h)))
	      (ahash-slots! h (quotient (vector-length (car h)) 2)))))
    removed))

(define (ahash-fold h)
  (hash-fold (car h)))



_______________________________________________
Guile-user mailing list
Guile-user@gnu.org
http://mail.gnu.org/mailman/listinfo/guile-user


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Efficiency and flexibility of hash-tables
  2003-02-11 11:28             ` Joris van der Hoeven
@ 2003-02-11 12:50               ` Mikael Djurfeldt
  0 siblings, 0 replies; 32+ messages in thread
From: Mikael Djurfeldt @ 2003-02-11 12:50 UTC (permalink / raw)
  Cc: Roland Orre

Thanks!

(I'm actually putting together a suggestion for how to extend Guile's
standard hash tables.  I'll post a patch or commit it soon.)

Mikael


_______________________________________________
Guile-user mailing list
Guile-user@gnu.org
http://mail.gnu.org/mailman/listinfo/guile-user


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Resizing hash tables in Guile
  2003-02-10 17:11                   ` Mikael Djurfeldt
@ 2003-02-11 13:59                     ` Mikael Djurfeldt
  2003-02-11 17:34                       ` Roland Orre
                                         ` (2 more replies)
  0 siblings, 3 replies; 32+ messages in thread
From: Mikael Djurfeldt @ 2003-02-11 13:59 UTC (permalink / raw)
  Cc: Greg Troxel

I've just committed resizing hash table functionality to the Guile
core hashtable functions (in hashtab.c).

Currently, the only thing which has changed in the API is that the
size argument to the hash table constructors is now optional.  If
omitted, you get a resizing table.

Please comment on this API and also take a look at the implementation
in hashtab.c.

Questions:

Is this a good thing?  Should we keep it?

I made the hash tables thread-safe (locking/unlocking a mutex at hash
table access and rehashing).  Is that good?  An alternativ is to
require of the programmer to make sure the hash tables aren't accessed
in parallel.

The implementation isn't quite finished.  Removal should be rewritten
for the new tables.  Also, weak hash tables need to be handled in
hashtab.c in order to maintain a correct item count.

M


_______________________________________________
Guile-user mailing list
Guile-user@gnu.org
http://mail.gnu.org/mailman/listinfo/guile-user


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Resizing hash tables in Guile
  2003-02-11 13:59                     ` Resizing hash tables in Guile Mikael Djurfeldt
@ 2003-02-11 17:34                       ` Roland Orre
  2003-02-12 11:41                         ` Marius Vollmer
  2003-02-12 16:10                       ` Marius Vollmer
  2003-02-12 20:55                       ` Rob Browning
  2 siblings, 1 reply; 32+ messages in thread
From: Roland Orre @ 2003-02-11 17:34 UTC (permalink / raw)
  Cc: Greg Troxel

On Tue, 2003-02-11 at 14:59, Mikael Djurfeldt wrote:
> I've just committed resizing hash table functionality to the Guile
> core hashtable functions (in hashtab.c).
> 
> Currently, the only thing which has changed in the API is that the
> size argument to the hash table constructors is now optional.  If
> omitted, you get a resizing table.
> 
> Please comment on this API and also take a look at the implementation
> in hashtab.c.

I've studied the code and consider this a perfect minimalistic solution.
It works without disturbing any old code. I'll test drive the code
tonight.

> Questions:
> 
> Is this a good thing?  Should we keep it?
It's definitely a good thing. In our application we sometimes need just
a few tens of items hashed but often many millions of items without
really knowing beforehand.
It's  definitely a good thing to not need to bother about analyzing
optimal table sizes any longer.

We hope it will stay and we will start using it in our application as
soon as possible.

> I made the hash tables thread-safe (locking/unlocking a mutex at hash
> table access and rehashing).  Is that good?  An alternativ is to
> require of the programmer to make sure the hash tables aren't accessed
> in parallel.

In our case I think we will most likely set up the different processes
to use different hash tables but we have still not parallellized our
applications. Could/should the mutex protection maybe be optional?

	Best regards
	Roland




_______________________________________________
Guile-devel mailing list
Guile-devel@gnu.org
http://mail.gnu.org/mailman/listinfo/guile-devel


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Resizing hash tables in Guile
  2003-02-11 17:34                       ` Roland Orre
@ 2003-02-12 11:41                         ` Marius Vollmer
  0 siblings, 0 replies; 32+ messages in thread
From: Marius Vollmer @ 2003-02-12 11:41 UTC (permalink / raw)


Roland Orre <orre@nada.kth.se> writes:

> > I made the hash tables thread-safe (locking/unlocking a mutex at hash
> > table access and rehashing).  Is that good?  An alternativ is to
> > require of the programmer to make sure the hash tables aren't accessed
> > in parallel.
> 
> In our case I think we will most likely set up the different processes
> to use different hash tables but we have still not parallellized our
> applications. Could/should the mutex protection maybe be optional?

Since users must consciously ask for resizable hash tables, we can
require them to do the locking themselves.  Should we?  I would say
yes, since people need to be aware of thread issues anyway, and maybe
they have a better scheme for ensuring thread safe access to a hash
table.  Removing the locking from the core code should improve
performance, right?

However, the hash tables should be somewhat thread safe: they might
not work as advertised when multiple threads access them, but the
application might not crash.  That is, adding an element from one
thread while another thread is doing the same thing might make one of
the elements disappear, but it must leave the hash table in a valid
state.

Do the non-resizing hash tables behgave that way, incidentally?


(Ahh, I just love cooperative threading.  These things are so much
easier and efficient when there is only one running thread with
defined switch points... :-)


_______________________________________________
Guile-user mailing list
Guile-user@gnu.org
http://mail.gnu.org/mailman/listinfo/guile-user


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Resizing hash tables in Guile
  2003-02-11 13:59                     ` Resizing hash tables in Guile Mikael Djurfeldt
  2003-02-11 17:34                       ` Roland Orre
@ 2003-02-12 16:10                       ` Marius Vollmer
  2003-02-12 17:53                         ` Mikael Djurfeldt
  2003-02-12 20:55                       ` Rob Browning
  2 siblings, 1 reply; 32+ messages in thread
From: Marius Vollmer @ 2003-02-12 16:10 UTC (permalink / raw)
  Cc: Greg Troxel

Mikael Djurfeldt <djurfeldt@nada.kth.se> writes:

> Is this a good thing?  Should we keep it?

I think resizing hash tables are important.  Thanks for writing them!

> The implementation isn't quite finished.  Removal should be rewritten
> for the new tables.  Also, weak hash tables need to be handled in
> hashtab.c in order to maintain a correct item count.

If you are not going to finish this soonish, please make a note in
workbook/tasks/TODO about what needs to still be done.

-- 
GPG: D5D4E405 - 2F9B BCCC 8527 692A 04E3  331E FAF8 226A D5D4 E405


_______________________________________________
Guile-user mailing list
Guile-user@gnu.org
http://mail.gnu.org/mailman/listinfo/guile-user


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Resizing hash tables in Guile
  2003-02-12 16:10                       ` Marius Vollmer
@ 2003-02-12 17:53                         ` Mikael Djurfeldt
  2003-02-12 20:17                           ` Roland Orre
  0 siblings, 1 reply; 32+ messages in thread
From: Mikael Djurfeldt @ 2003-02-12 17:53 UTC (permalink / raw)
  Cc: Greg Troxel

Marius Vollmer <mvo@zagadka.de> writes:

> Mikael Djurfeldt <djurfeldt@nada.kth.se> writes:
>
>> Is this a good thing?  Should we keep it?
>
> I think resizing hash tables are important.

I've been thinking that maybe we should continue the move and let the
resizing tables entirely replace the fixed size ones.  It seems a
little silly to have to explain to Guile users that there are two
(sorry, eight) different kinds of hash tables...  Also, I think the
opacity of the resizing table objects is an advantage rather than a
disadvantage.  If they are opaque, we can any time modify the
underlying implementation (the well-known data abstraction argument).

What do you say?

>> The implementation isn't quite finished.  Removal should be rewritten
>> for the new tables.  Also, weak hash tables need to be handled in
>> hashtab.c in order to maintain a correct item count.
>
> If you are not going to finish this soonish, please make a note in
> workbook/tasks/TODO about what needs to still be done.

Right.  Removal is fixed now.  Weak tables could be fixed on a train
ride during the weekend. :)

(BTW, at that time it would be good if we have come to a decision with
regard to dropping the fixed-size vector of alist tables.)

M


_______________________________________________
Guile-devel mailing list
Guile-devel@gnu.org
http://mail.gnu.org/mailman/listinfo/guile-devel


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Resizing hash tables in Guile
  2003-02-12 17:53                         ` Mikael Djurfeldt
@ 2003-02-12 20:17                           ` Roland Orre
  2003-02-13  9:35                             ` Mikael Djurfeldt
  2003-02-13  9:52                             ` Joris van der Hoeven
  0 siblings, 2 replies; 32+ messages in thread
From: Roland Orre @ 2003-02-12 20:17 UTC (permalink / raw)
  Cc: Greg Troxel

On Wed, 2003-02-12 at 18:53, Mikael Djurfeldt wrote:
> I've been thinking that maybe we should continue the move and let the
> resizing tables entirely replace the fixed size ones.  It seems a
> little silly to have to explain to Guile users that there are two
> (sorry, eight) different kinds of hash tables...  Also, I think the
> opacity of the resizing table objects is an advantage rather than a
> disadvantage.  If they are opaque, we can any time modify the
> underlying implementation (the well-known data abstraction argument).
> 
> What do you say?

I think that this is a a good idea, but I would like the optional size
argument to still be there as an initial guess about the number of
items to expect to avoid resizing and to optimize performance when
we have advance information about how many items to expect.

Of course this additional level of abstraction is nice as it also gives
a future option to use trees if the user e.g gives an optional
comparision operator.

By the way, Joris van der Hoeven mentioned that this type of resizing
hash tables are faster than trees, which have a time complexity
of O(log n)+reshuffling time if they are balanced. Do you have any
numbers showing how much faster and if possible if there are any
conditions? The reshuffling time will grow at least O(n) when the
size of our linear tables increases if I have understood right.

	Best regards
	Roland Orre



_______________________________________________
Guile-devel mailing list
Guile-devel@gnu.org
http://mail.gnu.org/mailman/listinfo/guile-devel


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Efficiency and flexibility of hash-tables
  2003-02-08 14:55     ` Roland Orre
  2003-02-08 15:14       ` Joris van der Hoeven
@ 2003-02-12 20:47       ` Paul Jarc
  2003-02-12 21:58         ` Roland Orre
  1 sibling, 1 reply; 32+ messages in thread
From: Paul Jarc @ 2003-02-12 20:47 UTC (permalink / raw)
  Cc: Joris van der Hoeven

Roland Orre <orre@nada.kth.se> wrote:
> The disadvantage with the balanced tree solution though is that it's
> not enough with a hash function (and equal?), it also needs a
> comparision (<) function.

(< (hash key1 0) (hash key2 0))
It's easy enough to modify hash.c to support this, by skipping the
modulo operation.  (It'd be a good idea anyway, since currently (in
1.6.3) guile crashes.)


paul


_______________________________________________
Guile-user mailing list
Guile-user@gnu.org
http://mail.gnu.org/mailman/listinfo/guile-user


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Resizing hash tables in Guile
  2003-02-11 13:59                     ` Resizing hash tables in Guile Mikael Djurfeldt
  2003-02-11 17:34                       ` Roland Orre
  2003-02-12 16:10                       ` Marius Vollmer
@ 2003-02-12 20:55                       ` Rob Browning
  2003-02-13 10:43                         ` Mikael Djurfeldt
  2 siblings, 1 reply; 32+ messages in thread
From: Rob Browning @ 2003-02-12 20:55 UTC (permalink / raw)
  Cc: Greg Troxel

Mikael Djurfeldt <djurfeldt@nada.kth.se> writes:

> I've just committed resizing hash table functionality to the Guile
> core hashtable functions (in hashtab.c).
>
> Currently, the only thing which has changed in the API is that the
> size argument to the hash table constructors is now optional.  If
> omitted, you get a resizing table.

Looks good to me.

If you have time, could you (very) briefly summarize the behavior?
Mostly I mean wrt how the implementation handles resizes (i.e. when
does it decide to resize, how does it pick the new size, etc.).
That'd be nice to have in comments in the file so that you can get an
overview without having to read the code in detail first.

Thanks

-- 
Rob Browning
rlb @defaultvalue.org, @linuxdevel.com, and @debian.org
Previously @cs.utexas.edu
GPG starting 2002-11-03 = 14DD 432F AE39 534D B592  F9A0 25C8 D377 8C7E 73A4


_______________________________________________
Guile-user mailing list
Guile-user@gnu.org
http://mail.gnu.org/mailman/listinfo/guile-user


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Efficiency and flexibility of hash-tables
  2003-02-12 20:47       ` Efficiency and flexibility of hash-tables Paul Jarc
@ 2003-02-12 21:58         ` Roland Orre
  0 siblings, 0 replies; 32+ messages in thread
From: Roland Orre @ 2003-02-12 21:58 UTC (permalink / raw)
  Cc: Joris van der Hoeven

On Wed, 2003-02-12 at 21:47, Paul Jarc wrote:
> > comparision (<) function.
> 
> (< (hash key1 0) (hash key2 0))
> It's easy enough to modify hash.c to support this, by skipping the
> modulo operation.  (It'd be a good idea anyway, since currently (in
> 1.6.3) guile crashes.)

That was smart ! OK, you can not utilize the order of the tree in any
way for extraction traversing, but that is not important in this case.

Strange, I get "Floating point exception" when hashing with a zero key.
I would expect another error from an integer operation.

	Best regards
	Roland





_______________________________________________
Guile-user mailing list
Guile-user@gnu.org
http://mail.gnu.org/mailman/listinfo/guile-user


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Resizing hash tables in Guile
  2003-02-12 20:17                           ` Roland Orre
@ 2003-02-13  9:35                             ` Mikael Djurfeldt
  2003-02-13 13:55                               ` Harvey J. Stein
  2003-02-13  9:52                             ` Joris van der Hoeven
  1 sibling, 1 reply; 32+ messages in thread
From: Mikael Djurfeldt @ 2003-02-13  9:35 UTC (permalink / raw)
  Cc: Greg Troxel

Roland Orre <orre@nada.kth.se> writes:

> On Wed, 2003-02-12 at 18:53, Mikael Djurfeldt wrote:
>> I've been thinking that maybe we should continue the move and let the
>> resizing tables entirely replace the fixed size ones.
[...]

> I think that this is a a good idea, but I would like the optional size
> argument to still be there as an initial guess about the number of
> items to expect
[...]

Agreed.

> Of course this additional level of abstraction is nice as it also gives
> a future option to use trees if the user e.g gives an optional
> comparision operator.
>
> By the way, Joris van der Hoeven mentioned that this type of resizing
> hash tables are faster than trees, which have a time complexity
> of O(log n)+reshuffling time if they are balanced. Do you have any
> numbers showing how much faster and if possible if there are any
> conditions? The reshuffling time will grow at least O(n) when the
> size of our linear tables increases if I have understood right.

I doubt that we'll ever use trees under the hash table API, the reason
being that the user expects O(1) for hash table operations.  Even
though O(log n) grows slowly, it is substantially different for large
enough data sets.

Regarding reshuffling time: Yes, rehuffling means that every operation
isn't O(1), but it *does* mean that they are O(1) on average.  You can
understand this by a trick sometimes used in algorithm run-time
analysis called "amortization":

The idea is that for every operation on the hash table you "pay" some
extra virtual time, and the sum of this time can then be used to
account for the reshuffling time.  In my implementation, the hash
table is roughly doubled for each rehash.  Before rehashing occurs,
you have inserted N elements.  This has cost you less than cN seconds.
Rehashing is O(2N) = O(N), so we can say it will cost us less than dN
seconds.  If we now pay off d seconds per operation in advance, and
note that the argument above holds equally well for each rehashing
point, we realize that each operation costs less than c + d seconds on
average.  This means that, on average, operations are O(1).

Mikael


_______________________________________________
Guile-user mailing list
Guile-user@gnu.org
http://mail.gnu.org/mailman/listinfo/guile-user


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Resizing hash tables in Guile
  2003-02-12 20:17                           ` Roland Orre
  2003-02-13  9:35                             ` Mikael Djurfeldt
@ 2003-02-13  9:52                             ` Joris van der Hoeven
  1 sibling, 0 replies; 32+ messages in thread
From: Joris van der Hoeven @ 2003-02-13  9:52 UTC (permalink / raw)
  Cc: Greg Troxel


> > I've been thinking that maybe we should continue the move and let the
> > resizing tables entirely replace the fixed size ones.  It seems a
> > little silly to have to explain to Guile users that there are two
> > (sorry, eight) different kinds of hash tables...  Also, I think the
> > opacity of the resizing table objects is an advantage rather than a
> > disadvantage.  If they are opaque, we can any time modify the
> > underlying implementation (the well-known data abstraction argument).
> > 
> > What do you say?
> 
> I think that this is a a good idea, but I would like the optional size
> argument to still be there as an initial guess about the number of
> items to expect to avoid resizing and to optimize performance when
> we have advance information about how many items to expect.

In fact, it is even better to specify the resizing behaviour
using optional arguments. For instance:

	up-ratio   : size up when size/slots >= up-ratio
	up-factor  : new nr slots := old nr slots * up-factor
	down-ratio : size up when size/slots < down-ratio
	down-factor: new nr slots := old nr slots / down-factor

Using a small up-ratio will improve the constant factor in the lookup speed,
but require the usage of a larger number of slots.

> Of course this additional level of abstraction is nice as it also gives
> a future option to use trees if the user e.g gives an optional
> comparision operator.

Yes.

> By the way, Joris van der Hoeven mentioned that this type of resizing
> hash tables are faster than trees, which have a time complexity
> of O(log n)+reshuffling time if they are balanced. Do you have any
> numbers showing how much faster and if possible if there are any
> conditions? The reshuffling time will grow at least O(n) when the
> size of our linear tables increases if I have understood right.

Well: lookup time in a balanced search tree is O(log n),
while lookup time in a table is O(1)...



_______________________________________________
Guile-user mailing list
Guile-user@gnu.org
http://mail.gnu.org/mailman/listinfo/guile-user


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Resizing hash tables in Guile
  2003-02-12 20:55                       ` Rob Browning
@ 2003-02-13 10:43                         ` Mikael Djurfeldt
  0 siblings, 0 replies; 32+ messages in thread
From: Mikael Djurfeldt @ 2003-02-13 10:43 UTC (permalink / raw)
  Cc: Greg Troxel

Rob Browning <rlb@defaultvalue.org> writes:

> If you have time, could you (very) briefly summarize the behavior?
> Mostly I mean wrt how the implementation handles resizes (i.e. when
> does it decide to resize, how does it pick the new size, etc.).

Good idea.

Done.

M


_______________________________________________
Guile-devel mailing list
Guile-devel@gnu.org
http://mail.gnu.org/mailman/listinfo/guile-devel


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Resizing hash tables in Guile
  2003-02-13  9:35                             ` Mikael Djurfeldt
@ 2003-02-13 13:55                               ` Harvey J. Stein
  2003-02-13 14:24                                 ` Joris van der Hoeven
  0 siblings, 1 reply; 32+ messages in thread
From: Harvey J. Stein @ 2003-02-13 13:55 UTC (permalink / raw)
  Cc: Greg Troxel


Mikael Djurfeldt <djurfeldt@nada.kth.se> writes:

 > Regarding reshuffling time: Yes, rehuffling means that every operation
 > isn't O(1), but it *does* mean that they are O(1) on average.  You can
 > understand this by a trick sometimes used in algorithm run-time
 > analysis called "amortization":
 > 
 > The idea is that for every operation on the hash table you "pay" some
 > extra virtual time, and the sum of this time can then be used to
 > account for the reshuffling time.  In my implementation, the hash
 > table is roughly doubled for each rehash.  Before rehashing occurs,
 > you have inserted N elements.  This has cost you less than cN seconds.
 > Rehashing is O(2N) = O(N), so we can say it will cost us less than dN
 > seconds.  If we now pay off d seconds per operation in advance, and
 > note that the argument above holds equally well for each rehashing
 > point, we realize that each operation costs less than c + d seconds on
 > average.  This means that, on average, operations are O(1).

Inserts are, but lookups aren't necessarily.  Lookups being O(1)
requires uniformity of bucket sizes.  Worst case hash table lookup
time is still O(N).  And good hashing functions are still hard to
write.

People overestimate log(N) and overuse O().  When comparing an O(1)
algorithm to an O(log(N)) algorithm, it really comes down to the
actual functions involved, and actual problem size, not just the
asymptotic behavior.  2^32 is over 4,000,000,000.  With this many
items, log(N) is still just 32, so an O(log(N)) algorithm will still
beat an O(1) algorithm if it's really log_2(N) vs 32.  And you can't
have 2^32 items in your hash table unless this is on a 64bit machine.
And with this many items in your hash table, you'd need 16gb (32gb on
a 64bit machine) just to store the addresses, let alone the keys and
the items themselves, so by this time you're probably hitting
the swap so hard that the whole question is moot.

Also, if a person's relying on O(1) for hash table performance, it might be
not because they need that on average, but because they need an upper
bound on the operation time, in which case automatic resizing would
also violate this, even though it maintains O(1) on average.

Trees also sort the data for you, which hash tables don't give you.

So, ideally, one would have a hash table object with & without
resizing, and various sorts of tree (AVL, red/black, B*, etc) objects.
insert and delete and map would be methods that work on all of the
above, with map on trees returning the data in sorted order.  For that
matter, insert & delete might as well also work on lists...

BTW, the hash table implementation in Tcl seems to be quite good.

-- 
Harvey Stein
Bloomberg LP
hjstein@bloomberg.com


_______________________________________________
Guile-user mailing list
Guile-user@gnu.org
http://mail.gnu.org/mailman/listinfo/guile-user


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Resizing hash tables in Guile
  2003-02-13 13:55                               ` Harvey J. Stein
@ 2003-02-13 14:24                                 ` Joris van der Hoeven
  2003-02-13 18:30                                   ` Harvey J. Stein
  0 siblings, 1 reply; 32+ messages in thread
From: Joris van der Hoeven @ 2003-02-13 14:24 UTC (permalink / raw)
  Cc: Greg Troxel


>  > Regarding reshuffling time: Yes, rehuffling means that every operation
>  > isn't O(1), but it *does* mean that they are O(1) on average.  You can
>  > understand this by a trick sometimes used in algorithm run-time
>  > analysis called "amortization":
>  >
>  > The idea is that for every operation on the hash table you "pay" some
>  > extra virtual time, and the sum of this time can then be used to
>  > account for the reshuffling time.  In my implementation, the hash
>  > table is roughly doubled for each rehash.  Before rehashing occurs,
>  > you have inserted N elements.  This has cost you less than cN seconds.
>  > Rehashing is O(2N) = O(N), so we can say it will cost us less than dN
>  > seconds.  If we now pay off d seconds per operation in advance, and
>  > note that the argument above holds equally well for each rehashing
>  > point, we realize that each operation costs less than c + d seconds on
>  > average.  This means that, on average, operations are O(1).
> 
> Inserts are, but lookups aren't necessarily.

Both aren't necessarily, because inserting requires looking up too.

> Lookups being O(1) requires uniformity of bucket sizes.
> Worst case hash table lookup time is still O(N).

You can also store a binary search tree in each of the buckets,
if you think that your hash function is bad.

> And good hashing functions are still hard to write.

I do not really agree. A good hash algorithm for lists (or strings),
which I use in TeXmacs, is to rotate the 32 bit integer hash values of
each of the members by a prime number like 3, 5, 7 or 11 and progressively
take the exclusive or. This seems to lead to bucket sizes as
predicted by probability theory, even for hash tables of size 2^p.

> People overestimate log(N) and overuse O().  When comparing an O(1)
> algorithm to an O(log(N)) algorithm, it really comes down to the
> actual functions involved, and actual problem size, not just the
> asymptotic behavior.  2^32 is over 4,000,000,000.

A factor 10 is still a factor 10 though.
(2^10 ~~ 1000).

> With this many
> items, log(N) is still just 32, so an O(log(N)) algorithm will still
> beat an O(1) algorithm if it's really log_2(N) vs 32.

Yes, but the O(1) is really *table lookup* multiplied by a small
constant here, so this is *fast*. You may adjust the small constant
by choosing an appropriate threshold for "size/nr buckets".

> Also, if a person's relying on O(1) for hash table performance, it might be
> not because they need that on average, but because they need an upper
> bound on the operation time, in which case automatic resizing would
> also violate this, even though it maintains O(1) on average.

This is a more serious drawback of standard hash tables, but,
as I said before, we already have garbage collection in Guile anyway...

> Trees also sort the data for you, which hash tables don't give you.

But you need a compairison operation for that,
which may be even less natural than a hash function.

> So, ideally, one would have a hash table object with & without
> resizing, and various sorts of tree (AVL, red/black, B*, etc) objects.
> insert and delete and map would be methods that work on all of the
> above, with map on trees returning the data in sorted order.  For that
> matter, insert & delete might as well also work on lists...

Agreed: ideally, we have everything :^)



_______________________________________________
Guile-devel mailing list
Guile-devel@gnu.org
http://mail.gnu.org/mailman/listinfo/guile-devel


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Resizing hash tables in Guile
  2003-02-13 14:24                                 ` Joris van der Hoeven
@ 2003-02-13 18:30                                   ` Harvey J. Stein
  2003-02-13 20:02                                     ` Paul Jarc
  0 siblings, 1 reply; 32+ messages in thread
From: Harvey J. Stein @ 2003-02-13 18:30 UTC (permalink / raw)
  Cc: Greg Troxel

Joris van der Hoeven <TeXmacs@math.u-psud.fr> writes:

 > >  > average.  This means that, on average, operations are O(1).
 > > 
 > > Inserts are, but lookups aren't necessarily.
 > 
 > Both aren't necessarily, because inserting requires looking up too.

Yes, if you want to return an error for inserting 2 items with the
same key.  No, if you allow multiple items with the same key.

 > > And good hashing functions are still hard to write.
 > 
 > I do not really agree. A good hash algorithm for lists (or strings),
 > which I use in TeXmacs, is ...

I guess "hard" isn't quite the word for it.  What I meant was that the
obvious ones don't work well.  You basically need to look up a good
one to get good behavior.  Tcl's fcn works pretty well too.  I forget
what it does.

 > > With this many items, log(N) is still just 32, so an O(log(N))
 > > algorithm will still beat an O(1) algorithm if it's really
 > > log_2(N) vs 32.
 > 
 > Yes, but the O(1) is really *table lookup* multiplied by a small
 > constant here, so this is *fast*. You may adjust the small constant
 > by choosing an appropriate threshold for "size/nr buckets".

Not just table lookup.  Also hash calculation from the key, which can
be more time consuming than a key comparison, and key comparisons for
everything in the bucket.

 > > Trees also sort the data for you, which hash tables don't give you.
 > 
 > But you need a compairison operation for that,
 > which may be even less natural than a hash function.

Yes.  For hash tables you just need a key equality test.  For trees
you need to be able to order the keys.

 > > So, ideally, one would have a hash table object with & without
 > > resizing, and various sorts of tree (AVL, red/black, B*, etc) objects.
 > > insert and delete and map would be methods that work on all of the
 > > above, with map on trees returning the data in sorted order.  For that
 > > matter, insert & delete might as well also work on lists...
 > 
 > Agreed: ideally, we have everything :^)

I think we're on the same page here...

STk had a nice scheme interface for hash tables.  The code uses the
Tcl hash tables, they autoresize and if I recall correctly, many of
the important parameters can be modified.  You might want to model the
guile stuff on this, or utilize this work.  I advocated this when the
discussion came up years ago on the guile mailing list.

BTW, there was a long discussion of hash tables on the guile list some
number of years ago.  At the time I didn't like liked the guile (aka
scm) hash tables.  They were a very minimal implementation.  The user
had to do too much to use them.  I don't know what it's like these
days, but it'd certainly be nice to have a good implementation
available which includes autosizing.

-- 
Harvey Stein
Bloomberg LP
hjstein@bloomberg.com


_______________________________________________
Guile-user mailing list
Guile-user@gnu.org
http://mail.gnu.org/mailman/listinfo/guile-user


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Resizing hash tables in Guile
  2003-02-13 18:30                                   ` Harvey J. Stein
@ 2003-02-13 20:02                                     ` Paul Jarc
  0 siblings, 0 replies; 32+ messages in thread
From: Paul Jarc @ 2003-02-13 20:02 UTC (permalink / raw)


HJSTEIN@bloomberg.net (Harvey J. Stein) wrote:
> Joris van der Hoeven <TeXmacs@math.u-psud.fr> writes:
>> inserting requires looking up too.
>
> Yes, if you want to return an error for inserting 2 items with the
> same key.

Or if you want to replace existing entries, as Guile's current hash
tables do.

>> But you need a compairison operation for that,
>> which may be even less natural than a hash function.
>
> Yes.  For hash tables you just need a key equality test.  For trees
> you need to be able to order the keys.

It's easy enough to support this: (< (hash key1 0) (hash key2 0))
But this is probably only useful in the general case.  It should be
possible to substitute a less expensive ordering function when one is
available.


paul


_______________________________________________
Guile-devel mailing list
Guile-devel@gnu.org
http://mail.gnu.org/mailman/listinfo/guile-devel


^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2003-02-13 20:02 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2003-02-08 11:00 Efficiency and flexibility of hash-tables Joris van der Hoeven
2003-02-08 13:57 ` Roland Orre
2003-02-08 14:14   ` Joris van der Hoeven
2003-02-08 14:55     ` Roland Orre
2003-02-08 15:14       ` Joris van der Hoeven
2003-02-08 15:31         ` Mikael Djurfeldt
2003-02-11 11:14           ` Joris van der Hoeven
2003-02-11 11:28             ` Joris van der Hoeven
2003-02-11 12:50               ` Mikael Djurfeldt
2003-02-08 15:44         ` Roland Orre
2003-02-10  9:55           ` Andreas Rottmann
2003-02-10 14:24             ` Greg Troxel
2003-02-10 15:00               ` Roland Orre
2003-02-10 16:52                 ` Mikael Djurfeldt
2003-02-10 17:09                   ` Roland Orre
2003-02-10 17:11                   ` Mikael Djurfeldt
2003-02-11 13:59                     ` Resizing hash tables in Guile Mikael Djurfeldt
2003-02-11 17:34                       ` Roland Orre
2003-02-12 11:41                         ` Marius Vollmer
2003-02-12 16:10                       ` Marius Vollmer
2003-02-12 17:53                         ` Mikael Djurfeldt
2003-02-12 20:17                           ` Roland Orre
2003-02-13  9:35                             ` Mikael Djurfeldt
2003-02-13 13:55                               ` Harvey J. Stein
2003-02-13 14:24                                 ` Joris van der Hoeven
2003-02-13 18:30                                   ` Harvey J. Stein
2003-02-13 20:02                                     ` Paul Jarc
2003-02-13  9:52                             ` Joris van der Hoeven
2003-02-12 20:55                       ` Rob Browning
2003-02-13 10:43                         ` Mikael Djurfeldt
2003-02-12 20:47       ` Efficiency and flexibility of hash-tables Paul Jarc
2003-02-12 21:58         ` Roland Orre

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).