Proposal for a new hash interface

unofficial mirror of guile-devel@gnu.org 
 help / color / mirror / Atom feed

* Proposal for a new hash interface
@ 2022-02-24  0:07 Stefan Israelsson Tampe
  2022-02-26  9:33 ` Stefan Israelsson Tampe
  0 siblings, 1 reply; 2+ messages in thread
From: Stefan Israelsson Tampe @ 2022-02-24  0:07 UTC (permalink / raw)
  To: guile-devel

[-- Attachment #1: Type: text/plain, Size: 5588 bytes --]

High all,

I have taken the task to study a possible hash table implementation mostly
in scheme but still speedy. I am still experimenting with different
approaches. With managing code and iteration higher order functions in
scheme and lower order implemented in C. I can match current hash
implementations but with hash-fold and hash-for-each 10-20X faster and
better semantics as it is not going via C.

I suppose the implementation is more memory intensive than the current
implementation. But much much safer and featurefull.

So what are the features:
* goops based organisation

* allows one to type a hash table, e.g. disallow the use of
different versions
  of the assessors for the same hash.

* surprisingly fast although we define goops methods for the accessors

* A flag that indicates that (hash i) = i, for integers, which is used in
python

* Possible to freeze has hash e.g. it will not grow or shrink, when cleared
only
  all the values will be cleared.

* A possibility to use (hash x (ash 1 60)) as a key, e.g. when checking if
a slot
  is equal, we compare only the long hash.

* A proper separation between general code that can be pushed to C for
speed    and scheme for nice integration and power

Code:

scheme:
https://gitlab.com/tampe/guile-persist/-/blob/master/ice-9/hashish.scm

C:
https://gitlab.com/tampe/guile-persist/-/blob/master/src/hash/resize.c

C/Scheme integration
Each handle is of the form
(hash key val)

So when we store the key value pair we also store the hash values which are
60bit wide. This means that we can dispatch the search for an item using
only the hash value and many operations can be done in a general way that
can be pushed to C (if we want). So typically you first try to use C and
when it finds a match you need to do the final check in scheme using the
correct equal predicate. This means that we can achieve a very extensible
system that you can extend without losing much performance. These
primitives make the scheme code beautiful and well organized so we should
probably keep this semantic even if we decide to do everything in guile
scheme.

Example, consider that we have designed a new hash table semantics. The
easiest is to supply a hash function and a equal predicate like so

(use-modules (ice-9 hashish))
(use-modules (oop goops))

(define (myhash    k s) ...)
(define (myequal? x y) ...)

(define-class <my-hash-map> (<stis-hash-map>))
(stis-new-hash-type <my-hash-map> make-my-hash-table myhash myequal?
     my-hash-ref  my-hash-set!  my-hash-remove!)

Preferably when we do not need to know the type of the hashtable we can
either use the methods (slow) stis-hash-ref stis-hash-set! stis-hash-remove!

The fast way is to take advantage of the following construct,
(define H (make-my-hash-table))
(stis-with-hash-accessors (H : ref set remove)
    ...
    (set k1 (ref k2 #f))
    ...
    (remove k3)
    ...)

Both these methods are safe.

There are some extra tools available.

> stis-hash-resize!
One can resize a hashtable as one likes

> stis-hash-define-world!
One can fill the hash table and then specify that this is the world. Then
If all long hashes that are stored are unique, no equal method will be used
for identity and the hashmap will be freezed

> stis-hash-unworld-it
Unset the world property (not yet coded)

> stis-hash-copy
Create a new hashtable (this is not yet coded)

IMPLEMENTATION
The implementation is not yet defined, we need to test it and try
different versions. But One thing that one can notice are that by reducing
the
size if the hashtable one can speed up all operations. So the idea is to
store
a list of hash/index pairs in the hash and use varying sized cells for
this. The
smallest cell is 2 bytes and we can use more bytes for the index (size of
hash table) and fewer for the hash itself) this means that we could use a
size of 64 for the hash value and allow 1024 items in the hash table.
Similarly as the hash  grows we will use 32bit cells and split the maximal
size of the hashtable between the hash value and the hash table size. The
next size is 64bit cells and the last one are 128 bit cells. There will be
a fallback hash table which is the usual a-list bin table But that one will
be smaller memory wise and not in the hot path.

I am in the process of doing the implementation and will see if I can find
the optimal setting for all sizes which I can test. (basically to pin down
the strategy for each of the sizes 1,2,4,6,16,32,64,128,256, ... As the
size of the hash can only be in those sizes.

Oh also, for small hashes the hash is essentially a single association list
and superfast.

Maybe the dispatching overhead is too much, donough, perhaps trimming down
the features might be needed.

I will see if we can find some use of special assembler operations,
especially for small hashtables.

PRIMITIVES,
There are essentially three primitives
1. search        =  find a matching hash, return index or handle or #f
                           possible starting from a known position

2. add             =  add a new element that we now is not included
                           in the hashtable

3. search-list   = fast alist search in case of a small assoc list case
                          returns the key-val handle or #f possibly restarts
                          from a known index.

Possibly one may define a search-and-set-some primitive, e.g. one can
in hot cases not need to fhe the add afterwards and add is only used in the
slow path in a set operation. (For other operations it is in the fast path
though)

Happy Hacking!

Regards
Stefan

[-- Attachment #2: Type: text/html, Size: 7298 bytes --]

^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: Proposal for a new hash interface
  2022-02-24  0:07 Proposal for a new hash interface Stefan Israelsson Tampe
@ 2022-02-26  9:33 ` Stefan Israelsson Tampe
  0 siblings, 0 replies; 2+ messages in thread
From: Stefan Israelsson Tampe @ 2022-02-26  9:33 UTC (permalink / raw)
  To: guile-devel

[-- Attachment #1: Type: text/plain, Size: 6292 bytes --]

This is autogenerated code,

https://gitlab.com/tampe/guile-persist/-/blob/master/src/hash/macro.c

Not sure how well uint128_t works on my system, but the compiler should be
able to make this code run really fast. Anyhow it is quick matters to
change to uint64_t due to things not being hand coded. Can you see what I
am doing?

On Thu, Feb 24, 2022 at 1:07 AM Stefan Israelsson Tampe <
stefan.itampe@gmail.com> wrote:

> High all,
>
> I have taken the task to study a possible hash table implementation mostly
> in scheme but still speedy. I am still experimenting with different
> approaches. With managing code and iteration higher order functions in
> scheme and lower order implemented in C. I can match current hash
> implementations but with hash-fold and hash-for-each 10-20X faster and
> better semantics as it is not going via C.
>
> I suppose the implementation is more memory intensive than the current
> implementation. But much much safer and featurefull.
>
> So what are the features:
> * goops based organisation
>
> * allows one to type a hash table, e.g. disallow the use of
> different versions
>   of the assessors for the same hash.
>
> * surprisingly fast although we define goops methods for the accessors
>
> * A flag that indicates that (hash i) = i, for integers, which is used in
> python
>
> * Possible to freeze has hash e.g. it will not grow or shrink, when
> cleared only
>   all the values will be cleared.
>
> * A possibility to use (hash x (ash 1 60)) as a key, e.g. when checking if
> a slot
>   is equal, we compare only the long hash.
>
> * A proper separation between general code that can be pushed to C for
> speed    and scheme for nice integration and power
>
> Code:
>
> scheme:
> https://gitlab.com/tampe/guile-persist/-/blob/master/ice-9/hashish.scm
>
> C:
> https://gitlab.com/tampe/guile-persist/-/blob/master/src/hash/resize.c
>
> C/Scheme integration
> Each handle is of the form
> (hash key val)
>
> So when we store the key value pair we also store the hash values which are
> 60bit wide. This means that we can dispatch the search for an item using
> only the hash value and many operations can be done in a general way that
> can be pushed to C (if we want). So typically you first try to use C and
> when it finds a match you need to do the final check in scheme using the
> correct equal predicate. This means that we can achieve a very extensible
> system that you can extend without losing much performance. These
> primitives make the scheme code beautiful and well organized so we should
> probably keep this semantic even if we decide to do everything in guile
> scheme.
>
> Example, consider that we have designed a new hash table semantics. The
> easiest is to supply a hash function and a equal predicate like so
>
> (use-modules (ice-9 hashish))
> (use-modules (oop goops))
>
> (define (myhash    k s) ...)
> (define (myequal? x y) ...)
>
> (define-class <my-hash-map> (<stis-hash-map>))
> (stis-new-hash-type <my-hash-map> make-my-hash-table myhash myequal?
>      my-hash-ref  my-hash-set!  my-hash-remove!)
>
> Preferably when we do not need to know the type of the hashtable we can
> either use the methods (slow) stis-hash-ref stis-hash-set!
> stis-hash-remove!
>
> The fast way is to take advantage of the following construct,
> (define H (make-my-hash-table))
> (stis-with-hash-accessors (H : ref set remove)
>     ...
>     (set k1 (ref k2 #f))
>     ...
>     (remove k3)
>     ...)
>
> Both these methods are safe.
>
> There are some extra tools available.
>
> > stis-hash-resize!
> One can resize a hashtable as one likes
>
> > stis-hash-define-world!
> One can fill the hash table and then specify that this is the world. Then
> If all long hashes that are stored are unique, no equal method will be used
> for identity and the hashmap will be freezed
>
> > stis-hash-unworld-it
> Unset the world property (not yet coded)
>
> > stis-hash-copy
> Create a new hashtable (this is not yet coded)
>
> IMPLEMENTATION
> The implementation is not yet defined, we need to test it and try
> different versions. But One thing that one can notice are that by reducing
> the
> size if the hashtable one can speed up all operations. So the idea is to
> store
> a list of hash/index pairs in the hash and use varying sized cells for
> this. The
> smallest cell is 2 bytes and we can use more bytes for the index (size of
> hash table) and fewer for the hash itself) this means that we could use a
> size of 64 for the hash value and allow 1024 items in the hash table.
> Similarly as the hash  grows we will use 32bit cells and split the maximal
> size of the hashtable between the hash value and the hash table size. The
> next size is 64bit cells and the last one are 128 bit cells. There will be
> a fallback hash table which is the usual a-list bin table But that one will
> be smaller memory wise and not in the hot path.
>
> I am in the process of doing the implementation and will see if I can find
> the optimal setting for all sizes which I can test. (basically to pin down
> the strategy for each of the sizes 1,2,4,6,16,32,64,128,256, ... As the
> size of the hash can only be in those sizes.
>
> Oh also, for small hashes the hash is essentially a single association
> list and superfast.
>
> Maybe the dispatching overhead is too much, donough, perhaps trimming down
> the features might be needed.
>
> I will see if we can find some use of special assembler operations,
> especially for small hashtables.
>
> PRIMITIVES,
> There are essentially three primitives
> 1. search        =  find a matching hash, return index or handle or #f
>                            possible starting from a known position
>
> 2. add             =  add a new element that we now is not included
>                            in the hashtable
>
> 3. search-list   = fast alist search in case of a small assoc list case
>                           returns the key-val handle or #f possibly
> restarts
>                           from a known index.
>
> Possibly one may define a search-and-set-some primitive, e.g. one can
> in hot cases not need to fhe the add afterwards and add is only used in the
> slow path in a set operation. (For other operations it is in the fast path
> though)
>
> Happy Hacking!
>
> Regards
> Stefan
>
>
>
>
>

[-- Attachment #2: Type: text/html, Size: 8171 bytes --]

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2022-02-26  9:33 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-02-24  0:07 Proposal for a new hash interface Stefan Israelsson Tampe
2022-02-26  9:33 ` Stefan Israelsson Tampe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).