unofficial mirror of guile-devel@gnu.org 
 help / color / mirror / Atom feed
From: Stefan Israelsson Tampe <stefan.itampe@gmail.com>
To: guile-devel <guile-devel@gnu.org>
Subject: Re: Proposal for a new hash interface
Date: Sat, 26 Feb 2022 10:33:55 +0100	[thread overview]
Message-ID: <CAGua6m2kR84K7WCepZzG85RRWM19-my-4fRuR0ggWBf99SkN4Q@mail.gmail.com> (raw)
In-Reply-To: <CAGua6m1iXtXZMsWU-VEX-FZBpOVYQj7CGQHrNk3HfVfVbJ-j-g@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 6292 bytes --]

This is autogenerated code,

https://gitlab.com/tampe/guile-persist/-/blob/master/src/hash/macro.c

Not sure how well uint128_t works on my system, but the compiler should be
able to make this code run really fast. Anyhow it is quick matters to
change to uint64_t due to things not being hand coded. Can you see what I
am doing?

On Thu, Feb 24, 2022 at 1:07 AM Stefan Israelsson Tampe <
stefan.itampe@gmail.com> wrote:

> High all,
>
> I have taken the task to study a possible hash table implementation mostly
> in scheme but still speedy. I am still experimenting with different
> approaches. With managing code and iteration higher order functions in
> scheme and lower order implemented in C. I can match current hash
> implementations but with hash-fold and hash-for-each 10-20X faster and
> better semantics as it is not going via C.
>
> I suppose the implementation is more memory intensive than the current
> implementation. But much much safer and featurefull.
>
> So what are the features:
> * goops based organisation
>
> * allows one to type a hash table, e.g. disallow the use of
> different versions
>   of the assessors for the same hash.
>
> * surprisingly fast although we define goops methods for the accessors
>
> * A flag that indicates that (hash i) = i, for integers, which is used in
> python
>
> * Possible to freeze has hash e.g. it will not grow or shrink, when
> cleared only
>   all the values will be cleared.
>
> * A possibility to use (hash x (ash 1 60)) as a key, e.g. when checking if
> a slot
>   is equal, we compare only the long hash.
>
> * A proper separation between general code that can be pushed to C for
> speed    and scheme for nice integration and power
>
> Code:
>
> scheme:
> https://gitlab.com/tampe/guile-persist/-/blob/master/ice-9/hashish.scm
>
> C:
> https://gitlab.com/tampe/guile-persist/-/blob/master/src/hash/resize.c
>
> C/Scheme integration
> Each handle is of the form
> (hash key val)
>
> So when we store the key value pair we also store the hash values which are
> 60bit wide. This means that we can dispatch the search for an item using
> only the hash value and many operations can be done in a general way that
> can be pushed to C (if we want). So typically you first try to use C and
> when it finds a match you need to do the final check in scheme using the
> correct equal predicate. This means that we can achieve a very extensible
> system that you can extend without losing much performance. These
> primitives make the scheme code beautiful and well organized so we should
> probably keep this semantic even if we decide to do everything in guile
> scheme.
>
> Example, consider that we have designed a new hash table semantics. The
> easiest is to supply a hash function and a equal predicate like so
>
> (use-modules (ice-9 hashish))
> (use-modules (oop goops))
>
> (define (myhash    k s) ...)
> (define (myequal? x y) ...)
>
> (define-class <my-hash-map> (<stis-hash-map>))
> (stis-new-hash-type <my-hash-map> make-my-hash-table myhash myequal?
>      my-hash-ref  my-hash-set!  my-hash-remove!)
>
> Preferably when we do not need to know the type of the hashtable we can
> either use the methods (slow) stis-hash-ref stis-hash-set!
> stis-hash-remove!
>
> The fast way is to take advantage of the following construct,
> (define H (make-my-hash-table))
> (stis-with-hash-accessors (H : ref set remove)
>     ...
>     (set k1 (ref k2 #f))
>     ...
>     (remove k3)
>     ...)
>
> Both these methods are safe.
>
> There are some extra tools available.
>
> > stis-hash-resize!
> One can resize a hashtable as one likes
>
> > stis-hash-define-world!
> One can fill the hash table and then specify that this is the world. Then
> If all long hashes that are stored are unique, no equal method will be used
> for identity and the hashmap will be freezed
>
> > stis-hash-unworld-it
> Unset the world property (not yet coded)
>
> > stis-hash-copy
> Create a new hashtable (this is not yet coded)
>
> IMPLEMENTATION
> The implementation is not yet defined, we need to test it and try
> different versions. But One thing that one can notice are that by reducing
> the
> size if the hashtable one can speed up all operations. So the idea is to
> store
> a list of hash/index pairs in the hash and use varying sized cells for
> this. The
> smallest cell is 2 bytes and we can use more bytes for the index (size of
> hash table) and fewer for the hash itself) this means that we could use a
> size of 64 for the hash value and allow 1024 items in the hash table.
> Similarly as the hash  grows we will use 32bit cells and split the maximal
> size of the hashtable between the hash value and the hash table size. The
> next size is 64bit cells and the last one are 128 bit cells. There will be
> a fallback hash table which is the usual a-list bin table But that one will
> be smaller memory wise and not in the hot path.
>
> I am in the process of doing the implementation and will see if I can find
> the optimal setting for all sizes which I can test. (basically to pin down
> the strategy for each of the sizes 1,2,4,6,16,32,64,128,256, ... As the
> size of the hash can only be in those sizes.
>
> Oh also, for small hashes the hash is essentially a single association
> list and superfast.
>
> Maybe the dispatching overhead is too much, donough, perhaps trimming down
> the features might be needed.
>
> I will see if we can find some use of special assembler operations,
> especially for small hashtables.
>
> PRIMITIVES,
> There are essentially three primitives
> 1. search        =  find a matching hash, return index or handle or #f
>                            possible starting from a known position
>
> 2. add             =  add a new element that we now is not included
>                            in the hashtable
>
> 3. search-list   = fast alist search in case of a small assoc list case
>                           returns the key-val handle or #f possibly
> restarts
>                           from a known index.
>
> Possibly one may define a search-and-set-some primitive, e.g. one can
> in hot cases not need to fhe the add afterwards and add is only used in the
> slow path in a set operation. (For other operations it is in the fast path
> though)
>
> Happy Hacking!
>
> Regards
> Stefan
>
>
>
>
>

[-- Attachment #2: Type: text/html, Size: 8171 bytes --]

      reply	other threads:[~2022-02-26  9:33 UTC|newest]

Thread overview: 2+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-02-24  0:07 Proposal for a new hash interface Stefan Israelsson Tampe
2022-02-26  9:33 ` Stefan Israelsson Tampe [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://www.gnu.org/software/guile/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAGua6m2kR84K7WCepZzG85RRWM19-my-4fRuR0ggWBf99SkN4Q@mail.gmail.com \
    --to=stefan.itampe@gmail.com \
    --cc=guile-devel@gnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).