From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.io!.POSTED.blaine.gmane.org!not-for-mail From: Stefan Israelsson Tampe Newsgroups: gmane.lisp.guile.devel Subject: Proposal for a new hash interface Date: Thu, 24 Feb 2022 01:07:45 +0100 Message-ID: Mime-Version: 1.0 Content-Type: multipart/alternative; boundary="000000000000dae6bb05d8b85f59" Injection-Info: ciao.gmane.io; posting-host="blaine.gmane.org:116.202.254.214"; logging-data="8339"; mail-complaints-to="usenet@ciao.gmane.io" To: guile-devel Original-X-From: guile-devel-bounces+guile-devel=m.gmane-mx.org@gnu.org Thu Feb 24 01:08:15 2022 Return-path: Envelope-to: guile-devel@m.gmane-mx.org Original-Received: from lists.gnu.org ([209.51.188.17]) by ciao.gmane.io with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.92) (envelope-from ) id 1nN1g6-00023M-PV for guile-devel@m.gmane-mx.org; Thu, 24 Feb 2022 01:08:14 +0100 Original-Received: from localhost ([::1]:55226 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1nN1g5-0006Pr-7X for guile-devel@m.gmane-mx.org; Wed, 23 Feb 2022 19:08:13 -0500 Original-Received: from eggs.gnu.org ([209.51.188.92]:48530) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1nN1ft-0006Pi-CM for guile-devel@gnu.org; Wed, 23 Feb 2022 19:08:01 -0500 Original-Received: from [2607:f8b0:4864:20::531] (port=42898 helo=mail-pg1-x531.google.com) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.90_1) (envelope-from ) id 1nN1fr-0007IW-5t for guile-devel@gnu.org; Wed, 23 Feb 2022 19:08:01 -0500 Original-Received: by mail-pg1-x531.google.com with SMTP id o8so258603pgf.9 for ; Wed, 23 Feb 2022 16:07:58 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=mime-version:from:date:message-id:subject:to; bh=otkpS0C9k1oBnbUFr3x4GuFpsTF37KkwoyIH1iX/RtI=; b=lFxYzsDA2a1pO6mYb2Q3ZzW8pv7x67/8BFwkJbj3p4Yqpg42d544oTkROPgHMLEl6s HZrQjnYpYYyKzn8yBa+lIQdGDZRKt5DAkq4QKrUnw4yBK1czL8uEOHmqHsLdwUfPP3n8 3E3OCAixZCKXnlX8rAhqbuc+ZBVwXvEgm4HxXKk92GN9ADMPR595PsacoUY+SDaqQSL2 waqQeoufVFjd2TjKKrbaODDzcabDZs2jQYm3pwQxt1NXsJ2FW015O7x3eCPRrngM64FK itiiCE4P9E2EXnVa2VJWNCoO00JD3oeB2foMZQP9P2wQN1WkfUFocEzXf751tPed839a irxA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:from:date:message-id:subject:to; bh=otkpS0C9k1oBnbUFr3x4GuFpsTF37KkwoyIH1iX/RtI=; b=bfYS/WyJO8KzivspEVk+2SOle0pit4pDkTX3SVFtP6J5Zvc2T0liRt0Vw6t1+4cDuE u9toMWuBQzC1445uzhXl7B9Erq7ZStqVFBTM/Vmb6GFzTGFyA8HF8DH5OudLOinLnJnc jDEgQFLVWJqgiEi5L+VdlDIX7HqR1dE+8Ztm05sQreaHCAfTGK9ET5v5Zv4RtouQB6Ld dPaUqi0l7RczScgdq6UpbcjZIXtzAvOIj4OYE6t0IE92kdj3xNJHMjP7F38NVLBC7UqX wG/RF1we+UL+X+kBLDiKcnOsiLuQp+jXXpblzIU1hDdqi/nWmJ4hW4RLKNQ6GfOYioQq sLrQ== X-Gm-Message-State: AOAM531sqXDzfIMn5r18LVNouvWlo4lBLUw2FiK+iN0K2VrgiaR39sNb 5VXyM1PsG7ZNxNxM1PQJi+jJUqN9olwCEP8dWWNNStTsHgE= X-Google-Smtp-Source: ABdhPJxBVIT/NEumkO2g/Ef6ObI97PCjE14WGCwjuDjYlcTqEy87lr+FiREGb06KLPbNb+GP8cN3iuHm1euHuBbX3HM= X-Received: by 2002:a05:6a00:124f:b0:4c0:6242:c14e with SMTP id u15-20020a056a00124f00b004c06242c14emr94374pfi.83.1645661277315; Wed, 23 Feb 2022 16:07:57 -0800 (PST) X-Host-Lookup-Failed: Reverse DNS lookup failed for 2607:f8b0:4864:20::531 (failed) Received-SPF: pass client-ip=2607:f8b0:4864:20::531; envelope-from=stefan.itampe@gmail.com; helo=mail-pg1-x531.google.com X-Spam_score_int: -6 X-Spam_score: -0.7 X-Spam_bar: / X-Spam_report: (-0.7 / 5.0 requ) BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, FREEMAIL_FROM=0.001, HTML_MESSAGE=0.001, PDS_HP_HELO_NORDNS=0.659, RCVD_IN_DNSWL_NONE=-0.0001, RDNS_NONE=0.793, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, T_SCC_BODY_TEXT_LINE=-0.01 autolearn=no autolearn_force=no X-Spam_action: no action X-BeenThere: guile-devel@gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: "Developers list for Guile, the GNU extensibility library" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: guile-devel-bounces+guile-devel=m.gmane-mx.org@gnu.org Original-Sender: "guile-devel" Xref: news.gmane.io gmane.lisp.guile.devel:21148 Archived-At: --000000000000dae6bb05d8b85f59 Content-Type: text/plain; charset="UTF-8" High all, I have taken the task to study a possible hash table implementation mostly in scheme but still speedy. I am still experimenting with different approaches. With managing code and iteration higher order functions in scheme and lower order implemented in C. I can match current hash implementations but with hash-fold and hash-for-each 10-20X faster and better semantics as it is not going via C. I suppose the implementation is more memory intensive than the current implementation. But much much safer and featurefull. So what are the features: * goops based organisation * allows one to type a hash table, e.g. disallow the use of different versions of the assessors for the same hash. * surprisingly fast although we define goops methods for the accessors * A flag that indicates that (hash i) = i, for integers, which is used in python * Possible to freeze has hash e.g. it will not grow or shrink, when cleared only all the values will be cleared. * A possibility to use (hash x (ash 1 60)) as a key, e.g. when checking if a slot is equal, we compare only the long hash. * A proper separation between general code that can be pushed to C for speed and scheme for nice integration and power Code: scheme: https://gitlab.com/tampe/guile-persist/-/blob/master/ice-9/hashish.scm C: https://gitlab.com/tampe/guile-persist/-/blob/master/src/hash/resize.c C/Scheme integration Each handle is of the form (hash key val) So when we store the key value pair we also store the hash values which are 60bit wide. This means that we can dispatch the search for an item using only the hash value and many operations can be done in a general way that can be pushed to C (if we want). So typically you first try to use C and when it finds a match you need to do the final check in scheme using the correct equal predicate. This means that we can achieve a very extensible system that you can extend without losing much performance. These primitives make the scheme code beautiful and well organized so we should probably keep this semantic even if we decide to do everything in guile scheme. Example, consider that we have designed a new hash table semantics. The easiest is to supply a hash function and a equal predicate like so (use-modules (ice-9 hashish)) (use-modules (oop goops)) (define (myhash k s) ...) (define (myequal? x y) ...) (define-class ()) (stis-new-hash-type make-my-hash-table myhash myequal? my-hash-ref my-hash-set! my-hash-remove!) Preferably when we do not need to know the type of the hashtable we can either use the methods (slow) stis-hash-ref stis-hash-set! stis-hash-remove! The fast way is to take advantage of the following construct, (define H (make-my-hash-table)) (stis-with-hash-accessors (H : ref set remove) ... (set k1 (ref k2 #f)) ... (remove k3) ...) Both these methods are safe. There are some extra tools available. > stis-hash-resize! One can resize a hashtable as one likes > stis-hash-define-world! One can fill the hash table and then specify that this is the world. Then If all long hashes that are stored are unique, no equal method will be used for identity and the hashmap will be freezed > stis-hash-unworld-it Unset the world property (not yet coded) > stis-hash-copy Create a new hashtable (this is not yet coded) IMPLEMENTATION The implementation is not yet defined, we need to test it and try different versions. But One thing that one can notice are that by reducing the size if the hashtable one can speed up all operations. So the idea is to store a list of hash/index pairs in the hash and use varying sized cells for this. The smallest cell is 2 bytes and we can use more bytes for the index (size of hash table) and fewer for the hash itself) this means that we could use a size of 64 for the hash value and allow 1024 items in the hash table. Similarly as the hash grows we will use 32bit cells and split the maximal size of the hashtable between the hash value and the hash table size. The next size is 64bit cells and the last one are 128 bit cells. There will be a fallback hash table which is the usual a-list bin table But that one will be smaller memory wise and not in the hot path. I am in the process of doing the implementation and will see if I can find the optimal setting for all sizes which I can test. (basically to pin down the strategy for each of the sizes 1,2,4,6,16,32,64,128,256, ... As the size of the hash can only be in those sizes. Oh also, for small hashes the hash is essentially a single association list and superfast. Maybe the dispatching overhead is too much, donough, perhaps trimming down the features might be needed. I will see if we can find some use of special assembler operations, especially for small hashtables. PRIMITIVES, There are essentially three primitives 1. search = find a matching hash, return index or handle or #f possible starting from a known position 2. add = add a new element that we now is not included in the hashtable 3. search-list = fast alist search in case of a small assoc list case returns the key-val handle or #f possibly restarts from a known index. Possibly one may define a search-and-set-some primitive, e.g. one can in hot cases not need to fhe the add afterwards and add is only used in the slow path in a set operation. (For other operations it is in the fast path though) Happy Hacking! Regards Stefan --000000000000dae6bb05d8b85f59 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
High all,

I have taken the task to stud= y a possible hash table implementation=C2=A0mostly in scheme but still spee= dy. I am still experimenting with different approaches. With managing code = and iteration=C2=A0higher order functions in scheme and lower order impleme= nted in C. I can match current hash implementations but with hash-fold and = hash-for-each 10-20X faster and better semantics as it is not going via C.<= /div>

I suppose the implementation is more memory intens= ive than the current implementation. But much much safer and featurefull.

So what are the features:
* goops based o= rganisation

* allows one to type a hash table, e.g= . disallow=C2=A0the use of different=C2=A0versions
=C2=A0 of the = assessors=C2=A0for the same hash.

* surprisingly f= ast although we define goops methods=C2=A0for the accessors

<= /div>
* A flag that indicates that (hash i) =3D i, for integers, which= =C2=A0is used in python

* Possible to freeze has h= ash e.g. it will not grow or shrink, when cleared only
=C2=A0 all= the values will be cleared.

* A possibility=C2=A0= to use (hash x (ash 1 60)) as a key, e.g. when checking if a slot=C2=A0
=C2=A0 is equal, we compare only the long hash.

=
* A proper separation between general=C2=A0code that can be pushed to = C for speed=C2=A0 =C2=A0 and scheme for nice integration and power

Code:

scheme:

C:

=
C/Scheme integration
Each handle is of the form
(h= ash key val)

So when we store the key value pair w= e also store the hash values which are
60bit wide. This means tha= t we can dispatch the search for an item using only the hash value and many= operations can be done in a general way that can be pushed to C (if we wan= t). So typically you first try to use C and when it finds a match you need = to do the final check in scheme using the correct equal predicate. This mea= ns that we can achieve a very extensible system that you can extend without= losing much performance. These primitives make=C2=A0the scheme code beauti= ful and well organized so we should probably keep this semantic even if we = decide to do everything in guile scheme.

Example, = consider that we have designed a new hash table semantics. The easiest is t= o supply a hash function and a equal predicate like so

=
(use-modules (ice-9 hashish))
(use-modules (oop goops))<= br>

(define (myhash=C2=A0 =C2=A0 k s) ...)
(define (myequal? x y) ...)

(define-class <m= y-hash-map> (<stis-hash-map>))
(stis-new-hash-type &= lt;my-hash-map> make-my-hash-table myhash=C2=A0myequal?
=C2=A0 =C2=A0= =C2=A0my-hash-ref=C2=A0 my-hash-set!=C2=A0 my-hash-remove!)
=
Preferably when we do not need to know the type of the hasht= able we can
either use the methods (slow) stis-hash-ref stis-hash= -set! stis-hash-remove!

The fast way is to take ad= vantage of the following construct,
(define H (make-my-hash-table= ))
(stis-with-hash-accessors (H : ref set remove)
= =C2=A0 =C2=A0 ...=C2=A0
=C2=A0 =C2=A0 (set k1 (ref k2 #f))
<= div>=C2=A0 =C2=A0 ...
=C2=A0 =C2=A0 (remove k3)
=C2=A0 = =C2=A0 ...)

Both these methods are safe.

There are some extra tools available.

<= div>> stis-hash-resize!
One can resize a hashtable as one = likes

> stis-hash-define-world!
One c= an fill the hash table and then specify that this is the world. Then=C2=A0<= /div>
If all long hashes that are stored are unique, no equal method wi= ll be used
for identity and the hashmap will be freezed

> stis-hash-unworld-it
Unset the world propert= y (not yet coded)

> stis-hash-copy
Cr= eate a new hashtable (this is not yet coded)

IMPLE= MENTATION
The implementation is not yet defined, we need to test = it and try different=C2=A0versions. But One thing that one can notice are t= hat by reducing the
size if the hashtable one can speed up all op= erations. So the idea is to store
a list of hash/index pairs in t= he hash and use varying sized cells for this. The
smallest cell i= s 2 bytes and we can use more bytes for the index (size of hash table) and = fewer for the hash itself) this means that we could use a size of 64 for th= e hash value and allow 1024 items in the hash table. Similarly as the hash= =C2=A0 grows we will use 32bit cells and split the maximal size of the hash= table between the hash value and the hash table size. The next size is 64bi= t cells and the last one are 128 bit cells. There will be a fallback hash t= able which is the usual a-list bin table But that one will be smaller memor= y wise and not in the hot path.=C2=A0

I am in the = process of doing the implementation and will see if I can find the optimal = setting for all sizes which I can test. (basically to pin down the strategy= =C2=A0for each of the sizes 1,2,4,6,16,32,64,128,256, ... As the size of th= e hash can only be in those sizes.

Oh also, for sm= all hashes the hash is essentially a single association list and superfast.=

Maybe the dispatching overhead is too much, donou= gh, perhaps trimming down the features might be needed.

I will see if we can find some use of special assembler operations, e= specially for small hashtables.

PRIMITIVES,
<= div>There are essentially three primitives
1. search=C2=A0 =C2=A0= =C2=A0 =C2=A0 =3D=C2=A0 find a matching hash, return index or handle or #f=
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0possible starting from a known position

2. add=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0=3D=C2=A0 add a new element that we now is not included=C2=A0
= =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0in the hashtable

3. search= -list=C2=A0 =C2=A0=3D fast alist search in case of a small assoc list case<= /div>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 returns the key-val handle or #f possibly resta= rts
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 from a known index.

Possibly one may define a search-and-set-some primitive, e.g. one can
in hot cases not need to fhe the add afterwards and add is only used= in the
slow path in a set operation. (For other operations it is= in the fast path though)

Happy Hacking!

Regards
Stefan




--000000000000dae6bb05d8b85f59--