From mboxrd@z Thu Jan  1 00:00:00 1970
Path: news.gmane.io!.POSTED.blaine.gmane.org!not-for-mail
From: Stefan Israelsson Tampe <stefan.itampe@gmail.com>
Newsgroups: gmane.lisp.guile.devel
Subject: Proposal for a new hash interface
Date: Thu, 24 Feb 2022 01:07:45 +0100
Message-ID: <CAGua6m1iXtXZMsWU-VEX-FZBpOVYQj7CGQHrNk3HfVfVbJ-j-g@mail.gmail.com>
Mime-Version: 1.0
Content-Type: multipart/alternative; boundary="000000000000dae6bb05d8b85f59"
Injection-Info: ciao.gmane.io; posting-host="blaine.gmane.org:116.202.254.214";
	logging-data="8339"; mail-complaints-to="usenet@ciao.gmane.io"
To: guile-devel <guile-devel@gnu.org>
Original-X-From: guile-devel-bounces+guile-devel=m.gmane-mx.org@gnu.org Thu Feb 24 01:08:15 2022
Return-path: <guile-devel-bounces+guile-devel=m.gmane-mx.org@gnu.org>
Envelope-to: guile-devel@m.gmane-mx.org
Original-Received: from lists.gnu.org ([209.51.188.17])
	by ciao.gmane.io with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256)
	(Exim 4.92)
	(envelope-from <guile-devel-bounces+guile-devel=m.gmane-mx.org@gnu.org>)
	id 1nN1g6-00023M-PV
	for guile-devel@m.gmane-mx.org; Thu, 24 Feb 2022 01:08:14 +0100
Original-Received: from localhost ([::1]:55226 helo=lists1p.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.90_1)
	(envelope-from <guile-devel-bounces+guile-devel=m.gmane-mx.org@gnu.org>)
	id 1nN1g5-0006Pr-7X
	for guile-devel@m.gmane-mx.org; Wed, 23 Feb 2022 19:08:13 -0500
Original-Received: from eggs.gnu.org ([209.51.188.92]:48530)
 by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256)
 (Exim 4.90_1) (envelope-from <stefan.itampe@gmail.com>)
 id 1nN1ft-0006Pi-CM
 for guile-devel@gnu.org; Wed, 23 Feb 2022 19:08:01 -0500
Original-Received: from [2607:f8b0:4864:20::531] (port=42898
 helo=mail-pg1-x531.google.com)
 by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128)
 (Exim 4.90_1) (envelope-from <stefan.itampe@gmail.com>)
 id 1nN1fr-0007IW-5t
 for guile-devel@gnu.org; Wed, 23 Feb 2022 19:08:01 -0500
Original-Received: by mail-pg1-x531.google.com with SMTP id o8so258603pgf.9
 for <guile-devel@gnu.org>; Wed, 23 Feb 2022 16:07:58 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112;
 h=mime-version:from:date:message-id:subject:to;
 bh=otkpS0C9k1oBnbUFr3x4GuFpsTF37KkwoyIH1iX/RtI=;
 b=lFxYzsDA2a1pO6mYb2Q3ZzW8pv7x67/8BFwkJbj3p4Yqpg42d544oTkROPgHMLEl6s
 HZrQjnYpYYyKzn8yBa+lIQdGDZRKt5DAkq4QKrUnw4yBK1czL8uEOHmqHsLdwUfPP3n8
 3E3OCAixZCKXnlX8rAhqbuc+ZBVwXvEgm4HxXKk92GN9ADMPR595PsacoUY+SDaqQSL2
 waqQeoufVFjd2TjKKrbaODDzcabDZs2jQYm3pwQxt1NXsJ2FW015O7x3eCPRrngM64FK
 itiiCE4P9E2EXnVa2VJWNCoO00JD3oeB2foMZQP9P2wQN1WkfUFocEzXf751tPed839a
 irxA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20210112;
 h=x-gm-message-state:mime-version:from:date:message-id:subject:to;
 bh=otkpS0C9k1oBnbUFr3x4GuFpsTF37KkwoyIH1iX/RtI=;
 b=bfYS/WyJO8KzivspEVk+2SOle0pit4pDkTX3SVFtP6J5Zvc2T0liRt0Vw6t1+4cDuE
 u9toMWuBQzC1445uzhXl7B9Erq7ZStqVFBTM/Vmb6GFzTGFyA8HF8DH5OudLOinLnJnc
 jDEgQFLVWJqgiEi5L+VdlDIX7HqR1dE+8Ztm05sQreaHCAfTGK9ET5v5Zv4RtouQB6Ld
 dPaUqi0l7RczScgdq6UpbcjZIXtzAvOIj4OYE6t0IE92kdj3xNJHMjP7F38NVLBC7UqX
 wG/RF1we+UL+X+kBLDiKcnOsiLuQp+jXXpblzIU1hDdqi/nWmJ4hW4RLKNQ6GfOYioQq
 sLrQ==
X-Gm-Message-State: AOAM531sqXDzfIMn5r18LVNouvWlo4lBLUw2FiK+iN0K2VrgiaR39sNb
 5VXyM1PsG7ZNxNxM1PQJi+jJUqN9olwCEP8dWWNNStTsHgE=
X-Google-Smtp-Source: ABdhPJxBVIT/NEumkO2g/Ef6ObI97PCjE14WGCwjuDjYlcTqEy87lr+FiREGb06KLPbNb+GP8cN3iuHm1euHuBbX3HM=
X-Received: by 2002:a05:6a00:124f:b0:4c0:6242:c14e with SMTP id
 u15-20020a056a00124f00b004c06242c14emr94374pfi.83.1645661277315; Wed, 23 Feb
 2022 16:07:57 -0800 (PST)
X-Host-Lookup-Failed: Reverse DNS lookup failed for 2607:f8b0:4864:20::531
 (failed)
Received-SPF: pass client-ip=2607:f8b0:4864:20::531;
 envelope-from=stefan.itampe@gmail.com; helo=mail-pg1-x531.google.com
X-Spam_score_int: -6
X-Spam_score: -0.7
X-Spam_bar: /
X-Spam_report: (-0.7 / 5.0 requ) BAYES_00=-1.9, DKIM_SIGNED=0.1,
 DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, FREEMAIL_FROM=0.001,
 HTML_MESSAGE=0.001, PDS_HP_HELO_NORDNS=0.659, RCVD_IN_DNSWL_NONE=-0.0001,
 RDNS_NONE=0.793, SPF_HELO_NONE=0.001, SPF_PASS=-0.001,
 T_SCC_BODY_TEXT_LINE=-0.01 autolearn=no autolearn_force=no
X-Spam_action: no action
X-BeenThere: guile-devel@gnu.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: "Developers list for Guile,
 the GNU extensibility library" <guile-devel.gnu.org>
List-Unsubscribe: <https://lists.gnu.org/mailman/options/guile-devel>,
 <mailto:guile-devel-request@gnu.org?subject=unsubscribe>
List-Archive: <https://lists.gnu.org/archive/html/guile-devel>
List-Post: <mailto:guile-devel@gnu.org>
List-Help: <mailto:guile-devel-request@gnu.org?subject=help>
List-Subscribe: <https://lists.gnu.org/mailman/listinfo/guile-devel>,
 <mailto:guile-devel-request@gnu.org?subject=subscribe>
Errors-To: guile-devel-bounces+guile-devel=m.gmane-mx.org@gnu.org
Original-Sender: "guile-devel" <guile-devel-bounces+guile-devel=m.gmane-mx.org@gnu.org>
Xref: news.gmane.io gmane.lisp.guile.devel:21148
Archived-At: <http://permalink.gmane.org/gmane.lisp.guile.devel/21148>

--000000000000dae6bb05d8b85f59
Content-Type: text/plain; charset="UTF-8"

High all,

I have taken the task to study a possible hash table implementation mostly
in scheme but still speedy. I am still experimenting with different
approaches. With managing code and iteration higher order functions in
scheme and lower order implemented in C. I can match current hash
implementations but with hash-fold and hash-for-each 10-20X faster and
better semantics as it is not going via C.

I suppose the implementation is more memory intensive than the current
implementation. But much much safer and featurefull.

So what are the features:
* goops based organisation

* allows one to type a hash table, e.g. disallow the use of
different versions
  of the assessors for the same hash.

* surprisingly fast although we define goops methods for the accessors

* A flag that indicates that (hash i) = i, for integers, which is used in
python

* Possible to freeze has hash e.g. it will not grow or shrink, when cleared
only
  all the values will be cleared.

* A possibility to use (hash x (ash 1 60)) as a key, e.g. when checking if
a slot
  is equal, we compare only the long hash.

* A proper separation between general code that can be pushed to C for
speed    and scheme for nice integration and power

Code:

scheme:
https://gitlab.com/tampe/guile-persist/-/blob/master/ice-9/hashish.scm

C:
https://gitlab.com/tampe/guile-persist/-/blob/master/src/hash/resize.c

C/Scheme integration
Each handle is of the form
(hash key val)

So when we store the key value pair we also store the hash values which are
60bit wide. This means that we can dispatch the search for an item using
only the hash value and many operations can be done in a general way that
can be pushed to C (if we want). So typically you first try to use C and
when it finds a match you need to do the final check in scheme using the
correct equal predicate. This means that we can achieve a very extensible
system that you can extend without losing much performance. These
primitives make the scheme code beautiful and well organized so we should
probably keep this semantic even if we decide to do everything in guile
scheme.

Example, consider that we have designed a new hash table semantics. The
easiest is to supply a hash function and a equal predicate like so

(use-modules (ice-9 hashish))
(use-modules (oop goops))

(define (myhash    k s) ...)
(define (myequal? x y) ...)

(define-class <my-hash-map> (<stis-hash-map>))
(stis-new-hash-type <my-hash-map> make-my-hash-table myhash myequal?
     my-hash-ref  my-hash-set!  my-hash-remove!)

Preferably when we do not need to know the type of the hashtable we can
either use the methods (slow) stis-hash-ref stis-hash-set! stis-hash-remove!

The fast way is to take advantage of the following construct,
(define H (make-my-hash-table))
(stis-with-hash-accessors (H : ref set remove)
    ...
    (set k1 (ref k2 #f))
    ...
    (remove k3)
    ...)

Both these methods are safe.

There are some extra tools available.

> stis-hash-resize!
One can resize a hashtable as one likes

> stis-hash-define-world!
One can fill the hash table and then specify that this is the world. Then
If all long hashes that are stored are unique, no equal method will be used
for identity and the hashmap will be freezed

> stis-hash-unworld-it
Unset the world property (not yet coded)

> stis-hash-copy
Create a new hashtable (this is not yet coded)

IMPLEMENTATION
The implementation is not yet defined, we need to test it and try
different versions. But One thing that one can notice are that by reducing
the
size if the hashtable one can speed up all operations. So the idea is to
store
a list of hash/index pairs in the hash and use varying sized cells for
this. The
smallest cell is 2 bytes and we can use more bytes for the index (size of
hash table) and fewer for the hash itself) this means that we could use a
size of 64 for the hash value and allow 1024 items in the hash table.
Similarly as the hash  grows we will use 32bit cells and split the maximal
size of the hashtable between the hash value and the hash table size. The
next size is 64bit cells and the last one are 128 bit cells. There will be
a fallback hash table which is the usual a-list bin table But that one will
be smaller memory wise and not in the hot path.

I am in the process of doing the implementation and will see if I can find
the optimal setting for all sizes which I can test. (basically to pin down
the strategy for each of the sizes 1,2,4,6,16,32,64,128,256, ... As the
size of the hash can only be in those sizes.

Oh also, for small hashes the hash is essentially a single association list
and superfast.

Maybe the dispatching overhead is too much, donough, perhaps trimming down
the features might be needed.

I will see if we can find some use of special assembler operations,
especially for small hashtables.

PRIMITIVES,
There are essentially three primitives
1. search        =  find a matching hash, return index or handle or #f
                           possible starting from a known position

2. add             =  add a new element that we now is not included
                           in the hashtable

3. search-list   = fast alist search in case of a small assoc list case
                          returns the key-val handle or #f possibly restarts
                          from a known index.

Possibly one may define a search-and-set-some primitive, e.g. one can
in hot cases not need to fhe the add afterwards and add is only used in the
slow path in a set operation. (For other operations it is in the fast path
though)

Happy Hacking!

Regards
Stefan

--000000000000dae6bb05d8b85f59
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">High all,<div><br></div><div>I have taken the task to stud=
y a possible hash table implementation=C2=A0mostly in scheme but still spee=
dy. I am still experimenting with different approaches. With managing code =
and iteration=C2=A0higher order functions in scheme and lower order impleme=
nted in C. I can match current hash implementations but with hash-fold and =
hash-for-each 10-20X faster and better semantics as it is not going via C.<=
/div><div><br></div><div>I suppose the implementation is more memory intens=
ive than the current implementation. But much much safer and featurefull.</=
div><div><br></div><div>So what are the features:</div><div>* goops based o=
rganisation</div><div><br></div><div>* allows one to type a hash table, e.g=
. disallow=C2=A0the use of different=C2=A0versions</div><div>=C2=A0 of the =
assessors=C2=A0for the same hash.</div><div><br></div><div>* surprisingly f=
ast although we define goops methods=C2=A0for the accessors</div><div><br><=
/div><div>* A flag that indicates that (hash i) =3D i, for integers, which=
=C2=A0is used in python</div><div><br></div><div>* Possible to freeze has h=
ash e.g. it will not grow or shrink, when cleared only</div><div>=C2=A0 all=
 the values will be cleared.</div><div><br></div><div>* A possibility=C2=A0=
to use (hash x (ash 1 60)) as a key, e.g. when checking if a slot=C2=A0</di=
v><div>=C2=A0 is equal, we compare only the long hash.</div><div><br></div>=
<div>* A proper separation between general=C2=A0code that can be pushed to =
C for speed=C2=A0 =C2=A0 and scheme for nice integration and power</div><di=
v><br></div><div>Code:</div><div><br></div><div>scheme:</div><div><a href=
=3D"https://gitlab.com/tampe/guile-persist/-/blob/master/ice-9/hashish.scm"=
>https://gitlab.com/tampe/guile-persist/-/blob/master/ice-9/hashish.scm</a>=
<br></div><div><br></div><div>C:</div><div><a href=3D"https://gitlab.com/ta=
mpe/guile-persist/-/blob/master/src/hash/resize.c">https://gitlab.com/tampe=
/guile-persist/-/blob/master/src/hash/resize.c</a><br></div><div><br></div>=
<div>C/Scheme integration</div><div>Each handle is of the form</div><div>(h=
ash key val)</div><div><br></div><div>So when we store the key value pair w=
e also store the hash values which are</div><div>60bit wide. This means tha=
t we can dispatch the search for an item using only the hash value and many=
 operations can be done in a general way that can be pushed to C (if we wan=
t). So typically you first try to use C and when it finds a match you need =
to do the final check in scheme using the correct equal predicate. This mea=
ns that we can achieve a very extensible system that you can extend without=
 losing much performance. These primitives make=C2=A0the scheme code beauti=
ful and well organized so we should probably keep this semantic even if we =
decide to do everything in guile scheme.</div><div><br></div><div>Example, =
consider that we have designed a new hash table semantics. The easiest is t=
o supply a hash function and a equal predicate like so</div><div><br></div>=
<div>(use-modules (ice-9 hashish))<br></div><div>(use-modules (oop goops))<=
br></div><div><br></div><div>(define (myhash=C2=A0 =C2=A0 k s) ...)</div><d=
iv>(define (myequal? x y) ...)</div><div><br></div><div>(define-class &lt;m=
y-hash-map&gt; (&lt;stis-hash-map&gt;))<br></div><div>(stis-new-hash-type &=
lt;my-hash-map&gt; make-my-hash-table myhash=C2=A0myequal?<br>=C2=A0 =C2=A0=
 =C2=A0my-hash-ref=C2=A0 my-hash-set!=C2=A0 my-hash-remove!)<br></div><div>=
<br></div><div>Preferably when we do not need to know the type of the hasht=
able we can</div><div>either use the methods (slow) stis-hash-ref stis-hash=
-set! stis-hash-remove!</div><div><br></div><div>The fast way is to take ad=
vantage of the following construct,</div><div>(define H (make-my-hash-table=
))</div><div>(stis-with-hash-accessors (H : ref set remove)<br></div><div>=
=C2=A0 =C2=A0 ...=C2=A0</div><div>=C2=A0 =C2=A0 (set k1 (ref k2 #f))</div><=
div>=C2=A0 =C2=A0 ...</div><div>=C2=A0 =C2=A0 (remove k3)</div><div>=C2=A0 =
=C2=A0 ...)</div><div><br></div><div>Both these methods are safe.</div><div=
><br></div><div>There are some extra tools available.</div><div><br></div><=
div>&gt; stis-hash-resize!<br></div><div>One can resize a hashtable as one =
likes</div><div><br></div><div>&gt; stis-hash-define-world!</div><div>One c=
an fill the hash table and then specify that this is the world. Then=C2=A0<=
/div><div>If all long hashes that are stored are unique, no equal method wi=
ll be used</div><div>for identity and the hashmap will be freezed</div><div=
><br></div><div>&gt; stis-hash-unworld-it</div><div>Unset the world propert=
y (not yet coded)</div><div><br></div><div>&gt; stis-hash-copy</div><div>Cr=
eate a new hashtable (this is not yet coded)</div><div><br></div><div>IMPLE=
MENTATION</div><div>The implementation is not yet defined, we need to test =
it and try different=C2=A0versions. But One thing that one can notice are t=
hat by reducing the</div><div>size if the hashtable one can speed up all op=
erations. So the idea is to store</div><div>a list of hash/index pairs in t=
he hash and use varying sized cells for this. The</div><div>smallest cell i=
s 2 bytes and we can use more bytes for the index (size of hash table) and =
fewer for the hash itself) this means that we could use a size of 64 for th=
e hash value and allow 1024 items in the hash table. Similarly as the hash=
=C2=A0 grows we will use 32bit cells and split the maximal size of the hash=
table between the hash value and the hash table size. The next size is 64bi=
t cells and the last one are 128 bit cells. There will be a fallback hash t=
able which is the usual a-list bin table But that one will be smaller memor=
y wise and not in the hot path.=C2=A0</div><div><br></div><div>I am in the =
process of doing the implementation and will see if I can find the optimal =
setting for all sizes which I can test. (basically to pin down the strategy=
=C2=A0for each of the sizes 1,2,4,6,16,32,64,128,256, ... As the size of th=
e hash can only be in those sizes.</div><div><br></div><div>Oh also, for sm=
all hashes the hash is essentially a single association list and superfast.=
</div><div><br></div><div>Maybe the dispatching overhead is too much, donou=
gh, perhaps trimming down the features might be needed.</div><div><br></div=
><div>I will see if we can find some use of special assembler operations, e=
specially for small hashtables.</div><div><br></div><div>PRIMITIVES,</div><=
div>There are essentially three primitives</div><div>1. search=C2=A0 =C2=A0=
 =C2=A0 =C2=A0 =3D=C2=A0 find a matching hash, return index or handle or #f=
</div><div>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0possible starting from a known position</=
div><div><br></div><div>2. add=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0=3D=C2=A0 add a new element that we now is not included=C2=A0</div><div>=
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0in the hashtable</div><div><br></div><div>3. search=
-list=C2=A0 =C2=A0=3D fast alist search in case of a small assoc list case<=
/div><div>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A0 =C2=A0 =C2=A0 returns the key-val handle or #f possibly resta=
rts</div><div>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 from a known index.</div><div><br></div><di=
v>Possibly one may define a search-and-set-some primitive, e.g. one can</di=
v><div>in hot cases not need to fhe the add afterwards and add is only used=
 in the</div><div>slow path in a set operation. (For other operations it is=
 in the fast path though)</div><div><br></div><div>Happy Hacking!</div><div=
><br></div><div>Regards</div><div>Stefan</div><div><br></div><div><br></div=
><div><br></div><div><br></div></div>

--000000000000dae6bb05d8b85f59--