From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.io!.POSTED.blaine.gmane.org!not-for-mail From: Stefan Israelsson Tampe Newsgroups: gmane.lisp.guile.devel Subject: Re: Proposal for a new hash interface Date: Sat, 26 Feb 2022 10:33:55 +0100 Message-ID: References: Mime-Version: 1.0 Content-Type: multipart/alternative; boundary="000000000000484a6805d8e8846e" Injection-Info: ciao.gmane.io; posting-host="blaine.gmane.org:116.202.254.214"; logging-data="29769"; mail-complaints-to="usenet@ciao.gmane.io" To: guile-devel Original-X-From: guile-devel-bounces+guile-devel=m.gmane-mx.org@gnu.org Sat Feb 26 10:34:37 2022 Return-path: Envelope-to: guile-devel@m.gmane-mx.org Original-Received: from lists.gnu.org ([209.51.188.17]) by ciao.gmane.io with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.92) (envelope-from ) id 1nNtTI-0007WM-UJ for guile-devel@m.gmane-mx.org; Sat, 26 Feb 2022 10:34:37 +0100 Original-Received: from localhost ([::1]:51486 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1nNtTH-0006wB-2v for guile-devel@m.gmane-mx.org; Sat, 26 Feb 2022 04:34:35 -0500 Original-Received: from eggs.gnu.org ([209.51.188.92]:39380) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1nNtSv-0006vS-CG for guile-devel@gnu.org; Sat, 26 Feb 2022 04:34:15 -0500 Original-Received: from [2607:f8b0:4864:20::62a] (port=34673 helo=mail-pl1-x62a.google.com) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.90_1) (envelope-from ) id 1nNtSs-0003IB-Nb for guile-devel@gnu.org; Sat, 26 Feb 2022 04:34:12 -0500 Original-Received: by mail-pl1-x62a.google.com with SMTP id ay5so3917766plb.1 for ; Sat, 26 Feb 2022 01:34:08 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=mime-version:references:in-reply-to:from:date:message-id:subject:to; bh=OUWe5SnU67agTnU/e+EJRtpE8leMfeFkaAG6US1qjok=; b=DA8ZMFRDf/Q+AsSADil7apdwX3Nlmit2//qW5E6xMuzW5y1Atrnvqk/RxlzYXIo241 HPv+YLdT2vFa5vnCgOczzxvbP58kdmTiU6Z3Sx9h4uoDqLzdd5QhDrTN23OLJH9PqFEE 3TlXfjuRVM9s3LHJzZttJhtMWOXWU0wpNcKqY85hEh+zwKm3IQ0gSjyYg+lBNYyq2K2n sr0W+GAtqM43EaOfoDGrfJ/TI6XT1YXXLYZHO7DUdUPrnmRxuT+c1tf4wEs96WLeCd4u SemW0waqsSJbfYpMPmqcmKtlxv59Q/Q1fGc1v48SAEGjTAacdlTUNlUl8hLc/ovN5/S3 CgzQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to; bh=OUWe5SnU67agTnU/e+EJRtpE8leMfeFkaAG6US1qjok=; b=lHaLqzeAeoq983bNvCaxWKkYu7dsVF1phHRwFShjtADEu6yw7GKmpH0fpH5ip9aG+3 RFU2qfRfOQVZbBKlpNsPu6eRcykjW6/MfLhdoOgc3xxBvcYmBddxLOU2XoUuBw8ic+TS pTzktOSNRIHo4CzDUqXmEIA6pNTHz7Oei7bfHrrgVoVz2r+Tb07E8L/JZijE8dBScwDy fAEUcZqqL9njskE+RNc/IRbs15M2aGTc65uejHINsH1l/YOZOD5q1RK18XamooiRWkkn uyLgCPMscvrL8mxArYbl8B5ckzHQXJgxyMS8/JWrwyAmppLZN1FALpHQXuwZn8tV9DsP IEog== X-Gm-Message-State: AOAM533xQQideoHnoZ8OKKapsKUM3Ee2U2nyuSXQW52WlR6u96rlJJvO Wd0A4AlqwbFLLEyG6pzUcfTQmMNwq/la4zcXm8kQnHOZoPw= X-Google-Smtp-Source: ABdhPJw5iseY2C282r9eH58yO/pa/z+KdAWKFepUmIbMlse9C0RzkhBVLx9CWucuEJFfgdBKi08zl2XOaobmpLj52qE= X-Received: by 2002:a17:902:7246:b0:151:49e7:d4fc with SMTP id c6-20020a170902724600b0015149e7d4fcmr2263154pll.88.1645868046894; Sat, 26 Feb 2022 01:34:06 -0800 (PST) In-Reply-To: X-Host-Lookup-Failed: Reverse DNS lookup failed for 2607:f8b0:4864:20::62a (failed) Received-SPF: pass client-ip=2607:f8b0:4864:20::62a; envelope-from=stefan.itampe@gmail.com; helo=mail-pl1-x62a.google.com X-Spam_score_int: -6 X-Spam_score: -0.7 X-Spam_bar: / X-Spam_report: (-0.7 / 5.0 requ) BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, FREEMAIL_FROM=0.001, HTML_MESSAGE=0.001, PDS_HP_HELO_NORDNS=0.659, RCVD_IN_DNSWL_NONE=-0.0001, RDNS_NONE=0.793, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, T_SCC_BODY_TEXT_LINE=-0.01 autolearn=no autolearn_force=no X-Spam_action: no action X-BeenThere: guile-devel@gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: "Developers list for Guile, the GNU extensibility library" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: guile-devel-bounces+guile-devel=m.gmane-mx.org@gnu.org Original-Sender: "guile-devel" Xref: news.gmane.io gmane.lisp.guile.devel:21154 Archived-At: --000000000000484a6805d8e8846e Content-Type: text/plain; charset="UTF-8" This is autogenerated code, https://gitlab.com/tampe/guile-persist/-/blob/master/src/hash/macro.c Not sure how well uint128_t works on my system, but the compiler should be able to make this code run really fast. Anyhow it is quick matters to change to uint64_t due to things not being hand coded. Can you see what I am doing? On Thu, Feb 24, 2022 at 1:07 AM Stefan Israelsson Tampe < stefan.itampe@gmail.com> wrote: > High all, > > I have taken the task to study a possible hash table implementation mostly > in scheme but still speedy. I am still experimenting with different > approaches. With managing code and iteration higher order functions in > scheme and lower order implemented in C. I can match current hash > implementations but with hash-fold and hash-for-each 10-20X faster and > better semantics as it is not going via C. > > I suppose the implementation is more memory intensive than the current > implementation. But much much safer and featurefull. > > So what are the features: > * goops based organisation > > * allows one to type a hash table, e.g. disallow the use of > different versions > of the assessors for the same hash. > > * surprisingly fast although we define goops methods for the accessors > > * A flag that indicates that (hash i) = i, for integers, which is used in > python > > * Possible to freeze has hash e.g. it will not grow or shrink, when > cleared only > all the values will be cleared. > > * A possibility to use (hash x (ash 1 60)) as a key, e.g. when checking if > a slot > is equal, we compare only the long hash. > > * A proper separation between general code that can be pushed to C for > speed and scheme for nice integration and power > > Code: > > scheme: > https://gitlab.com/tampe/guile-persist/-/blob/master/ice-9/hashish.scm > > C: > https://gitlab.com/tampe/guile-persist/-/blob/master/src/hash/resize.c > > C/Scheme integration > Each handle is of the form > (hash key val) > > So when we store the key value pair we also store the hash values which are > 60bit wide. This means that we can dispatch the search for an item using > only the hash value and many operations can be done in a general way that > can be pushed to C (if we want). So typically you first try to use C and > when it finds a match you need to do the final check in scheme using the > correct equal predicate. This means that we can achieve a very extensible > system that you can extend without losing much performance. These > primitives make the scheme code beautiful and well organized so we should > probably keep this semantic even if we decide to do everything in guile > scheme. > > Example, consider that we have designed a new hash table semantics. The > easiest is to supply a hash function and a equal predicate like so > > (use-modules (ice-9 hashish)) > (use-modules (oop goops)) > > (define (myhash k s) ...) > (define (myequal? x y) ...) > > (define-class ()) > (stis-new-hash-type make-my-hash-table myhash myequal? > my-hash-ref my-hash-set! my-hash-remove!) > > Preferably when we do not need to know the type of the hashtable we can > either use the methods (slow) stis-hash-ref stis-hash-set! > stis-hash-remove! > > The fast way is to take advantage of the following construct, > (define H (make-my-hash-table)) > (stis-with-hash-accessors (H : ref set remove) > ... > (set k1 (ref k2 #f)) > ... > (remove k3) > ...) > > Both these methods are safe. > > There are some extra tools available. > > > stis-hash-resize! > One can resize a hashtable as one likes > > > stis-hash-define-world! > One can fill the hash table and then specify that this is the world. Then > If all long hashes that are stored are unique, no equal method will be used > for identity and the hashmap will be freezed > > > stis-hash-unworld-it > Unset the world property (not yet coded) > > > stis-hash-copy > Create a new hashtable (this is not yet coded) > > IMPLEMENTATION > The implementation is not yet defined, we need to test it and try > different versions. But One thing that one can notice are that by reducing > the > size if the hashtable one can speed up all operations. So the idea is to > store > a list of hash/index pairs in the hash and use varying sized cells for > this. The > smallest cell is 2 bytes and we can use more bytes for the index (size of > hash table) and fewer for the hash itself) this means that we could use a > size of 64 for the hash value and allow 1024 items in the hash table. > Similarly as the hash grows we will use 32bit cells and split the maximal > size of the hashtable between the hash value and the hash table size. The > next size is 64bit cells and the last one are 128 bit cells. There will be > a fallback hash table which is the usual a-list bin table But that one will > be smaller memory wise and not in the hot path. > > I am in the process of doing the implementation and will see if I can find > the optimal setting for all sizes which I can test. (basically to pin down > the strategy for each of the sizes 1,2,4,6,16,32,64,128,256, ... As the > size of the hash can only be in those sizes. > > Oh also, for small hashes the hash is essentially a single association > list and superfast. > > Maybe the dispatching overhead is too much, donough, perhaps trimming down > the features might be needed. > > I will see if we can find some use of special assembler operations, > especially for small hashtables. > > PRIMITIVES, > There are essentially three primitives > 1. search = find a matching hash, return index or handle or #f > possible starting from a known position > > 2. add = add a new element that we now is not included > in the hashtable > > 3. search-list = fast alist search in case of a small assoc list case > returns the key-val handle or #f possibly > restarts > from a known index. > > Possibly one may define a search-and-set-some primitive, e.g. one can > in hot cases not need to fhe the add afterwards and add is only used in the > slow path in a set operation. (For other operations it is in the fast path > though) > > Happy Hacking! > > Regards > Stefan > > > > > --000000000000484a6805d8e8846e Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
This is autogenerated code,

https://gitlab.com/tampe/guile-persist/-/blob/master/src/hash/macro.c=

Not sure how well uint128_t works on my system, but= the compiler=C2=A0should be able to make this code run really fast. Anyhow= it is quick matters to change to uint64_t due to things not being=C2=A0han= d coded. Can you see what I am doing?

On Thu, Feb 24, 2022 at 1:07 AM = Stefan Israelsson Tampe <stef= an.itampe@gmail.com> wrote:
High all,

I have tak= en the task to study a possible hash table implementation=C2=A0mostly in sc= heme but still speedy. I am still experimenting with different approaches. = With managing code and iteration=C2=A0higher order functions in scheme and = lower order implemented in C. I can match current hash implementations but = with hash-fold and hash-for-each 10-20X faster and better semantics as it i= s not going via C.

I suppose the implementation is= more memory intensive than the current implementation. But much much safer= and featurefull.

So what are the features:
<= div>* goops based organisation

* allows one to typ= e a hash table, e.g. disallow=C2=A0the use of different=C2=A0versions
=
=C2=A0 of the assessors=C2=A0for the same hash.

* surprisingly fast although we define goops methods=C2=A0for the access= ors

* A flag that indicates that (hash i) =3D i, f= or integers, which=C2=A0is used in python

* Possib= le to freeze has hash e.g. it will not grow or shrink, when cleared only
=C2=A0 all the values will be cleared.

* A= possibility=C2=A0to use (hash x (ash 1 60)) as a key, e.g. when checking i= f a slot=C2=A0
=C2=A0 is equal, we compare only the long hash.

* A proper separation between general=C2=A0code that= can be pushed to C for speed=C2=A0 =C2=A0 and scheme for nice integration = and power

Code:


C/Scheme integrati= on
Each handle is of the form
(hash key val)
=
So when we store the key value pair we also store the hash v= alues which are
60bit wide. This means that we can dispatch the s= earch for an item using only the hash value and many operations can be done= in a general way that can be pushed to C (if we want). So typically you fi= rst try to use C and when it finds a match you need to do the final check i= n scheme using the correct equal predicate. This means that we can achieve = a very extensible system that you can extend without losing much performanc= e. These primitives make=C2=A0the scheme code beautiful and well organized = so we should probably keep this semantic even if we decide to do everything= in guile scheme.

Example, consider that we have d= esigned a new hash table semantics. The easiest is to supply a hash functio= n and a equal predicate like so

(use-modules (ice-= 9 hashish))
(use-modules (oop goops))

(define (myhash=C2=A0 =C2=A0 k s) ...)
(define (myequal? x = y) ...)

(define-class <my-hash-map> (<sti= s-hash-map>))
(stis-new-hash-type <my-hash-map> make= -my-hash-table myhash=C2=A0myequal?
=C2=A0 =C2=A0 =C2=A0my-hash-ref=C2= =A0 my-hash-set!=C2=A0 my-hash-remove!)

Prefer= ably when we do not need to know the type of the hashtable we can
either use the methods (slow) stis-hash-ref stis-hash-set! stis-hash-remov= e!

The fast way is to take advantage of the follow= ing construct,
(define H (make-my-hash-table))
(stis-wi= th-hash-accessors (H : ref set remove)
=C2=A0 =C2=A0 ...=C2= =A0
=C2=A0 =C2=A0 (set k1 (ref k2 #f))
=C2=A0 =C2=A0 ..= .
=C2=A0 =C2=A0 (remove k3)
=C2=A0 =C2=A0 ...)

Both these methods are safe.

Ther= e are some extra tools available.

> stis-hash-r= esize!
One can resize a hashtable as one likes

=
> stis-hash-define-world!
One can fill the hash tab= le and then specify that this is the world. Then=C2=A0
If all lon= g hashes that are stored are unique, no equal method will be used
for identity and the hashmap will be freezed

>= stis-hash-unworld-it
Unset the world property (not yet coded)

> stis-hash-copy
Create a new hashtable= (this is not yet coded)

IMPLEMENTATION
= The implementation is not yet defined, we need to test it and try different= =C2=A0versions. But One thing that one can notice are that by reducing the<= /div>
size if the hashtable one can speed up all operations. So the ide= a is to store
a list of hash/index pairs in the hash and use vary= ing sized cells for this. The
smallest cell is 2 bytes and we can= use more bytes for the index (size of hash table) and fewer for the hash i= tself) this means that we could use a size of 64 for the hash value and all= ow 1024 items in the hash table. Similarly as the hash=C2=A0 grows we will = use 32bit cells and split the maximal size of the hashtable between the has= h value and the hash table size. The next size is 64bit cells and the last = one are 128 bit cells. There will be a fallback hash table which is the usu= al a-list bin table But that one will be smaller memory wise and not in the= hot path.=C2=A0

I am in the process of doing the = implementation and will see if I can find the optimal setting for all sizes= which I can test. (basically to pin down the strategy=C2=A0for each of the= sizes 1,2,4,6,16,32,64,128,256, ... As the size of the hash can only be in= those sizes.

Oh also, for small hashes the hash i= s essentially a single association list and superfast.

=
Maybe the dispatching overhead is too much, donough, perhaps trimming = down the features might be needed.

I will see if w= e can find some use of special assembler operations, especially for small h= ashtables.

PRIMITIVES,
There are essenti= ally three primitives
1. search=C2=A0 =C2=A0 =C2=A0 =C2=A0 =3D=C2= =A0 find a matching hash, return index or handle or #f
=C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0possible starting from a known position

2. add=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0=3D=C2=A0 add a ne= w element that we now is not included=C2=A0
=C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0in the hashtable

3. search-list=C2=A0 =C2=A0=3D= fast alist search in case of a small assoc list case
=C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 returns the key-val handle or #f possibly restarts
=C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 from a known index.

Possibly one may de= fine a search-and-set-some primitive, e.g. one can
in hot cases n= ot need to fhe the add afterwards and add is only used in the
slo= w path in a set operation. (For other operations it is in the fast path tho= ugh)

Happy Hacking!

Regar= ds
Stefan



=
--000000000000484a6805d8e8846e--