more advanced bytevector => supervectors

unofficial mirror of guile-devel@gnu.org 
 help / color / mirror / Atom feed

* more advanced bytevector => supervectors
@ 2021-09-02 15:45 Stefan Israelsson Tampe
  2021-09-02 15:54 ` Stefan Israelsson Tampe
                   ` (3 more replies)
  0 siblings, 4 replies; 11+ messages in thread
From: Stefan Israelsson Tampe @ 2021-09-02 15:45 UTC (permalink / raw)
  To: guile-devel

[-- Attachment #1: Type: text/plain, Size: 2196 bytes --]

Hi guilers!

My next project is to explore a more advanced bytevector structure than
today's bytevectors.
I think that having this infrastructure in guile and core code taken
advantage of it like having strings otop of it and allowing our string
library to use those (the difficult case is probably to get regexps working
properly)

Anyhow this is my initial comment in the code:

#|
The idea of this data structure is to note that when we employ full
large datastructure scans we can allow for a much more rich and featurefull
datastructure then a simple bytevector. The reason is that we can divide
the data in byte chunks and spend most of the time scanning copying maping
those areas with usual methis, even optimized C - code taken advantage of
advanced cpu opcodes are possible here. ANd by dividing it in chunks we get
a lot of new features with essentiually no cose with more than complexity
which we manage mostly in scheme. We gain many things,

1. Many new features that can vary on the pages

2. less memory hogs as
   a. it can use copy ion write semantics
   b. it does not need to fins 1GB continuous blocks

3. potentially faster operations as we can fast foorward the zeroe on write
   pages compared to pure bytevectors

4. we can allow for swaping and refferential datastructures to speed up
copying
   even further

5. can get better fiber features of C programs that spends seconds or
minutes on
   performing an operation because it will just spend a microsecond or such
   in C-land and then swap back to Scheme. CTRL-C will also work nicely.

6. We could potentially build a string library untop of these
datastructures and
   also we could add features to pages that lead to a much richer
interface.

7. resizing is much faster end efficient

8. reversing is super quick

9. queues and stacks of byte data can have a very efficient implementations

Drawback:
1. More complex as we need to consider boudaries
2. Slower one off operations like bytevector-u8-get as guile compiles the
   core operatoins to quite effective JIT CPU encodings. But maybe we can
   disign some caching to make those operations much faster and even have
   suporting JIT operations.

|#

WDYT ?

[-- Attachment #2: Type: text/html, Size: 2492 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: more advanced bytevector => supervectors
  2021-09-02 15:45 more advanced bytevector => supervectors Stefan Israelsson Tampe
@ 2021-09-02 15:54 ` Stefan Israelsson Tampe
  2021-09-06  2:10   ` Stefan Israelsson Tampe
  2021-09-02 19:11 ` Matt Wette
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 11+ messages in thread
From: Stefan Israelsson Tampe @ 2021-09-02 15:54 UTC (permalink / raw)
  To: guile-devel

[-- Attachment #1: Type: text/plain, Size: 2497 bytes --]

Oh I just created the project, you can follow it here:

https://gitlab.com/tampe/stis-supervectors

On Thu, Sep 2, 2021 at 5:45 PM Stefan Israelsson Tampe <
stefan.itampe@gmail.com> wrote:

> Hi guilers!
>
> My next project is to explore a more advanced bytevector structure than
> today's bytevectors.
> I think that having this infrastructure in guile and core code taken
> advantage of it like having strings otop of it and allowing our string
> library to use those (the difficult case is probably to get regexps working
> properly)
>
> Anyhow this is my initial comment in the code:
>
> #|
> The idea of this data structure is to note that when we employ full
> large datastructure scans we can allow for a much more rich and featurefull
> datastructure then a simple bytevector. The reason is that we can divide
> the data in byte chunks and spend most of the time scanning copying maping
> those areas with usual methis, even optimized C - code taken advantage of
> advanced cpu opcodes are possible here. ANd by dividing it in chunks we get
> a lot of new features with essentiually no cose with more than complexity
> which we manage mostly in scheme. We gain many things,
>
> 1. Many new features that can vary on the pages
>
> 2. less memory hogs as
>    a. it can use copy ion write semantics
>    b. it does not need to fins 1GB continuous blocks
>
> 3. potentially faster operations as we can fast foorward the zeroe on write
>    pages compared to pure bytevectors
>
> 4. we can allow for swaping and refferential datastructures to speed up
> copying
>    even further
>
> 5. can get better fiber features of C programs that spends seconds or
> minutes on
>    performing an operation because it will just spend a microsecond or such
>    in C-land and then swap back to Scheme. CTRL-C will also work nicely.
>
> 6. We could potentially build a string library untop of these
> datastructures and
>    also we could add features to pages that lead to a much richer
> interface.
>
> 7. resizing is much faster end efficient
>
> 8. reversing is super quick
>
> 9. queues and stacks of byte data can have a very efficient
> implementations
>
> Drawback:
> 1. More complex as we need to consider boudaries
> 2. Slower one off operations like bytevector-u8-get as guile compiles the
>    core operatoins to quite effective JIT CPU encodings. But maybe we can
>    disign some caching to make those operations much faster and even have
>    suporting JIT operations.
>
> |#
>
> WDYT ?
>
>

[-- Attachment #2: Type: text/html, Size: 3049 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: more advanced bytevector => supervectors
  2021-09-02 15:45 more advanced bytevector => supervectors Stefan Israelsson Tampe
  2021-09-02 15:54 ` Stefan Israelsson Tampe
@ 2021-09-02 19:11 ` Matt Wette
  2021-09-02 19:47   ` tomas
  2021-09-08  2:04 ` Stefan Israelsson Tampe
  2021-09-15  0:12 ` Stefan Israelsson Tampe
  3 siblings, 1 reply; 11+ messages in thread
From: Matt Wette @ 2021-09-02 19:11 UTC (permalink / raw)
  To: guile-devel

maybe guile could consider regexp's in scheme
see  https://ds26gte.github.io/pregexp/index.html

On 9/2/21 8:45 AM, Stefan Israelsson Tampe wrote:
> Hi guilers!
>
> My next project is to explore a more advanced bytevector structure 
> than today's bytevectors.
> I think that having this infrastructure in guile and core code taken 
> advantage of it like having strings otop of it and allowing our string 
> library to use those (the difficult case is probably to get regexps 
> working properly)
>
> Anyhow this is my initial comment in the code:
>
> #|
> The idea of this data structure is to note that when we employ full
> large datastructure scans we can allow for a much more rich and 
> featurefull
> datastructure then a simple bytevector. The reason is that we can divide
> the data in byte chunks and spend most of the time scanning copying maping
> those areas with usual methis, even optimized C - code taken advantage 
> of advanced cpu opcodes are possible here. ANd by dividing it in 
> chunks we get a lot of new features with essentiually no cose with 
> more than complexity which we manage mostly in scheme. We gain many 
> things,
>
> 1. Many new features that can vary on the pages
>
> 2. less memory hogs as
>    a. it can use copy ion write semantics
>    b. it does not need to fins 1GB continuous blocks
>
> 3. potentially faster operations as we can fast foorward the zeroe on 
> write
>    pages compared to pure bytevectors
>
> 4. we can allow for swaping and refferential datastructures to speed 
> up copying
>    even further
>
> 5. can get better fiber features of C programs that spends seconds or 
> minutes on
>    performing an operation because it will just spend a microsecond or 
> such
>    in C-land and then swap back to Scheme. CTRL-C will also work nicely.
>
> 6. We could potentially build a string library untop of these 
> datastructures and
>    also we could add features to pages that lead to a much richer 
> interface.
>
> 7. resizing is much faster end efficient
>
> 8. reversing is super quick
>
> 9. queues and stacks of byte data can have a very efficient 
> implementations
>
> Drawback:
> 1. More complex as we need to consider boudaries
> 2. Slower one off operations like bytevector-u8-get as guile compiles the
>    core operatoins to quite effective JIT CPU encodings. But maybe we can
>    disign some caching to make those operations much faster and even have
>    suporting JIT operations.
>
> |#
>
> WDYT ?
>




^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: more advanced bytevector => supervectors
  2021-09-02 19:11 ` Matt Wette
@ 2021-09-02 19:47   ` tomas
  0 siblings, 0 replies; 11+ messages in thread
From: tomas @ 2021-09-02 19:47 UTC (permalink / raw)
  To: guile-devel

[-- Attachment #1: Type: text/plain, Size: 261 bytes --]

On Thu, Sep 02, 2021 at 12:11:12PM -0700, Matt Wette wrote:
> maybe guile could consider regexp's in scheme
> see  https://ds26gte.github.io/pregexp/index.html

This would have other advantages (e.g. suspendable match
off a port), right?

Cheers
 - t

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: more advanced bytevector => supervectors
  2021-09-02 15:54 ` Stefan Israelsson Tampe
@ 2021-09-06  2:10   ` Stefan Israelsson Tampe
  0 siblings, 0 replies; 11+ messages in thread
From: Stefan Israelsson Tampe @ 2021-09-06  2:10 UTC (permalink / raw)
  To: guile-devel

[-- Attachment #1: Type: text/plain, Size: 5237 bytes --]

Further developments are now in, features are (guided somewhat by how the
memory in linux/unix is managed and current guile strings)

1.   chunked vectors e.g. we dived the vector in areas with different
properties
2.   optionally typed e.g. we consider a vector of u32 etc of data so there
is a notion of object length and byte length
3.   unaligned data transfers can be mitigated (controlled by a depth
parameter) by considering a tree e.g. the chunk is a supervector itself
with the same size as the old chunk
4.   growing and shrinking can be effective
5.   we have a pre side and a post side  so queues and stacks are well
modeled by this
6.   fast reverse operation (useful for queues)
7.   supervectors can be read only
8.   ro super vectors allow cow  (copy on write) sharing
9.   types can grow e.g. we can start with a u8 general layout and have
some chunks in u16 and some in u32 all tree like and controlled by an
allowed depth
10. we can shar data that means that a shared copy can write to the
original vector it copied from
11. zero chunks
12. all in scheme, but can take advantage of superfast c-code that
mayby uses some advance processor op not possible in guile etc for hot
paths, the advanantage
      is that i) the chunks means better fibers integration with no long
stalls because it repeats frequently to scheme and ii) the supervector
library have an API
      that makes the programming experience pleasant to implement the
operations.
13. good default sizes of chunks to make sure that they are large enough to
amortise the overhead of the chunks
14. fast append append!
15. pretty fast indexing, lookup of a position where the speed goes from
basically a vector lookup to tree lookup controlled by the level parameter
16. auto aligns supervectors chunks to allow fast operations.

Next
a) testing testing testing and debugging
b) making a string implementation with guiles string interface that will
guide some util functions for the general supervector.
c) try find a regexp library that can work with chunked vectors.
d) make the supervector work with ports so that we do not need to go via
strings to create a large data structure and instead of the chunked one
which is built
     by smaller moderate sized pages
e) enable user defined data on chunks
f) some data transfers can be much faster in C when e.g. one copy from u8
to u64
g) allow supervectors with SCM objects as elements
h) allow chunked bitvectors and chunked bitvector operations
i)  docs like readme


Project:
https://gitlab.com/tampe/stis-supervectors/-/tree/main



On Thu, Sep 2, 2021 at 5:54 PM Stefan Israelsson Tampe <
stefan.itampe@gmail.com> wrote:

> Oh I just created the project, you can follow it here:
>
> https://gitlab.com/tampe/stis-supervectors
>
> On Thu, Sep 2, 2021 at 5:45 PM Stefan Israelsson Tampe <
> stefan.itampe@gmail.com> wrote:
>
>> Hi guilers!
>>
>> My next project is to explore a more advanced bytevector structure than
>> today's bytevectors.
>> I think that having this infrastructure in guile and core code taken
>> advantage of it like having strings otop of it and allowing our string
>> library to use those (the difficult case is probably to get regexps working
>> properly)
>>
>> Anyhow this is my initial comment in the code:
>>
>> #|
>> The idea of this data structure is to note that when we employ full
>> large datastructure scans we can allow for a much more rich and
>> featurefull
>> datastructure then a simple bytevector. The reason is that we can divide
>> the data in byte chunks and spend most of the time scanning copying maping
>> those areas with usual methis, even optimized C - code taken advantage of
>> advanced cpu opcodes are possible here. ANd by dividing it in chunks we get
>> a lot of new features with essentiually no cose with more than complexity
>> which we manage mostly in scheme. We gain many things,
>>
>> 1. Many new features that can vary on the pages
>>
>> 2. less memory hogs as
>>    a. it can use copy ion write semantics
>>    b. it does not need to fins 1GB continuous blocks
>>
>> 3. potentially faster operations as we can fast foorward the zeroe on
>> write
>>    pages compared to pure bytevectors
>>
>> 4. we can allow for swaping and refferential datastructures to speed up
>> copying
>>    even further
>>
>> 5. can get better fiber features of C programs that spends seconds or
>> minutes on
>>    performing an operation because it will just spend a microsecond or
>> such
>>    in C-land and then swap back to Scheme. CTRL-C will also work nicely.
>>
>> 6. We could potentially build a string library untop of these
>> datastructures and
>>    also we could add features to pages that lead to a much richer
>> interface.
>>
>> 7. resizing is much faster end efficient
>>
>> 8. reversing is super quick
>>
>> 9. queues and stacks of byte data can have a very efficient
>> implementations
>>
>> Drawback:
>> 1. More complex as we need to consider boudaries
>> 2. Slower one off operations like bytevector-u8-get as guile compiles the
>>    core operatoins to quite effective JIT CPU encodings. But maybe we can
>>    disign some caching to make those operations much faster and even have
>>    suporting JIT operations.
>>
>> |#
>>
>> WDYT ?
>>
>>

[-- Attachment #2: Type: text/html, Size: 6528 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: more advanced bytevector => supervectors
  2021-09-02 15:45 more advanced bytevector => supervectors Stefan Israelsson Tampe
  2021-09-02 15:54 ` Stefan Israelsson Tampe
  2021-09-02 19:11 ` Matt Wette
@ 2021-09-08  2:04 ` Stefan Israelsson Tampe
  2021-09-08  7:18   ` lloda
  2021-09-15  0:12 ` Stefan Israelsson Tampe
  3 siblings, 1 reply; 11+ messages in thread
From: Stefan Israelsson Tampe @ 2021-09-08  2:04 UTC (permalink / raw)
  To: guile-devel

[-- Attachment #1: Type: text/plain, Size: 7751 bytes --]

Today I added structure to support supervectors of scheme values and
refactored the code with a nice macrology. So now we
have an infrastructure that the compiler can compile to very efficient
code. All loop must be dispatched to the type of ref and set
operations. But the good part is that this selection is done only once on
each bin. which can be amortised with all the work on the
loop. This means that there can be an explosion in the number of
combinations. The design is such that one can select the set
of variants used in a nice interface. So let's discuss the setup to
illustrate this point afet a small detour to algebra

After investigating the structure of bit like vectors today I turned to
algebraic operations and wonder about what structure we can have
regarding supervectors. First to note is that storing an offset in the bin
can be of value if we only consider + and - operations so for a bin
with values X we store it as a + X, now adding (a1 + X1) + (a2 + X2) lead
to the final value (a1 + a2) + (X1 + X2) which means that
a1 and a2 can be general values and we could try to compress the X1 and X2
to a smaller byte structure

For multiplication we can have a scaling operation and if we only conisder
* / we have
(b1 * X1)  * (b2 * X2) = (b1*b2) (X1*X2)

Mixing multiplications will lead to problems but essentially if you add
number you can gain if the scaling is the same e.g.
(a1 + bX1) + (a2 + b*X2) = (a1 + a2) + b (X1 + X2).

complexity wise one do not win here, but the X1 and X2 could be one type of
number and the scale and offset of a larger
type.

All operations + - * /  of a constant bin with a fat bin can be fast if we
store a scaling and an offset.

This is the basic structure when considering basic algebra and one could
possibly benefit by storing and offset and a scale
factor in the bin information. Not sure of the use cases though.

Now over to the macrology ...

The besic tool to make find out about a ref or set is defined as

(define-syntax-rule (mk-get-setter get-setter do-inv? AS a b c d)
  (define-inlinable (get-setter do-inv2? bits bits1 loop )
    (lambda* (#:key (is-typed? #f)
         (is-scm    #f)
         (is-fixed    #f)
         (is-float  #f)
         (is-integer  #f)
         (is-single #f)
         (is-unsigned #f)
         (is-endian #f)
         (is-m      #f)
         (is-inv    #f))

,,,,)))

AS can be AR,AS,ARS    (ref set! ref-set!) e.g depending on the type of the
bin we will find a ref if AR is used, a set! if AS is used and both
the ref and set for the ARS version. do-inv? can if false skip lognot
stored info in the bin. get-setter is the name we define of the new setter
and a b c  id all of the form
AR:   ((1 ref-u8) (2 ref-u16)) ...
AS:   ((1 set-s8) (2 set-u16)) ...
ARS: ((1 ref-u8 set-s8) (2 ref-u16 set-u16)) ...

so a is the signed versions used, then b is the unsigned versions to use
and c is teh floating operations used and finally d is the scm values used

it will define an inlanable function like
(define-inlinable (get-setter do-inv? bits bits1  loop) ...)

So using get-setter typically means
((get-setter #f bin1 #f
   (lambda (set) (set v 2 val)))

   #:is-endian 'little          ;; only consider little endian setters like
I know
   #:is-unsigned  #t         ;; only use unsigned
   #:is-integer      #t         ;; only use integer representations
   #:is-fixed          #t        ;; do not use the scm value vector versions
)
So a version where we only consider handling nonegative integers of up to
64bit. The gain is faster compilation as this ideom will dispatch
between 4 different versions of the the loop lambda and the compiler could
inline all of them or be able to detect the one that are used and hot
compile that version
(a feature we do not have yet in guile) now whe you select between a ref
and a set you will similarly end up with 4*4 versions = 16 different loops
that. full versions
is very large and a double loop with all featurs consists of (2*2 + 3*2*2*2
+ 4 + 1)**2 = 33*33 ~ 1000 versions of the loop which is crazy if we should
expand the loop
for all cases in the compilation. Now guile would just use a functional
approach and not expand the loop everywhere. We will have parameterised
versions of
libraries so that one can select which versions to compile for. for example
the general functions that performs transform form one supervector to
another is a general
ideom that would use the full dispatc which is not practical,

So for e.g. bitwise operations one could do
(define-values (l-not l-and l-or -xor)
    (generate-bit-ops
       #:is-endian 'little          ;; only consider little endian setters
like I know
       #:is-unsigned  #t         ;; only use unsigned
       #:is-integer      #t         ;; only use integer representations
       #:is-fixed         #t        ;; do not use the scm value vector
versions
     ))

As you realise although the compiler ca innline stuff, the complexity is
too big for the compiler to expand all the loops. In the end this facility
should not be needed, but currently I think itäs a good idea.

That's the plan!



































On Thu, Sep 2, 2021 at 5:45 PM Stefan Israelsson Tampe <
stefan.itampe@gmail.com> wrote:

> Hi guilers!
>
> My next project is to explore a more advanced bytevector structure than
> today's bytevectors.
> I think that having this infrastructure in guile and core code taken
> advantage of it like having strings otop of it and allowing our string
> library to use those (the difficult case is probably to get regexps working
> properly)
>
> Anyhow this is my initial comment in the code:
>
> #|
> The idea of this data structure is to note that when we employ full
> large datastructure scans we can allow for a much more rich and featurefull
> datastructure then a simple bytevector. The reason is that we can divide
> the data in byte chunks and spend most of the time scanning copying maping
> those areas with usual methis, even optimized C - code taken advantage of
> advanced cpu opcodes are possible here. ANd by dividing it in chunks we get
> a lot of new features with essentiually no cose with more than complexity
> which we manage mostly in scheme. We gain many things,
>
> 1. Many new features that can vary on the pages
>
> 2. less memory hogs as
>    a. it can use copy ion write semantics
>    b. it does not need to fins 1GB continuous blocks
>
> 3. potentially faster operations as we can fast foorward the zeroe on write
>    pages compared to pure bytevectors
>
> 4. we can allow for swaping and refferential datastructures to speed up
> copying
>    even further
>
> 5. can get better fiber features of C programs that spends seconds or
> minutes on
>    performing an operation because it will just spend a microsecond or such
>    in C-land and then swap back to Scheme. CTRL-C will also work nicely.
>
> 6. We could potentially build a string library untop of these
> datastructures and
>    also we could add features to pages that lead to a much richer
> interface.
>
> 7. resizing is much faster end efficient
>
> 8. reversing is super quick
>
> 9. queues and stacks of byte data can have a very efficient
> implementations
>
> Drawback:
> 1. More complex as we need to consider boudaries
> 2. Slower one off operations like bytevector-u8-get as guile compiles the
>    core operatoins to quite effective JIT CPU encodings. But maybe we can
>    disign some caching to make those operations much faster and even have
>    suporting JIT operations.
>
> |#
>
> WDYT ?
>
>

[-- Attachment #2: Type: text/html, Size: 9660 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: more advanced bytevector => supervectors
  2021-09-08  2:04 ` Stefan Israelsson Tampe
@ 2021-09-08  7:18   ` lloda
  2021-09-08 11:32     ` Stefan Israelsson Tampe
  2021-09-11 17:03     ` Stefan Israelsson Tampe
  0 siblings, 2 replies; 11+ messages in thread
From: lloda @ 2021-09-08  7:18 UTC (permalink / raw)
  To: Stefan Israelsson Tampe; +Cc: guile-devel

[-- Attachment #1: Type: text/plain, Size: 2321 bytes --]



> On 8 Sep 2021, at 04:04, Stefan Israelsson Tampe <stefan.itampe@gmail.com> wrote:
> 

...

> So using get-setter typically means
> ((get-setter #f bin1 #f 
>    (lambda (set) (set v 2 val)))
> 
>    #:is-endian 'little          ;; only consider little endian setters like I know 
>    #:is-unsigned  #t         ;; only use unsigned
>    #:is-integer      #t         ;; only use integer representations
>    #:is-fixed          #t        ;; do not use the scm value vector versions
> )
> So a version where we only consider handling nonegative integers of up to 64bit. The gain is faster compilation as this ideom will dispatch
> between 4 different versions of the the loop lambda and the compiler could inline all of them or be able to detect the one that are used and hot compile that version
> (a feature we do not have yet in guile) now whe you select between a ref and a set you will similarly end up with 4*4 versions = 16 different loops that. full versions
> is very large and a double loop with all featurs consists of (2*2 + 3*2*2*2 + 4 + 1)**2 = 33*33 ~ 1000 versions of the loop which is crazy if we should expand the loop
> for all cases in the compilation. Now guile would just use a functional approach and not expand the loop everywhere. We will have parameterised versions of
> libraries so that one can select which versions to compile for. for example the general functions that performs transform form one supervector to another is a general
> ideom that would use the full dispatc which is not practical, 

I'm curious where you're going with this.

I implemented something similar (iiuc) in https://github.com/lloda/guile-newra/ <https://github.com/lloda/guile-newra/>, specifically https://github.com/lloda/guile-newra/blob/master/mod/newra/map.scm <https://github.com/lloda/guile-newra/blob/master/mod/newra/map.scm> , where the lookup/set methods are inlined in the loop. The compilation times indeed grow exponentially so I'm forced to have a default 'generic' case. 

The idea for fixing this was to have some kind of run time compilation cache so only a fixed number of type combinations that actually get used would be compiled, instead of the tensor product of all types. But I haven't figured out, or actually tried to do that yet.

Regards
	
	Daniel


[-- Attachment #2: Type: text/html, Size: 3973 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: more advanced bytevector => supervectors
  2021-09-08  7:18   ` lloda
@ 2021-09-08 11:32     ` Stefan Israelsson Tampe
  2021-09-11 17:03     ` Stefan Israelsson Tampe
  1 sibling, 0 replies; 11+ messages in thread
From: Stefan Israelsson Tampe @ 2021-09-08 11:32 UTC (permalink / raw)
  To: lloda; +Cc: guile-devel

[-- Attachment #1: Type: text/plain, Size: 3156 bytes --]

When I made this code work I will start examine the impact of the design. I
suspect that it will, for complex cases
if not compiled be way slower then the compiled part and this is really the
hot path. For one thing, we will create 2 closures that
allocates memory at each iteration if not inlined. The best solutionis that
wingo can design a very fast compiler targeting
this case in the beginning meaning that guile just handle it perfectly even
with potentially 1000:s of cases. Second
posibility is if guiile had a fast compiler that when feeding a lambda to
it, it would optimise it. we could simulate this
by simply pass lists representing code and use compile to compile it, but
my experience is that it's very time consuming
to do this. I can experiment a little here to see the actual timing.

Anyhow the idea with a fast compiler is that it could prepare in¨the first
compiler the setup so that it is really fast to
compile compared to starting from scratch. Here the advice from wingo would
be apprisiated.

A final posibility which is not too bad speedwise is to do the following
inside the loop and create one big dispatch like
that is executed each iteration.

(let ((val (if (= endian 'little)
                   (if float?
                       (if (= m 4) (get-f32 v1 k1 'little) (get-d64 v1 k1
'little))
                        ...))

(if (= endian 'little)
    (if float?
         (if (= m 4) (set-f32 v1 k1 val 'little) (set-d64 v1 k1 val
'little))
              ...))

This is ideally the code should compile to if it can't create all possible
loops

Now I do not like to adjust my code to output this as it makes the
framework less powerfulll and useful as every case
will be a special case. But what about if you could mark a code less
important. what we want is a dispatch like so

(if (= endian 'little) #:level-2 ...)

And in the first pass, if will be handled if endian is known (will reduce
complexity) else it will in the first pass freeze
this one and continue with the whole shebang. the level2 will be the basic
compiler, but where the #:level-2 tag is ignored.
Maybe this is a no issue and the compiler handles this gracefully.

Also The compiler could note that endian nbits single? float? etc etc is
really created outside the loop and prepare the code for
handling all cases. essentialle make sure to compile all nodes and make an
area in the code to modify. then when before the loop
the code can decide which version to use outside the loop (here we can use
padding or a goto in case if the padded area is so large
that a goto saves time. this means that the compiler has 33 cases for the
ref and 33 cases for the set! part in my most general
version which is ok as they each are typically small. So what I would do if
I where the compiler do the following layout pseudo,

(if ... (copy RefStub1 to StubX ...)
(if ... (copy SetStub1 to  StubY ...)
goto loop
StubRef1
...
StubRefN

StubSet1
...
StubSetN
loop:
(let lp (...)
  (let ((val StubX))
     StubY)
   (iwhen... (lp ...)))

this can be quite fast.

Self modifying code rocks!!!

[-- Attachment #2: Type: text/html, Size: 4151 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: more advanced bytevector => supervectors
  2021-09-08  7:18   ` lloda
  2021-09-08 11:32     ` Stefan Israelsson Tampe
@ 2021-09-11 17:03     ` Stefan Israelsson Tampe
  2021-09-11 18:21       ` lloda
  1 sibling, 1 reply; 11+ messages in thread
From: Stefan Israelsson Tampe @ 2021-09-11 17:03 UTC (permalink / raw)
  To: lloda; +Cc: guile-devel

[-- Attachment #1: Type: text/plain, Size: 2831 bytes --]

I did some test ands wingo's superb compiler is about equally fast for a
hand made scheme loop as the automatic dispatch for getter and setter. It
e.g. can copy from
e.g. u8 to i16 in about 100 op's / second using native byte order. However
compiling it in C lead to nasty 2 Go ops / second. So for these kind of
patterns
it is still better to work in C as it probaly vectorises the operation
quite well. Supervectors supports pushing busy loops to C very well and I
will probably
enable fast C code for some simple utility ops.

On Wed, Sep 8, 2021 at 9:18 AM lloda <lloda@sarc.name> wrote:

>
>
> On 8 Sep 2021, at 04:04, Stefan Israelsson Tampe <stefan.itampe@gmail.com>
> wrote:
>
>
> ...
>
>
> So using get-setter typically means
> ((get-setter #f bin1 #f
>    (lambda (set) (set v 2 val)))
>
>    #:is-endian 'little          ;; only consider little endian setters
> like I know
>    #:is-unsigned  #t         ;; only use unsigned
>    #:is-integer      #t         ;; only use integer representations
>    #:is-fixed          #t        ;; do not use the scm value vector
> versions
> )
> So a version where we only consider handling nonegative integers of up to
> 64bit. The gain is faster compilation as this ideom will dispatch
> between 4 different versions of the the loop lambda and the compiler could
> inline all of them or be able to detect the one that are used and hot
> compile that version
> (a feature we do not have yet in guile) now whe you select between a ref
> and a set you will similarly end up with 4*4 versions = 16 different loops
> that. full versions
> is very large and a double loop with all featurs consists of (2*2 +
> 3*2*2*2 + 4 + 1)**2 = 33*33 ~ 1000 versions of the loop which is crazy if
> we should expand the loop
> for all cases in the compilation. Now guile would just use a functional
> approach and not expand the loop everywhere. We will have parameterised
> versions of
> libraries so that one can select which versions to compile for. for
> example the general functions that performs transform form one supervector
> to another is a general
> ideom that would use the full dispatc which is not practical,
>
>
> I'm curious where you're going with this.
>
> I implemented something similar (iiuc) in
> https://github.com/lloda/guile-newra/, specifically
> https://github.com/lloda/guile-newra/blob/master/mod/newra/map.scm ,
> where the lookup/set methods are inlined in the loop. The compilation times
> indeed grow exponentially so I'm forced to have a default 'generic' case.
>
> The idea for fixing this was to have some kind of run time compilation
> cache so only a fixed number of type combinations that actually get used
> would be compiled, instead of the tensor product of all types. But I
> haven't figured out, or actually tried to do that yet.
>
> Regards
> Daniel
>
>

[-- Attachment #2: Type: text/html, Size: 4105 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: more advanced bytevector => supervectors
  2021-09-11 17:03     ` Stefan Israelsson Tampe
@ 2021-09-11 18:21       ` lloda
  0 siblings, 0 replies; 11+ messages in thread
From: lloda @ 2021-09-11 18:21 UTC (permalink / raw)
  To: Stefan Israelsson Tampe; +Cc: guile-devel

[-- Attachment #1: Type: text/plain, Size: 3615 bytes --]


A problem is that Guile doesn't really provide a god set of fast rank 1 ops. None of them have strides!=1 for example (this is ok for regular vectors, but it hurts for general arrays), and some are missing start/end or you have to write wrappers yourself, like for the typed vectors (other than u8). So in some cases you have to do the loop in Scheme. That's fine when the body of the loop is Scheme ops but if it's something like copy or fill it really hurts compared to C.


> On 11 Sep 2021, at 19:03, Stefan Israelsson Tampe <stefan.itampe@gmail.com> wrote:
> 
> I did some test ands wingo's superb compiler is about equally fast for a hand made scheme loop as the automatic dispatch for getter and setter. It e.g. can copy from 
> e.g. u8 to i16 in about 100 op's / second using native byte order. However compiling it in C lead to nasty 2 Go ops / second. So for these kind of patterns
> it is still better to work in C as it probaly vectorises the operation quite well. Supervectors supports pushing busy loops to C very well and I will probably 
> enable fast C code for some simple utility ops.
> 
> On Wed, Sep 8, 2021 at 9:18 AM lloda <lloda@sarc.name <mailto:lloda@sarc.name>> wrote:
> 
> 
>> On 8 Sep 2021, at 04:04, Stefan Israelsson Tampe <stefan.itampe@gmail.com <mailto:stefan.itampe@gmail.com>> wrote:
>> 
> 
> ...
> 
>> So using get-setter typically means
>> ((get-setter #f bin1 #f 
>>    (lambda (set) (set v 2 val)))
>> 
>>    #:is-endian 'little          ;; only consider little endian setters like I know 
>>    #:is-unsigned  #t         ;; only use unsigned
>>    #:is-integer      #t         ;; only use integer representations
>>    #:is-fixed          #t        ;; do not use the scm value vector versions
>> )
>> So a version where we only consider handling nonegative integers of up to 64bit. The gain is faster compilation as this ideom will dispatch
>> between 4 different versions of the the loop lambda and the compiler could inline all of them or be able to detect the one that are used and hot compile that version
>> (a feature we do not have yet in guile) now whe you select between a ref and a set you will similarly end up with 4*4 versions = 16 different loops that. full versions
>> is very large and a double loop with all featurs consists of (2*2 + 3*2*2*2 + 4 + 1)**2 = 33*33 ~ 1000 versions of the loop which is crazy if we should expand the loop
>> for all cases in the compilation. Now guile would just use a functional approach and not expand the loop everywhere. We will have parameterised versions of
>> libraries so that one can select which versions to compile for. for example the general functions that performs transform form one supervector to another is a general
>> ideom that would use the full dispatc which is not practical, 
> 
> I'm curious where you're going with this.
> 
> I implemented something similar (iiuc) in https://github.com/lloda/guile-newra/ <https://github.com/lloda/guile-newra/>, specifically https://github.com/lloda/guile-newra/blob/master/mod/newra/map.scm <https://github.com/lloda/guile-newra/blob/master/mod/newra/map.scm> , where the lookup/set methods are inlined in the loop. The compilation times indeed grow exponentially so I'm forced to have a default 'generic' case. 
> 
> The idea for fixing this was to have some kind of run time compilation cache so only a fixed number of type combinations that actually get used would be compiled, instead of the tensor product of all types. But I haven't figured out, or actually tried to do that yet.
> 
> Regards
> 	
> 	Daniel
> 


[-- Attachment #2: Type: text/html, Size: 5907 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: more advanced bytevector => supervectors
  2021-09-02 15:45 more advanced bytevector => supervectors Stefan Israelsson Tampe
                   ` (2 preceding siblings ...)
  2021-09-08  2:04 ` Stefan Israelsson Tampe
@ 2021-09-15  0:12 ` Stefan Israelsson Tampe
  3 siblings, 0 replies; 11+ messages in thread
From: Stefan Israelsson Tampe @ 2021-09-15  0:12 UTC (permalink / raw)
  To: guile-devel

[-- Attachment #1: Type: text/plain, Size: 4700 bytes --]

hi,

I think I am happe now with the API for binary operators,

To see where I am heading, consider the bitvector implementation. Here is
e.g. how logxor is implemented
s1 logior s2 -> s2
(define (supervector-logxor! s1 s2)
  (supervector-binary
   ;; The general case
   logxor

   ;; The  s1 part is a constant bin
   (lambda (x)
     (if (= x 0)
         (values invert    0) ;; will invert the s2 part in case it is a
full bin - fast
         (values nothing 0) ;; will leace the s2 part non changes - fast

    ;; the s2 part is constant
     (lambda (x)
       (if (= x 0)
          (values invert    0)    ;; if s1 is ro? refere to it with
negation if referable   (fast)
          (values nothing 0))) ;; replace s2 with s1 if referable
                    (fast)

   ;; Both s1 and s2 are constant and s2 is a complete bin
   logxor

   s1
   s2))

We have the following controls
(values add   val)    : will add the same val to all elements in the bin
e.g. maybe stored in the bin
(values scale val)    :  will scale the same vale to all elements in the bin
(values nothing _)   : will do nothing with the bin
(values replace val) : set the whole bin to val
(values invert    _)    :  take lognot to the whole bin
(values negate _)     : take not to the whole bin

With a good alignement of the bins and ability to reference bins if s2 is
the constant part,
the code will successfully produce a fast application of the binary
operators. This means that
clustered sparse bitvectors are well modeled in this system and for those
vectors. especially
for advanced matching constructs with very large numbers of clauses it is
possible to make the
matching fast and also the amount of memory needed to store the matcher is
small.

We have a similar system for algebraic manipulation and for logical
operations.

logxor is not in the current setup not applicated as a fast op however, but
it is sent as a function into the
construct in higher order manner. This means that busy loops for fat
vectors are a bit slow. When things
are debugged I will move over to macro versions which allow the busy loop
to essentially be as fast as if we
did a logxor in a single loop of a bytevector which you can get at
100-200Mops per second. This can go much
faster like 1-2G ops per second if we push it to C.

On Thu, Sep 2, 2021 at 5:45 PM Stefan Israelsson Tampe <
stefan.itampe@gmail.com> wrote:

> Hi guilers!
>
> My next project is to explore a more advanced bytevector structure than
> today's bytevectors.
> I think that having this infrastructure in guile and core code taken
> advantage of it like having strings otop of it and allowing our string
> library to use those (the difficult case is probably to get regexps working
> properly)
>
> Anyhow this is my initial comment in the code:
>
> #|
> The idea of this data structure is to note that when we employ full
> large datastructure scans we can allow for a much more rich and featurefull
> datastructure then a simple bytevector. The reason is that we can divide
> the data in byte chunks and spend most of the time scanning copying maping
> those areas with usual methis, even optimized C - code taken advantage of
> advanced cpu opcodes are possible here. ANd by dividing it in chunks we get
> a lot of new features with essentiually no cose with more than complexity
> which we manage mostly in scheme. We gain many things,
>
> 1. Many new features that can vary on the pages
>
> 2. less memory hogs as
>    a. it can use copy ion write semantics
>    b. it does not need to fins 1GB continuous blocks
>
> 3. potentially faster operations as we can fast foorward the zeroe on write
>    pages compared to pure bytevectors
>
> 4. we can allow for swaping and refferential datastructures to speed up
> copying
>    even further
>
> 5. can get better fiber features of C programs that spends seconds or
> minutes on
>    performing an operation because it will just spend a microsecond or such
>    in C-land and then swap back to Scheme. CTRL-C will also work nicely.
>
> 6. We could potentially build a string library untop of these
> datastructures and
>    also we could add features to pages that lead to a much richer
> interface.
>
> 7. resizing is much faster end efficient
>
> 8. reversing is super quick
>
> 9. queues and stacks of byte data can have a very efficient
> implementations
>
> Drawback:
> 1. More complex as we need to consider boudaries
> 2. Slower one off operations like bytevector-u8-get as guile compiles the
>    core operatoins to quite effective JIT CPU encodings. But maybe we can
>    disign some caching to make those operations much faster and even have
>    suporting JIT operations.
>
> |#
>
> WDYT ?
>
>

[-- Attachment #2: Type: text/html, Size: 5745 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2021-09-15  0:12 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-09-02 15:45 more advanced bytevector => supervectors Stefan Israelsson Tampe
2021-09-02 15:54 ` Stefan Israelsson Tampe
2021-09-06  2:10   ` Stefan Israelsson Tampe
2021-09-02 19:11 ` Matt Wette
2021-09-02 19:47   ` tomas
2021-09-08  2:04 ` Stefan Israelsson Tampe
2021-09-08  7:18   ` lloda
2021-09-08 11:32     ` Stefan Israelsson Tampe
2021-09-11 17:03     ` Stefan Israelsson Tampe
2021-09-11 18:21       ` lloda
2021-09-15  0:12 ` Stefan Israelsson Tampe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).