* Identifying sources of allocations in a toy Mandelbrot package
@ 2024-01-17 12:39 Psionic K
2024-01-17 12:58 ` Eli Zaretskii
2024-01-17 13:25 ` Emanuel Berg
0 siblings, 2 replies; 15+ messages in thread
From: Psionic K @ 2024-01-17 12:39 UTC (permalink / raw)
To: help-gnu-emacs
I was toying around with compilation speed trying to eliminate
allocations and discovered that my package was spending about 40% of
its runtime in GC (observed in `gc-elapsed'). I didn't suspect the
situation could be this intense until turning on GC messages.
I wrapped the command's body in some temporary GC settings and then
found values I was comfortable with in terms of tradeoffs for
resulting resident memory consumption and time spent GC'ing.
By setting absurd values that wouldn't GC until the program finished,
I was able to reproduce freezes I had experienced in the past where
Emacs was evidently GC'ing forever.
All that was fun, but what I don't understand is where this package
manages to consume up to 7GB of memory, which Emacs will not readily
give back afterwards. How can I identify causes of allocations until
I gain familiarity with which expressions typically cause them? To my
naive eyes, there are no allocations in the hot section of the code,
but that is very untrue given the behavior.
I want to also remark that `gc-cons-percentage' does not seem to be a
percentage but a factor. If it is a percentage, I don't understand
it's behavior because any value larger than 2.0 seems to lead Emacs to
using all of the available memory.
Package follows. Enjoy, but as configured, Emacs will be hovering at
around 1.3GB after running:
;;; mandelbrot.el --- Mandelbrot -*- lexical-binding: t; -*-
;;; Commentary:
;; Render's Mandelbrot set.
;;; Code:
(eval-when-compile (require 'cl-lib))
;;;###autoload
(defun mandelbrot ()
"Render a mandelbrot."
(declare (speed 3))
(interactive)
(pop-to-buffer (get-buffer-create "*mandelbrot*"))
(let ((gc-cons-threshold (* 800000 1024))
(gc-cons-percentage 1.0))
(let* ((cx) (cy) (zr) (zi)
(inhibit-modification-hooks t)
(w 1600) (h 1200) (d 256)
(output (make-vector (* w h 3) 0))
(idx 0)
(x0 -2.5) (y0 -1.5) (dx 4.0) (dy 3.0)
(dxp (/ dx w)) (dyp (/ dy h)))
(fundamental-mode) (erase-buffer)
(set-buffer-multibyte nil)
(insert (format "P6\n%d %d\n255\n" w h))
(dotimes (y h)
(dotimes (x w)
(setq cx (+ x0 (* dxp x))
cy (+ y0 (* dyp y))
zr 0
zi 0)
(let ((v (cl-dotimes (v d d)
(if (> (+ (* zr zr) (* zi zi)) 4) (cl-return v)
(cl-psetq
zr (+ (* zr zr) (- (* zi zi)) cx)
zi (+ (* (* zr zi) 2) cy))))))
;; exponential 0.8 enhances contrast a bit
(let ((out (floor (* 256 (expt (/ (float v) d) 0.8)))))
(aset output idx out)
(aset output (+ idx 1) out)
(aset output (+ idx 2) out)
(setq idx (+ idx 3))))))
(insert (concat output))
(image-mode))))
(provide 'mandelbrot)
;;; mandelbrot.el ends here.
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Identifying sources of allocations in a toy Mandelbrot package
2024-01-17 12:39 Identifying sources of allocations in a toy Mandelbrot package Psionic K
@ 2024-01-17 12:58 ` Eli Zaretskii
2024-01-17 13:25 ` Emanuel Berg
1 sibling, 0 replies; 15+ messages in thread
From: Eli Zaretskii @ 2024-01-17 12:58 UTC (permalink / raw)
To: help-gnu-emacs
> From: Psionic K <psionik@positron.solutions>
> Date: Wed, 17 Jan 2024 21:39:04 +0900
>
> I want to also remark that `gc-cons-percentage' does not seem to be a
> percentage but a factor.
Yes. As its doc string says:
Portion of the heap used for allocation.
"Portion", not "percent". (The name of the variable is perhaps
unfortunate, but that ship sailed so long ago that we don't even know
where it is by now, probably beyond the termination shock.
> (cl-psetq
> zr (+ (* zr zr) (- (* zi zi)) cx)
> zi (+ (* (* zr zi) 2) cy))))))
cl-psetq conses.
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Identifying sources of allocations in a toy Mandelbrot package
2024-01-17 12:39 Identifying sources of allocations in a toy Mandelbrot package Psionic K
2024-01-17 12:58 ` Eli Zaretskii
@ 2024-01-17 13:25 ` Emanuel Berg
1 sibling, 0 replies; 15+ messages in thread
From: Emanuel Berg @ 2024-01-17 13:25 UTC (permalink / raw)
To: help-gnu-emacs
Psionic K wrote:
> All that was fun, but what I don't understand is where this
> package manages to consume up to 7GB of memory, which Emacs
> will not readily give back afterwards.
You have a pretty big vector
(* 1600 1200 3) ; 5 760 000
and a lot of nested iteration
(* 1200 1600 256) ; 491 520 000
Maybe you can even compute further how big it should get
without GC?
--
underground experts united
https://dataswamp.org/~incal
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Identifying sources of allocations in a toy Mandelbrot package
@ 2024-01-19 9:19 Psionic K
2024-01-19 15:33 ` Tomas Hlavaty
2024-01-19 15:44 ` Eli Zaretskii
0 siblings, 2 replies; 15+ messages in thread
From: Psionic K @ 2024-01-19 9:19 UTC (permalink / raw)
To: help-gnu-emacs; +Cc: incal, Eli Zaretskii
Most importantly to my question, how can I identify sources of consing
without just knowing which functions and macros pathologically cons?
Is there a way to allow usage of stack by the byte or native compiler?
> You have a pretty big vector
The vector is allocated once and not on the same scale as usage,
measured by profiling or by resident memory in top.
If I let Emacs run with no GC, it will consume about 10GB before it
starts swapping and becomes unresponsive until I kill it.
> and a lot of nested iteration
I would have expected this to consume only stack memory, but it seems
that neither native compiled nor bytecode can use memory in a
stack-like way? I'm not sure what the Elisp manual on stack allocated
objects means for native or byte compiled functions.
> cl-psetq conses.
After removing this for a combination of setq's and then removing two
unnecessary `cl-dotimes` expressions that used no CL features, I've
seen a second-run reduction down to about 8-9s.
The first run is quite a bit longer. Although both runs GC 4-5 times,
the first run of an Emacs session is about 35s versus 8s for a second
run. `gc-elapsed' after the first run goes from near zero to 33.7s.
The result is still on the same order of magnitude in terms of memory
usage. The only behavior consistent with earlier memory profiling I
did is if every single floating point value ever calculated lives on a
unique piece of heap until GC'd. It as if heap is consumed in a
straight line and there is no stack memory.
Perhaps concerning is that Emacs continues to consume the same
resident memory, about 2.1GB, indefinitely after the first run. Is
this simply expected from a sprinkling of objects that Emacs cannot GC
and therefore never give back memory?
Is this unique to floating points? Do I still have latent consing in
this version? Thanks in advance.
;;; mandelbrot.el --- Mandelbrot -*- lexical-binding: t; -*-
;;; Commentary:
;; Render's Mandelbrot set.
;;; Code:
(eval-when-compile (require 'cl-lib))
;;;###autoload
(defun mandelbrot ()
"Render a mandelbrot."
(declare (speed 3))
(interactive)
(pop-to-buffer (get-buffer-create "*mandelbrot*"))
(let ((gc-cons-threshold (* 800000 2048))
(gc-cons-percentage 1.0))
(let* ((cx) (cy) (zr) (zi) (v) (out) (zrt)
(inhibit-modification-hooks t)
(w 1600) (h 1200) (d 256)
(output (make-vector (* w h 3) 0))
(idx 0)
(x0 -2.5) (y0 -1.5) (dx 4.0) (dy 3.0)
(dxp (/ dx w)) (dyp (/ dy h)))
(fundamental-mode) (erase-buffer)
(set-buffer-multibyte nil)
(insert (format "P6\n%d %d\n255\n" w h))
(dotimes (y h)
(dotimes (x w)
(setf cx (+ x0 (* dxp x)))
(setf cy (+ y0 (* dyp y)))
(setq zr 0)
(setq zi 0)
(setq v (cl-dotimes (v d d)
(if (> (+ (* zr zr) (* zi zi)) 4) (cl-return v)
(setq zrt (+ (* zr zr) (- (* zi zi)) cx))
(setq zi (+ (* (* zr zi) 2) cy))
(setq zr zrt))))
;; samples that hit 256 wrap in the graphic to display as zero
;; exponential 0.8 enhances contrast a bit
(setq out (floor (* 256 (expt (/ (float v) d) 0.8))))
(aset output idx out)
(aset output (+ idx 1) out)
(aset output (+ idx 2) out)
(setq idx (+ idx 3))))
(insert (concat output))
(image-mode))))
(provide 'mandelbrot)
;;; mandelbrot.el ends here.
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Identifying sources of allocations in a toy Mandelbrot package
2024-01-19 9:19 Psionic K
@ 2024-01-19 15:33 ` Tomas Hlavaty
2024-01-20 3:14 ` Psionic K
2024-01-19 15:44 ` Eli Zaretskii
1 sibling, 1 reply; 15+ messages in thread
From: Tomas Hlavaty @ 2024-01-19 15:33 UTC (permalink / raw)
To: Psionic K, help-gnu-emacs; +Cc: incal, Eli Zaretskii
in lisp, a rough heuristic is that anything can allocate memory anytime
another useful heuristic is not to do things that do not need to be
done (not lisp-specific)
>> You have a pretty big vector
>
> The vector is allocated once and not on the same scale as usage,
> measured by profiling or by resident memory in top.
it does not need to be allocated at all
> (eval-when-compile (require 'cl-lib))
have you tried eliminating cl-lib?
> (let* ((cx) (cy) (zr) (zi) (v) (out) (zrt)
^
those parens are unnecessary
also it is ugly to (let (a) (setq a 42) ...)
it is nice to put let around as little forms as possible
then the binding dependencies look clearer
and lack of setf and setq means
I don't have to think unnecessarily about sideeffects
> (output (make-vector (* w h 3) 0))
you dont need the output vector at all, you already have an output
buffer so just write there directly
> (setf cx (+ x0 (* dxp x)))
use let instead of setf and setq, everywhere
> (if (> (+ (* zr zr) (* zi zi)) 4) (cl-return v)
avoid cl-return
> (aset output idx out)
> (aset output (+ idx 1) out)
> (aset output (+ idx 2) out)
insert the value into the output buffer here
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Identifying sources of allocations in a toy Mandelbrot package
2024-01-19 9:19 Psionic K
2024-01-19 15:33 ` Tomas Hlavaty
@ 2024-01-19 15:44 ` Eli Zaretskii
1 sibling, 0 replies; 15+ messages in thread
From: Eli Zaretskii @ 2024-01-19 15:44 UTC (permalink / raw)
To: help-gnu-emacs
> From: Psionic K <psionik@positron.solutions>
> Date: Fri, 19 Jan 2024 18:19:55 +0900
> Cc: incal@dataswamp.org, Eli Zaretskii <eliz@gnu.org>
>
> Most importantly to my question, how can I identify sources of consing
> without just knowing which functions and macros pathologically cons?
Insert calls to garbage-collect, and look at the value it returns.
> Is there a way to allow usage of stack by the byte or native compiler?
Lisp objects are not allocated on the stack.
> > and a lot of nested iteration
>
> I would have expected this to consume only stack memory
Any temporary Lisp objects will allocate memory. Their references
will be on the stack, but the objects themselves will be on the heap.
> Perhaps concerning is that Emacs continues to consume the same
> resident memory, about 2.1GB, indefinitely after the first run. Is
> this simply expected from a sprinkling of objects that Emacs cannot GC
> and therefore never give back memory?
The memory is released and available for subsequent allocations, but
not necessarily given back to the OS. Whether and when to give memory
back to the OS depends on the internals of the C library your Emacs is
using.
> Is this unique to floating points? Do I still have latent consing in
> this version? Thanks in advance.
I don't think I understand what you are asking here, please elaborate.
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Identifying sources of allocations in a toy Mandelbrot package
2024-01-19 15:33 ` Tomas Hlavaty
@ 2024-01-20 3:14 ` Psionic K
2024-01-20 3:37 ` Psionic K
` (2 more replies)
0 siblings, 3 replies; 15+ messages in thread
From: Psionic K @ 2024-01-20 3:14 UTC (permalink / raw)
To: Tomas Hlavaty; +Cc: Psionic K, help-gnu-emacs, incal, Eli Zaretskii
> Any temporary Lisp objects will allocate memory. Their references will be
on the stack, but the objects themselves will be on the heap.
I had presumed at a minimum that the compiler can re-use heap locations
when it decides there's no risk. While the values will be in the heap, is
there a way to destructively re-use heap locations so that we are not
burning through all of the heap during iteration? Do function calls re-use
any heap space? Does anything besides GC? As it is, I can guess that
every call to even + or * is creating a unique float that now exists on the
heap until GC'd.
The vector would have a bounded size of about 50-100MB. To avoid us
focusing on it, I have removed it. No material impact.
> have you tried eliminating cl-lib?
No change. After getting rid of cl-return, I'm doing worse, at around
16.0s and however much memory I provide.
> use let instead of setf and setq, everywhere
Without using recursion to change the iteration structure to one that never
uses setq, I obtained a version that is consistent with all of the GC
tradeoffs I've seen so far, requiring between 10 and 20 seconds depending
on how much memory I allow. Should I consider further modifications?
;;; mandelbrot.el --- Mandelbrot -*- lexical-binding: t; -*-
;;; Commentary:
;; Render's Mandelbrot set.
;;; Code:
;;;###autoload
(defun mandelbrot ()
"Render a Mandelbrot."
(declare (speed 3))
(interactive)
(pop-to-buffer (get-buffer-create "*Mandelbrot*"))
(fundamental-mode)
(erase-buffer)
(set-buffer-multibyte nil)
(let ((gc-cons-threshold (* 800000 1024))
(gc-cons-percentage 0.6)
(inhibit-modification-hooks t)
(w 1600) (h 1200) (d 256) (dd 256.0)
(x0 -2.5) (y0 -1.5) (dx 4.0) (dy 3.0))
(insert (format "P6\n%d %d\n255\n" w h))
(let* ((dxp (/ dx w))
(dyp (/ dy h)))
(dotimes (y h)
(message "Row: %d" y) ; removal prevents GC
(let ((cy (+ y0 (* dyp y))))
(dotimes (x w)
(let ((cx (+ x0 (* dxp x)))
(zr 0) (zi 0)
(v 0) (not-escaped t))
(while (and not-escaped
(< v d))
(if (> (+ (expt zr 2.0) (expt zi 2.0)) 4)
(setq not-escaped nil)
(let ((zrt (+ (* zr zr) (- (* zi zi)) cx)))
(setq zi (+ (* (* zr zi) 2.0) cy))
(setq zr zrt))
(setq v (1+ v))))
;; samples that hit 256 wrap in the graphic to display as
zero
;; exponential 0.8 enhances contrast a bit
(let ((out (floor (* 256 (/ v dd)))))
(insert out out out))))))
(image-mode))))
(provide 'mandelbrot)
;;; mandelbrot.el ends here.
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Identifying sources of allocations in a toy Mandelbrot package
2024-01-20 3:14 ` Psionic K
@ 2024-01-20 3:37 ` Psionic K
2024-01-20 7:29 ` Eli Zaretskii
2024-01-20 9:09 ` Tomas Hlavaty
2 siblings, 0 replies; 15+ messages in thread
From: Psionic K @ 2024-01-20 3:37 UTC (permalink / raw)
To: Tomas Hlavaty; +Cc: Psionic K, help-gnu-emacs, incal, Eli Zaretskii
I need to correct a statement. While some versions I have preserved will
not GC unless I insert a call to `message', the version in my last email
will GC properly without the `message' call.
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Identifying sources of allocations in a toy Mandelbrot package
2024-01-20 3:14 ` Psionic K
2024-01-20 3:37 ` Psionic K
@ 2024-01-20 7:29 ` Eli Zaretskii
2024-01-20 9:09 ` Tomas Hlavaty
2 siblings, 0 replies; 15+ messages in thread
From: Eli Zaretskii @ 2024-01-20 7:29 UTC (permalink / raw)
To: help-gnu-emacs
> From: Psionic K <psionik@positron.solutions>
> Date: Sat, 20 Jan 2024 12:14:53 +0900
> Cc: Psionic K <psionik@positron.solutions>, help-gnu-emacs@gnu.org, incal@dataswamp.org,
> Eli Zaretskii <eliz@gnu.org>
>
> > Any temporary Lisp objects will allocate memory. Their references will be on the stack, but the
> objects themselves will be on the heap.
>
> I had presumed at a minimum that the compiler can re-use heap locations when it decides there's no
> risk. While the values will be in the heap, is there a way to destructively re-use heap locations so that
> we are not burning through all of the heap during iteration? Do function calls re-use any heap space?
> Does anything besides GC? As it is, I can guess that every call to even + or * is creating a unique
> float that now exists on the heap until GC'd.
Yes, every temporary Lisp object other than an integere smaller than
most-positive-fixnum gets allocated from the heap. I don't know
enough about the native compiler and the optimizations it performs to
tell whether the native code changes that in any way, but I would be
surprised if it did.
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Identifying sources of allocations in a toy Mandelbrot package
2024-01-20 3:14 ` Psionic K
2024-01-20 3:37 ` Psionic K
2024-01-20 7:29 ` Eli Zaretskii
@ 2024-01-20 9:09 ` Tomas Hlavaty
2024-01-20 10:03 ` Psionic K
2 siblings, 1 reply; 15+ messages in thread
From: Tomas Hlavaty @ 2024-01-20 9:09 UTC (permalink / raw)
To: Psionic K; +Cc: Psionic K, help-gnu-emacs, incal, Eli Zaretskii
On Sat 20 Jan 2024 at 12:14, Psionic K <psionik@positron.solutions> wrote:
> I had presumed at a minimum that the compiler can re-use heap locations
> when it decides there's no risk.
garbage collector does this
> While the values will be in the heap, is
> there a way to destructively re-use heap locations so that we are not
> burning through all of the heap during iteration? Do function calls re-use
> any heap space? Does anything besides GC? As it is, I can guess that
> every call to even + or * is creating a unique float that now exists on the
> heap until GC'd.
every lisp system represents data differently.
if a value is not immediate, it has to be consed.
maybe floating point numbers in emacs are not immediate.
it looks like emacs uses libgmp
and there are no special cases for IEEE representation.
for comparison, you could try your code on sbcl
and see how much better it gets just by using very sophisticated compiler
> Should I consider further modifications?
now it looks much nicer, well done
> (d 256) (dd 256.0)
this is fine, but it's used in single place,
could as well put the values directly in the code
without this let indirection
> (v 0) (not-escaped t))
> (while (and not-escaped
> (< v d))
> (if (> (+ (expt zr 2.0) (expt zi 2.0)) 4)
> (setq not-escaped nil)
not-escaped is not needed
(while (and (< v d)
(<= (+ (expt zr 2.0) (expt zi 2.0)) 4))
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Identifying sources of allocations in a toy Mandelbrot package
2024-01-20 9:09 ` Tomas Hlavaty
@ 2024-01-20 10:03 ` Psionic K
2024-01-20 10:31 ` Eli Zaretskii
2024-01-26 23:36 ` Tomas Hlavaty
0 siblings, 2 replies; 15+ messages in thread
From: Psionic K @ 2024-01-20 10:03 UTC (permalink / raw)
To: Tomas Hlavaty; +Cc: Psionic K, help-gnu-emacs, incal, Eli Zaretskii
> maybe floating point numbers in emacs are not immediate.
This is my hunch. There isn't a good second guess at this point.
> try your code on sbcl
I think that's where we're going.
I was mainly hoping to start identifying forms that can do okay even
in spite of the Elisp runtime, especially for these kind of iteration
problems that could - but should not - generate lots of garbage. The
first rows go by in the blink of an eye because of the cache locality.
Elisp is almost something. The memory pathologies are likely
affecting every other perceived lack of responsiveness.
A relatively recent transducer library captured my curiosity. I want
to give it a work out alongside dash for a comparison:
https://gist.github.com/NicolasPetton/0a46d907febeef3da9e2
Writing dynamic modules to break out of Elisp is one of my next steps.
Calling out only via process pipes is rather limited.
I haven't worked with SLIME or CIDER etc yet. Definitely need to try them out.
Thanks for helping me calibrate my expectations.
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Identifying sources of allocations in a toy Mandelbrot package
2024-01-20 10:03 ` Psionic K
@ 2024-01-20 10:31 ` Eli Zaretskii
2024-01-26 23:36 ` Tomas Hlavaty
1 sibling, 0 replies; 15+ messages in thread
From: Eli Zaretskii @ 2024-01-20 10:31 UTC (permalink / raw)
To: help-gnu-emacs
> From: Psionic K <psionik@positron.solutions>
> Date: Sat, 20 Jan 2024 19:03:55 +0900
> Cc: Psionic K <psionik@positron.solutions>, help-gnu-emacs@gnu.org, incal@dataswamp.org,
> Eli Zaretskii <eliz@gnu.org>
>
> > maybe floating point numbers in emacs are not immediate.
>
> This is my hunch. There isn't a good second guess at this point.
It isn't a hunch, it's a fact.
Similarly, integers bigger than most-positive-fixnum or smaller than
most-negative-fixnum (a.k.a. "bignums") are not immediate, Lisp
strings are not immediate, and any other Lisp object that has some data.
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Identifying sources of allocations in a toy Mandelbrot package
2024-01-20 10:03 ` Psionic K
2024-01-20 10:31 ` Eli Zaretskii
@ 2024-01-26 23:36 ` Tomas Hlavaty
2024-01-27 1:07 ` Psionic K
1 sibling, 1 reply; 15+ messages in thread
From: Tomas Hlavaty @ 2024-01-26 23:36 UTC (permalink / raw)
To: Psionic K; +Cc: help-gnu-emacs, incal, Eli Zaretskii
[-- Attachment #1: Type: text/plain, Size: 1302 bytes --]
On Sat 20 Jan 2024 at 19:03, Psionic K <psionik@positron.solutions> wrote:
> I was mainly hoping to start identifying forms that can do okay even
> in spite of the Elisp runtime, especially for these kind of iteration
> problems that could - but should not - generate lots of garbage. The
> first rows go by in the blink of an eye because of the cache locality.
> Elisp is almost something. The memory pathologies are likely
> affecting every other perceived lack of responsiveness.
I was curious about:
- How does native compilation improve things?
- How does fixed point arithmetic improve things?
Run:
(benchmark-run 1 (mandelbrot))
with the attached mandelbrot1.el (floating point):
after emacs-lisp-native-compile-and-load:
=> (120.40541861 1947 93.45048212499998)
after emacs-lisp-byte-compile-and-load:
=> (128.362728323 1942 93.44881820700004)
with the attached mandelbrot.el (fixed point):
after emacs-lisp-native-compile-and-load:
(32.898215598 2 0.09366582000001245)
after emacs-lisp-byte-compile-and-load:
(51.629763289 2 0.09165389300000015)
For comparison, the attached forth version test.fs takes about 6sec.
I find it interesting that:
- native compilation improves so little compared to byte compilation.
- floating point calculations are very slow.
- emacs-lisp is very slow
[-- Attachment #2: mandelbrot1.el --]
[-- Type: application/emacs-lisp, Size: 1675 bytes --]
[-- Attachment #3: mandelbrot.el --]
[-- Type: application/emacs-lisp, Size: 2034 bytes --]
[-- Attachment #4: test.fs --]
[-- Type: text/plain, Size: 1122 bytes --]
\ (compile "time gforth test.fs >a.pnm")
1 16 lshift constant 1fx
: fx ( n -- fx ) 1fx * ;
: fx2 ( n1 n2 -- fx ) 1fx swap */ ;
: fx* ( fx1 fx2 -- fx3 ) 1fx */ ;
: fx^2 ( fx -- fx ) dup fx* ;
1600 constant w 0 value cx
1200 constant h 0 value cy
-25 10 fx2 constant x0 0 value zr
-15 10 fx2 constant y0 0 value zi
4 fx w / constant dxp 0 value v
3 fx h / constant dyp
: cx! ( n -- ) dxp * x0 + to cx ;
: cy! ( n -- ) dyp * y0 + to cy ;
: nl newline type ;
: p6 ." P6" nl w . h . nl 255 . nl ;
: rgb ( n -- ) dup emit dup emit emit ;
\ : expt ( r1 r2 -- r3 ) fexp fswap fln f* ;
: mandelbrot1 ( -- n )
0 to v 0 to zr 0 to zi
begin
v 256 < ( f -- )
zr fx^2 zi fx^2 + [ 4 fx ] literal <= ( f1 f2 -- )
and while
v 1+ to v
zr fx^2 zi fx^2 - cx + ( fx -- )
zr zi fx* 2* cy + to zi
to zr
repeat
\ samples that hit 256 wrap in the graphic to display as zero
\ exponential 0.8 enhances contrast a bit
v \ s>f 256e f/ 0.8e expt 256e f* floor f>s
;
: dot mandelbrot1 rgb ;
: row w 0 do i cx! dot loop ;
: mandelbrot h 0 do i cy! row loop ;
p6 mandelbrot bye
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Identifying sources of allocations in a toy Mandelbrot package
2024-01-26 23:36 ` Tomas Hlavaty
@ 2024-01-27 1:07 ` Psionic K
0 siblings, 0 replies; 15+ messages in thread
From: Psionic K @ 2024-01-27 1:07 UTC (permalink / raw)
To: Tomas Hlavaty; +Cc: Psionic K, help-gnu-emacs, incal, Eli Zaretskii
> I find it interesting that:
> - native compilation improves so little compared to byte compilation.
> - floating point calculations are very slow.
> - emacs-lisp is very slow
From first principles and numbers I have seen so far, I can guess
what's happening with little uncertainty. The Elisp interpreter must
store all created conses and non-immediate values, however
short-lived, into memory, converting every computationally intensive
workload into a memory-intensive workload. None of this is freed for
re-use between GC's, so in about 1/100th of a second, the memory
controller is fully saturated and the CPU grinds to a halt.
There are no workarounds for this using GC settings. The GC is
non-generational, so frequent GC's must mark & sweep a lot of memory
that is irrelevant to the currently running function. Frequent GC's
will saturate the memory attempting to free small amounts that are
relevant to the cache sizes. Furthermore, the GC is non-moving,
resulting in fragmented reads and writes, so even for shorter
computations, much of the loaded cache could be wasted.
The sum of these behaviors is still faster than human speed, but
almost pathalogically worst case usage of memory and cache, requiring
multiple trips to RAM for almost all computations and limiting
effective CPU register bandwidth to a small multiple of RAM. The
programmer has few tools to increase computational density and all
workloads are memory bandwidth limited.
To remedy the most egregious behavior, which is that computations
imply writes, we would take advantage of how functions usually throw
away all of their intermediate values except the return value and
implement stack-like behavior by compacting and freeing after cons
thresholds calculated at call boundaries. We can still give this
compacted memory back to our pathologically worst GC, but it will be
less fragmented and there will be much less of it.
A lot of this can be decided at compile time, and the compiler would
write values that need to be preserved into one stack while writing
other values at a different stack and re-setting the pointer. I can
tell that this is not happening at all because otherwise the
Mandelbrot would not be consuming multiple gigabytes of RAM for values
that are thrown away at first use.
Elisp could be exceptionally fast for small runs, especially with
today's generous L1 sizes. That drops off as the interpreter and new
data start hitting L2, L3, and then RAM. Any work on the compiler
should focus on keeping memory from being entrusted to the existing GC
because no amount of doing less work means anything if all work also
consumes memory and freeing memory is hugely wasteful.
Fixed point was neat! I didn't think to try that. Thanks for code!
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Identifying sources of allocations in a toy Mandelbrot package
@ 2024-01-27 9:25 Psionic K
0 siblings, 0 replies; 15+ messages in thread
From: Psionic K @ 2024-01-27 9:25 UTC (permalink / raw)
To: help-gnu-emacs; +Cc: Tomas Hlavaty
[-- Attachment #1: Type: text/plain, Size: 2015 bytes --]
I wanted to dig into the CPU versus memory limitation a bit.
I switched back to writing to a vector, not touching any buffers to
wrap the profile around the hot section. CPU + mem results:
fixed point byte-compiled: 195 samples 3,890,391 bytes
fixed point native compiled: 250 samples 4,205,335 bytes
floating point byte-compiled: 560 samples 221,858,239 bytes
floating point native compiled: 409 samples 221,732,687 bytes
* There is no typo in the native & byte compiled fixed point results.
I made a few runs. It is slower when native compiled.
* I found no other combination of `defsubst' `defun' or declaring
speed or not that could get the native compiled fixed point version to
be faster than byte compiled.
* The floating point version was only measured when the run did not
trigger GC. This inflates its performance because GC cost, about 30%,
is not shown.
While the fixed point is fast considering it has to do more work,
perhaps if the native compiler didn't manage to slow it down, it could
be faster.
Regarding my previous statements about memory bandwidth being the
limiting factor, while the memory used is on the lower end of an order
of magnitude within the random write bandwidth available. If that
amount was limiting, the compilation method would not be expected to
affect performance much. Since the native compiled floating point is
faster, it is likely CPU bound. Consumption is just a proxy for
bandwidth used. Runtimes use bandwidth without new allocation.
Better tools are needed to answer this conclusively.
I may continue this exercise. My initial idea to create clarity is by
writing a function that just throws away conses that hold values that
require 1, 10, 100, or 1000 steps to compute. If the performance
scales inversely by conses but not by steps, we can isolate memory
versus computation as the bottleneck for some workflows and provide a
benchmark for the compiler and GC to be smarter about not generating
garbage in the first place.
[-- Attachment #2: fixed.el --]
[-- Type: text/x-emacs-lisp, Size: 2093 bytes --]
;;; -*- lexical-binding: t -*-
;; samples that hit 256 wrap in the graphic to display as zero
;; exponential 0.8 enhances contrast a bit
(let ((a (make-vector 257 0)))
(dotimes (v 256)
(aset a v (floor (* 256 (expt (/ v 256.0) 0.8)))))
(define-inline contrast (v)
(aref a v)))
(defconst +1fx+ (expt 2 12))
(define-inline fx (x)
(* x +1fx+))
(define-inline fx2 (n d)
(/ (* n +1fx+) d))
(defconst +4fx+ (fx 4))
(define-inline fx* (l r)
(/ (* l r) +1fx+))
(define-inline fx^2 (x)
(fx* x x))
(define-inline mandelbrot1 (cx cy)
(let ((zr 0)
(zi 0)
(v 0)
zr^2
zi^2)
(while (and (< v 256) (<= (+ (setq zr^2 (fx^2 zr)) (setq zi^2 (fx^2 zi))) +4fx+))
(setq v (1+ v))
(let ((tmp (+ zr^2 (- zi^2) cx)))
(setq zi (+ (* (fx* zr zi) 2) cy))
(setq zr tmp)))
v))
(defun mandelbrot (&optional w h)
(declare (speed 3))
(interactive)
(let ((output (make-vector (* w h) 0.0))
(idx 0))
(let ((w (or w 1600))
(h (or h 1200))
(x0 (fx2 -25 10))
(y0 (fx2 -15 10)))
(let ((dxp (/ (fx 4) w))
(dyp (/ (fx 3) h)))
(dotimes (y h)
(let ((cy (+ y0 (* dyp y))))
(dotimes (x w)
(let ((v (contrast (mandelbrot1 (+ x0 (* dxp x)) cy))))
(aset output idx v)
(setq idx (1+ idx))))))))
output))
(defun mandelbrot-bench ()
(interactive)
(let (output)
(profiler-start 'cpu+mem)
(setq output (mandelbrot 320 200))
(profiler-stop)
(profiler-report)
(message "%S" output)))
;;(mandelbrot 320 200)
;;(mandelbrot 640 480)
;;(mandelbrot 1024 768)
;;(mandelbrot)
;; (profiler-start 'mem)
;; (mandelbrot 320 200)
;; (profiler-stop)
;; (profiler-report)
;;(profiler-start 'cpu)
;;(mandelbrot 320 200)
;;(profiler-stop)
;;(profiler-report)
;;(profiler-start 'cpu+mem)
;;(mandelbrot)
;;(profiler-stop)
;;(profiler-report)
;; (benchmark-run 1 (mandelbrot 320 200))
;;(32.898215598 2 0.09366582000001245)
;;(51.629763289 2 0.09165389300000015)
[-- Attachment #3: floating.el --]
[-- Type: text/x-emacs-lisp, Size: 1730 bytes --]
;;; -*- lexical-binding: t -*-
(define-inline mandelbrot1 (cx cy)
(let ((zr 0)
(zi 0)
(v 0))
(while (and (< v 256) (<= (+ (* zr zr) (* zi zi)) 4))
(setq v (1+ v))
(let ((tmp (+ (* zr zr) (- (* zi zi)) cx)))
(setq zi (+ (* (* zr zi) 2) cy))
(setq zr tmp)))
;; samples that hit 256 wrap in the graphic to display as zero
;; exponential 0.8 enhances contrast a bit
(floor (* 256 (expt (/ v 256.0) 0.8)))))
(defun mandelbrot (&optional w h)
(declare (speed 3))
(interactive)
(let ((output (make-vector (* w h) 0.0))
(idx 0))
(let ((w (or w 1600))
(h (or h 1200))
(x0 -2.5)
(y0 -1.5)
(dx 4.0)
(dy 3.0))
(let ((dxp (/ dx w))
(dyp (/ dy h)))
(dotimes (y h)
(let ((cy (+ y0 (* dyp y))))
(dotimes (x w)
(let ((v (mandelbrot1 (+ x0 (* dxp x)) cy)))
(aset output idx v)
(setq idx (1+ idx))))))))
output))
(defun mandelbrot-bench ()
(interactive)
(let (output)
(profiler-start 'cpu+mem)
(setq output (mandelbrot 320 200))
(profiler-stop)
(profiler-report)
(message "%S" output)))
;;(mandelbrot 320 200)
;;(mandelbrot 640 480)
;;(mandelbrot 1024 768)
;;(mandelbrot)
;; (profiler-start 'mem)
;; (mandelbrot 320 200)
;; (profiler-stop)
;; (profiler-report)
;; (profiler-start 'cpu+mem)
;; (mandelbrot 320 200)
;; (profiler-stop)
;; (profiler-report)
;; (profiler-start 'cpu)
;; (mandelbrot 320 200)
;; (profiler-stop)
;; (profiler-report)
;;(profiler-start 'cpu+mem)
;;(mandelbrot)
;;(profiler-stop)
;;(profiler-report)
;; (benchmark-run 1 (mandelbrot))
;; (15.033354357 15 3.9534469799999954)
;;(120.40541861 1947 93.45048212499998)
;;(128.362728323 1942 93.44881820700004)
^ permalink raw reply [flat|nested] 15+ messages in thread
end of thread, other threads:[~2024-01-27 9:25 UTC | newest]
Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-01-17 12:39 Identifying sources of allocations in a toy Mandelbrot package Psionic K
2024-01-17 12:58 ` Eli Zaretskii
2024-01-17 13:25 ` Emanuel Berg
-- strict thread matches above, loose matches on Subject: below --
2024-01-19 9:19 Psionic K
2024-01-19 15:33 ` Tomas Hlavaty
2024-01-20 3:14 ` Psionic K
2024-01-20 3:37 ` Psionic K
2024-01-20 7:29 ` Eli Zaretskii
2024-01-20 9:09 ` Tomas Hlavaty
2024-01-20 10:03 ` Psionic K
2024-01-20 10:31 ` Eli Zaretskii
2024-01-26 23:36 ` Tomas Hlavaty
2024-01-27 1:07 ` Psionic K
2024-01-19 15:44 ` Eli Zaretskii
2024-01-27 9:25 Psionic K
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).