unofficial mirror of emacs-devel@gnu.org 
 help / color / mirror / code / Atom feed
* MPS: a random backtrace while toying with gdb
@ 2024-06-29 19:12 Ihor Radchenko
  2024-06-29 19:19 ` Pip Cet
  0 siblings, 1 reply; 68+ messages in thread
From: Ihor Radchenko @ 2024-06-29 19:12 UTC (permalink / raw)
  To: emacs-devel; +Cc: Gerd Möllmann, Eli Zaretskii, eller.helmut

I just got this while starting up Emacs on the latest scratch/igc (with
native-comp).

lockix.c:126: Emacs fatal error: assertion failed: res == 0

Thread 1 "emacs" hit Breakpoint 1, terminate_due_to_signal (sig=sig@entry=6, backtrace_limit=backtrace_limit@entry=2147483647) at emacs.c:443
443	{
(gdb) bt
#0  terminate_due_to_signal (sig=sig@entry=6, backtrace_limit=backtrace_limit@entry=2147483647) at emacs.c:443
#1  0x00005555558634be in set_state (state=IGC_STATE_DEAD) at igc.c:179
#2  igc_assert_fail (file=<optimized out>, line=<optimized out>, msg=<optimized out>) at igc.c:205
#3  0x00005555558f1e19 in mps_lib_assert_fail (condition=0x555555943c4c "res == 0", line=126, file=0x555555943c36 "lockix.c")
    at /home/yantar92/Dist/mps/code/mpsliban.c:87
#4  LockClaim (lock=0x7fffe8000110) at /home/yantar92/Dist/mps/code/lockix.c:126
#5  0x00005555558f204d in ArenaEnterLock (arena=0x7ffff7fbf000, recursive=0) at /home/yantar92/Dist/mps/code/global.c:576
#6  0x000055555591aefe in ArenaEnter (arena=0x7ffff7fbf000) at /home/yantar92/Dist/mps/code/global.c:553
#7  ArenaAccess (addr=0x7fffeb908758, mode=mode@entry=3, context=context@entry=0x7fffffff97d0) at /home/yantar92/Dist/mps/code/global.c:655
#8  0x0000555555926202 in sigHandle (sig=<optimized out>, info=0x7fffffff9af0, uap=0x7fffffff99c0) at /home/yantar92/Dist/mps/code/protsgix.c:97
#9  0x00007ffff3048050 in <signal handler called> () at /lib64/libc.so.6
#10 0x0000555555827385 in PSEUDOVECTORP (a=XIL(0x7fffeb90875d), code=9) at /home/yantar92/Git/emacs/src/lisp.h:1105
#11 PROCESSP (a=XIL(0x7fffeb90875d)) at /home/yantar92/Git/emacs/src/process.h:212
#12 XPROCESS (a=XIL(0x7fffeb90875d)) at /home/yantar92/Git/emacs/src/process.h:224
#13 handle_child_signal (sig=sig@entry=17) at process.c:7660
#14 0x000055555573b771 in deliver_process_signal (sig=17, handler=handler@entry=0x555555827200 <handle_child_signal>) at sysdep.c:1758
#15 0x0000555555820647 in deliver_child_signal (sig=<optimized out>) at process.c:7702
#16 0x00007ffff3048050 in <signal handler called> () at /lib64/libc.so.6
#17 0x000055555585f77b in fix_lisp_obj (ss=ss@entry=0x7fffffffa9a8, pobj=pobj@entry=0x7fffeee7ffe8) at igc.c:841
#18 0x000055555586050d in fix_cons (ss=0x7fffffffa9a8, cons=0x7fffeee7ffe0) at igc.c:1474
#19 dflt_scan_obj (ss=0x7fffffffa9a8, base_start=0x7fffeee7ffd8, base_limit=0x7fffeee80000, closure=0x0) at igc.c:1578
#20 dflt_scanx (ss=ss@entry=0x7fffffffa9a8, base_start=<optimized out>, base_limit=0x7fffeee80000, closure=closure@entry=0x0) at igc.c:1658
#21 0x00005555558613a3 in dflt_scan (ss=0x7fffffffa9a8, base_start=<optimized out>, base_limit=<optimized out>) at igc.c:1669
#22 0x00005555558f163f in TraceScanFormat (limit=0x7fffeee80000, base=0x7fffeee7e000, ss=0x7fffffffa9a0) at /home/yantar92/Dist/mps/code/trace.c:1539
#23 amcSegScan (totalReturn=0x7fffffffa99c, seg=0x7fffe845e4c8, ss=0x7fffffffa9a0) at /home/yantar92/Dist/mps/code/poolamc.c:1440
#24 0x000055555591e7bc in traceScanSegRes (ts=ts@entry=1, rank=rank@entry=1, arena=arena@entry=0x7ffff7fbf000, seg=seg@entry=0x7fffe845e4c8)
    at /home/yantar92/Dist/mps/code/trace.c:1205
#25 0x000055555591e9ca in traceScanSeg (ts=1, rank=1, arena=0x7ffff7fbf000, seg=0x7fffe845e4c8) at /home/yantar92/Dist/mps/code/trace.c:1267
#26 0x000055555591f3a4 in TraceAdvance (trace=trace@entry=0x7ffff7fbfaa8) at /home/yantar92/Dist/mps/code/trace.c:1728
#27 0x000055555591faa4 in TracePoll
    (workReturn=workReturn@entry=0x7fffffffab90, collectWorldReturn=collectWorldReturn@entry=0x7fffffffab8c, globals=globals@entry=0x7ffff7fbf008, collectWorldAllowed=<optimized out>) at /home/yantar92/Dist/mps/code/trace.c:1849
#28 0x000055555591fceb in ArenaPoll (globals=globals@entry=0x7ffff7fbf008) at /home/yantar92/Dist/mps/code/global.c:745
#29 0x00005555559200da in mps_ap_fill (p_o=p_o@entry=0x7fffffffad00, mps_ap=mps_ap@entry=0x7fffe80017f0, size=size@entry=24)
    at /home/yantar92/Dist/mps/code/mpsi.c:1097
#30 0x00005555558601ee in alloc_impl (size=24, type=IGC_OBJ_CONS, ap=0x7fffe80017f0) at igc.c:3330
#31 0x000055555586023c in alloc (size=size@entry=16, type=type@entry=IGC_OBJ_CONS) at igc.c:3358
#32 0x000055555586187a in igc_make_cons (car=XIL(0x133e0), cdr=XIL(0)) at igc.c:3385
#33 0x000055555578e7de in Fcons (car=<optimized out>, cdr=<optimized out>) at alloc.c:2926
#34 Flist (nargs=31, args=0x7fffffffaf38) at alloc.c:3054
#35 0x00007ffff06b13ea in F7365742d666163652d617474726962757465_set_face_attribute_0 ()
--Type <RET> for more, q to quit, c to continue without paging--c
    at /home/yantar92/Git/emacs/src/../native-lisp/31.0.50-27937d9d/preloaded/faces-b9447c93-32c2609b.eln
#36 0x00005555557bfcbc in funcall_subr (subr=<optimized out>, numargs=numargs@entry=34, args=args@entry=0x7fffffffaf28) at eval.c:3209
#37 0x00005555557c28bb in funcall_general (fun=<optimized out>, numargs=numargs@entry=34, args=args@entry=0x7fffffffaf28) at eval.c:3065
#38 0x00005555557bd2dc in Ffuncall (nargs=nargs@entry=35, args=args@entry=0x7fffffffaf20) at eval.c:3118
#39 0x00005555557bda80 in Fapply (nargs=4, args=0x7fffffffb0e0) at eval.c:2790
#40 0x00007ffff06b6cac in F666163652d737065632d72657365742d66616365_face_spec_reset_face_0 ()
    at /home/yantar92/Git/emacs/src/../native-lisp/31.0.50-27937d9d/preloaded/faces-b9447c93-32c2609b.eln
#41 0x00005555557bfbb5 in funcall_subr (subr=<optimized out>, numargs=numargs@entry=2, args=args@entry=0x7fffffffb2d8) at eval.c:3188
#42 0x00005555557c28bb in funcall_general (fun=<optimized out>, numargs=numargs@entry=2, args=args@entry=0x7fffffffb2d8) at eval.c:3065
#43 0x00005555557bd2dc in Ffuncall (nargs=3, args=0x7fffffffb2d0) at eval.c:3118
#44 0x00007ffff06b6fb6 in F666163652d737065632d726563616c63_face_spec_recalc_0 ()
    at /home/yantar92/Git/emacs/src/../native-lisp/31.0.50-27937d9d/preloaded/faces-b9447c93-32c2609b.eln
#45 0x00005555557bfbb5 in funcall_subr (subr=<optimized out>, numargs=numargs@entry=2, args=args@entry=0x7fffffffb598) at eval.c:3188
#46 0x00005555557c28bb in funcall_general (fun=<optimized out>, numargs=numargs@entry=2, args=args@entry=0x7fffffffb598) at eval.c:3065
#47 0x00005555557bd2dc in Ffuncall (nargs=3, args=0x7fffffffb590) at eval.c:3118
#48 0x00007ffff06b6e61 in F666163652d737065632d736574_face_spec_set_0 ()
    at /home/yantar92/Git/emacs/src/../native-lisp/31.0.50-27937d9d/preloaded/faces-b9447c93-32c2609b.eln
#49 0x00005555557bfbcc in funcall_subr (subr=<optimized out>, numargs=numargs@entry=3, args=args@entry=0x7fffffffb6f8) at eval.c:3190
#50 0x00005555557c28bb in funcall_general (fun=<optimized out>, numargs=numargs@entry=3, args=args@entry=0x7fffffffb6f8) at eval.c:3065
#51 0x00005555557bd2dc in Ffuncall (nargs=4, args=0x7fffffffb6f0) at eval.c:3118
#52 0x00007fffe02fa20c in F637573746f6d2d6465636c6172652d66616365_custom_declare_face_0 ()
    at /home/yantar92/Git/emacs/src/../native-lisp/31.0.50-27937d9d/preloaded/cus-face-74f1689e-33ab82e8.eln
#53 0x00005555557c162c in eval_sub (form=form@entry=XIL(0x7fffe5fb38fb)) at /home/yantar92/Git/emacs/src/lisp.h:2267
#54 0x00005555557fce9f in readevalloop
    (readcharfun=readcharfun@entry=XIL(0x98a0), infile0=infile0@entry=0x7fffffffb950, sourcename=sourcename@entry=XIL(0x7fffe5fb20c4), printflag=printflag@entry=false, unibyte=unibyte@entry=XIL(0), readfun=readfun@entry=XIL(0), start=XIL(0), end=<optimized out>) at lread.c:2541
#55 0x00005555557fdc0b in Fload (file=<optimized out>, 
    file@entry=XIL(0x7fffe89a00ac), noerror=noerror@entry=XIL(0x30), nomessage=nomessage@entry=XIL(0x30), nosuffix=nosuffix@entry=XIL(0), must_suffix=<optimized out>, must_suffix@entry=XIL(0x30)) at lread.c:1732
#56 0x00005555557fdda8 in save_match_data_load
    (file=file@entry=XIL(0x7fffe89a00ac), noerror=noerror@entry=XIL(0x30), nomessage=nomessage@entry=XIL(0x30), nosuffix=nosuffix@entry=XIL(0), must_suffix=must_suffix@entry=XIL(0x30)) at lread.c:1781
#57 0x00005555557bd087 in load_with_autoload_queue
    (file=file@entry=XIL(0x7fffe89a00ac), noerror=noerror@entry=XIL(0x30), nomessage=nomessage@entry=XIL(0x30), nosuffix=nosuffix@entry=XIL(0), must_suffix=must_suffix@entry=XIL(0x30)) at eval.c:2401
#58 0x00005555557d573d in Frequire (feature=XIL(0x2aaa929409f8), filename=<optimized out>, noerror=XIL(0x30)) at fns.c:3768
#59 0x00005555557c17bd in eval_sub (form=<optimized out>) at eval.c:2629
#60 0x00005555557c13fe in eval_sub (form=<optimized out>) at eval.c:2611
#61 0x00005555557c2b25 in Fif (args=XIL(0x7fffe5fafa03)) at eval.c:400
#62 0x00005555557c14ed in eval_sub (form=<optimized out>) at eval.c:2574
#63 0x00005555557c1f5e in Fprogn (body=<optimized out>) at eval.c:448
#64 0x00005555557c14ed in eval_sub (form=form@entry=XIL(0x7fffe5f882f3)) at eval.c:2574
#65 0x00005555557c42cb in internal_lisp_condition_case (var=XIL(0x2aaa8c8ec508), bodyform=<optimized out>, handlers=<optimized out>) at eval.c:1583
#66 0x00005555557c4664 in Fcondition_case (args=<optimized out>) at eval.c:1454
#67 0x00005555557c14ed in eval_sub (form=<optimized out>) at eval.c:2574
#68 0x00005555557c1f5e in Fprogn (body=<optimized out>) at eval.c:448
#69 0x00005555557c14ed in eval_sub (form=<optimized out>) at eval.c:2574
#70 0x00005555557c2b4c in Fif (args=<optimized out>) at eval.c:403
#71 0x00005555557c14ed in eval_sub (form=<optimized out>) at eval.c:2574
#72 0x00005555557f48e5 in readevalloop_eager_expand_eval (val=<optimized out>, macroexpand=macroexpand@entry=XIL(0xb3d0)) at lread.c:2357
#73 0x00005555557f4859 in readevalloop_eager_expand_eval (val=XIL(0), val@entry=XIL(0x7fffe5f7d80b), macroexpand=macroexpand@entry=XIL(0xb3d0))
    at /home/yantar92/Git/emacs/src/lisp.h:1522
#74 0x00005555557fce01 in readevalloop
    (readcharfun=readcharfun@entry=XIL(0x7fffe2905d95), infile0=infile0@entry=0x0, sourcename=sourcename@entry=XIL(0x7fffe2905c3c), printflag=printflag@entry=false, unibyte=unibyte@entry=XIL(0), readfun=readfun@entry=XIL(0), start=XIL(0), end=<optimized out>) at lread.c:2539
#75 0x00005555557fdfdb in Feval_buffer
    (buffer=<optimized out>, printflag=XIL(0), filename=XIL(0x7fffe2905c3c), unibyte=XIL(0), do_allow_print=<optimized out>) at lread.c:2614
#76 0x00007fffe01023e4 in F6c6f61642d776974682d636f64652d636f6e76657273696f6e_load_with_code_conversion_0 ()
    at /home/yantar92/Git/emacs/src/../native-lisp/31.0.50-27937d9d/preloaded/mule-3352613d-8d84c8e1.eln
#77 0x00005555557bfc06 in funcall_subr (subr=<optimized out>, numargs=numargs@entry=4, args=args@entry=0x7fffffffc678) at eval.c:3194
#78 0x00005555557c28bb in funcall_general (fun=<optimized out>, numargs=numargs@entry=4, args=args@entry=0x7fffffffc678) at eval.c:3065
#79 0x00005555557bd2dc in Ffuncall (nargs=nargs@entry=5, args=args@entry=0x7fffffffc670) at eval.c:3118
#80 0x00005555557fd947 in Fload
    (file=<optimized out>, noerror=<optimized out>, nomessage=<optimized out>, nosuffix=<optimized out>, must_suffix=<optimized out>) at lread.c:1620
#81 0x00005555557c1807 in eval_sub (form=<optimized out>) at eval.c:2637
#82 0x00005555557c1f5e in Fprogn (body=<optimized out>) at eval.c:448
#83 0x00005555557c36f8 in Flet (args=<optimized out>) at /home/yantar92/Git/emacs/src/lisp.h:1522
#84 0x00005555557c14ed in eval_sub (form=<optimized out>) at eval.c:2574
#85 0x00005555557f48e5 in readevalloop_eager_expand_eval (val=<optimized out>, val@entry=XIL(0x7fffe29054e3), macroexpand=macroexpand@entry=XIL(0xb3d0))
    at lread.c:2357
#86 0x00005555557fce01 in readevalloop
    (readcharfun=readcharfun@entry=XIL(0x7fffe29026a5), infile0=infile0@entry=0x0, sourcename=sourcename@entry=XIL(0x7fffe290259c), printflag=printflag@entry=false, unibyte=unibyte@entry=XIL(0), readfun=readfun@entry=XIL(0), start=XIL(0), end=<optimized out>) at lread.c:2539
#87 0x00005555557fdfdb in Feval_buffer
    (buffer=<optimized out>, printflag=XIL(0), filename=XIL(0x7fffe290259c), unibyte=XIL(0), do_allow_print=<optimized out>) at lread.c:2614
#88 0x00007fffe01023e4 in F6c6f61642d776974682d636f64652d636f6e76657273696f6e_load_with_code_conversion_0 ()
    at /home/yantar92/Git/emacs/src/../native-lisp/31.0.50-27937d9d/preloaded/mule-3352613d-8d84c8e1.eln
#89 0x00005555557bfc06 in funcall_subr (subr=<optimized out>, numargs=numargs@entry=4, args=args@entry=0x7fffffffcdd8) at eval.c:3194
#90 0x00005555557c28bb in funcall_general (fun=<optimized out>, numargs=numargs@entry=4, args=args@entry=0x7fffffffcdd8) at eval.c:3065
#91 0x00005555557bd2dc in Ffuncall (nargs=nargs@entry=5, args=args@entry=0x7fffffffcdd0) at eval.c:3118
#92 0x00005555557fd947 in Fload
    (file=<optimized out>, noerror=<optimized out>, nomessage=<optimized out>, nosuffix=<optimized out>, must_suffix=<optimized out>) at lread.c:1620
#93 0x00005555557bfc06 in funcall_subr (subr=<optimized out>, numargs=numargs@entry=3, args=args@entry=0x7fffdf200048) at eval.c:3194
#94 0x0000555555811438 in exec_byte_code (fun=<optimized out>, fun@entry=XIL(0x7fffe28f3c55), args_template=<optimized out>, 
    args_template@entry=0, nargs=<optimized out>, nargs@entry=0, args=<optimized out>, args@entry=0x7fffffffd0d0)
    at /home/yantar92/Git/emacs/src/lisp.h:2267
#95 0x00005555557c20cf in funcall_lambda (fun=XIL(0x7fffe28f3c55), nargs=nargs@entry=0, arg_vector=arg_vector@entry=0x7fffffffd0d0) at eval.c:3277
#96 0x00005555557c2a70 in funcall_general (fun=<optimized out>, numargs=numargs@entry=0, args=args@entry=0x7fffffffd0d0) at eval.c:3069
#97 0x00005555557bd2dc in Ffuncall (nargs=1, args=0x7fffffffd0c8) at eval.c:3118
#98 0x00007fffdff374f7 in F737461727475702d2d6c6f61642d757365722d696e69742d66696c65_startup__load_user_init_file_0 ()
    at /home/yantar92/Git/emacs/src/../native-lisp/31.0.50-27937d9d/preloaded/startup-bbc6ea72-e56c9510.eln
#99 0x00005555557bfbcc in funcall_subr (subr=<optimized out>, numargs=numargs@entry=3, args=args@entry=0x7fffffffd258) at eval.c:3190
#100 0x00005555557c28bb in funcall_general (fun=<optimized out>, numargs=numargs@entry=3, args=args@entry=0x7fffffffd258) at eval.c:3065
#101 0x00005555557bd2dc in Ffuncall (nargs=4, args=0x7fffffffd250) at eval.c:3118
#102 0x00007fffdff3892b in F636f6d6d616e642d6c696e65_command_line_0 ()
    at /home/yantar92/Git/emacs/src/../native-lisp/31.0.50-27937d9d/preloaded/startup-bbc6ea72-e56c9510.eln
#103 0x00005555557bfb96 in funcall_subr (subr=<optimized out>, numargs=numargs@entry=0, args=args@entry=0x7fffffffd3c0) at eval.c:3184
#104 0x00005555557c28bb in funcall_general (fun=<optimized out>, numargs=numargs@entry=0, args=args@entry=0x7fffffffd3c0) at eval.c:3065
#105 0x00005555557bd2dc in Ffuncall (nargs=1, args=0x7fffffffd3b8) at eval.c:3118
#106 0x00007fffdff353f0 in F6e6f726d616c2d746f702d6c6576656c_normal_top_level_0 ()
    at /home/yantar92/Git/emacs/src/../native-lisp/31.0.50-27937d9d/preloaded/startup-bbc6ea72-e56c9510.eln
#107 0x00005555557c1e73 in eval_sub (form=form@entry=XIL(0x7fffe08b357b)) at eval.c:2620
#108 0x00005555557c46ac in Feval (form=XIL(0x7fffe08b357b), lexical=lexical@entry=XIL(0x30)) at eval.c:2482
#109 0x00005555557137d8 in top_level_2 () at keyboard.c:1185
#110 0x00005555557bb469 in internal_condition_case
    (bfun=bfun@entry=0x5555557137b9 <top_level_2>, handlers=handlers@entry=XIL(0x90), hfun=hfun@entry=0x55555571a734 <cmd_error>) at eval.c:1629
#111 0x00005555557134e4 in top_level_1 (ignore=ignore@entry=XIL(0)) at keyboard.c:1197
#112 0x00005555557bb368 in internal_catch (tag=tag@entry=XIL(0x12360), func=func@entry=0x5555557134ba <top_level_1>, arg=arg@entry=XIL(0)) at eval.c:1308
#113 0x0000555555713450 in command_loop () at keyboard.c:1146
#114 0x000055555571a219 in recursive_edit_1 () at keyboard.c:755
#115 0x000055555571a58e in Frecursive_edit () at keyboard.c:838
#116 0x0000555555712f37 in main (argc=1, argv=<optimized out>) at emacs.c:2651

Lisp Backtrace:
"set-face-attribute" (0xffffaf28)
"face-spec-reset-face" (0xffffb2d8)
"face-spec-recalc" (0xffffb598)
"face-spec-set" (0xffffb6f8)
"custom-declare-face" (0xffffb770)
"require" (0xffffbbb0)
"not" (0xffffbc58)
"if" (0xffffbd28)
"progn" (0xffffbdf8)
"condition-case" (0xffffbf38)
"progn" (0xffffc008)
"if" (0xffffc0d8)
"load-with-code-conversion" (0xffffc678)
"load" (0xffffc770)
"let" (0xffffc8c8)
"load-with-code-conversion" (0xffffcdd8)
"load" (0xdf200048)
0xe28f3c50 PVEC_CLOSURE
"startup--load-user-init-file" (0xffffd258)
"command-line" (0xffffd3c0)
"normal-top-level" (0xffffd450)

-- 
Ihor Radchenko // yantar92,
Org mode contributor,
Learn more about Org mode at <https://orgmode.org/>.
Support Org development at <https://liberapay.com/org-mode>,
or support my work at <https://liberapay.com/yantar92>



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: MPS: a random backtrace while toying with gdb
  2024-06-29 19:12 MPS: a random backtrace while toying with gdb Ihor Radchenko
@ 2024-06-29 19:19 ` Pip Cet
  2024-06-29 21:46   ` Gerd Möllmann
  0 siblings, 1 reply; 68+ messages in thread
From: Pip Cet @ 2024-06-29 19:19 UTC (permalink / raw)
  To: Ihor Radchenko
  Cc: emacs-devel, Gerd Möllmann, Eli Zaretskii, eller.helmut

On Saturday, June 29th, 2024 at 19:11, Ihor Radchenko <yantar92@posteo.net> wrote:
> I just got this while starting up Emacs on the latest scratch/igc (with
> native-comp).

Looks like the same double-signal situation we saw with the profiler, only this time it's SIGCHLD interrupting an MPS scan.

Pip



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: MPS: a random backtrace while toying with gdb
  2024-06-29 19:19 ` Pip Cet
@ 2024-06-29 21:46   ` Gerd Möllmann
  2024-06-30  4:55     ` Eli Zaretskii
                       ` (2 more replies)
  0 siblings, 3 replies; 68+ messages in thread
From: Gerd Möllmann @ 2024-06-29 21:46 UTC (permalink / raw)
  To: Pip Cet; +Cc: Ihor Radchenko, emacs-devel, Eli Zaretskii, eller.helmut

Pip Cet <pipcet@protonmail.com> writes:

> On Saturday, June 29th, 2024 at 19:11, Ihor Radchenko <yantar92@posteo.net> wrote:
>> I just got this while starting up Emacs on the latest scratch/igc (with
>> native-comp).
>
> Looks like the same double-signal situation we saw with the profiler, only this time it's SIGCHLD interrupting an MPS scan.
>
> Pip

Indeed

  #10 0x0000555555827385 in PSEUDOVECTORP (a=XIL(0x7fffeb90875d), code=9) at /home/yantar92/Git/emacs/src/lisp.h:1105
  #11 PROCESSP (a=XIL(0x7fffeb90875d)) at /home/yantar92/Git/emacs/src/process.h:212
  #12 XPROCESS (a=XIL(0x7fffeb90875d)) at /home/yantar92/Git/emacs/src/process.h:224
  #13 handle_child_signal (sig=sig@entry=17) at process.c:7660

Someone has an idea what to do with that? And maybe how to reproduce?




^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: MPS: a random backtrace while toying with gdb
  2024-06-29 21:46   ` Gerd Möllmann
@ 2024-06-30  4:55     ` Eli Zaretskii
  2024-06-30  5:33       ` Gerd Möllmann
  2024-06-30  9:36     ` Helmut Eller
  2024-06-30  9:59     ` Pip Cet
  2 siblings, 1 reply; 68+ messages in thread
From: Eli Zaretskii @ 2024-06-30  4:55 UTC (permalink / raw)
  To: Gerd Möllmann; +Cc: pipcet, yantar92, emacs-devel, eller.helmut

> From: Gerd Möllmann <gerd.moellmann@gmail.com>
> Cc: Ihor Radchenko <yantar92@posteo.net>,  emacs-devel@gnu.org,  Eli
>  Zaretskii <eliz@gnu.org>,  eller.helmut@gmail.com
> Date: Sat, 29 Jun 2024 23:46:27 +0200
> 
> Pip Cet <pipcet@protonmail.com> writes:
> 
> > On Saturday, June 29th, 2024 at 19:11, Ihor Radchenko <yantar92@posteo.net> wrote:
> >> I just got this while starting up Emacs on the latest scratch/igc (with
> >> native-comp).
> >
> > Looks like the same double-signal situation we saw with the profiler, only this time it's SIGCHLD interrupting an MPS scan.
> >
> > Pip
> 
> Indeed
> 
>   #10 0x0000555555827385 in PSEUDOVECTORP (a=XIL(0x7fffeb90875d), code=9) at /home/yantar92/Git/emacs/src/lisp.h:1105
>   #11 PROCESSP (a=XIL(0x7fffeb90875d)) at /home/yantar92/Git/emacs/src/process.h:212
>   #12 XPROCESS (a=XIL(0x7fffeb90875d)) at /home/yantar92/Git/emacs/src/process.h:224
>   #13 handle_child_signal (sig=sig@entry=17) at process.c:7660
> 
> Someone has an idea what to do with that?

Call sigblock at the beginning of dflt_scan (and friends) and
sigunblock before it returns?

Are there any guidance in the MPS docs for handling such signals in
these situations?  If not, can we ask them for guidance?  It is
unrealistic to expect any real-life program to do nothing in signal
handlers except setting flags.  And even flags could be subject to MPS
moving GC, at least in some cases.  So there must be some way of
dealing with that in a safe way.

> And maybe how to reproduce?

Run for enough time with subprocesses that start and terminate, I
guess?



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: MPS: a random backtrace while toying with gdb
  2024-06-30  4:55     ` Eli Zaretskii
@ 2024-06-30  5:33       ` Gerd Möllmann
  2024-06-30  6:16         ` Eli Zaretskii
  0 siblings, 1 reply; 68+ messages in thread
From: Gerd Möllmann @ 2024-06-30  5:33 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: pipcet, yantar92, emacs-devel, eller.helmut

Eli Zaretskii <eliz@gnu.org> writes:

>> Indeed
>> 
>>   #10 0x0000555555827385 in PSEUDOVECTORP (a=XIL(0x7fffeb90875d), code=9) at /home/yantar92/Git/emacs/src/lisp.h:1105
>>   #11 PROCESSP (a=XIL(0x7fffeb90875d)) at /home/yantar92/Git/emacs/src/process.h:212
>>   #12 XPROCESS (a=XIL(0x7fffeb90875d)) at /home/yantar92/Git/emacs/src/process.h:224
>>   #13 handle_child_signal (sig=sig@entry=17) at process.c:7660
>> 
>> Someone has an idea what to do with that?
>
> Call sigblock at the beginning of dflt_scan (and friends) and
> sigunblock before it returns?

Yuk.

> Are there any guidance in the MPS docs for handling such signals in
> these situations?  If not, can we ask them for guidance?  It is
> unrealistic to expect any real-life program to do nothing in signal
> handlers except setting flags.  And even flags could be subject to MPS
> moving GC, at least in some cases.  So there must be some way of
> dealing with that in a safe way.

It's Emacs' fault. MPS cannot reasonably be expected to assume that a
client program uses MPS managed memory in a signal handler. My 2 cents.

>
>> And maybe how to reproduce?
>
> Run for enough time with subprocesses that start and terminate, I
> guess?

Just remembered that I won't be able to reproduce this anyway an macOS,
where barriers don't use signals.



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: MPS: a random backtrace while toying with gdb
  2024-06-30  5:33       ` Gerd Möllmann
@ 2024-06-30  6:16         ` Eli Zaretskii
  2024-06-30  6:43           ` Gerd Möllmann
  0 siblings, 1 reply; 68+ messages in thread
From: Eli Zaretskii @ 2024-06-30  6:16 UTC (permalink / raw)
  To: Gerd Möllmann; +Cc: pipcet, yantar92, emacs-devel, eller.helmut

> From: Gerd Möllmann <gerd.moellmann@gmail.com>
> Cc: pipcet@protonmail.com,  yantar92@posteo.net,  emacs-devel@gnu.org,
>   eller.helmut@gmail.com
> Date: Sun, 30 Jun 2024 07:33:54 +0200
> 
> Eli Zaretskii <eliz@gnu.org> writes:
> 
> >> Indeed
> >> 
> >>   #10 0x0000555555827385 in PSEUDOVECTORP (a=XIL(0x7fffeb90875d), code=9) at /home/yantar92/Git/emacs/src/lisp.h:1105
> >>   #11 PROCESSP (a=XIL(0x7fffeb90875d)) at /home/yantar92/Git/emacs/src/process.h:212
> >>   #12 XPROCESS (a=XIL(0x7fffeb90875d)) at /home/yantar92/Git/emacs/src/process.h:224
> >>   #13 handle_child_signal (sig=sig@entry=17) at process.c:7660
> >> 
> >> Someone has an idea what to do with that?
> >
> > Call sigblock at the beginning of dflt_scan (and friends) and
> > sigunblock before it returns?
> 
> Yuk.

Why "yuck"?  That's a valid solution, IMO, especially if there are no
better ones.  I'm talking about blocking only a small number of
signals: SIGCHLD, SIGPROF, SIGALRM, and maybe also SIGIO and SIGINT.
(We should also consider SIGUSR1 and SIGUSR2.)

> > Are there any guidance in the MPS docs for handling such signals in
> > these situations?  If not, can we ask them for guidance?  It is
> > unrealistic to expect any real-life program to do nothing in signal
> > handlers except setting flags.  And even flags could be subject to MPS
> > moving GC, at least in some cases.  So there must be some way of
> > dealing with that in a safe way.
> 
> It's Emacs' fault. MPS cannot reasonably be expected to assume that a
> client program uses MPS managed memory in a signal handler. My 2 cents.

That's unreasonable, IMNSHO.  Programs manage memory for various
purposes, and nothing in principle should prevent a program from
accessing MPS-managed memory from a signal handler.

Are you saying that a program that manages _all_ of its heap-allocated
memory via MPS cannot also handle signals in a safe way?

> >> And maybe how to reproduce?
> >
> > Run for enough time with subprocesses that start and terminate, I
> > guess?
> 
> Just remembered that I won't be able to reproduce this anyway an macOS,
> where barriers don't use signals.

AFAIU, this scenario is not necessarily related to barrier-related
signals.  SIGCHLD caused us to access MPS-managed memory, which
violated some assertion in MPS, because the arena lock was already
taken.



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: MPS: a random backtrace while toying with gdb
  2024-06-30  6:16         ` Eli Zaretskii
@ 2024-06-30  6:43           ` Gerd Möllmann
  2024-06-30  8:52             ` Eli Zaretskii
  0 siblings, 1 reply; 68+ messages in thread
From: Gerd Möllmann @ 2024-06-30  6:43 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: pipcet, yantar92, emacs-devel, eller.helmut

Eli Zaretskii <eliz@gnu.org> writes:

> Why "yuck"?  That's a valid solution, IMO, especially if there are no
> better ones.  I'm talking about blocking only a small number of
> signals: SIGCHLD, SIGPROF, SIGALRM, and maybe also SIGIO and SIGINT.
> (We should also consider SIGUSR1 and SIGUSR2.)

I find that really ugly. Of course, should there be no other solution on
some platforms, we can try that. It's not needed for macOS, for example.

>> > Are there any guidance in the MPS docs for handling such signals in
>> > these situations?  If not, can we ask them for guidance?  It is
>> > unrealistic to expect any real-life program to do nothing in signal
>> > handlers except setting flags.  And even flags could be subject to MPS
>> > moving GC, at least in some cases.  So there must be some way of
>> > dealing with that in a safe way.
>> 
>> It's Emacs' fault. MPS cannot reasonably be expected to assume that a
>> client program uses MPS managed memory in a signal handler. My 2 cents.
>
> That's unreasonable, IMNSHO.  Programs manage memory for various
> purposes, and nothing in principle should prevent a program from
> accessing MPS-managed memory from a signal handler.

I disagree.

> Are you saying that a program that manages _all_ of its heap-allocated
> memory via MPS cannot also handle signals in a safe way?

The program can do both, but only if the signal handler behaves. There
is not much a signal handler is allowed to do. Portably it can do almost
nothing. But even non-portably, it's never safe to call non-reentrant
code. On Linux, there is a list of async-signal-safe functions one can
call in a signal handler, others cannot be used.

I don't expect MPS to be async-signal-safe. I find that unreasonble.
Why would it be? Emacs's isn't either. Almost nothing is.

>> >> And maybe how to reproduce?
>> >
>> > Run for enough time with subprocesses that start and terminate, I
>> > guess?
>> 
>> Just remembered that I won't be able to reproduce this anyway an macOS,
>> where barriers don't use signals.
>
> AFAIU, this scenario is not necessarily related to barrier-related
> signals.  SIGCHLD caused us to access MPS-managed memory, which
> violated some assertion in MPS, because the arena lock was already
> taken.

I would have to see an example where no barrier is involved. It is in
this case. MPS is doing work, therefore holds the lock, receives SIGxy
which it let's the client handle. The client hits a barrier, which
invokes MPS's signal handler, which tries to acquire the lock which is
already taken.



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: MPS: a random backtrace while toying with gdb
  2024-06-30  6:43           ` Gerd Möllmann
@ 2024-06-30  8:52             ` Eli Zaretskii
  2024-06-30  9:43               ` Gerd Möllmann
  0 siblings, 1 reply; 68+ messages in thread
From: Eli Zaretskii @ 2024-06-30  8:52 UTC (permalink / raw)
  To: Gerd Möllmann; +Cc: pipcet, yantar92, emacs-devel, eller.helmut

> From: Gerd Möllmann <gerd.moellmann@gmail.com>
> Cc: pipcet@protonmail.com,  yantar92@posteo.net,  emacs-devel@gnu.org,
>   eller.helmut@gmail.com
> Date: Sun, 30 Jun 2024 08:43:02 +0200
> 
> Eli Zaretskii <eliz@gnu.org> writes:
> 
> > Why "yuck"?  That's a valid solution, IMO, especially if there are no
> > better ones.  I'm talking about blocking only a small number of
> > signals: SIGCHLD, SIGPROF, SIGALRM, and maybe also SIGIO and SIGINT.
> > (We should also consider SIGUSR1 and SIGUSR2.)
> 
> I find that really ugly. Of course, should there be no other solution on
> some platforms, we can try that. It's not needed for macOS, for example.

We should only block signals where that is necessary, of course.

> >> > Are there any guidance in the MPS docs for handling such signals in
> >> > these situations?  If not, can we ask them for guidance?  It is
> >> > unrealistic to expect any real-life program to do nothing in signal
> >> > handlers except setting flags.  And even flags could be subject to MPS
> >> > moving GC, at least in some cases.  So there must be some way of
> >> > dealing with that in a safe way.
> >> 
> >> It's Emacs' fault. MPS cannot reasonably be expected to assume that a
> >> client program uses MPS managed memory in a signal handler. My 2 cents.
> >
> > That's unreasonable, IMNSHO.  Programs manage memory for various
> > purposes, and nothing in principle should prevent a program from
> > accessing MPS-managed memory from a signal handler.
> 
> I disagree.
> 
> > Are you saying that a program that manages _all_ of its heap-allocated
> > memory via MPS cannot also handle signals in a safe way?
> 
> The program can do both, but only if the signal handler behaves. There
> is not much a signal handler is allowed to do. Portably it can do almost
> nothing. But even non-portably, it's never safe to call non-reentrant
> code. On Linux, there is a list of async-signal-safe functions one can
> call in a signal handler, others cannot be used.

Which unsafe function did we call in this case, though?  AFAICT, we
simply accessed memory that happens to be behind the barrier:

  #9  0x00007ffff3048050 in <signal handler called> () at /lib64/libc.so.6
  #10 0x0000555555827385 in PSEUDOVECTORP (a=XIL(0x7fffeb90875d), code=9) at /home/yantar92/Git/emacs/src/lisp.h:1105
  #11 PROCESSP (a=XIL(0x7fffeb90875d)) at /home/yantar92/Git/emacs/src/process.h:212
  #12 XPROCESS (a=XIL(0x7fffeb90875d)) at /home/yantar92/Git/emacs/src/process.h:224
  #13 handle_child_signal (sig=sig@entry=17) at process.c:7660

Line 7660 of process.c is this:

      struct Lisp_Process *p = XPROCESS (proc);

What "crime" did we commit here?

> I don't expect MPS to be async-signal-safe. I find that unreasonble.
> Why would it be? Emacs's isn't either. Almost nothing is.

See above: I'm not sure this is about async-signal-safety.

> >> Just remembered that I won't be able to reproduce this anyway an macOS,
> >> where barriers don't use signals.
> >
> > AFAIU, this scenario is not necessarily related to barrier-related
> > signals.  SIGCHLD caused us to access MPS-managed memory, which
> > violated some assertion in MPS, because the arena lock was already
> > taken.
> 
> I would have to see an example where no barrier is involved. It is in
> this case. MPS is doing work, therefore holds the lock, receives SIGxy
> which it let's the client handle. The client hits a barrier, which
> invokes MPS's signal handler, which tries to acquire the lock which is
> already taken.

Wouldn't the equivalent mechanism used on macOS also want to acquire
the arena lock in this case?  If not, what will it do?



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: MPS: a random backtrace while toying with gdb
  2024-06-29 21:46   ` Gerd Möllmann
  2024-06-30  4:55     ` Eli Zaretskii
@ 2024-06-30  9:36     ` Helmut Eller
  2024-06-30 10:00       ` Eli Zaretskii
  2024-06-30 11:05       ` Gerd Möllmann
  2024-06-30  9:59     ` Pip Cet
  2 siblings, 2 replies; 68+ messages in thread
From: Helmut Eller @ 2024-06-30  9:36 UTC (permalink / raw)
  To: Gerd Möllmann; +Cc: Pip Cet, Ihor Radchenko, emacs-devel, Eli Zaretskii

On Sat, Jun 29 2024, Gerd Möllmann wrote:

> Someone has an idea what to do with that? And maybe how to reproduce?

I can reproduce it with

/src/emacs -Q -batch -eval '(dotimes (_ 100)
  (start-process "foo" nil "sleep" "0.2")
  (igc--collect))'

I would move most of the work out of signal handlers.  The signal
handler could put the work in a "async work queue" and then return.  The
actual work is then processed by maybe_quit.

Sending the work to a helper thread would be nicer, but probably not a
realistic option with Emacs.



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: MPS: a random backtrace while toying with gdb
  2024-06-30  8:52             ` Eli Zaretskii
@ 2024-06-30  9:43               ` Gerd Möllmann
  2024-06-30 10:05                 ` Eli Zaretskii
  0 siblings, 1 reply; 68+ messages in thread
From: Gerd Möllmann @ 2024-06-30  9:43 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: pipcet, yantar92, emacs-devel, eller.helmut

Eli Zaretskii <eliz@gnu.org> writes:

>> The program can do both, but only if the signal handler behaves. There
>> is not much a signal handler is allowed to do. Portably it can do almost
>> nothing. But even non-portably, it's never safe to call non-reentrant
>> code. On Linux, there is a list of async-signal-safe functions one can
>> call in a signal handler, others cannot be used.
>
> Which unsafe function did we call in this case, though?  AFAICT, we
> simply accessed memory that happens to be behind the barrier:
>
>   #9  0x00007ffff3048050 in <signal handler called> () at /lib64/libc.so.6
>   #10 0x0000555555827385 in PSEUDOVECTORP (a=XIL(0x7fffeb90875d), code=9) at /home/yantar92/Git/emacs/src/lisp.h:1105
>   #11 PROCESSP (a=XIL(0x7fffeb90875d)) at /home/yantar92/Git/emacs/src/process.h:212
>   #12 XPROCESS (a=XIL(0x7fffeb90875d)) at /home/yantar92/Git/emacs/src/process.h:224
>   #13 handle_child_signal (sig=sig@entry=17) at process.c:7660
>
> Line 7660 of process.c is this:
>
>       struct Lisp_Process *p = XPROCESS (proc);
>
> What "crime" did we commit here?

Not a crime, but we are using a lib that we know uses barriers and
functions of which get invoked when we hit such a barrier. We can't just
ignore that.

>> I don't expect MPS to be async-signal-safe. I find that unreasonble.
>> Why would it be? Emacs's isn't either. Almost nothing is.
>
> See above: I'm not sure this is about async-signal-safety.
>
>> >> Just remembered that I won't be able to reproduce this anyway an macOS,
>> >> where barriers don't use signals.
>> >
>> > AFAIU, this scenario is not necessarily related to barrier-related
>> > signals.  SIGCHLD caused us to access MPS-managed memory, which
>> > violated some assertion in MPS, because the arena lock was already
>> > taken.
>> 
>> I would have to see an example where no barrier is involved. It is in
>> this case. MPS is doing work, therefore holds the lock, receives SIGxy
>> which it let's the client handle. The client hits a barrier, which
>> invokes MPS's signal handler, which tries to acquire the lock which is
>> already taken.
>
> Wouldn't the equivalent mechanism used on macOS also want to acquire
> the arena lock in this case?  If not, what will it do?

I can't tell you exactly how MPS does this on macOS because I'm refusing
to study their code, where possible at least. Fact is that barrier hits
on macOS don't result in a signal, but are handled as Mach exceptions.
Maybe their design docs have something explaining this, don't know.



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: MPS: a random backtrace while toying with gdb
  2024-06-29 21:46   ` Gerd Möllmann
  2024-06-30  4:55     ` Eli Zaretskii
  2024-06-30  9:36     ` Helmut Eller
@ 2024-06-30  9:59     ` Pip Cet
  2024-06-30 10:09       ` Eli Zaretskii
  2024-06-30 11:10       ` Gerd Möllmann
  2 siblings, 2 replies; 68+ messages in thread
From: Pip Cet @ 2024-06-30  9:59 UTC (permalink / raw)
  To: Gerd Möllmann
  Cc: Ihor Radchenko, emacs-devel, Eli Zaretskii, eller.helmut

On Saturday, June 29th, 2024 at 21:46, Gerd Möllmann <gerd.moellmann@gmail.com> wrote:
> Pip Cet pipcet@protonmail.com writes:
> 
> > On Saturday, June 29th, 2024 at 19:11, Ihor Radchenko yantar92@posteo.net wrote:
> > 
> > > I just got this while starting up Emacs on the latest scratch/igc (with
> > > native-comp).
> > 
> > Looks like the same double-signal situation we saw with the profiler, only this time it's SIGCHLD interrupting an MPS scan.
> > 
> > Pip
> 
> 
> Indeed
> 
> #10 0x0000555555827385 in PSEUDOVECTORP (a=XIL(0x7fffeb90875d), code=9) at /home/yantar92/Git/emacs/src/lisp.h:1105
> #11 PROCESSP (a=XIL(0x7fffeb90875d)) at /home/yantar92/Git/emacs/src/process.h:212
> #12 XPROCESS (a=XIL(0x7fffeb90875d)) at /home/yantar92/Git/emacs/src/process.h:224
> #13 handle_child_signal (sig=sig@entry=17) at process.c:7660
> 
> Someone has an idea what to do with that? And maybe how to reproduce?

We could always "steal" the SIGSEGV handler and reinstall it with a sigmask (without modifying the MPS source). My guess is that's equivalent to what the macOS code does, essentially, by using a different class of signal that blocks POSIX signals...

Pip



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: MPS: a random backtrace while toying with gdb
  2024-06-30  9:36     ` Helmut Eller
@ 2024-06-30 10:00       ` Eli Zaretskii
  2024-06-30 10:24         ` Helmut Eller
  2024-06-30 11:06         ` Gerd Möllmann
  2024-06-30 11:05       ` Gerd Möllmann
  1 sibling, 2 replies; 68+ messages in thread
From: Eli Zaretskii @ 2024-06-30 10:00 UTC (permalink / raw)
  To: Helmut Eller; +Cc: gerd.moellmann, pipcet, yantar92, emacs-devel

> From: Helmut Eller <eller.helmut@gmail.com>
> Cc: Pip Cet <pipcet@protonmail.com>,  Ihor Radchenko <yantar92@posteo.net>,
>  emacs-devel@gnu.org,  Eli Zaretskii <eliz@gnu.org>
> Date: Sun, 30 Jun 2024 11:36:34 +0200
> 
> On Sat, Jun 29 2024, Gerd Möllmann wrote:
> 
> > Someone has an idea what to do with that? And maybe how to reproduce?
> 
> I can reproduce it with
> 
> /src/emacs -Q -batch -eval '(dotimes (_ 100)
>   (start-process "foo" nil "sleep" "0.2")
>   (igc--collect))'
> 
> I would move most of the work out of signal handlers.  The signal
> handler could put the work in a "async work queue" and then return.  The
> actual work is then processed by maybe_quit.

But in this case we hit the barrier just by accessing the process
object.  How can you do anything useful in a SIGCHLD handler if you
cannot even find out which process exited and how did it exit?
Surely, something like that is needed in order to put any work on some
queue?  And I'm not even talking about the effects of deferring this
job "for later" on the speed of working with sub-processes in Emacs.

IMO, we must find a way of touching MPS-managed memory safely from a
signal handler, otherwise writing signal handlers will become an
impossible job.



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: MPS: a random backtrace while toying with gdb
  2024-06-30  9:43               ` Gerd Möllmann
@ 2024-06-30 10:05                 ` Eli Zaretskii
  2024-06-30 11:20                   ` Gerd Möllmann
  0 siblings, 1 reply; 68+ messages in thread
From: Eli Zaretskii @ 2024-06-30 10:05 UTC (permalink / raw)
  To: Gerd Möllmann; +Cc: pipcet, yantar92, emacs-devel, eller.helmut

> From: Gerd Möllmann <gerd.moellmann@gmail.com>
> Cc: pipcet@protonmail.com,  yantar92@posteo.net,  emacs-devel@gnu.org,
>   eller.helmut@gmail.com
> Date: Sun, 30 Jun 2024 11:43:56 +0200
> 
> Eli Zaretskii <eliz@gnu.org> writes:
> 
> > Which unsafe function did we call in this case, though?  AFAICT, we
> > simply accessed memory that happens to be behind the barrier:
> >
> >   #9  0x00007ffff3048050 in <signal handler called> () at /lib64/libc.so.6
> >   #10 0x0000555555827385 in PSEUDOVECTORP (a=XIL(0x7fffeb90875d), code=9) at /home/yantar92/Git/emacs/src/lisp.h:1105
> >   #11 PROCESSP (a=XIL(0x7fffeb90875d)) at /home/yantar92/Git/emacs/src/process.h:212
> >   #12 XPROCESS (a=XIL(0x7fffeb90875d)) at /home/yantar92/Git/emacs/src/process.h:224
> >   #13 handle_child_signal (sig=sig@entry=17) at process.c:7660
> >
> > Line 7660 of process.c is this:
> >
> >       struct Lisp_Process *p = XPROCESS (proc);
> >
> > What "crime" did we commit here?
> 
> Not a crime, but we are using a lib that we know uses barriers and
> functions of which get invoked when we hit such a barrier. We can't just
> ignore that.

What alternatives do we have?  Emacs code deals with Lisp objects and
memory managed by MPS all over the place.  In many case, the
programmer doesn't necessarily know whether some data is or isn't in
memory managed by MPS.

You are asking to do the impossible, IMO.  Refraining from touching
MPS-managed memory from signal handlers is simply not practical in
Emacs.  E.g., it would make the profiler useless, because the minimum
you must do from the signal handler is to see which Lisp function is
being run; if you defer this for later, you will see a completely
skewed profile.



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: MPS: a random backtrace while toying with gdb
  2024-06-30  9:59     ` Pip Cet
@ 2024-06-30 10:09       ` Eli Zaretskii
  2024-06-30 10:16         ` Pip Cet
  2024-06-30 11:10       ` Gerd Möllmann
  1 sibling, 1 reply; 68+ messages in thread
From: Eli Zaretskii @ 2024-06-30 10:09 UTC (permalink / raw)
  To: Pip Cet; +Cc: gerd.moellmann, yantar92, emacs-devel, eller.helmut

> Date: Sun, 30 Jun 2024 09:59:17 +0000
> From: Pip Cet <pipcet@protonmail.com>
> Cc: Ihor Radchenko <yantar92@posteo.net>, emacs-devel@gnu.org, Eli Zaretskii <eliz@gnu.org>, eller.helmut@gmail.com
> 
> On Saturday, June 29th, 2024 at 21:46, Gerd Möllmann <gerd.moellmann@gmail.com> wrote:
> > Pip Cet pipcet@protonmail.com writes:
> > 
> > > On Saturday, June 29th, 2024 at 19:11, Ihor Radchenko yantar92@posteo.net wrote:
> > > 
> > > > I just got this while starting up Emacs on the latest scratch/igc (with
> > > > native-comp).
> > > 
> > > Looks like the same double-signal situation we saw with the profiler, only this time it's SIGCHLD interrupting an MPS scan.
> > > 
> > > Pip
> > 
> > 
> > Indeed
> > 
> > #10 0x0000555555827385 in PSEUDOVECTORP (a=XIL(0x7fffeb90875d), code=9) at /home/yantar92/Git/emacs/src/lisp.h:1105
> > #11 PROCESSP (a=XIL(0x7fffeb90875d)) at /home/yantar92/Git/emacs/src/process.h:212
> > #12 XPROCESS (a=XIL(0x7fffeb90875d)) at /home/yantar92/Git/emacs/src/process.h:224
> > #13 handle_child_signal (sig=sig@entry=17) at process.c:7660
> > 
> > Someone has an idea what to do with that? And maybe how to reproduce?
> 
> We could always "steal" the SIGSEGV handler and reinstall it with a sigmask (without modifying the MPS source). My guess is that's equivalent to what the macOS code does, essentially, by using a different class of signal that blocks POSIX signals...

And this is less "ugly" (let alone more safe) than using sigblock?
What am I missing?



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: MPS: a random backtrace while toying with gdb
  2024-06-30 10:09       ` Eli Zaretskii
@ 2024-06-30 10:16         ` Pip Cet
  2024-06-30 10:34           ` Eli Zaretskii
  0 siblings, 1 reply; 68+ messages in thread
From: Pip Cet @ 2024-06-30 10:16 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: gerd.moellmann, yantar92, emacs-devel, eller.helmut






On Sunday, June 30th, 2024 at 10:09, Eli Zaretskii <eliz@gnu.org> wrote:

> 
> 
> > Date: Sun, 30 Jun 2024 09:59:17 +0000
> 
> > From: Pip Cet pipcet@protonmail.com
> > Cc: Ihor Radchenko yantar92@posteo.net, emacs-devel@gnu.org, Eli Zaretskii eliz@gnu.org, eller.helmut@gmail.com
> > 
> > On Saturday, June 29th, 2024 at 21:46, Gerd Möllmann gerd.moellmann@gmail.com wrote:
> > 
> > > Pip Cet pipcet@protonmail.com writes:
> > > 
> > > > On Saturday, June 29th, 2024 at 19:11, Ihor Radchenko yantar92@posteo.net wrote:
> > > > 
> > > > > I just got this while starting up Emacs on the latest scratch/igc (with
> > > > > native-comp).
> > > > 
> > > > Looks like the same double-signal situation we saw with the profiler, only this time it's SIGCHLD interrupting an MPS scan.
> > > > 
> > > > Pip
> > > 
> > > Indeed
> > > 
> > > #10 0x0000555555827385 in PSEUDOVECTORP (a=XIL(0x7fffeb90875d), code=9) at /home/yantar92/Git/emacs/src/lisp.h:1105
> > > #11 PROCESSP (a=XIL(0x7fffeb90875d)) at /home/yantar92/Git/emacs/src/process.h:212
> > > #12 XPROCESS (a=XIL(0x7fffeb90875d)) at /home/yantar92/Git/emacs/src/process.h:224
> > > #13 handle_child_signal (sig=sig@entry=17) at process.c:7660
> > > 
> > > Someone has an idea what to do with that? And maybe how to reproduce?
> > 
> > We could always "steal" the SIGSEGV handler and reinstall it with a sigmask (without modifying the MPS source). My guess is that's equivalent to what the macOS code does, essentially, by using a different class of signal that blocks POSIX signals...
> 
> 
> And this is less "ugly" (let alone more safe) than using sigblock?
> What am I missing?

Mostly, I think, the race condition where the barrier is installed (by MPS) some time before dflt_scan is even called (and, symmetrically, we'd unblock it too early while the barrier is still in effect). IOW, sigblock wouldn't work.

(I should point out that you've convinced me as far as SIGPROF goes. At least for the initial stage, it makes more sense to count hits in GC (which we do, with false positives, using the current scratch/igc code) than to delay signal processing until GC is over).

Unfortunately, I can't reproduce the SIGCHLD problem with Helmut's code, which doesn't cause a single SIGSEGV here...

Pip



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: MPS: a random backtrace while toying with gdb
  2024-06-30 10:00       ` Eli Zaretskii
@ 2024-06-30 10:24         ` Helmut Eller
  2024-06-30 10:43           ` Eli Zaretskii
  2024-06-30 11:07           ` Gerd Möllmann
  2024-06-30 11:06         ` Gerd Möllmann
  1 sibling, 2 replies; 68+ messages in thread
From: Helmut Eller @ 2024-06-30 10:24 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: gerd.moellmann, pipcet, yantar92, emacs-devel

On Sun, Jun 30 2024, Eli Zaretskii wrote:

>> I would move most of the work out of signal handlers.  The signal
>> handler could put the work in a "async work queue" and then return.  The
>> actual work is then processed by maybe_quit.
>
> But in this case we hit the barrier just by accessing the process
> object.  How can you do anything useful in a SIGCHLD handler if you
> cannot even find out which process exited and how did it exit?

You don't do anything useful other than packing up the arguments that
the signal handler receives and put them in the queue.

> Surely, something like that is needed in order to put any work on some
> queue?  And I'm not even talking about the effects of deferring this
> job "for later" on the speed of working with sub-processes in Emacs.

The queue is not in MPS-managed memory; so we can access it.

> IMO, we must find a way of touching MPS-managed memory safely from a
> signal handler, otherwise writing signal handlers will become an
> impossible job.

The MPS documentation says quite clearly that a scanner must not access
MPS-managed memory other than the block it receives as argument:

  https://memory-pool-system.readthedocs.io/en/latest/topic/format.html#cautions

5. Format methods must not:

   a. call library code;
   b. access MPS-managed memory in pools that protect their contents;
   c. perform a non-local exit (for example, by throwing an exception, or
      calling longjmp());
   d. call any functions or macros in the MPS other than
      MPS_SCAN_BEGIN, MPS_SCAN_END, MPS_FIX1(), MPS_FIX12(), MPS_FIX2(), and
      MPS_FIX_CALL.

   It’s permissible to call other functions in the client program, but
   see MPS_FIX_CALL for a restriction on passing the scan state.

6. Subject to the above constraints, format methods can freely access:

   a. memory inside the object or block that they have been asked to
      look at;
   b. MPS-managed memory in pools that do not protect their contents;
   c. memory not managed by the MPS.



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: MPS: a random backtrace while toying with gdb
  2024-06-30 10:16         ` Pip Cet
@ 2024-06-30 10:34           ` Eli Zaretskii
  2024-06-30 13:06             ` Pip Cet
  0 siblings, 1 reply; 68+ messages in thread
From: Eli Zaretskii @ 2024-06-30 10:34 UTC (permalink / raw)
  To: Pip Cet; +Cc: gerd.moellmann, yantar92, emacs-devel, eller.helmut

> Date: Sun, 30 Jun 2024 10:16:42 +0000
> From: Pip Cet <pipcet@protonmail.com>
> Cc: gerd.moellmann@gmail.com, yantar92@posteo.net, emacs-devel@gnu.org, eller.helmut@gmail.com
> 
> > > We could always "steal" the SIGSEGV handler and reinstall it with a sigmask (without modifying the MPS source). My guess is that's equivalent to what the macOS code does, essentially, by using a different class of signal that blocks POSIX signals...
> > 
> > 
> > And this is less "ugly" (let alone more safe) than using sigblock?
> > What am I missing?
> 
> Mostly, I think, the race condition where the barrier is installed (by MPS) some time before dflt_scan is even called (and, symmetrically, we'd unblock it too early while the barrier is still in effect). IOW, sigblock wouldn't work.

Then we should block signals earlier, e.g., in igc_alloc (and any
other place which could cause our objects moved).

> (I should point out that you've convinced me as far as SIGPROF goes. At least for the initial stage, it makes more sense to count hits in GC (which we do, with false positives, using the current scratch/igc code) than to delay signal processing until GC is over).

SIGPROF is not the only one.  We have other useful signals, and we
need to think about a solution that will cover most if not all of
them.  I've listed the signals that came to mind a few messages ago.

I still cannot believe MPS doesn't have a canonical solution for these
cases.



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: MPS: a random backtrace while toying with gdb
  2024-06-30 10:24         ` Helmut Eller
@ 2024-06-30 10:43           ` Eli Zaretskii
  2024-06-30 18:42             ` Helmut Eller
  2024-06-30 11:07           ` Gerd Möllmann
  1 sibling, 1 reply; 68+ messages in thread
From: Eli Zaretskii @ 2024-06-30 10:43 UTC (permalink / raw)
  To: Helmut Eller; +Cc: gerd.moellmann, pipcet, yantar92, emacs-devel

> From: Helmut Eller <eller.helmut@gmail.com>
> Cc: gerd.moellmann@gmail.com,  pipcet@protonmail.com,  yantar92@posteo.net,
>   emacs-devel@gnu.org
> Date: Sun, 30 Jun 2024 12:24:58 +0200
> 
> On Sun, Jun 30 2024, Eli Zaretskii wrote:
> 
> > But in this case we hit the barrier just by accessing the process
> > object.  How can you do anything useful in a SIGCHLD handler if you
> > cannot even find out which process exited and how did it exit?
> 
> You don't do anything useful other than packing up the arguments that
> the signal handler receives and put them in the queue.

What arguments are those?  SIGCHLD doesn't tell us the PID of the
process (or any other data that could be used to identify the
process), AFAICT.  What if two or more sub-processes exited while we
are in MPS-land?

> > Surely, something like that is needed in order to put any work on some
> > queue?  And I'm not even talking about the effects of deferring this
> > job "for later" on the speed of working with sub-processes in Emacs.
> 
> The queue is not in MPS-managed memory; so we can access it.

I'm not talking about the queue, I'm talking about accessing the data
which needs to be put on the queue.  The deferred job must be
accompanied with enough meta-data to identify what happened, otherwise
the deferred handler will not know how to do the job.

> > IMO, we must find a way of touching MPS-managed memory safely from a
> > signal handler, otherwise writing signal handlers will become an
> > impossible job.
> 
> The MPS documentation says quite clearly that a scanner must not access
> MPS-managed memory other than the block it receives as argument:

What do you mean by "scanner"?  I was talking about our signal
handlers.  If they qualify as "scanner", please explain why.

Anyway, if a signal handler cannot do anything useful, the only
alternative is to block select signals around code which could violate
those restrictions.



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: MPS: a random backtrace while toying with gdb
  2024-06-30  9:36     ` Helmut Eller
  2024-06-30 10:00       ` Eli Zaretskii
@ 2024-06-30 11:05       ` Gerd Möllmann
  1 sibling, 0 replies; 68+ messages in thread
From: Gerd Möllmann @ 2024-06-30 11:05 UTC (permalink / raw)
  To: Helmut Eller; +Cc: Pip Cet, Ihor Radchenko, emacs-devel, Eli Zaretskii

Helmut Eller <eller.helmut@gmail.com> writes:

> On Sat, Jun 29 2024, Gerd Möllmann wrote:
>
>> Someone has an idea what to do with that? And maybe how to reproduce?
>
> I can reproduce it with
>
> /src/emacs -Q -batch -eval '(dotimes (_ 100)
>   (start-process "foo" nil "sleep" "0.2")
>   (igc--collect))'
>
> I would move most of the work out of signal handlers.  The signal
> handler could put the work in a "async work queue" and then return.  The
> actual work is then processed by maybe_quit.
>
> Sending the work to a helper thread would be nicer, but probably not a
> realistic option with Emacs.

+1 in every aspect. For once, let Emacs do what all others do :-).



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: MPS: a random backtrace while toying with gdb
  2024-06-30 10:00       ` Eli Zaretskii
  2024-06-30 10:24         ` Helmut Eller
@ 2024-06-30 11:06         ` Gerd Möllmann
  1 sibling, 0 replies; 68+ messages in thread
From: Gerd Möllmann @ 2024-06-30 11:06 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: Helmut Eller, pipcet, yantar92, emacs-devel

Eli Zaretskii <eliz@gnu.org> writes:

>> From: Helmut Eller <eller.helmut@gmail.com>
>> Cc: Pip Cet <pipcet@protonmail.com>,  Ihor Radchenko <yantar92@posteo.net>,
>>  emacs-devel@gnu.org,  Eli Zaretskii <eliz@gnu.org>
>> Date: Sun, 30 Jun 2024 11:36:34 +0200
>> 
>> On Sat, Jun 29 2024, Gerd Möllmann wrote:
>> 
>> > Someone has an idea what to do with that? And maybe how to reproduce?
>> 
>> I can reproduce it with
>> 
>> /src/emacs -Q -batch -eval '(dotimes (_ 100)
>>   (start-process "foo" nil "sleep" "0.2")
>>   (igc--collect))'
>> 
>> I would move most of the work out of signal handlers.  The signal
>> handler could put the work in a "async work queue" and then return.  The
>> actual work is then processed by maybe_quit.
>
> But in this case we hit the barrier just by accessing the process
> object.  How can you do anything useful in a SIGCHLD handler if you
> cannot even find out which process exited and how did it exit?
> Surely, something like that is needed in order to put any work on some
> queue?  And I'm not even talking about the effects of deferring this
> job "for later" on the speed of working with sub-processes in Emacs.
>
> IMO, we must find a way of touching MPS-managed memory safely from a
> signal handler, otherwise writing signal handlers will become an
> impossible job.

FWIW, I disagree but I guess that was already clear.



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: MPS: a random backtrace while toying with gdb
  2024-06-30 10:24         ` Helmut Eller
  2024-06-30 10:43           ` Eli Zaretskii
@ 2024-06-30 11:07           ` Gerd Möllmann
  1 sibling, 0 replies; 68+ messages in thread
From: Gerd Möllmann @ 2024-06-30 11:07 UTC (permalink / raw)
  To: Helmut Eller; +Cc: Eli Zaretskii, pipcet, yantar92, emacs-devel

Helmut Eller <eller.helmut@gmail.com> writes:

> On Sun, Jun 30 2024, Eli Zaretskii wrote:
>
>>> I would move most of the work out of signal handlers.  The signal
>>> handler could put the work in a "async work queue" and then return.  The
>>> actual work is then processed by maybe_quit.
>>
>> But in this case we hit the barrier just by accessing the process
>> object.  How can you do anything useful in a SIGCHLD handler if you
>> cannot even find out which process exited and how did it exit?
>
> You don't do anything useful other than packing up the arguments that
> the signal handler receives and put them in the queue.
>
>> Surely, something like that is needed in order to put any work on some
>> queue?  And I'm not even talking about the effects of deferring this
>> job "for later" on the speed of working with sub-processes in Emacs.
>
> The queue is not in MPS-managed memory; so we can access it.
>
>> IMO, we must find a way of touching MPS-managed memory safely from a
>> signal handler, otherwise writing signal handlers will become an
>> impossible job.
>
> The MPS documentation says quite clearly that a scanner must not access
> MPS-managed memory other than the block it receives as argument:
>
>   https://memory-pool-system.readthedocs.io/en/latest/topic/format.html#cautions
>
> 5. Format methods must not:
>
>    a. call library code;
>    b. access MPS-managed memory in pools that protect their contents;
>    c. perform a non-local exit (for example, by throwing an exception, or
>       calling longjmp());
>    d. call any functions or macros in the MPS other than
>       MPS_SCAN_BEGIN, MPS_SCAN_END, MPS_FIX1(), MPS_FIX12(), MPS_FIX2(), and
>       MPS_FIX_CALL.
>
>    It’s permissible to call other functions in the client program, but
>    see MPS_FIX_CALL for a restriction on passing the scan state.
>
> 6. Subject to the above constraints, format methods can freely access:
>
>    a. memory inside the object or block that they have been asked to
>       look at;
>    b. MPS-managed memory in pools that do not protect their contents;
>    c. memory not managed by the MPS.

Amen! That's how I see it too.



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: MPS: a random backtrace while toying with gdb
  2024-06-30  9:59     ` Pip Cet
  2024-06-30 10:09       ` Eli Zaretskii
@ 2024-06-30 11:10       ` Gerd Möllmann
  1 sibling, 0 replies; 68+ messages in thread
From: Gerd Möllmann @ 2024-06-30 11:10 UTC (permalink / raw)
  To: Pip Cet; +Cc: Ihor Radchenko, emacs-devel, Eli Zaretskii, eller.helmut

Pip Cet <pipcet@protonmail.com> writes:

> My guess is that's equivalent to what the macOS code does,
> essentially, by using a different class of signal that blocks POSIX
> signals...

I've read somewhere that in macOS signals are implemented on Mach
exceptions, but I don't know if that's correct. The truth is somewhere
in the sources I guess

  https://github.com/apple-oss-distributions/xnu



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: MPS: a random backtrace while toying with gdb
  2024-06-30 10:05                 ` Eli Zaretskii
@ 2024-06-30 11:20                   ` Gerd Möllmann
  2024-06-30 12:16                     ` Eli Zaretskii
  0 siblings, 1 reply; 68+ messages in thread
From: Gerd Möllmann @ 2024-06-30 11:20 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: pipcet, yantar92, emacs-devel, eller.helmut

Eli Zaretskii <eliz@gnu.org> writes:

>> Not a crime, but we are using a lib that we know uses barriers and
>> functions of which get invoked when we hit such a barrier. We can't just
>> ignore that.
>
> What alternatives do we have?  Emacs code deals with Lisp objects and
> memory managed by MPS all over the place.  In many case, the
> programmer doesn't necessarily know whether some data is or isn't in
> memory managed by MPS.

We know what we do in signal handlers, don't we? We'd not call the
existing GC for example, and we have to be awware that GC mark bits ay
be set. It's not that this is a problem all over the place in Emacs,
outside of signal handlers.

> You are asking to do the impossible, IMO.  Refraining from touching
> MPS-managed memory from signal handlers is simply not practical in
> Emacs.  E.g., it would make the profiler useless, because the minimum
> you must do from the signal handler is to see which Lisp function is
> being run; if you defer this for later, you will see a completely
> skewed profile.

The profiler is a special case that is true, but we could get away with
doing something special here: detecting, with igc_busy, that we
shouldn't do anything with Lisp atm, and instead do whatever goes.

If you insist on making MPS usuable in signal handlers, please go ahead
of course.



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: MPS: a random backtrace while toying with gdb
  2024-06-30 11:20                   ` Gerd Möllmann
@ 2024-06-30 12:16                     ` Eli Zaretskii
  2024-06-30 12:43                       ` Gerd Möllmann
  0 siblings, 1 reply; 68+ messages in thread
From: Eli Zaretskii @ 2024-06-30 12:16 UTC (permalink / raw)
  To: Gerd Möllmann; +Cc: pipcet, yantar92, emacs-devel, eller.helmut

> From: Gerd Möllmann <gerd.moellmann@gmail.com>
> Cc: pipcet@protonmail.com,  yantar92@posteo.net,  emacs-devel@gnu.org,
>   eller.helmut@gmail.com
> Date: Sun, 30 Jun 2024 13:20:30 +0200
> 
> Eli Zaretskii <eliz@gnu.org> writes:
> 
> > What alternatives do we have?  Emacs code deals with Lisp objects and
> > memory managed by MPS all over the place.  In many case, the
> > programmer doesn't necessarily know whether some data is or isn't in
> > memory managed by MPS.
> 
> We know what we do in signal handlers, don't we? We'd not call the
> existing GC for example, and we have to be awware that GC mark bits ay
> be set. It's not that this is a problem all over the place in Emacs,
> outside of signal handlers.

Reality check: please re-read the code of our signal handlers to see
what they really do.

> > You are asking to do the impossible, IMO.  Refraining from touching
> > MPS-managed memory from signal handlers is simply not practical in
> > Emacs.  E.g., it would make the profiler useless, because the minimum
> > you must do from the signal handler is to see which Lisp function is
> > being run; if you defer this for later, you will see a completely
> > skewed profile.
> 
> The profiler is a special case that is true, but we could get away with
> doing something special here: detecting, with igc_busy, that we
> shouldn't do anything with Lisp atm, and instead do whatever goes.

The profiler is not a special case in any way.  The other signals I
mentioned represent similar issues.  And what about SIGUSR1/SIGUSR2?
And SIGALRM, which is used for atimers (that are the basis for
hourglass cursor)?  Heck, in TTY mode C-g triggers SIGINT, and we
process the key from the signal handler!

> If you insist on making MPS usuable in signal handlers, please go ahead
> of course.

I can't, I have too much stuff on my plate.  And implying that any
disagreement must mean go code it by yourself is at least unfair.

If you are unwilling to try what I suggest, I guess the future will
tell who is right and who isn't.  My fear is that going your way will
mean we'd need to disable many useful Emacs features or make them much
less useful.



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: MPS: a random backtrace while toying with gdb
  2024-06-30 12:16                     ` Eli Zaretskii
@ 2024-06-30 12:43                       ` Gerd Möllmann
  0 siblings, 0 replies; 68+ messages in thread
From: Gerd Möllmann @ 2024-06-30 12:43 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: pipcet, yantar92, emacs-devel, eller.helmut

Eli Zaretskii <eliz@gnu.org> writes:

> If you are unwilling to try what I suggest, I guess the future will
> tell who is right and who isn't.  My fear is that going your way will
> mean we'd need to disable many useful Emacs features or make them much
> less useful.

You could also say I'm unable to work ono this because I don't have that
problem on macOS to begin with. And I'm am unwilling to set up such a
system so that I can do more work. I'd actually rather let others take
over scratch/igc. And finally I find the idea of making MPS usable in
signal handlers wrong, I said that already.

So, sorry, this is outside of my field of interest.



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: MPS: a random backtrace while toying with gdb
  2024-06-30 10:34           ` Eli Zaretskii
@ 2024-06-30 13:06             ` Pip Cet
  0 siblings, 0 replies; 68+ messages in thread
From: Pip Cet @ 2024-06-30 13:06 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: gerd.moellmann, yantar92, emacs-devel, eller.helmut

On Sunday, June 30th, 2024 at 10:34, Eli Zaretskii <eliz@gnu.org> wrote:
> > Date: Sun, 30 Jun 2024 10:16:42 +0000
> > From: Pip Cet pipcet@protonmail.com
> > Cc: gerd.moellmann@gmail.com, yantar92@posteo.net, emacs-devel@gnu.org, eller.helmut@gmail.com
> >
> > > > We could always "steal" the SIGSEGV handler and reinstall it with a sigmask (without modifying the MPS source). My guess is that's equivalent to what the macOS code does, essentially, by using a different class of signal that blocks POSIX signals...
> > >
> > > And this is less "ugly" (let alone more safe) than using sigblock?
> > > What am I missing?

And, yes, as far as I can tell my proposed solution is safer.

> > Mostly, I think, the race condition where the barrier is installed (by MPS) some time before dflt_scan is even called (and, symmetrically, we'd unblock it too early while the barrier is still in effect). IOW, sigblock wouldn't work.
>
>
> Then we should block signals earlier, e.g., in igc_alloc (and any
> other place which could cause our objects moved).

That would be a very expensive and error-prone way of achieving the same result as simply blocking the signals while handling SIGSEGV.

> > (I should point out that you've convinced me as far as SIGPROF goes. At least for the initial stage, it makes more sense to count hits in GC (which we do, with false positives, using the current scratch/igc code) than to delay signal processing until GC is over).
>
>
> SIGPROF is not the only one.

It's the only one where I am not convinced the right thing is to simply block it using sa_sigmask.

> We have other useful signals, and we
> need to think about a solution that will cover most if not all of
> them. I've listed the signals that came to mind a few messages ago.

Those would be handled just fine by blocking all but the known-to-be-safe SIGPROF while handling SIGSEGV.

> I still cannot believe MPS doesn't have a canonical solution for these
> cases.

I haven't been able to find one in the docs. It still seems a simple oversight in the MPS code to me that they don't simply block all signals while handling SIGSEGV, or provide an API to set the mask.

Pip



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: MPS: a random backtrace while toying with gdb
  2024-06-30 10:43           ` Eli Zaretskii
@ 2024-06-30 18:42             ` Helmut Eller
  2024-06-30 18:59               ` Gerd Möllmann
                                 ` (2 more replies)
  0 siblings, 3 replies; 68+ messages in thread
From: Helmut Eller @ 2024-06-30 18:42 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: gerd.moellmann, pipcet, yantar92, emacs-devel

[-- Attachment #1: Type: text/plain, Size: 927 bytes --]

On Sun, Jun 30 2024, Eli Zaretskii wrote:
>> You don't do anything useful other than packing up the arguments that
>> the signal handler receives and put them in the queue.
>
> What arguments are those?  SIGCHLD doesn't tell us the PID of the
> process (or any other data that could be used to identify the
> process), AFAICT.  What if two or more sub-processes exited while we
> are in MPS-land?

This patch below implements the idea I was thinking about.

>> The MPS documentation says quite clearly that a scanner must not access
>> MPS-managed memory other than the block it receives as argument:
>
> What do you mean by "scanner"?  I was talking about our signal
> handlers.  If they qualify as "scanner", please explain why.

I mean dflt_scan.

> Anyway, if a signal handler cannot do anything useful, the only
> alternative is to block select signals around code which could violate
> those restrictions.

We will see.


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 0001-Fix-igc_check_fwd.patch --]
[-- Type: text/x-diff, Size: 694 bytes --]

From f4080ec23a0630bd3c5c37c853cd2528faec92c4 Mon Sep 17 00:00:00 2001
From: Helmut Eller <eller.helmut@gmail.com>
Date: Sun, 30 Jun 2024 20:15:51 +0200
Subject: [PATCH 1/2] Fix igc_check_fwd

* src/igc.c (igc_check_fwd): Add missing argument to has_header.
---
 src/igc.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/igc.c b/src/igc.c
index 324bcfa112f..2165a69cacf 100644
--- a/src/igc.c
+++ b/src/igc.c
@@ -598,7 +598,7 @@ alloc_hash (void)
 void
 igc_check_fwd (void *client)
 {
-  if (has_header (client))
+  if (has_header (client, true))
     {
       struct igc_header *h = client_to_base (client);
       igc_assert (h->obj_type != IGC_OBJ_FWD);
-- 
2.39.2


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #3: 0002-Process-SIGCHLD-asynchronously.patch --]
[-- Type: text/x-diff, Size: 7535 bytes --]

From 7b8715470bdec2dd0808c530c19ca4b749c6178b Mon Sep 17 00:00:00 2001
From: Helmut Eller <eller.helmut@gmail.com>
Date: Sun, 30 Jun 2024 20:17:05 +0200
Subject: [PATCH 2/2] Process SIGCHLD asynchronously

Introduce a async_work_queue that is processed as part of
process_pending_signals.

* src/keyboard.h (struct async_work_item)
(enum async_work_type): New types.
(enqueue_async_work): New function.
* src/keyboard.c (struct async_work_queue): New type.
(async_work_queue): New global variable.
(async_work_queue_len, async_work_queue_pop_front)
(do_async_work, do_async_work_1): New helpers.
(enqueue_async_work): Implementation.
(process_pending_signals): Call do_async_work;
* src/process.h (process_sigchld_async): New function.
* src/process.c (process_sigchld_async): Factored out from
handle_child_signal.
(handle_child_signal): Use enqueue_async_work.
---
 src/keyboard.c | 63 ++++++++++++++++++++++++++++++++++++++++++++++++++
 src/keyboard.h | 16 +++++++++++++
 src/process.c  | 62 +++++++++++++++++++++++++++----------------------
 src/process.h  |  2 ++
 4 files changed, 116 insertions(+), 27 deletions(-)

diff --git a/src/keyboard.c b/src/keyboard.c
index 06599ed0fdf..71286df8915 100644
--- a/src/keyboard.c
+++ b/src/keyboard.c
@@ -374,6 +374,15 @@ #define GROW_RAW_KEYBUF							\
    menu bar.  */
 static Lisp_Object menu_bar_touch_id;
 
+#define ASYNC_WORK_QUEUE_CAPACITY 64
+
+struct async_work_queue {
+  uint8_t start, end;
+  struct async_work_item items[ASYNC_WORK_QUEUE_CAPACITY];
+};
+
+static struct async_work_queue async_work_queue;
+
 \f
 /* Global variable declarations.  */
 
@@ -415,6 +424,7 @@ #define READABLE_EVENTS_IGNORE_SQUEEZABLES	(1 << 2)
 static char *find_user_signal_name (int);
 static void store_user_signal_events (void);
 static bool is_ignored_event (union buffered_input_event *);
+static void do_async_work (void);
 
 /* Advance or retreat a buffered input event pointer.  */
 
@@ -8176,6 +8186,7 @@ process_pending_signals (void)
   pending_signals = false;
   handle_async_input ();
   do_pending_atimers ();
+  do_async_work ();
 }
 
 /* Undo any number of BLOCK_INPUT calls down to level LEVEL,
@@ -12845,6 +12856,58 @@ is_ignored_event (union buffered_input_event *event)
   return !NILP (Fmemq (ignore_event, Vwhile_no_input_ignore_events));
 }
 
+static uint8_t
+async_work_queue_len (struct async_work_queue *q)
+{
+  uint8_t cap = ASYNC_WORK_QUEUE_CAPACITY;
+  return (q->start <= q->end
+	  ? q->end - q->start
+	  : cap - q->start + q->end);
+}
+
+static struct async_work_item
+async_work_queue_pop_front (struct async_work_queue *q)
+{
+  if (async_work_queue_len (q) == 0)
+    emacs_abort ();
+  struct async_work_item item = q->items[q->start];
+  uint8_t cap = ASYNC_WORK_QUEUE_CAPACITY;
+  q->start = (q->start + 1) % cap;
+  return item;
+}
+
+void
+enqueue_async_work (struct async_work_item item)
+{
+  struct async_work_queue *q = &async_work_queue;
+  uint8_t cap = ASYNC_WORK_QUEUE_CAPACITY;
+  if (async_work_queue_len (q) == cap)
+    emacs_abort ();
+  q->items[q->end] = item;
+  q->end = (q->end + 1) % cap;
+  pending_signals = 1;
+}
+
+static void
+do_async_work_1 (struct async_work_item item)
+{
+  switch (item.type)
+    {
+    case ASYNCWORK_SIGCHLD:
+      process_sigchld_async (item.u.sigchld);
+      return;
+    }
+  emacs_abort ();
+}
+
+static void
+do_async_work (void)
+{
+  struct async_work_queue *q = &async_work_queue;
+  while (async_work_queue_len (q) > 0)
+    do_async_work_1 (async_work_queue_pop_front (q));
+}
+
 static void syms_of_keyboard_for_pdumper (void);
 
 void
diff --git a/src/keyboard.h b/src/keyboard.h
index c7ae1f7f0fa..8fb592dacff 100644
--- a/src/keyboard.h
+++ b/src/keyboard.h
@@ -527,4 +527,20 @@ kbd_buffer_store_event_hold (struct input_event *event,
 
 INLINE_HEADER_END
 
+enum async_work_type
+{
+  ASYNCWORK_SIGCHLD
+};
+
+struct async_work_item
+{
+  enum async_work_type type;
+  union
+  {
+    int sigchld;
+  } u;
+};
+
+extern void enqueue_async_work (struct async_work_item item);
+
 #endif /* EMACS_KEYBOARD_H */
diff --git a/src/process.c b/src/process.c
index a3b2d2e54bd..5939d16c37e 100644
--- a/src/process.c
+++ b/src/process.c
@@ -7584,33 +7584,8 @@ child_signal_notify (void)
 static void dummy_handler (int sig) {}
 static signal_handler_t volatile lib_child_handler;
 
-/* Handle a SIGCHLD signal by looking for known child processes of
-   Emacs whose status have changed.  For each one found, record its
-   new status.
-
-   All we do is change the status; we do not run sentinels or print
-   notifications.  That is saved for the next time keyboard input is
-   done, in order to avoid timing errors.
-
-   ** WARNING: this can be called during garbage collection.
-   Therefore, it must not be fooled by the presence of mark bits in
-   Lisp objects.
-
-   ** USG WARNING: Although it is not obvious from the documentation
-   in signal(2), on a USG system the SIGCLD handler MUST NOT call
-   signal() before executing at least one wait(), otherwise the
-   handler will be called again, resulting in an infinite loop.  The
-   relevant portion of the documentation reads "SIGCLD signals will be
-   queued and the signal-catching function will be continually
-   reentered until the queue is empty".  Invoking signal() causes the
-   kernel to reexamine the SIGCLD queue.  Fred Fish, UniSoft Systems
-   Inc.
-
-   ** Malloc WARNING: This should never call malloc either directly or
-   indirectly; if it does, that is a bug.  */
-
-static void
-handle_child_signal (int sig)
+void
+process_sigchld_async (int sig)
 {
   Lisp_Object tail, proc;
   bool changed = false;
@@ -7688,6 +7663,39 @@ handle_child_signal (int sig)
     /* Wake up `wait_reading_process_output'.  */
     child_signal_notify ();
 
+}
+
+/* Handle a SIGCHLD signal by looking for known child processes of
+   Emacs whose status have changed.  For each one found, record its
+   new status.
+
+   All we do is change the status; we do not run sentinels or print
+   notifications.  That is saved for the next time keyboard input is
+   done, in order to avoid timing errors.
+
+   ** WARNING: this can be called during garbage collection.
+   Therefore, it must not be fooled by the presence of mark bits in
+   Lisp objects.
+
+   ** USG WARNING: Although it is not obvious from the documentation
+   in signal(2), on a USG system the SIGCLD handler MUST NOT call
+   signal() before executing at least one wait(), otherwise the
+   handler will be called again, resulting in an infinite loop.  The
+   relevant portion of the documentation reads "SIGCLD signals will be
+   queued and the signal-catching function will be continually
+   reentered until the queue is empty".  Invoking signal() causes the
+   kernel to reexamine the SIGCLD queue.  Fred Fish, UniSoft Systems
+   Inc.
+
+   ** Malloc WARNING: This should never call malloc either directly or
+   indirectly; if it does, that is a bug.  */
+
+static void
+handle_child_signal (int sig)
+{
+  struct async_work_item item = {ASYNCWORK_SIGCHLD, .u.sigchld = sig};
+  enqueue_async_work (item);
+
   lib_child_handler (sig);
 #ifdef NS_IMPL_GNUSTEP
   /* NSTask in GNUstep sets its child handler each time it is called.
diff --git a/src/process.h b/src/process.h
index f497a64c3d1..1d59a658d95 100644
--- a/src/process.h
+++ b/src/process.h
@@ -308,4 +308,6 @@ pset_gnutls_cred_type (struct Lisp_Process *p, Lisp_Object val)
 
 INLINE_HEADER_END
 
+extern void process_sigchld_async (int sig);
+
 #endif /* EMACS_PROCESS_H */
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* Re: MPS: a random backtrace while toying with gdb
  2024-06-30 18:42             ` Helmut Eller
@ 2024-06-30 18:59               ` Gerd Möllmann
  2024-06-30 19:25               ` Pip Cet
  2024-06-30 19:58               ` Eli Zaretskii
  2 siblings, 0 replies; 68+ messages in thread
From: Gerd Möllmann @ 2024-06-30 18:59 UTC (permalink / raw)
  To: Helmut Eller; +Cc: Eli Zaretskii, pipcet, yantar92, emacs-devel

Helmut Eller <eller.helmut@gmail.com> writes:

> We will see.

Thanks and pushed.



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: MPS: a random backtrace while toying with gdb
  2024-06-30 18:42             ` Helmut Eller
  2024-06-30 18:59               ` Gerd Möllmann
@ 2024-06-30 19:25               ` Pip Cet
  2024-06-30 19:49                 ` Ihor Radchenko
                                   ` (3 more replies)
  2024-06-30 19:58               ` Eli Zaretskii
  2 siblings, 4 replies; 68+ messages in thread
From: Pip Cet @ 2024-06-30 19:25 UTC (permalink / raw)
  To: Helmut Eller; +Cc: Eli Zaretskii, gerd.moellmann, yantar92, emacs-devel

On Sunday, June 30th, 2024 at 18:42, Helmut Eller <eller.helmut@gmail.com> wrote:
> On Sun, Jun 30 2024, Eli Zaretskii wrote:
> 
> > > You don't do anything useful other than packing up the arguments that
> > > the signal handler receives and put them in the queue.
> > 
> > What arguments are those? SIGCHLD doesn't tell us the PID of the
> > process (or any other data that could be used to identify the
> > process), AFAICT. What if two or more sub-processes exited while we
> > are in MPS-land?
> 
> This patch below implements the idea I was thinking about.

I don't think it's sufficiently careful about modifying C structures from signal handlers, though. The signal can occur between any two CPU instructions, and that opens up (rare, but possible) race conditions. IIUC the consensus is you can set an "int" or "bool" to true, or write to a self-pipe, but that's about as much as you should ever do in a signal handler...

As Eli points out, most of our signals don't have arguments, so how about simply having one flag per signal? Even a global signal flag would require us to block signals while we clear it.

Pip



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: MPS: a random backtrace while toying with gdb
  2024-06-30 19:25               ` Pip Cet
@ 2024-06-30 19:49                 ` Ihor Radchenko
  2024-06-30 20:09                 ` Eli Zaretskii
                                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 68+ messages in thread
From: Ihor Radchenko @ 2024-06-30 19:49 UTC (permalink / raw)
  To: Pip Cet; +Cc: Helmut Eller, Eli Zaretskii, gerd.moellmann, emacs-devel

Pip Cet <pipcet@protonmail.com> writes:

>> This patch below implements the idea I was thinking about.
>
> I don't think it's sufficiently careful about modifying C structures from signal handlers, though. The signal can occur between any two CPU instructions, and that opens up (rare, but possible) race conditions. IIUC the consensus is you can set an "int" or "bool" to true, or write to a self-pipe, but that's about as much as you should ever do in a signal handler...

At least, I am no longer seeing crashes on my side with this patch.

Of course, even more reliable handling of signals will be better.

-- 
Ihor Radchenko // yantar92,
Org mode contributor,
Learn more about Org mode at <https://orgmode.org/>.
Support Org development at <https://liberapay.com/org-mode>,
or support my work at <https://liberapay.com/yantar92>



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: MPS: a random backtrace while toying with gdb
  2024-06-30 18:42             ` Helmut Eller
  2024-06-30 18:59               ` Gerd Möllmann
  2024-06-30 19:25               ` Pip Cet
@ 2024-06-30 19:58               ` Eli Zaretskii
  2024-06-30 21:08                 ` Ihor Radchenko
  2 siblings, 1 reply; 68+ messages in thread
From: Eli Zaretskii @ 2024-06-30 19:58 UTC (permalink / raw)
  To: Helmut Eller; +Cc: gerd.moellmann, pipcet, yantar92, emacs-devel

> From: Helmut Eller <eller.helmut@gmail.com>
> Cc: gerd.moellmann@gmail.com,  pipcet@protonmail.com,  yantar92@posteo.net,
>   emacs-devel@gnu.org
> Date: Sun, 30 Jun 2024 20:42:03 +0200
> 
> +#define ASYNC_WORK_QUEUE_CAPACITY 64
> +
> +struct async_work_queue {
> +  uint8_t start, end;
> +  struct async_work_item items[ASYNC_WORK_QUEUE_CAPACITY];
> +};

What happens if more than 64 sub-processes exit while we are under
block_input?  64 is not such a large number, and we sometimes
block_input for prolonged times.

In any case, the 64 thing (or any other fixed value) is against the
GNU Coding Standards that frown on arbitrary limits.

Frankly, if this is what we must do, then this branch is not very
interesting.  Too bad, I thought this was a promising development in
Emacs.



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: MPS: a random backtrace while toying with gdb
  2024-06-30 19:25               ` Pip Cet
  2024-06-30 19:49                 ` Ihor Radchenko
@ 2024-06-30 20:09                 ` Eli Zaretskii
  2024-06-30 20:32                   ` Pip Cet
  2024-07-01  2:33                 ` Eli Zaretskii
  2024-07-01  6:05                 ` Helmut Eller
  3 siblings, 1 reply; 68+ messages in thread
From: Eli Zaretskii @ 2024-06-30 20:09 UTC (permalink / raw)
  To: Pip Cet; +Cc: eller.helmut, gerd.moellmann, yantar92, emacs-devel

> Date: Sun, 30 Jun 2024 19:25:36 +0000
> From: Pip Cet <pipcet@protonmail.com>
> Cc: Eli Zaretskii <eliz@gnu.org>, gerd.moellmann@gmail.com, yantar92@posteo.net,
>  emacs-devel@gnu.org
> 
> IIUC the consensus is you can set an "int" or "bool" to true, or write to a self-pipe, but that's about as much as you should ever do in a signal handler...

That's not true, AFAIU: you can safely call any of the dozens of
functions listed by the signal-safety(7) man page.



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: MPS: a random backtrace while toying with gdb
  2024-06-30 20:09                 ` Eli Zaretskii
@ 2024-06-30 20:32                   ` Pip Cet
  2024-07-01 11:07                     ` Eli Zaretskii
  0 siblings, 1 reply; 68+ messages in thread
From: Pip Cet @ 2024-06-30 20:32 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: eller.helmut, gerd.moellmann, yantar92, emacs-devel

On Sunday, June 30th, 2024 at 20:09, Eli Zaretskii <eliz@gnu.org> wrote:
> > Date: Sun, 30 Jun 2024 19:25:36 +0000
> 
> > From: Pip Cet pipcet@protonmail.com
> > Cc: Eli Zaretskii eliz@gnu.org, gerd.moellmann@gmail.com, yantar92@posteo.net,
> > emacs-devel@gnu.org
> > 
> > IIUC the consensus is you can set an "int" or "bool" to true, or write to a self-pipe, but that's about as much as you should ever do in a signal handler...
> 
> That's not true, AFAIU: you can safely call any of the dozens of
> functions listed by the signal-safety(7) man page.

I think the implication is reversed there: functions that aren't on that man page definitely aren't safe to call, but that doesn't mean that any old C code modifying complicated structures using only the listed functions is safe, at all.

My point is that complicated C structures, such as the queue Helmut implemented, cannot usually be safely modified from signal handlers unless the non-signal code takes care to block signals while it is modifying the structure.

All that said, it's not like our existing code looks safe, either, so the patch is definitely an improvement!

Pip



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: MPS: a random backtrace while toying with gdb
  2024-06-30 19:58               ` Eli Zaretskii
@ 2024-06-30 21:08                 ` Ihor Radchenko
  2024-07-01  2:35                   ` Eli Zaretskii
  2024-07-01 11:13                   ` Eli Zaretskii
  0 siblings, 2 replies; 68+ messages in thread
From: Ihor Radchenko @ 2024-06-30 21:08 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: Helmut Eller, gerd.moellmann, pipcet, emacs-devel

Eli Zaretskii <eliz@gnu.org> writes:

> Frankly, if this is what we must do, then this branch is not very
> interesting.  Too bad, I thought this was a promising development in
> Emacs.

What about going back to the alternative proposal with setting up signal
masks:

Eli Zaretskii <eliz@gnu.org> writes:

>> From: Pip Cet <pipcet@protonmail.com>
>> ...
>> The main difference between the two approaches, both of which require meddling with MPS internals, is that if SIGPROF is blocked, it will be delivered with a delay after the SIGSEGV handler is done running. If the SIGPROF handler runs and doesn't do "anything unsafe", on the other hand, it can record the unusable signal but won't get a new one. I don't see a clear advantage of either, to be honest. Blocking signals, of course, is more reliable than "just knowing" we won't do anything unsafe.
>
> I simply don't believe MPS is so poorly designed as to disallow
> signals while its own SIGSEGV handler runs.  SIGPROF is just one
> example; there are also SIGALRM, SIGUSR1, SIGUSR2, and maybe others.
> It makes little sense to me to expect MPS to block these signals.  I
> rather expect them to DTRT with those signals.

I am looking at
https://memory-pool-system.readthedocs.io/en/latest/topic/thread.html#thread-interface

And it says that

    In a multi-threaded environment where incremental garbage collection
    is used, threads must be registered with the MPS by calling
    mps_thread_reg() so that the MPS can suspend them as necessary in
    order to have exclusive access to their state.

If my understanding is correct, the situation we have is when MPS is
performing root scan while also receiving another signal. During the
root scan, all other Emacs threads are suspended, and no state changes
are taking place - everything is locked by MPS thread, which is the only
thread running.

Emacs signal handlers are, of course, installed for all the threads. And
since the only thread being running is MPS thread - it will be the one
handling the received signal. And the handler thus cannot run normally,
reaching out to the MPS-managed memory as usual.

So, we should somehow arrange the handler to wait until the MPS scan
finishes and the actual Emacs threads are resumed.
Such waiting should not lose any information as the actual Emacs state
is not going to change during the scan.

The simplest way to do it is setting up a mask for MPS
thread. (Although, we may or may not want to do it for profiler signals
specifically, depending on whether these signals will be queued by the OS
or merged into a single signal before unblocking - merging will make us
loose some samples)

-- 
Ihor Radchenko // yantar92,
Org mode contributor,
Learn more about Org mode at <https://orgmode.org/>.
Support Org development at <https://liberapay.com/org-mode>,
or support my work at <https://liberapay.com/yantar92>



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: MPS: a random backtrace while toying with gdb
  2024-06-30 19:25               ` Pip Cet
  2024-06-30 19:49                 ` Ihor Radchenko
  2024-06-30 20:09                 ` Eli Zaretskii
@ 2024-07-01  2:33                 ` Eli Zaretskii
  2024-07-01  6:05                 ` Helmut Eller
  3 siblings, 0 replies; 68+ messages in thread
From: Eli Zaretskii @ 2024-07-01  2:33 UTC (permalink / raw)
  To: Pip Cet; +Cc: eller.helmut, gerd.moellmann, yantar92, emacs-devel

> Date: Sun, 30 Jun 2024 19:25:36 +0000
> From: Pip Cet <pipcet@protonmail.com>
> Cc: Eli Zaretskii <eliz@gnu.org>, gerd.moellmann@gmail.com, yantar92@posteo.net, emacs-devel@gnu.org
> 
> As Eli points out, most of our signals don't have arguments, so how about simply having one flag per signal? Even a global signal flag would require us to block signals while we clear it.

How is this different from blocking the signals using the signal mask?
What do you think the system does when you block a signal?  It does
the same.  The differences are just two:

  . blocking signals uses system calls, which are well-documented,
    their semantics is well-understood, and their implementation is
    reliable.  By contrast, doing that in our own home-grown code will
    definitely have windows of race conditions, and will generally be
    less reliable, as any tricky signal-related code in userland;
  . instead of blocking signals only around MPS calls, we will be
    blocking signals for much longer times, thus making the
    above-mentioned windows larger and disabling more useful Emacs
    features

And that is even before we consider the fact that we have no plan for
the profiler, the SIGINT on TTY frames, and SIGUSR1/2, which are all
very useful features and their lack will be noticed.

And the only reason for not blocking signals as I suggested was
"yuck", with no further explanation.  I frankly cannot see any good
sense in not liking that, but then going ahead and reimplementing the
same idea in our own code.



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: MPS: a random backtrace while toying with gdb
  2024-06-30 21:08                 ` Ihor Radchenko
@ 2024-07-01  2:35                   ` Eli Zaretskii
  2024-07-01 11:13                   ` Eli Zaretskii
  1 sibling, 0 replies; 68+ messages in thread
From: Eli Zaretskii @ 2024-07-01  2:35 UTC (permalink / raw)
  To: Ihor Radchenko; +Cc: eller.helmut, gerd.moellmann, pipcet, emacs-devel

> From: Ihor Radchenko <yantar92@posteo.net>
> Cc: Helmut Eller <eller.helmut@gmail.com>, gerd.moellmann@gmail.com,
>  pipcet@protonmail.com, emacs-devel@gnu.org
> Date: Sun, 30 Jun 2024 21:08:48 +0000
> 
> If my understanding is correct, the situation we have is when MPS is
> performing root scan while also receiving another signal. During the
> root scan, all other Emacs threads are suspended, and no state changes
> are taking place - everything is locked by MPS thread, which is the only
> thread running.

To be sure we are on the same page: the aborts we are discussing
happened when MPS code was running in the context of the Emacs Lisp
(main) thread, not when it was running in a separate MPS thread.



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: MPS: a random backtrace while toying with gdb
  2024-06-30 19:25               ` Pip Cet
                                   ` (2 preceding siblings ...)
  2024-07-01  2:33                 ` Eli Zaretskii
@ 2024-07-01  6:05                 ` Helmut Eller
  3 siblings, 0 replies; 68+ messages in thread
From: Helmut Eller @ 2024-07-01  6:05 UTC (permalink / raw)
  To: Pip Cet; +Cc: Eli Zaretskii, gerd.moellmann, yantar92, emacs-devel

On Sun, Jun 30 2024, Pip Cet wrote:

> On Sunday, June 30th, 2024 at 18:42, Helmut Eller
>> This patch below implements the idea I was thinking about.
>
> I don't think it's sufficiently careful about modifying C structures
> from signal handlers, though. The signal can occur between any two CPU
> instructions, and that opens up (rare, but possible) race
> conditions. IIUC the consensus is you can set an "int" or "bool" to
> true, or write to a self-pipe, but that's about as much as you should
> ever do in a signal handler...

Hm, indeed that's a problem.  Though, probably not as long as
enqueue_async_work is used only for SIGCHLD.

> As Eli points out, most of our signals don't have arguments, so how
> about simply having one flag per signal? Even a global signal flag
> would require us to block signals while we clear it.

Maybe the safest would be to use a pipe to implement the queue.  That
would also work from different threads.



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: MPS: a random backtrace while toying with gdb
  2024-06-30 20:32                   ` Pip Cet
@ 2024-07-01 11:07                     ` Eli Zaretskii
  2024-07-01 17:27                       ` Pip Cet
  0 siblings, 1 reply; 68+ messages in thread
From: Eli Zaretskii @ 2024-07-01 11:07 UTC (permalink / raw)
  To: Pip Cet; +Cc: eller.helmut, gerd.moellmann, yantar92, emacs-devel

> Date: Sun, 30 Jun 2024 20:32:54 +0000
> From: Pip Cet <pipcet@protonmail.com>
> Cc: eller.helmut@gmail.com, gerd.moellmann@gmail.com, yantar92@posteo.net, emacs-devel@gnu.org
> 
> On Sunday, June 30th, 2024 at 20:09, Eli Zaretskii <eliz@gnu.org> wrote:
> > > Date: Sun, 30 Jun 2024 19:25:36 +0000
> > 
> > > From: Pip Cet pipcet@protonmail.com
> > > Cc: Eli Zaretskii eliz@gnu.org, gerd.moellmann@gmail.com, yantar92@posteo.net,
> > > emacs-devel@gnu.org
> > > 
> > > IIUC the consensus is you can set an "int" or "bool" to true, or write to a self-pipe, but that's about as much as you should ever do in a signal handler...
> > 
> > That's not true, AFAIU: you can safely call any of the dozens of
> > functions listed by the signal-safety(7) man page.
> 
> I think the implication is reversed there: functions that aren't on that man page definitely aren't safe to call, but that doesn't mean that any old C code modifying complicated structures using only the listed functions is safe, at all.

I didn't say "any code"; obviously, one can write unsafe code even
without calling any function!  This is a strawman if ever there was
one.

That man page is very clear:

       An async-signal-safe function is one that can be safely called
       from within a signal handler.  Many functions are not async-
       signal-safe.  In particular, nonreentrant functions are generally
       unsafe to call from a signal handler.
       [...]
       To avoid problems with unsafe functions, there are two possible
       choices:

       (a)  Ensure that (1) the signal handler calls only async-signal-
            safe functions, and (2) the signal handler itself is
            reentrant with respect to global variables in the main
            program.

       (b)  Block signal delivery in the main program when calling
            functions that are unsafe or operating on global data that
            is also accessed by the signal handler.

       Generally, the second choice is difficult in programs of any
       complexity, so the first choice is taken.
       [...]
       The set of functions required to be async-signal-safe by POSIX.1
       is shown in the following table.  The functions not otherwise
       noted were required to be async-signal-safe in POSIX.1-2001; the
       table details changes in the subsequent standards.

       Function               Notes
       abort(3)               Added in POSIX.1-2001 TC1
       accept(2)
       access(2)
       [...]

> My point is that complicated C structures, such as the queue Helmut implemented, cannot usually be safely modified from signal handlers unless the non-signal code takes care to block signals while it is modifying the structure.

It goes without saying that updating a static data structure makes the
signal handler non-reentrant, and it should at least block SIGCHLD.

> All that said, it's not like our existing code looks safe, either, so the patch is definitely an improvement!

I don't think I agree with the "improvement" part.  The challenge is
to prevent crashes without giving up useful features, not by means of
giving them up.



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: MPS: a random backtrace while toying with gdb
  2024-06-30 21:08                 ` Ihor Radchenko
  2024-07-01  2:35                   ` Eli Zaretskii
@ 2024-07-01 11:13                   ` Eli Zaretskii
  2024-07-01 11:47                     ` Ihor Radchenko
  1 sibling, 1 reply; 68+ messages in thread
From: Eli Zaretskii @ 2024-07-01 11:13 UTC (permalink / raw)
  To: Ihor Radchenko; +Cc: eller.helmut, gerd.moellmann, pipcet, emacs-devel

> From: Ihor Radchenko <yantar92@posteo.net>
> Cc: Helmut Eller <eller.helmut@gmail.com>, gerd.moellmann@gmail.com,
>  pipcet@protonmail.com, emacs-devel@gnu.org
> Date: Sun, 30 Jun 2024 21:08:48 +0000
> 
> If my understanding is correct, the situation we have is when MPS is
> performing root scan while also receiving another signal. During the
> root scan, all other Emacs threads are suspended, and no state changes
> are taking place - everything is locked by MPS thread, which is the only
> thread running.
> 
> Emacs signal handlers are, of course, installed for all the threads. And
> since the only thread being running is MPS thread - it will be the one
> handling the received signal. And the handler thus cannot run normally,
> reaching out to the MPS-managed memory as usual.
> 
> So, we should somehow arrange the handler to wait until the MPS scan
> finishes and the actual Emacs threads are resumed.
> Such waiting should not lose any information as the actual Emacs state
> is not going to change during the scan.
> 
> The simplest way to do it is setting up a mask for MPS
> thread. (Although, we may or may not want to do it for profiler signals
> specifically, depending on whether these signals will be queued by the OS
> or merged into a single signal before unblocking - merging will make us
> loose some samples)

As I wrote earlier, these last crashes happened when the only thread
running was the Emacs main thread, and MPS code was called from that
thread, as side effect of allocating memory (AFAIU).  So the situation
with multiple threads is not what we see here.

Nevertheless, please note that Emacs has machinery to deliver signals
on the main thread, see deliver_process_signal (and be sure to read
the commentary of that function).  So if this is needed, we seem to
already have the code in place to deal with it.



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: MPS: a random backtrace while toying with gdb
  2024-07-01 11:13                   ` Eli Zaretskii
@ 2024-07-01 11:47                     ` Ihor Radchenko
  2024-07-01 12:33                       ` Eli Zaretskii
  0 siblings, 1 reply; 68+ messages in thread
From: Ihor Radchenko @ 2024-07-01 11:47 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: eller.helmut, gerd.moellmann, pipcet, emacs-devel

Eli Zaretskii <eliz@gnu.org> writes:

> As I wrote earlier, these last crashes happened when the only thread
> running was the Emacs main thread, and MPS code was called from that
> thread, as side effect of allocating memory (AFAIU).  So the situation
> with multiple threads is not what we see here.

May you point me to what in the backtrace is memory allocation?

-- 
Ihor Radchenko // yantar92,
Org mode contributor,
Learn more about Org mode at <https://orgmode.org/>.
Support Org development at <https://liberapay.com/org-mode>,
or support my work at <https://liberapay.com/yantar92>



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: MPS: a random backtrace while toying with gdb
  2024-07-01 11:47                     ` Ihor Radchenko
@ 2024-07-01 12:33                       ` Eli Zaretskii
  2024-07-01 17:17                         ` Ihor Radchenko
  0 siblings, 1 reply; 68+ messages in thread
From: Eli Zaretskii @ 2024-07-01 12:33 UTC (permalink / raw)
  To: Ihor Radchenko; +Cc: eller.helmut, gerd.moellmann, pipcet, emacs-devel

> From: Ihor Radchenko <yantar92@posteo.net>
> Cc: eller.helmut@gmail.com, gerd.moellmann@gmail.com, pipcet@protonmail.com,
>  emacs-devel@gnu.org
> Date: Mon, 01 Jul 2024 11:47:01 +0000
> 
> Eli Zaretskii <eliz@gnu.org> writes:
> 
> > As I wrote earlier, these last crashes happened when the only thread
> > running was the Emacs main thread, and MPS code was called from that
> > thread, as side effect of allocating memory (AFAIU).  So the situation
> > with multiple threads is not what we see here.
> 
> May you point me to what in the backtrace is memory allocation?

Here's the top of your original backtrace:

> Thread 1 "emacs" hit Breakpoint 1, terminate_due_to_signal (sig=sig@entry=6, backtrace_limit=backtrace_limit@entry=2147483647) at emacs.c:443
> 443	{
> (gdb) bt
> #0  terminate_due_to_signal (sig=sig@entry=6, backtrace_limit=backtrace_limit@entry=2147483647) at emacs.c:443
> #1  0x00005555558634be in set_state (state=IGC_STATE_DEAD) at igc.c:179
> #2  igc_assert_fail (file=<optimized out>, line=<optimized out>, msg=<optimized out>) at igc.c:205
> #3  0x00005555558f1e19 in mps_lib_assert_fail (condition=0x555555943c4c "res == 0", line=126, file=0x555555943c36 "lockix.c")
>     at /home/yantar92/Dist/mps/code/mpsliban.c:87
> #4  LockClaim (lock=0x7fffe8000110) at /home/yantar92/Dist/mps/code/lockix.c:126
> #5  0x00005555558f204d in ArenaEnterLock (arena=0x7ffff7fbf000, recursive=0) at /home/yantar92/Dist/mps/code/global.c:576
> #6  0x000055555591aefe in ArenaEnter (arena=0x7ffff7fbf000) at /home/yantar92/Dist/mps/code/global.c:553
> #7  ArenaAccess (addr=0x7fffeb908758, mode=mode@entry=3, context=context@entry=0x7fffffff97d0) at /home/yantar92/Dist/mps/code/global.c:655
> #8  0x0000555555926202 in sigHandle (sig=<optimized out>, info=0x7fffffff9af0, uap=0x7fffffff99c0) at /home/yantar92/Dist/mps/code/protsgix.c:97
> #9  0x00007ffff3048050 in <signal handler called> () at /lib64/libc.so.6
> #10 0x0000555555827385 in PSEUDOVECTORP (a=XIL(0x7fffeb90875d), code=9) at /home/yantar92/Git/emacs/src/lisp.h:1105
> #11 PROCESSP (a=XIL(0x7fffeb90875d)) at /home/yantar92/Git/emacs/src/process.h:212
> #12 XPROCESS (a=XIL(0x7fffeb90875d)) at /home/yantar92/Git/emacs/src/process.h:224
> #13 handle_child_signal (sig=sig@entry=17) at process.c:7660
> #14 0x000055555573b771 in deliver_process_signal (sig=17, handler=handler@entry=0x555555827200 <handle_child_signal>) at sysdep.c:1758
> #15 0x0000555555820647 in deliver_child_signal (sig=<optimized out>) at process.c:7702
> #16 0x00007ffff3048050 in <signal handler called> () at /lib64/libc.so.6
> #17 0x000055555585f77b in fix_lisp_obj (ss=ss@entry=0x7fffffffa9a8, pobj=pobj@entry=0x7fffeee7ffe8) at igc.c:841
> #18 0x000055555586050d in fix_cons (ss=0x7fffffffa9a8, cons=0x7fffeee7ffe0) at igc.c:1474
> #19 dflt_scan_obj (ss=0x7fffffffa9a8, base_start=0x7fffeee7ffd8, base_limit=0x7fffeee80000, closure=0x0) at igc.c:1578
> #20 dflt_scanx (ss=ss@entry=0x7fffffffa9a8, base_start=<optimized out>, base_limit=0x7fffeee80000, closure=closure@entry=0x0) at igc.c:1658
> #21 0x00005555558613a3 in dflt_scan (ss=0x7fffffffa9a8, base_start=<optimized out>, base_limit=<optimized out>) at igc.c:1669
> #22 0x00005555558f163f in TraceScanFormat (limit=0x7fffeee80000, base=0x7fffeee7e000, ss=0x7fffffffa9a0) at /home/yantar92/Dist/mps/code/trace.c:1539
> #23 amcSegScan (totalReturn=0x7fffffffa99c, seg=0x7fffe845e4c8, ss=0x7fffffffa9a0) at /home/yantar92/Dist/mps/code/poolamc.c:1440
> #24 0x000055555591e7bc in traceScanSegRes (ts=ts@entry=1, rank=rank@entry=1, arena=arena@entry=0x7ffff7fbf000, seg=seg@entry=0x7fffe845e4c8)
>     at /home/yantar92/Dist/mps/code/trace.c:1205
> #25 0x000055555591e9ca in traceScanSeg (ts=1, rank=1, arena=0x7ffff7fbf000, seg=0x7fffe845e4c8) at /home/yantar92/Dist/mps/code/trace.c:1267
> #26 0x000055555591f3a4 in TraceAdvance (trace=trace@entry=0x7ffff7fbfaa8) at /home/yantar92/Dist/mps/code/trace.c:1728
> #27 0x000055555591faa4 in TracePoll
>     (workReturn=workReturn@entry=0x7fffffffab90, collectWorldReturn=collectWorldReturn@entry=0x7fffffffab8c, globals=globals@entry=0x7ffff7fbf008, collectWorldAllowed=<optimized out>) at /home/yantar92/Dist/mps/code/trace.c:1849
> #28 0x000055555591fceb in ArenaPoll (globals=globals@entry=0x7ffff7fbf008) at /home/yantar92/Dist/mps/code/global.c:745
> #29 0x00005555559200da in mps_ap_fill (p_o=p_o@entry=0x7fffffffad00, mps_ap=mps_ap@entry=0x7fffe80017f0, size=size@entry=24)
>     at /home/yantar92/Dist/mps/code/mpsi.c:1097
> #30 0x00005555558601ee in alloc_impl (size=24, type=IGC_OBJ_CONS, ap=0x7fffe80017f0) at igc.c:3330
> #31 0x000055555586023c in alloc (size=size@entry=16, type=type@entry=IGC_OBJ_CONS) at igc.c:3358
> #32 0x000055555586187a in igc_make_cons (car=XIL(0x133e0), cdr=XIL(0)) at igc.c:3385
> #33 0x000055555578e7de in Fcons (car=<optimized out>, cdr=<optimized out>) at alloc.c:2926

As you can see, it all started when we called Fcons, to produce a cons
cell.  That called igc_make_cons, which eventually called MPS
memory-allocation stuff (starting from frame #29).  MPS called us back
(frame #21), and while in fix_lisp_obj, we got delivered SIGPROF, and
attempted to access some memory that was evidently behind the barrier.
The rest is history, because MPS kicked in and tried to take the arena
lock the second time.

There's no other thread anywhere in sight here, this all happens in
the context of our own main (a.k.a. "Lisp") thread.



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: MPS: a random backtrace while toying with gdb
  2024-07-01 12:33                       ` Eli Zaretskii
@ 2024-07-01 17:17                         ` Ihor Radchenko
  2024-07-01 17:44                           ` Eli Zaretskii
  0 siblings, 1 reply; 68+ messages in thread
From: Ihor Radchenko @ 2024-07-01 17:17 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: eller.helmut, gerd.moellmann, pipcet, emacs-devel

Eli Zaretskii <eliz@gnu.org> writes:

> As you can see, it all started when we called Fcons, to produce a cons
> cell.  That called igc_make_cons, which eventually called MPS
> memory-allocation stuff (starting from frame #29).  MPS called us back
> (frame #21), and while in fix_lisp_obj, we got delivered SIGPROF, and
> attempted to access some memory that was evidently behind the barrier.
> The rest is history, because MPS kicked in and tried to take the arena
> lock the second time.
>
> There's no other thread anywhere in sight here, this all happens in
> the context of our own main (a.k.a. "Lisp") thread.

I see. So, it is the other way around than what I thought.

I looked a bit into mps sources and what I observe is
mps_ap_fill -> ArenaEnter -> ArenaEnterLock -> LockClaim.
This is active for the duration of our dflt_scan.
Nothing can access the MPS arena and when something tries to, we get
assertion failure.

It looks like we can test whether the Arena is locked via

/* mps_arena_busy -- is the arena part way through an operation? */

mps_bool_t mps_arena_busy(mps_arena_t arena)

Although it does not help with the underlying problem with signal
arriving while we are in the process of allocating a new object in the
pool.

I am wondering what happens in the same situation when using normal
malloc. (Or may it never happen when using normal malloc?)

-- 
Ihor Radchenko // yantar92,
Org mode contributor,
Learn more about Org mode at <https://orgmode.org/>.
Support Org development at <https://liberapay.com/org-mode>,
or support my work at <https://liberapay.com/yantar92>



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: MPS: a random backtrace while toying with gdb
  2024-07-01 11:07                     ` Eli Zaretskii
@ 2024-07-01 17:27                       ` Pip Cet
  2024-07-01 17:42                         ` Ihor Radchenko
  2024-07-01 18:08                         ` Eli Zaretskii
  0 siblings, 2 replies; 68+ messages in thread
From: Pip Cet @ 2024-07-01 17:27 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: eller.helmut, gerd.moellmann, yantar92, emacs-devel

On Monday, July 1st, 2024 at 11:07, Eli Zaretskii <eliz@gnu.org> wrote:
> > Date: Sun, 30 Jun 2024 20:32:54 +0000
> 
> > From: Pip Cet pipcet@protonmail.com
> > Cc: eller.helmut@gmail.com, gerd.moellmann@gmail.com, yantar92@posteo.net, emacs-devel@gnu.org
> > 
> > On Sunday, June 30th, 2024 at 20:09, Eli Zaretskii eliz@gnu.org wrote:
> > 
> > > > Date: Sun, 30 Jun 2024 19:25:36 +0000
> > > 
> > > > From: Pip Cet pipcet@protonmail.com
> > > > Cc: Eli Zaretskii eliz@gnu.org, gerd.moellmann@gmail.com, yantar92@posteo.net,
> > > > emacs-devel@gnu.org
> > > > 
> > > > IIUC the consensus is you can set an "int" or "bool" to true, or write to a self-pipe, but that's about as much as you should ever do in a signal handler...
> > > 
> > > That's not true, AFAIU: you can safely call any of the dozens of
> > > functions listed by the signal-safety(7) man page.
> > 
> > I think the implication is reversed there: functions that aren't on that man page definitely aren't safe to call, but that doesn't mean that any old C code modifying complicated structures using only the listed functions is safe, at all.
> 
> I didn't say "any code"; obviously, one can write unsafe code even
> without calling any function! This is a strawman if ever there was
> one.

I'm sorry, I must have misunderstood you. I thought you said I could safely call any of dozens of functions when what I talked about was what signal handlers can actually, effectively, do. There's no contradiction there.

> > My point is that complicated C structures, such as the queue Helmut implemented, cannot usually be safely modified from signal handlers unless the non-signal code takes care to block signals while it is modifying the structure.
> 
> It goes without saying that updating a static data structure makes the
> signal handler non-reentrant, and it should at least block SIGCHLD.

Blocking SIGCHLD in a SIGCHLD handler is redundant unless SA_NODEFER is in use. Are we talking about a different handler here?

> > All that said, it's not like our existing code looks safe, either, so the patch is definitely an improvement!
> 
> I don't think I agree with the "improvement" part. The challenge is
> to prevent crashes without giving up useful features, not by means of
> giving them up.

One last suggestion: how about blocking those signals for most of Emacs' lifetime, only unblocking them in maybe_quit or at similar points? That would allow us to keep the existing signal handlers and make them safe...

I still think this is a simple oversight on the part of MPS, FWIW. You shouldn't allow other signals when handling SIGSEGV, or at least give the client program an option to specify a signal mask.

Pip



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: MPS: a random backtrace while toying with gdb
  2024-07-01 17:27                       ` Pip Cet
@ 2024-07-01 17:42                         ` Ihor Radchenko
  2024-07-01 18:08                         ` Eli Zaretskii
  1 sibling, 0 replies; 68+ messages in thread
From: Ihor Radchenko @ 2024-07-01 17:42 UTC (permalink / raw)
  To: Pip Cet; +Cc: Eli Zaretskii, eller.helmut, gerd.moellmann, emacs-devel

Pip Cet <pipcet@protonmail.com> writes:

> I still think this is a simple oversight on the part of MPS, FWIW. You shouldn't allow other signals when handling SIGSEGV, or at least give the client program an option to specify a signal mask.

They talk a little about why they do not block other signals.

https://memory-pool-system.readthedocs.io/en/latest/design/protix.html#threads

    .threads.suspend: The SIGSEGV signal handler does not mask out any
    signals, so a thread may be suspended while the handler is active,
    as required by the design (see
    design.mps.pthreadext.req.suspend.protection). The signal handlers
    simply nest at top of stack.

-- 
Ihor Radchenko // yantar92,
Org mode contributor,
Learn more about Org mode at <https://orgmode.org/>.
Support Org development at <https://liberapay.com/org-mode>,
or support my work at <https://liberapay.com/yantar92>



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: MPS: a random backtrace while toying with gdb
  2024-07-01 17:17                         ` Ihor Radchenko
@ 2024-07-01 17:44                           ` Eli Zaretskii
  2024-07-01 18:01                             ` Ihor Radchenko
  0 siblings, 1 reply; 68+ messages in thread
From: Eli Zaretskii @ 2024-07-01 17:44 UTC (permalink / raw)
  To: Ihor Radchenko; +Cc: eller.helmut, gerd.moellmann, pipcet, emacs-devel

> From: Ihor Radchenko <yantar92@posteo.net>
> Cc: eller.helmut@gmail.com, gerd.moellmann@gmail.com, pipcet@protonmail.com,
>  emacs-devel@gnu.org
> Date: Mon, 01 Jul 2024 17:17:06 +0000
> 
> I looked a bit into mps sources and what I observe is
> mps_ap_fill -> ArenaEnter -> ArenaEnterLock -> LockClaim.
> This is active for the duration of our dflt_scan.
> Nothing can access the MPS arena and when something tries to, we get
> assertion failure.

Yes.

> It looks like we can test whether the Arena is locked via
> 
> /* mps_arena_busy -- is the arena part way through an operation? */
> 
> mps_bool_t mps_arena_busy(mps_arena_t arena)
> 
> Although it does not help with the underlying problem with signal
> arriving while we are in the process of allocating a new object in the
> pool.

Blocking signals whose handlers access the data of the Lisp machine
will prevent that.

> I am wondering what happens in the same situation when using normal
> malloc. (Or may it never happen when using normal malloc?)

You mean, when the current GC runs?  Signal handlers are free to
access Lisp data even if the signal happens while we are in GC, they
just need to be careful to mask off the mark bit (see, for example,
the commentary to handle_child_signal).

If you really mean malloc, then there usually is not problem because
signal handlers cannot safely call malloc.  And if malloc causes GC,
see above.



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: MPS: a random backtrace while toying with gdb
  2024-07-01 17:44                           ` Eli Zaretskii
@ 2024-07-01 18:01                             ` Ihor Radchenko
  2024-07-01 18:16                               ` Eli Zaretskii
  2024-07-01 18:19                               ` Gerd Möllmann
  0 siblings, 2 replies; 68+ messages in thread
From: Ihor Radchenko @ 2024-07-01 18:01 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: eller.helmut, gerd.moellmann, pipcet, emacs-devel

Eli Zaretskii <eliz@gnu.org> writes:

>> I am wondering what happens in the same situation when using normal
>> malloc. (Or may it never happen when using normal malloc?)
>
> You mean, when the current GC runs?

Nope. AFAIU, the problem we are facing is not when MPS GC runs, but when
we query MPS to allocate memory in the managed arena.

> If you really mean malloc, then there usually is not problem because
> signal handlers cannot safely call malloc.  And if malloc causes GC,
> see above.

I do not mean calling malloc. I mean, what happens if a signal arrives
_while_ malloc is still in the process of allocating data?

-- 
Ihor Radchenko // yantar92,
Org mode contributor,
Learn more about Org mode at <https://orgmode.org/>.
Support Org development at <https://liberapay.com/org-mode>,
or support my work at <https://liberapay.com/yantar92>



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: MPS: a random backtrace while toying with gdb
  2024-07-01 17:27                       ` Pip Cet
  2024-07-01 17:42                         ` Ihor Radchenko
@ 2024-07-01 18:08                         ` Eli Zaretskii
  2024-07-02  7:55                           ` Pip Cet
  1 sibling, 1 reply; 68+ messages in thread
From: Eli Zaretskii @ 2024-07-01 18:08 UTC (permalink / raw)
  To: Pip Cet; +Cc: eller.helmut, gerd.moellmann, yantar92, emacs-devel

> Date: Mon, 01 Jul 2024 17:27:49 +0000
> From: Pip Cet <pipcet@protonmail.com>
> Cc: eller.helmut@gmail.com, gerd.moellmann@gmail.com, yantar92@posteo.net, emacs-devel@gnu.org
> 
> > It goes without saying that updating a static data structure makes the
> > signal handler non-reentrant, and it should at least block SIGCHLD.
> 
> Blocking SIGCHLD in a SIGCHLD handler is redundant unless SA_NODEFER is in use. Are we talking about a different handler here?

First, that queue was intended for more than just SIGCHLD.

And second, depending on the OS, nested signals can or cannot be
possible, so we need to take that into consideration.

> One last suggestion: how about blocking those signals for most of Emacs' lifetime, only unblocking them in maybe_quit or at similar points? That would allow us to keep the existing signal handlers and make them safe...

IMO, this would be the opposite of what we should do.  We should have
these signals blocked for as little time as possible, because
otherwise the features built on them will be much less useful.  For
example, SIGUSR1/2 are a means of forcing Emacs out of some infinite
loop which is otherwise uninterruptible -- if we let these signals be
unblocked only in maybe_quit, we will have lost this useful feature.

Which is why I suggested to block the signals before calling MPS and
unblock them immediately when we return from an MPS call.  All of
these calls are in igc.c, so the job of adding these blocks, while
mundane and boring, is not impossible.

But if people who have time to work on that disagree, I have no means
of making them do what they don't agree with.

> I still think this is a simple oversight on the part of MPS, FWIW. You shouldn't allow other signals when handling SIGSEGV, or at least give the client program an option to specify a signal mask.

That's not the problem, AFAIU.  The problem is that a signal handler
which accesses Lisp data or the state of the Lisp machine could
trigger an MPS call, which will try taking the arena lock, and that
cannot be nested, by MPS design.  And our handlers do access the Lisp
machine, albeit cautiously and as little as necessary.  So when the
signal happens in the middle of an MPS call which already took the
arena lock, we cannot safely access our data.



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: MPS: a random backtrace while toying with gdb
  2024-07-01 18:01                             ` Ihor Radchenko
@ 2024-07-01 18:16                               ` Eli Zaretskii
  2024-07-01 18:24                                 ` Ihor Radchenko
  2024-07-01 18:19                               ` Gerd Möllmann
  1 sibling, 1 reply; 68+ messages in thread
From: Eli Zaretskii @ 2024-07-01 18:16 UTC (permalink / raw)
  To: Ihor Radchenko; +Cc: eller.helmut, gerd.moellmann, pipcet, emacs-devel

> From: Ihor Radchenko <yantar92@posteo.net>
> Cc: eller.helmut@gmail.com, gerd.moellmann@gmail.com, pipcet@protonmail.com,
>  emacs-devel@gnu.org
> Date: Mon, 01 Jul 2024 18:01:49 +0000
> 
> > If you really mean malloc, then there usually is not problem because
> > signal handlers cannot safely call malloc.  And if malloc causes GC,
> > see above.
> 
> I do not mean calling malloc. I mean, what happens if a signal arrives
> _while_ malloc is still in the process of allocating data?

That's what I mean by "you mean malloc".

If a signal happens while malloc is in progress, it is not a problem,
because signal handlers should not re-enter malloc.

And if that doesn't answer your question, maybe I don't understand
what is it exactly that you are asking.



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: MPS: a random backtrace while toying with gdb
  2024-07-01 18:01                             ` Ihor Radchenko
  2024-07-01 18:16                               ` Eli Zaretskii
@ 2024-07-01 18:19                               ` Gerd Möllmann
  2024-07-01 18:23                                 ` Eli Zaretskii
  1 sibling, 1 reply; 68+ messages in thread
From: Gerd Möllmann @ 2024-07-01 18:19 UTC (permalink / raw)
  To: Ihor Radchenko; +Cc: Eli Zaretskii, eller.helmut, pipcet, emacs-devel

Ihor Radchenko <yantar92@posteo.net> writes:

> Eli Zaretskii <eliz@gnu.org> writes:
>
>>> I am wondering what happens in the same situation when using normal
>>> malloc. (Or may it never happen when using normal malloc?)
>>
>> You mean, when the current GC runs?
>
> Nope. AFAIU, the problem we are facing is not when MPS GC runs, but when
> we query MPS to allocate memory in the managed arena.

AFAIU, there are 2 scenarios:

1. The client allocates memory from MPS, via an allocation point. APs
normally allocate from a block they reserve, which is why this is
usually surprisingly fast. More must done when the reserved block is
used up. Then, MPS scans mempory, tries to free some, and so on. This
happens synchronously in the client thread.

2. MPS concurrently does incrementally work. It scans some part of the
meoory, for example, and functions of the object format are called
(those in igc.c). This happens in the MPS thread.

>> If you really mean malloc, then there usually is not problem because
>> signal handlers cannot safely call malloc.  And if malloc causes GC,
>> see above.
>
> I do not mean calling malloc. I mean, what happens if a signal arrives
> _while_ malloc is still in the process of allocating data?

One would have to consult the malloc implementation. Portably, it's not
safe to call malloc in a signal handler.



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: MPS: a random backtrace while toying with gdb
  2024-07-01 18:19                               ` Gerd Möllmann
@ 2024-07-01 18:23                                 ` Eli Zaretskii
  0 siblings, 0 replies; 68+ messages in thread
From: Eli Zaretskii @ 2024-07-01 18:23 UTC (permalink / raw)
  To: Gerd Möllmann; +Cc: yantar92, eller.helmut, pipcet, emacs-devel

> From: Gerd Möllmann <gerd.moellmann@gmail.com>
> Cc: Eli Zaretskii <eliz@gnu.org>,  eller.helmut@gmail.com,
>   pipcet@protonmail.com,  emacs-devel@gnu.org
> Date: Mon, 01 Jul 2024 20:19:09 +0200
> 
> Ihor Radchenko <yantar92@posteo.net> writes:
> 
> >> If you really mean malloc, then there usually is not problem because
> >> signal handlers cannot safely call malloc.  And if malloc causes GC,
> >> see above.
> >
> > I do not mean calling malloc. I mean, what happens if a signal arrives
> > _while_ malloc is still in the process of allocating data?
> 
> One would have to consult the malloc implementation. Portably, it's not
> safe to call malloc in a signal handler.

Right.  Also, keep in mind that as long as malloc didn't finish its
job of serving a call, the memory it was requested to allocate is not
yet available to the program, and thus no one except malloc itself can
access that memory.



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: MPS: a random backtrace while toying with gdb
  2024-07-01 18:16                               ` Eli Zaretskii
@ 2024-07-01 18:24                                 ` Ihor Radchenko
  2024-07-01 18:31                                   ` Eli Zaretskii
  0 siblings, 1 reply; 68+ messages in thread
From: Ihor Radchenko @ 2024-07-01 18:24 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: eller.helmut, gerd.moellmann, pipcet, emacs-devel

Eli Zaretskii <eliz@gnu.org> writes:

>> I do not mean calling malloc. I mean, what happens if a signal arrives
>> _while_ malloc is still in the process of allocating data?
>
> That's what I mean by "you mean malloc".
>
> If a signal happens while malloc is in progress, it is not a problem,
> because signal handlers should not re-enter malloc.

So, maybe it is going to be enough to make sure that we block signals
while running `alloc_impl' (in igc.c)?

> And if that doesn't answer your question, maybe I don't understand
> what is it exactly that you are asking.

I was mostly asking about technical implementation details of how malloc
is preventing signals from being handled while it is doing its
job. Maybe we can reuse it or have a more clear idea what is supposed to
be done.

-- 
Ihor Radchenko // yantar92,
Org mode contributor,
Learn more about Org mode at <https://orgmode.org/>.
Support Org development at <https://liberapay.com/org-mode>,
or support my work at <https://liberapay.com/yantar92>



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: MPS: a random backtrace while toying with gdb
  2024-07-01 18:24                                 ` Ihor Radchenko
@ 2024-07-01 18:31                                   ` Eli Zaretskii
  2024-07-01 18:51                                     ` Ihor Radchenko
  0 siblings, 1 reply; 68+ messages in thread
From: Eli Zaretskii @ 2024-07-01 18:31 UTC (permalink / raw)
  To: Ihor Radchenko; +Cc: eller.helmut, gerd.moellmann, pipcet, emacs-devel

> From: Ihor Radchenko <yantar92@posteo.net>
> Cc: eller.helmut@gmail.com, gerd.moellmann@gmail.com, pipcet@protonmail.com,
>  emacs-devel@gnu.org
> Date: Mon, 01 Jul 2024 18:24:20 +0000
> 
> Eli Zaretskii <eliz@gnu.org> writes:
> 
> >> I do not mean calling malloc. I mean, what happens if a signal arrives
> >> _while_ malloc is still in the process of allocating data?
> >
> > That's what I mean by "you mean malloc".
> >
> > If a signal happens while malloc is in progress, it is not a problem,
> > because signal handlers should not re-enter malloc.
> 
> So, maybe it is going to be enough to make sure that we block signals
> while running `alloc_impl' (in igc.c)?

Something like that.  I'm not familiar enough with the MPS functions
we call in igc.c, but blocking signals should be done around any calls
that could take the arena lock.

> > And if that doesn't answer your question, maybe I don't understand
> > what is it exactly that you are asking.
> 
> I was mostly asking about technical implementation details of how malloc
> is preventing signals from being handled while it is doing its
> job.

It doesn't.  It's the responsibility of the signal handlers not to
re-enter malloc.

We cannot do the same with MPS, because MPS can kick in outside of our
control -- for example, if accessing some Lisp object hits the
barrier.  So we have no idea when MPS will take the arena lock, and
this cannot prevent recursive locks, except by preventing our code
which touches Lisp data from running when MPS is active because we
called it.

To say this in different words: MPS must be in control of all the
threads which can touch the data it manages, and code that runs from a
signal handler is a kind-of separate thread, except that we have no
way of registering such a thread with MPS as we do with real threads.



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: MPS: a random backtrace while toying with gdb
  2024-07-01 18:31                                   ` Eli Zaretskii
@ 2024-07-01 18:51                                     ` Ihor Radchenko
  2024-07-01 19:05                                       ` Eli Zaretskii
  2024-07-01 19:34                                       ` Gerd Möllmann
  0 siblings, 2 replies; 68+ messages in thread
From: Ihor Radchenko @ 2024-07-01 18:51 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: eller.helmut, gerd.moellmann, pipcet, emacs-devel

Eli Zaretskii <eliz@gnu.org> writes:

>> I was mostly asking about technical implementation details of how malloc
>> is preventing signals from being handled while it is doing its
>> job.
>
> It doesn't.  It's the responsibility of the signal handlers not to
> re-enter malloc.
>
> We cannot do the same with MPS, because MPS can kick in outside of our
> control -- for example, if accessing some Lisp object hits the
> barrier.  So we have no idea when MPS will take the arena lock, and
> this cannot prevent recursive locks, except by preventing our code
> which touches Lisp data from running when MPS is active because we
> called it.

AFAIU, the situation discussed in this thread is different - MPS kicks
in _in our control_. Fcons requests MPS to allocate memory for a new
cons, MPS does it, not super quickly, and blocks the whole arena
(all the lisp memory) in the process. At the same time, signal arrives,
and we try to access the (blocked) arena.

Note that our SIGCHLD signal handler itself does not re-enter malloc
(igc_malloc in our case). It just tries to access the memory arena that
is in transient state (in process of allocation).

I do not see why the same cannot happen with vanilla malloc, unless, of
course, it blocks memory much more granularly, guaranteeing that live
objects are always reachable.

-- 
Ihor Radchenko // yantar92,
Org mode contributor,
Learn more about Org mode at <https://orgmode.org/>.
Support Org development at <https://liberapay.com/org-mode>,
or support my work at <https://liberapay.com/yantar92>



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: MPS: a random backtrace while toying with gdb
  2024-07-01 18:51                                     ` Ihor Radchenko
@ 2024-07-01 19:05                                       ` Eli Zaretskii
  2024-07-01 19:34                                       ` Gerd Möllmann
  1 sibling, 0 replies; 68+ messages in thread
From: Eli Zaretskii @ 2024-07-01 19:05 UTC (permalink / raw)
  To: Ihor Radchenko; +Cc: eller.helmut, gerd.moellmann, pipcet, emacs-devel

> From: Ihor Radchenko <yantar92@posteo.net>
> Cc: eller.helmut@gmail.com, gerd.moellmann@gmail.com, pipcet@protonmail.com,
>  emacs-devel@gnu.org
> Date: Mon, 01 Jul 2024 18:51:13 +0000
> 
> Eli Zaretskii <eliz@gnu.org> writes:
> 
> >> I was mostly asking about technical implementation details of how malloc
> >> is preventing signals from being handled while it is doing its
> >> job.
> >
> > It doesn't.  It's the responsibility of the signal handlers not to
> > re-enter malloc.
> >
> > We cannot do the same with MPS, because MPS can kick in outside of our
> > control -- for example, if accessing some Lisp object hits the
> > barrier.  So we have no idea when MPS will take the arena lock, and
> > this cannot prevent recursive locks, except by preventing our code
> > which touches Lisp data from running when MPS is active because we
> > called it.
> 
> AFAIU, the situation discussed in this thread is different - MPS kicks
> in _in our control_.

Yes.  But we had other backtraces where what I described happened.  So
the situation we need to solve is more general.

> Note that our SIGCHLD signal handler itself does not re-enter malloc
> (igc_malloc in our case). It just tries to access the memory arena that
> is in transient state (in process of allocation).

Yes.  And any other signal handler that does something except raising
a flag will do something similar.

> I do not see why the same cannot happen with vanilla malloc

Because vanilla malloc doesn't move objects in memory, and this
doesn't need to lock the arena.  It only needs to lock the data
structure it itself uses for managing the memory, so if the handler
doesn't re-enter malloc, it will never cause any trouble to vanilla
malloc.



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: MPS: a random backtrace while toying with gdb
  2024-07-01 18:51                                     ` Ihor Radchenko
  2024-07-01 19:05                                       ` Eli Zaretskii
@ 2024-07-01 19:34                                       ` Gerd Möllmann
  2024-07-01 20:00                                         ` Ihor Radchenko
  1 sibling, 1 reply; 68+ messages in thread
From: Gerd Möllmann @ 2024-07-01 19:34 UTC (permalink / raw)
  To: Ihor Radchenko; +Cc: Eli Zaretskii, eller.helmut, pipcet, emacs-devel

Ihor Radchenko <yantar92@posteo.net> writes:

> Note that our SIGCHLD signal handler itself does not re-enter malloc
> (igc_malloc in our case). It just tries to access the memory arena that
> is in transient state (in process of allocation).

I think you are missing the read/write barriers that MPS puts on its
memory.

Consider a single object O. At some point in time, MPS copies O to a new
address O', as part of the copying GC algorithm. In O it leaves a
"tombstone" which tells where O' is now found. Then it puts a barrier on
O, so that any access to O let's MPS fix the reference. (All via
functions in igc.c. See dflt_fwd which creates the tombstone.)

> I do not see why the same cannot happen with vanilla malloc, unless, of
> course, it blocks memory much more granularly, guaranteeing that live
> objects are always reachable.

I can't follow, but I guess you missed that/why MPS puts barriers on
memory, and why it gets invoked because of that, necessarily.



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: MPS: a random backtrace while toying with gdb
  2024-07-01 19:34                                       ` Gerd Möllmann
@ 2024-07-01 20:00                                         ` Ihor Radchenko
  2024-07-02  4:33                                           ` Gerd Möllmann
  0 siblings, 1 reply; 68+ messages in thread
From: Ihor Radchenko @ 2024-07-01 20:00 UTC (permalink / raw)
  To: Gerd Möllmann; +Cc: Eli Zaretskii, eller.helmut, pipcet, emacs-devel

Gerd Möllmann <gerd.moellmann@gmail.com> writes:

> Ihor Radchenko <yantar92@posteo.net> writes:
>
>> Note that our SIGCHLD signal handler itself does not re-enter malloc
>> (igc_malloc in our case). It just tries to access the memory arena that
>> is in transient state (in process of allocation).
>
> I think you are missing the read/write barriers that MPS puts on its
> memory.
>
> Consider a single object O. At some point in time, MPS copies O to a new
> address O', as part of the copying GC algorithm. In O it leaves a
> "tombstone" which tells where O' is now found. Then it puts a barrier on
> O, so that any access to O let's MPS fix the reference. (All via
> functions in igc.c. See dflt_fwd which creates the tombstone.)

It is actually clear for me.
The problem is not with the barrier. The problem is that MPS' barrier
handler cannot work while the arena is locked to allocate a new
(completely unrelated) object.

My reading of
https://memory-pool-system.readthedocs.io/en/latest/design/protix.html#threads
is that the MPS' barrier handler itself is re-entrant:

    .threads.async.protection: If the signal handler is invoked because
    of an MPS access, then we know the access must have been caused by
    client code, because the client is not allowed to permit access to
    protectable memory to arbitrary foreign code. In these
    circumstances, it’s OK to call arbitrary POSIX functions inside the
    handler.

AFAIU, if memory barrier handler were not re-entrant, multi-threading
would simply not be possible.

-- 
Ihor Radchenko // yantar92,
Org mode contributor,
Learn more about Org mode at <https://orgmode.org/>.
Support Org development at <https://liberapay.com/org-mode>,
or support my work at <https://liberapay.com/yantar92>



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: MPS: a random backtrace while toying with gdb
  2024-07-01 20:00                                         ` Ihor Radchenko
@ 2024-07-02  4:33                                           ` Gerd Möllmann
  2024-07-02  7:05                                             ` Ihor Radchenko
  0 siblings, 1 reply; 68+ messages in thread
From: Gerd Möllmann @ 2024-07-02  4:33 UTC (permalink / raw)
  To: Ihor Radchenko; +Cc: Eli Zaretskii, eller.helmut, pipcet, emacs-devel

Ihor Radchenko <yantar92@posteo.net> writes:

> Gerd Möllmann <gerd.moellmann@gmail.com> writes:
>
>> Ihor Radchenko <yantar92@posteo.net> writes:
>>
>>> Note that our SIGCHLD signal handler itself does not re-enter malloc
>>> (igc_malloc in our case). It just tries to access the memory arena that
>>> is in transient state (in process of allocation).
>>
>> I think you are missing the read/write barriers that MPS puts on its
>> memory.
>>
>> Consider a single object O. At some point in time, MPS copies O to a new
>> address O', as part of the copying GC algorithm. In O it leaves a
>> "tombstone" which tells where O' is now found. Then it puts a barrier on
>> O, so that any access to O let's MPS fix the reference. (All via
>> functions in igc.c. See dflt_fwd which creates the tombstone.)
>
> It is actually clear for me.
> The problem is not with the barrier. The problem is that MPS' barrier
> handler cannot work while the arena is locked to allocate a new
> (completely unrelated) object.
>
> My reading of
> https://memory-pool-system.readthedocs.io/en/latest/design/protix.html#threads
> is that the MPS' barrier handler itself is re-entrant:

The paragraph above what you cited is

  .threads.async: POSIX imposes some restrictions on signal handler
  functions (see design.mps.pthreadext.analysis.signal.safety). Basically
  the rules say the behaviour of almost all POSIX functions inside a
  signal handler is undefined, except for a handful of functions which are
  known to be “async-signal safe”. However, if it’s known that the signal
  didn’t happen inside a POSIX function, then it is safe to call arbitrary
  POSIX functions inside a handler.

I read this as them being thinking about whether their SIGSEGV handler can
use POSIX functions or not.

>     .threads.async.protection: If the signal handler is invoked because
>     of an MPS access, then we know the access must have been caused by
>     client code, because the client is not allowed to permit access to
>     protectable memory to arbitrary foreign code. In these
>     circumstances, it’s OK to call arbitrary POSIX functions inside the
>     handler.

And the above I read as saying they can because the client is not
allowed, and blahblah...

> AFAIU, if memory barrier handler were not re-entrant, multi-threading
> would simply not be possible.

I don't understand how you come to that conclusion, sorry.



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: MPS: a random backtrace while toying with gdb
  2024-07-02  4:33                                           ` Gerd Möllmann
@ 2024-07-02  7:05                                             ` Ihor Radchenko
  2024-07-02  7:06                                               ` Gerd Möllmann
  0 siblings, 1 reply; 68+ messages in thread
From: Ihor Radchenko @ 2024-07-02  7:05 UTC (permalink / raw)
  To: Gerd Möllmann; +Cc: Eli Zaretskii, eller.helmut, pipcet, emacs-devel

Gerd Möllmann <gerd.moellmann@gmail.com> writes:

>> AFAIU, if memory barrier handler were not re-entrant, multi-threading
>> would simply not be possible.
>
> I don't understand how you come to that conclusion, sorry.

I just got lost I guess.

Best,
Ihor



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: MPS: a random backtrace while toying with gdb
  2024-07-02  7:05                                             ` Ihor Radchenko
@ 2024-07-02  7:06                                               ` Gerd Möllmann
  0 siblings, 0 replies; 68+ messages in thread
From: Gerd Möllmann @ 2024-07-02  7:06 UTC (permalink / raw)
  To: Ihor Radchenko; +Cc: Eli Zaretskii, eller.helmut, pipcet, emacs-devel

Ihor Radchenko <yantar92@posteo.net> writes:

> Gerd Möllmann <gerd.moellmann@gmail.com> writes:
>
>>> AFAIU, if memory barrier handler were not re-entrant, multi-threading
>>> would simply not be possible.
>>
>> I don't understand how you come to that conclusion, sorry.
>
> I just got lost I guess.

No problem :-)



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: MPS: a random backtrace while toying with gdb
  2024-07-01 18:08                         ` Eli Zaretskii
@ 2024-07-02  7:55                           ` Pip Cet
  2024-07-02 13:10                             ` Eli Zaretskii
  0 siblings, 1 reply; 68+ messages in thread
From: Pip Cet @ 2024-07-02  7:55 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: eller.helmut, gerd.moellmann, yantar92, emacs-devel

On Monday, July 1st, 2024 at 18:08, Eli Zaretskii <eliz@gnu.org> wrote:
> > Date: Mon, 01 Jul 2024 17:27:49 +0000
>
> > From: Pip Cet pipcet@protonmail.com
> > Cc: eller.helmut@gmail.com, gerd.moellmann@gmail.com, yantar92@posteo.net, emacs-devel@gnu.org
>
> First, that queue was intended for more than just SIGCHLD.
>
> And second, depending on the OS, nested signals can or cannot be
> possible, so we need to take that into consideration.

I understand now, I think. Thank you.

> > One last suggestion: how about blocking those signals for most of Emacs' lifetime, only unblocking them in maybe_quit or at similar points? That would allow us to keep the existing signal handlers and make them safe...
>
> IMO, this would be the opposite of what we should do. We should have
> these signals blocked for as little time as possible, because
> otherwise the features built on them will be much less useful. For
> example, SIGUSR1/2 are a means of forcing Emacs out of some infinite
> loop which is otherwise uninterruptible -- if we let these signals be
> unblocked only in maybe_quit, we will have lost this useful feature.

Thanks for making that clear. That is a very useful feature and I would like to keep it. I'm also quite fond of the hourglass cursor :-)

> Which is why I suggested to block the signals before calling MPS and
> unblock them immediately when we return from an MPS call. All of
> these calls are in igc.c, so the job of adding these blocks, while
> mundane and boring, is not impossible.

And it adds two syscalls to what should be a very fast operation. I'm not convinced it's necessary.

> But if people who have time to work on that disagree, I have no means
> of making them do what they don't agree with.
>
> > I still think this is a simple oversight on the part of MPS, FWIW. You shouldn't allow other signals when handling SIGSEGV, or at least give the client program an option to specify a signal mask.
>
> That's not the problem, AFAIU. The problem is that a signal handler
> which accesses Lisp data or the state of the Lisp machine could
> trigger an MPS call, which will try taking the arena lock, and that
> cannot be nested, by MPS design. And our handlers do access the Lisp
> machine, albeit cautiously and as little as necessary. So when the
> signal happens in the middle of an MPS call which already took the
> arena lock, we cannot safely access our data.

I've tried quite hard to make this happen, but I didn't manage it. It seems that whenever MPS puts up a protection barrier for existing allocated memory, the arena lock has already been released. As signal handlers cannot allocate memory directly, there's no deadlock, either.

I don't understand MPS as well as you apparently do, so could you help me and tell where to put a kill(getpid(), SIGWHATEVER) with an appropriate signal handler which will cause a crash (without, in the signal handler, allocating memory)?

I'm seriously tempted to suggest that until we can produce such a crash, we can work on the assumption that blocking signals while handling SIGSEGV is enough, but, again, I don't fully understand MPS and its complicated locking scheme.

To expand a little on what I'm doing:

* install a handler for SIGUSR2 which dereferences a pointer stored in a global variable (and remove the old SIGUSR2 handler)
* modify MPS's locking functions to kill(getpid(), SIGUSR2) right after acquiring the lock
* in gdb, wait for a SIGSEGV to find a protected address/segment. Store that in the pointer variable.
* there should now be a crash when the SIGUSR2 handler runs and memory protection for the pointer is in effect
* no crashes observed so far.

Pip



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: MPS: a random backtrace while toying with gdb
  2024-07-02  7:55                           ` Pip Cet
@ 2024-07-02 13:10                             ` Eli Zaretskii
  2024-07-02 14:24                               ` Pip Cet
  0 siblings, 1 reply; 68+ messages in thread
From: Eli Zaretskii @ 2024-07-02 13:10 UTC (permalink / raw)
  To: Pip Cet; +Cc: eller.helmut, gerd.moellmann, yantar92, emacs-devel

> Date: Tue, 02 Jul 2024 07:55:26 +0000
> From: Pip Cet <pipcet@protonmail.com>
> Cc: eller.helmut@gmail.com, gerd.moellmann@gmail.com, yantar92@posteo.net, emacs-devel@gnu.org
> 
> > Which is why I suggested to block the signals before calling MPS and
> > unblock them immediately when we return from an MPS call. All of
> > these calls are in igc.c, so the job of adding these blocks, while
> > mundane and boring, is not impossible.
> 
> And it adds two syscalls to what should be a very fast operation. I'm not convinced it's necessary.

We should time these syscalls if we are afraid they could slow us
down.

> > That's not the problem, AFAIU. The problem is that a signal handler
> > which accesses Lisp data or the state of the Lisp machine could
> > trigger an MPS call, which will try taking the arena lock, and that
> > cannot be nested, by MPS design. And our handlers do access the Lisp
> > machine, albeit cautiously and as little as necessary. So when the
> > signal happens in the middle of an MPS call which already took the
> > arena lock, we cannot safely access our data.
> 
> I've tried quite hard to make this happen, but I didn't manage it. It seems that whenever MPS puts up a protection barrier for existing allocated memory, the arena lock has already been released. As signal handlers cannot allocate memory directly, there's no deadlock, either.
> 
> I don't understand MPS as well as you apparently do, so could you help me and tell where to put a kill(getpid(), SIGWHATEVER) with an appropriate signal handler which will cause a crash (without, in the signal handler, allocating memory)?

I thought using the profiler would trigger these easily enough?  I
think someone (Helmut?) posted a simple recipe for reproducing that
some time ago?

Also, there was a recipe with SIGCHLD not long ago (you'd need to undo
Helmut's fixes for that, I believe, to be able to reproduce that).

> I'm seriously tempted to suggest that until we can produce such a crash, we can work on the assumption that blocking signals while handling SIGSEGV is enough, but, again, I don't fully understand MPS and its complicated locking scheme.

I agree that having a reproduction recipe is a necessary condition for
trying to fix this.

> To expand a little on what I'm doing:
> 
> * install a handler for SIGUSR2 which dereferences a pointer stored in a global variable (and remove the old SIGUSR2 handler)
> * modify MPS's locking functions to kill(getpid(), SIGUSR2) right after acquiring the lock
> * in gdb, wait for a SIGSEGV to find a protected address/segment. Store that in the pointer variable.
> * there should now be a crash when the SIGUSR2 handler runs and memory protection for the pointer is in effect
> * no crashes observed so far.

Why not simply bind the sigusr2 event to some function (see the node
"Misc Events" in the ELisp manual for how), and then use "kill -USR2"
outside of Emacs?  IOW, I guess I don't understand why you'd need all
that complexity just to reproduce the crashes.



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: MPS: a random backtrace while toying with gdb
  2024-07-02 13:10                             ` Eli Zaretskii
@ 2024-07-02 14:24                               ` Pip Cet
  2024-07-02 14:57                                 ` Eli Zaretskii
  0 siblings, 1 reply; 68+ messages in thread
From: Pip Cet @ 2024-07-02 14:24 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: eller.helmut, gerd.moellmann, yantar92, emacs-devel

On Tuesday, July 2nd, 2024 at 13:10, Eli Zaretskii <eliz@gnu.org> wrote:
> > Date: Tue, 02 Jul 2024 07:55:26 +0000
> 
> > From: Pip Cet pipcet@protonmail.com
> > Cc: eller.helmut@gmail.com, gerd.moellmann@gmail.com, yantar92@posteo.net, emacs-devel@gnu.org
> > 
> > > Which is why I suggested to block the signals before calling MPS and
> > > unblock them immediately when we return from an MPS call. All of
> > > these calls are in igc.c, so the job of adding these blocks, while
> > > mundane and boring, is not impossible.
> > 
> > And it adds two syscalls to what should be a very fast operation. I'm not convinced it's necessary.
> 
> We should time these syscalls if we are afraid they could slow us
> down.

Agreed.

> > > That's not the problem, AFAIU. The problem is that a signal handler
> > > which accesses Lisp data or the state of the Lisp machine could
> > > trigger an MPS call, which will try taking the arena lock, and that
> > > cannot be nested, by MPS design. And our handlers do access the Lisp
> > > machine, albeit cautiously and as little as necessary. So when the
> > > signal happens in the middle of an MPS call which already took the
> > > arena lock, we cannot safely access our data.
> > 
> > I've tried quite hard to make this happen, but I didn't manage it. It seems that whenever MPS puts up a protection barrier for existing allocated memory, the arena lock has already been released. As signal handlers cannot allocate memory directly, there's no deadlock, either.
> > 
> > I don't understand MPS as well as you apparently do, so could you help me and tell where to put a kill(getpid(), SIGWHATEVER) with an appropriate signal handler which will cause a crash (without, in the signal handler, allocating memory)?
> 
> I thought using the profiler would trigger these easily enough? I
> think someone (Helmut?) posted a simple recipe for reproducing that
> some time ago?

Those were all signals interrupting MPS's SIGSEGV handler. You were talking about signals interrupting MPS code that runs outside of a signal handler, weren't you?

> Also, there was a recipe with SIGCHLD not long ago (you'd need to undo
> Helmut's fixes for that, I believe, to be able to reproduce that).

Same thing.

> > I'm seriously tempted to suggest that until we can produce such a crash, we can work on the assumption that blocking signals while handling SIGSEGV is enough, but, again, I don't fully understand MPS and its complicated locking scheme.
> 
> I agree that having a reproduction recipe is a necessary condition for
> trying to fix this.

Excellent.

> > To expand a little on what I'm doing:
> > 
> > * install a handler for SIGUSR2 which dereferences a pointer stored in a global variable (and remove the old SIGUSR2 handler)
> > * modify MPS's locking functions to kill(getpid(), SIGUSR2) right after acquiring the lock
> > * in gdb, wait for a SIGSEGV to find a protected address/segment. Store that in the pointer variable.
> > * there should now be a crash when the SIGUSR2 handler runs and memory protection for the pointer is in effect
> > * no crashes observed so far.
> 
> Why not simply bind the sigusr2 event to some function (see the node
> "Misc Events" in the ELisp manual for how), and then use "kill -USR2"
> outside of Emacs? IOW, I guess I don't understand why you'd need all
> that complexity just to reproduce the crashes.

Because I wanted to be sure to hit the tiny window while a global lock was taken.

Pip



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: MPS: a random backtrace while toying with gdb
  2024-07-02 14:24                               ` Pip Cet
@ 2024-07-02 14:57                                 ` Eli Zaretskii
  2024-07-02 17:06                                   ` Pip Cet
  0 siblings, 1 reply; 68+ messages in thread
From: Eli Zaretskii @ 2024-07-02 14:57 UTC (permalink / raw)
  To: Pip Cet; +Cc: eller.helmut, gerd.moellmann, yantar92, emacs-devel

> Date: Tue, 02 Jul 2024 14:24:33 +0000
> From: Pip Cet <pipcet@protonmail.com>
> Cc: eller.helmut@gmail.com, gerd.moellmann@gmail.com, yantar92@posteo.net, emacs-devel@gnu.org
> 
> > > > That's not the problem, AFAIU. The problem is that a signal handler
> > > > which accesses Lisp data or the state of the Lisp machine could
> > > > trigger an MPS call, which will try taking the arena lock, and that
> > > > cannot be nested, by MPS design. And our handlers do access the Lisp
> > > > machine, albeit cautiously and as little as necessary. So when the
> > > > signal happens in the middle of an MPS call which already took the
> > > > arena lock, we cannot safely access our data.
> > > 
> > > I've tried quite hard to make this happen, but I didn't manage it. It seems that whenever MPS puts up a protection barrier for existing allocated memory, the arena lock has already been released. As signal handlers cannot allocate memory directly, there's no deadlock, either.
> > > 
> > > I don't understand MPS as well as you apparently do, so could you help me and tell where to put a kill(getpid(), SIGWHATEVER) with an appropriate signal handler which will cause a crash (without, in the signal handler, allocating memory)?
> > 
> > I thought using the profiler would trigger these easily enough? I
> > think someone (Helmut?) posted a simple recipe for reproducing that
> > some time ago?
> 
> Those were all signals interrupting MPS's SIGSEGV handler. You were talking about signals interrupting MPS code that runs outside of a signal handler, weren't you?

I don't think they all were interrupting MPS's SIGSEGV handler.  I
think it's the other way around: we interrupted MPS code, and our
signal handler accessed memory which triggered MPS's SIGSEGV.

But even if I'm wrong, why is that important?  We need to solve both
kinds of situations, don't we?

> > Also, there was a recipe with SIGCHLD not long ago (you'd need to undo
> > Helmut's fixes for that, I believe, to be able to reproduce that).
> 
> Same thing.

Not AFAICT.  Look:

  Thread 1 "emacs" hit Breakpoint 1, terminate_due_to_signal (sig=sig@entry=6, backtrace_limit=backtrace_limit@entry=2147483647) at emacs.c:443
  443	{
  (gdb) bt
  #0  terminate_due_to_signal (sig=sig@entry=6, backtrace_limit=backtrace_limit@entry=2147483647) at emacs.c:443
  #1  0x00005555558634be in set_state (state=IGC_STATE_DEAD) at igc.c:179
  #2  igc_assert_fail (file=<optimized out>, line=<optimized out>, msg=<optimized out>) at igc.c:205
  #3  0x00005555558f1e19 in mps_lib_assert_fail (condition=0x555555943c4c "res == 0", line=126, file=0x555555943c36 "lockix.c")
      at /home/yantar92/Dist/mps/code/mpsliban.c:87
  #4  LockClaim (lock=0x7fffe8000110) at /home/yantar92/Dist/mps/code/lockix.c:126
  #5  0x00005555558f204d in ArenaEnterLock (arena=0x7ffff7fbf000, recursive=0) at /home/yantar92/Dist/mps/code/global.c:576
  #6  0x000055555591aefe in ArenaEnter (arena=0x7ffff7fbf000) at /home/yantar92/Dist/mps/code/global.c:553
  #7  ArenaAccess (addr=0x7fffeb908758, mode=mode@entry=3, context=context@entry=0x7fffffff97d0) at /home/yantar92/Dist/mps/code/global.c:655
  #8  0x0000555555926202 in sigHandle (sig=<optimized out>, info=0x7fffffff9af0, uap=0x7fffffff99c0) at /home/yantar92/Dist/mps/code/protsgix.c:97
  #9  0x00007ffff3048050 in <signal handler called> () at /lib64/libc.so.6
  #10 0x0000555555827385 in PSEUDOVECTORP (a=XIL(0x7fffeb90875d), code=9) at /home/yantar92/Git/emacs/src/lisp.h:1105
  #11 PROCESSP (a=XIL(0x7fffeb90875d)) at /home/yantar92/Git/emacs/src/process.h:212
  #12 XPROCESS (a=XIL(0x7fffeb90875d)) at /home/yantar92/Git/emacs/src/process.h:224
  #13 handle_child_signal (sig=sig@entry=17) at process.c:7660
  #14 0x000055555573b771 in deliver_process_signal (sig=17, handler=handler@entry=0x555555827200 <handle_child_signal>) at sysdep.c:1758
  #15 0x0000555555820647 in deliver_child_signal (sig=<optimized out>) at process.c:7702
  #16 0x00007ffff3048050 in <signal handler called> () at /lib64/libc.so.6
  #17 0x000055555585f77b in fix_lisp_obj (ss=ss@entry=0x7fffffffa9a8, pobj=pobj@entry=0x7fffeee7ffe8) at igc.c:841
  #18 0x000055555586050d in fix_cons (ss=0x7fffffffa9a8, cons=0x7fffeee7ffe0) at igc.c:1474
  #19 dflt_scan_obj (ss=0x7fffffffa9a8, base_start=0x7fffeee7ffd8, base_limit=0x7fffeee80000, closure=0x0) at igc.c:1578
  #20 dflt_scanx (ss=ss@entry=0x7fffffffa9a8, base_start=<optimized out>, base_limit=0x7fffeee80000, closure=closure@entry=0x0) at igc.c:1658
  #21 0x00005555558613a3 in dflt_scan (ss=0x7fffffffa9a8, base_start=<optimized out>, base_limit=<optimized out>) at igc.c:1669
  #22 0x00005555558f163f in TraceScanFormat (limit=0x7fffeee80000, base=0x7fffeee7e000, ss=0x7fffffffa9a0) at /home/yantar92/Dist/mps/code/trace.c:1539
  #23 amcSegScan (totalReturn=0x7fffffffa99c, seg=0x7fffe845e4c8, ss=0x7fffffffa9a0) at /home/yantar92/Dist/mps/code/poolamc.c:1440
  #24 0x000055555591e7bc in traceScanSegRes (ts=ts@entry=1, rank=rank@entry=1, arena=arena@entry=0x7ffff7fbf000, seg=seg@entry=0x7fffe845e4c8)
      at /home/yantar92/Dist/mps/code/trace.c:1205
  #25 0x000055555591e9ca in traceScanSeg (ts=1, rank=1, arena=0x7ffff7fbf000, seg=0x7fffe845e4c8) at /home/yantar92/Dist/mps/code/trace.c:1267
  #26 0x000055555591f3a4 in TraceAdvance (trace=trace@entry=0x7ffff7fbfaa8) at /home/yantar92/Dist/mps/code/trace.c:1728
  #27 0x000055555591faa4 in TracePoll
      (workReturn=workReturn@entry=0x7fffffffab90, collectWorldReturn=collectWorldReturn@entry=0x7fffffffab8c, globals=globals@entry=0x7ffff7fbf008, collectWorldAllowed=<optimized out>) at /home/yantar92/Dist/mps/code/trace.c:1849
  #28 0x000055555591fceb in ArenaPoll (globals=globals@entry=0x7ffff7fbf008) at /home/yantar92/Dist/mps/code/global.c:745
  #29 0x00005555559200da in mps_ap_fill (p_o=p_o@entry=0x7fffffffad00, mps_ap=mps_ap@entry=0x7fffe80017f0, size=size@entry=24)
      at /home/yantar92/Dist/mps/code/mpsi.c:1097
  #30 0x00005555558601ee in alloc_impl (size=24, type=IGC_OBJ_CONS, ap=0x7fffe80017f0) at igc.c:3330
  #31 0x000055555586023c in alloc (size=size@entry=16, type=type@entry=IGC_OBJ_CONS) at igc.c:3358
  #32 0x000055555586187a in igc_make_cons (car=XIL(0x133e0), cdr=XIL(0)) at igc.c:3385
  #33 0x000055555578e7de in Fcons (car=<optimized out>, cdr=<optimized out>) at alloc.c:2926
  #34 Flist (nargs=31, args=0x7fffffffaf38) at alloc.c:3054
  #35 0x00007ffff06b13ea in F7365742d666163652d617474726962757465_set_face_attribute_0 ()

This says:

  . we called Fcons (from a "normal" Emacs Lisp program, which called
    set-face-attribute)
  . that entered MPS by way of igc_make_cons
  . MPS called our scanning code in dflt_scan
  . while in fix_* functions called by dflt_scan, we got SIGCHLD
  . the SIGCHLD handler accessed Lisp data of the process object(s),
    which triggered MPS SIGSEGV handler
  . the MPS handler tried to take the arena lock and aborted

IOW, SIGCHLD did NOT interrupt the MPS SIGSEGV handler, it interrupted
the "normal" MPS code when it called our scanning callbacks.

> > Why not simply bind the sigusr2 event to some function (see the node
> > "Misc Events" in the ELisp manual for how), and then use "kill -USR2"
> > outside of Emacs? IOW, I guess I don't understand why you'd need all
> > that complexity just to reproduce the crashes.
> 
> Because I wanted to be sure to hit the tiny window while a global lock was taken.

I think the scenario above with SIGCHLD does precisely that, no?



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: MPS: a random backtrace while toying with gdb
  2024-07-02 14:57                                 ` Eli Zaretskii
@ 2024-07-02 17:06                                   ` Pip Cet
  2024-07-03 11:31                                     ` Pip Cet
  0 siblings, 1 reply; 68+ messages in thread
From: Pip Cet @ 2024-07-02 17:06 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: eller.helmut, gerd.moellmann, yantar92, emacs-devel

On Tuesday, July 2nd, 2024 at 14:57, Eli Zaretskii <eliz@gnu.org> wrote:
> > Date: Tue, 02 Jul 2024 14:24:33 +0000
> > From: Pip Cet pipcet@protonmail.com
> > Cc: eller.helmut@gmail.com, gerd.moellmann@gmail.com, yantar92@posteo.net, emacs-devel@gnu.org
> > 
> > > > > That's not the problem, AFAIU. The problem is that a signal handler
> > > > > which accesses Lisp data or the state of the Lisp machine could
> > > > > trigger an MPS call, which will try taking the arena lock, and that
> > > > > cannot be nested, by MPS design. And our handlers do access the Lisp
> > > > > machine, albeit cautiously and as little as necessary. So when the
> > > > > signal happens in the middle of an MPS call which already took the
> > > > > arena lock, we cannot safely access our data.
> > > > 
> > > > I've tried quite hard to make this happen, but I didn't manage it. It seems that whenever MPS puts up a protection barrier for existing allocated memory, the arena lock has already been released. As signal handlers cannot allocate memory directly, there's no deadlock, either.

I finally figured out what I was doing wrong. I was allocating a few very large objects, but I needed many very small ones.

./emacs --batch -Q --eval "(progn (setq list nil) (keymap-set special-event-map \"<sigusr1>\" (lambda () (interactive) (length list))) (while t (push nil list)))" & while sleep .1; do kill -USR1 %%; done

works.

> > Those were all signals interrupting MPS's SIGSEGV handler. You were talking about signals interrupting MPS code that runs outside of a signal handler, weren't you?
> 
> 
> I don't think they all were interrupting MPS's SIGSEGV handler. I
> think it's the other way around: we interrupted MPS code, and our
> signal handler accessed memory which triggered MPS's SIGSEGV.

You're correct.

> But even if I'm wrong, why is that important? We need to solve both
> kinds of situations, don't we?

Now that we have a way to reproducibly make it happen, yes, I agree.

> > > Also, there was a recipe with SIGCHLD not long ago (you'd need to undo
> > > Helmut's fixes for that, I believe, to be able to reproduce that).
> > 
> > Same thing.
> 
> Not AFAICT. Look:

I did, now. You're right.

> . we called Fcons (from a "normal" Emacs Lisp program, which called
> set-face-attribute)
> . that entered MPS by way of igc_make_cons
> . MPS called our scanning code in dflt_scan
> . while in fix_* functions called by dflt_scan, we got SIGCHLD
> . the SIGCHLD handler accessed Lisp data of the process object(s),
> which triggered MPS SIGSEGV handler
> . the MPS handler tried to take the arena lock and aborted
> 
> IOW, SIGCHLD did NOT interrupt the MPS SIGSEGV handler, it interrupted
> the "normal" MPS code when it called our scanning callbacks.

You're right, again.

> > > Why not simply bind the sigusr2 event to some function (see the node
> > > "Misc Events" in the ELisp manual for how), and then use "kill -USR2"
> > > outside of Emacs? IOW, I guess I don't understand why you'd need all
> > > that complexity just to reproduce the crashes.
> > 
> > Because I wanted to be sure to hit the tiny window while a global lock was taken.
> 
> I think the scenario above with SIGCHLD does precisely that, no?

It does.

Pip



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: MPS: a random backtrace while toying with gdb
  2024-07-02 17:06                                   ` Pip Cet
@ 2024-07-03 11:31                                     ` Pip Cet
  2024-07-03 11:50                                       ` Eli Zaretskii
  0 siblings, 1 reply; 68+ messages in thread
From: Pip Cet @ 2024-07-03 11:31 UTC (permalink / raw)
  To: Pip Cet; +Cc: Eli Zaretskii, eller.helmut, gerd.moellmann, yantar92,
	emacs-devel

On Tuesday, July 2nd, 2024 at 17:06, Pip Cet <pipcet@protonmail.com> wrote:
> > But even if I'm wrong, why is that important? We need to solve both
> > kinds of situations, don't we?
> 
> Now that we have a way to reproducibly make it happen, yes, I agree.

So what are the proposed solutions that are still on the table? I can think of these:

1. block signals around MPS calls and block signals while the MPS SIGSEGV handler is running
2. handle signals in a special thread
3. switch Emacs to an event loop model and handle all signals asynchronously

(1) is hard to do because we'd need to hijack the SIGSEGV handler and/or modify MPS
(3) is a very major change which would mean permanently losing features we want
(2) is very easy to do, but gets complicated for SIGPROF (which needs to be on the thread it's profiling) and emergency breakout signals which need to modify the main thread's state.

I've tried (2) and it fixes the specific reproducer I posted yesterday, as it should do in theory. I believe it's the most flexible approach which still gives us well-defined semantics for signal handlers. We'd need to add per-thread signals for the specific cases mentioned above. And, of course, its pthread-specific.

Pip



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: MPS: a random backtrace while toying with gdb
  2024-07-03 11:31                                     ` Pip Cet
@ 2024-07-03 11:50                                       ` Eli Zaretskii
  2024-07-03 14:35                                         ` Pip Cet
  0 siblings, 1 reply; 68+ messages in thread
From: Eli Zaretskii @ 2024-07-03 11:50 UTC (permalink / raw)
  To: Pip Cet; +Cc: eller.helmut, gerd.moellmann, yantar92, emacs-devel

> Date: Wed, 03 Jul 2024 11:31:48 +0000
> From: Pip Cet <pipcet@protonmail.com>
> Cc: Eli Zaretskii <eliz@gnu.org>, eller.helmut@gmail.com, gerd.moellmann@gmail.com, yantar92@posteo.net, emacs-devel@gnu.org
> 
> So what are the proposed solutions that are still on the table? I can think of these:
> 
> 1. block signals around MPS calls and block signals while the MPS SIGSEGV handler is running
> 2. handle signals in a special thread
> 3. switch Emacs to an event loop model and handle all signals asynchronously

I'd start with the first half of (1).  It is not clear to me that the
other part is needed, and in any case we need a reproducer for it
first.  Most of the crashes we've seen until now were not when the MPS
handler was running.  Also, didn't someone say that when the MPS
SIGSEGV handler is active, we could detect that from our code and
return doing nothing?

> (1) is hard to do because we'd need to hijack the SIGSEGV handler and/or modify MPS
> (3) is a very major change which would mean permanently losing features we want
> (2) is very easy to do, but gets complicated for SIGPROF (which needs to be on the thread it's profiling) and emergency breakout signals which need to modify the main thread's state.
> 
> I've tried (2) and it fixes the specific reproducer I posted yesterday, as it should do in theory. I believe it's the most flexible approach which still gives us well-defined semantics for signal handlers. We'd need to add per-thread signals for the specific cases mentioned above. And, of course, its pthread-specific.

Doing (2) adds a whole lot of complexity to Emacs.  Most importantly,
we will be unable to access Lisp data safely, unwind-protect and the
entire specpdl stuff generally cannot be used, and signaling an error
would be fatal.  So I'd rather avoid that.



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: MPS: a random backtrace while toying with gdb
  2024-07-03 11:50                                       ` Eli Zaretskii
@ 2024-07-03 14:35                                         ` Pip Cet
  2024-07-03 15:41                                           ` Eli Zaretskii
  0 siblings, 1 reply; 68+ messages in thread
From: Pip Cet @ 2024-07-03 14:35 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: eller.helmut, gerd.moellmann, yantar92, emacs-devel

On Wednesday, July 3rd, 2024 at 11:50, Eli Zaretskii <eliz@gnu.org> wrote:
> > Date: Wed, 03 Jul 2024 11:31:48 +0000
> 
> > From: Pip Cet pipcet@protonmail.com
> > Cc: Eli Zaretskii eliz@gnu.org, eller.helmut@gmail.com, gerd.moellmann@gmail.com, yantar92@posteo.net, emacs-devel@gnu.org
> > 
> > So what are the proposed solutions that are still on the table? I can think of these:
> > 
> > 1. block signals around MPS calls and block signals while the MPS SIGSEGV handler is running
> > 2. handle signals in a special thread
> > 3. switch Emacs to an event loop model and handle all signals asynchronously
> 
> 
> I'd start with the first half of (1). It is not clear to me that the
> other part is needed, and in any case we need a reproducer for it
> first.

Ihor's original SIGPROF-based reproducer works if you revert the (SIGPROF-specific) workaround by Helmut. What makes you think it can't happen with other signals (which, naturally, aren't as frequent or badly-timed as SIGPROF, which strikes precisely when the CPU is active)?

Obviously a reproducer is highly desirable in this case, but we shouldn't leave known and understood bugs in the code.

> Most of the crashes we've seen until now were not when the MPS
> handler was running. Also, didn't someone say that when the MPS

Ihor's first crash backlog (MPS: profiler) clearly shows that was the case: https://lists.gnu.org/archive/html/emacs-devel/2024-06/msg00568.html

> SIGSEGV handler is active, we could detect that from our code and
> return doing nothing?

Dropping the signal in the process.

> > (1) is hard to do because we'd need to hijack the SIGSEGV handler and/or modify MPS
> > (3) is a very major change which would mean permanently losing features we want
> > (2) is very easy to do, but gets complicated for SIGPROF (which needs to be on the thread it's profiling) and emergency breakout signals which need to modify the main thread's state.
> > 
> > I've tried (2) and it fixes the specific reproducer I posted yesterday, as it should do in theory. I believe it's the most flexible approach which still gives us well-defined semantics for signal handlers. We'd need to add per-thread signals for the specific cases mentioned above. And, of course, its pthread-specific.
> 
> Doing (2) adds a whole lot of complexity to Emacs.

You're right.

> Most importantly,
> we will be unable to access Lisp data safely, unwind-protect and the
> entire specpdl stuff generally cannot be used, and signaling an error
> would be fatal. So I'd rather avoid that.

Very good points, though I wonder to what extent our current code is safe...

Pip



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: MPS: a random backtrace while toying with gdb
  2024-07-03 14:35                                         ` Pip Cet
@ 2024-07-03 15:41                                           ` Eli Zaretskii
  0 siblings, 0 replies; 68+ messages in thread
From: Eli Zaretskii @ 2024-07-03 15:41 UTC (permalink / raw)
  To: Pip Cet; +Cc: eller.helmut, gerd.moellmann, yantar92, emacs-devel

> Date: Wed, 03 Jul 2024 14:35:16 +0000
> From: Pip Cet <pipcet@protonmail.com>
> Cc: eller.helmut@gmail.com, gerd.moellmann@gmail.com, yantar92@posteo.net, emacs-devel@gnu.org
> 
> > I'd start with the first half of (1). It is not clear to me that the
> > other part is needed, and in any case we need a reproducer for it
> > first.
> 
> Ihor's original SIGPROF-based reproducer works if you revert the (SIGPROF-specific) workaround by Helmut. What makes you think it can't happen with other signals (which, naturally, aren't as frequent or badly-timed as SIGPROF, which strikes precisely when the CPU is active)?
> 
> Obviously a reproducer is highly desirable in this case, but we shouldn't leave known and understood bugs in the code.

I agree, but you yourself said, and I agree, that without a test case
we cannot test any fixes in that area.  So let's have the text case
first and analyze it, before discussing solution.

> > Most of the crashes we've seen until now were not when the MPS
> > handler was running. Also, didn't someone say that when the MPS
> 
> Ihor's first crash backlog (MPS: profiler) clearly shows that was the case: https://lists.gnu.org/archive/html/emacs-devel/2024-06/msg00568.html
> 
> > SIGSEGV handler is active, we could detect that from our code and
> > return doing nothing?
> 
> Dropping the signal in the process.

No, setting a flag to 'raise' the same signal when we are back in our
code and in safe environment (i.e., not called from the MPS SIGSEGV
handler).

Btw, an alternative would be to block the signals we care about while
in the MPS SIGSEGV handler in some way.  The way they install the
handler is very simple, see protsgix.c:ProtSetup.  Their code masks no
signals; we could instead mask the signals we care about.  As a POC,
I'd simply modify their code, but if that works, we could later
override their handler setup with ours or something.

> > Doing (2) adds a whole lot of complexity to Emacs.
> 
> You're right.
> 
> > Most importantly,
> > we will be unable to access Lisp data safely, unwind-protect and the
> > entire specpdl stuff generally cannot be used, and signaling an error
> > would be fatal. So I'd rather avoid that.
> 
> Very good points, though I wonder to what extent our current code is safe...

Well, I know from the MS-Windows experience that this is fraught with
pitfalls which took us some years to find and fix.  (On Windows, Emacs
uses additional threads for GUI I/O and for emulating SIGALRM and
SIGPROF.)



^ permalink raw reply	[flat|nested] 68+ messages in thread

end of thread, other threads:[~2024-07-03 15:41 UTC | newest]

Thread overview: 68+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-06-29 19:12 MPS: a random backtrace while toying with gdb Ihor Radchenko
2024-06-29 19:19 ` Pip Cet
2024-06-29 21:46   ` Gerd Möllmann
2024-06-30  4:55     ` Eli Zaretskii
2024-06-30  5:33       ` Gerd Möllmann
2024-06-30  6:16         ` Eli Zaretskii
2024-06-30  6:43           ` Gerd Möllmann
2024-06-30  8:52             ` Eli Zaretskii
2024-06-30  9:43               ` Gerd Möllmann
2024-06-30 10:05                 ` Eli Zaretskii
2024-06-30 11:20                   ` Gerd Möllmann
2024-06-30 12:16                     ` Eli Zaretskii
2024-06-30 12:43                       ` Gerd Möllmann
2024-06-30  9:36     ` Helmut Eller
2024-06-30 10:00       ` Eli Zaretskii
2024-06-30 10:24         ` Helmut Eller
2024-06-30 10:43           ` Eli Zaretskii
2024-06-30 18:42             ` Helmut Eller
2024-06-30 18:59               ` Gerd Möllmann
2024-06-30 19:25               ` Pip Cet
2024-06-30 19:49                 ` Ihor Radchenko
2024-06-30 20:09                 ` Eli Zaretskii
2024-06-30 20:32                   ` Pip Cet
2024-07-01 11:07                     ` Eli Zaretskii
2024-07-01 17:27                       ` Pip Cet
2024-07-01 17:42                         ` Ihor Radchenko
2024-07-01 18:08                         ` Eli Zaretskii
2024-07-02  7:55                           ` Pip Cet
2024-07-02 13:10                             ` Eli Zaretskii
2024-07-02 14:24                               ` Pip Cet
2024-07-02 14:57                                 ` Eli Zaretskii
2024-07-02 17:06                                   ` Pip Cet
2024-07-03 11:31                                     ` Pip Cet
2024-07-03 11:50                                       ` Eli Zaretskii
2024-07-03 14:35                                         ` Pip Cet
2024-07-03 15:41                                           ` Eli Zaretskii
2024-07-01  2:33                 ` Eli Zaretskii
2024-07-01  6:05                 ` Helmut Eller
2024-06-30 19:58               ` Eli Zaretskii
2024-06-30 21:08                 ` Ihor Radchenko
2024-07-01  2:35                   ` Eli Zaretskii
2024-07-01 11:13                   ` Eli Zaretskii
2024-07-01 11:47                     ` Ihor Radchenko
2024-07-01 12:33                       ` Eli Zaretskii
2024-07-01 17:17                         ` Ihor Radchenko
2024-07-01 17:44                           ` Eli Zaretskii
2024-07-01 18:01                             ` Ihor Radchenko
2024-07-01 18:16                               ` Eli Zaretskii
2024-07-01 18:24                                 ` Ihor Radchenko
2024-07-01 18:31                                   ` Eli Zaretskii
2024-07-01 18:51                                     ` Ihor Radchenko
2024-07-01 19:05                                       ` Eli Zaretskii
2024-07-01 19:34                                       ` Gerd Möllmann
2024-07-01 20:00                                         ` Ihor Radchenko
2024-07-02  4:33                                           ` Gerd Möllmann
2024-07-02  7:05                                             ` Ihor Radchenko
2024-07-02  7:06                                               ` Gerd Möllmann
2024-07-01 18:19                               ` Gerd Möllmann
2024-07-01 18:23                                 ` Eli Zaretskii
2024-06-30 11:07           ` Gerd Möllmann
2024-06-30 11:06         ` Gerd Möllmann
2024-06-30 11:05       ` Gerd Möllmann
2024-06-30  9:59     ` Pip Cet
2024-06-30 10:09       ` Eli Zaretskii
2024-06-30 10:16         ` Pip Cet
2024-06-30 10:34           ` Eli Zaretskii
2024-06-30 13:06             ` Pip Cet
2024-06-30 11:10       ` Gerd Möllmann

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).