From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.io!.POSTED.blaine.gmane.org!not-for-mail From: Andrea Corallo Newsgroups: gmane.emacs.devel Subject: Re: MPS: Please check if scratch/igc builds with native compilation Date: Tue, 21 May 2024 14:17:19 -0400 Message-ID: References: Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Injection-Info: ciao.gmane.io; posting-host="blaine.gmane.org:116.202.254.214"; logging-data="8804"; mail-complaints-to="usenet@ciao.gmane.io" User-Agent: Gnus/5.13 (Gnus v5.13) Cc: Emacs Devel , Eli Zaretskii , Helmut Eller To: Gerd =?utf-8?Q?M=C3=B6llmann?= Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org Tue May 21 20:18:22 2024 Return-path: Envelope-to: ged-emacs-devel@m.gmane-mx.org Original-Received: from lists.gnu.org ([209.51.188.17]) by ciao.gmane.io with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.92) (envelope-from ) id 1s9U45-0001zV-8e for ged-emacs-devel@m.gmane-mx.org; Tue, 21 May 2024 20:18:21 +0200 Original-Received: from localhost ([::1] helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1s9U3K-0006vS-Pg; Tue, 21 May 2024 14:17:34 -0400 Original-Received: from eggs.gnu.org ([2001:470:142:3::10]) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1s9U37-0006pz-21 for emacs-devel@gnu.org; Tue, 21 May 2024 14:17:22 -0400 Original-Received: from fencepost.gnu.org ([2001:470:142:3::e]) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1s9U36-0001Ux-P8; Tue, 21 May 2024 14:17:20 -0400 DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=gnu.org; s=fencepost-gnu-org; h=MIME-Version:Date:References:In-Reply-To:Subject:To: From; bh=ABAkkxc5io8o4kw49MB4Wo/sh642kox3Gfb7LbPp88o=; b=QIubWRQ4wvWV3D0ubaNy oggqtfXNhWx2HNd/VZnC8HM78Wy8LPaW1vEidwWkvoqOopTlXCvEdb7shdQEFhZ/eoirOT2UCLwH9 /fwpt0dRwAe08xHXvWfEYyGexQOp2xvxbo6e6z9dkmnOUPBCO7ln0bXIU79nyszQ5LEioDVi7c+px 213XKgw52/HSWj7DA6qE6nzvtXbPBMXhSNNRAHadi3Mjmyef6XVQLXLk/rBp/ZqB7O36OpSCrbywR szYqgYoqB4RtGfJJSzKYEOovz2WbbJI2KZY8vQdZ/qzhxnbi8EM4uFETF1w65qk5IX5VpVR296EMa 8gAXiMoqDP7YTA==; Original-Received: from acorallo by fencepost.gnu.org with local (Exim 4.90_1) (envelope-from ) id 1s9U35-0002rY-NY; Tue, 21 May 2024 14:17:19 -0400 In-Reply-To: ("Gerd =?utf-8?Q?M=C3=B6llman?= =?utf-8?Q?n=22's?= message of "Tue, 21 May 2024 20:09:50 +0200") X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org Xref: news.gmane.io gmane.emacs.devel:319450 Archived-At: Gerd M=C3=B6llmann writes: > Andrea Corallo writes: > >> At least here the error seems reproducible. Bootstrapping with -j1 >> makes native compiling leim/ja-dic/ja-dic.el always fail. >> >> And if I run it under gdb I see we get a SIGSEGV in >> 'maybe_resize_hash_table' at fns.c:4987 >> >> memcpy (key, h->key, old_size * sizeof *key); > > That's a new one for me. Maybe you are hitting a read/write barrier? Ah right maybe, interesting! > I think Eli & Helmut can help here with what to do for the signals in > GDB. (On macOS, MPS is using Mach exceptions, not signals.) > >> >> with the following bt > > > >> >> (gdb) bt >> #0 maybe_resize_hash_table (h=3D0x7fffe7dabd48) at fns.c:4987 >> #1 hash_put (h=3D0x7fffe7dabd48, key=3DXIL(0x7fffe4fc297b), value=3DXIL= (0x30), hash=3D1644298) at fns.c:5162 >> #2 0x0000555555817fc0 in Fputhash (key=3DXIL(0x7fffe4fc297b), value=3DX= IL(0x30), table=3D) at fns.c:5993 >> #3 0x00007ffff14f6313 in F627974652d72756e2d2d73747269702d6c697374_byte= _run__strip_list_0 () at /home/andcor03/emacs4/src/../native-lisp/30.0.50-0= 0c2e4a4/preloaded/byte-run-79ff048e-d52588ab.eln >> #4 0x00005555557fdbac in Ffuncall (nargs=3D2, args=3D0x7fffffffc010) at= eval.c:3032 >> #5 0x00007ffff14f6476 in F627974652d72756e2d2d73747269702d6c697374_byte= _run__strip_list_0 () at /home/andcor03/emacs4/src/../native-lisp/30.0.50-0= 0c2e4a4/preloaded/byte-run-79ff048e-d52588ab.eln >> #6 0x00005555557fdbac in Ffuncall (nargs=3D2, args=3D0x7fffffffc0d0) at= eval.c:3032 >> #7 0x00007ffff14f6476 in F627974652d72756e2d2d73747269702d6c697374_byte= _run__strip_list_0 () at /home/andcor03/emacs4/src/../native-lisp/30.0.50-0= 0c2e4a4/preloaded/byte-run-79ff048e-d52588ab.eln >> #8 0x00005555557fdbac in Ffuncall (nargs=3D2, args=3D0x7fffffffc190) at= eval.c:3032 >> #9 0x00007ffff14f6476 in F627974652d72756e2d2d73747269702d6c697374_byte= _run__strip_list_0 () at /home/andcor03/emacs4/src/../native-lisp/30.0.50-0= 0c2e4a4/preloaded/byte-run-79ff048e-d52588ab.eln >> #10 0x00005555557fdbac in Ffuncall (nargs=3D2, args=3D0x7fffffffc250) at= eval.c:3032 >> #11 0x00007ffff14f6476 in F627974652d72756e2d2d73747269702d6c697374_byte= _run__strip_list_0 () at /home/andcor03/emacs4/src/../native-lisp/30.0.50-0= 0c2e4a4/preloaded/byte-run-79ff048e-d52588ab.eln >> #12 0x00005555557fdbac in Ffuncall (nargs=3D2, args=3D0x7fffffffc310) at= eval.c:3032 >> #13 0x00007ffff14f6476 in F627974652d72756e2d2d73747269702d6c697374_byte= _run__strip_list_0 () at /home/andcor03/emacs4/src/../native-lisp/30.0.50-0= 0c2e4a4/preloaded/byte-run-79ff048e-d52588ab.eln >> #14 0x00005555557fdbac in Ffuncall (nargs=3D2, args=3D0x7fffffffc3d0) at= eval.c:3032 >> #15 0x00007ffff14f6476 in F627974652d72756e2d2d73747269702d6c697374_byte= _run__strip_list_0 () at /home/andcor03/emacs4/src/../native-lisp/30.0.50-0= 0c2e4a4/preloaded/byte-run-79ff048e-d52588ab.eln >> #16 0x00005555557fdbac in Ffuncall (nargs=3D2, args=3D0x7fffffffc490) at= eval.c:3032 >> #17 0x00007ffff14f6476 in F627974652d72756e2d2d73747269702d6c697374_byte= _run__strip_list_0 () at /home/andcor03/emacs4/src/../native-lisp/30.0.50-0= 0c2e4a4/preloaded/byte-run-79ff048e-d52588ab.eln >> #18 0x00005555557fdbac in Ffuncall (nargs=3D2, args=3D0x7fffffffc550) at= eval.c:3032 >> #19 0x00007ffff14f6476 in F627974652d72756e2d2d73747269702d6c697374_byte= _run__strip_list_0 () at /home/andcor03/emacs4/src/../native-lisp/30.0.50-0= 0c2e4a4/preloaded/byte-run-79ff048e-d52588ab.eln >> #20 0x00005555557fdbac in Ffuncall (nargs=3D2, args=3D0x7fffffffc610) at= eval.c:3032 >> #21 0x00007ffff14f6476 in F627974652d72756e2d2d73747269702d6c697374_byte= _run__strip_list_0 () at /home/andcor03/emacs4/src/../native-lisp/30.0.50-0= 0c2e4a4/preloaded/byte-run-79ff048e-d52588ab.eln >> #22 0x00005555557fdbac in Ffuncall (nargs=3D2, args=3D0x7fffffffc6d0) at= eval.c:3032 >> #23 0x00007ffff14f6476 in F627974652d72756e2d2d73747269702d6c697374_byte= _run__strip_list_0 () at /home/andcor03/emacs4/src/../native-lisp/30.0.50-0= 0c2e4a4/preloaded/byte-run-79ff048e-d52588ab.eln >> #24 0x00005555557fdbac in Ffuncall (nargs=3D2, args=3D0x7fffffffc760) at= eval.c:3032 >> #25 0x00007ffff14f692c in F627974652d72756e2d73747269702d73796d626f6c2d7= 06f736974696f6e73_byte_run_strip_symbol_positions_0 () >> [...] >> >> Which is admittedly different to what I saw from command line. >> >>> To debug this, I changed the check in igc.c to not assert, but print >>> the PID, and enter an endless loop sleeping. This makes it possible to >>> attach to the process with LLDB. >>> >>> In all cases I investigated in this way, I'm seeing a pattern: What is >>> happening is that a function in the Emacs core is called from a >>> native-compiled function. Things look like, simplified, >>> >>> /* In some .eln */ >>> Lisp_Object d_reloc[100]; >>> >>> Lisp_Object some_native_compiled_lisp_function () >>> { >>> Lisp_Object frame[2]; >>> frame[0] =3D d_reloc[17]; // some symbol >>> frame[1] =3D ... >>> f_reloc->funcall (2, frame); >>> } >>> >>> where f_reloc is a large struct with function pointer members for >>> function being called from the .eln. Doesn't matter. We then land in >>> Ffuncall in the Emacs core, and the first element of its args vector, >>> a symbol, is found to be forwarded which leads to the assertion. >>> >>> d_reloc in the .eln is scanned in igc.c, and it being on the control >>> stack, in frame[], or in a register, should pin it, one would assume. >>> So how comes Ffuncall in Emacs receives an invalid symbol? >>> >>> I've checked that d_reloc is indeed scanned by fix_comp_unit. The >>> check gives me reasonable confidence that this "should work". But as >>> an alternative, I also made all the things like d_reloc in the .elns >>> ambiguous roots, so that they cannot possibly be moved, if all works as >>> expected. >>> >>> - No change, it still asserts in the same way. >>> >>> - Changing optimization levels - no change. >>> - Changing from arm64 to x86_64 - no change. >> >> That's very bizarre, I've hard time believing we are hitting such a bug = :/ >> Hope we are missing something. > > Yes, bizarre is a good description. I'm out of ideas. Do you think is very difficult to debug MPS to understand why a certain object is being moved (while it should not)? On GNU/Linux we can record the rr trace (so that everything is reproducible) and do some back and forward to try to spread some light on this maybe? Andrea