* bug#33014: 26.1.50; 27.0.50; Fatal error after re-evaluating a thread's function @ 2018-10-11 5:30 Gemini Lasswell 2018-10-12 8:12 ` Eli Zaretskii 2018-10-31 4:49 ` Paul Eggert 0 siblings, 2 replies; 34+ messages in thread From: Gemini Lasswell @ 2018-10-11 5:30 UTC (permalink / raw) To: 33014 [-- Attachment #1: Type: text/plain, Size: 684 bytes --] When I run some byte-compiled code which creates some threads, and then, while a thread is blocked, interactively evaluate the function which was used to create that thread, Emacs has a fatal error or segmentation fault when the thread becomes unblocked. To reproduce: Build Emacs from master with this patch, in which I've pasted some excerpts from my current project onto the end of lisp/thread.el. It's going to be like ERT but designed to run benchmarks instead of tests, but right now all it does is to create a buffer and three threads, set the threads up to communicate with each other, log their progress to *Messages*, and update the buffer when they finish: [-- Attachment #2: 0001-Reproduce-Bswitch-segfault.patch --] [-- Type: text/plain, Size: 17777 bytes --] From 4b9e3ccb2376ea532051d5ff7d2d308df56576c9 Mon Sep 17 00:00:00 2001 From: Gemini Lasswell <gazally@runbox.com> Date: Tue, 9 Oct 2018 09:27:54 -0700 Subject: [PATCH] Reproduce Bswitch segfault --- lisp/thread.el | 454 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 454 insertions(+) diff --git a/lisp/thread.el b/lisp/thread.el index 7974a2603c..9c4774e961 100644 --- a/lisp/thread.el +++ b/lisp/thread.el @@ -196,5 +196,459 @@ thread-list--name (and (eq thread main-thread) "Main") (prin1-to-string thread))) +;;; Thread-safe messages + +(cl-defstruct + (thread--message + (:constructor + thread-make-message (&optional name + &aux + (mutex (make-mutex name)) + (condition + (make-condition-variable mutex name))))) + name mutex value condition) + +(defun thread-message-available-p (message) + "Return the value of MESSAGE." + (thread--message-value message)) + +(defun thread-message-send (message value) + "Set the VALUE of MESSAGE, and awaken all threads waiting for it." + (with-mutex (thread--message-mutex message) + (setf (thread--message-value message) value) + (condition-notify (thread--message-condition message) t))) + +(defun thread-message-cancel (message) + "Cancel MESSAGE by setting its value to nil." + (with-mutex (thread--message-mutex message) + (setf (thread--message-value message) nil))) + +(defun thread-message-wait (message &optional cancel) + "If MESSAGE's value is nil, block until it is set to something else. +Return the value of MESSAGE. If CANCEL is non-nil, clear MESSAGE +by setting its value to nil. If multiple threads are waiting on +the same message, and all pass a non-nil CANCEL, then only one +thread will unblock and receive the message's value, and the +others will continue to block." + (with-mutex (thread--message-mutex message) + (while (not (thread--message-value message)) + (condition-wait (thread--message-condition message))) + (let ((value (thread--message-value message))) + (when cancel + (setf (thread--message-value message) nil)) + value))) + +;;; Thread-safe queues + +(cl-defstruct (thread--queue + (:constructor + thread-make-queue (&optional + size-limit + type + &aux + (fifo (eq type 'fifo)) + (limit (when (natnump size-limit) size-limit)) + (mutex (make-mutex)) + (not-full (make-condition-variable mutex)) + (not-empty (make-condition-variable mutex))))) + fifo + limit + items + mutex + not-full + not-empty) + +(defun thread-queue-empty-p (queue) + "Return non-nil if QUEUE is empty. +There is no guarantee that QUEUE will contain the same number of +items the next time you access it." + (with-mutex (thread--queue-mutex queue) + (null (thread--queue-items queue)))) + +(defun thread-queue-full-p (queue) + "Return non-nil if QUEUE is full. +There is no guarantee that QUEUE will contain the same number of +items the next time you access it." + (when (thread--queue-limit queue) + (with-mutex (thread--queue-mutex queue) + (= (length (thread--queue-items queue)) (thread--queue-limit queue))))) + +(defun thread-queue-length (queue) + "Return the number of items in QUEUE. +There is no guarantee that QUEUE will contain the same number of +items the next time you access it." + (with-mutex (thread--queue-mutex queue) + (length (thread--queue-items queue)))) + +(defun thread-queue-remove-all (queue) + "Discard any items in QUEUE." + (with-mutex (thread--queue-mutex queue) + (setf (thread--queue-items queue) nil) + (condition-notify (thread--queue-not-full queue)))) + +(defun thread-queue-put (item queue) + "Put ITEM into QUEUE. +If QUEUE was created with a size limit, and already contains that many items, +block until one is removed." + (with-mutex (thread--queue-mutex queue) + (while (and (thread--queue-limit queue) + (= (length (thread--queue-items queue)) (thread--queue-limit queue))) + (condition-wait (thread--queue-not-full queue))) + (if (thread--queue-fifo queue) + (setf (thread--queue-items queue) + (nconc (thread--queue-items queue) (list item))) + (push item (thread--queue-items queue))) + (condition-notify (thread--queue-not-empty queue)))) + +(defun thread-queue-get (queue) + "Remove an item from QUEUE and return it. +If there are no items in QUEUE, block until one is added." + (with-mutex (thread--queue-mutex queue) + (while (null (thread--queue-items queue)) + (condition-wait (thread--queue-not-empty queue))) + (let ((item (pop (thread--queue-items queue)))) + (condition-notify (thread--queue-not-full queue)) + item))) + +;;; Bswitch bug + +(require 'seq) + +(defvar erb--job (thread-make-message) + "This contains all the information needed about what benchmark job to run. +It is created by `erb-run-start' and cleared when the benchmark job is +finished by `erb-run--benchmarker'.") + +(defvar erb--status nil + "The state of the benchmark runner. +Possible values are `idle', `building', `benchmarking' and +`done'. Maintained by `erb-run--monitor' and used to update the +`erb-run' buffer.") + +(defvar erb--status-commits nil + "The commits being processed by the benchmark runner.") + +(defvar erb--status-waiting-to-build nil + "The list of commits that have not yet started building. +Maintained by `erb-run--monitor' and used to update the +`erb-run' buffer.") + +(defvar erb--status-building nil + "The list of commits that are currently being built. +Maintained by `erb-run--monitor' and used to update the +`erb-run' buffer.") + +(defvar erb--status-built nil + "The list of commits that have been built. +Maintained by `erb-run--monitor' and used to update the +`erb-run' buffer.") + +(defvar erb--status-failed-builds nil + "The list of commits that failed to build. +Maintained by `erb-run--monitor' and used to update the +`erb-run' buffer.") + +(defvar erb--status-waiting-to-benchmark nil + "The commits that are built and waiting to be benchmarked. +Maintained by `erb-run--monitor' and used to update the +`erb-run' buffer.") + +(defvar erb--status-benchmarking nil + "The commit that is currently being benchmarked. +Maintained by `erb-run--monitor' and used to update the +`erb-run' buffer.") + +(defvar erb--status-benchmark-failures nil + "The commits that had task failures during benchmarking. +Maintained by `erb-run--monitor' and used to update the +`erb-run' buffer.") + +(defvar erb--status-finished nil + "The commits that are finished benchmarking. +Maintained by `erb-run--monitor' and used to update the +`erb-run' buffer.") + +(defvar erb-run--cancel-request (thread-make-message) + "Signal the user's request that a benchmark job be stopped. +Cleared when the benchmark job is cleaned up, by +`erb-run-benchmarker'.") + + +;;; ERB Run Benchmarks mode + +(defvar erb-run-refresh-seconds 0.2 + "Delay between updates of `erb-run' buffers.") + +;; Options settable in the erb-run-mode buffer. +(defvar-local erb-run--commit-range '("commitA" "commitB" "commitC")) +(defvar-local erb-run--count-to-select nil) + +(defvar erb-run-mode-map + (let ((map (copy-keymap special-mode-map))) + (set-keymap-parent map button-buffer-map) + (define-key map "n" 'next-line) + (define-key map "p" 'previous-line) + (define-key map "s" 'erb-run-start) + (define-key map "c" 'erb-run-cancel) + (define-key map [follow-link] 'mouse-face) + (define-key map [mouse-2] 'mouse-select-window) + + map) + "Local keymap for `erb-run-mode' buffers.") + +(define-derived-mode erb-run-mode special-mode "ERB-run" + "Mode for configuring and running benchmarks. +\\<erb-run-mode-map> +ERB is documented in info node `(erb)'." + :group 'erb-mode + (setq-local revert-buffer-function #'erb-run-revert-buffer)) + +;; TODO make only one buffer? +(defun erb-run-generate-new-buffer () + (let* ((name (format "*ERB-run: %s*" (file-name-nondirectory + (directory-file-name default-directory)))) + (buffer (generate-new-buffer name))) + (with-current-buffer buffer + (erb-run-mode) + (run-at-time erb-run-refresh-seconds nil + #'erb-run--timer-func buffer) + (add-to-list 'uniquify-list-buffers-directory-modes 'erb-run-mode)) + buffer)) + +;; TODO autoload is just for now +;;;###autoload +(defun erb-summary-run () + "Switch to or create an `erb-run-mode' buffer for running benchmarks." + (interactive) + (let* ((dir default-directory) + (buffer (or (seq-find (lambda (buf) + (with-current-buffer buf + (eq major-mode 'erb-run-mode))) + (buffer-list)) + (erb-run-generate-new-buffer)))) + (switch-to-buffer buffer) + (erb-run-revert-buffer buffer))) + +(defun erb-run-revert-buffer (&rest _ignored) + (let ((inhibit-read-only t)) + (erase-buffer) + ;; (erb--update-config-cache) + (insert + (format "Commit range: %s\n" erb-run--commit-range) + (format "Number to select: %s\n" (if erb-run--count-to-select + erb-run--count-to-select "All")) + (if (memq erb--status '(idle done)) + "\nStart running benchmarks\n" + "\nCancel\n") + + "\n" + (format "Benchmarked: %s\n" (if erb--status-finished + (length erb--status-finished) "")) + (format "Build Failures: %s\n" (if (not (eq erb--status 'idle)) + (length erb--status-failed-builds) "")) + (format "Task Failures: %s\n" (if (not (eq erb--status 'idle)) + (length erb--status-benchmark-failures) "")) + "Total: %s/%s\n\n" + + "Started at: \n" + (cl-case erb--status + (building "Building.") + (benchmarking "Benchmarking.") + (idle "Ready.") + (done "Finished.") + (t (format "erb--status=%s" erb--status))) + "\nFinished at: \n\n") + + (unless (eq erb--status 'idle) + (dolist (commit erb--status-commits) + (insert + (cond + ((memq commit erb--status-finished) ".") + ((memq commit erb--status-building) "B") + ((memq commit erb--status-benchmark-failures) "F") + ((memq commit erb--status-failed-builds) "E") + ((memq commit erb--status-benchmarking) "@") + ((memq commit erb--status-built) "b") + (t "w")))) + (insert "\n")) + + (set-buffer-modified-p nil))) + +(defun erb-run--timer-func (buffer) + "Revert BUFFER and set a timer to do it again." + (when (buffer-live-p buffer) + (with-current-buffer buffer + (revert-buffer)) + (run-at-time erb-run-refresh-seconds nil + #'erb-run--timer-func buffer))) + +(defun erb-run-start () + "Start running benchmarks." + (interactive) + (when (thread-message-available-p erb--job) + (user-error "Benchmarks are already running")) + + (erb--start-benchmark-controller-thread) + (erb--start-builder-threads) + (erb--start-benchmark-monitor-thread) + + (unless erb-run--commit-range + (user-error "Choose a commit or range of commits to benchmark")) + + (erb--status-clear) + (thread-message-send erb--job erb-run--commit-range)) + +(defun erb-run-cancel () + "Stop running benchmarks." + (interactive) + (erb--status-clear)) + +;;; The benchmark runner: the controller thread + +(defvar erb-simultaneous-build-count 1 + "The number of builds to run simultaneously.") + +(defvar erb--builders nil + "The list of threads which have been created to run builds.") + +(defvar erb--unbuilt-commits (thread-make-queue) + "A thread-safe queue of commits waiting to be built.") +(defvar erb--built-commits (thread-make-queue) + "A thread-safe queue of commits which have been built.") + +(defvar erb--benchmark-controller nil) + +(defun erb--start-benchmark-controller-thread () + "Start the benchmark controller thread if it is not already started." + (unless erb--benchmark-controller + (setq erb--benchmark-controller + (make-thread #'erb--benchmark-control-func "control")))) + +(defun erb--benchmark-control-func () + "Process benchmark jobs. +Watch for incoming jobs arriving by a thread-safe message in +`erb--job'. When a job becomes available, build and +benchmark all the commits and then clear the message." + (while t + (condition-case err + (let* ((job (thread-message-wait erb--job)) + (count (length job)) + builds) + + ;; First, do all the builds. There may be more than one + ;; builder thread. + (erb--status-change 'erb--status 'building) + (dolist (commit job) + (erb--status-add commit 'erb--status-waiting-to-build) + (thread-queue-put commit erb--unbuilt-commits)) + + (while (> count 0) + (let ((build (thread-queue-get erb--built-commits))) + (push build builds)) + (cl-decf count)) + + (erb--status-change 'erb--status 'benchmarking) + ;; Then benchmark the build results, one at a time. + + (erb--status-change 'erb--status 'done) + (thread-message-cancel erb--job)) + ((error quit) (message "Error in ERB benchmark control thread: %s" err))))) + +;;; The benchmark runner: the builder threads + +;; TODO Watch for a signal to shut down + +(defun erb--start-builder-threads () + "Create the desired number of commit-building threads. +Get the number from `erb-run-simultaneous-build-count'. +TODO: adjust the number down as well as up." + (let ((needed (- erb-simultaneous-build-count + (length erb--builders)))) + (while (> needed 0) + (push (make-thread #'erb--builder-func + (format "builder %s" (length erb--builders))) + erb--builders) + (cl-decf needed)))) + +(defun erb--builder-func () + "Build commits from `erb-run--commmits-to-build'." + (while t + (condition-case err + (let ((commit (thread-queue-get erb--unbuilt-commits))) + (message "building %s" commit) + (erb--status-remove commit 'erb--status-waiting-to-build) + (erb--status-add commit 'erb--status-building) + + (unwind-protect + (sleep-for (random 10)) + + (erb--status-remove commit 'erb--status-building) + (erb--status-add commit 'erb--status-built) + (thread-queue-put commit erb--built-commits))) + + ((error quit) (message "Error in ERB benchmark build thread: %s" err))))) + +;;; The benchmark runner: the status monitor thread + +(defvar erb--benchmark-monitor nil + "The thread which keeps track of build and benchmark job status.") +(defvar erb--status-updates (thread-make-queue nil 'fifo) + "A thread-safe queue of status updates yet to be processed.") + +(defun erb--status-clear () + (thread-queue-put '(clear) erb--status-updates)) +(defun erb--status-change (symbol value) + (thread-queue-put `(change ,symbol ,value) erb--status-updates)) +(defun erb--status-add (value symbol) + (thread-queue-put `(add ,value ,symbol) erb--status-updates)) +(defun erb--status-remove (value symbol) + (thread-queue-put `(remove ,value ,symbol) erb--status-updates)) + +(defun erb--start-benchmark-monitor-thread () + "Start the benchmark monitor thread if it is not already started." + (unless erb--benchmark-monitor + (setq erb--benchmark-monitor + (make-thread #'erb--benchmark-monitor-func "monitor")))) + +(defvar erb--logging t) +(defun erb--benchmark-monitor-func () + "Process benchmark job status update. +Collect status update requests from `erb--status-updates' +and update the various global variables accordingly." + (while t + (condition-case err + (pcase (thread-queue-get erb--status-updates) + (`(clear) + (when erb--logging (message "** Clear **")) + (erb--clear-status)) + (`(add ,value ,variable) + (when erb--logging + (message "** Add %s to %s **" value variable)) + (push value (symbol-value variable))) + (`(remove ,value ,variable) + (when erb--logging + (message "** Remove %s from %s **" value variable)) + (setf (symbol-value variable) + (delq value (symbol-value variable)))) + (`(change ,variable ,value) + (when erb--logging + (message "** Set %s to %s **" variable value)) + (setf (symbol-value variable) value)) + (update (error "Unrecognized status update: %s" update))) + ((error quit) (message "Error in ERB benchmark control thread: %s" err))))) + +(defun erb--clear-status () + "Reset all the ERB benchmarking status variables to their initial state." + (setq erb--status 'idle + erb--status-waiting-to-build nil + erb--status-building nil + erb--status-built nil + erb--status-failed-builds nil + erb--status-waiting-to-benchmark nil + erb--status-benchmarking nil + erb--status-finished nil)) + + + (provide 'thread) ;;; thread.el ends here -- 2.16.4 [-- Attachment #3: Type: text/plain, Size: 5589 bytes --] Run Emacs with -Q, and then type: M-x erb-summary-run RET s Wait several seconds for the second to the last line in the buffer to change to "Finished". There will also be an echo area message about erb--status being set to done. Navigate to lisp/thread.el, select everything from the definition of erb--benchmark-monitor to the end of the file, and use: M-x eval-region RET Return to the buffer created by erb-summary-run, and type 's' again. Result: lisp.h:2241: Emacs fatal error: assertion failed: HASH_TABLE_P (a) My suspicion is that the garbage collector is freeing something needed by the blocked thread. Setting gc-cons-threshold to 500M before doing the steps above stops the error from happening. Here's the backtrace. While trying to sort out how to reproduce this, I also saw it segfault in Ffuncall, in styled_format, and in the Bswitch case of exec_byte_code just past where this error occurs, when it tries to access h->count. Thread 7 (Thread 0x7f1cd4dec700 (LWP 21837)): #0 terminate_due_to_signal (sig=sig@entry=6, backtrace_limit=backtrace_limit@entry=2147483647) at emacs.c:369 #1 0x00000000005a4d99 in die (msg=msg@entry=0x678d52 "HASH_TABLE_P (a)", file=file@entry=0x6768a5 "lisp.h", line=line@entry=2241) at alloc.c:7094 #2 0x00000000006122b5 in XHASH_TABLE (a=...) at lisp.h:2241 #3 exec_byte_code (bytestr=..., vector=..., maxdepth=..., args_template=..., nargs=nargs@entry=0, args=<optimized out>, args@entry=0x16eac38 <bss_sbrk_buffer+9926040>) at bytecode.c:1403 #4 0x00000000005cb972 in funcall_lambda (fun=..., nargs=nargs@entry=0, arg_vector=0x16eac38 <bss_sbrk_buffer+9926040>, arg_vector@entry=0x158ec58 <bss_sbrk_buffer+8500664>) at eval.c:3057 #5 0x00000000005c818b in Ffuncall (nargs=nargs@entry=1, args=args@entry=0x158ec50 <bss_sbrk_buffer+8500656>) at eval.c:2870 #6 0x000000000064443b in invoke_thread_function () at thread.c:684 #7 0x00000000005c728f in internal_condition_case ( bfun=bfun@entry=0x644400 <invoke_thread_function>, handlers=..., handlers@entry=XIL(0xc3c0), hfun=hfun@entry=0x644320 <record_thread_error>) at eval.c:1373 #8 0x0000000000644dd1 in run_thread (state=0x158ec30 <bss_sbrk_buffer+8500624>) at thread.c:723 #9 0x00007f1cebf602a7 in start_thread () from /nix/store/hwwqshlmazzjzj7yhrkyjydxamvvkfd3-glibc-2.26-131/lib/libpthread.so.0 #10 0x00007f1ceb5fd57f in clone () from /nix/store/hwwqshlmazzjzj7yhrkyjydxamvvkfd3-glibc-2.26-131/lib/libc.so.6 Thread 7 (Thread 0x7f1cd4dec700 (LWP 21837)): "erb--benchmark-monitor-func" (0x158ec58) In GNU Emacs 27.0.50 (build 1, x86_64-pc-linux-gnu, GTK+ Version 3.22.28) of 2018-10-09 built on sockeye Repository revision: 708444efad7a2ce1e309532898b844527e2d9c64 Windowing system distributor 'The X.Org Foundation', version 11.0.11906000 System Description: NixOS 18.03.git.bd06547 (Impala) Recent messages: For information about GNU Emacs and the GNU system, type C-h C-a. Configured using: 'configure --prefix=/home/gem/src/emacs/master/bin --with-modules --with-x-toolkit=gtk3 --with-xft --config-cache --enable-checking=yes,glyphs --enable-check-lisp-object-type' Configured features: XPM JPEG TIFF GIF PNG RSVG SOUND DBUS GSETTINGS GLIB NOTIFY LIBSELINUX GNUTLS LIBXML2 FREETYPE XFT ZLIB TOOLKIT_SCROLL_BARS GTK3 X11 XDBE XIM MODULES THREADS GMP Important settings: value of $EMACSLOADPATH: value of $LANG: en_US.UTF-8 locale-coding-system: utf-8-unix Major mode: Lisp Interaction Minor modes in effect: tooltip-mode: t global-eldoc-mode: t eldoc-mode: t electric-indent-mode: t mouse-wheel-mode: t tool-bar-mode: t menu-bar-mode: t file-name-shadow-mode: t global-font-lock-mode: t font-lock-mode: t blink-cursor-mode: t auto-composition-mode: t auto-encryption-mode: t auto-compression-mode: t line-number-mode: t transient-mark-mode: t Load-path shadows: None found. Features: (shadow sort mail-extr emacsbug message rmc puny seq byte-opt gv bytecomp byte-compile cconv dired dired-loaddefs format-spec rfc822 mml easymenu mml-sec password-cache epa derived epg epg-config gnus-util rmail rmail-loaddefs time-date mm-decode mm-bodies mm-encode mail-parse rfc2231 mailabbrev gmm-utils mailheader cl-loaddefs cl-lib sendmail rfc2047 rfc2045 ietf-drums mm-util mail-prsvr mail-utils elec-pair mule-util tooltip eldoc electric uniquify ediff-hook vc-hooks lisp-float-type mwheel term/x-win x-win term/common-win x-dnd tool-bar dnd fontset image regexp-opt fringe tabulated-list replace newcomment text-mode elisp-mode lisp-mode prog-mode register page menu-bar rfn-eshadow isearch timer select scroll-bar mouse jit-lock font-lock syntax facemenu font-core term/tty-colors frame cl-generic cham georgian utf-8-lang misc-lang vietnamese tibetan thai tai-viet lao korean japanese eucjp-ms cp51932 hebrew greek romanian slovak czech european ethiopic indian cyrillic chinese composite charscript charprop case-table epa-hook jka-cmpr-hook help simple abbrev obarray minibuffer cl-preloaded nadvice loaddefs button faces cus-face macroexp files text-properties overlay sha1 md5 base64 format env code-pages mule custom widget hashtable-print-readable backquote threads dbusbind inotify dynamic-setting system-font-setting font-render-setting move-toolbar gtk x-toolkit x multi-tty make-network-process emacs) Memory information: ((conses 16 94967 9472) (symbols 48 20045 1) (strings 32 28456 1769) (string-bytes 1 816313) (vectors 16 14265) (vector-slots 8 504082 12268) (floats 8 47 70) (intervals 56 213 0) (buffers 992 11)) ^ permalink raw reply related [flat|nested] 34+ messages in thread
* bug#33014: 26.1.50; 27.0.50; Fatal error after re-evaluating a thread's function 2018-10-11 5:30 bug#33014: 26.1.50; 27.0.50; Fatal error after re-evaluating a thread's function Gemini Lasswell @ 2018-10-12 8:12 ` Eli Zaretskii 2018-10-12 20:02 ` Gemini Lasswell 2018-10-31 4:49 ` Paul Eggert 1 sibling, 1 reply; 34+ messages in thread From: Eli Zaretskii @ 2018-10-12 8:12 UTC (permalink / raw) To: Gemini Lasswell; +Cc: 33014 > From: Gemini Lasswell <gazally@runbox.com> > Date: Wed, 10 Oct 2018 22:30:29 -0700 > > When I run some byte-compiled code which creates some threads, and then, > while a thread is blocked, interactively evaluate the function which > was used to create that thread, Emacs has a fatal error or segmentation > fault when the thread becomes unblocked. Can you please make a smaller stand-alone test case, which doesn't require patching Emacs? That will make it much easier to try reproducing the problem. > Thread 7 (Thread 0x7f1cd4dec700 (LWP 21837)): > #0 terminate_due_to_signal (sig=sig@entry=6, > backtrace_limit=backtrace_limit@entry=2147483647) at emacs.c:369 > #1 0x00000000005a4d99 in die (msg=msg@entry=0x678d52 "HASH_TABLE_P (a)", > file=file@entry=0x6768a5 "lisp.h", line=line@entry=2241) at alloc.c:7094 > #2 0x00000000006122b5 in XHASH_TABLE (a=...) at lisp.h:2241 > #3 exec_byte_code (bytestr=..., vector=..., maxdepth=..., args_template=..., > nargs=nargs@entry=0, args=<optimized out>, > args@entry=0x16eac38 <bss_sbrk_buffer+9926040>) at bytecode.c:1403 > #4 0x00000000005cb972 in funcall_lambda (fun=..., nargs=nargs@entry=0, > arg_vector=0x16eac38 <bss_sbrk_buffer+9926040>, > arg_vector@entry=0x158ec58 <bss_sbrk_buffer+8500664>) at eval.c:3057 > #5 0x00000000005c818b in Ffuncall (nargs=nargs@entry=1, > args=args@entry=0x158ec50 <bss_sbrk_buffer+8500656>) at eval.c:2870 > #6 0x000000000064443b in invoke_thread_function () at thread.c:684 > #7 0x00000000005c728f in internal_condition_case ( > bfun=bfun@entry=0x644400 <invoke_thread_function>, handlers=..., > handlers@entry=XIL(0xc3c0), hfun=hfun@entry=0x644320 <record_thread_error>) > at eval.c:1373 > #8 0x0000000000644dd1 in run_thread (state=0x158ec30 <bss_sbrk_buffer+8500624>) Can you show the Lisp backtrace of this thread? Also, what is the offending object 'a' in this frame: > #2 0x00000000006122b5 in XHASH_TABLE (a=...) at lisp.h:2241 and what was its parent object in the calling frame? Thanks. ^ permalink raw reply [flat|nested] 34+ messages in thread
* bug#33014: 26.1.50; 27.0.50; Fatal error after re-evaluating a thread's function 2018-10-12 8:12 ` Eli Zaretskii @ 2018-10-12 20:02 ` Gemini Lasswell 2018-10-13 6:23 ` Eli Zaretskii 0 siblings, 1 reply; 34+ messages in thread From: Gemini Lasswell @ 2018-10-12 20:02 UTC (permalink / raw) To: Eli Zaretskii; +Cc: 33014 Eli Zaretskii <eliz@gnu.org> writes: > Can you please make a smaller stand-alone test case, which doesn't > require patching Emacs? That will make it much easier to try > reproducing the problem. I've tried to do that without success. The bug won't reproduce if I put all the code added to thread.el by the patch into its own file and load it with C-u M-x byte-compile-file, and it also doesn't work to put the resulting .elc on my load-path and load it with require. I've determined today that having -O2 in CFLAGS is necessary to reproduce the bug, and that -O1 or -O0 won't do it. > Can you show the Lisp backtrace of this thread? Also, what is the > offending object 'a' in this frame: The Lisp backtrace is really short: Thread 7 (Thread 0x7f1cd4dec700 (LWP 21837)): "erb--benchmark-monitor-func" (0x158ec58) >> #2 0x00000000006122b5 in XHASH_TABLE (a=...) at lisp.h:2241 > > and what was its parent object in the calling frame? Those are both optimized out with -O2. I recompiled bytecode.c with "volatile" on the declaration of jmp_table, and got this: (gdb) up 3 #3 exec_byte_code (bytestr=..., vector=..., maxdepth=..., args_template=..., nargs=nargs@entry=0, args=<optimized out>, args@entry=0x16eacf8 <bss_sbrk_buffer+9926232>) at bytecode.c:1403 1403 struct Lisp_Hash_Table *h = XHASH_TABLE (jmp_table); (gdb) p jmp_table $1 = make_number(514) (gdb) p *top $3 = XIL(0x42b4d0) (gdb) pp *top remove Then I started looking at other variables in exec_byte_code, and found this which didn't look right: (gdb) p *vectorp $13 = XIL(0x7f4934009523) (gdb) pr (((help-menu "Help" keymap (emacs-tutorial menu-item "Emacs Tutorial" help-with-tutorial :help "Lear ?\207" [yank-menu kill-ring buffer-read-only gui-backend-selection-exists-p CLIPBOARD featurep ns] 2 \205^Q^@ÅÆ!\207" [visual-line-mode word-wrap truncate-lines 0 nil toggle-truncate-lines -1] 2 nil ni (I've truncated the result of printing *vectorp since each line is over 5000 characters long.) Since that looked like it was unlikely to be the original value of *vectorp, I started a new debugging session and stepped through Thread 7's call to exec_byte_code for erb--benchmark-monitor-func, and determined that *vectorp's initial value was erb--status-updates, which matches the first element of the constants vector in (symbol-function 'erb--benchmark-monitor-func). The value of vectorp was 0x16eac38 so I set a watchpoint on *(EMACS_INT *) 0x16eac38 and continued, and then during the execution of eval-region it triggered here: Thread 1 "monitor" hit Hardware watchpoint 7: *(EMACS_INT *) 0x16eac38 Old value = 60897760 New value = 24075314 setup_on_free_list (v=v@entry=0x16eac30 <bss_sbrk_buffer+9926032>, nbytes=nbytes@entry=272) at alloc.c:3060 3060 total_free_vector_slots += nbytes / word_size; (gdb) bt 10 #0 setup_on_free_list (v=v@entry=0x16eac30 <bss_sbrk_buffer+9926032>, nbytes=nbytes@entry=272) at alloc.c:3060 #1 0x00000000005a9a24 in sweep_vectors () at alloc.c:3297 #2 0x00000000005adb2e in gc_sweep () at alloc.c:6872 #3 garbage_collect_1 (end=<optimized out>) at alloc.c:5860 #4 Fgarbage_collect () at alloc.c:5989 #5 0x00000000005ca478 in maybe_gc () at lisp.h:4804 #6 Ffuncall (nargs=4, args=args@entry=0x7fff210a3bc8) at eval.c:2838 #7 0x0000000000611e00 in exec_byte_code (bytestr=..., vector=..., maxdepth=..., args_template=..., nargs=nargs@entry=2, args=<optimized out>, args@entry=0x9bd128 <pure+781288>) at bytecode.c:632 #8 0x00000000005cdd32 in funcall_lambda (fun=XIL(0x7fff210a3bc8), nargs=nargs@entry=2, arg_vector=0x9bd128 <pure+781288>, arg_vector@entry=0x7fff210a3f00) at eval.c:3057 #9 0x00000000005ca54b in Ffuncall (nargs=3, args=args@entry=0x7fff210a3ef8) at eval.c:2870 (More stack frames follow...) Note that just as was happening when we were working through bug#32357, the thread names which gdb prints are wrong, which I verified with: (gdb) p current_thread $21 = (struct thread_state *) 0xd73480 <main_thread> (gdb) p current_thread->name $22 = XIL(0) Am I correct that the next step is to figure out why the garbage collector is not marking this vector? Presumably it's no longer attached to the function definition for erb--benchmark-monitor-func by the time the garbage collector runs, but it's supposed to be found by mark_stack when called from mark_one_thread for Thread 7, right? ^ permalink raw reply [flat|nested] 34+ messages in thread
* bug#33014: 26.1.50; 27.0.50; Fatal error after re-evaluating a thread's function 2018-10-12 20:02 ` Gemini Lasswell @ 2018-10-13 6:23 ` Eli Zaretskii 2018-10-13 17:17 ` Gemini Lasswell 0 siblings, 1 reply; 34+ messages in thread From: Eli Zaretskii @ 2018-10-13 6:23 UTC (permalink / raw) To: Gemini Lasswell; +Cc: 33014 > From: Gemini Lasswell <gazally@runbox.com> > Cc: 33014@debbugs.gnu.org > Date: Fri, 12 Oct 2018 13:02:56 -0700 > > I've tried to do that without success. The bug won't reproduce if I put > all the code added to thread.el by the patch into its own file and load > it with C-u M-x byte-compile-file, and it also doesn't work to put the > resulting .elc on my load-path and load it with require. Did you try loading it as a .el file? Anyway, it's too bad that the reproduction is so Heisenbug-like. It probably won't reproduce on my system anyway. > I've determined today that having -O2 in CFLAGS is necessary to > reproduce the bug, and that -O1 or -O0 won't do it. One more reason why reproduction elsewhere is probably hard. > The Lisp backtrace is really short: > > Thread 7 (Thread 0x7f1cd4dec700 (LWP 21837)): > "erb--benchmark-monitor-func" (0x158ec58) If you succeed in reproducing this when this code is loaded uncompiled, the backtrace might be more helpful. > >> #2 0x00000000006122b5 in XHASH_TABLE (a=...) at lisp.h:2241 > > > > and what was its parent object in the calling frame? > > Those are both optimized out with -O2. I recompiled bytecode.c with > "volatile" on the declaration of jmp_table, and got this: > > (gdb) up 3 > #3 exec_byte_code (bytestr=..., vector=..., maxdepth=..., args_template=..., > nargs=nargs@entry=0, args=<optimized out>, > args@entry=0x16eacf8 <bss_sbrk_buffer+9926232>) at bytecode.c:1403 > 1403 struct Lisp_Hash_Table *h = XHASH_TABLE (jmp_table); > (gdb) p jmp_table > $1 = make_number(514) > (gdb) p *top > $3 = XIL(0x42b4d0) > (gdb) pp *top > remove Which one of these is the one that triggers the assertion violation? > Thread 1 "monitor" hit Hardware watchpoint 7: *(EMACS_INT *) 0x16eac38 > > Old value = 60897760 > New value = 24075314 > setup_on_free_list (v=v@entry=0x16eac30 <bss_sbrk_buffer+9926032>, > nbytes=nbytes@entry=272) at alloc.c:3060 > 3060 total_free_vector_slots += nbytes / word_size; > (gdb) bt 10 > #0 setup_on_free_list (v=v@entry=0x16eac30 <bss_sbrk_buffer+9926032>, > nbytes=nbytes@entry=272) at alloc.c:3060 > #1 0x00000000005a9a24 in sweep_vectors () at alloc.c:3297 > #2 0x00000000005adb2e in gc_sweep () at alloc.c:6872 > #3 garbage_collect_1 (end=<optimized out>) at alloc.c:5860 > #4 Fgarbage_collect () at alloc.c:5989 > #5 0x00000000005ca478 in maybe_gc () at lisp.h:4804 > #6 Ffuncall (nargs=4, args=args@entry=0x7fff210a3bc8) at eval.c:2838 > #7 0x0000000000611e00 in exec_byte_code (bytestr=..., vector=..., maxdepth=..., > args_template=..., nargs=nargs@entry=2, args=<optimized out>, > args@entry=0x9bd128 <pure+781288>) at bytecode.c:632 > #8 0x00000000005cdd32 in funcall_lambda (fun=XIL(0x7fff210a3bc8), > nargs=nargs@entry=2, arg_vector=0x9bd128 <pure+781288>, > arg_vector@entry=0x7fff210a3f00) at eval.c:3057 > #9 0x00000000005ca54b in Ffuncall (nargs=3, args=args@entry=0x7fff210a3ef8) > at eval.c:2870 > (More stack frames follow...) Can you show the Lisp backtrace for the above? > Note that just as was happening when we were working through bug#32357, > the thread names which gdb prints are wrong, which I verified with: Looks like a bug in pthreads version of sys_thread_create: it calls prctl with first arg PR_SET_NAME, but my reading of the documentation is that such a call gives the name to the _calling_ thread, which is not the thread just created. We should instead call pthread_setname_np, I think (but I'm not an expert on pthreads). > Am I correct that the next step is to figure out why the garbage > collector is not marking this vector? Presumably it's no longer > attached to the function definition for erb--benchmark-monitor-func by > the time the garbage collector runs, but it's supposed to be found by > mark_stack when called from mark_one_thread for Thread 7, right? Is this vector the byte-code of erb--benchmark-monitor-func? If so, how come it is no longer attached to the function, as long as the function does exist? And if this vector isn't the byte-code of erb--benchmark-monitor-func, then what is it? IMO, we cannot reason about what GC does or doesn't do until we understand what data structure it processes, and what is the relation of that data structure to the symbols in your program and in Emacs. Thanks. ^ permalink raw reply [flat|nested] 34+ messages in thread
* bug#33014: 26.1.50; 27.0.50; Fatal error after re-evaluating a thread's function 2018-10-13 6:23 ` Eli Zaretskii @ 2018-10-13 17:17 ` Gemini Lasswell 2018-10-13 18:04 ` Eli Zaretskii 0 siblings, 1 reply; 34+ messages in thread From: Gemini Lasswell @ 2018-10-13 17:17 UTC (permalink / raw) To: Eli Zaretskii; +Cc: 33014 [-- Attachment #1: Type: text/plain, Size: 553 bytes --] Eli Zaretskii <eliz@gnu.org> writes: > Did you try loading it as a .el file? Yes, but I couldn't reproduce the bug. >> The Lisp backtrace is really short: >> >> Thread 7 (Thread 0x7f1cd4dec700 (LWP 21837)): >> "erb--benchmark-monitor-func" (0x158ec58) > > If you succeed in reproducing this when this code is loaded > uncompiled, the backtrace might be more helpful. The assertion happens in the Bswitch case of exec_byte_code when it's running erb--benchmark-monitor-func, and there's only one 'switch' in the disassembled bytecode, see line 16: [-- Attachment #2: bytecode.txt --] [-- Type: text/plain, Size: 4386 bytes --] byte code for erb--benchmark-monitor-func: doc: Process benchmark job status update. ... args: nil 0:1 constant (error quit) 1 pushconditioncase 34 4 constant thread-queue-get 5 varref erb--status-updates 6 call 1 7 dup 8 consp 9 goto-if-nil 32 12 dup 13 car 14 dup 15 constant <jump-table-eq (clear 2 add 6 remove 14 change 22)> 16 switch 17 goto 30 20:2 stack-ref 1 21 cdr 22 dup 23 goto-if-not-nil 4 26 varref erb--logging 27 goto-if-nil 3 30 constant message 31 constant "** Clear **" 32 call 1 33 discard 34:3 constant erb--clear-status 35 call 0 36 goto 5 39:4 stack-ref 2 40 constant error 41 constant "Unrecognized status update: %s" 42 stack-ref 2 43 call 2 44 stack-set 1 46:5 stack-set 1 48 goto 31 51:6 stack-ref 1 52 cdr 53 dup 54 consp 55 goto-if-nil 12 58 dup 59 car 60 stack-ref 1 61 cdr 62 dup 63 consp 64 goto-if-nil 10 67 dup 68 car 69 stack-ref 1 70 cdr 71 dup 72 goto-if-not-nil 8 75 stack-ref 1 76 stack-ref 4 77 varref erb--logging 78 goto-if-nil 7 81 constant message 82 constant "** Add %s to %s **" 83 stack-ref 2 84 stack-ref 4 85 call 3 86 discard 87:7 stack-ref 1 88 dup 89 stack-ref 2 90 stack-ref 2 91 symbol-value 92 cons 93 set 94 stack-set 1 96 discardN-preserve-tos 2 98 goto 9 101:8 stack-ref 6 103 constant error 104 constant "Unrecognized status update: %s" 105 stack-ref 2 106 call 2 107 stack-set 1 109:9 discardN-preserve-tos 2 111 goto 11 114:10 stack-ref 4 115 constant error 116 constant "Unrecognized status update: %s" 117 stack-ref 2 118 call 2 119 stack-set 1 121:11 discardN-preserve-tos 2 123 goto 13 126:12 stack-ref 2 127 constant error 128 constant "Unrecognized status update: %s" 129 stack-ref 2 130 call 2 131 stack-set 1 133:13 stack-set 1 135 goto 31 138:14 stack-ref 1 139 cdr 140 dup 141 consp 142 goto-if-nil 20 145 dup 146 car 147 stack-ref 1 148 cdr 149 dup 150 consp 151 goto-if-nil 18 154 dup 155 car 156 stack-ref 1 157 cdr 158 dup 159 goto-if-not-nil 16 162 stack-ref 1 163 stack-ref 4 164 varref erb--logging 165 goto-if-nil 15 168 constant message 169 constant "** Remove %s from %s **" 170 stack-ref 2 171 stack-ref 4 172 call 3 173 discard 174:15 stack-ref 1 175 dup 176 constant delq 177 stack-ref 3 178 stack-ref 5 179 symbol-value 180 call 2 181 set 182 stack-set 1 184 discardN-preserve-tos 2 186 goto 17 189:16 stack-ref 6 191 constant error 192 constant "Unrecognized status update: %s" 193 stack-ref 2 194 call 2 195 stack-set 1 197:17 discardN-preserve-tos 2 199 goto 19 202:18 stack-ref 4 203 constant error 204 constant "Unrecognized status update: %s" 205 stack-ref 2 206 call 2 207 stack-set 1 209:19 discardN-preserve-tos 2 211 goto 21 214:20 stack-ref 2 215 constant error 216 constant "Unrecognized status update: %s" 217 stack-ref 2 218 call 2 219 stack-set 1 221:21 stack-set 1 223 goto 31 226:22 stack-ref 1 227 cdr 228 dup 229 consp 230 goto-if-nil 28 233 dup 234 car 235 stack-ref 1 236 cdr 237 dup 238 consp 239 goto-if-nil 26 242 dup 243 car 244 stack-ref 1 245 cdr 246 dup 247 goto-if-not-nil 24 250 stack-ref 1 251 stack-ref 4 252 varref erb--logging 253 goto-if-nil 23 256 constant message 257 constant "** Set %s to %s **" 258 stack-ref 2 259 stack-ref 4 260 call 3 261 discard 262:23 dup 263 dup 264 stack-ref 3 265 set 266 stack-set 1 268 discardN-preserve-tos 2 270 goto 25 273:24 stack-ref 6 275 constant error 276 constant "Unrecognized status update: %s" 277 stack-ref 2 278 call 2 279 stack-set 1 281:25 discardN-preserve-tos 2 283 goto 27 286:26 stack-ref 4 287 constant error 288 constant "Unrecognized status update: %s" 289 stack-ref 2 290 call 2 291 stack-set 1 293:27 discardN-preserve-tos 2 295 goto 29 298:28 stack-ref 2 299 constant error 300 constant "Unrecognized status update: %s" 301 stack-ref 2 302 call 2 303 stack-set 1 305:29 stack-set 1 307 goto 31 310:30 stack-ref 1 311 constant error 312 constant "Unrecognized status update: %s" 313 stack-ref 2 314 call 2 315 stack-set 1 317:31 stack-set 1 319 goto 33 322:32 dup 323 constant error 324 constant "Unrecognized status update: %s" 325 stack-ref 2 326 call 2 327 stack-set 1 329:33 stack-set 1 331 pophandler 332 goto 35 335:34 constant message 336 constant "Error in ERB benchmark control thread: %s" 337 stack-ref 2 338 call 2 339 stack-set 1 341:35 discard 342 goto 1 345 return [-- Attachment #3: Type: text/plain, Size: 3123 bytes --] >> (gdb) p jmp_table >> $1 = make_number(514) >> (gdb) p *top >> $3 = XIL(0x42b4d0) >> (gdb) pp *top >> remove > > Which one of these is the one that triggers the assertion violation? jmp_table. The assertion violation is at line 1403 in bytecode.c: struct Lisp_Hash_Table *h = XHASH_TABLE (jmp_table); >> Thread 1 "monitor" hit Hardware watchpoint 7: *(EMACS_INT *) 0x16eac38 >> >> Old value = 60897760 >> New value = 24075314 >> setup_on_free_list (v=v@entry=0x16eac30 <bss_sbrk_buffer+9926032>, >> nbytes=nbytes@entry=272) at alloc.c:3060 >> 3060 total_free_vector_slots += nbytes / word_size; >> (gdb) bt 10 >> #0 setup_on_free_list (v=v@entry=0x16eac30 <bss_sbrk_buffer+9926032>, >> nbytes=nbytes@entry=272) at alloc.c:3060 >> #1 0x00000000005a9a24 in sweep_vectors () at alloc.c:3297 >> #2 0x00000000005adb2e in gc_sweep () at alloc.c:6872 >> #3 garbage_collect_1 (end=<optimized out>) at alloc.c:5860 >> #4 Fgarbage_collect () at alloc.c:5989 >> #5 0x00000000005ca478 in maybe_gc () at lisp.h:4804 >> #6 Ffuncall (nargs=4, args=args@entry=0x7fff210a3bc8) at eval.c:2838 >> #7 0x0000000000611e00 in exec_byte_code (bytestr=..., vector=..., maxdepth=..., >> args_template=..., nargs=nargs@entry=2, args=<optimized out>, >> args@entry=0x9bd128 <pure+781288>) at bytecode.c:632 >> #8 0x00000000005cdd32 in funcall_lambda (fun=XIL(0x7fff210a3bc8), >> nargs=nargs@entry=2, arg_vector=0x9bd128 <pure+781288>, >> arg_vector@entry=0x7fff210a3f00) at eval.c:3057 >> #9 0x00000000005ca54b in Ffuncall (nargs=3, args=args@entry=0x7fff210a3ef8) >> at eval.c:2870 >> (More stack frames follow...) > > Can you show the Lisp backtrace for the above? (gdb) xbacktrace "Automatic GC" (0x0) "string-match" (0x210a3bd0) "completion-pcm--string->pattern" (0x210a3f00) "completion-pcm--find-all-completions" (0x210a43a0) "completion-pcm-try-completion" (0x210a4668) 0x1723c30 PVEC_COMPILED "completion--some" (0x210a4b60) "completion--nth-completion" (0x210a4e68) "completion-try-completion" (0x210a50f0) "execute-extended-command--shorter" (0x210a5390) "execute-extended-command" (0x210a5760) "funcall-interactively" (0x210a5758) "call-interactively" (0x210a5a90) "command-execute" (0x210a5d48) >> Am I correct that the next step is to figure out why the garbage >> collector is not marking this vector? Presumably it's no longer >> attached to the function definition for erb--benchmark-monitor-func by >> the time the garbage collector runs, but it's supposed to be found by >> mark_stack when called from mark_one_thread for Thread 7, right? > > Is this vector the byte-code of erb--benchmark-monitor-func? If so, > how come it is no longer attached to the function, as long as the > function does exist? This vector is the constants vector for the byte-code of erb--benchmark-monitor-func. When eval-region evaluates the defun for erb--benchmark-monitor-func, it replaces the symbol's function definition, so it removes that reference to the byte-code. AFAIK the only other reference to the byte-code is on the stack of Thread 7, which is running the byte-code. ^ permalink raw reply [flat|nested] 34+ messages in thread
* bug#33014: 26.1.50; 27.0.50; Fatal error after re-evaluating a thread's function 2018-10-13 17:17 ` Gemini Lasswell @ 2018-10-13 18:04 ` Eli Zaretskii 2018-10-14 19:29 ` Gemini Lasswell 2018-10-14 19:46 ` Andreas Schwab 0 siblings, 2 replies; 34+ messages in thread From: Eli Zaretskii @ 2018-10-13 18:04 UTC (permalink / raw) To: Gemini Lasswell; +Cc: 33014 > From: Gemini Lasswell <gazally@runbox.com> > Cc: 33014@debbugs.gnu.org > Date: Sat, 13 Oct 2018 10:17:10 -0700 > > When eval-region evaluates the defun for erb--benchmark-monitor-func, it > replaces the symbol's function definition, so it removes that reference > to the byte-code. AFAIK the only other reference to the byte-code > is on the stack of Thread 7, which is running the byte-code. So you are saying that the call to mark_stack inside mark_one_thread doesn't do its job well enough? AFAIU, it's supposed to scan the stack of each and every thread, and mark Lisp objects referenced from those stacks. How do we know there's a reference to that vector on thread 7's stack? Could it be that there is no reference at all? ^ permalink raw reply [flat|nested] 34+ messages in thread
* bug#33014: 26.1.50; 27.0.50; Fatal error after re-evaluating a thread's function 2018-10-13 18:04 ` Eli Zaretskii @ 2018-10-14 19:29 ` Gemini Lasswell 2018-10-15 2:37 ` Eli Zaretskii 2018-10-14 19:46 ` Andreas Schwab 1 sibling, 1 reply; 34+ messages in thread From: Gemini Lasswell @ 2018-10-14 19:29 UTC (permalink / raw) To: Eli Zaretskii; +Cc: 33014 Eli Zaretskii <eliz@gnu.org> writes: > How do we know there's a reference to that vector on thread 7's stack? > Could it be that there is no reference at all? Yes it could be that the reference is getting optimized out. I asked gdb for more detail on the stack frames for exec_byte_code and funcall_lambda, and the arguments referring to the byte-code object and its components do appear to be optimized out, see below. I also tried adding 'volatile' to the declaration of the local variable 'fun' in Ffuncall, and that made the bug go away. Is there anything else I should be looking at before concluding that this is the problem? And if it is, what is the best way to fix it? lisp.h:2241: Emacs fatal error: assertion failed: HASH_TABLE_P (a) [Switching to Thread 0x7f6eca4b9700 (LWP 17729)] Thread 7 "builder 0" hit Breakpoint 1, terminate_due_to_signal (sig=sig@entry=6, backtrace_limit=backtrace_limit@entry=2147483647) at emacs.c:369 369 { (gdb) bt #0 terminate_due_to_signal (sig=sig@entry=6, backtrace_limit=backtrace_limit@entry=2147483647) at emacs.c:369 #1 0x00000000005a7159 in die (msg=msg@entry=0x67b132 "HASH_TABLE_P (a)", file=file@entry=0x678c85 "lisp.h", line=line@entry=2241) at alloc.c:7094 #2 0x0000000000614685 in XHASH_TABLE (a=...) at lisp.h:2241 #3 exec_byte_code (bytestr=..., vector=..., maxdepth=..., args_template=..., nargs=nargs@entry=0, args=<optimized out>, args@entry=0x16eac38 <bss_sbrk_buffer+9926040>) at bytecode.c:1403 #4 0x00000000005cdd32 in funcall_lambda (fun=XIL(0x7f6eca4b8470), nargs=nargs@entry=0, arg_vector=0x16eac38 <bss_sbrk_buffer+9926040>, arg_vector@entry=0x1574c58 <bss_sbrk_buffer+8394168>) at eval.c:3057 #5 0x00000000005ca54b in Ffuncall (nargs=nargs@entry=1, args=args@entry=0x1574c50 <bss_sbrk_buffer+8394160>) at eval.c:2870 #6 0x000000000064680b in invoke_thread_function () at thread.c:684 #7 0x00000000005c964f in internal_condition_case ( bfun=bfun@entry=0x6467d0 <invoke_thread_function>, handlers=..., handlers@entry=XIL(0xc3c0), hfun=hfun@entry=0x6466f0 <record_thread_error>) at eval.c:1373 #8 0x00000000006471a1 in run_thread (state=0x1574c30 <bss_sbrk_buffer+8394128>) at thread.c:723 #9 0x00007f6eea5cb5a7 in start_thread () from /nix/store/fg4yq8i8wd08xg3fy58l6q73cjy8hjr2-glibc-2.27/lib/libpthread.so.0 #10 0x00007f6ee9c6622f in clone () from /nix/store/fg4yq8i8wd08xg3fy58l6q73cjy8hjr2-glibc-2.27/lib/libc.so.6 (gdb) set print frame-arguments all (gdb) info frame 3 Stack frame at 0x7f6eca4b87d0: rip = 0x614685 in exec_byte_code (bytecode.c:1403); saved rip = 0x5cdd32 called by frame at 0x7f6eca4b87d0, caller of frame at 0x7f6eca4b87d0 source language c. Arglist at 0x7f6eca4b87c0, args: bytestr=<optimized out>, vector=<optimized out>, maxdepth=<optimized out>, args_template=<optimized out>, nargs=nargs@entry=0, args=<optimized out>, args@entry=0x16eac38 <bss_sbrk_buffer+9926040> Locals at 0x7f6eca4b87c0, Previous frame's sp is 0x7f6eca4b87c8 Saved registers: rbx at 0x7f6eca4b8798, rbp at 0x7f6eca4b87c0, r12 at 0x7f6eca4b87a0, r13 at 0x7f6eca4b87a8, r14 at 0x7f6eca4b87b0, r15 at 0x7f6eca4b87b8 (gdb) info frame 4 Stack frame at 0x7f6eca4b87d0: rip = 0x5cdd32 in funcall_lambda (eval.c:3057); saved rip = 0x64680b tail call frame, caller of frame at 0x7f6eca4b87d0 source language c. Arglist at unknown address. Locals at unknown address, Previous frame's sp is 0x7f6eca4b87d0 ^ permalink raw reply [flat|nested] 34+ messages in thread
* bug#33014: 26.1.50; 27.0.50; Fatal error after re-evaluating a thread's function 2018-10-14 19:29 ` Gemini Lasswell @ 2018-10-15 2:37 ` Eli Zaretskii 0 siblings, 0 replies; 34+ messages in thread From: Eli Zaretskii @ 2018-10-15 2:37 UTC (permalink / raw) To: Gemini Lasswell; +Cc: 33014 > From: Gemini Lasswell <gazally@runbox.com> > Cc: 33014@debbugs.gnu.org > Date: Sun, 14 Oct 2018 12:29:42 -0700 > > Eli Zaretskii <eliz@gnu.org> writes: > > > How do we know there's a reference to that vector on thread 7's stack? > > Could it be that there is no reference at all? > > Yes it could be that the reference is getting optimized out. I asked > gdb for more detail on the stack frames for exec_byte_code and > funcall_lambda, and the arguments referring to the byte-code object and > its components do appear to be optimized out, see below. I also tried > adding 'volatile' to the declaration of the local variable 'fun' in > Ffuncall, and that made the bug go away. "Optimized out" is GDB's way of saying it's confused by the complex way a variable's location changes as the program counter advances. It doesn't mean the variable is lost, just that GDB lost its track. > Is there anything else I should be looking at before concluding that > this is the problem? And if it is, what is the best way to fix it? There's the question Andreas asked, we should look into that, I think. ^ permalink raw reply [flat|nested] 34+ messages in thread
* bug#33014: 26.1.50; 27.0.50; Fatal error after re-evaluating a thread's function 2018-10-13 18:04 ` Eli Zaretskii 2018-10-14 19:29 ` Gemini Lasswell @ 2018-10-14 19:46 ` Andreas Schwab 2018-10-15 14:59 ` Eli Zaretskii 1 sibling, 1 reply; 34+ messages in thread From: Andreas Schwab @ 2018-10-14 19:46 UTC (permalink / raw) To: Eli Zaretskii; +Cc: Gemini Lasswell, 33014 On Okt 13 2018, Eli Zaretskii <eliz@gnu.org> wrote: > So you are saying that the call to mark_stack inside mark_one_thread > doesn't do its job well enough? AFAIU, it's supposed to scan the > stack of each and every thread, and mark Lisp objects referenced from > those stacks. > > How do we know there's a reference to that vector on thread 7's stack? > Could it be that there is no reference at all? Do we actually mark the registers of the threads as gc roots? Andreas. -- Andreas Schwab, schwab@linux-m68k.org GPG Key fingerprint = 7578 EB47 D4E5 4D69 2510 2552 DF73 E780 A9DA AEC1 "And now for something completely different." ^ permalink raw reply [flat|nested] 34+ messages in thread
* bug#33014: 26.1.50; 27.0.50; Fatal error after re-evaluating a thread's function 2018-10-14 19:46 ` Andreas Schwab @ 2018-10-15 14:59 ` Eli Zaretskii 2018-10-15 16:22 ` Gemini Lasswell 2018-10-16 18:46 ` Gemini Lasswell 0 siblings, 2 replies; 34+ messages in thread From: Eli Zaretskii @ 2018-10-15 14:59 UTC (permalink / raw) To: Andreas Schwab; +Cc: gazally, 33014 > From: Andreas Schwab <schwab@linux-m68k.org> > Cc: Gemini Lasswell <gazally@runbox.com>, 33014@debbugs.gnu.org > Date: Sun, 14 Oct 2018 21:46:35 +0200 > > Do we actually mark the registers of the threads as gc roots? According to my reading of the code, we do. Each time a running thread is about to release the global lock, we call flush_stack_call_func, which is supposed to flush relevant registers to the stack of that thread. And mark_one_thread marks the stack of each thread, so it should be able to see the Lisp objects on that stack. In this case, the function whose bytecode seems to be GC'ed is the thread function itself. That function is also marked, as part of marking the thread object itself, although, of course, re-evaluating the function will redefine the function. But, if my reading of exec_byte_code is correct, the bytecode should be on the stack and in registers while we execute it, so even though the bytecode gets disconnected from the function, it is still reachable from the stack, and should have been marked... Could this be some bug in the implementation of __builtin_unwind_init etc., which causes it not to save some registers under some conditions? Gemini, what version of the compiler are you using? ^ permalink raw reply [flat|nested] 34+ messages in thread
* bug#33014: 26.1.50; 27.0.50; Fatal error after re-evaluating a thread's function 2018-10-15 14:59 ` Eli Zaretskii @ 2018-10-15 16:22 ` Gemini Lasswell 2018-10-15 16:41 ` Eli Zaretskii 2018-10-16 18:46 ` Gemini Lasswell 1 sibling, 1 reply; 34+ messages in thread From: Gemini Lasswell @ 2018-10-15 16:22 UTC (permalink / raw) To: Eli Zaretskii; +Cc: 33014, Andreas Schwab Eli Zaretskii <eliz@gnu.org> writes: > Could this be some bug in the implementation of __builtin_unwind_init > etc., which causes it not to save some registers under some > conditions? Gemini, what version of the compiler are you using? 7.3.0 ^ permalink raw reply [flat|nested] 34+ messages in thread
* bug#33014: 26.1.50; 27.0.50; Fatal error after re-evaluating a thread's function 2018-10-15 16:22 ` Gemini Lasswell @ 2018-10-15 16:41 ` Eli Zaretskii 0 siblings, 0 replies; 34+ messages in thread From: Eli Zaretskii @ 2018-10-15 16:41 UTC (permalink / raw) To: Gemini Lasswell; +Cc: 33014, schwab > From: Gemini Lasswell <gazally@runbox.com> > Cc: Andreas Schwab <schwab@linux-m68k.org>, 33014@debbugs.gnu.org > Date: Mon, 15 Oct 2018 09:22:46 -0700 > > Eli Zaretskii <eliz@gnu.org> writes: > > > Could this be some bug in the implementation of __builtin_unwind_init > > etc., which causes it not to save some registers under some > > conditions? Gemini, what version of the compiler are you using? > > 7.3.0 Then it's unlikely, I think. ^ permalink raw reply [flat|nested] 34+ messages in thread
* bug#33014: 26.1.50; 27.0.50; Fatal error after re-evaluating a thread's function 2018-10-15 14:59 ` Eli Zaretskii 2018-10-15 16:22 ` Gemini Lasswell @ 2018-10-16 18:46 ` Gemini Lasswell 2018-10-16 19:25 ` Eli Zaretskii 2018-10-17 16:21 ` Eli Zaretskii 1 sibling, 2 replies; 34+ messages in thread From: Gemini Lasswell @ 2018-10-16 18:46 UTC (permalink / raw) To: Eli Zaretskii; +Cc: 33014, Andreas Schwab Eli Zaretskii <eliz@gnu.org> writes: > In this case, the function whose bytecode seems to be GC'ed is the > thread function itself. That function is also marked, as part of > marking the thread object itself, although, of course, re-evaluating > the function will redefine the function. But, if my reading of > exec_byte_code is correct, the bytecode should be on the stack and in > registers while we execute it, so even though the bytecode gets > disconnected from the function, it is still reachable from the stack, > and should have been marked... My knowledge of what gcc does and how the code it generates works is superficial, but I don't see why an optimizer would find it necessary to save the following values: - The value of 'fun' in Ffuncall after it is used as an argument for funcall_lambda. - The value of 'fun' in funcall_lambda after it is used to calculate the arguments to exec_byte_code. - The value of 'vector' in exec_byte_code after the calculation of vectorp. These three are the only variables that I see in Thread 7 from which the garbage collector could find the constants vector which it's not finding. If gcc's optimizer puts them all in registers instead of on the stack because it knows it won't need them later, those registers will be overwritten with other values by recursive calls before flush_stack_call_func is called. Here's the backtrace of where Thread 7 is stopped while Thread 1 is running garbage collection, in which the three frames I'm talking about above are 10, 11 and 12: (gdb) thread apply 7 bt Thread 7 (Thread 0x7fecdacdd700 (LWP 5509)): #0 0x00007fecf771f592 in pthread_cond_wait@@GLIBC_2.3.2 () from /nix/store/fg4yq8i8wd08xg3fy58l6q73cjy8hjr2-glibc-2.27/lib/libpthread.so.0 #1 0x0000000000648389 in sys_cond_wait (cond=cond@entry=0x16e9c48 <bss_sbrk_buffer+9921960>, mutex=mutex@entry=0xd73440 <global_lock>) at systhread.c:163 #2 0x0000000000647747 in condition_wait_callback (arg=0x16e9c30 <bss_sbrk_buffer+9921936>) at thread.c:410 #3 0x00000000005a9608 in flush_stack_call_func (func=func@entry=0x647630 <condition_wait_callback>, arg=<optimized out>) at alloc.c:5021 #4 0x0000000000646e1d in Fcondition_wait (cond=<optimized out>) at thread.c:449 #5 0x00000000005cc49e in funcall_subr (subr=0xcdc1c0 <Scondition_wait>, numargs=numargs@entry=1, args=args@entry=0x7fecdacdc1c0) at eval.c:2931 #6 0x00000000005ca661 in Ffuncall (nargs=2, args=args@entry=0x7fecdacdc1b8) at eval.c:2856 #7 0x0000000000611e00 in exec_byte_code (bytestr=<optimized out>, vector=<optimized out>, maxdepth=<optimized out>, args_template=<optimized out>, nargs=nargs@entry=1, args=<optimized out>, args@entry=0x11bde68 <bss_sbrk_buffer+4499400>) at bytecode.c:632 #8 0x00000000005cdd32 in funcall_lambda (fun=XIL(0x7fecdacdc1b8), nargs=nargs@entry=1, arg_vector=0x11bde68 <bss_sbrk_buffer+4499400>, arg_vector@entry=0x7fecdacdc470) at eval.c:3057 #9 0x00000000005ca54b in Ffuncall (nargs=2, args=args@entry=0x7fecdacdc468) at eval.c:2870 #10 0x0000000000611e00 in exec_byte_code (bytestr=<optimized out>, vector=<optimized out>, maxdepth=<optimized out>, args_template=<optimized out>, nargs=nargs@entry=0, args=<optimized out>, args@entry=0x16eac38 <bss_sbrk_buffer+9926040>) at bytecode.c:632 #11 0x00000000005cdd32 in funcall_lambda (fun=XIL(0x7fecdacdc468), nargs=nargs@entry=0, arg_vector=0x16eac38 <bss_sbrk_buffer+9926040>, arg_vector@entry=0x1578c58 <bss_sbrk_buffer+8410552>) at eval.c:3057 #12 0x00000000005ca54b in Ffuncall (nargs=nargs@entry=1, args=args@entry=0x1578c50 <bss_sbrk_buffer+8410544>) at eval.c:2870 #13 0x000000000064680b in invoke_thread_function () at thread.c:684 #14 0x00000000005c964f in internal_condition_case (bfun=bfun@entry=0x6467d0 <invoke_thread_function>, handlers=<optimized out>, handlers@entry=XIL(0xc3c0), hfun=hfun@entry=0x6466f0 <record_thread_error>) at eval.c:1373 #15 0x00000000006471a1 in run_thread (state=0x1578c30 <bss_sbrk_buffer+8410512>) at thread.c:723 #16 0x00007fecf77195a7 in start_thread () from /nix/store/fg4yq8i8wd08xg3fy58l6q73cjy8hjr2-glibc-2.27/lib/libpthread.so.0 #17 0x00007fecf6db422f in clone () from /nix/store/fg4yq8i8wd08xg3fy58l6q73cjy8hjr2-glibc-2.27/lib/libc.so.6 gdb shows a value for fun in frame 11, but when I try to print XIL(0x7fecdacdc468) it complains about it being an invalid lisp object, and then the result of "info frame 11" shows some similar values, so I'm thinking gdb is confused: (gdb) thread apply 7 info frame 11 Thread 7 (Thread 0x7fecdacdd700 (LWP 5509)): Stack frame at 0x7fecdacdc7d0: rip = 0x5cdd32 in funcall_lambda (eval.c:3057); saved rip = 0x64680b tail call frame, caller of frame at 0x7fecdacdc7d0 source language c. Arglist at unknown address. Locals at unknown address, Previous frame's sp is 0x7fecdacdc7d0 I haven't figured out how to get gdb to print the Lisp backtrace of one thread while execution is stopped in a different one. But I expect Thread 7's Lisp backtrace looks like this: condition-wait thread-queue-get erb--benchmark-monitor-func ^ permalink raw reply [flat|nested] 34+ messages in thread
* bug#33014: 26.1.50; 27.0.50; Fatal error after re-evaluating a thread's function 2018-10-16 18:46 ` Gemini Lasswell @ 2018-10-16 19:25 ` Eli Zaretskii 2018-10-16 19:38 ` Eli Zaretskii 2018-10-19 0:22 ` Gemini Lasswell 2018-10-17 16:21 ` Eli Zaretskii 1 sibling, 2 replies; 34+ messages in thread From: Eli Zaretskii @ 2018-10-16 19:25 UTC (permalink / raw) To: Gemini Lasswell; +Cc: 33014, schwab > From: Gemini Lasswell <gazally@runbox.com> > Cc: Andreas Schwab <schwab@linux-m68k.org>, 33014@debbugs.gnu.org > Date: Tue, 16 Oct 2018 11:46:36 -0700 > > My knowledge of what gcc does and how the code it generates works is > superficial, but I don't see why an optimizer would find it necessary to > save the following values: > > - The value of 'fun' in Ffuncall after it is used as an argument for > funcall_lambda. > > - The value of 'fun' in funcall_lambda after it is used to calculate > the arguments to exec_byte_code. > > - The value of 'vector' in exec_byte_code after the calculation of > vectorp. There are calling frames as well. For GC to pay attention to a Lisp object, it is enough to have that object _somewhere_ on the stack. Anyway, are you saying that stack marking doesn't work in optimized code? We've been using this technique for the last 17 years without problems; why would the fact that we have more than one thread change that? The same arguments you submit are valid for a single-threaded Emacs, right? I think the chance of something like what you describe to happen here are small, and we shouldn't throw in the towel so quickly. I don't think we've exhausted all the other possibilities, not yet. > gdb shows a value for fun in frame 11, but when I try to print > XIL(0x7fecdacdc468) it complains about it being an invalid lisp object, > and then the result of "info frame 11" shows some similar values, > so I'm thinking gdb is confused: It's quite possible that GDB is not confused, and you've found some evidence of the problem. How did you try to print XIL(0x7fecdacdc468)? Maybe we should take a good look at this object. > I haven't figured out how to get gdb to print the Lisp backtrace of one > thread while execution is stopped in a different one. You can't, AFAIR. The code that helps us produce a Lisp backtrace doesn't work in that case. ^ permalink raw reply [flat|nested] 34+ messages in thread
* bug#33014: 26.1.50; 27.0.50; Fatal error after re-evaluating a thread's function 2018-10-16 19:25 ` Eli Zaretskii @ 2018-10-16 19:38 ` Eli Zaretskii 2018-10-19 0:22 ` Gemini Lasswell 1 sibling, 0 replies; 34+ messages in thread From: Eli Zaretskii @ 2018-10-16 19:38 UTC (permalink / raw) To: gazally; +Cc: 33014, schwab > Date: Tue, 16 Oct 2018 22:25:21 +0300 > From: Eli Zaretskii <eliz@gnu.org> > Cc: 33014@debbugs.gnu.org, schwab@linux-m68k.org > > > - The value of 'fun' in Ffuncall after it is used as an argument for > > funcall_lambda. > > > > - The value of 'fun' in funcall_lambda after it is used to calculate > > the arguments to exec_byte_code. > > > > - The value of 'vector' in exec_byte_code after the calculation of > > vectorp. > > There are calling frames as well. For GC to pay attention to a Lisp > object, it is enough to have that object _somewhere_ on the stack. And btw, 'fun' is not the object we should be tracing in this case. We should be tracing the bytecode that is being run, either the entire vector or some of its elements. AFAIU, that is the bytecode of the thread function, which I think is the one called here: > #11 0x00000000005cdd32 in funcall_lambda (fun=XIL(0x7fecdacdc468), nargs=nargs@entry=0, arg_vector=0x16eac38 <bss_sbrk_buffer+9926040>, > arg_vector@entry=0x1578c58 <bss_sbrk_buffer+8410552>) at eval.c:3057 > #12 0x00000000005ca54b in Ffuncall (nargs=nargs@entry=1, args=args@entry=0x1578c50 <bss_sbrk_buffer+8410544>) at eval.c:2870 > #13 0x000000000064680b in invoke_thread_function () at thread.c:684 ^ permalink raw reply [flat|nested] 34+ messages in thread
* bug#33014: 26.1.50; 27.0.50; Fatal error after re-evaluating a thread's function 2018-10-16 19:25 ` Eli Zaretskii 2018-10-16 19:38 ` Eli Zaretskii @ 2018-10-19 0:22 ` Gemini Lasswell 2018-10-19 8:44 ` Eli Zaretskii 2018-10-19 19:32 ` Gemini Lasswell 1 sibling, 2 replies; 34+ messages in thread From: Gemini Lasswell @ 2018-10-19 0:22 UTC (permalink / raw) To: Eli Zaretskii; +Cc: 33014, schwab Eli Zaretskii <eliz@gnu.org> writes: > Anyway, are you saying that stack marking doesn't work in optimized > code? We've been using this technique for the last 17 years without > problems; why would the fact that we have more than one thread change > that? The same arguments you submit are valid for a single-threaded > Emacs, right? Apparently so. I set up a single-threaded situation where I could redefine a function while exec_byte_code was running it, and got a segfault. I've gained some insights from debugging this version of the bug which I will put into a separate email. Here are steps which consistently reproduce it for me: Save the following code to the file 'repro.el', and then run emacs -Q (I'm using master built with -O2 in CFLAGS): ;;; -*- lexical-binding: t -*- (defvar my-var "ok") (defun my-loop-1 () (let ((val 0)) (while t (insert "Now in recursive edit\n") (recursive-edit) (insert (format "Leaving recursive edit: %s\n" my-var)) (let ((things '(a b c d e))) (cond ((= val 0) (message "foo: %s" (last things))) ((= val 1) (message "bar: %s" things)) ((= val 2) (message "baz: %s" (car things))) (t (message "bop: %s" (nth 2 things)))) (setq val (mod (1+ val) 3)))))) (defun my-loop () (interactive) (redraw-display) (my-loop-1)) (defun my-gc-1 () (garbage-collect)) (defun my-gc () (interactive) (my-gc-1)) (provide 'repro) Then, from emacs -Q: C-x C-f repro.el RET C-u M-x byte-compile-file RET repro.el RET C-x b RET M-x my-loop RET C-x b RET M-x eval-buffer RET C-x b RET M-x my-gc RET C-M-c Result: Thread 1 "emacs" received signal SIGSEGV, Segmentation fault. 0x00000000005bca1b in styled_format (nargs=2, args=0x7ffffffeffc8, message=<optimized out>) at editfns.c:3129 3129 unsigned char format_char = *format++; (gdb) bt #0 0x00000000005bca1b in styled_format (nargs=2, args=0x7ffffffeffc8, message=<optimized out>) at editfns.c:3129 #1 0x00000000005ca771 in Ffuncall (nargs=3, args=args@entry=0x7ffffffeffc0) at eval.c:2859 #2 0x0000000000611f00 in exec_byte_code (bytestr=..., vector=..., maxdepth=..., args_template=..., nargs=nargs@entry=0, args=<optimized out>, args@entry=0x31bda38) at bytecode.c:632 #3 0x00000000005cde82 in funcall_lambda (fun=XIL(0x7ffffffeffc0), nargs=nargs@entry=0, arg_vector=0x31bda38, arg_vector@entry=0x7fffffff0240) at eval.c:3060 #4 0x00000000005ca65b in Ffuncall (nargs=1, args=args@entry=0x7fffffff0238) at eval.c:2873 #5 0x0000000000611f00 in exec_byte_code (bytestr=..., vector=..., maxdepth=..., args_template=..., nargs=nargs@entry=0, args=<optimized out>, args@entry=0x31bdaf8) at bytecode.c:632 #6 0x00000000005cde82 in funcall_lambda (fun=XIL(0x7fffffff0238), nargs=nargs@entry=0, arg_vector=0x31bdaf8, arg_vector@entry=0x7fffffff0640) at eval.c:3060 #7 0x00000000005ca65b in Ffuncall (nargs=nargs@entry=1, args=args@entry=0x7fffffff0638) at eval.c:2873 #8 0x00000000005c6653 in Ffuncall_interactively (nargs=1, args=0x7fffffff0638) at callint.c:253 #9 0x00000000005ca771 in Ffuncall (nargs=nargs@entry=2, args=args@entry=0x7fffffff0630) at eval.c:2859 #10 0x00000000005cab2c in Fapply (nargs=nargs@entry=3, args=args@entry=0x7fffffff0630) at eval.c:2432 #11 0x00000000005c6de1 in Fcall_interactively (function=..., record_flag=..., keys=...) at callint.c:340 #12 0x00000000005cc5d7 in funcall_subr (subr=0xcd63c0 <Scall_interactively>, numargs=numargs@entry=3, args=args@entry=0x7fffffff07c0) at eval.c:2939 #13 0x00000000005ca771 in Ffuncall (nargs=4, args=args@entry=0x7fffffff07b8) at eval.c:2859 #14 0x0000000000611f00 in exec_byte_code (bytestr=..., vector=..., maxdepth=..., args_template=..., nargs=nargs@entry=2, args=<optimized out>, args@entry=0x9c3cc8 <pure+808136>) at bytecode.c:632 #15 0x00000000005cde82 in funcall_lambda (fun=XIL(0x7fffffff07b8), nargs=nargs@entry=2, arg_vector=0x9c3cc8 <pure+808136>, arg_vector@entry=0x7fffffff0aa8) at eval.c:3060 #16 0x00000000005ca65b in Ffuncall (nargs=3, args=args@entry=0x7fffffff0aa0) at eval.c:2873 #17 0x0000000000611f00 in exec_byte_code (bytestr=..., vector=..., maxdepth=..., args_template=..., nargs=nargs@entry=3, args=<optimized out>, args@entry=0x9c3978 <pure+807288>) at bytecode.c:632 #18 0x00000000005cde82 in funcall_lambda (fun=XIL(0x7fffffff0aa0), nargs=nargs@entry=3, arg_vector=0x9c3978 <pure+807288>, arg_vector@entry=0x7fffffff0e90) at eval.c:3060 #19 0x00000000005ca65b in Ffuncall (nargs=nargs@entry=4, args=args@entry=0x7fffffff0e88) at eval.c:2873 #20 0x00000000005c6653 in Ffuncall_interactively (nargs=4, args=0x7fffffff0e88) at callint.c:253 #21 0x00000000005ca771 in Ffuncall (nargs=nargs@entry=5, args=0x7fffffff0e80) at eval.c:2859 #22 0x00000000005caa3a in Fapply (nargs=nargs@entry=3, args=args@entry=0x7fffffff1030) at eval.c:2479 #23 0x00000000005c6de1 in Fcall_interactively (function=..., record_flag=..., keys=...) at callint.c:340 #24 0x00000000005cc5d7 in funcall_subr (subr=0xcd63c0 <Scall_interactively>, numargs=numargs@entry=3, args=args@entry=0x7fffffff11c0) at eval.c:2939 #25 0x00000000005ca771 in Ffuncall (nargs=4, args=args@entry=0x7fffffff11b8) at eval.c:2859 #26 0x0000000000611f00 in exec_byte_code (bytestr=..., vector=..., maxdepth=..., args_template=..., nargs=nargs@entry=1, args=<optimized out>, args@entry=0x9c3cc8 <pure+808136>) at bytecode.c:632 #27 0x00000000005cde82 in funcall_lambda (fun=XIL(0x7fffffff11b8), nargs=nargs@entry=1, arg_vector=0x9c3cc8 <pure+808136>, arg_vector@entry=0x7fffffff1478) at eval.c:3060 #28 0x00000000005ca65b in Ffuncall (nargs=nargs@entry=2, args=args@entry=0x7fffffff1470) at eval.c:2873 #29 0x00000000005ca83a in call1 (fn=..., fn@entry=XIL(0x3ff0), arg1=...) at eval.c:2710 #30 0x000000000054f597 in command_loop_1 () at keyboard.c:1451 #31 0x00000000005c975f in internal_condition_case (bfun=bfun@entry=0x54f080 <command_loop_1>, handlers=..., handlers@entry=XIL(0x53a0), hfun=hfun@entry=0x541d60 <cmd_error>) at eval.c:1373 #32 0x000000000053db88 in command_loop_2 (ignore=..., ignore@entry=XIL(0)) at keyboard.c:1079 #33 0x00000000005c9683 in internal_catch (tag=..., func=func@entry=0x53db60 <command_loop_2>, arg=..., arg@entry=XIL(0)) at eval.c:1136 #34 0x000000000053ddeb in command_loop () at keyboard.c:1058 #35 0x0000000000541864 in recursive_edit_1 () at keyboard.c:703 #36 0x0000000000541c23 in Frecursive_edit () at keyboard.c:774 #37 0x000000000041e727 in main (argc=<optimized out>, argv=<optimized out>) at emacs.c:1731 Lisp Backtrace: "format" (0xfffeffc8) "my-loop-1" (0xffff0240) "my-loop" (0xffff0640) "funcall-interactively" (0xffff0638) "call-interactively" (0xffff07c0) "command-execute" (0xffff0aa8) "execute-extended-command" (0xffff0e90) "funcall-interactively" (0xffff0e88) "call-interactively" (0xffff11c0) "command-execute" (0xffff1478) ^ permalink raw reply [flat|nested] 34+ messages in thread
* bug#33014: 26.1.50; 27.0.50; Fatal error after re-evaluating a thread's function 2018-10-19 0:22 ` Gemini Lasswell @ 2018-10-19 8:44 ` Eli Zaretskii 2018-10-19 20:05 ` Gemini Lasswell 2018-10-19 19:32 ` Gemini Lasswell 1 sibling, 1 reply; 34+ messages in thread From: Eli Zaretskii @ 2018-10-19 8:44 UTC (permalink / raw) To: Gemini Lasswell; +Cc: 33014 > From: Gemini Lasswell <gazally@runbox.com> > Cc: 33014@debbugs.gnu.org, schwab@linux-m68k.org > Date: Thu, 18 Oct 2018 17:22:36 -0700 > > Eli Zaretskii <eliz@gnu.org> writes: > > > Anyway, are you saying that stack marking doesn't work in optimized > > code? We've been using this technique for the last 17 years without > > problems; why would the fact that we have more than one thread change > > that? The same arguments you submit are valid for a single-threaded > > Emacs, right? > > Apparently so. I set up a single-threaded situation where I could > redefine a function while exec_byte_code was running it, and got a > segfault. I've gained some insights from debugging this version of the > bug which I will put into a separate email. If this is the case, then I think we should protect the definition of a running function from GC, in some way, either by making sure it is referenced by some stack-based Lisp object, even in heavily optimized code (e.g., by using 'volatile' qualifiers); or by some other method that will ensure that definition is marked and not swept. ^ permalink raw reply [flat|nested] 34+ messages in thread
* bug#33014: 26.1.50; 27.0.50; Fatal error after re-evaluating a thread's function 2018-10-19 8:44 ` Eli Zaretskii @ 2018-10-19 20:05 ` Gemini Lasswell 2018-10-20 6:41 ` Eli Zaretskii 0 siblings, 1 reply; 34+ messages in thread From: Gemini Lasswell @ 2018-10-19 20:05 UTC (permalink / raw) To: Eli Zaretskii; +Cc: 33014 [-- Attachment #1: Type: text/plain, Size: 1166 bytes --] Eli Zaretskii <eliz@gnu.org> writes: >> > Anyway, are you saying that stack marking doesn't work in optimized >> > code? We've been using this technique for the last 17 years without >> > problems; why would the fact that we have more than one thread change >> > that? The same arguments you submit are valid for a single-threaded >> > Emacs, right? >> >> Apparently so. I set up a single-threaded situation where I could >> redefine a function while exec_byte_code was running it, and got a >> segfault. I've gained some insights from debugging this version of the >> bug which I will put into a separate email. > > If this is the case, then I think we should protect the definition of > a running function from GC, in some way, either by making sure it is > referenced by some stack-based Lisp object, even in heavily optimized > code (e.g., by using 'volatile' qualifiers); or by some other method > that will ensure that definition is marked and not swept. Maybe code optimizers have improved over the last 17 years? I have patched Emacs with a 'volatile' on the definition of 'fun' in Ffuncall, and so far haven't managed to reproduce the bug with it: [-- Attachment #2: 0001-src-eval.c-Ffuncall-Make-local-variable-fun-volatile.patch --] [-- Type: text/plain, Size: 801 bytes --] From a1fc2dfd392e0ba8754159d855da231a56ca275b Mon Sep 17 00:00:00 2001 From: Gemini Lasswell <gazally@runbox.com> Date: Sun, 14 Oct 2018 12:12:04 -0700 Subject: [PATCH] * src/eval.c (Ffuncall): Make local variable 'fun' volatile (bug#33014) --- src/eval.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/src/eval.c b/src/eval.c index 5e25caaa84..75b30f9c7d 100644 --- a/src/eval.c +++ b/src/eval.c @@ -2817,8 +2817,8 @@ Thus, (funcall \\='cons \\='x \\='y) returns (x . y). usage: (funcall FUNCTION &rest ARGUMENTS) */) (ptrdiff_t nargs, Lisp_Object *args) { - Lisp_Object fun, original_fun; - Lisp_Object funcar; + Lisp_Object volatile fun; + Lisp_Object original_fun, funcar; ptrdiff_t numargs = nargs - 1; Lisp_Object val; ptrdiff_t count; -- 2.16.4 [-- Attachment #3: Type: text/plain, Size: 354 bytes --] I'll go back now to working on my benchmarking project which I hope someday will make it easy to see if that 'volatile' causes measurable harm to performance. I'll also keep using 'eval-region' and 'eval-buffer' while I have threads running byte-compiled functions which get redefined by doing that, and report back here if I encounter this bug again. ^ permalink raw reply related [flat|nested] 34+ messages in thread
* bug#33014: 26.1.50; 27.0.50; Fatal error after re-evaluating a thread's function 2018-10-19 20:05 ` Gemini Lasswell @ 2018-10-20 6:41 ` Eli Zaretskii 2018-10-20 8:23 ` Andreas Schwab 2018-10-29 18:24 ` Gemini Lasswell 0 siblings, 2 replies; 34+ messages in thread From: Eli Zaretskii @ 2018-10-20 6:41 UTC (permalink / raw) To: Gemini Lasswell; +Cc: 33014 > From: Gemini Lasswell <gazally@runbox.com> > Cc: 33014@debbugs.gnu.org > Date: Fri, 19 Oct 2018 13:05:19 -0700 > > > If this is the case, then I think we should protect the definition of > > a running function from GC, in some way, either by making sure it is > > referenced by some stack-based Lisp object, even in heavily optimized > > code (e.g., by using 'volatile' qualifiers); or by some other method > > that will ensure that definition is marked and not swept. > > Maybe code optimizers have improved over the last 17 years? I think a much more significant factor is that modern processors have many more registers to use. > I have patched Emacs with a 'volatile' on the definition of 'fun' in > Ffuncall, and so far haven't managed to reproduce the bug with it: Thanks. This needs a comment for why we do something strange like that, but otherwise, if no one has better ideas in a week's time, let's install this. ^ permalink raw reply [flat|nested] 34+ messages in thread
* bug#33014: 26.1.50; 27.0.50; Fatal error after re-evaluating a thread's function 2018-10-20 6:41 ` Eli Zaretskii @ 2018-10-20 8:23 ` Andreas Schwab 2018-10-20 10:20 ` Eli Zaretskii 2018-10-29 18:24 ` Gemini Lasswell 1 sibling, 1 reply; 34+ messages in thread From: Andreas Schwab @ 2018-10-20 8:23 UTC (permalink / raw) To: Eli Zaretskii; +Cc: Gemini Lasswell, 33014 On Okt 20 2018, Eli Zaretskii <eliz@gnu.org> wrote: >> Maybe code optimizers have improved over the last 17 years? > > I think a much more significant factor is that modern processors have > many more registers to use. I think an important factor is that they pass arguments in registers, so it is more likely that the original value of an argument is lost. Andreas. -- Andreas Schwab, schwab@linux-m68k.org GPG Key fingerprint = 7578 EB47 D4E5 4D69 2510 2552 DF73 E780 A9DA AEC1 "And now for something completely different." ^ permalink raw reply [flat|nested] 34+ messages in thread
* bug#33014: 26.1.50; 27.0.50; Fatal error after re-evaluating a thread's function 2018-10-20 8:23 ` Andreas Schwab @ 2018-10-20 10:20 ` Eli Zaretskii 2018-10-20 11:30 ` Andreas Schwab 0 siblings, 1 reply; 34+ messages in thread From: Eli Zaretskii @ 2018-10-20 10:20 UTC (permalink / raw) To: Andreas Schwab; +Cc: gazally, 33014 > From: Andreas Schwab <schwab@linux-m68k.org> > Cc: Gemini Lasswell <gazally@runbox.com>, 33014@debbugs.gnu.org > Date: Sat, 20 Oct 2018 10:23:37 +0200 > > On Okt 20 2018, Eli Zaretskii <eliz@gnu.org> wrote: > > >> Maybe code optimizers have improved over the last 17 years? > > > > I think a much more significant factor is that modern processors have > > many more registers to use. > > I think an important factor is that they pass arguments in registers, so > it is more likely that the original value of an argument is lost. Agreed. That's part of what I meant by "have many more registers". ^ permalink raw reply [flat|nested] 34+ messages in thread
* bug#33014: 26.1.50; 27.0.50; Fatal error after re-evaluating a thread's function 2018-10-20 10:20 ` Eli Zaretskii @ 2018-10-20 11:30 ` Andreas Schwab 0 siblings, 0 replies; 34+ messages in thread From: Andreas Schwab @ 2018-10-20 11:30 UTC (permalink / raw) To: Eli Zaretskii; +Cc: gazally, 33014 On Okt 20 2018, Eli Zaretskii <eliz@gnu.org> wrote: >> From: Andreas Schwab <schwab@linux-m68k.org> >> Cc: Gemini Lasswell <gazally@runbox.com>, 33014@debbugs.gnu.org >> Date: Sat, 20 Oct 2018 10:23:37 +0200 >> >> On Okt 20 2018, Eli Zaretskii <eliz@gnu.org> wrote: >> >> >> Maybe code optimizers have improved over the last 17 years? >> > >> > I think a much more significant factor is that modern processors have >> > many more registers to use. >> >> I think an important factor is that they pass arguments in registers, so >> it is more likely that the original value of an argument is lost. > > Agreed. That's part of what I meant by "have many more registers". You can pass by register even without more of them, and the likelihood that the original value is lost is even higher then. Andreas. -- Andreas Schwab, schwab@linux-m68k.org GPG Key fingerprint = 7578 EB47 D4E5 4D69 2510 2552 DF73 E780 A9DA AEC1 "And now for something completely different." ^ permalink raw reply [flat|nested] 34+ messages in thread
* bug#33014: 26.1.50; 27.0.50; Fatal error after re-evaluating a thread's function 2018-10-20 6:41 ` Eli Zaretskii 2018-10-20 8:23 ` Andreas Schwab @ 2018-10-29 18:24 ` Gemini Lasswell 2018-10-29 19:41 ` Eli Zaretskii 1 sibling, 1 reply; 34+ messages in thread From: Gemini Lasswell @ 2018-10-29 18:24 UTC (permalink / raw) To: Eli Zaretskii; +Cc: 33014 Eli Zaretskii <eliz@gnu.org> writes: >> I have patched Emacs with a 'volatile' on the definition of 'fun' in >> Ffuncall, and so far haven't managed to reproduce the bug with it: > > Thanks. This needs a comment for why we do something strange like > that, but otherwise, if no one has better ideas in a week's time, > let's install this. Pushed to master, along with a new test which should fail if some future optimizing compiler removes the reference from the stack in spite of the 'volatile'. ^ permalink raw reply [flat|nested] 34+ messages in thread
* bug#33014: 26.1.50; 27.0.50; Fatal error after re-evaluating a thread's function 2018-10-29 18:24 ` Gemini Lasswell @ 2018-10-29 19:41 ` Eli Zaretskii 0 siblings, 0 replies; 34+ messages in thread From: Eli Zaretskii @ 2018-10-29 19:41 UTC (permalink / raw) To: Gemini Lasswell; +Cc: 33014 > From: Gemini Lasswell <gazally@runbox.com> > Cc: 33014@debbugs.gnu.org > Date: Mon, 29 Oct 2018 11:24:10 -0700 > > Pushed to master, along with a new test which should fail if some future > optimizing compiler removes the reference from the stack in spite of the > 'volatile'. Thanks. Any reason not to cherry-pick this to emacs-26? ^ permalink raw reply [flat|nested] 34+ messages in thread
* bug#33014: 26.1.50; 27.0.50; Fatal error after re-evaluating a thread's function 2018-10-19 0:22 ` Gemini Lasswell 2018-10-19 8:44 ` Eli Zaretskii @ 2018-10-19 19:32 ` Gemini Lasswell 1 sibling, 0 replies; 34+ messages in thread From: Gemini Lasswell @ 2018-10-19 19:32 UTC (permalink / raw) To: Eli Zaretskii; +Cc: 33014, schwab [-- Attachment #1: Type: text/plain, Size: 8770 bytes --] Gemini Lasswell <gazally@runbox.com> writes: > I set up a single-threaded situation where I could redefine a function > while exec_byte_code was running it, and got a segfault. I've gained > some insights from debugging this version of the bug which I will put > into a separate email. Here's a gdb transcript going through the single-threaded version of this bug. In this transcript I use a file 'repro.el' which I've attached to the end of this message, and is the same as the one in my last message. Start gdb with a breakpoint at Fredraw_display: $ gdb --args ./emacs -Q ... (gdb) b Fredraw_display (gdb) r In Emacs, find the file repro.el and load it with byte-compile-file, then go back to *scratch* and run my-loop: C-x C-f repro.el RET C-u M-x byte-compile-file RET repro.el RET C-x b RET M-x my-loop RET This gets me to the gdb prompt, at a point in execution where the next function called will be my-loop-1, so I set a breakpoint in funcall_lambda, where I can see the bytecode object for my-loop-1 (I edited out the bytestring): Thread 1 "emacs" hit Breakpoint 3, Fredraw_display () at dispnew.c:3027 3027 { (gdb) br funcall_lambda Breakpoint 4 at 0x5cdb00: file eval.c, line 3016. (gdb) c Continuing. Thread 1 "emacs" hit Breakpoint 4, funcall_lambda (fun=XIL(0x31c0235), nargs=nargs@entry=0, arg_vector=arg_vector@entry=0x7fffffff01c0) at eval.c:3016 3016 { (gdb) clear Deleted breakpoint 4 (gdb) p fun $1 = XIL(0x1630fc5) (gdb) pr #[0 "..." [my-var 0 "Now in recursive edit " recursive-edit format "Leaving recursive edit: %s " (a b c d e) message "foo: %s" last 1 "bar: %s" 2 "baz: %s" "bop: %s" mod 3] 6] Then I skip ahead into exec-byte-code: (gdb) br exec_byte_code Breakpoint 5 at 0x611bb0: file bytecode.c, line 342. (gdb) c Continuing. Thread 1 "emacs" hit Breakpoint 5, exec_byte_code (bytestr=XIL(0x3571d24), vector=XIL(0x31c0195), maxdepth=make_number(4), args_template=args_template@entry=XIL(0), nargs=nargs@entry=0, args=args@entry=0x0) at bytecode.c:342 342 { Here's what's in the register $rbp, and the constants vector: (gdb) clear Deleted breakpoint 5 (gdb) p $rbp $2 = (void *) 0xb0201 (gdb) pr #<INVALID_LISP_OBJECT 0x000b0201> (gdb) p vector $3 = XIL(0x1630f35) (gdb) pr [my-var 0 "Now in recursive edit " recursive-edit format "Leaving recursive edit: %s " (a b c d e) message "foo: %s" last 1 "bar: %s" 2 "baz: %s" "bop: %s" mod 3] Skip ahead, to get to where exec_byte_code has a value for vectorp: (gdb) n 12 366 USE_SAFE_ALLOCA; (gdb) p vectorp $4 = (Lisp_Object *) 0x1630f38 <bss_sbrk_buffer+9164248> (gdb) p *vectorp $5 = XIL(0x2327d80) (gdb) pr my-var (gdb) break mark_vectorlike if ptr->contents == $4 Breakpoint 6 at 0x5ad400: file alloc.c, line 6036. (gdb) c Continuing. The idea is to break when garbage collection finds the constants vector. (I first tried setting a conditional breakpoint in mark_object, which made garbage collection either hang or take more time than I had patience for.) In Emacs type C-x b RET. This causes a gc and a breakpoint hit: Thread 1 "emacs" hit Breakpoint 6, mark_vectorlike (ptr=0x31c0190) at alloc.c:6036 6036 eassert (!VECTOR_MARKED_P (ptr)); (gdb) bt 20 #0 mark_vectorlike (ptr=0x1630f30 <bss_sbrk_buffer+9164240>) at alloc.c:6036 #1 0x00000000005aca9c in mark_object (arg=...) at alloc.c:6430 #2 0x00000000005ad45e in mark_vectorlike ( ptr=0x1611fd0 <bss_sbrk_buffer+9037424>) at alloc.c:6046 #3 0x00000000005aca9c in mark_object (arg=...) at alloc.c:6430 #4 0x00000000005acdf4 in mark_object (arg=...) at alloc.c:6477 #5 0x00000000005acae4 in mark_object (arg=...) at alloc.c:6434 #6 0x00000000005ad45e in mark_vectorlike ( ptr=0x15a8e00 <bss_sbrk_buffer+8606880>) at alloc.c:6046 #7 0x00000000005ad45e in mark_vectorlike ( ptr=0x15a9c30 <bss_sbrk_buffer+8610512>) at alloc.c:6046 #8 0x00000000005aca9c in mark_object (arg=...) at alloc.c:6430 #9 0x00000000005ad45e in mark_vectorlike ( ptr=0x15a7c30 <bss_sbrk_buffer+8602320>) at alloc.c:6046 #10 0x00000000005aca9c in mark_object (arg=...) at alloc.c:6430 #11 0x00000000005ad45e in mark_vectorlike ( ptr=0x15a6e80 <bss_sbrk_buffer+8598816>) at alloc.c:6046 #12 0x00000000005aca9c in mark_object (arg=...) at alloc.c:6430 #13 0x00000000005acdf4 in mark_object (arg=...) at alloc.c:6477 #14 0x00000000005acaa5 in mark_object (arg=...) at alloc.c:6431 #15 0x00000000005ad45e in mark_vectorlike ( ptr=0x15fbed0 <bss_sbrk_buffer+8947056>) at alloc.c:6046 #16 0x00000000005aca9c in mark_object (arg=...) at alloc.c:6430 #17 0x00000000005ad45e in mark_vectorlike ( ptr=0x15fbf50 <bss_sbrk_buffer+8947184>) at alloc.c:6046 #18 0x00000000005aca9c in mark_object (arg=...) at alloc.c:6430 #19 0x00000000005ad45e in mark_vectorlike ( ptr=0x15fcc80 <bss_sbrk_buffer+8950560>) at alloc.c:6046 (More stack frames follow...) Lisp Backtrace: "Automatic GC" (0x0) "eldoc-pre-command-refresh-echo-area" (0xfffefbb0) "recursive-edit" (0xfffeffd8) "my-loop-1" (0xffff0250) "my-loop" (0xffff0650) "funcall-interactively" (0xffff0648) "call-interactively" (0xffff07d0) "command-execute" (0xffff0ab8) "execute-extended-command" (0xffff0ea0) "funcall-interactively" (0xffff0e98) "call-interactively" (0xffff11d0) "command-execute" (0xffff1488) There are 279 frames in the backtrace, and mark_stack and mark_memory aren't there. So I'm guessing the constants vector is getting found via the function definition of 'my-loop-1'. Keep going: (gdb) c Continuing. Now in Emacs do this: M-x eval-buffer RET C-x b RET M-x my-gc RET Execution does not stop at the breakpoint. In Emacs type C-M-c. Result: Thread 1 "emacs" received signal SIGSEGV, Segmentation fault. 0x00000000005bca1b in styled_format (nargs=2, args=0x7ffffffeffd8, message=<optimized out>) at editfns.c:3129 3129 unsigned char format_char = *format++; What's happened to the constants vector and its contents? (gdb) p $3 $6 = XIL(0x1630f35) (gdb) pr #<INVALID_LISP_OBJECT 0x01630f35> (gdb) p *$4 $7 = XIL(0x2327d80) (gdb) pr my-var (gdb) p *($4+5) $8 = XIL(0x359a6f4) (gdb) pr #<INVALID_LISP_OBJECT 0x0359a6f4> (gdb) p *($4+4) $9 = XIL(0x6390) (gdb) pr format Looks like the constants vector was freed, and its contents haven't been overwritten (yet) but the format string has been freed leading to the crash in styled_format. While I was developing this method of reproducing this bug, I went through this exercise without lexical-binding set in repro.el. In that version, the register $rbp when exec_byte_code is called contains the bytecode Lisp_Object (instead of the non-Lisp-object value it contains in the transcript above), and the first thing exec_byte_code does is save it on the stack (presumably because the System V AMD64 ABI calling convention says that called functions which use $rbp should save and restore it). Here's the beginning of the disassembly of exec_byte_code from "objdump -S bytecode.o": 0000000000000020 <exec_byte_code>: executing BYTESTR. */ Lisp_Object exec_byte_code (Lisp_Object bytestr, Lisp_Object vector, Lisp_Object maxdepth, Lisp_Object args_template, ptrdiff_t nargs, Lisp_Object *args) { 20: 55 push %rbp 21: 48 89 e5 mov %rsp,%rbp 24: 41 57 push %r15 26: 41 56 push %r14 28: 41 55 push %r13 2a: 41 54 push %r12 2c: 49 89 ce mov %rcx,%r14 2f: 53 push %rbx So in the non-lexical-binding case the bytecode Lisp_Object is written to the stack by the first instruction in exec_byte_code, and then during the execution of 'my-gc' the breakpoint in mark_vectorlike stops at a point with a much shorter backtrace which includes mark_stack and mark_memory, and mark_memory's pp is pointing to the location on the stack where $rbp was written. The bytecode object and constants vector are consequently not freed, and no segfault happens. I don't follow everything going on in the disassembly of funcall_lambda, but I did figure out (by comparison with a debug session in the multithreaded situation) that the different values in $rbp when funcall_lambda calls exec_byte_code depend on the different code paths following the test of whether the first element of the bytecode object vector (the "args template" as funcall_lambda's comment calls it) is an integer, which in turn depends on whether my-loop-1 was compiled with lexical-binding on. Here is 'repro.el': [-- Attachment #2: repro.el --] [-- Type: text/plain, Size: 685 bytes --] ;;; -*- lexical-binding: t -*- (defvar my-var "ok") (defun my-loop-1 () (let ((val 0)) (while t (insert "Now in recursive edit\n") (recursive-edit) (insert (format "Leaving recursive edit: %s\n" my-var)) (let ((things '(a b c d e))) (cond ; ((= val 0) (message "foo: %s" (last things))) ((= val 1) (message "bar: %s" things)) ((= val 2) (message "baz: %s" (car things))) (t (message "bop: %s" (nth 2 things)))) (setq val (mod (1+ val) 3)))))) (defun my-loop () (interactive) (redraw-display) (my-loop-1)) (defun my-gc-1 () (garbage-collect)) (defun my-gc () (interactive) (my-gc-1)) (provide 'repro) ^ permalink raw reply [flat|nested] 34+ messages in thread
* bug#33014: 26.1.50; 27.0.50; Fatal error after re-evaluating a thread's function 2018-10-16 18:46 ` Gemini Lasswell 2018-10-16 19:25 ` Eli Zaretskii @ 2018-10-17 16:21 ` Eli Zaretskii 2018-10-18 1:07 ` Gemini Lasswell 1 sibling, 1 reply; 34+ messages in thread From: Eli Zaretskii @ 2018-10-17 16:21 UTC (permalink / raw) To: Gemini Lasswell; +Cc: 33014, schwab > From: Gemini Lasswell <gazally@runbox.com> > Cc: Andreas Schwab <schwab@linux-m68k.org>, 33014@debbugs.gnu.org > Date: Tue, 16 Oct 2018 11:46:36 -0700 > > My knowledge of what gcc does and how the code it generates works is > superficial, but I don't see why an optimizer would find it necessary to > save the following values: > > - The value of 'fun' in Ffuncall after it is used as an argument for > funcall_lambda. > > - The value of 'fun' in funcall_lambda after it is used to calculate > the arguments to exec_byte_code. > > - The value of 'vector' in exec_byte_code after the calculation of > vectorp. After thinking about this a bit, I don't really agree with the last one: the compiler could indeed stop tracking 'vector', but not XVECTOR (vector)->contents, and we are interested in the latter. One other thought is that, if worse comes to worst, we may consider disallowing redefinition of a function that is currently being executed (in another thread). However, I'm still not convinced we are there. Can we establish which element(s) of the bytecode vector are GC'ed in this scenario? ^ permalink raw reply [flat|nested] 34+ messages in thread
* bug#33014: 26.1.50; 27.0.50; Fatal error after re-evaluating a thread's function 2018-10-17 16:21 ` Eli Zaretskii @ 2018-10-18 1:07 ` Gemini Lasswell 2018-10-18 17:04 ` Eli Zaretskii 0 siblings, 1 reply; 34+ messages in thread From: Gemini Lasswell @ 2018-10-18 1:07 UTC (permalink / raw) To: Eli Zaretskii; +Cc: 33014, schwab Eli Zaretskii <eliz@gnu.org> writes: > After thinking about this a bit, I don't really agree with the last > one: the compiler could indeed stop tracking 'vector', but not > XVECTOR (vector)->contents, and we are interested in the latter. If the compiler stops tracking 'vector', and the garbage collector frees it, doesn't that cause XVECTOR (vector)->contents to be overwritten? In the debugging session in my second message in this thread I had a hardware watchpoint on what vectorp was pointing at and it went off in setup_on_free_list. > However, I'm still not convinced we are there. Can we establish which > element(s) of the bytecode vector are GC'ed in this scenario? I'll see if I can figure that out. Is there an easy way to print the function binding of a Lisp symbol from gdb? ^ permalink raw reply [flat|nested] 34+ messages in thread
* bug#33014: 26.1.50; 27.0.50; Fatal error after re-evaluating a thread's function 2018-10-18 1:07 ` Gemini Lasswell @ 2018-10-18 17:04 ` Eli Zaretskii 2018-10-19 0:39 ` Gemini Lasswell 2018-10-29 18:56 ` Stefan Monnier 0 siblings, 2 replies; 34+ messages in thread From: Eli Zaretskii @ 2018-10-18 17:04 UTC (permalink / raw) To: Gemini Lasswell; +Cc: 33014, schwab > From: Gemini Lasswell <gazally@runbox.com> > Cc: 33014@debbugs.gnu.org, schwab@linux-m68k.org > Date: Wed, 17 Oct 2018 18:07:39 -0700 > > Eli Zaretskii <eliz@gnu.org> writes: > > > After thinking about this a bit, I don't really agree with the last > > one: the compiler could indeed stop tracking 'vector', but not > > XVECTOR (vector)->contents, and we are interested in the latter. > > If the compiler stops tracking 'vector', and the garbage collector frees > it, doesn't that cause XVECTOR (vector)->contents to be overwritten? Hmmm... could be. > > However, I'm still not convinced we are there. Can we establish which > > element(s) of the bytecode vector are GC'ed in this scenario? > > I'll see if I can figure that out. > > Is there an easy way to print the function binding of a Lisp symbol from > gdb? Not sure what you mean by "the function binding" in this context. I hope something like the following will do: (gdb) p fun (gdb) xpr Let me know if this isn't what you meant. Thanks. ^ permalink raw reply [flat|nested] 34+ messages in thread
* bug#33014: 26.1.50; 27.0.50; Fatal error after re-evaluating a thread's function 2018-10-18 17:04 ` Eli Zaretskii @ 2018-10-19 0:39 ` Gemini Lasswell 2018-10-19 8:38 ` Eli Zaretskii 2018-10-29 18:56 ` Stefan Monnier 1 sibling, 1 reply; 34+ messages in thread From: Gemini Lasswell @ 2018-10-19 0:39 UTC (permalink / raw) To: Eli Zaretskii; +Cc: 33014, schwab Eli Zaretskii <eliz@gnu.org> writes: > (gdb) p fun > (gdb) xpr > > Let me know if this isn't what you meant. I meant something like 'pv', as in: (gdb) pv emacs-version "27.0.50" but which I could use to find out what the bytecode object for erb--benchmark-monitor-func is. ^ permalink raw reply [flat|nested] 34+ messages in thread
* bug#33014: 26.1.50; 27.0.50; Fatal error after re-evaluating a thread's function 2018-10-19 0:39 ` Gemini Lasswell @ 2018-10-19 8:38 ` Eli Zaretskii 0 siblings, 0 replies; 34+ messages in thread From: Eli Zaretskii @ 2018-10-19 8:38 UTC (permalink / raw) To: Gemini Lasswell; +Cc: 33014, schwab > From: Gemini Lasswell <gazally@runbox.com> > Cc: 33014@debbugs.gnu.org, schwab@linux-m68k.org > Date: Thu, 18 Oct 2018 17:39:54 -0700 > > Eli Zaretskii <eliz@gnu.org> writes: > > > (gdb) p fun > > (gdb) xpr > > > > Let me know if this isn't what you meant. > > I meant something like 'pv', as in: > > (gdb) pv emacs-version > "27.0.50" > > but which I could use to find out what the bytecode object for > erb--benchmark-monitor-func is. But a function doesn't have to be byte-compiled, in which case there's no bytecode. Look at the implementation of funcall, and you will see how Emacs deals with this. It seemed to me that xpr reflects that, in that it shows you what object to look at for a given function symbol. If you are sure the function is already compiled, then funcall_lambda will show you how it invokes exec_byte_code for such a function, and you will see there how to access the bytecode of such a function. HTH P.S. Patches to .gdbinit to provide such a functionality in a new command, called, say, "xfunc", will be most welcome, of course. ^ permalink raw reply [flat|nested] 34+ messages in thread
* bug#33014: 26.1.50; 27.0.50; Fatal error after re-evaluating a thread's function 2018-10-18 17:04 ` Eli Zaretskii 2018-10-19 0:39 ` Gemini Lasswell @ 2018-10-29 18:56 ` Stefan Monnier 1 sibling, 0 replies; 34+ messages in thread From: Stefan Monnier @ 2018-10-29 18:56 UTC (permalink / raw) To: Gemini Lasswell; +Cc: schwab, 33014 >> > After thinking about this a bit, I don't really agree with the last >> > one: the compiler could indeed stop tracking 'vector', but not >> > XVECTOR (vector)->contents, and we are interested in the latter. >> If the compiler stops tracking 'vector', and the garbage collector frees >> it, doesn't that cause XVECTOR (vector)->contents to be overwritten? > Hmmm... could be. Indeed, the conservative GC doesn't try to handle "pointers into the middle of objects", so a pointer to `XVECTOR (vector)->contents` won't be sufficient to keep `vector` alive. > From: Gemini Lasswell <gazally@runbox.com> > Date: Sun, 14 Oct 2018 12:12:04 -0700 > Subject: [PATCH] * src/eval.c (Ffuncall): Make local variable 'fun' volatile > (bug#33014) Shouldn't we do that in exec_byte_code instead (it probably doesn't matter that much in the end, but I think conceptually that would be the more correct place)? E.g. if you change your test to (defun eval-tests-33014-redefine () "Remove the Lisp reference to the byte-compiled object." (aset (symbol-function #'eval-tests-33014-func) 1 nil) (aset (symbol-function #'eval-tests-33014-func) 2 nil)) you won't get a crash but only because these `aset` will fail (bytecode objects are luckily read-only). Moving the volatile thingies to exec_byte_code should let our code work correctly against the above test even if we changed aset to allow modifying bytecode objects. Stefan ^ permalink raw reply [flat|nested] 34+ messages in thread
* bug#33014: 26.1.50; 27.0.50; Fatal error after re-evaluating a thread's function 2018-10-11 5:30 bug#33014: 26.1.50; 27.0.50; Fatal error after re-evaluating a thread's function Gemini Lasswell 2018-10-12 8:12 ` Eli Zaretskii @ 2018-10-31 4:49 ` Paul Eggert 2018-10-31 15:33 ` Eli Zaretskii 2018-11-01 23:15 ` Gemini Lasswell 1 sibling, 2 replies; 34+ messages in thread From: Paul Eggert @ 2018-10-31 4:49 UTC (permalink / raw) To: Eli Zaretskii; +Cc: Gemini Lasswell, 33014, Andreas Schwab, Stefan Monnier [-- Attachment #1: Type: text/plain, Size: 923 bytes --] > Any reason not to cherry-pick this to emacs-26? No, once we fix it up. Although adding 'volatile' happened to work for Gemini's compiler, it won't suffice in general as the C standard does not require volatile variables to survive their last access (which is what the patch was assuming). Furthermore, Fbyte_code bypasses that patch, so the bug could still occur even if 'volatile' cured the bug in the more-common code path. A simple way to ensure that the constant vector survives GC is to have exec_byte_code put the vector into a GC-visible slot. As it happens, there's a spare slot that we can appropriate, so this won't cost us stack (or heap) space. I installed the first attached patch into master to do that, and backported the patch series into emacs-26 via the last two attached patches. Thanks, Gemini, for the good work in debugging this problem and writing that test case. GC bugs can be nasty. [-- Warning: decoded text below may be mangled, UTF-8 assumed --] [-- Attachment #2: 0001-Improve-fix-for-Bug-33014.patch --] [-- Type: text/x-patch; name="0001-Improve-fix-for-Bug-33014.patch", Size: 2262 bytes --] From cf486a7a920d3d95fa9aa98d7b03ebc61b17518a Mon Sep 17 00:00:00 2001 From: Paul Eggert <eggert@cs.ucla.edu> Date: Tue, 30 Oct 2018 20:57:46 -0700 Subject: [PATCH] Improve fix for Bug#33014 MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Although the previously-applied fix worked for its platform, it doesn’t suffice in general. * src/bytecode.c (exec_byte_code): Save VECTOR into stack slot so that it survives GC. The stack slot was otherwise unused, so this doesn’t cost us memory, only a store insn. * src/eval.c (Ffuncall): Do not make FUN volatile, reverting 2018-10-14T19:12:04Z!gazally@runbox.com. Adding ‘volatile’ does not suffice, since storage for a volatile local can be reclaimed after its last access (e.g., by tail recursion elimination), which would make VECTOR invisible to GC. --- src/bytecode.c | 1 + src/eval.c | 7 ++----- 2 files changed, 3 insertions(+), 5 deletions(-) diff --git a/src/bytecode.c b/src/bytecode.c index 17457fc574..40389e08f0 100644 --- a/src/bytecode.c +++ b/src/bytecode.c @@ -369,6 +369,7 @@ exec_byte_code (Lisp_Object bytestr, Lisp_Object vector, Lisp_Object maxdepth, ptrdiff_t item_bytes = stack_items * word_size; Lisp_Object *stack_base = ptr_bounds_clip (alloc, item_bytes); Lisp_Object *top = stack_base; + *top = vector; /* Ensure VECTOR survives GC (Bug#33014). */ Lisp_Object *stack_lim = stack_base + stack_items; unsigned char *bytestr_data = alloc; bytestr_data = ptr_bounds_clip (bytestr_data + item_bytes, bytestr_length); diff --git a/src/eval.c b/src/eval.c index 32cfda24d8..a51d0c9083 100644 --- a/src/eval.c +++ b/src/eval.c @@ -2820,11 +2820,8 @@ Thus, (funcall \\='cons \\='x \\='y) returns (x . y). usage: (funcall FUNCTION &rest ARGUMENTS) */) (ptrdiff_t nargs, Lisp_Object *args) { - /* Use 'volatile' here to cause optimizing compilers to keep a - reference on the stack to the function's bytecode object. See - Bug#33014. */ - Lisp_Object volatile fun; - Lisp_Object original_fun, funcar; + Lisp_Object fun, original_fun; + Lisp_Object funcar; ptrdiff_t numargs = nargs - 1; Lisp_Object val; ptrdiff_t count; -- 2.17.1 [-- Warning: decoded text below may be mangled, UTF-8 assumed --] [-- Attachment #3: 0001-Refer-to-bytecode-constant-vectors-Bug-33014.patch --] [-- Type: text/x-patch; name="0001-Refer-to-bytecode-constant-vectors-Bug-33014.patch", Size: 1200 bytes --] From 1ad2903a48b682985a2bd0709ec05f67a1351a8e Mon Sep 17 00:00:00 2001 From: Paul Eggert <eggert@cs.ucla.edu> Date: Tue, 30 Oct 2018 21:14:10 -0700 Subject: [PATCH 1/2] Refer to bytecode constant vectors (Bug#33014) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Backport from master. * src/bytecode.c (exec_byte_code): Save VECTOR into stack slot so that it survives GC. The stack slot was otherwise unused, so this doesn’t cost us memory, only a store insn. --- src/bytecode.c | 1 + 1 file changed, 1 insertion(+) diff --git a/src/bytecode.c b/src/bytecode.c index e51f9095b3..538cd4f3ca 100644 --- a/src/bytecode.c +++ b/src/bytecode.c @@ -367,6 +367,7 @@ exec_byte_code (Lisp_Object bytestr, Lisp_Object vector, Lisp_Object maxdepth, SAFE_ALLOCA_LISP_EXTRA (stack_base, stack_items, bytestr_length); Lisp_Object *stack_lim = stack_base + stack_items; Lisp_Object *top = stack_base; + *top = vector; /* Ensure VECTOR survives GC (Bug#33014). */ memcpy (stack_lim, SDATA (bytestr), bytestr_length); void *void_stack_lim = stack_lim; unsigned char const *bytestr_data = void_stack_lim; -- 2.17.1 [-- Attachment #4: 0002-Add-regression-test-for-Bug-33014.patch --] [-- Type: text/x-patch, Size: 2169 bytes --] From c3cf85b1c186e13c2d588aa35ffa57981ca481d7 Mon Sep 17 00:00:00 2001 From: Gemini Lasswell <gazally@runbox.com> Date: Tue, 30 Oct 2018 21:15:51 -0700 Subject: [PATCH 2/2] Add regression test for Bug#33014 Backport from master. * test/src/eval-tests.el: (eval-tests-byte-code-being-evaluated-is-protected-from-gc): New test. (eval-tests-33014-var): New variable. (eval-tests-33014-func, eval-tests-33014-redefine): New functions. --- test/src/eval-tests.el | 30 ++++++++++++++++++++++++++++++ 1 file changed, 30 insertions(+) diff --git a/test/src/eval-tests.el b/test/src/eval-tests.el index e68fd13611..eeb98b0994 100644 --- a/test/src/eval-tests.el +++ b/test/src/eval-tests.el @@ -99,4 +99,34 @@ eval-tests--exceed-specbind-limit (signal-hook-function #'ignore)) (should-error (eval-tests--exceed-specbind-limit)))) +(ert-deftest eval-tests-byte-code-being-evaluated-is-protected-from-gc () + "Regression test for Bug#33014. +Check that byte-compiled objects being executed by exec-byte-code +are found on the stack and therefore not garbage collected." + (should (string= (eval-tests-33014-func) + "before after: ok foo: (e) bar: (a b c d e) baz: a bop: c"))) + +(defvar eval-tests-33014-var "ok") +(defun eval-tests-33014-func () + "A function which has a non-trivial constants vector when byte-compiled." + (let ((result "before ")) + (eval-tests-33014-redefine) + (garbage-collect) + (setq result (concat result (format "after: %s" eval-tests-33014-var))) + (let ((vals '(0 1 2 3)) + (things '(a b c d e))) + (dolist (val vals) + (setq result + (concat result " " + (cond + ((= val 0) (format "foo: %s" (last things))) + ((= val 1) (format "bar: %s" things)) + ((= val 2) (format "baz: %s" (car things))) + (t (format "bop: %s" (nth 2 things)))))))) + result)) + +(defun eval-tests-33014-redefine () + "Remove the Lisp reference to the byte-compiled object." + (setf (symbol-function #'eval-tests-33014-func) nil)) + ;;; eval-tests.el ends here -- 2.17.1 ^ permalink raw reply related [flat|nested] 34+ messages in thread
* bug#33014: 26.1.50; 27.0.50; Fatal error after re-evaluating a thread's function 2018-10-31 4:49 ` Paul Eggert @ 2018-10-31 15:33 ` Eli Zaretskii 2018-11-01 23:15 ` Gemini Lasswell 1 sibling, 0 replies; 34+ messages in thread From: Eli Zaretskii @ 2018-10-31 15:33 UTC (permalink / raw) To: Paul Eggert; +Cc: gazally, 33014, schwab, monnier > Cc: Gemini Lasswell <gazally@runbox.com>, 33014@debbugs.gnu.org, > Stefan Monnier <monnier@iro.umontreal.ca>, > Andreas Schwab <schwab@linux-m68k.org> > From: Paul Eggert <eggert@cs.ucla.edu> > Date: Tue, 30 Oct 2018 21:49:46 -0700 > > > Any reason not to cherry-pick this to emacs-26? > > No, once we fix it up. Although adding 'volatile' happened to work for Gemini's > compiler, it won't suffice in general as the C standard does not require > volatile variables to survive their last access (which is what the patch was > assuming). Furthermore, Fbyte_code bypasses that patch, so the bug could still > occur even if 'volatile' cured the bug in the more-common code path. > > A simple way to ensure that the constant vector survives GC is to have > exec_byte_code put the vector into a GC-visible slot. As it happens, there's a > spare slot that we can appropriate, so this won't cost us stack (or heap) space. > I installed the first attached patch into master to do that, and backported the > patch series into emacs-26 via the last two attached patches. Thanks. ^ permalink raw reply [flat|nested] 34+ messages in thread
* bug#33014: 26.1.50; 27.0.50; Fatal error after re-evaluating a thread's function 2018-10-31 4:49 ` Paul Eggert 2018-10-31 15:33 ` Eli Zaretskii @ 2018-11-01 23:15 ` Gemini Lasswell 1 sibling, 0 replies; 34+ messages in thread From: Gemini Lasswell @ 2018-11-01 23:15 UTC (permalink / raw) To: Paul Eggert; +Cc: Andreas Schwab, 33014, Stefan Monnier Paul Eggert <eggert@cs.ucla.edu> writes: > A simple way to ensure that the constant vector survives GC is to have > exec_byte_code put the vector into a GC-visible slot. As it happens, > there's a spare slot that we can appropriate, so this won't cost us > stack (or heap) space. I installed the first attached patch into > master to do that, and backported the patch series into emacs-26 via > the last two attached patches. Thanks. I ran through all my methods of reproducing this bug on both emacs-26 and master, and saw no more problems. ^ permalink raw reply [flat|nested] 34+ messages in thread
end of thread, other threads:[~2018-11-01 23:15 UTC | newest] Thread overview: 34+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2018-10-11 5:30 bug#33014: 26.1.50; 27.0.50; Fatal error after re-evaluating a thread's function Gemini Lasswell 2018-10-12 8:12 ` Eli Zaretskii 2018-10-12 20:02 ` Gemini Lasswell 2018-10-13 6:23 ` Eli Zaretskii 2018-10-13 17:17 ` Gemini Lasswell 2018-10-13 18:04 ` Eli Zaretskii 2018-10-14 19:29 ` Gemini Lasswell 2018-10-15 2:37 ` Eli Zaretskii 2018-10-14 19:46 ` Andreas Schwab 2018-10-15 14:59 ` Eli Zaretskii 2018-10-15 16:22 ` Gemini Lasswell 2018-10-15 16:41 ` Eli Zaretskii 2018-10-16 18:46 ` Gemini Lasswell 2018-10-16 19:25 ` Eli Zaretskii 2018-10-16 19:38 ` Eli Zaretskii 2018-10-19 0:22 ` Gemini Lasswell 2018-10-19 8:44 ` Eli Zaretskii 2018-10-19 20:05 ` Gemini Lasswell 2018-10-20 6:41 ` Eli Zaretskii 2018-10-20 8:23 ` Andreas Schwab 2018-10-20 10:20 ` Eli Zaretskii 2018-10-20 11:30 ` Andreas Schwab 2018-10-29 18:24 ` Gemini Lasswell 2018-10-29 19:41 ` Eli Zaretskii 2018-10-19 19:32 ` Gemini Lasswell 2018-10-17 16:21 ` Eli Zaretskii 2018-10-18 1:07 ` Gemini Lasswell 2018-10-18 17:04 ` Eli Zaretskii 2018-10-19 0:39 ` Gemini Lasswell 2018-10-19 8:38 ` Eli Zaretskii 2018-10-29 18:56 ` Stefan Monnier 2018-10-31 4:49 ` Paul Eggert 2018-10-31 15:33 ` Eli Zaretskii 2018-11-01 23:15 ` Gemini Lasswell
Code repositories for project(s) associated with this external index https://git.savannah.gnu.org/cgit/emacs.git https://git.savannah.gnu.org/cgit/emacs/org-mode.git This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.