* bug#65726: 29.1.50; Crash in regexp engine @ 2023-09-04 7:46 martin rudalics 2023-09-04 8:44 ` Mattias Engdegård ` (2 more replies) 0 siblings, 3 replies; 36+ messages in thread From: martin rudalics @ 2023-09-04 7:46 UTC (permalink / raw) To: 65726 [-- Attachment #1: Type: text/plain, Size: 4687 bytes --] With emacs -Q load the attached file elinfo.el. Now type C-h S split-window RET C-x o C-c C-g This crashes Emacs here with Thread 1 "emacs" received signal SIGSEGV, Segmentation fault. 0x000000000068810a in skip_noops (p=<error reading variable: Cannot access memory at address 0x7fffff66fff8>, pend=<error reading variable: Cannot access memory at address 0x7fffff66fff0>) at ../../src/regex-emacs.c:3556 3556 { and an infinite backtrace starting with Python Exception <class 'gdb.MemoryError'> Cannot access memory at address 0x7fffff66fff8: #0 0x000000000068810a in skip_noops (p=#1 0x0000000000688823 in mutually_exclusive_p (bufp=0xec9c30 <searchbufs+752>, p1=0x1fcee74 "\004\005", p2=0x1fcee81 "\016\063") at ../../src/regex-emacs.c:3665 #2 0x0000000000688e19 in mutually_exclusive_p (bufp=0xec9c30 <searchbufs+752>, p1=0x1fcee74 "\004\005", p2=0x1fcee81 "\016\063") at ../../src/regex-emacs.c:3838 #3 0x0000000000688e3c in mutually_exclusive_p (bufp=0xec9c30 <searchbufs+752>, p1=0x1fcee74 "\004\005", p2=0x1fceeba "\004\020") at ../../src/regex-emacs.c:3839 #4 0x0000000000688e3c in mutually_exclusive_p (bufp=0xec9c30 <searchbufs+752>, p1=0x1fcee74 "\004\005", p2=0x1fcee84 "\002\001@\004\020") at ../../src/regex-emacs.c:3839 #5 0x0000000000688e19 in mutually_exclusive_p (bufp=0xec9c30 <searchbufs+752>, p1=0x1fcee74 "\004\005", p2=0x1fcee81 "\016\063") at ../../src/regex-emacs.c:3838 ... The same scenario worked well with Emacs 22 through 28. martin In GNU Emacs 29.1.50 (build 1, x86_64-pc-linux-gnu, GTK+ Version 3.24.5, cairo version 1.16.0) of 2023-09-03 built on restno Repository revision: f1e4cbe72aa4da9351cbbcd209d9233c68dd9fbb Repository branch: emacs-29 Windowing system distributor 'The X.Org Foundation', version 11.0.12004000 System Description: Debian GNU/Linux 10 (buster) Configured using: 'configure --with-gif=ifavailable --with-tiff=ifavailable --with-gnutls=no --without-pop --enable-gcc-warnings=warn-only --enable-checking=yes,glyphs --enable-check-lisp-object-type=yes 'CFLAGS=-O0 -g3 -no-pie -Wno-missing-braces'' Configured features: CAIRO DBUS FREETYPE GIF GLIB GSETTINGS HARFBUZZ JPEG LIBSELINUX MODULES NOTIFY INOTIFY PDUMPER PNG SECCOMP SOUND THREADS TOOLKIT_SCROLL_BARS X11 XDBE XIM XINPUT2 XPM GTK3 ZLIB Important settings: value of $LANG: de_AT.utf8 value of $XMODIFIERS: @im=ibus locale-coding-system: utf-8-unix Major mode: Lisp Interaction Minor modes in effect: tooltip-mode: t global-eldoc-mode: t eldoc-mode: t show-paren-mode: t electric-indent-mode: t mouse-wheel-mode: t tool-bar-mode: t menu-bar-mode: t file-name-shadow-mode: t global-font-lock-mode: t font-lock-mode: t blink-cursor-mode: t line-number-mode: t indent-tabs-mode: t transient-mark-mode: t auto-composition-mode: t auto-encryption-mode: t auto-compression-mode: t Load-path shadows: None found. Features: (shadow sort mail-extr emacsbug message mailcap yank-media puny dired dired-loaddefs rfc822 mml mml-sec password-cache epa derived epg rfc6068 epg-config gnus-util text-property-search time-date subr-x mm-decode mm-bodies mm-encode mail-parse rfc2231 mailabbrev gmm-utils mailheader cl-loaddefs cl-lib sendmail rfc2047 rfc2045 ietf-drums mm-util mail-prsvr mail-utils elinfo texinfo texinfo-loaddefs info rmc iso-transl tooltip cconv eldoc paren electric uniquify ediff-hook vc-hooks lisp-float-type elisp-mode mwheel term/x-win x-win term/common-win x-dnd tool-bar dnd fontset image regexp-opt fringe tabulated-list replace newcomment text-mode lisp-mode prog-mode register page tab-bar menu-bar rfn-eshadow isearch easymenu timer select scroll-bar mouse jit-lock font-lock syntax font-core term/tty-colors frame minibuffer nadvice seq simple cl-generic indonesian philippine cham georgian utf-8-lang misc-lang vietnamese tibetan thai tai-viet lao korean japanese eucjp-ms cp51932 hebrew greek romanian slovak czech european ethiopic indian cyrillic chinese composite emoji-zwj charscript charprop case-table epa-hook jka-cmpr-hook help abbrev obarray oclosure cl-preloaded button loaddefs theme-loaddefs faces cus-face macroexp files window text-properties overlay sha1 md5 base64 format env code-pages mule custom widget keymap hashtable-print-readable backquote threads dbusbind inotify dynamic-setting system-font-setting font-render-setting cairo move-toolbar gtk x-toolkit xinput2 x multi-tty make-network-process emacs) Memory information: ((conses 16 45758 8352) (symbols 48 5682 0) (strings 32 15622 2110) (string-bytes 1 431024) (vectors 16 10081) (vector-slots 8 157669 13073) (floats 8 26 23) (intervals 56 218 0) (buffers 976 10)) [-- Attachment #2: elinfo.el --] [-- Type: text/x-emacs-lisp, Size: 51929 bytes --] ;;; elinfo.el --- Elisp Info and Texinfo support routines ;; Copyright (C) 2009 Martin Rudalics ;; Time-stamp: "2021-06-09 18:45:29 martin" ;; Author: Martin Rudalics <rudalics@gmx.at> ;; This program is free software: you can redistribute it and/or modify ;; it under the terms of the GNU General Public License as published by ;; the Free Software Foundation, either version 3 of the License, or ;; (at your option) any later version. ;; This program is distributed in the hope that it will be useful, ;; but WITHOUT ANY WARRANTY; without even the implied warranty of ;; MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the ;; GNU General Public License for more details. ;; You should have received a copy of the GNU General Public License ;; along with this program. If not, see <http://www.gnu.org/licenses/>. ;;; Commentary: ;; This file provides functions for checking the validity of Emacs' Info ;; files, navigating between Emacs' Info files and corresponding Texinfo ;; buffers, and for creating Texinfo entries from documentation strings. (require 'info) (require 'texinfo) (defgroup elinfo nil "Elinfo." :version "23.2" :group 'lisp) ;; _____________________________________________________________________________ ;; ;;; Utility Functions ;; _____________________________________________________________________________ ;; ;; Move to and return start of first non-whitespace char after `point'. (defun elinfo--move-to-next () (forward-comment (buffer-size)) (point)) ;; Move to end of and return next name. (defun elinfo--get-next () (let ((from (elinfo--move-to-next))) (skip-chars-forward "^ \t\n") (buffer-substring-no-properties from (point)))) ;; Like `elinfo--get-next' but don't leave current line. (defun elinfo--get-next-arg () (save-restriction (narrow-to-region (point) (line-end-position)) (elinfo--get-next))) ;; _____________________________________________________________________________ ;; ;;; Checking and Elinfo Log Utilities ;; _____________________________________________________________________________ ;; (defcustom elinfo-log-warnings nil "Non-nil means log warnings in Elinfo Log buffer." :group 'elinfo :type 'boolean :version "23.2") (defcustom elinfo-show-definition t "Non-nil means show definitions when browsing Elinfo Logs." :group 'elinfo :type 'boolean :version "23.2") (defcustom elinfo-balance-windows t "Non-nil means balance windows when browsing Elinfo Logs." :group 'elinfo :type 'boolean :version "23.2") (defcustom elinfo-reuse-windows t "Non-nil means reuse windows when browsing Elinfo Logs." :group 'elinfo :type 'boolean :version "23.2") (defcustom elinfo-log-location-face 'underline "Face name for highlighting locations in Elinfo Logs." :group 'elinfo :type 'face :version "23.2") (defcustom elinfo-log-def-face font-lock-constant-face "Face name for highlighting things in Elinfo Logs." :group 'elinfo :type 'face :version "23.2") ;; Window navigation. (defvar elinfo-source-window nil "The window where the Texinfo source buffer was displayed.") (defvar elinfo-def-window nil "The window where the definition of some symbol was displayed.") (defun elinfo-show-definition (symbol &optional type) "Show definition of SYMBOL in a suitable window. For compatibility with `find-definition-noselect' TYPE must be omitted or nil for functions and `defvar' for variables." ;; `find-definition-noselect' is autoloaded in find-func.el and should ;; return a cons whose car is the buffer containing the definition ;; while its cdr contains the position within the buffer. (let ((buffer-position-pair (save-excursion (find-definition-noselect symbol type)))) (when buffer-position-pair (let ((buffer (car buffer-position-pair)) (position (cdr buffer-position-pair))) (if (and elinfo-reuse-windows (window-live-p elinfo-def-window)) (set-window-buffer elinfo-def-window buffer) (setq elinfo-def-window (display-buffer buffer))) ;; Record window/buffer relationship. (set-window-point elinfo-def-window position) (with-selected-window elinfo-def-window ;; Put the definition at the top of the window. (recenter 1)))))) (defun elinfo-show-source (&optional log-pos) "Show Texinfo source corresponding to Elinfo Log entry at LOG-POS. LOG-POS nil or omitted means consider the entry located at `point'. This function may also display a corresponding definition in case `elinfo-show-definition' is non-nil and balance the respective windows provided `elinfo-balance-windows' is non-nil." (save-excursion (let* ((log-pos (or log-pos (point))) (file-name (get-text-property log-pos 'elinfo-file-name)) (position (get-text-property log-pos 'elinfo-position)) (buffer (find-file-noselect file-name)) balance) ;; Display Texinfo source. (if (and elinfo-reuse-windows (window-live-p elinfo-source-window)) (set-window-buffer elinfo-source-window buffer) (setq elinfo-source-window (display-buffer buffer))) ;; Set window/buffer relationship. (set-window-point elinfo-source-window position) (with-selected-window elinfo-source-window ;; We could make the "1" customizable. (recenter 1)) ;; Display definition if possible. (cond ((not elinfo-show-definition)) ((memq (get-text-property log-pos 'elinfo-type) '(command no-command arg-mismatch)) (elinfo-show-definition (intern (get-text-property log-pos 'elinfo-name))) (setq balance t)) ((memq (get-text-property log-pos 'elinfo-type) '(option no-option)) (elinfo-show-definition (intern (get-text-property log-pos 'elinfo-name)) 'defvar) (setq balance t))) ;; Balance windows if possible. (when (and elinfo-balance-windows balance (eq (window-frame elinfo-source-window) (window-frame elinfo-def-window))) ;; `balance-windows' should balance siblings of the ;; source-window only. (balance-windows elinfo-source-window))))) ;; The next-/previous-error functions have been copied from compile.el ;; and slightly adapted for our needs. For UI and keybindings stick to ;; those used by compile and related modes. Also retain the -error ;; postfix for all entries and don't modify the doc strings to make this ;; look familiar. (defun elinfo-next-error (n &optional other-file) "Show Texinfo source for next entry in Elinfo Log. Prefix arg N says how many entries to move forwards \(backwards, if negative). Optional arg OTHER-FILE, if non-nil, means find entry for a file that is different from the current one. In that case, N specifies the number of files whose entries shall be skipped. This function may also display a corresponding definition in case `elinfo-show-definition' is non-nil and balance the respective windows provided `elinfo-balance-windows' is non-nil." (interactive "p") (if other-file (let ((last-file (get-text-property (point) 'elinfo-file-name)) (current-file "")) (if (> n 0) (while (and (not (eobp)) (setq current-file (get-text-property (point) 'elinfo-file-name)) (or (string-equal current-file last-file) (and (setq last-file current-file) (> (setq n (1- n)) 0)))) (forward-line)) (while (and (not (bobp)) (setq current-file (get-text-property (point) 'elinfo-file-name)) (or (string-equal current-file last-file) (and (setq last-file current-file) (< (setq n (1+ n)) 0)))) (forward-line -1)))) (forward-line n)) (elinfo-show-source)) (defun elinfo-next-error-function (n &optional reset) "Move to next entry in Elinfo Log and show corresponding Texinfo source. Optional argument RESET non-nil means go to first Elinfo Log entry. See also `elinfo-next-error'. This function provides the value of `next-error-function' in Elinfo Log buffers." (interactive "p") (when reset (goto-char (point-min))) (elinfo-next-error n)) (defun elinfo-previous-error (n) "Show Texinfo source corresponding to previous entry in Elinfo Log. Prefix arg N says how many entries to move backwards \(forwards, if negative). See also `elinfo-next-error'." (interactive "p") (elinfo-next-error (- n))) (defun elinfo-next-file (n) "Show Texinfo source of next Elinfo Log entry naming another file. Prefix arg N says how many files to move forwards \(or backwards, if negative). See also `elinfo-next-error'." (interactive "p") (elinfo-next-error n t)) (defun elinfo-previous-file (n) "Show Texinfo source of previous Elinfo Log entry naming another file. Prefix arg N says how many files to move backwards \(or forwards, if negative). See also `elinfo-next-error'." (interactive "p") (elinfo-next-file (- n))) (defun elinfo-goto-error (&optional event) "Show Texinfo source for Elinfo Log entry at `point'. See also `elinfo-next-error'." (interactive (list last-input-event)) (when event (posn-set-point (event-end event))) (next-error-internal)) (defvar elinfo-log-mode-map (let ((map (make-sparse-keymap))) (define-key map [mouse-2] 'elinfo-goto-error) (define-key map [follow-link] 'mouse-face) (define-key map "\C-c\C-c" 'elinfo-goto-error) ;; Elinfo goto compatibility. (define-key map "\C-c\C-g" 'elinfo-goto-error) (define-key map "\C-m" 'elinfo-goto-error) (define-key map "\M-n" 'elinfo-next-error) (define-key map "\M-p" 'elinfo-previous-error) (define-key map "\M-{" 'elinfo-previous-file) (define-key map "\M-}" 'elinfo-next-file) (define-key map "\t" 'elinfo-next-error) (define-key map [backtab] 'elinfo-previous-error) (define-key map "q" 'quit-window) (define-key map " " 'scroll-up) (define-key map "\^?" 'scroll-down) (define-key map "\C-c\C-f" 'next-error-follow-minor-mode) map) "Keymap for Elinfo Logs.") (defun elinfo-log-mode () "Major mode for browsing Elinfo Logs. \\{elinfo-log-mode-map}" (interactive) (kill-all-local-variables) (use-local-map elinfo-log-mode-map) (setq major-mode 'elinfo-log-mode) (setq mode-name "Elinfo") (setq next-error-function 'elinfo-next-error-function) (setq buffer-read-only t)) (defvar elinfo-log-button-map (let ((map (make-sparse-keymap))) (define-key map [mouse-2] 'elinfo-goto-error) (define-key map [follow-link] 'mouse-face) (define-key map "\C-m" 'elinfo-goto-error) map) "Keymap for Elinfo Log buttons.") (fset 'elinfo-log-button-map elinfo-log-button-map) ;; Two utility functions for getting properties of functions and ;; variables. These are used for checking properties described by ;; Texinfo text. (defun elinfo--fun-properties (function) "Return properties of FUNCTION. Return value is a list of three elements: The first element, when non-nil, tells that FUNCTION is a command. The second element, when non-nil, specifies the function FUNCTION is aliased to. The third element specifies the argument list of FUNCTION (and is t when that argument list is not available)." (let (def command alias arglist doc) (setq def (if (symbolp function) (symbol-function function) function)) (setq command (commandp def)) (when (symbolp def) (while (and (fboundp def) (symbolp (symbol-function def))) (setq def (indirect-function def))) (setq alias def)) (setq arglist (if (and (setq doc (documentation function)) (string-match "\n\n(fn[ \]*\\(.*\\))[ \t\n]*\\'" doc)) (mapcar 'downcase (split-string (match-string 1 doc))) ;; Handle symbols aliased to other symbols. (when (and (symbolp def) (fboundp def)) (setq def (indirect-function def))) ;; If definition is a macro, find the function inside it. (when (eq (car-safe def) 'macro) (setq def (cdr def))) (cond ((byte-code-function-p def) (aref def 0)) ((eq (car-safe def) 'lambda) (nth 1 def)) (t t)))) (list command alias (if (listp arglist) arglist t)))) (defun elinfo--is-option (variable) "Return t if VARIABLE is an option." (let* ((file-name (find-lisp-object-file-name variable 'defvar)) (alias (condition-case nil (indirect-variable variable) (error variable))) (doc (condition-case nil (or (documentation-property variable 'variable-documentation) (documentation-property alias 'variable-documentation)) (error nil))) (option (or (custom-variable-p variable) (and (stringp doc) (string-equal "*" (substring-no-properties doc 0 1)))))) option)) ;; Some counters. (defvar elinfo-files-count nil) (defvar elinfo-errors-count 0) (defvar elinfo-warnings-count 0) (defun elinfo-clear-log () "Clear Elinfo Log buffer and reset error counts." ;; Maybe we should make clearing customizable. (with-current-buffer (get-buffer-create "*Elinfo-Log*") (let ((inhibit-read-only t)) (delete-region (point-min) (point-max)))) ;; We could insert some header, but who'd care anyway? (setq elinfo-errors-count 0) (setq elinfo-warnings-count 0)) (defun elinfo-display-log (&optional file-name) "Display Elinfo Log. If there were no errors just write a message. Optional argument FILE-NAME must name the file whose contents were checked." (if (and (zerop elinfo-errors-count) (zerop elinfo-warnings-count)) (let ((text (concat "No errors found" (cond (file-name (format " in file %s" file-name)) (elinfo-files-count (format " in %s checked file(s)" elinfo-files-count)))))) (message text)) (let* ((buffer (get-buffer "*Elinfo-Log*")) (window (when buffer (display-buffer buffer))) (text (concat (format "%s error(s)" elinfo-errors-count) (when elinfo-log-warnings (format " and %s warning(s)" elinfo-warnings-count)) (cond (file-name (format " in file %s" file-name)) (elinfo-files-count (format " in %s checked file(s)" elinfo-files-count)))))) ;; We really should have an option which lets users select ;; Log/Help windows. (when window (set-window-point window (point-min))) (message text)))) (defun elinfo-log-string (type name file-name position string &optional warning) "Log STRING in Elinfo Log buffer. STRING is propertized with TYPE \(the type of the error), NAME \(usually the name of some function or variable), FILE-NAME \(the name of a file), POSITION \(a position valid in a buffer visiting FILE-NAME). Optional argument WARNING non-nil means no logging occurs unless `elinfo-log-warnings' is non-nil." (when (or (not warning) elinfo-log-warnings) (if warning (setq elinfo-warnings-count (1+ elinfo-warnings-count)) (setq elinfo-errors-count (1+ elinfo-errors-count))) (let ((line (count-lines (point-min) position)) (column (save-excursion (goto-char position) (current-column)))) (with-current-buffer (get-buffer-create "*Elinfo-Log*") (unless (eq major-mode 'elinfo-log-mode) (elinfo-log-mode)) (let ((inhibit-read-only t) from to) (goto-char (setq from (point-max))) (insert ;; Propertized information. (propertize (concat (propertize (concat (when file-name (concat (file-name-nondirectory file-name) ":")) (format "%s:%s" line column)) 'face elinfo-log-location-face) " " string) 'elinfo-type type 'elinfo-name name 'elinfo-file-name file-name 'elinfo-position position 'mouse-face 'highlight 'follow-link 'elinfo-follow-link) ;; STRING itself followed by a newline. "\n")))))) (defsubst elinfo--symbol-name (arg) "If ARG is a symbol return its name, else return ARG." (if (symbolp arg) (symbol-name arg) arg)) (defun elinfo-check (file-name) "Check current buffer. FILE-NAME must specify the name of the file whose contents are in the current buffer." (interactive) (save-excursion (save-restriction (widen) ;; Make sure `case-fold-search' is nil to catch the "C-" and "M-" ;; entries only and not the C-mode "c-..." stuff. (let (case-fold-search texi-name texi-pd-name texi-arg-name texi-command texi-option position properties arg-list arg-name is-option) (goto-char (point-min)) ;; We probably won't ever bother to optimize the regexp below. (while (re-search-forward ;; REGEXP MNEMNONICS: 1..defun/deffn, 4..defvar/defopt, ;; 8..text we ignore, 9..C-/M- prefixed string in @code, ;; 10..SPC/RET/etc. in @code, 11..SPC/RET/etc. in @kbd, ;; 13..non-nil. "^@\\(?:\\(\\(defunx?\\|defmac\\|defspec\\)\\|\\(deffn Command\\)\\)\ \\|\\(\\(defvarx?\\)\\|\\(defoptx?\\)\\)\ \\|\\(small\\)?\\(example\\|lisp\\|verbatim\\|ignore\\)\\)\ \\|@code{\\([CM]-[^}]+\\)\\|@\\(?:\\(code\\)\ \\|\\(kbd\\)\\){\\(SPC\\|RET\\|LFD\\|TAB\\|BS\\|ESC\\|DEL\\|SHIFT\\|CTRL\\|META\\|F[0-9]+\\)}\ \\|\\(non-nil\\)" nil t) (cond ((match-beginning 1) (setq texi-command (match-beginning 3)) (setq position (elinfo--move-to-next)) (setq texi-name (elinfo--get-next)) (setq texi-pd-name (propertize texi-name 'face elinfo-log-def-face)) (if (not (fboundp (intern texi-name))) (progn (elinfo-log-string 'fun-unbound texi-name file-name position (concat "Unbound function `" texi-pd-name "'") t) (forward-line)) (setq properties (elinfo--fun-properties (intern texi-name))) (if (and (eq (nth 1 properties) 'ignore) (not (string-equal texi-name "ignore"))) ;; When the function is remapped to `ignore' we can't ;; do anything reasonable on the current platform. (progn (elinfo-log-string 'remapped-to-ignore texi-name file-name position (concat "`" texi-pd-name "' remapped to `ignore' on this platform") t) (forward-line)) ;; Check command property. (cond ((and texi-command (not (nth 0 properties))) (elinfo-log-string 'command texi-name file-name position (concat "`" texi-pd-name "' is not a command"))) ((and (not texi-command) (nth 0 properties)) (elinfo-log-string 'no-command texi-name file-name position (concat "`" texi-pd-name "' is a command")))) ;; Check arguments. (setq arg-list (nth 2 properties)) (cond ((not arg-list) (unless (string-equal (elinfo--get-next-arg) "") ;; The function doesn't have arguments. (elinfo-log-string 'arg-mismatch texi-name file-name position (concat "`" texi-pd-name "' has no arguments")))) ((not (listp arg-list)) ;; Usually this is an "Arglist not available ..." problem. (elinfo-log-string 'arg-not-available texi-name file-name position (concat "Arguments of `" texi-pd-name "' not available") t)) (t (catch 'bug (dolist (arg arg-list) (setq arg-name (elinfo--symbol-name arg)) (setq texi-arg-name (elinfo--get-next-arg)) (cond ((string-equal arg-name texi-arg-name)) ; :-) ((string-equal texi-arg-name "") ;; The function has more arguments than described. (or (and (string-match "&optional\\|&rest" arg-name) ;; Maybe the manual just wants us to ;; ignore the arguments. (let ((last-name (car (last arg-list)))) (string-match "ignored?" (elinfo--symbol-name last-name)))) (elinfo-log-string 'arg-mismatch texi-name file-name position (concat "`" texi-pd-name "' lacks description of " (cond ((string-equal arg-name "&optional") "optional argument(s).") ((string-equal arg-name "%rest") "rest of arguments.") (t (concat "argument `" arg-name "'")))))) (throw 'bug t)) ((or (string-match "&optional\\|&rest" arg-name) (string-match "&optional\\|&rest" texi-arg-name)) (or (and (string-match "&optional\\|&rest" texi-arg-name) ;; Maybe we don't want to know all ;; arguments? Let's get heuristical ;; once more. (let ((next-name (elinfo--get-next-arg))) (string-match "elements\\|arguments\\|args" next-name 0))) ;; The following idiom is popular within ;; macros and special forms. (and (string-match "&rest" arg-name) (or (string-match "@dots{}$" texi-arg-name) (re-search-forward "@dots{}$" (line-end-position) t))) (elinfo-log-string 'arg-mismatch texi-name file-name position (concat "`" texi-pd-name "' has mismatching optional or rest of arguments"))) (throw 'bug t)) (t (elinfo-log-string 'arg-mismatch texi-name file-name position (concat "'" texi-pd-name "' uses `" texi-arg-name "' for argument `" arg-name "'") t)) (throw 'bug t))))))))) ((match-beginning 4) (setq texi-option (match-beginning 6)) (setq position (elinfo--move-to-next)) (setq texi-name (elinfo--get-next)) (setq texi-pd-name (propertize texi-name 'face elinfo-log-def-face)) (if (not (boundp (intern texi-name))) (progn (elinfo-log-string 'var-unbound texi-name file-name position (concat "Unbound variable `" texi-pd-name "'") t) (forward-line)) ;; Check option property. (setq is-option (elinfo--is-option (intern texi-name))) (cond ((and texi-option (not is-option)) (elinfo-log-string 'no-option texi-name file-name position (concat "`" texi-pd-name "' is not an option"))) ((and (not texi-option) is-option) (elinfo-log-string 'option texi-name file-name position (concat "`" texi-pd-name "' is an option")))))) ((match-beginning 8) (setq position (line-beginning-position)) ;; An example-verbatim-lisp group, skip it. (unless (re-search-forward (concat "^@end " (match-string-no-properties 7) (match-string-no-properties 8)) nil t) (elinfo-log-string 'no-end nil file-name position (concat "Unterminated " (match-string-no-properties 7) (match-string-no-properties 8))))) ((match-beginning 9) (elinfo-log-string 'code-kbd nil file-name (match-beginning 9) (concat "Use @kbd for @code{" (replace-regexp-in-string "\n" " " (match-string-no-properties 9)) "}"))) ((match-beginning 10) (elinfo-log-string 'code-key nil file-name (match-beginning 10) (concat "Use @key for @code{" (replace-regexp-in-string "\n" " " (match-string-no-properties 12)) "}"))) ((match-beginning 11) (elinfo-log-string 'kbd-key nil file-name (match-beginning 11) (concat "Use @key for @kbd{" (replace-regexp-in-string "\n" " " (match-string-no-properties 12)) "}"))) ((match-beginning 13) ;; Currently we don't cater for @code{nil} miswritten as nil ;; and probably never will. (elinfo-log-string 'kbd-key nil file-name (match-beginning 13) (concat "Use non-@code{nil} instead of non-nil"))) ;; Additional cases added here. )))))) (defun elinfo-check-buffer () "Check validity of current Texinfo buffer. The buffer must be visiting a file." (interactive) (elinfo-clear-log) (cond ((not buffer-file-name) (error "The current buffer does not visit a file")) ((and (buffer-modified-p) (y-or-n-p (format "Save buffer %s first? " (buffer-name)))) (save-buffer))) (setq elinfo-files-count nil) (goto-char (point-min)) (elinfo-check buffer-file-name) (elinfo-display-log buffer-file-name)) (defun elinfo-check-file (file-name) "Check validity of Texinfo file FILE-NAME." (interactive "fCheck Texinfo file ") (elinfo-clear-log) (setq elinfo-files-count nil) (with-temp-buffer (insert-file-contents file-name) (elinfo-check file-name)) (elinfo-display-log file-name)) (defun elinfo-check-directory (directory) "Check validity of Texinfo files in DIRECTORY." (interactive "DCheck all texinfo files in ") (let ((files (directory-files directory 'full))) (elinfo-clear-log) (setq elinfo-files-count 0) (dolist (file files) (when (and (string-equal (file-name-extension file) "texi") (not (backup-file-name-p file))) (with-temp-buffer (insert-file-contents file) (elinfo-check file) (setq elinfo-files-count (1+ elinfo-files-count)) (let (message-log-max) (message "%s files checked" elinfo-files-count))))) (elinfo-display-log))) ;; _____________________________________________________________________________ ;; ;;; Navigation Utilities ;; _____________________________________________________________________________ ;; (defcustom elinfo-names-alist '(("elisp" . "lispref") ("emacs" . "emacs")) "Alist of (info-file-name texi-subdirectory) associations. If elinfo can't find an association it will look in the \"misc\" subdirectory." :type '(repeat (group (string :tag "info-file" :format " %t: %v" :size 15) (string :tag "texi-subdir" :format " %t: %v\n" :size 15))) :group 'elinfo) (defcustom elinfo-texi-directory (concat source-directory "doc/") "Directory where doc files reside." :type 'file :group 'elinfo) (defun elinfo--default-tag () "Return cons holding start and name of tag near `point'." (let (from to bound) (when (or (progn ;; Look at text around `point'. (save-excursion (skip-chars-backward "a-z0-9") (setq from (point))) (save-excursion (skip-chars-forward "a-z0-9") (setq to (point))) (> to from)) ;; Look between `line-beginning-position' and `point'. (save-excursion (and (setq bound (line-beginning-position)) (skip-chars-backward "^a-z0-9" bound) (> (setq to (point)) bound) (skip-chars-backward "a-z0-9") (setq from (point)))) ;; Look between `point' and `line-end-position'. (save-excursion (and (setq bound (line-end-position)) (skip-chars-forward "^a-z0-9" bound) (< (setq from (point)) bound) (skip-chars-forward "a-z0-9") (setq to (point))))) (cons from (buffer-substring-no-properties from to))))) (defconst elinfo-at-regexp "\\(?:\\(?:@[a-z]+{\\|[}`'\"_:]*\\)*[ \t\n]*\\)" "Regexp supposed to match Texinfo constructs that separate items.") ;; Two globals, initialized in `elinfo-goto-texi' and updated by ;; `elinfo--update-regexp'. (defvar elinfo-from nil) (defvar elinfo-regexp nil) (defun elinfo--make-regexp (from to) "Return buffer text from FROM to TO as regexp." (replace-regexp-in-string "[ \t]+" "[ \t\n]+" (regexp-quote (downcase (buffer-substring-no-properties from to))) nil t)) (defun elinfo--make-regexp-2 (from to) "Return buffer text from FROM to TO as Texinfoish regexp." (replace-regexp-in-string "@\\([@{}]\\)" (lambda (string) (match-string-no-properties 1 string)) (elinfo--make-regexp from to))) (defun elinfo--update-regexp (&optional rest) (setq elinfo-regexp (concat elinfo-regexp (elinfo--make-regexp elinfo-from (match-beginning 0)) (unless (= elinfo-from (match-beginning 0)) elinfo-at-regexp) (when rest (concat rest elinfo-at-regexp))))) (defun elinfo--update-regexp-810 (&optional rest) (setq elinfo-regexp (concat elinfo-regexp (elinfo--make-regexp elinfo-from (match-beginning 0)) (unless (= elinfo-from (match-beginning 0)) elinfo-at-regexp) (when rest (concat rest elinfo-at-regexp))))) (defconst elinfo-info-regexp "\\(-- \\(?:Function\\|Command\\|Variable\\|Macro\\|Special Form\\|User Option\\|Syntax class\\): \\)\ \\|\\(?:\\(\\*Note\\)\\|\\(::\\)\\|\\*\\(Warning:\\)\\*\\)[ ]*\\|[‘]*<\\([^>]+\\)>[’]*\ \\|\\(\\(?:La\\)?TeX\\|([CR])\\|Euro\\)\\|\\(‘\\([^’]+\\)’\\)\\|\\([‘]*\\([A-Z0-9*-]+\\)[’]*\\)\ \\|\\([ \t\n]+\\)\\|\\(['\"_]+\\)\\|\\(-+\\)\\|\\(\\.[.]+\\)\ \\|\\(\\(?:error\\)?[-.!=>|*#][-.!=>|*#]+\\)\\|\\([{}]\\)" "The basic regexp used by `elinfo-goto-texi'.") ;; Used for testing only so REMOVE it. (defvar elinfo-test-texi-window nil) (defun elinfo-find-tag-other-window (tagname &optional next-p regexp-p) "Our version of `find-tag-other-window'." (let* ((window-point (window-point)) (tagbuf (find-tag-noselect tagname next-p regexp-p)) (tagpoint (progn (set-buffer tagbuf) (point)))) (set-window-point (prog1 (selected-window) (select-window (or (get-buffer-window tagbuf) (let ((window (split-window nil nil 'left))) (set-window-buffer window tagbuf) window))) (set-window-point (selected-window) tagpoint)) window-point))) (defun elinfo-windows-synch (&optional source target) "Synch `window-point' of window TARGET with that of window SOURCE." (let* ((source (or source (selected-window))) (target (or target (window-in-direction 'left source) (window-in-direction 'right source))) (line (with-current-buffer (window-buffer source) (count-screen-lines (window-start source) (window-point source)))) ;; Experimentally OK ... (source-head (/ (+ (window-header-line-height source) (window-tab-line-height source)) (frame-char-height (window-frame source)))) (target-head (/ (+ (window-header-line-height target) (window-tab-line-height target)) (frame-char-height (window-frame target))))) (setq line (+ line (- source-head target-head))) (with-selected-window target (recenter (1- line))))) (defun elinfo-goto-texi () "In `Info-mode' display corresponding Texinfo text in other window. If invoking this command doesn't get you to the desired position in the Texinfo buffer, you might try to invoke it from a position one line above or below the current one in the Info buffer." (interactive) (unless (eq major-mode 'Info-mode) (error "This function can be invoked in Info-mode only")) (let* ((info-window (selected-window)) (window-point (window-point)) (info-file-name (when (boundp 'Info-current-file) (file-name-nondirectory Info-current-file))) (tag (when (boundp 'Info-current-node) Info-current-node)) (texi-subdir (when info-file-name (let ((cell (assoc info-file-name elinfo-names-alist))) (if cell (cdr cell) "misc")))) (at (point)) (offset 0) texi-dir texi-file tags-file-name default-tag anchor repeat from to case-fold-search tagbuf) (cond ((not texi-subdir) (error "Can't identify texinfo subdirectory")) ((file-exists-p (setq texi-dir (concat elinfo-texi-directory texi-subdir "/")))) (t (error "Can't find texinfo subdirectory"))) ;; Sigh ... (save-excursion ;; Move away from empty lines. (save-excursion (beginning-of-line) (when (looking-at "[ \t]*$") (let ((prev-at (progn (skip-chars-backward " \t\n") (unless (bobp) (point)))) (next-at (progn (skip-chars-forward " \t\n") (unless (eobp) (point))))) (setq at (cond ((and prev-at next-at) (if (> (- at prev-at) (- next-at at)) next-at prev-at)) (prev-at prev-at) (next-at next-at)))))) (goto-char at) (setq from (progn (beginning-of-line) (skip-chars-forward " \t") (point))) (setq to (progn (end-of-line) (skip-chars-backward " \t") (point))) ;; Try to handle an xref that spans the beginning of this line. ;; This will put us after the reference in the texi buffer, but ;; that's still better than getting to the head of the section. (when (save-excursion (goto-char from) (and (re-search-backward "\\*[Nn]ote" (line-beginning-position 0) t) (not (re-search-forward "[.,]\\|::" from t)))) (goto-char from) (re-search-forward "::\\|\\([.,])?\\)" to t) (when (match-beginning 1) ;; Found a closing `.' or `,' so relocate `from'. (setq from (point)) (setq at (max from at)))) (goto-char at) (when (setq default-tag (elinfo--default-tag)) (setq offset (- at (car default-tag))) (setq anchor (cdr default-tag)) (save-excursion (goto-char from) (catch 'done (let ((case-fold-search t)) (while (re-search-forward anchor nil t) (if (= (match-beginning 0) (car default-tag)) (throw 'done t) (setq repeat (if repeat (1+ repeat) 2)))))))) (goto-char (setq elinfo-from from)) (setq elinfo-regexp nil) ;; Scan line. (while (re-search-forward elinfo-info-regexp to t) (cond ((match-beginning 1)) ; -- Function: ... ((match-beginning 2) ; *Note (elinfo--update-regexp) (let ((from-1 (point)) to-1) (if (re-search-forward ":" to t) (progn (setq to-1 (match-beginning 0)) (if (looking-at-p ":") ;; :: is handled by main loop, rescan. (goto-char from-1) ;; A complex xref, handle it here. (setq elinfo-regexp (concat elinfo-regexp ; Force a `{' here? "[a-zA-Z, \t\n]*" (elinfo--make-regexp from-1 to-1) "[a-zA-Z, \t\n]*}")) ;; Quit loop. (setq from to) (goto-char to))) (setq elinfo-regexp (concat elinfo-regexp ; Force a `{' here? "[a-zA-Z, \t\n]*" (elinfo--make-regexp from-1 to) "[a-zA-Z, \t\n]*}")) ;; Quit loop. (setq from to) (goto-char to)))) ((match-beginning 3) ; :: (elinfo--update-regexp)) ((match-beginning 4) ; *Warning:* (elinfo--update-regexp)) ((match-beginning 5) ; `<KEY>' (elinfo--update-regexp (elinfo--make-regexp (match-beginning 5) (match-end 5)))) ((match-beginning 6) (elinfo--update-regexp (concat "@?" (match-string-no-properties 6) "\\(?:[{}]+\\)?"))) ((match-beginning 7) ; Quoted string (elinfo--update-regexp-810 (elinfo--make-regexp (match-beginning 8) (match-end 8)))) ((match-beginning 9) ; Upper-case string. (elinfo--update-regexp-810 (elinfo--make-regexp (match-beginning 10) (match-end 10)))) ((match-beginning 11) ; Whitespace (elinfo--update-regexp "[ \t\n]+")) ((match-beginning 12) ; `'" (elinfo--update-regexp)) ((match-beginning 13) ; -- (elinfo--update-regexp "[-]+")) ((match-beginning 14) ; .. (elinfo--update-regexp "\\(?:\\.[.]+\\|@dots{}\\)")) ((match-beginning 15) ; => (elinfo--update-regexp (concat "\\(?:@[a-z]+{}\\)\\|" (elinfo--make-regexp (match-beginning 15) (match-end 15))))) ((match-beginning 16) ; { or } (elinfo--update-regexp (concat "@?" (match-string-no-properties 16) elinfo-at-regexp))) ;; Future additions go here. ) (setq elinfo-from (point))) (setq elinfo-regexp (concat elinfo-regexp (elinfo--make-regexp elinfo-from to)))) (if (not (file-exists-p (setq tags-file-name (concat texi-dir "TAGS")))) (error "No tags file found") (setq tagbuf (find-tag-noselect (concat "^@node " tag "[\x7f,]") nil t)) (elinfo-find-tag-other-window (concat "^@node " tag "[\x7f,]") nil t) ;; Used for testing only so REMOVE it. (setq elinfo-test-texi-window (selected-window)) (let ((case-fold-search t)) (when elinfo-regexp (with-current-buffer (get-buffer-create "*test*") (delete-region (point-min) (point-max)) (insert elinfo-regexp)) (let ((at (point))) (cond ((and (re-search-forward (concat "\\(" elinfo-regexp "\\)\\|^node") nil t) (match-beginning 1)) (goto-char (match-beginning 1)) (save-restriction (widen) (narrow-to-region (match-beginning 1) (match-end 1)) (when (and anchor (re-search-forward anchor nil t repeat)) (goto-char (+ (match-beginning 0) offset))))) (t (goto-char at))))))) (set-window-point info-window window-point) (elinfo-windows-synch info-window (get-buffer-window tagbuf)) ;; The following might get removed. (select-window info-window) )) (defconst elinfo-texi-glyph-regexp (regexp-opt '("dots" "enddots" "bullet" "minus" "result" "expansion" "print" "error" "equiv" "point" "copyright" "registeredsymbol" "euro" "pounds" "LaTeX" "TeX")) "Regular expression for glyphs in texi buffers.") (defconst elinfo-texi-regexp (concat "@def\\(un\\|fn Command\\|var\\|macro\\|spec\\|opt\\)[ \t]+\ \\|\\(?:@p?xref{\\([^}]+\\)\\(?:}\\|$\\)\\)\\|@\\(?:strong\\|emph\\){\\([^}]+\\)}\ \\|\\(@" elinfo-texi-glyph-regexp "{}\\)\\|\\(?:@dfn{\\([^}]+\\)}\\)\ \\|\\(@[a-z]+{\\)\\|\\(}\\)\\|\\([ \t]+\\)\\|\\([`'\"][`'\"]+\\)") "The basic regexp used by `elinfo-goto-info'.") (defun elinfo-goto-info () "In `texinfo-mode' display corresponding Info node in other window. If invoking this command doesn't get you to the desired position in the Info buffer, you might try to invoke it from a position one line above or below the current one in the Texinfo buffer." (interactive) (unless (eq major-mode 'texinfo-mode) (error "This function can be invoked in texinfo-mode only")) (let* ((texi-window (selected-window)) (window-point (window-point)) (info-node-3 (save-excursion (when (re-search-backward "^@node \\([-a-zA-Z ]+\\)" nil t) (match-string-no-properties 1)))) (info-node-2 (save-excursion (goto-char (point-min)) (when (re-search-forward "^@node \\([-a-zA-Z ]+\\)" nil t) (match-string-no-properties 1)))) (info-node-1 ;; If we can't find an association, `info-other-window' will ;; start at the top node. (car (rassoc (file-name-nondirectory (directory-file-name (file-name-directory buffer-file-name))) elinfo-names-alist))) (offset 0) regexp default-tag anchor repeat case-fold-search) (save-excursion ;; Use brute force. (while (save-excursion (beginning-of-line) (or (looking-at "[ \t]*\n\\|@anchor{") (and (looking-at "@[a-z]+\\(?: \\|$\\)") (not (looking-at "@def\\(un\\|fn Command\\|var\\|macro\\|spec\\|opt\\)"))))) (forward-line)) (let ((at (point)) (from (progn (beginning-of-line) (skip-chars-forward " \t") (point))) (to (progn (end-of-line) (skip-chars-backward " \t") (point)))) ;; Get out of any reference at BOL, we can't trace it anyway. (condition-case nil (and (progn (goto-char from) (up-list -1) (looking-at "{")) (looking-back "xref" (line-beginning-position)) (progn (forward-sexp) (setq from (point)) (setq at (max at from)))) (error nil)) (goto-char (max at from)) (when (setq default-tag (elinfo--default-tag)) (setq offset (- at (car default-tag))) (setq anchor (cdr default-tag)) (save-excursion (goto-char from) (catch 'done (let ((case-fold-search t)) (while (re-search-forward anchor nil t) (if (= (match-beginning 0) (car default-tag)) (throw 'done t) (setq repeat (if repeat (1+ repeat) 2)))))))) (goto-char from) ;; Scan line. (while (re-search-forward elinfo-texi-regexp to t) (cond ((match-beginning 1)) ((match-beginning 2) ; @p?xref (or (and (save-excursion (goto-char (match-end 2)) (and (save-match-data ;; A comma within the braces stands for a ;; complex reference. (re-search-backward "," (match-beginning 2) t)) (skip-chars-forward " \t\n,") (setq regexp (concat regexp (elinfo--make-regexp-2 from (match-beginning 0)) "\\*note[ \t\n]*" (elinfo--make-regexp-2 (point) (match-end 2)))))) ;; Leave main loop. (setq from to) (goto-char to)) (setq regexp (concat regexp (elinfo--make-regexp-2 from (match-beginning 0)) "\\*note[ \t\n]*" (elinfo--make-regexp-2 (match-beginning 2) (match-end 2)) "::[ \t\n]*")))) ((match-beginning 3) ; @strong or @emph (setq regexp (concat regexp (elinfo--make-regexp-2 from (match-beginning 0)) "[*_]" (elinfo--make-regexp-2 (match-beginning 3) (match-end 3)) "[*_][ \t\n]*"))) ((match-beginning 4) ; Some sort of glyph (setq regexp (concat regexp (elinfo--make-regexp-2 from (match-beginning 0)) "\\(?:\\(?:error\\)?[-.!=>|*][-.!=>|*]+\\|\\(?:La\\)?TeX\\|([CR])\\|Euro\\|#\\)[ \t\n]*"))) ((match-beginning 5) ; @dfn (setq regexp (concat regexp (elinfo--make-regexp-2 from (match-beginning 0)) "\"" (elinfo--make-regexp-2 (match-beginning 5) (match-end 5)) "\""))) ((match-beginning 6) ; @...{ (setq regexp (concat regexp (elinfo--make-regexp-2 from (match-beginning 0)) "‘?"))) ((match-beginning 7) ; } (setq regexp (concat regexp (elinfo--make-regexp-2 from (match-beginning 0)) "’?"))) ((match-beginning 8) ; whitespace (setq regexp (concat regexp (elinfo--make-regexp-2 from (match-beginning 0)) "[ \t\n]*"))) ((match-beginning 9) ; quotes like ``...'' (setq regexp (concat regexp (elinfo--make-regexp-2 from (match-beginning 0)) "[‘’\"]+"))) ) ; Future additions go here. (setq from (point))) (setq regexp (concat regexp (elinfo--make-regexp-2 from to))))) (info-other-window info-node-1) (Info-goto-node info-node-2) (when info-node-3 (Info-goto-node info-node-3)) (let ((case-fold-search t)) (when regexp (with-current-buffer (get-buffer-create "*test*") (delete-region (point-min) (point-max)) (insert regexp)) (let ((at (point))) (cond ;; We could add name after File: below. ((and (re-search-forward (concat "\\(" regexp "\\)\\|^File: ") nil t) (match-beginning 1)) (goto-char (match-beginning 1)) (save-restriction (widen) (narrow-to-region (match-beginning 1) (match-end 1)) (when (and anchor (re-search-forward anchor nil t repeat)) (goto-char (+ (match-beginning 0) offset))))) (t (goto-char at)))))) (set-window-point texi-window window-point) (elinfo-windows-synch texi-window (get-buffer-window "*info*")) (select-window texi-window) ; remove this evtly )) ;; Unless people object ... (define-key Info-mode-map "\C-c\C-g" 'elinfo-goto-texi) (define-key texinfo-mode-map "\C-c\C-g" 'elinfo-goto-info) (define-key texinfo-mode-map "\M-o" 'elinfo-goto-info) ;; _____________________________________________________________________________ ;; ;;; Grabbing Documentation Dtrings ;; _____________________________________________________________________________ ;; (defun elinfo-get-definition () "Add doc string of object at `point' to kill ring. The doc string is suitable for insertion into Elisp Texinfo buffers." (interactive) (save-excursion (let ((at (point)) limit from to type name arg-list init-value doc open-paren-at) (cond ((eq major-mode 'emacs-lisp-mode) (beginning-of-defun) (when (looking-at "(\\(?:\\(def\\(?:un\\|subst\\)\\)\\|\\(defvar\\)\\|\\(defcustom\\)\\|\\(defmacro\\)\\)[ \t]+") (setq type (cond ((match-beginning 1) 'function) ((match-beginning 2) 'variable) ((match-beginning 3) 'option) ((match-beginning 4) 'macro))) (goto-char (match-end 0)) (setq name (elinfo--get-next)) (cond ((memq type '(function command macro)) (skip-chars-forward " \t\n(") (while (not (looking-at ")")) (setq arg-list (cons (buffer-substring-no-properties (point) (progn (skip-chars-forward "^ \t\n)") (point))) arg-list)) (skip-chars-forward " \t\n")) (setq arg-list (nreverse arg-list))) ((memq type '(variable option)) (skip-chars-forward " \t\n") (setq init-value (buffer-substring-no-properties (point) (progn (forward-sexp) (point)))))) (skip-chars-forward " \t\n)") ;; We must be before the doc-string now. (setq doc (buffer-substring-no-properties (progn (skip-chars-forward " \t\n") (1+ (point))) (progn (forward-sexp) (1- (point))))) (when (eq type 'function) ;; Check for interactive property. (forward-comment (buffer-size)) (when (looking-at "(interactive") (setq type 'command))))) ((eq major-mode 'c-mode) (beginning-of-defun) (cond ((looking-at "void[ \t\n]*syms_of") (setq limit (point)) (goto-char at) (when (or (save-excursion (beginning-of-line) (looking-at "[ \t]*DEFVAR_\\(?:LISP\\|BOOL\\|INT\\|PER_BUFFER\\)[ \t]+(\"\\([^\" \t\n]+\\)")) (re-search-backward "DEFVAR_\\(?:LISP\\|BOOL\\|INT\\|PER_BUFFER\\)[ \t]+(\"\\([^\" \t\n]+\\)" limit t)) (setq type 'variable) (setq name (match-string-no-properties 1)))) ;; The following must be decided. ;; (when (eq type 'variable) (custom-variable-p (intern name))) ;; (setq type 'option)) ((looking-at "\\(?:DEFUN\\)[ \t]+(\"\\([^\" \t\n]+\\)") (setq type 'function) (setq name (match-string-no-properties 1)))) (cond ((eq type 'function) (let ((index 0) from to min-args) (re-search-forward ",[ \t]*" nil nil 3) ;; Must be before MIN now. (setq min-args (string-to-number (buffer-substring-no-properties (point) (progn (skip-chars-forward "0-9") (point))))) (re-search-forward ",[ \t\n]*" nil t 2) (unless (looking-at "0") ;; Command. (setq type 'command)) (re-search-forward "^[ \t]*doc:[ \]*\\(/\\*[ \t]*\\)" nil t) (setq from (match-end 0)) (goto-char (match-beginning 1)) (forward-comment 1) (save-excursion (skip-chars-backward "*/ \t") (setq to (point))) (setq doc (buffer-substring-no-properties from to)) (unless arg-list (skip-chars-forward ") \t\n(") (unless (looking-at "void") (while (not (looking-at "[ \t\n]*)")) (when (looking-at "\\(?:register[ \t]+Lisp_Object[ \t]*\\)\\|\\(?:Lisp_Object[ \t]*\\)") (goto-char (match-end 0))) (skip-chars-forward " \t\n,") (cond ((< index min-args) (setq index (1+ index))) ((= index min-args) (setq arg-list (cons "&optional" arg-list)) (setq index (1+ index)))) (when (looking-at "Lisp_Object[ \t]*") (goto-char (match-end 0))) (setq arg-list (cons (buffer-substring-no-properties (point) (progn (skip-chars-forward "^, \t\n)") (point))) arg-list))) (setq arg-list (nreverse arg-list)))))) ((eq type 'variable) (re-search-forward "^[ \t]*doc:[ \]*\\(/\\*[ \t]*\\)" nil t) (setq from (match-end 0)) (goto-char (match-beginning 1)) (forward-comment 1) (save-excursion (skip-chars-backward "*/ \t") (setq to (point))) ;; Remove leading asterisk. (goto-char from) (when (looking-at "\\*") (setq from (1+ from))) (setq doc (buffer-substring-no-properties from to)))))) (when doc (with-current-buffer (get-buffer-create "*Elinfo-Temp*") (delete-region (point-min) (point-max)) (insert doc) ;; Handle usage information. (when (memq type '(function command)) (when (re-search-backward "^usage:[ \t]*(" from t) (setq from (point)) (goto-char (match-end 0)) (save-excursion (goto-char (match-beginning 0)) (skip-chars-backward " \t\n") (setq to (point))) (skip-chars-forward "^ \t\n") (skip-chars-forward " \t\n") (setq arg-list nil) (while (not (or (looking-at ")") (eobp))) (setq arg-list (cons (downcase (buffer-substring-no-properties (point) (progn (skip-chars-forward "^ \t\n)") (point)))) arg-list)) (skip-chars-forward " \t\n")) (setq arg-list (nreverse arg-list)) (delete-region (point) (point-max)))) (when arg-list ;; Handle args within doc. This case could be merged with ;; the remaining ones but hardly seems worth the effort. (goto-char (point-min)) (let ((regexp (mapconcat 'regexp-quote (mapcar 'upcase arg-list) "\\|")) (to (make-marker)) case-fold-search) (while (re-search-forward regexp nil t) (let ((string (downcase (match-string-no-properties 0))) (from (match-beginning 0))) (set-marker to (match-end 0)) (catch 'found (dolist (arg arg-list) ;; We assume that all args are downcase. (when (string-equal string arg) (replace-match (concat "@var{" string "}") t) (throw 'found t)))))) (set-marker to nil))) ;; Look for quotes, double-quotes, nil and other funny things. (goto-char (point-min)) (let (quoted) (while (re-search-forward "\\(`\\)\\(?:[^ \t\n']+\\)\\('\\)\ \\|\\(?:^\\|[ \t\n(*`]\\)\\(\\(?:[Nn]on-\\)?\\(nil\\)\\|t\\|\\(\\(?:[CM]-\\)+\\(?:[a-zA-Z]+\\|[^ a-zA-Z\t\n]+\\)\\)\ \\|\\(SPC\\|RET\\|LFD\\|TAB\\|BS\\|ESC\\|DEL\\|SHIFT\\|CTRL\\|META\\|F[0-9]+\\)\\)\\(?:$\\|[ \t\n,.;:?!')]\\)\ \\|\\(\\\\?\"\\)\\|\\(?:\\(\\\\\\\\[[{]\\)[^]}]+\\([]}]\\)\\|\\(<[^>]+>\\)\\)\ \\|\\\\\\([([{]\\)\\|\\(\\\\$\\)" nil t) (cond ((match-beginning 1) ;; `...' (replace-match "@code{" nil nil nil 1) (replace-match "}" nil nil nil 2)) ((match-beginning 3) ;; `nil', `non-nil' and `t'. (cond ((match-beginning 4) (replace-match (concat "@code{nil}") t nil nil 4)) ((match-beginning 5) (replace-match (concat "@kbd{" (match-string-no-properties 5) "}") t nil nil 5)) ((match-beginning 6) (replace-match (concat "@key{" (match-string-no-properties 6) "}") t nil nil 6)) (t (replace-match (concat "@code{" (match-string-no-properties 3) "}") t nil nil 3)))) ((match-beginning 7) ;; \" (replace-match (if quoted "''" "``")) (setq quoted (not quoted ))) ((match-beginning 8) ;; \\[...], \\{...} (replace-match "@code{" nil t nil 8) (replace-match "}" nil t nil 9)) ((match-beginning 10) ;; \\<...> (replace-match "")) ((match-beginning 11) ;; An escaped paren (this must follow case 2). (replace-match (match-string-no-properties 11))) ((match-beginning 12) ;; A backslash at EOL. (replace-match "")) ))) ;; Fill. (let ((fill-column 70)) (fill-region (point-min) (point-max))) ;; Handle key-bindings, manual references and URLs, maybe. (setq doc (buffer-substring-no-properties (point-min) (point-max))))) (let (at-string arg-list-string) (cond ((eq type 'command) (setq at-string (cons "@deffn Command" "@end deffn"))) ((eq type 'function) (setq at-string (cons "@defun" "@end defun"))) ((eq type 'variable) (setq at-string (cons "@defvar" "@end defvar"))) ((eq type 'option) (setq at-string (cons "@defopt" "@end defopt"))) ((eq type 'macro) (setq at-string (cons "@defmac" "@end defmac")))) (dolist (arg arg-list) (setq arg-list-string (concat arg-list-string " " arg))) ;; Hardly anyone wants to see this so REMOVE it. (with-selected-window (display-buffer (get-buffer-create "*Elinfo-show")) (delete-region (point-min) (point-max)) (insert (concat (car at-string) " " name arg-list-string "\n" doc "\n" (cdr at-string) "\n\n"))) (kill-new (concat (car at-string) " " name arg-list-string "\n" doc "\n" (cdr at-string) "\n\n")))))) ;; TODO: define-mode and define-derived-mode are not handled. ;; TODO: Special forms are written in C and have UNEVALLED set. (provide 'elinfo) ;; Below this line are monsters. ;; Used for testing only, so REMOVE it. (defun elinfo-test () (interactive) (with-current-buffer (get-buffer-create "*elinfo-test*") (delete-region (point-min) (point-max))) (let ((old-point (point)) new-point node split-height-threshold split-width-threshold) (while t (Info-forward-node nil nil t) (forward-line (random 43)) (setq old-point (point)) (setq node Info-current-node) (elinfo-goto-texi) (with-selected-window elinfo-test-texi-window (elinfo-goto-info) (sit-for 0.1)) (setq new-point (point)) (with-current-buffer (get-buffer-create "*elinfo-test*") (goto-char (point-max)) (insert (format "node: %s old: %s new: %s diff: %s \n" node old-point new-point (- old-point new-point)))) (goto-char old-point)))) ^ permalink raw reply [flat|nested] 36+ messages in thread
* bug#65726: 29.1.50; Crash in regexp engine 2023-09-04 7:46 bug#65726: 29.1.50; Crash in regexp engine martin rudalics @ 2023-09-04 8:44 ` Mattias Engdegård 2023-09-04 12:12 ` Eli Zaretskii 2023-09-04 14:32 ` Stefan Monnier via Bug reports for GNU Emacs, the Swiss army knife of text editors 2023-09-11 8:10 ` Mattias Engdegård 2 siblings, 1 reply; 36+ messages in thread From: Mattias Engdegård @ 2023-09-04 8:44 UTC (permalink / raw) To: 65726; +Cc: martin rudalics Can't reproduce here (macOS). Can you give us a deeper backtrace? Try finding out the Lisp context: what function is called and with what arguments? What is the regexp, exactly? What text does it attempt to match against? ^ permalink raw reply [flat|nested] 36+ messages in thread
* bug#65726: 29.1.50; Crash in regexp engine 2023-09-04 8:44 ` Mattias Engdegård @ 2023-09-04 12:12 ` Eli Zaretskii 2023-09-04 13:18 ` Mattias Engdegård 0 siblings, 1 reply; 36+ messages in thread From: Eli Zaretskii @ 2023-09-04 12:12 UTC (permalink / raw) To: Mattias Engdegård; +Cc: rudalics, 65726 > Cc: martin rudalics <rudalics@gmx.at> > From: Mattias Engdegård <mattias.engdegard@gmail.com> > Date: Mon, 4 Sep 2023 10:44:06 +0200 > > Can't reproduce here (macOS). Can you give us a deeper backtrace? I can reproduce on MS-Windows. It's an infinite recursion. > Try finding out the Lisp context: what function is called and with what arguments? What is the regexp, exactly? What text does it attempt to match against? Here's the end of the call stack: #104528 0x0120bbc4 in search_buffer_re (string=XIL(0x8000000007e5e308), pos=62669, pos_byte=62669, lim=300101, lim_byte=300103, n=1, trt=XIL(0xa00000000631c92c), inverse_trt=XIL(0xa000000006231ed4), posix=false) at search.c:1265 #104529 0x0120cae5 in search_buffer (string=XIL(0x8000000007e5e308), pos=62669, pos_byte=62669, lim=300101, lim_byte=300103, n=1, RE=1, trt=XIL(0xa00000000631c92c), inverse_trt=XIL(0xa000000006231ed4), posix=false) at search.c:1527 #104530 0x0120b404 in search_command (string=XIL(0x8000000007e5e308), bound=XIL(0), noerror=XIL(0x30), count=XIL(0), direction=1, RE=1, posix=false) at search.c:1069 #104531 0x0120e35a in Fre_search_forward (regexp=XIL(0x8000000007e5e308), bound=XIL(0), noerror=XIL(0x30), count=XIL(0)) at search.c:2294 #104532 0x0127860e in eval_sub (form=XIL(0xc000000007dbef40)) at eval.c:2508 #104533 0x0126f601 in Fand (args=XIL(0xc000000007dbe810)) at eval.c:370 #104534 0x012780e0 in eval_sub (form=XIL(0xc000000007dbef30)) at eval.c:2449 #104535 0x0126f847 in Fcond (args=XIL(0xc000000007e16d00)) at eval.c:412 #104536 0x012780e0 in eval_sub (form=XIL(0xc000000007e16d70)) at eval.c:2449 #104537 0x0126fa9d in Fprogn (body=XIL(0)) at eval.c:436 #104538 0x01272b16 in Flet (args=XIL(0xc000000007e16d90)) at eval.c:1026 #104539 0x012780e0 in eval_sub (form=XIL(0xc000000007e16da0)) at eval.c:2449 #104540 0x0126fa9d in Fprogn (body=XIL(0)) at eval.c:436 #104541 0x012780e0 in eval_sub (form=XIL(0xc000000007e17740)) at eval.c:2449 #104542 0x0126f74a in Fif (args=XIL(0xc000000007e16dd0)) at eval.c:391 #104543 0x012780e0 in eval_sub (form=XIL(0xc000000007e16dc0)) at eval.c:2449 #104544 0x0126fa9d in Fprogn (body=XIL(0)) at eval.c:436 #104545 0x01272b16 in Flet (args=XIL(0xc000000007e16e00)) at eval.c:1026 #104546 0x012780e0 in eval_sub (form=XIL(0xc000000007e16e10)) at eval.c:2449 #104547 0x0126fa9d in Fprogn (body=XIL(0)) at eval.c:436 #104548 0x0126f7ae in Fif (args=XIL(0xc000000007e16e30)) at eval.c:392 #104549 0x012780e0 in eval_sub (form=XIL(0xc000000007e16e20)) at eval.c:2449 #104550 0x0126fa9d in Fprogn (body=XIL(0xc000000007dbeb50)) at eval.c:436 #104551 0x0127247c in FletX (args=XIL(0xc000000007e16fc0)) at eval.c:958 #104552 0x012780e0 in eval_sub (form=XIL(0xc000000007e16fd0)) at eval.c:2449 #104553 0x0126fa9d in Fprogn (body=XIL(0)) at eval.c:436 #104554 0x0127b95e in funcall_lambda (fun=XIL(0xc000000007dbe750), nargs=0, arg_vector=0x82ecb0) at eval.c:3233 #104555 0x0127a4c9 in funcall_general (fun=XIL(0xc000000007dbe750), numargs=0, args=0x82ecb0) at eval.c:2957 #104556 0x0127a6b7 in Ffuncall (nargs=1, args=0x82eca8) at eval.c:2995 #104557 0x012693e3 in Ffuncall_interactively (nargs=1, args=0x82eca8) at callint.c:250 #104558 0x0127ad17 in funcall_subr (subr=0x1881580 <Sfuncall_interactively>, numargs=1, args=0x82eca8) at eval.c:3059 #104559 0x0127a2cb in funcall_general (fun=XIL(0xa000000001881580), numargs=1, args=0x82eca8) at eval.c:2941 #104560 0x0127a6b7 in Ffuncall (nargs=2, args=0x82eca0) at eval.c:2995 #104561 0x012790cb in Fapply (nargs=3, args=0x82eca0) at eval.c:2619 #104562 0x01269ae9 in Fcall_interactively (function=XIL(0x60de7c0), record_flag=XIL(0), keys=XIL(0xa000000007970198)) at callint.c:342 #104563 0x0127a915 in funcall_subr (subr=0x18815c0 <Scall_interactively>, numargs=3, args=0x6c50078) at eval.c:3038 #104564 0x012ed316 in exec_byte_code (fun=XIL(0xa00000000615797c), args_template=1025, nargs=1, args=0x82f590) at bytecode.c:809 #104565 0x0127ae66 in fetch_and_exec_byte_code (fun=XIL(0xa00000000615797c), args_template=1025, nargs=1, args=0x82f588) at eval.c:3081 #104566 0x0127b3c5 in funcall_lambda (fun=XIL(0xa00000000615797c), nargs=1, arg_vector=0x82f588) at eval.c:3153 #104567 0x0127a332 in funcall_general (fun=XIL(0xa00000000615797c), numargs=1, args=0x82f588) at eval.c:2945 #104568 0x0127a6b7 in Ffuncall (nargs=2, args=0x82f580) at eval.c:2995 #104569 0x0116d992 in call1 (fn=XIL(0x4590), arg1=XIL(0x60de7c0)) at lisp.h:3248 #104570 0x01171cb7 in command_loop_1 () at keyboard.c:1503 #104571 0x01274470 in internal_condition_case ( bfun=0x11710ec <command_loop_1>, handlers=XIL(0x90), hfun=0x11700ba <cmd_error>) at eval.c:1474 #104572 0x01170b59 in command_loop_2 (handlers=XIL(0x90)) at keyboard.c:1133 #104573 0x012732f7 in internal_catch (tag=XIL(0x103b0), func=0x1170b22 <command_loop_2>, arg=XIL(0x90)) at eval.c:1197 #104574 0x01170ac4 in command_loop () at keyboard.c:1111 #104575 0x0116fb1a in recursive_edit_1 () at keyboard.c:720 #104576 0x0116fdb8 in Frecursive_edit () at keyboard.c:803 #104577 0x0116ab60 in main (argc=2, argv=0xa428e0) at emacs.c:2521 Lisp Backtrace: "re-search-forward" (0x82d6a0) "and" (0x82d8a0) "cond" (0x82da60) "let" (0x82dcf0) "progn" (0x82deb0) "if" (0x82e070) "let" (0x82e300) "if" (0x82e500) "let*" (0x82e770) "elinfo-goto-texi" (0x82ecb0) "funcall-interactively" (0x82eca8) "call-interactively" (0x6c50078) "command-execute" (0x82f588) (gdb) In frame #104531, we have: #104531 0x0120e35a in Fre_search_forward (regexp=XIL(0x8000000007e5e308), bound=XIL(0), noerror=XIL(0x30), count=XIL(0)) at search.c:2294 and regexp is this monstrocity: "\\(split\\(?:\\(?:@[a-z]+{\\|[}`'\"_:]*\\)*[ \t\n]*\\)-\\(?:\\(?:@[a-z]+{\\|[}`'\"_:]*\\)*[ \t\n]*\\)window\\(?:\\(?:@[a-z]+{\\|[}`'\"_:]*\\)*[ \t\n]*\\)[ \t\n]+\\(?:\\(?:@[a-z]+{\\|[}`'\"_:]*\\)*[ \t\n]*\\)&optional\\(?:\\(?:@[a-z]+{\\|[}`'\"_:]*\\)*[ \t\n]*\\)[ \t\n]+\\(?:\\(?:@[a-z]+{\\|[}`'\"_:]*\\)*[ \t\n]*\\)window\\(?:\\(?:@[a-z]+{\\|[}`'\"_:]*\\)*[ \t\n]*\\)[ \t\n]+\\(?:\\(?:@[a-z]+{\\|[}`'\"_:]*\\)*[ \t\n]*\\)size\\(?:\\(?:@[a-z]+{\\|[}`'\"_:]*\\)*[ \t\n]*\\)[ \t\n]+\\(?:\\(?:@[a-z]+{\\|[}`'\"_:]*\\)*[ \t\n]*\\)side\\(?:\\(?:@[a-z]+{\\|[}`'\"_:]*\\)*[ \t\n]*\\)[ \t\n]+\\(?:\\(?:@[a-z]+{\\|[}`'\"_:]*\\)*[ \t\n]*\\)pixelwise\\)\\|^node" Feel free to ask more questions, as I can reproduce this at will. ^ permalink raw reply [flat|nested] 36+ messages in thread
* bug#65726: 29.1.50; Crash in regexp engine 2023-09-04 12:12 ` Eli Zaretskii @ 2023-09-04 13:18 ` Mattias Engdegård 2023-09-04 13:26 ` Mattias Engdegård 2023-09-04 13:28 ` Eli Zaretskii 0 siblings, 2 replies; 36+ messages in thread From: Mattias Engdegård @ 2023-09-04 13:18 UTC (permalink / raw) To: Eli Zaretskii; +Cc: martin rudalics, 65726, Stefan Monnier 4 sep. 2023 kl. 14.12 skrev Eli Zaretskii <eliz@gnu.org>: > I can reproduce on MS-Windows. It's an infinite recursion. Hmmm, strange that I'm unable to then. The recursion seems to be in code added in 5a864f23eb. Does the problem disappear if that commit is unmade? I shall give this another go but meanwhile maybe Stefan could look at it, too? Not that I am implything there is anything wrong with his code. Heavens, no. > In frame #104531, we have: > > #104531 0x0120e35a in Fre_search_forward (regexp=XIL(0x8000000007e5e308), > bound=XIL(0), noerror=XIL(0x30), count=XIL(0)) at search.c:2294 That should be the re-search-forward call in elinfo.el:973. > and regexp is this monstrocity: Thank you, this one is apparently generated dynamically. ^ permalink raw reply [flat|nested] 36+ messages in thread
* bug#65726: 29.1.50; Crash in regexp engine 2023-09-04 13:18 ` Mattias Engdegård @ 2023-09-04 13:26 ` Mattias Engdegård 2023-09-04 13:28 ` Eli Zaretskii 1 sibling, 0 replies; 36+ messages in thread From: Mattias Engdegård @ 2023-09-04 13:26 UTC (permalink / raw) To: Eli Zaretskii; +Cc: martin rudalics, 65726, Stefan Monnier > Hmmm, strange that I'm unable to then. Now I can reproduce the crash (I didn't have any TAGS files). ^ permalink raw reply [flat|nested] 36+ messages in thread
* bug#65726: 29.1.50; Crash in regexp engine 2023-09-04 13:18 ` Mattias Engdegård 2023-09-04 13:26 ` Mattias Engdegård @ 2023-09-04 13:28 ` Eli Zaretskii 2023-09-04 15:47 ` Mattias Engdegård 1 sibling, 1 reply; 36+ messages in thread From: Eli Zaretskii @ 2023-09-04 13:28 UTC (permalink / raw) To: Mattias Engdegård; +Cc: rudalics, 65726, monnier > From: Mattias Engdegård <mattias.engdegard@gmail.com> > Date: Mon, 4 Sep 2023 15:18:01 +0200 > Cc: 65726@debbugs.gnu.org, > martin rudalics <rudalics@gmx.at>, > Stefan Monnier <monnier@iro.umontreal.ca> > > 4 sep. 2023 kl. 14.12 skrev Eli Zaretskii <eliz@gnu.org>: > > > I can reproduce on MS-Windows. It's an infinite recursion. > > Hmmm, strange that I'm unable to then. > > The recursion seems to be in code added in 5a864f23eb. Does the problem disappear if that commit is unmade? Yes, if I revert that change, the problem goes away. > I shall give this another go but meanwhile maybe Stefan could look at it, too? > Not that I am implything there is anything wrong with his code. Heavens, no. Thanks. ^ permalink raw reply [flat|nested] 36+ messages in thread
* bug#65726: 29.1.50; Crash in regexp engine 2023-09-04 13:28 ` Eli Zaretskii @ 2023-09-04 15:47 ` Mattias Engdegård 2023-09-05 12:23 ` Mattias Engdegård 0 siblings, 1 reply; 36+ messages in thread From: Mattias Engdegård @ 2023-09-04 15:47 UTC (permalink / raw) To: Eli Zaretskii; +Cc: rudalics, 65726, monnier Minimised reproduction recipe: (string-match (rx (* "a") (* (* "b"))) "a") (Don't worry about the nested stars; the original was more like (* (| A (* B))).) The generated bytecode is: 0 on-failure-jump-smart to 9 3 exact "a" 6 jump to 0 9 on-failure-jump-loop to 24 12 on-failure-jump-smart to 21 15 exact "b" 18 jump to 12 21 jump to 9 24 succeed which suggests a mutual_exclusive_p recursion loop: mep(3,9) -> mep(3,12) -> mep(3,21) = {skip_noop} = mep(3,9) -> mep(3,12) ... Maybe there is something clever we could do here but it's possible that we just can't avoid keeping track of visited nodes. ^ permalink raw reply [flat|nested] 36+ messages in thread
* bug#65726: 29.1.50; Crash in regexp engine 2023-09-04 15:47 ` Mattias Engdegård @ 2023-09-05 12:23 ` Mattias Engdegård 2023-09-05 13:08 ` Stefan Monnier via Bug reports for GNU Emacs, the Swiss army knife of text editors 0 siblings, 1 reply; 36+ messages in thread From: Mattias Engdegård @ 2023-09-05 12:23 UTC (permalink / raw) To: Stefan Monnier; +Cc: martin rudalics, Eli Zaretskii, 65726 Thank you for fixing it, but now regex-emacs-tests fail with a regexp stack overflow. It seems that by fixing this bug, we've unfixed the one that the broken code was supposed to fix. And we should probably include a regression test for this bug as well. The previously mentioned (string-match (rx (* "a") (* (* "b"))) "a") might suffice, but extending it to (string-match (rx (* "a") (* (or "c" (* "b")))) "a") is safer in case the regexp compiler decides to simplify (* (* X)) -> (* X). ^ permalink raw reply [flat|nested] 36+ messages in thread
* bug#65726: 29.1.50; Crash in regexp engine 2023-09-05 12:23 ` Mattias Engdegård @ 2023-09-05 13:08 ` Stefan Monnier via Bug reports for GNU Emacs, the Swiss army knife of text editors 2023-09-05 13:50 ` Mattias Engdegård 0 siblings, 1 reply; 36+ messages in thread From: Stefan Monnier via Bug reports for GNU Emacs, the Swiss army knife of text editors @ 2023-09-05 13:08 UTC (permalink / raw) To: Mattias Engdegård; +Cc: martin rudalics, Eli Zaretskii, 65726 > Thank you for fixing it, but now regex-emacs-tests fail with a regexp stack > overflow. Really? I thought I had specifically run that test. [ ...recompiling + rerunning the test... ] Hmm... you're right. Crap. > It seems that by fixing this bug, we've unfixed the one that the > broken code was supposed to fix. Well, that one was less of a bug (more of a missing optimization), so it's no reason to revert the change, but yes, not good. Let's see if I can finagle something. > And we should probably include a regression test for this bug as well. > The previously mentioned > > (string-match (rx (* "a") (* (* "b"))) "a") > > might suffice, but extending it to > > (string-match (rx (* "a") (* (or "c" (* "b")))) "a") > > is safer in case the regexp compiler decides to simplify (* (* X)) -> (* X). Feel free. To me, it feels too much like testing the presence/absence of a very specific bug (i.e. too unlikely that we'll reintroduce that exact bug. It's not like it was a weird case I didn't think of). Stefan ^ permalink raw reply [flat|nested] 36+ messages in thread
* bug#65726: 29.1.50; Crash in regexp engine 2023-09-05 13:08 ` Stefan Monnier via Bug reports for GNU Emacs, the Swiss army knife of text editors @ 2023-09-05 13:50 ` Mattias Engdegård 2023-09-05 15:33 ` Stefan Monnier via Bug reports for GNU Emacs, the Swiss army knife of text editors 0 siblings, 1 reply; 36+ messages in thread From: Mattias Engdegård @ 2023-09-05 13:50 UTC (permalink / raw) To: Stefan Monnier; +Cc: martin rudalics, Eli Zaretskii, 65726 5 sep. 2023 kl. 15.08 skrev Stefan Monnier <monnier@iro.umontreal.ca>: > Let's see if I can finagle something. I liked your previous optimisation, so if you could finagle it back without breaking anything that would be pukka. > Feel free. To me, it feels too much like testing the presence/absence > of a very specific bug (i.e. too unlikely that we'll reintroduce that > exact bug. It's not like it was a weird case I didn't think of). Alright, pushed to emacs-29. It's a cheap test and offers at least some kind of protection in the event that someone undoes your handiwork. Such things do happen after all. ^ permalink raw reply [flat|nested] 36+ messages in thread
* bug#65726: 29.1.50; Crash in regexp engine 2023-09-05 13:50 ` Mattias Engdegård @ 2023-09-05 15:33 ` Stefan Monnier via Bug reports for GNU Emacs, the Swiss army knife of text editors 2023-09-06 12:03 ` Mattias Engdegård 0 siblings, 1 reply; 36+ messages in thread From: Stefan Monnier via Bug reports for GNU Emacs, the Swiss army knife of text editors @ 2023-09-05 15:33 UTC (permalink / raw) To: Mattias Engdegård; +Cc: martin rudalics, Eli Zaretskii, 65726 >> Feel free. To me, it feels too much like testing the presence/absence >> of a very specific bug (i.e. too unlikely that we'll reintroduce that >> exact bug. It's not like it was a weird case I didn't think of). > Alright, pushed to emacs-29. It's a cheap test and offers at least some kind > of protection in the event that someone undoes your handiwork. Such things > do happen after all. I marked the backtracking test as failing because the "real" fix wil likely be too risky to put onto `emacs-29` (at least for now). >> Let's see if I can finagle something. > I liked your previous optimisation, so if you could finagle it back without > breaking anything that would be pukka. WDYT of the approach below? You can ignore the `mutually_exclusive_(charset|exactn)` thingies which just move code around (it's necessary for the patch). It's a bit ugly because the actual assumptions on which it relies aren't spelled out explicitly (and even less assert-checked at run time). The basic idea is that `done` points to the beginning of the "current loop": as long as we move forward we presume we may still be in the current loop, and when we find a jump backward it's either a jump to the beginning of a new (nested/further) loop (so we move `done`) or a jump to the same loop or to some earlier loop (which we presumably handle elsewhere). I also tried a safer option where we do: return ((p2 < done ? false : p2 == done ? true : mutually_exclusive_aux (bufp, p1, p2, p2 > p2_orig ? done : p2)) && (p2_other < done ? false : p2_other == done ? true : mutually_exclusive_aux (bufp, p1, p2_other, p2_other > p2_orig ? done : p2_other))); i.e. we only consider `done` itself as "already checked", which should definitely be safe. It also seems to be sufficient to handle things like: (should (looking-at "x*\\(=+?\\|h\\)+")) Adding an `assert (p2 >= done)` confirms that this does happen, so whether we return true or false when `p2 < done` does matter, so I guess we should go with the safer option unless we can really convince ourselves that the more optimistic option is also correct. Then again, maybe we should bite the bullet and actually keep track of the positions already visited, so we don't need any "fancy" argument. Stefan diff --git a/src/regex-emacs.c b/src/regex-emacs.c index 7e75f0ac597..0d3cf39d619 100644 --- a/src/regex-emacs.c +++ b/src/regex-emacs.c @@ -3643,19 +3643,125 @@ execute_charset (re_char **pp, int c, int corig, bool unibyte, return not; } +/* Case where `p2` points to an `exactn`. */ +static bool +mutually_exclusive_exactn (struct re_pattern_buffer *bufp, re_char *p1, + re_char *p2) +{ + bool multibyte = RE_MULTIBYTE_P (bufp); + int c + = (re_opcode_t) *p2 == endline ? '\n' + : RE_STRING_CHAR (p2 + 2, multibyte); + + if ((re_opcode_t) *p1 == exactn) + { + if (c != RE_STRING_CHAR (p1 + 2, multibyte)) + { + DEBUG_PRINT (" '%c' != '%c' => fast loop.\n", c, p1[2]); + return true; + } + } + + else if ((re_opcode_t) *p1 == charset + || (re_opcode_t) *p1 == charset_not) + { + if (!execute_charset (&p1, c, c, !multibyte || ASCII_CHAR_P (c), + Qnil)) + { + DEBUG_PRINT (" No match => fast loop.\n"); + return true; + } + } + else if ((re_opcode_t) *p1 == anychar + && c == '\n') + { + DEBUG_PRINT (" . != \\n => fast loop.\n"); + return true; + } + return false; +} + +/* Case where `p2` points to an `charset`. */ +static bool +mutually_exclusive_charset (struct re_pattern_buffer *bufp, re_char *p1, + re_char *p2) +{ + /* It is hard to list up all the character in charset + P2 if it includes multibyte character. Give up in + such case. */ + if (!RE_MULTIBYTE_P (bufp) || !CHARSET_RANGE_TABLE_EXISTS_P (p2)) + { + /* Now, we are sure that P2 has no range table. + So, for the size of bitmap in P2, 'p2[1]' is + enough. But P1 may have range table, so the + size of bitmap table of P1 is extracted by + using macro 'CHARSET_BITMAP_SIZE'. + + In a multibyte case, we know that all the character + listed in P2 is ASCII. In a unibyte case, P1 has only a + bitmap table. So, in both cases, it is enough to test + only the bitmap table of P1. */ + + if ((re_opcode_t) *p1 == charset) + { + int idx; + /* We win if the charset inside the loop + has no overlap with the one after the loop. */ + for (idx = 0; + (idx < (int) p2[1] + && idx < CHARSET_BITMAP_SIZE (p1)); + idx++) + if ((p2[2 + idx] & p1[2 + idx]) != 0) + break; + + if (idx == p2[1] + || idx == CHARSET_BITMAP_SIZE (p1)) + { + DEBUG_PRINT (" No match => fast loop.\n"); + return true; + } + } + else if ((re_opcode_t) *p1 == charset_not) + { + int idx; + /* We win if the charset_not inside the loop lists + every character listed in the charset after. */ + for (idx = 0; idx < (int) p2[1]; idx++) + if (! (p2[2 + idx] == 0 + || (idx < CHARSET_BITMAP_SIZE (p1) + && ((p2[2 + idx] & ~ p1[2 + idx]) == 0)))) + break; + + if (idx == p2[1]) + { + DEBUG_PRINT (" No match => fast loop.\n"); + return true; + } + } + } + return false; +} + /* True if "p1 matches something" implies "p2 fails". */ +/* Avoiding inf-loops: + We're trying to follow all paths reachable from `p2`, but since some + loops can match the empty string, this can loop back to `p2`. + To avoid inf-looping, we rely on a lexicographic ordering using `done` + and `p2`: at every recursion, either `done` is larger, or it + is unchanged and `p2` is larger. */ static bool -mutually_exclusive_p (struct re_pattern_buffer *bufp, re_char *p1, - re_char *p2) +mutually_exclusive_aux (struct re_pattern_buffer *bufp, re_char *p1, + re_char *p2, re_char *done) { re_opcode_t op2; - bool multibyte = RE_MULTIBYTE_P (bufp); unsigned char *pend = bufp->buffer + bufp->used; re_char *p2_orig = p2; eassert (p1 >= bufp->buffer && p1 < pend && p2 >= bufp->buffer && p2 <= pend); + eassert (p2 >= done); + /* Skip over open/close-group commands. If what follows this loop is a ...+ construct, look at what begins its body, since we will have to @@ -3684,98 +3790,14 @@ mutually_exclusive_p (struct re_pattern_buffer *bufp, re_char *p1, case endline: case exactn: - { - int c - = (re_opcode_t) *p2 == endline ? '\n' - : RE_STRING_CHAR (p2 + 2, multibyte); - - if ((re_opcode_t) *p1 == exactn) - { - if (c != RE_STRING_CHAR (p1 + 2, multibyte)) - { - DEBUG_PRINT (" '%c' != '%c' => fast loop.\n", c, p1[2]); - return true; - } - } - - else if ((re_opcode_t) *p1 == charset - || (re_opcode_t) *p1 == charset_not) - { - if (!execute_charset (&p1, c, c, !multibyte || ASCII_CHAR_P (c), - Qnil)) - { - DEBUG_PRINT (" No match => fast loop.\n"); - return true; - } - } - else if ((re_opcode_t) *p1 == anychar - && c == '\n') - { - DEBUG_PRINT (" . != \\n => fast loop.\n"); - return true; - } - } - break; + return mutually_exclusive_exactn (bufp, p1, p2); case charset: { if ((re_opcode_t) *p1 == exactn) - /* Reuse the code above. */ - return mutually_exclusive_p (bufp, p2, p1); - - /* It is hard to list up all the character in charset - P2 if it includes multibyte character. Give up in - such case. */ - else if (!multibyte || !CHARSET_RANGE_TABLE_EXISTS_P (p2)) - { - /* Now, we are sure that P2 has no range table. - So, for the size of bitmap in P2, 'p2[1]' is - enough. But P1 may have range table, so the - size of bitmap table of P1 is extracted by - using macro 'CHARSET_BITMAP_SIZE'. - - In a multibyte case, we know that all the character - listed in P2 is ASCII. In a unibyte case, P1 has only a - bitmap table. So, in both cases, it is enough to test - only the bitmap table of P1. */ - - if ((re_opcode_t) *p1 == charset) - { - int idx; - /* We win if the charset inside the loop - has no overlap with the one after the loop. */ - for (idx = 0; - (idx < (int) p2[1] - && idx < CHARSET_BITMAP_SIZE (p1)); - idx++) - if ((p2[2 + idx] & p1[2 + idx]) != 0) - break; - - if (idx == p2[1] - || idx == CHARSET_BITMAP_SIZE (p1)) - { - DEBUG_PRINT (" No match => fast loop.\n"); - return true; - } - } - else if ((re_opcode_t) *p1 == charset_not) - { - int idx; - /* We win if the charset_not inside the loop lists - every character listed in the charset after. */ - for (idx = 0; idx < (int) p2[1]; idx++) - if (! (p2[2 + idx] == 0 - || (idx < CHARSET_BITMAP_SIZE (p1) - && ((p2[2 + idx] & ~ p1[2 + idx]) == 0)))) - break; - - if (idx == p2[1]) - { - DEBUG_PRINT (" No match => fast loop.\n"); - return true; - } - } - } + return mutually_exclusive_exactn (bufp, p2, p1); + else + return mutually_exclusive_charset (bufp, p1, p2); } break; @@ -3783,9 +3805,9 @@ mutually_exclusive_p (struct re_pattern_buffer *bufp, re_char *p1, switch (*p1) { case exactn: + return mutually_exclusive_exactn (bufp, p2, p1); case charset: - /* Reuse the code above. */ - return mutually_exclusive_p (bufp, p2, p1); + return mutually_exclusive_charset (bufp, p2, p1); case charset_not: /* When we have two charset_not, it's very unlikely that they don't overlap. The union of the two sets of excluded @@ -3830,11 +3852,18 @@ mutually_exclusive_p (struct re_pattern_buffer *bufp, re_char *p1, int mcnt; p2++; EXTRACT_NUMBER_AND_INCR (mcnt, p2); - /* Don't just test `mcnt > 0` because non-greedy loops have - their test at the end with an unconditional jump at the start. */ - if (p2 + mcnt > p2_orig) /* Ensure forward progress. */ - return (mutually_exclusive_p (bufp, p1, p2) - && mutually_exclusive_p (bufp, p1, p2 + mcnt)); + re_char *p2_other = p2 + mcnt; + /* Presumably, any position up to `done` should be a position we + already considered elsewhere (because our state machines are not + arbitrary: they consists of syntaxically nested loops with limited + control flow), so we can consider those back jumps as mutually + exclusive here. */ + return ((p2 <= done + || mutually_exclusive_aux (bufp, p1, p2, + p2 > p2_orig ? done : p2)) + && (p2_other <= done + || mutually_exclusive_aux (bufp, p1, p2_other, + p2_other > p2_orig ? done : p2_other))); break; } @@ -3846,6 +3875,12 @@ mutually_exclusive_p (struct re_pattern_buffer *bufp, re_char *p1, return false; } +static bool +mutually_exclusive_p (struct re_pattern_buffer *bufp, re_char *p1, + re_char *p2) +{ + return mutually_exclusive_aux (bufp, p1, p2, min (p2, p1)); +} \f /* Matching routines. */ ^ permalink raw reply related [flat|nested] 36+ messages in thread
* bug#65726: 29.1.50; Crash in regexp engine 2023-09-05 15:33 ` Stefan Monnier via Bug reports for GNU Emacs, the Swiss army knife of text editors @ 2023-09-06 12:03 ` Mattias Engdegård 2023-09-09 15:55 ` Stefan Monnier via Bug reports for GNU Emacs, the Swiss army knife of text editors 0 siblings, 1 reply; 36+ messages in thread From: Mattias Engdegård @ 2023-09-06 12:03 UTC (permalink / raw) To: Stefan Monnier; +Cc: martin rudalics, Eli Zaretskii, 65726 5 sep. 2023 kl. 17.33 skrev Stefan Monnier <monnier@iro.umontreal.ca>: > You can ignore the `mutually_exclusive_(charset|exactn)` thingies which > just move code around (it's necessary for the patch). Actually that's a good refactoring to start with regardless of the path we follow later. Makes things more readable if nothing else. > The basic idea is that `done` points to the beginning of the "current > loop": as long as we move forward we presume we may still be in the > current loop, and when we find a jump backward it's either a jump to the > beginning of a new (nested/further) loop (so we move `done`) or a jump > to the same loop or to some earlier loop (which we presumably handle > elsewhere). So how can we describe the exact semantics of `done` in mutually_exclusive_aux(p1,p2,done)? Something like: Let me(p1,p2) = true iff (p1 matches) implies (p2 fails). Then me_aux(p1,p2,done) = me(p1,p2) under the assumption that me(p1,x) holds for all x<done where x is reachable from p2. ? > Adding an `assert (p2 >= done)` confirms that this does happen, Can you give an example of such a regexp? I ran `make check` with your patch applied and checking enabled but got no assertion failure. > so > whether we return true or false when `p2 < done` does matter, so I guess > we should go with the safer option unless we can really convince > ourselves that the more optimistic option is also correct. > > Then again, maybe we should bite the bullet and actually keep track of > the positions already visited, so we don't need any "fancy" argument. There are a couple of ways of doing that but it might not be that bad. It would also be more robust in face of future changes that may not obey the current invariants. ^ permalink raw reply [flat|nested] 36+ messages in thread
* bug#65726: 29.1.50; Crash in regexp engine 2023-09-06 12:03 ` Mattias Engdegård @ 2023-09-09 15:55 ` Stefan Monnier via Bug reports for GNU Emacs, the Swiss army knife of text editors 2023-09-09 16:34 ` Mattias Engdegård 0 siblings, 1 reply; 36+ messages in thread From: Stefan Monnier via Bug reports for GNU Emacs, the Swiss army knife of text editors @ 2023-09-09 15:55 UTC (permalink / raw) To: Mattias Engdegård; +Cc: martin rudalics, Eli Zaretskii, 65726 >> Adding an `assert (p2 >= done)` confirms that this does happen, > > Can you give an example of such a regexp? I ran `make check` with your patch > applied and checking enabled but got no assertion failure. My patch didn't include the assertion :-) The patch below does and hits an assertion failure during the dump: Loading format... Loading bindings... WOW1! Error: error ("WOW1!") mapbacktrace(#[1028 "\1\4\203\24\0\301\302!\210\300\4!\210\301\303!\210\202\35\0\301\304!\210\3\3B\262\1\211\2035\0\300\1@!\210\211A\211\262\2\2035\0\301\305!\210\202!\0\301\306!\207" [prin1 princ " " "(" " (" " " ")\n"] 7 "\n\n(fn EVALD FUNC ARGS FLAGS)"]) debug-early-backtrace() debug-early(error (error "WOW1!")) string-match("\\`[^ ]+\\( [^ ]+\\)*\\'" "h r" nil t) key-valid-p("h r") keymap--check("h r") keymap-set((keymap (27 keymap (119 . eww-search-words)) (111 . occur)) "h r" highlight-regexp) define-keymap("o" occur "M-w" eww-search-words "h r" highlight-regexp "h p" highlight-phrase "h l" highlight-lines-matching-regexp "h ." highlight-symbol-at-point "h u" unhighlight-regexp "h f" hi-lock-find-patterns "h w" hi-lock-write-interactive-patterns) (defvar search-map (define-keymap "o" 'occur "M-w" 'eww-search-words "h r" 'highlight-regexp "h p" 'highlight-phrase "h l" 'highlight-lines-matching-regexp "h ." 'highlight-symbol-at-point "h u" 'unhighlight-regexp "h f" 'hi-lock-find-patterns "h w" 'hi-lock-write-interactive-patterns) ("/home/monnier/src/emacs/main/lisp/bindings.elc" . 40677)) load("bindings") load("loadup.el") -- Stefan diff --git a/src/regex-emacs.c b/src/regex-emacs.c index 394ba22e9b0..b47cdfbfef8 100644 --- a/src/regex-emacs.c +++ b/src/regex-emacs.c @@ -3643,19 +3643,125 @@ execute_charset (re_char **pp, int c, int corig, bool unibyte, return not; } +/* Case where `p2` points to an `exactn`. */ +static bool +mutually_exclusive_exactn (struct re_pattern_buffer *bufp, re_char *p1, + re_char *p2) +{ + bool multibyte = RE_MULTIBYTE_P (bufp); + int c + = (re_opcode_t) *p2 == endline ? '\n' + : RE_STRING_CHAR (p2 + 2, multibyte); + + if ((re_opcode_t) *p1 == exactn) + { + if (c != RE_STRING_CHAR (p1 + 2, multibyte)) + { + DEBUG_PRINT (" '%c' != '%c' => fast loop.\n", c, p1[2]); + return true; + } + } + + else if ((re_opcode_t) *p1 == charset + || (re_opcode_t) *p1 == charset_not) + { + if (!execute_charset (&p1, c, c, !multibyte || ASCII_CHAR_P (c), + Qnil)) + { + DEBUG_PRINT (" No match => fast loop.\n"); + return true; + } + } + else if ((re_opcode_t) *p1 == anychar + && c == '\n') + { + DEBUG_PRINT (" . != \\n => fast loop.\n"); + return true; + } + return false; +} + +/* Case where `p2` points to an `charset`. */ +static bool +mutually_exclusive_charset (struct re_pattern_buffer *bufp, re_char *p1, + re_char *p2) +{ + /* It is hard to list up all the character in charset + P2 if it includes multibyte character. Give up in + such case. */ + if (!RE_MULTIBYTE_P (bufp) || !CHARSET_RANGE_TABLE_EXISTS_P (p2)) + { + /* Now, we are sure that P2 has no range table. + So, for the size of bitmap in P2, 'p2[1]' is + enough. But P1 may have range table, so the + size of bitmap table of P1 is extracted by + using macro 'CHARSET_BITMAP_SIZE'. + + In a multibyte case, we know that all the character + listed in P2 is ASCII. In a unibyte case, P1 has only a + bitmap table. So, in both cases, it is enough to test + only the bitmap table of P1. */ + + if ((re_opcode_t) *p1 == charset) + { + int idx; + /* We win if the charset inside the loop + has no overlap with the one after the loop. */ + for (idx = 0; + (idx < (int) p2[1] + && idx < CHARSET_BITMAP_SIZE (p1)); + idx++) + if ((p2[2 + idx] & p1[2 + idx]) != 0) + break; + + if (idx == p2[1] + || idx == CHARSET_BITMAP_SIZE (p1)) + { + DEBUG_PRINT (" No match => fast loop.\n"); + return true; + } + } + else if ((re_opcode_t) *p1 == charset_not) + { + int idx; + /* We win if the charset_not inside the loop lists + every character listed in the charset after. */ + for (idx = 0; idx < (int) p2[1]; idx++) + if (! (p2[2 + idx] == 0 + || (idx < CHARSET_BITMAP_SIZE (p1) + && ((p2[2 + idx] & ~ p1[2 + idx]) == 0)))) + break; + + if (idx == p2[1]) + { + DEBUG_PRINT (" No match => fast loop.\n"); + return true; + } + } + } + return false; +} + /* True if "p1 matches something" implies "p2 fails". */ +/* Avoiding inf-loops: + We're trying to follow all paths reachable from `p2`, but since some + loops can match the empty string, this can loop back to `p2`. + To avoid inf-looping, we rely on a lexicographic ordering using `done` + and `p2`: at every recursion, either `done` is larger, or it + is unchanged and `p2` is larger. */ static bool -mutually_exclusive_p (struct re_pattern_buffer *bufp, re_char *p1, - re_char *p2) +mutually_exclusive_aux (struct re_pattern_buffer *bufp, re_char *p1, + re_char *p2, re_char *done) { re_opcode_t op2; - bool multibyte = RE_MULTIBYTE_P (bufp); unsigned char *pend = bufp->buffer + bufp->used; re_char *p2_orig = p2; eassert (p1 >= bufp->buffer && p1 < pend && p2 >= bufp->buffer && p2 <= pend); + eassert (p2 >= done); + /* Skip over open/close-group commands. If what follows this loop is a ...+ construct, look at what begins its body, since we will have to @@ -3684,98 +3790,14 @@ mutually_exclusive_p (struct re_pattern_buffer *bufp, re_char *p1, case endline: case exactn: - { - int c - = (re_opcode_t) *p2 == endline ? '\n' - : RE_STRING_CHAR (p2 + 2, multibyte); - - if ((re_opcode_t) *p1 == exactn) - { - if (c != RE_STRING_CHAR (p1 + 2, multibyte)) - { - DEBUG_PRINT (" '%c' != '%c' => fast loop.\n", c, p1[2]); - return true; - } - } - - else if ((re_opcode_t) *p1 == charset - || (re_opcode_t) *p1 == charset_not) - { - if (!execute_charset (&p1, c, c, !multibyte || ASCII_CHAR_P (c), - Qnil)) - { - DEBUG_PRINT (" No match => fast loop.\n"); - return true; - } - } - else if ((re_opcode_t) *p1 == anychar - && c == '\n') - { - DEBUG_PRINT (" . != \\n => fast loop.\n"); - return true; - } - } - break; + return mutually_exclusive_exactn (bufp, p1, p2); case charset: { if ((re_opcode_t) *p1 == exactn) - /* Reuse the code above. */ - return mutually_exclusive_p (bufp, p2, p1); - - /* It is hard to list up all the character in charset - P2 if it includes multibyte character. Give up in - such case. */ - else if (!multibyte || !CHARSET_RANGE_TABLE_EXISTS_P (p2)) - { - /* Now, we are sure that P2 has no range table. - So, for the size of bitmap in P2, 'p2[1]' is - enough. But P1 may have range table, so the - size of bitmap table of P1 is extracted by - using macro 'CHARSET_BITMAP_SIZE'. - - In a multibyte case, we know that all the character - listed in P2 is ASCII. In a unibyte case, P1 has only a - bitmap table. So, in both cases, it is enough to test - only the bitmap table of P1. */ - - if ((re_opcode_t) *p1 == charset) - { - int idx; - /* We win if the charset inside the loop - has no overlap with the one after the loop. */ - for (idx = 0; - (idx < (int) p2[1] - && idx < CHARSET_BITMAP_SIZE (p1)); - idx++) - if ((p2[2 + idx] & p1[2 + idx]) != 0) - break; - - if (idx == p2[1] - || idx == CHARSET_BITMAP_SIZE (p1)) - { - DEBUG_PRINT (" No match => fast loop.\n"); - return true; - } - } - else if ((re_opcode_t) *p1 == charset_not) - { - int idx; - /* We win if the charset_not inside the loop lists - every character listed in the charset after. */ - for (idx = 0; idx < (int) p2[1]; idx++) - if (! (p2[2 + idx] == 0 - || (idx < CHARSET_BITMAP_SIZE (p1) - && ((p2[2 + idx] & ~ p1[2 + idx]) == 0)))) - break; - - if (idx == p2[1]) - { - DEBUG_PRINT (" No match => fast loop.\n"); - return true; - } - } - } + return mutually_exclusive_exactn (bufp, p2, p1); + else + return mutually_exclusive_charset (bufp, p1, p2); } break; @@ -3783,9 +3805,9 @@ mutually_exclusive_p (struct re_pattern_buffer *bufp, re_char *p1, switch (*p1) { case exactn: + return mutually_exclusive_exactn (bufp, p2, p1); case charset: - /* Reuse the code above. */ - return mutually_exclusive_p (bufp, p2, p1); + return mutually_exclusive_charset (bufp, p2, p1); case charset_not: /* When we have two charset_not, it's very unlikely that they don't overlap. The union of the two sets of excluded @@ -3830,11 +3852,25 @@ mutually_exclusive_p (struct re_pattern_buffer *bufp, re_char *p1, int mcnt; p2++; EXTRACT_NUMBER_AND_INCR (mcnt, p2); - /* Don't just test `mcnt > 0` because non-greedy loops have - their test at the end with an unconditional jump at the start. */ - if (p2 > p2_orig && mcnt >= 0) /* Ensure forward progress. */ - return (mutually_exclusive_p (bufp, p1, p2) - && mutually_exclusive_p (bufp, p1, p2 + mcnt)); + re_char *p2_other = p2 + mcnt; + /* Presumably, any position up to `done` should be a position we + already considered elsewhere (because our state machines are not + arbitrary: they consists of syntaxically nested loops with limited + control flow), so we can consider those back jumps as mutually + exclusive here. */ + if (p2 < done) + error ("WOW1!"); + if (p2_other < done) + error ("WOW2!"); + return ((p2 < done ? true + : p2 == done ? true + : mutually_exclusive_aux (bufp, p1, p2, + p2 > p2_orig ? done : p2)) + && (p2_other < done ? true + : p2_other == done ? true + : mutually_exclusive_aux (bufp, p1, p2_other, + p2_other > p2_orig + ? done : p2_other))); break; } @@ -3846,6 +3882,12 @@ mutually_exclusive_p (struct re_pattern_buffer *bufp, re_char *p1, return false; } +static bool +mutually_exclusive_p (struct re_pattern_buffer *bufp, re_char *p1, + re_char *p2) +{ + return mutually_exclusive_aux (bufp, p1, p2, min (p2, p1)); +} \f /* Matching routines. */ ^ permalink raw reply related [flat|nested] 36+ messages in thread
* bug#65726: 29.1.50; Crash in regexp engine 2023-09-09 15:55 ` Stefan Monnier via Bug reports for GNU Emacs, the Swiss army knife of text editors @ 2023-09-09 16:34 ` Mattias Engdegård 2023-09-14 14:41 ` Stefan Monnier via Bug reports for GNU Emacs, the Swiss army knife of text editors 0 siblings, 1 reply; 36+ messages in thread From: Mattias Engdegård @ 2023-09-09 16:34 UTC (permalink / raw) To: Stefan Monnier; +Cc: martin rudalics, Eli Zaretskii, 65726 9 sep. 2023 kl. 17.55 skrev Stefan Monnier <monnier@iro.umontreal.ca>: > My patch didn't include the assertion :-) Oh, I thought you meant the eassert (p2 >= done); added at the start of mutually_exclusive_aux. Sorry about the confusion. Anyway, disassembling the regexp makes it clear why: Disassembly of regexp "\\`[^ ]+\\( [^ ]+\\)*\\'" 0 begbuf 1 charset-not [ ] 8 on-failure-jump-smart to 21 11 charset-not [ ] 18 jump to 8 21 on-failure-jump to 54 24 start-memory group 1 26 exact " " 29 charset-not [ ] 36 on-failure-jump-smart to 49 39 charset-not [ ] 46 jump to 36 49 stop-memory group 1 51 jump to 21 54 endbuf 55 succeed which unless I'm mistaken we can condense to: Disassembly of regexp "\\(?:ba*\\)*" 0 on-failure-jump to 18 3 exact "b" 6 on-failure-jump-smart to 15 9 exact "a" 12 jump to 6 15 jump to 0 18 succeed so yes, we may need to remember where we've been. (At this point someone will inevitably point out a helpful invariant that is obvious in hindsight. This is just my cunning attempt at making that happen.) ^ permalink raw reply [flat|nested] 36+ messages in thread
* bug#65726: 29.1.50; Crash in regexp engine 2023-09-09 16:34 ` Mattias Engdegård @ 2023-09-14 14:41 ` Stefan Monnier via Bug reports for GNU Emacs, the Swiss army knife of text editors 2023-09-15 20:03 ` Mattias Engdegård 2023-09-16 3:45 ` Stefan Monnier via Bug reports for GNU Emacs, the Swiss army knife of text editors 0 siblings, 2 replies; 36+ messages in thread From: Stefan Monnier via Bug reports for GNU Emacs, the Swiss army knife of text editors @ 2023-09-14 14:41 UTC (permalink / raw) To: Mattias Engdegård; +Cc: martin rudalics, Eli Zaretskii, 65726 [-- Attachment #1: Type: text/plain, Size: 732 bytes --] > so yes, we may need to remember where we've been. (At this point > someone will inevitably point out a helpful invariant that is obvious > in hindsight. This is just my cunning attempt at making that happen.) No helpful invariant that makes it trivial, but if we keep pushing the same idea that we rely on the assumption that we just have a syntactically nested loop nest, then we can handle that as in the patch below. I.e. keep the idea I proposed of keeping track of a beg..end region that's already been handled. But now we really do have to maintain both ends (before, I only had `done` which kept track of the end of the region), and just "restart" when we jump back to a point before the "done" region. Stefan [-- Warning: decoded text below may be mangled, UTF-8 assumed --] [-- Attachment #2: regexp.c.patch --] [-- Type: text/x-diff, Size: 9710 bytes --] diff --git a/src/regex-emacs.c b/src/regex-emacs.c index 394ba22e9b0..a6b7faadf86 100644 --- a/src/regex-emacs.c +++ b/src/regex-emacs.c @@ -3643,19 +3643,128 @@ execute_charset (re_char **pp, int c, int corig, bool unibyte, return not; } +/* Case where `p2` points to an `exactn`. */ +static bool +mutually_exclusive_exactn (struct re_pattern_buffer *bufp, re_char *p1, + re_char *p2) +{ + bool multibyte = RE_MULTIBYTE_P (bufp); + int c + = (re_opcode_t) *p2 == endline ? '\n' + : RE_STRING_CHAR (p2 + 2, multibyte); + + if ((re_opcode_t) *p1 == exactn) + { + if (c != RE_STRING_CHAR (p1 + 2, multibyte)) + { + DEBUG_PRINT (" '%c' != '%c' => fast loop.\n", c, p1[2]); + return true; + } + } + + else if ((re_opcode_t) *p1 == charset + || (re_opcode_t) *p1 == charset_not) + { + if (!execute_charset (&p1, c, c, !multibyte || ASCII_CHAR_P (c), + Qnil)) + { + DEBUG_PRINT (" No match => fast loop.\n"); + return true; + } + } + else if ((re_opcode_t) *p1 == anychar + && c == '\n') + { + DEBUG_PRINT (" . != \\n => fast loop.\n"); + return true; + } + return false; +} + +/* Case where `p2` points to an `charset`. */ +static bool +mutually_exclusive_charset (struct re_pattern_buffer *bufp, re_char *p1, + re_char *p2) +{ + /* It is hard to list up all the character in charset + P2 if it includes multibyte character. Give up in + such case. */ + if (!RE_MULTIBYTE_P (bufp) || !CHARSET_RANGE_TABLE_EXISTS_P (p2)) + { + /* Now, we are sure that P2 has no range table. + So, for the size of bitmap in P2, 'p2[1]' is + enough. But P1 may have range table, so the + size of bitmap table of P1 is extracted by + using macro 'CHARSET_BITMAP_SIZE'. + + In a multibyte case, we know that all the character + listed in P2 is ASCII. In a unibyte case, P1 has only a + bitmap table. So, in both cases, it is enough to test + only the bitmap table of P1. */ + + if ((re_opcode_t) *p1 == charset) + { + int idx; + /* We win if the charset inside the loop + has no overlap with the one after the loop. */ + for (idx = 0; + (idx < (int) p2[1] + && idx < CHARSET_BITMAP_SIZE (p1)); + idx++) + if ((p2[2 + idx] & p1[2 + idx]) != 0) + break; + + if (idx == p2[1] + || idx == CHARSET_BITMAP_SIZE (p1)) + { + DEBUG_PRINT (" No match => fast loop.\n"); + return true; + } + } + else if ((re_opcode_t) *p1 == charset_not) + { + int idx; + /* We win if the charset_not inside the loop lists + every character listed in the charset after. */ + for (idx = 0; idx < (int) p2[1]; idx++) + if (! (p2[2 + idx] == 0 + || (idx < CHARSET_BITMAP_SIZE (p1) + && ((p2[2 + idx] & ~ p1[2 + idx]) == 0)))) + break; + + if (idx == p2[1]) + { + DEBUG_PRINT (" No match => fast loop.\n"); + return true; + } + } + } + return false; +} + /* True if "p1 matches something" implies "p2 fails". */ +/* Avoiding inf-loops: + We're trying to follow all paths reachable from `p2`, but since some + loops can match the empty string, this can loop back to `p2`. + To avoid inf-looping, we keep track of points that have been considered + "already". Instead of keeping a list of such points, we assume + that the code is a "sequential loop nest" and only keep track of + `done_beg` and `done_end` which delimit a chunk of bytecode we consider + as already considered. */ static bool -mutually_exclusive_p (struct re_pattern_buffer *bufp, re_char *p1, - re_char *p2) +mutually_exclusive_aux (struct re_pattern_buffer *bufp, re_char *p1, + re_char *p2, re_char *done_beg, re_char *done_end) { re_opcode_t op2; - bool multibyte = RE_MULTIBYTE_P (bufp); unsigned char *pend = bufp->buffer + bufp->used; re_char *p2_orig = p2; eassert (p1 >= bufp->buffer && p1 < pend && p2 >= bufp->buffer && p2 <= pend); + eassert (done_beg <= done_end); + eassert (done_end <= p2); + /* Skip over open/close-group commands. If what follows this loop is a ...+ construct, look at what begins its body, since we will have to @@ -3684,98 +3793,14 @@ mutually_exclusive_p (struct re_pattern_buffer *bufp, re_char *p1, case endline: case exactn: - { - int c - = (re_opcode_t) *p2 == endline ? '\n' - : RE_STRING_CHAR (p2 + 2, multibyte); - - if ((re_opcode_t) *p1 == exactn) - { - if (c != RE_STRING_CHAR (p1 + 2, multibyte)) - { - DEBUG_PRINT (" '%c' != '%c' => fast loop.\n", c, p1[2]); - return true; - } - } - - else if ((re_opcode_t) *p1 == charset - || (re_opcode_t) *p1 == charset_not) - { - if (!execute_charset (&p1, c, c, !multibyte || ASCII_CHAR_P (c), - Qnil)) - { - DEBUG_PRINT (" No match => fast loop.\n"); - return true; - } - } - else if ((re_opcode_t) *p1 == anychar - && c == '\n') - { - DEBUG_PRINT (" . != \\n => fast loop.\n"); - return true; - } - } - break; + return mutually_exclusive_exactn (bufp, p1, p2); case charset: { if ((re_opcode_t) *p1 == exactn) - /* Reuse the code above. */ - return mutually_exclusive_p (bufp, p2, p1); - - /* It is hard to list up all the character in charset - P2 if it includes multibyte character. Give up in - such case. */ - else if (!multibyte || !CHARSET_RANGE_TABLE_EXISTS_P (p2)) - { - /* Now, we are sure that P2 has no range table. - So, for the size of bitmap in P2, 'p2[1]' is - enough. But P1 may have range table, so the - size of bitmap table of P1 is extracted by - using macro 'CHARSET_BITMAP_SIZE'. - - In a multibyte case, we know that all the character - listed in P2 is ASCII. In a unibyte case, P1 has only a - bitmap table. So, in both cases, it is enough to test - only the bitmap table of P1. */ - - if ((re_opcode_t) *p1 == charset) - { - int idx; - /* We win if the charset inside the loop - has no overlap with the one after the loop. */ - for (idx = 0; - (idx < (int) p2[1] - && idx < CHARSET_BITMAP_SIZE (p1)); - idx++) - if ((p2[2 + idx] & p1[2 + idx]) != 0) - break; - - if (idx == p2[1] - || idx == CHARSET_BITMAP_SIZE (p1)) - { - DEBUG_PRINT (" No match => fast loop.\n"); - return true; - } - } - else if ((re_opcode_t) *p1 == charset_not) - { - int idx; - /* We win if the charset_not inside the loop lists - every character listed in the charset after. */ - for (idx = 0; idx < (int) p2[1]; idx++) - if (! (p2[2 + idx] == 0 - || (idx < CHARSET_BITMAP_SIZE (p1) - && ((p2[2 + idx] & ~ p1[2 + idx]) == 0)))) - break; - - if (idx == p2[1]) - { - DEBUG_PRINT (" No match => fast loop.\n"); - return true; - } - } - } + return mutually_exclusive_exactn (bufp, p2, p1); + else + return mutually_exclusive_charset (bufp, p1, p2); } break; @@ -3783,9 +3808,9 @@ mutually_exclusive_p (struct re_pattern_buffer *bufp, re_char *p1, switch (*p1) { case exactn: + return mutually_exclusive_exactn (bufp, p2, p1); case charset: - /* Reuse the code above. */ - return mutually_exclusive_p (bufp, p2, p1); + return mutually_exclusive_charset (bufp, p2, p1); case charset_not: /* When we have two charset_not, it's very unlikely that they don't overlap. The union of the two sets of excluded @@ -3830,12 +3855,30 @@ mutually_exclusive_p (struct re_pattern_buffer *bufp, re_char *p1, int mcnt; p2++; EXTRACT_NUMBER_AND_INCR (mcnt, p2); - /* Don't just test `mcnt > 0` because non-greedy loops have - their test at the end with an unconditional jump at the start. */ - if (p2 > p2_orig && mcnt >= 0) /* Ensure forward progress. */ - return (mutually_exclusive_p (bufp, p1, p2) - && mutually_exclusive_p (bufp, p1, p2 + mcnt)); - break; + re_char *p2_other = p2 + mcnt; + + /* Presumably, any position in `done_beg..end` should be a position we + already considered elsewhere (because our state machines are not + arbitrary: they consists of syntaxically nested loops with limited + control flow), so we can consider those back jumps as mutually + exclusive here. + This shows in the code below when `p3 > p2_orig`, i.e. when we jump + forward: in that case we bump `done_end` up to `p3` under + the assumption that any other reachable position + between `done_end` and `p3` will be checked by the *other* + call to RECURSE. + The recursion should terminate because of a lexical ordering between + `done_beg`, `done_end`, and `p2`: + At each recursion, either `done_beg` gets smaller, + or `done_beg` is unchanged and `done_end` gets larger + or they're both unchanged and `p2` gets larger. */ +#define RECURSE(p3) \ + ((p3) < done_beg ? mutually_exclusive_aux (bufp, p1, p3, p3, p3) \ + : (p3) <= done_end ? true \ + : mutually_exclusive_aux (bufp, p1, p3, done_beg, \ + (p3) > p2_orig ? done_end : (p3))) + + return RECURSE (p2) && RECURSE (p2_other); } default: @@ -3846,6 +3889,13 @@ mutually_exclusive_p (struct re_pattern_buffer *bufp, re_char *p1, return false; } +static bool +mutually_exclusive_p (struct re_pattern_buffer *bufp, re_char *p1, + re_char *p2) +{ + re_char *done = min (p2, p1); + return mutually_exclusive_aux (bufp, p1, p2, done, done); +} \f /* Matching routines. */ ^ permalink raw reply related [flat|nested] 36+ messages in thread
* bug#65726: 29.1.50; Crash in regexp engine 2023-09-14 14:41 ` Stefan Monnier via Bug reports for GNU Emacs, the Swiss army knife of text editors @ 2023-09-15 20:03 ` Mattias Engdegård 2023-09-15 22:20 ` Stefan Monnier via Bug reports for GNU Emacs, the Swiss army knife of text editors 2023-09-16 3:45 ` Stefan Monnier via Bug reports for GNU Emacs, the Swiss army knife of text editors 1 sibling, 1 reply; 36+ messages in thread From: Mattias Engdegård @ 2023-09-15 20:03 UTC (permalink / raw) To: Stefan Monnier; +Cc: martin rudalics, Eli Zaretskii, 65726 14 sep. 2023 kl. 16.41 skrev Stefan Monnier <monnier@iro.umontreal.ca>: > I.e. keep the idea I proposed of keeping track of a beg..end region > that's already been handled. But now we really do have to maintain both > ends (before, I only had `done` which kept track of the end of the > region), and just "restart" when we jump back to a point before the > "done" region. Thank you, this looks workable and should enable some further optimisations. Another option would be to add the usual cycle detection, but this is probably faster. ^ permalink raw reply [flat|nested] 36+ messages in thread
* bug#65726: 29.1.50; Crash in regexp engine 2023-09-15 20:03 ` Mattias Engdegård @ 2023-09-15 22:20 ` Stefan Monnier via Bug reports for GNU Emacs, the Swiss army knife of text editors 0 siblings, 0 replies; 36+ messages in thread From: Stefan Monnier via Bug reports for GNU Emacs, the Swiss army knife of text editors @ 2023-09-15 22:20 UTC (permalink / raw) To: Mattias Engdegård; +Cc: martin rudalics, Eli Zaretskii, 65726 >> I.e. keep the idea I proposed of keeping track of a beg..end region >> that's already been handled. But now we really do have to maintain both >> ends (before, I only had `done` which kept track of the end of the >> region), and just "restart" when we jump back to a point before the >> "done" region. > Thank you, this looks workable and should enable some further > optimisations. I just pushed this to `master`. > Another option would be to add the usual cycle detection, That wouldn't be fun, tho, would it? > but this is probably faster. And more thrilling. Stefan ^ permalink raw reply [flat|nested] 36+ messages in thread
* bug#65726: 29.1.50; Crash in regexp engine 2023-09-14 14:41 ` Stefan Monnier via Bug reports for GNU Emacs, the Swiss army knife of text editors 2023-09-15 20:03 ` Mattias Engdegård @ 2023-09-16 3:45 ` Stefan Monnier via Bug reports for GNU Emacs, the Swiss army knife of text editors 2023-09-16 10:49 ` Mattias Engdegård 1 sibling, 1 reply; 36+ messages in thread From: Stefan Monnier via Bug reports for GNU Emacs, the Swiss army knife of text editors @ 2023-09-16 3:45 UTC (permalink / raw) To: Mattias Engdegård; +Cc: martin rudalics, Eli Zaretskii, 65726 >> so yes, we may need to remember where we've been. (At this point >> someone will inevitably point out a helpful invariant that is obvious >> in hindsight. This is just my cunning attempt at making that happen.) I think I have a "helpful invariant", finally: Because of how we build our code, when we're at an `on_failure_jump` the two branches can either both go forward (typically for a "|" or a "?"), or *one* of them goes backward the other forward (for loops), where the one that goes backward (i.e. `p2 <= p2_orig`) is the edge (call it `p2_loop`) that goes back to the beginning of the loop and the other (call it `p2_exit`) is the one that exits the loop. Now, because our loops are nested with proper "structured programming", there can't be any jump from within the loop to outside the loop except for the current jump. And there can't be any jump from outside the loop to inside the loop except by entering via `p2_loop`. Since we have two recursive calls to `mutually_exclusive_p` (one for `p2_exit` and one for `p2_loop`) and each one only needs to check those positions not checked by the other, we can say that `p2_loop` only needs to check the positions within the loop (i.e. between `p2_loop` and `p2_exit`) and can presume that *all* other positions are checked by the other recursive call (the one that starts at `p2_exit`). So I think a single arg `done_end` (set, like the current `done_end`, to `p2_loop` when recursing into `p2_loop`) is indeed sufficient: there's no way to go from `p2_loop` to before `p2_loop` without first going to `p2_exit` (which is already checked by the other call). Stefan ^ permalink raw reply [flat|nested] 36+ messages in thread
* bug#65726: 29.1.50; Crash in regexp engine 2023-09-16 3:45 ` Stefan Monnier via Bug reports for GNU Emacs, the Swiss army knife of text editors @ 2023-09-16 10:49 ` Mattias Engdegård 2023-09-16 15:48 ` Stefan Monnier via Bug reports for GNU Emacs, the Swiss army knife of text editors 0 siblings, 1 reply; 36+ messages in thread From: Mattias Engdegård @ 2023-09-16 10:49 UTC (permalink / raw) To: Stefan Monnier; +Cc: martin rudalics, Eli Zaretskii, 65726 16 sep. 2023 kl. 05.45 skrev Stefan Monnier <monnier@iro.umontreal.ca>: >>> (At this point >>> someone will inevitably point out a helpful invariant that is obvious >>> in hindsight. This is just my cunning attempt at making that happen.) > > I think I have a "helpful invariant", finally: Everything is proceeding as I have foreseen. > Because of how we build our code, when we're at an `on_failure_jump` the > two branches can either both go forward (typically for a "|" or a "?"), > or *one* of them goes backward the other forward (for loops), where the > one that goes backward (i.e. `p2 <= p2_orig`) is the edge (call it > `p2_loop`) that goes back to the beginning of the loop and the other > (call it `p2_exit`) is the one that exits the loop. > > Now, because our loops are nested with proper "structured programming", > there can't be any jump from within the loop to outside the loop except > for the current jump. And there can't be any jump from outside the loop > to inside the loop except by entering via `p2_loop`. > > Since we have two recursive calls to `mutually_exclusive_p` (one for > `p2_exit` and one for `p2_loop`) and each one only needs to check those > positions not checked by the other, we can say that `p2_loop` only needs > to check the positions within the loop (i.e. between `p2_loop` and > `p2_exit`) and can presume that *all* other positions are checked by the > other recursive call (the one that starts at `p2_exit`). > > So I think a single arg `done_end` (set, like the current `done_end`, to > `p2_loop` when recursing into `p2_loop`) is indeed sufficient: there's > no way to go from `p2_loop` to before `p2_loop` without first going to > `p2_exit` (which is already checked by the other call). I think you are right, but wouldn't mind seeing it confirmed empirically. Say, by cross-checking using an an alternative (slower) implementation that directly checks whether a node has been visited before. ^ permalink raw reply [flat|nested] 36+ messages in thread
* bug#65726: 29.1.50; Crash in regexp engine 2023-09-16 10:49 ` Mattias Engdegård @ 2023-09-16 15:48 ` Stefan Monnier via Bug reports for GNU Emacs, the Swiss army knife of text editors 2023-09-18 2:14 ` Stefan Monnier via Bug reports for GNU Emacs, the Swiss army knife of text editors 0 siblings, 1 reply; 36+ messages in thread From: Stefan Monnier via Bug reports for GNU Emacs, the Swiss army knife of text editors @ 2023-09-16 15:48 UTC (permalink / raw) To: Mattias Engdegård; +Cc: martin rudalics, Eli Zaretskii, 65726 > I think you are right, but wouldn't mind seeing it confirmed > empirically. Say, by cross-checking using an an alternative (slower) > implementation that directly checks whether a node has been visited before. I'm reworking my code so it takes 2 arguments: the loop-entry and the loop-exit. Any jump outside of those bounds should never happen so we can assert it to check that our assumptions are right (and we can return false when we don't compile assertions, so it's never over-optimistic). Stefan ^ permalink raw reply [flat|nested] 36+ messages in thread
* bug#65726: 29.1.50; Crash in regexp engine 2023-09-16 15:48 ` Stefan Monnier via Bug reports for GNU Emacs, the Swiss army knife of text editors @ 2023-09-18 2:14 ` Stefan Monnier via Bug reports for GNU Emacs, the Swiss army knife of text editors 2023-09-18 3:59 ` Stefan Monnier via Bug reports for GNU Emacs, the Swiss army knife of text editors 0 siblings, 1 reply; 36+ messages in thread From: Stefan Monnier via Bug reports for GNU Emacs, the Swiss army knife of text editors @ 2023-09-18 2:14 UTC (permalink / raw) To: Mattias Engdegård; +Cc: martin rudalics, Eli Zaretskii, 65726 [-- Attachment #1: Type: text/plain, Size: 2591 bytes --] > I'm reworking my code so it takes 2 arguments: the loop-entry and the > loop-exit. Any jump outside of those bounds should never happen so we > can assert it to check that our assumptions are right (and we can > return false when we don't compile assertions, so it's never over-optimistic). Hmm... not quite working yet. With the patch below I get: % make test/src/regex-emacs-tests [...] passed 30/34 regexp-multibyte-unibyte (0.000264 sec) 0: /on_failure_jump_smart to 9 3: /exactn/1/x 6: /jump to 0 9: /start_memory/1 11: /on_failure_jump to 26 14: /on_failure_jump_smart to 23 17: /exactn/1/= 20: /jump to 14 23: /jump to 29 26: /exactn/1/h 29: /stop_memory/1 31: /on_failure_jump_loop to 37 34: /jump to 9 37: /succeed 38: end of pattern. 38 bytes used/128 bytes allocated. re_nsub: 1 regs_alloc: 1 can_be_null: 0 p1==3, p2==34, loop-entry==14, loop-exit==26 Test regexp-tests-backtrack-optimization backtrace: looking-at("x*\\(=*\\|h\\)+") [...] The problem is that my code sees the `jump 9` (near the end) followed by `start_memory` and `on_failure_jump to 26` as a a backward jump that defines a loop whose body starts at 14 and whose exit point is at 26, but that 14..26 is not a loop, it's a `|` and those don't obey the assumption I made about the exit point (the 2 destinations of such an `on_failure_jump` correspond to the first and the second patterns of the `|`, i.e. entry&middle rather than entry&exit, so there *can* be jumps straight out from the first point without going through the second point 🙁). BTW, here's an indented version of the code: 0: /on_failure_jump_smart to 9 { 3: /exactn/1/x 6: } /jump to 0 9: { /start_memory/1 11: /on_failure_jump to 26 14: /on_failure_jump_smart to 23 { 17: /exactn/1/= 20: } /jump to 14 23: /jump to 29 26: /exactn/1/h 29: /stop_memory/1 31: /on_failure_jump_loop to 37 34: } /jump to 9 37: /succeed 38: end of pattern. Another problem we see here is that the main (second) loop spans 9..37 and is controlled by the `/on_failure_jump_loop to 37` but the two destination of the OFJ are not 9 and 37 but 34 and 37 🙂. So maybe we should `skip_noops` before looking at the destinations to decide if it's a loop? Stefan [-- Warning: decoded text below may be mangled, UTF-8 assumed --] [-- Attachment #2: regexp.c.patch --] [-- Type: text/x-diff, Size: 6702 bytes --] diff --git a/src/regex-emacs.c b/src/regex-emacs.c index 55d0a6e8df8..c54f9f81890 100644 --- a/src/regex-emacs.c +++ b/src/regex-emacs.c @@ -3068,8 +3068,10 @@ analyze_first (re_char *p, re_char *pend, char *fastmap, bool multibyte) continue; case succeed_n: - /* If N == 0, it should be an on_failure_jump_loop instead. */ - DEBUG_STATEMENT (EXTRACT_NUMBER (j, p + 2); eassert (j > 0)); + /* If N == 0, it should be an on_failure_jump_loop instead. + `j` can be negative because `EXTRACT_NUMBER` extracts a + signed number whereas `succeed_n` treats it as unsigned. */ + DEBUG_STATEMENT (EXTRACT_NUMBER (j, p + 2); eassert (j != 0)); p += 4; /* We only care about one iteration of the loop, so we don't need to consider the case where this behaves like an @@ -3743,20 +3745,20 @@ mutually_exclusive_charset (struct re_pattern_buffer *bufp, re_char *p1, } /* True if "p1 matches something" implies "p2 fails". */ -/* Avoiding inf-loops: - We're trying to follow all paths reachable from `p2`, but since some +/* We're trying to follow all paths reachable from `p2`, but since some loops can match the empty string, this can loop back to `p2`. - To avoid inf-looping, we keep track of points that have been considered - "already". Instead of keeping a list of such points, `done_beg` and - `done_end` delimit a chunk of bytecode we already considered. - To guarantee termination, a lexical ordering between `done_*` and `p2` - should be obeyed: - At each recursion, either `done_beg` gets smaller, - or `done_beg` is unchanged and `done_end` gets larger - or they're both unchanged and `p2` gets larger. */ + To avoid inf-looping, we take advantage of the fact that + the bytecode we generate is made of syntactically nested loops, more + specifically, every loop has a single entry point and single exit point. + The function takes 2 more arguments (`loop_entry` and `loop_exit`). + `loop_entry` points to the sole entry point of the current loop and + `loop_exit` points to its sole exit point (when non-NULL). + The function can assume that `loop_exit` is "mutually exclusive". + Termination of recursive calls is ensured because either `loop_entry` + is larger, or it's equal but `p2` is larger. */ static bool mutually_exclusive_aux (struct re_pattern_buffer *bufp, re_char *p1, - re_char *p2, re_char *done_beg, re_char *done_end) + re_char *p2, re_char *loop_entry, re_char *loop_exit) { re_opcode_t op2; unsigned char *pend = bufp->buffer + bufp->used; @@ -3765,8 +3767,23 @@ mutually_exclusive_aux (struct re_pattern_buffer *bufp, re_char *p1, eassert (p1 >= bufp->buffer && p1 < pend && p2 >= bufp->buffer && p2 <= pend); - eassert (done_beg <= done_end); - eassert (done_end <= p2); + if (p2 == loop_exit) + /* Presumably already checked elsewhere. */ + return true; + eassert (loop_entry && p2 >= loop_entry); + /* eassert (!loop_exit || p2 < loop_exit); */ + if (p2 < loop_entry) + return false; + if (loop_exit && p2 > loop_exit) + { + void print_compiled_pattern (struct re_pattern_buffer *bufp); + print_compiled_pattern (bufp); + fprintf (stderr, "p1==%d, p2==%d, loop-entry==%d, loop-exit==%d\n", + p1 - bufp->buffer, p2 - bufp->buffer, + loop_entry - bufp->buffer, loop_exit - bufp->buffer); + error ("WOW1!"); + } + /* return false; */ /* Skip over open/close-group commands. If what follows this loop is a ...+ construct, @@ -3860,27 +3877,30 @@ mutually_exclusive_aux (struct re_pattern_buffer *bufp, re_char *p1, EXTRACT_NUMBER_AND_INCR (mcnt, p2); re_char *p2_other = p2 + mcnt; - /* When we jump backward we bump `done_end` up to `p3` under - the assumption that any other position between `done_end` - and `p3` is either: - - checked by the other call to RECURSE. - - not reachable from here (e.g. for positions before the - `on_failure_jump`), or at least not without first - jumping before `done_beg`. - This should hold because our state machines are not arbitrary: - they consists of syntaxically nested loops with limited - control flow. - FIXME: This can fail (i.e. return true when it shouldn't) - if we start generating bytecode with a different shape, - so maybe we should bite the bullet and replace done_beg/end - with an actual list of positions we've already processed. */ -#define RECURSE(p3) \ - ((p3) < done_beg ? mutually_exclusive_aux (bufp, p1, p3, p3, p3) \ - : (p3) <= done_end ? true \ - : mutually_exclusive_aux (bufp, p1, p3, done_beg, \ - (p3) > p2_orig ? done_end : (p3))) - - return RECURSE (p2) && RECURSE (p2_other); + /* If we jump backward to the entry point of the current loop + it means it's a zero-length cycle through that loop, so + this cycle itself does not break mutual-exclusion. + If we jump backward to a new loop nested within the current one + then the `p2_other` points to the exit of that inner loop + and the other RECURSE will check what happens when we leave + the loop so we can focus on checking just the loop body, + so we pass new values for `loop-entry` and `loop_exit`. + In all other cases, we just recurse "normally": if it's before + `loop_entry` it's a bug that will be caught by checks at + the entrace of the function and otherwise it's a forward jump + which doesn't need any special treatment. + FIXME: This is failsafe (can't return true when it shouldn't) + but it could be too conservative if we start generating bytecode + with a different shape, so maybe we should bite the bullet and + replace done_beg/end with an actual list of positions we've + already processed. */ +#define RECURSE(p3, p3_other) \ + ((p3) == loop_entry ? true \ + : loop_entry < (p3) && (p3) < p2_orig ? \ + mutually_exclusive_aux (bufp, p1, p3, p3, p3_other) \ + : mutually_exclusive_aux (bufp, p1, p3, loop_entry, loop_exit)) + + return RECURSE (p2, p2_other) && RECURSE (p2_other, p2); } default: @@ -3895,7 +3915,7 @@ #define RECURSE(p3) \ mutually_exclusive_p (struct re_pattern_buffer *bufp, re_char *p1, re_char *p2) { - return mutually_exclusive_aux (bufp, p1, p2, p2, p2); + return mutually_exclusive_aux (bufp, p1, p2, bufp->buffer, NULL); } \f /* Matching routines. */ ^ permalink raw reply related [flat|nested] 36+ messages in thread
* bug#65726: 29.1.50; Crash in regexp engine 2023-09-18 2:14 ` Stefan Monnier via Bug reports for GNU Emacs, the Swiss army knife of text editors @ 2023-09-18 3:59 ` Stefan Monnier via Bug reports for GNU Emacs, the Swiss army knife of text editors 2023-09-18 12:32 ` Stefan Monnier via Bug reports for GNU Emacs, the Swiss army knife of text editors 0 siblings, 1 reply; 36+ messages in thread From: Stefan Monnier via Bug reports for GNU Emacs, the Swiss army knife of text editors @ 2023-09-18 3:59 UTC (permalink / raw) To: Mattias Engdegård; +Cc: martin rudalics, Eli Zaretskii, 65726 > So maybe we should `skip_noops` before looking at the destinations to > decide if it's a loop? Nah, just introduces other problems :-( BTW maybe we should replace 31: /on_failure_jump_loop to 37 34: /jump to 9 with 31: /on_failure_dont_jump_loop to 9 It's quite funny actually: with the current bytecodes, every iteration of a backtracking greedy loop runs `jump + on_failure_jump`, whereas for a non-greedy loop every iteration runs just a single `on_failure_jump`. Yet those bytecodes predate the introduction of non-greedy loops :-) [ Of course, there are reasons for it, beside historical ones: it's harder to rewrite a loop that ends with `on_failure_dont_jump_smart` into one using `on_failure_keep_string_jump` :-( ] Stefan ^ permalink raw reply [flat|nested] 36+ messages in thread
* bug#65726: 29.1.50; Crash in regexp engine 2023-09-18 3:59 ` Stefan Monnier via Bug reports for GNU Emacs, the Swiss army knife of text editors @ 2023-09-18 12:32 ` Stefan Monnier via Bug reports for GNU Emacs, the Swiss army knife of text editors 2023-09-21 17:23 ` Mattias Engdegård 0 siblings, 1 reply; 36+ messages in thread From: Stefan Monnier via Bug reports for GNU Emacs, the Swiss army knife of text editors @ 2023-09-18 12:32 UTC (permalink / raw) To: Mattias Engdegård; +Cc: martin rudalics, Eli Zaretskii, 65726 [-- Attachment #1: Type: text/plain, Size: 432 bytes --] > BTW maybe we should replace > > 31: /on_failure_jump_loop to 37 > 34: /jump to 9 > > with > > 31: /on_failure_dont_jump_loop to 9 In the mean time the patch below recognizes the above pattern, which gives me code which is safe (it should both always terminate and should never apply the optimization if it's not valid) yet does find the optimization in all the cases where I expected it. Stefan [-- Warning: decoded text below may be mangled, UTF-8 assumed --] [-- Attachment #2: regexp.c.patch --] [-- Type: text/x-diff, Size: 8140 bytes --] diff --git a/src/regex-emacs.c b/src/regex-emacs.c index 55d0a6e8df8..e45d0f19dbf 100644 --- a/src/regex-emacs.c +++ b/src/regex-emacs.c @@ -1923,12 +1923,22 @@ regex_compile (re_char *pattern, ptrdiff_t size, if (!zero_times_ok && simple) { /* Since simple * loops can be made faster by using - on_failure_keep_string_jump, we turn simple P+ - into PP* if P is simple. */ + on_failure_keep_string_jump, we turn P+ into PP* + if P is simple. + We can't use `top: <BODY>; OFJS exit; J top; exit:` + Because the OFJS needs to be at the beginning + since we want to be able to replace + `top: OFJS exit; <BODY>; J top; exit` + with + `top: OFKSJ exit; loop: <BODY>; J loop; exit`, + i.e. a single OFJ at the beginning of the loop + rather than once per iteration. */ unsigned char *p1, *p2; startoffset = b - laststart; GET_BUFFER_SPACE (startoffset); p1 = b; p2 = laststart; + /* This presumes that the code skipped + by `skip_one_char` is position-independent. */ while (p2 < p1) *b++ = *p2++; zero_times_ok = 1; @@ -3068,8 +3078,10 @@ analyze_first (re_char *p, re_char *pend, char *fastmap, bool multibyte) continue; case succeed_n: - /* If N == 0, it should be an on_failure_jump_loop instead. */ - DEBUG_STATEMENT (EXTRACT_NUMBER (j, p + 2); eassert (j > 0)); + /* If N == 0, it should be an on_failure_jump_loop instead. + `j` can be negative because `EXTRACT_NUMBER` extracts a + signed number whereas `succeed_n` treats it as unsigned. */ + DEBUG_STATEMENT (EXTRACT_NUMBER (j, p + 2); eassert (j != 0)); p += 4; /* We only care about one iteration of the loop, so we don't need to consider the case where this behaves like an @@ -3743,20 +3755,20 @@ mutually_exclusive_charset (struct re_pattern_buffer *bufp, re_char *p1, } /* True if "p1 matches something" implies "p2 fails". */ -/* Avoiding inf-loops: - We're trying to follow all paths reachable from `p2`, but since some +/* We're trying to follow all paths reachable from `p2`, but since some loops can match the empty string, this can loop back to `p2`. - To avoid inf-looping, we keep track of points that have been considered - "already". Instead of keeping a list of such points, `done_beg` and - `done_end` delimit a chunk of bytecode we already considered. - To guarantee termination, a lexical ordering between `done_*` and `p2` - should be obeyed: - At each recursion, either `done_beg` gets smaller, - or `done_beg` is unchanged and `done_end` gets larger - or they're both unchanged and `p2` gets larger. */ + To avoid inf-looping, we take advantage of the fact that + the bytecode we generate is made of syntactically nested loops, more + specifically, every loop has a single entry point and single exit point. + The function takes 2 more arguments (`loop_entry` and `loop_exit`). + `loop_entry` points to the sole entry point of the current loop and + `loop_exit` points to its sole exit point (when non-NULL). + The function can assume that `loop_exit` is "mutually exclusive". + Termination of recursive calls is ensured because either `loop_entry` + is larger, or it's equal but `p2` is larger. */ static bool mutually_exclusive_aux (struct re_pattern_buffer *bufp, re_char *p1, - re_char *p2, re_char *done_beg, re_char *done_end) + re_char *p2, re_char *loop_entry, re_char *loop_exit) { re_opcode_t op2; unsigned char *pend = bufp->buffer + bufp->used; @@ -3765,8 +3777,23 @@ mutually_exclusive_aux (struct re_pattern_buffer *bufp, re_char *p1, eassert (p1 >= bufp->buffer && p1 < pend && p2 >= bufp->buffer && p2 <= pend); - eassert (done_beg <= done_end); - eassert (done_end <= p2); + if (p2 == loop_exit) + /* Presumably already checked elsewhere. */ + return true; + eassert (loop_entry && p2 >= loop_entry); + /* eassert (!loop_exit || p2 < loop_exit); */ + if (p2 < loop_entry) + return false; + if (loop_exit && p2 > loop_exit) + { + void print_compiled_pattern (struct re_pattern_buffer *bufp); + print_compiled_pattern (bufp); + fprintf (stderr, "p1==%d, p2==%d, loop-entry==%d, loop-exit==%d\n", + p1 - bufp->buffer, p2 - bufp->buffer, + loop_entry - bufp->buffer, loop_exit - bufp->buffer); + error ("WOW1!"); + } + /* return false; */ /* Skip over open/close-group commands. If what follows this loop is a ...+ construct, @@ -3859,28 +3886,40 @@ mutually_exclusive_aux (struct re_pattern_buffer *bufp, re_char *p1, p2++; EXTRACT_NUMBER_AND_INCR (mcnt, p2); re_char *p2_other = p2 + mcnt; + /* For `+` loops, we often have an `on_failure_jump` that skips forward + over a subsequent `jump` for lack of an `on_failure_dont_jump` + kind of thing. Recognize this pattern since that subsequent + `jump` is the one that jumps to the loop-entry. */ + if ((re_opcode_t) p2[0] == jump && mcnt == 3) + { + EXTRACT_NUMBER (mcnt, p2 + 1); + p2 += mcnt + 3; + } - /* When we jump backward we bump `done_end` up to `p3` under - the assumption that any other position between `done_end` - and `p3` is either: - - checked by the other call to RECURSE. - - not reachable from here (e.g. for positions before the - `on_failure_jump`), or at least not without first - jumping before `done_beg`. - This should hold because our state machines are not arbitrary: - they consists of syntaxically nested loops with limited - control flow. - FIXME: This can fail (i.e. return true when it shouldn't) - if we start generating bytecode with a different shape, - so maybe we should bite the bullet and replace done_beg/end - with an actual list of positions we've already processed. */ -#define RECURSE(p3) \ - ((p3) < done_beg ? mutually_exclusive_aux (bufp, p1, p3, p3, p3) \ - : (p3) <= done_end ? true \ - : mutually_exclusive_aux (bufp, p1, p3, done_beg, \ - (p3) > p2_orig ? done_end : (p3))) - - return RECURSE (p2) && RECURSE (p2_other); + /* If we jump backward to the entry point of the current loop + it means it's a zero-length cycle through that loop, so + this cycle itself does not break mutual-exclusion. + If we jump backward to a new loop nested within the current one + then the `p2_other` points to the exit of that inner loop + and the other RECURSE will check what happens when we leave + the loop so we can focus on checking just the loop body, + so we pass new values for `loop-entry` and `loop_exit`. + In all other cases, we just recurse "normally": if it's before + `loop_entry` it's a bug that will be caught by checks at + the entrace of the function and otherwise it's a forward jump + which doesn't need any special treatment. + FIXME: This is failsafe (can't return true when it shouldn't) + but it could be too conservative if we start generating bytecode + with a different shape, so maybe we should bite the bullet and + replace done_beg/end with an actual list of positions we've + already processed. */ +#define RECURSE(p3, p3_other) \ + ((p3) == loop_entry ? true \ + : loop_entry < (p3) && (p3) < p2_orig && (p3_other) > p2_orig ? \ + mutually_exclusive_aux (bufp, p1, p3, p3, p3_other) \ + : mutually_exclusive_aux (bufp, p1, p3, loop_entry, loop_exit)) + + return RECURSE (p2, p2_other) && RECURSE (p2_other, p2); } default: @@ -3895,7 +3934,7 @@ #define RECURSE(p3) \ mutually_exclusive_p (struct re_pattern_buffer *bufp, re_char *p1, re_char *p2) { - return mutually_exclusive_aux (bufp, p1, p2, p2, p2); + return mutually_exclusive_aux (bufp, p1, p2, bufp->buffer, NULL); } \f /* Matching routines. */ ^ permalink raw reply related [flat|nested] 36+ messages in thread
* bug#65726: 29.1.50; Crash in regexp engine 2023-09-18 12:32 ` Stefan Monnier via Bug reports for GNU Emacs, the Swiss army knife of text editors @ 2023-09-21 17:23 ` Mattias Engdegård 2023-09-21 18:08 ` Stefan Monnier via Bug reports for GNU Emacs, the Swiss army knife of text editors 0 siblings, 1 reply; 36+ messages in thread From: Mattias Engdegård @ 2023-09-21 17:23 UTC (permalink / raw) To: Stefan Monnier; +Cc: martin rudalics, Eli Zaretskii, 65726 18 sep. 2023 kl. 14.32 skrev Stefan Monnier <monnier@iro.umontreal.ca>: > In the mean time the patch below recognizes the above pattern, which > gives me code which is safe (it should both always terminate and should > never apply the optimization if it's not valid) yet does find the > optimization in all the cases where I expected it. Yes, maybe this would work but the ad-hocness of it all means that I very much could have missed something (and as you comment, that it's brittle). > case succeed_n: > - /* If N == 0, it should be an on_failure_jump_loop instead. */ > - DEBUG_STATEMENT (EXTRACT_NUMBER (j, p + 2); eassert (j > 0)); > + /* If N == 0, it should be an on_failure_jump_loop instead. > + `j` can be negative because `EXTRACT_NUMBER` extracts a > + signed number whereas `succeed_n` treats it as unsigned. */ > + DEBUG_STATEMENT (EXTRACT_NUMBER (j, p + 2); eassert (j != 0)); So this is what prevented me from running with counts above 2**15 and debugging enabled for all these years... > + if (loop_exit && p2 > loop_exit) > + { > + void print_compiled_pattern (struct re_pattern_buffer *bufp); > + print_compiled_pattern (bufp); > + fprintf (stderr, "p1==%d, p2==%d, loop-entry==%d, loop-exit==%d\n", > + p1 - bufp->buffer, p2 - bufp->buffer, > + loop_entry - bufp->buffer, loop_exit - bufp->buffer); > + error ("WOW1!"); #ifdef REGEX_EMACS_DEBUG ? > + } > + /* return false; */ ^ permalink raw reply [flat|nested] 36+ messages in thread
* bug#65726: 29.1.50; Crash in regexp engine 2023-09-21 17:23 ` Mattias Engdegård @ 2023-09-21 18:08 ` Stefan Monnier via Bug reports for GNU Emacs, the Swiss army knife of text editors 2023-09-23 11:56 ` Mattias Engdegård 0 siblings, 1 reply; 36+ messages in thread From: Stefan Monnier via Bug reports for GNU Emacs, the Swiss army knife of text editors @ 2023-09-21 18:08 UTC (permalink / raw) To: Mattias Engdegård; +Cc: martin rudalics, Eli Zaretskii, 65726-done > So this is what prevented me from running with counts above 2**15 and > debugging enabled for all these years... Looks like it. In any case, I pushed a variant of the last patch to `master` and I think the case is now closed. We may want to revisit it, presumably by really keeping track of the all visited nodes, but this seems to work well for now and I'm confident that it is safe (regardless of the shape of the bytecode, it should always terminate and it should never apply the optimization if it's incorrect, and with the shape of the bytecode we currently produce it does correctly apply the optimization in the cases I checked). Stefan ^ permalink raw reply [flat|nested] 36+ messages in thread
* bug#65726: 29.1.50; Crash in regexp engine 2023-09-21 18:08 ` Stefan Monnier via Bug reports for GNU Emacs, the Swiss army knife of text editors @ 2023-09-23 11:56 ` Mattias Engdegård 0 siblings, 0 replies; 36+ messages in thread From: Mattias Engdegård @ 2023-09-23 11:56 UTC (permalink / raw) To: Stefan Monnier; +Cc: martin rudalics, Eli Zaretskii, 65726-done 21 sep. 2023 kl. 20.08 skrev Stefan Monnier <monnier@iro.umontreal.ca>: > In any case, I pushed a variant of the last patch to `master` and > I think the case is now closed. We may want to revisit it, presumably > by really keeping track of the all visited nodes, but this seems to work > well for now and I'm confident that it is safe (regardless of the shape > of the bytecode, it should always terminate and it should never apply > the optimization if it's incorrect, and with the shape of the bytecode > we currently produce it does correctly apply the optimization in the cases > I checked). Good, thank you. ^ permalink raw reply [flat|nested] 36+ messages in thread
* bug#65726: 29.1.50; Crash in regexp engine 2023-09-04 7:46 bug#65726: 29.1.50; Crash in regexp engine martin rudalics 2023-09-04 8:44 ` Mattias Engdegård @ 2023-09-04 14:32 ` Stefan Monnier via Bug reports for GNU Emacs, the Swiss army knife of text editors 2023-09-04 15:57 ` Eli Zaretskii 2023-09-05 7:14 ` martin rudalics 2023-09-11 8:10 ` Mattias Engdegård 2 siblings, 2 replies; 36+ messages in thread From: Stefan Monnier via Bug reports for GNU Emacs, the Swiss army knife of text editors @ 2023-09-04 14:32 UTC (permalink / raw) To: martin rudalics; +Cc: 65726 > Python Exception <class 'gdb.MemoryError'> Cannot access memory at address 0x7fffff66fff8: > #0 0x000000000068810a in skip_noops (p=#1 0x0000000000688823 in mutually_exclusive_p (bufp=0xec9c30 <searchbufs+752>, p1=0x1fcee74 "\004\005", p2=0x1fcee81 "\016\063") at ../../src/regex-emacs.c:3665 > #2 0x0000000000688e19 in mutually_exclusive_p (bufp=0xec9c30 <searchbufs+752>, p1=0x1fcee74 "\004\005", p2=0x1fcee81 "\016\063") at ../../src/regex-emacs.c:3838 > #3 0x0000000000688e3c in mutually_exclusive_p (bufp=0xec9c30 <searchbufs+752>, p1=0x1fcee74 "\004\005", p2=0x1fceeba "\004\020") at ../../src/regex-emacs.c:3839 > #4 0x0000000000688e3c in mutually_exclusive_p (bufp=0xec9c30 <searchbufs+752>, p1=0x1fcee74 "\004\005", p2=0x1fcee84 "\002\001@\004\020") at ../../src/regex-emacs.c:3839 > #5 0x0000000000688e19 in mutually_exclusive_p (bufp=0xec9c30 <searchbufs+752>, p1=0x1fcee74 "\004\005", p2=0x1fcee81 "\016\063") at ../../src/regex-emacs.c:3838 > ... Hmm... the line numbers strongly suggests the inf-recursion happens via the calls: case on_failure_jump: { int mcnt; p2++; EXTRACT_NUMBER_AND_INCR (mcnt, p2); /* Don't just test `mcnt > 0` because non-greedy loops have their test at the end with an unconditional jump at the start. */ if (p2 + mcnt > p2_orig) /* Ensure forward progress. */ return (mutually_exclusive_p (bufp, p1, p2) && mutually_exclusive_p (bufp, p1, p2 + mcnt)); break; } Re-reading the code I see that `skip_noops` can return a position smaller than its argument, which makes it possible for `p2` to be smaller (or equal) to `p2_orig` and hence explain that inf-loop (that's the only path I can see that explains the inf-loop you're seeing). So, the patch below should hopefully fix your problem. Stefan diff --git a/src/regex-emacs.c b/src/regex-emacs.c index 7e75f0ac597..3a14c10771d 100644 --- a/src/regex-emacs.c +++ b/src/regex-emacs.c @@ -3832,7 +3832,8 @@ mutually_exclusive_p (struct re_pattern_buffer *bufp, re_char *p1, EXTRACT_NUMBER_AND_INCR (mcnt, p2); /* Don't just test `mcnt > 0` because non-greedy loops have their test at the end with an unconditional jump at the start. */ - if (p2 + mcnt > p2_orig) /* Ensure forward progress. */ + if (p2 + mcnt > p2_orig /* Ensure forward progress. */ + && p2 > p2_orig) /* Bug#65726 */ return (mutually_exclusive_p (bufp, p1, p2) && mutually_exclusive_p (bufp, p1, p2 + mcnt)); break; ^ permalink raw reply related [flat|nested] 36+ messages in thread
* bug#65726: 29.1.50; Crash in regexp engine 2023-09-04 14:32 ` Stefan Monnier via Bug reports for GNU Emacs, the Swiss army knife of text editors @ 2023-09-04 15:57 ` Eli Zaretskii 2023-09-04 17:12 ` Stefan Monnier via Bug reports for GNU Emacs, the Swiss army knife of text editors 2023-09-05 7:14 ` martin rudalics 1 sibling, 1 reply; 36+ messages in thread From: Eli Zaretskii @ 2023-09-04 15:57 UTC (permalink / raw) To: Stefan Monnier; +Cc: rudalics, 65726 > Cc: 65726@debbugs.gnu.org > Date: Mon, 04 Sep 2023 10:32:38 -0400 > From: Stefan Monnier via "Bug reports for GNU Emacs, > the Swiss army knife of text editors" <bug-gnu-emacs@gnu.org> > > Re-reading the code I see that `skip_noops` can return a position > smaller than its argument, which makes it possible for `p2` to > be smaller (or equal) to `p2_orig` and hence explain that inf-loop > (that's the only path I can see that explains the inf-loop you're > seeing). > > So, the patch below should hopefully fix your problem. Yes, with this patch the problem is gone. Please install on the release branch, when you are satisfied with the solution, and thanks. ^ permalink raw reply [flat|nested] 36+ messages in thread
* bug#65726: 29.1.50; Crash in regexp engine 2023-09-04 15:57 ` Eli Zaretskii @ 2023-09-04 17:12 ` Stefan Monnier via Bug reports for GNU Emacs, the Swiss army knife of text editors 2023-09-10 7:50 ` Stefan Kangas 0 siblings, 1 reply; 36+ messages in thread From: Stefan Monnier via Bug reports for GNU Emacs, the Swiss army knife of text editors @ 2023-09-04 17:12 UTC (permalink / raw) To: Eli Zaretskii; +Cc: rudalics, 65726-done > Yes, with this patch the problem is gone. Thanks, pushed to `emacs-29`. Stefan ^ permalink raw reply [flat|nested] 36+ messages in thread
* bug#65726: 29.1.50; Crash in regexp engine 2023-09-04 17:12 ` Stefan Monnier via Bug reports for GNU Emacs, the Swiss army knife of text editors @ 2023-09-10 7:50 ` Stefan Kangas 2023-09-10 7:55 ` Eli Zaretskii 0 siblings, 1 reply; 36+ messages in thread From: Stefan Kangas @ 2023-09-10 7:50 UTC (permalink / raw) To: Stefan Monnier; +Cc: rudalics, Eli Zaretskii, 65726, 65726-done Stefan Monnier <monnier@iro.umontreal.ca> writes: >> Yes, with this patch the problem is gone. > > Thanks, pushed to `emacs-29`. Is there anything more to do here, or can this bug report be closed? ^ permalink raw reply [flat|nested] 36+ messages in thread
* bug#65726: 29.1.50; Crash in regexp engine 2023-09-10 7:50 ` Stefan Kangas @ 2023-09-10 7:55 ` Eli Zaretskii 2023-09-10 23:09 ` Stefan Monnier via Bug reports for GNU Emacs, the Swiss army knife of text editors 0 siblings, 1 reply; 36+ messages in thread From: Eli Zaretskii @ 2023-09-10 7:55 UTC (permalink / raw) To: Stefan Kangas; +Cc: rudalics, monnier, 65726 > From: Stefan Kangas <stefankangas@gmail.com> > Date: Sun, 10 Sep 2023 00:50:48 -0700 > Cc: Eli Zaretskii <eliz@gnu.org>, 65726@debbugs.gnu.org, rudalics@gmx.at, > 65726-done@debbugs.gnu.org > > Stefan Monnier <monnier@iro.umontreal.ca> writes: > > >> Yes, with this patch the problem is gone. > > > > Thanks, pushed to `emacs-29`. > > Is there anything more to do here, or can this bug report be closed? I believe this is still being worked on and discussed by Stefan and Mattias. ^ permalink raw reply [flat|nested] 36+ messages in thread
* bug#65726: 29.1.50; Crash in regexp engine 2023-09-10 7:55 ` Eli Zaretskii @ 2023-09-10 23:09 ` Stefan Monnier via Bug reports for GNU Emacs, the Swiss army knife of text editors 2023-09-11 14:46 ` Stefan Kangas 0 siblings, 1 reply; 36+ messages in thread From: Stefan Monnier via Bug reports for GNU Emacs, the Swiss army knife of text editors @ 2023-09-10 23:09 UTC (permalink / raw) To: Eli Zaretskii; +Cc: rudalics, Stefan Kangas, 65726 >> >> Yes, with this patch the problem is gone. >> > Thanks, pushed to `emacs-29`. >> Is there anything more to do here, or can this bug report be closed? > I believe this is still being worked on and discussed by Stefan and > Mattias. I think we can close it. What is being discussed is how to recover the optimization that had to be significantly weakened to fix this patch, but it was a new optimization in Emacs-29, so it's not urgent nor a bug fix. Stefan ^ permalink raw reply [flat|nested] 36+ messages in thread
* bug#65726: 29.1.50; Crash in regexp engine 2023-09-10 23:09 ` Stefan Monnier via Bug reports for GNU Emacs, the Swiss army knife of text editors @ 2023-09-11 14:46 ` Stefan Kangas 0 siblings, 0 replies; 36+ messages in thread From: Stefan Kangas @ 2023-09-11 14:46 UTC (permalink / raw) To: Stefan Monnier, Eli Zaretskii; +Cc: rudalics, 65726-done Stefan Monnier <monnier@iro.umontreal.ca> writes: > I think we can close it. What is being discussed is how to recover the > optimization that had to be significantly weakened to fix this patch, > but it was a new optimization in Emacs-29, so it's not urgent nor > a bug fix. Thanks, I'm therefore closing this bug report. ^ permalink raw reply [flat|nested] 36+ messages in thread
* bug#65726: 29.1.50; Crash in regexp engine 2023-09-04 14:32 ` Stefan Monnier via Bug reports for GNU Emacs, the Swiss army knife of text editors 2023-09-04 15:57 ` Eli Zaretskii @ 2023-09-05 7:14 ` martin rudalics 1 sibling, 0 replies; 36+ messages in thread From: martin rudalics @ 2023-09-05 7:14 UTC (permalink / raw) To: Stefan Monnier; +Cc: 65726 > So, the patch below should hopefully fix your problem. It fixes the problem. Thanks, martin ^ permalink raw reply [flat|nested] 36+ messages in thread
* bug#65726: 29.1.50; Crash in regexp engine 2023-09-04 7:46 bug#65726: 29.1.50; Crash in regexp engine martin rudalics 2023-09-04 8:44 ` Mattias Engdegård 2023-09-04 14:32 ` Stefan Monnier via Bug reports for GNU Emacs, the Swiss army knife of text editors @ 2023-09-11 8:10 ` Mattias Engdegård 2023-09-11 13:41 ` Stefan Monnier via Bug reports for GNU Emacs, the Swiss army knife of text editors 2 siblings, 1 reply; 36+ messages in thread From: Mattias Engdegård @ 2023-09-11 8:10 UTC (permalink / raw) To: Stefan Monnier; +Cc: martin rudalics, Eli Zaretskii, Stefan Kangas, 65726 > I think we can close it. What is being discussed is how to recover the optimization that had to be significantly weakened to fix this patch, but it was a new optimization in Emacs-29, so it's not urgent nor a bug fix. Would you confirm that bug#61514 remains 'fixed' for practical purposes? I ran a few of the cases in that bug and it seems that we have them still covered, but I may have missed something. ^ permalink raw reply [flat|nested] 36+ messages in thread
* bug#65726: 29.1.50; Crash in regexp engine 2023-09-11 8:10 ` Mattias Engdegård @ 2023-09-11 13:41 ` Stefan Monnier via Bug reports for GNU Emacs, the Swiss army knife of text editors 0 siblings, 0 replies; 36+ messages in thread From: Stefan Monnier via Bug reports for GNU Emacs, the Swiss army knife of text editors @ 2023-09-11 13:41 UTC (permalink / raw) To: Mattias Engdegård Cc: martin rudalics, Eli Zaretskii, Stefan Kangas, 65726 > Would you confirm that bug#61514 remains 'fixed' for practical purposes? I believe it does: the real fix was to use narrowing. Stefan ^ permalink raw reply [flat|nested] 36+ messages in thread
end of thread, other threads:[~2023-09-23 11:56 UTC | newest] Thread overview: 36+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2023-09-04 7:46 bug#65726: 29.1.50; Crash in regexp engine martin rudalics 2023-09-04 8:44 ` Mattias Engdegård 2023-09-04 12:12 ` Eli Zaretskii 2023-09-04 13:18 ` Mattias Engdegård 2023-09-04 13:26 ` Mattias Engdegård 2023-09-04 13:28 ` Eli Zaretskii 2023-09-04 15:47 ` Mattias Engdegård 2023-09-05 12:23 ` Mattias Engdegård 2023-09-05 13:08 ` Stefan Monnier via Bug reports for GNU Emacs, the Swiss army knife of text editors 2023-09-05 13:50 ` Mattias Engdegård 2023-09-05 15:33 ` Stefan Monnier via Bug reports for GNU Emacs, the Swiss army knife of text editors 2023-09-06 12:03 ` Mattias Engdegård 2023-09-09 15:55 ` Stefan Monnier via Bug reports for GNU Emacs, the Swiss army knife of text editors 2023-09-09 16:34 ` Mattias Engdegård 2023-09-14 14:41 ` Stefan Monnier via Bug reports for GNU Emacs, the Swiss army knife of text editors 2023-09-15 20:03 ` Mattias Engdegård 2023-09-15 22:20 ` Stefan Monnier via Bug reports for GNU Emacs, the Swiss army knife of text editors 2023-09-16 3:45 ` Stefan Monnier via Bug reports for GNU Emacs, the Swiss army knife of text editors 2023-09-16 10:49 ` Mattias Engdegård 2023-09-16 15:48 ` Stefan Monnier via Bug reports for GNU Emacs, the Swiss army knife of text editors 2023-09-18 2:14 ` Stefan Monnier via Bug reports for GNU Emacs, the Swiss army knife of text editors 2023-09-18 3:59 ` Stefan Monnier via Bug reports for GNU Emacs, the Swiss army knife of text editors 2023-09-18 12:32 ` Stefan Monnier via Bug reports for GNU Emacs, the Swiss army knife of text editors 2023-09-21 17:23 ` Mattias Engdegård 2023-09-21 18:08 ` Stefan Monnier via Bug reports for GNU Emacs, the Swiss army knife of text editors 2023-09-23 11:56 ` Mattias Engdegård 2023-09-04 14:32 ` Stefan Monnier via Bug reports for GNU Emacs, the Swiss army knife of text editors 2023-09-04 15:57 ` Eli Zaretskii 2023-09-04 17:12 ` Stefan Monnier via Bug reports for GNU Emacs, the Swiss army knife of text editors 2023-09-10 7:50 ` Stefan Kangas 2023-09-10 7:55 ` Eli Zaretskii 2023-09-10 23:09 ` Stefan Monnier via Bug reports for GNU Emacs, the Swiss army knife of text editors 2023-09-11 14:46 ` Stefan Kangas 2023-09-05 7:14 ` martin rudalics 2023-09-11 8:10 ` Mattias Engdegård 2023-09-11 13:41 ` Stefan Monnier via Bug reports for GNU Emacs, the Swiss army knife of text editors
Code repositories for project(s) associated with this public inbox https://git.savannah.gnu.org/cgit/emacs.git This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).