* bug#36609: 27.0.50; Possible race-condition in threading implementation @ 2019-07-11 20:51 Andreas Politz 2019-07-12 7:47 ` Eli Zaretskii ` (3 more replies) 0 siblings, 4 replies; 53+ messages in thread From: Andreas Politz @ 2019-07-11 20:51 UTC (permalink / raw) To: 36609 [-- Attachment #1: Type: text/plain, Size: 22047 bytes --] I think there is a race-condition in the implementation of threads. I tried to find a minimal test-case, without success. Thus, I've attached a lengthy source-file. Loading that file should trigger this bug and may freeze your session. Indications: 1. The main-thread has the name of one of created threads (XEmacs in this case), instead of "emacs". 2. Emacs stops processing all keyboard/mouse input while looping in wait_reading_process_output. Sending commands via emacsclient still works. GDB output: (gdb) info threads Id Target Id Frame * 1 Thread 0x7ffff17f5d40 (LWP 26264) "XEmacs" 0x000055555576eac0 in XPNTR (a=XIL(0x7ffff1312533)) at alloc.c:535 2 Thread 0x7ffff0ac4700 (LWP 26265) "gmain" 0x00007ffff50d1667 in poll () from /usr/lib/libc.so.6 3 Thread 0x7fffebd1a700 (LWP 26266) "gdbus" 0x00007ffff50d1667 in poll () from /usr/lib/libc.so.6 4 Thread 0x7fffeb519700 (LWP 26267) "dconf worker" 0x00007ffff50d1667 in poll () from /usr/lib/libc.so.6 (gdb) bt full #0 0x00007ffff50d3f76 in pselect () at /usr/lib/libc.so.6 #1 0x0000555555832e48 in really_call_select (arg=0x7fffffffd150) at thread.c:586 sa = 0x7fffffffd150 self = 0x555555c154e0 <main_thread> oldset = { __val = {0, 93825011232085, 140737488343064, 31632, 31632, 51840, 140737488343136, 93824994672716, 4294967298, 140737488343184, 93824999850720, 0, 0, 140737488343136, 93824993869565, 4041340661} } #2 0x0000555555774449 in flush_stack_call_func (func=0x555555832d81 <really_call_select>, arg=0x7fffffffd150) at alloc.c:4969 end = 0x7fffffffd100 self = 0x555555c154e0 <main_thread> sentry = { o = { __max_align_ll = 140737488343296, __max_align_ld = <invalid float value> } } #3 0x0000555555832f40 in thread_select (func=0x7ffff50d3eb0 <pselect>, max_fds=6, rfds=0x7fffffffd260, wfds=0x7fffffffd2e0, efds=0x0, timeout=0x7fffffffd8b0, sigmask=0x0) at thread.c:616 sa = { func = 0x7ffff50d3eb0 <pselect>, max_fds = 6, rfds = 0x7fffffffd260, wfds = 0x7fffffffd2e0, efds = 0x0, timeout = 0x7fffffffd8b0, sigmask = 0x0, result = 210347336 } #4 0x000055555585fea3 in xg_select (fds_lim=6, rfds=0x7fffffffd920, wfds=0x7fffffffd9a0, efds=0x0, timeout=0x7fffffffd8b0, sigmask=0x0) at xgselect.c:117 all_rfds = { fds_bits = {32, 0 <repeats 15 times>} } all_wfds = { fds_bits = {0 <repeats 16 times>} } tmo = { tv_sec = 7, tv_nsec = 140737488343264 } tmop = 0x7fffffffd8b0 context = 0x555555dae320 have_wfds = true gfds_buf = {{ fd = 0, events = 0, revents = 0 }, { fd = 32, events = 0, revents = 0 }, { fd = 0, events = 0, revents = 0 }, { fd = 7, events = 0, revents = 0 }, { fd = 8, events = 0, revents = 0 }, { fd = 15, events = 0, revents = 0 }, { fd = -11072, events = 32767, revents = 0 }, { fd = 1460278864, events = 21845, revents = 0 }, { fd = 64, events = 0, revents = 0 }, { fd = 0, events = 0, revents = 0 }, { fd = 2, events = 48, revents = 0 }, { fd = 0, events = 0, revents = 0 }, { fd = 91, events = 124, revents = 0 }, { fd = -397131776, events = 35414, revents = 11125 }, { fd = 1460278864, events = 21845, revents = 0 }, { fd = 1, events = 0, revents = 0 }, { fd = 2, events = 0, revents = 0 }, { fd = -11072, events = 32767, revents = 0 }, { fd = 1439161536, events = 21845, revents = 0 }, { fd = -11024, events = 32767, revents = 0 }, { fd = -11088, events = 32767, revents = 0 }, { fd = -182671191, events = 32767, revents = 0 }, { fd = 1, events = 0, revents = 0 }, { fd = -182670875, events = 32767, revents = 0 }, { fd = 194871296, events = 59160, revents = 48198 }, { fd = -1175563804, events = 19, revents = 0 }, { fd = 24, events = 0, revents = 0 }, { fd = 1460278864, events = 21845, revents = 0 }, { fd = 2, events = 0, revents = 0 }, { fd = 1439882448, events = 21845, revents = 0 }, { fd = 2, events = 0, revents = 0 }, { fd = 2, events = 0, revents = 0 }, { fd = 1, events = 2, revents = 0 }, { fd = 1460278864, events = 21845, revents = 0 }, { fd = 0, events = 0, revents = 0 }, { fd = -397131776, events = 35414, revents = 11125 }, { fd = 24, events = 0, revents = 0 }, { fd = 1439161536, events = 21845, revents = 0 }, { fd = 2, events = 0, revents = 0 }, { fd = 1, events = 0, revents = 0 }, { fd = 1439161536, events = 21845, revents = 0 }, { fd = -182606722, events = 32767, revents = 0 }, { fd = -10960, events = 32767, revents = 0 }, { fd = 1433487750, events = 21845, revents = 0 }, { fd = -10976, events = 32767, revents = 0 }, { fd = 1439161536, events = 21845, revents = 0 }, { fd = -30, events = 0, revents = 0 }, { fd = 1, events = 0, revents = 0 }, { fd = 1, events = 64, revents = 0 }, { fd = 1562877925, events = 0, revents = 0 }, { fd = 31, events = 0, revents = 0 }, { fd = 1562877925, events = 0, revents = 0 }, { fd = -10960, events = 32767, revents = 0 }, { fd = 1434542700, events = 21845, revents = 0 }, { fd = -10912, events = 32767, revents = 0 }, { fd = 1439161536, events = 21845, revents = 0 }, { fd = 1562877925, events = 0, revents = 0 }, { fd = -397131776, events = 35414, revents = 11125 }, { fd = -10848, events = 32767, revents = 0 }, { fd = 1434543288, events = 21845, revents = 0 }, { fd = 1460274293, events = 21845, revents = 0 }, { fd = 1385447426, events = 931, revents = 0 }, { fd = 95390, events = 0, revents = 0 }, { fd = 1433865373, events = 7256, revents = 10134 }, { fd = 1562877925, events = 0, revents = 0 }, { fd = 1439161536, events = 21845, revents = 0 }, { fd = 664149, events = 0, revents = 0 }, { fd = 80000, events = 0, revents = 0 }, { fd = 1562877925, events = 0, revents = 0 }, { fd = 664149080, events = 0, revents = 0 }, { fd = 1562877925, events = 0, revents = 0 }, { fd = 664149080, events = 0, revents = 0 }, { fd = -10752, events = 32767, revents = 0 }, { fd = 1434543494, events = 21845, revents = 0 }, { fd = 0, events = 0, revents = 0 }, { fd = -10544, events = 32767, revents = 0 }, { fd = 499622378, events = 0, revents = 0 }, { fd = 0, events = 0, revents = 0 }, { fd = 0, events = 0, revents = 0 }, { fd = 499622378, events = 0, revents = 0 }, { fd = -10688, events = 32767, revents = 0 }, { fd = 1434939197, events = 21845, revents = 0 }, { fd = 1562877925, events = 0, revents = 0 }, { fd = 164526702, events = 0, revents = 0 }, { fd = 1562877925, events = 0, revents = 0 }, { fd = 664149080, events = 0, revents = 0 }, { fd = -10544, events = 32767, revents = 0 }, { fd = 499622378, events = 41450, revents = 7623 }, { fd = 0, events = 0, revents = 0 }, { fd = 1562877925, events = 0, revents = 0 }, { fd = 0, events = 0, revents = 0 }, { fd = -1, events = 65535, revents = 65535 }, { fd = -10432, events = 32767, revents = 0 }, { fd = 1433359581, events = 21845, revents = 0 }, { fd = 0, events = 0, revents = 0 }, { fd = 0, events = 0, revents = 0 }, { fd = 0, events = 0, revents = 0 }, { fd = 1443725184, events = 85, revents = 0 }, { fd = 1450650965, events = 21845, revents = 0 }, { fd = 1450650965, events = 21845, revents = 0 }, { fd = 0, events = 0, revents = 0 }, { fd = 16384, events = 0, revents = 0 }, { fd = 5, events = 0, revents = 0 }, { fd = -10496, events = 32767, revents = 0 }, { fd = 0, events = 0, revents = 0 }, { fd = -1, events = 65535, revents = 65535 }, { fd = 1562877925, events = 0, revents = 0 }, { fd = 164526702, events = 0, revents = 0 }, { fd = 719, events = 0, revents = 0 }, { fd = 893997157, events = 0, revents = 0 }, { fd = 1439269600, events = 21845, revents = 0 }, { fd = 0, events = 0, revents = 0 }, { fd = 0, events = 0, revents = 0 }, { fd = -10496, events = 32767, revents = 0 }, { fd = 1433288445, events = 21845, revents = 0 }, { fd = 0, events = 0, revents = 0 }, { fd = 164532384, events = 0, revents = 0 }, { fd = 1562977925, events = 0, revents = 0 }, { fd = 1562977925, events = 0, revents = 0 }, { fd = 164532384, events = 0, revents = 0 }, { fd = -10368, events = 32767, revents = 0 }, { fd = 1434938909, events = 21845, revents = 0 }, { fd = 100000, events = 0, revents = 0 }, { fd = 0, events = 0, revents = 0 }, { fd = 1562877925, events = 0, revents = 0 }, { fd = 164532384, events = 0, revents = 0 }, { fd = -1, events = 65535, revents = 65535 }, { fd = 0, events = 0, revents = 0 }} gfds = 0x7fffffffd360 gfds_size = 128 n_gfds = -1 retval = 0 our_fds = 0 max_fds = 5 context_acquired = false i = 0 nfds = 135 tmo_in_millisec = 1030 must_free = 0 need_to_dispatch = false #5 0x0000555555802a78 in wait_reading_process_output (time_limit=0, nsecs=0, read_kbd=-1, do_display=true, wait_for_cell=XIL(0), wait_proc=0x0, just_wait_proc=0) at process.c:5423 process_skipped = false channel = 6 nfds = 1 Available = { fds_bits = {32, 0 <repeats 15 times>} } Writeok = { fds_bits = {0 <repeats 16 times>} } check_write = true check_delay = 0 no_avail = false xerrno = 11 proc = XIL(0x7fffffffd9a0) timeout = { tv_sec = 0, tv_nsec = 499622378 } end_time = { tv_sec = 93824993869565, tv_nsec = 0 } timer_delay = { tv_sec = 0, tv_nsec = 499622378 } got_output_end_time = { tv_sec = 1562977205, tv_nsec = 270561059 } wait = FOREVER got_some_output = -1 prev_wait_proc_nbytes_read = 0 retry_for_async = false count = 4 now = { tv_sec = 0, tv_nsec = -1 } #6 0x00005555556f4718 in kbd_buffer_get_event (kbp=0x7fffffffdc70, used_mouse_menu=0x7fffffffe235, end_time=0x0) at keyboard.c:3836 do_display = true obj = make_fixnum(1073741816) #7 0x00005555556f06c5 in read_event_from_main_queue (end_time=0x0, local_getcjmp=0x7fffffffe040, used_mouse_menu=0x7fffffffe235) at keyboard.c:2138 c = XIL(0) save_jump = {{ __jmpbuf = {0, 0, 0, 0, 0, 0, 0, 0}, __mask_was_saved = 0, __saved_mask = { __val = {0 <repeats 16 times>} } }} kb = 0x7fffffffdcd0 count = 3 #8 0x00005555556f099b in read_decoded_event_from_main_queue (end_time=0x0, local_getcjmp=0x7fffffffe040, prev_event=XIL(0), used_mouse_menu=0x7fffffffe235) at keyboard.c:2202 nextevt = XIL(0) frame = 0x1dcd30d9 terminal = 0x9 events = {XIL(0), XIL(0x1dcd30d9), XIL(0xca80), XIL(0x2b758a56e8544000), XIL(0), XIL(0x555556772ff5), XIL(0x7fffffffde50), XIL(0x5555556f57d8), XIL(0x555555c982e0), XIL(0), XIL(0), XIL(0x7fffffffde50), XIL(0x5555556e3efd), XIL(0x1dcd30d9), XIL(0x7fffffffde90), XIL(0x5555556f3b29)} n = 0 #9 0x00005555556f2236 in read_char (commandflag=1, map=XIL(0x5555567756e3), prev_event=XIL(0), used_mouse_menu=0x7fffffffe235, end_time=0x0) at keyboard.c:2810 c = XIL(0) jmpcount = 3 local_getcjmp = {{ __jmpbuf = {0, -245616553674816983, 93825011232757, 51840, 0, 0, -245616553305718231, -6214351577293929943}, __mask_was_saved = 0, __saved_mask = { __val = {0, 0, 140737488347312, 93824993869565, 0, 140737488347424, 93824994673374, 93825011242707, 3, 0, 0, 140737488347424, 93824994446493, 93824999850720, 0, 0} } }} save_jump = {{ __jmpbuf = {93824993869565, 0, 140737488347584, 93824994020698, 0, 51840, -1, 0}, __mask_was_saved = 1450661587, __saved_mask = { __val = {93825011242723, 140737488347520, 93824993870282, 93825011242707, 93825011242723, 140737488347584, 93824994446493, 93824999885184, 34464, 34464, 140737488347584, 93824993869565, 40105367251, 93825004306304, 140737488347624, 93824999850720} } }} tem = XIL(0x2aaa9b29c970) save = XIL(0) previous_echo_area_message = XIL(0) also_record = XIL(0) reread = false recorded = false polling_stopped_here = true orig_kboard = 0x555555dfa570 #10 0x00005555556ff6c8 in read_key_sequence (keybuf=0x7fffffffe440, prompt=XIL(0), dont_downcase_last=false, can_return_switch_frame=true, fix_current_buffer=true, prevent_redisplay=false) at keyboard.c:9124 interrupted_kboard = 0x555555dfa570 interrupted_frame = 0x5555560d7f80 key = XIL(0x5555557a7eff) used_mouse_menu = false echo_local_start = 0 last_real_key_start = 0 keys_local_start = 0 new_binding = XIL(0x7fffffffe310) count = 3 t = 0 echo_start = 0 keys_start = 0 current_binding = XIL(0x5555567756e3) first_unbound = 31 mock_input = 0 used_mouse_menu_history = {false <repeats 30 times>} fkey = { parent = XIL(0x555555d6ce03), map = XIL(0x555555d6ce03), start = 0, end = 0 } keytran = { parent = XIL(0x7ffff14a7dbb), map = XIL(0x7ffff14a7dbb), start = 0, end = 0 } indec = { parent = XIL(0x555555d6cdf3), map = XIL(0x555555d6cdf3), start = 0, end = 0 } shift_translated = false delayed_switch_frame = XIL(0) original_uppercase = XIL(0) original_uppercase_position = -1 dummyflag = false starting_buffer = 0x7ffff0e1f6f0 fake_prefixed_keys = XIL(0) first_event = XIL(0) second_event = XIL(0) #11 0x00005555556ee5fb in command_loop_1 () at keyboard.c:1348 cmd = XIL(0) keybuf = {XIL(0x7fffffffe490), XIL(0x5555557a804c), make_fixnum(11545945833472), XIL(0x7fffffffe4c0), XIL(0x555555c982e0), XIL(0), XIL(0), XIL(0x7fffffffe490), XIL(0x5555556e3efd), XIL(0xf0e1f6f5), XIL(0x7fffffffe500), make_fixnum(23456248668343), XIL(0x555555d18953), XIL(0x3), XIL(0x555555c982e0), XIL(0), XIL(0), XIL(0x7fffffffe4e0), XIL(0x5555556e3efd), XIL(0xf0e1f6f5), XIL(0x7fffffffe520), XIL(0x5555557a2e69), XIL(0x1556e3efd), XIL(0x5760), XIL(0x7fffffffe540), XIL(0x555555d53470), XIL(0), XIL(0), XIL(0x7fffffffe550), make_fixnum(23456248662876)} i = 32767 prev_modiff = 0 prev_buffer = 0x0 already_adjusted = false #12 0x00005555557a2a64 in internal_condition_case (bfun=0x5555556ee1b3 <command_loop_1>, handlers=XIL(0x5760), hfun=0x5555556ed968 <cmd_error>) at eval.c:1351 val = XIL(0x5555556e3efd) c = 0x555555d53470 #13 0x00005555556ede9b in command_loop_2 (ignore=XIL(0)) at keyboard.c:1091 val = XIL(0) #14 0x00005555557a22d5 in internal_catch (tag=XIL(0xd110), func=0x5555556ede6e <command_loop_2>, arg=XIL(0)) at eval.c:1112 val = XIL(0x45b00000000) c = 0x555555d53340 #15 0x00005555556ede39 in command_loop () at keyboard.c:1070 #16 0x00005555556ed537 in recursive_edit_1 () at keyboard.c:714 count = 1 val = make_fixnum(23456248667937) #17 0x00005555556ed6bb in Frecursive_edit () at keyboard.c:786 count = 0 buffer = XIL(0) #18 0x00005555556eb49d in main (argc=4, argv=0x7fffffffe8b8) at emacs.c:2086 stack_bottom_variable = 0x20 do_initial_setlocale = true no_loadup = false junk = 0x0 dname_arg = 0x0 ch_to_dir = 0x0 original_pwd = 0x0 dump_mode = 0x0 skip_args = 0 temacs = 0x0 attempt_load_pdump = true rlim = { rlim_cur = 10022912, rlim_max = 18446744073709551615 } sockfd = -1 [-- Attachment #2: thread-bug-2.el --] [-- Type: text/plain, Size: 6199 bytes --] ;; -*- lexical-binding: t -*- (require 'threads) (require 'eieio) (require 'cl-lib) (require 'ring) (defun debug (fmt &rest args) (princ (apply #'format fmt args) #'external-debugging-output) (terpri #'external-debugging-output)) (define-error 'thread-utils-thread-interrupted "Thread was interrupted" 'error) (defun thread-utils-main-thread-p (&optional object) (let ((object (or object (current-thread)))) (and (threadp object) (eq object (car (all-threads)))))) (defun thread-utils-quitable-apply (fn &rest args) (let* ((this-thread (current-thread)) (quit-thread (make-thread (lambda nil (condition-case nil (cl-loop (sleep-for 3)) (quit (thread-signal this-thread 'quit nil)) (thread-utils-thread-interrupted nil)))))) (unwind-protect (apply fn args) (thread-signal quit-thread 'thread-utils-thread-interrupted nil)))) (defun thread-utils-condition-quitable-wait (condition) (cl-check-type condition condition-variable) (thread-utils-quitable-apply #'condition-wait condition)) (defun thread-utils-condition-wait (condition) (if (thread-utils-main-thread-p) (thread-utils-condition-quitable-wait condition) (condition-wait condition))) (defconst channel-default-capacity 16) (defclass channel-terminal nil ((mutex :initarg :mutex :type mutex) (condition :initarg :condition :type condition-variable) (msg-queue :initarg :msg-queue :type ring) (closed-p :initform nil) (other-terminal :type (or null channel-terminal)))) (defclass channel-source (channel-terminal) nil) (defclass channel-sink (channel-terminal) nil) (define-error 'channel-closed "Trying to send/recv from a closed channel") (defun make-channel (&optional capacity) (unless capacity (setq capacity channel-default-capacity)) (cl-check-type capacity (integer 1 *)) (let* ((mutex (make-mutex "channel")) (condition (make-condition-variable mutex "channel")) (msg-queue (make-ring capacity)) (source (channel-source :mutex mutex :condition condition :msg-queue msg-queue)) (sink (channel-sink :mutex mutex :condition condition :msg-queue msg-queue))) (oset source other-terminal sink) (oset sink other-terminal source) (cons source sink))) (cl-defgeneric channel-send ((source channel-source) message) (with-mutex (oref source mutex) (with-slots (condition msg-queue) source (while (and (not (channel-closed-p source)) (= (ring-size msg-queue) (ring-length msg-queue))) (thread-utils-condition-wait condition)) (when (channel-closed-p source) (signal 'channel-closed (list source))) (let ((inhibit-quit t)) (ring-insert msg-queue message) (when (= 1 (ring-length msg-queue)) (condition-notify condition t))) nil))) (cl-defgeneric channel-recv ((sink channel-terminal)) (with-mutex (oref sink mutex) (with-slots (condition msg-queue) sink (while (and (not (channel-closed-p sink)) (ring-empty-p msg-queue)) (thread-utils-condition-wait condition)) (when (channel-closed-p sink) (signal 'channel-closed (list sink))) (let ((inhibit-quit t)) (prog1 (ring-remove msg-queue) (when (= 1 (- (ring-size msg-queue) (ring-length msg-queue))) (condition-notify condition t))))))) (cl-defgeneric channel-peek ((sink channel-terminal)) (with-mutex (oref sink mutex) (with-slots (condition msg-queue) sink (while (and (not (channel-closed-p sink)) (ring-empty-p msg-queue)) (thread-utils-condition-wait condition)) (when (channel-closed-p sink) (signal 'channel-closed (list sink))) (ring-ref msg-queue -1)))) (cl-defgeneric channel-close ((terminal channel-terminal)) (with-mutex (oref terminal mutex) (with-slots (closed-p condition) terminal (setq closed-p t) (condition-notify condition t)) nil)) (cl-defmethod channel-closed-p ((source channel-source)) (with-mutex (oref source mutex) (with-slots (closed-p other-terminal) source (or closed-p (oref other-terminal closed-p))))) (cl-defmethod channel-closed-p ((sink channel-sink)) (with-mutex (oref sink mutex) (with-slots (closed-p other-terminal msg-queue) sink (or closed-p (and (oref other-terminal closed-p) (ring-empty-p msg-queue)))))) (defclass future nil ((channel :initform (make-channel 1)))) (defun make-future () (make-instance 'future)) (cl-defgeneric future-set ((future future) value) (with-slots (channel) future (let ((inhibit-quit t)) (condition-case nil (progn (debug "Sending future") (channel-send (car channel) value) (debug "Future send")) (channel-closed (signal 'error (list future)))) (debug "Closing future") (channel-close (car channel)) (debug "Future closed")))) (cl-defgeneric future-get ((future future)) (with-slots (channel) future (debug "Getting future") (channel-peek (cdr channel)) (debug "Future got"))) (defclass future-deferred (future) ((producer :initarg :producer :type function) (value-produced-p :initform nil) (mutex :initform (make-mutex "future-deferred")))) (defun make-deferred-future (producer) (make-instance 'future-deferred :producer producer)) (cl-defmethod future-get :before ((future future-deferred)) (with-slots (mutex value-produced-p producer) future (with-mutex mutex (unless value-produced-p (unwind-protect (make-thread (lambda nil (debug "Setting Future") (future-set future (funcall producer)) (debug "Future set")) "XEmacs") (setq value-produced-p t)))))) (let ((future (make-deferred-future (lambda nil 42)))) (future-get future)) ^ permalink raw reply [flat|nested] 53+ messages in thread
* bug#36609: 27.0.50; Possible race-condition in threading implementation 2019-07-11 20:51 bug#36609: 27.0.50; Possible race-condition in threading implementation Andreas Politz @ 2019-07-12 7:47 ` Eli Zaretskii 2019-07-12 8:05 ` Eli Zaretskii 2019-07-12 9:02 ` Pip Cet ` (2 subsequent siblings) 3 siblings, 1 reply; 53+ messages in thread From: Eli Zaretskii @ 2019-07-12 7:47 UTC (permalink / raw) To: Andreas Politz; +Cc: 36609 > From: Andreas Politz <politza@hochschule-trier.de> > Date: Thu, 11 Jul 2019 22:51:10 +0200 > > I think there is a race-condition in the implementation of threads. Not sure what you mean by "race condition". Since only one Lisp thread can run at any given time, where could this race happen, and between what threads? > I tried to find a minimal test-case, without success. Thus, I've > attached a lengthy source-file. Loading that file should trigger > this bug and may freeze your session. FWIW, it doesn't freeze my Emacs here, neither on GNU/Linux nor on MS-Windows (with today's master). Are you getting freezes with 100% reproducibility? If so, please show all the details of your build, as collected by report-emacs-bug. > Indications: > > 1. The main-thread has the name of one of created threads (XEmacs in > this case), instead of "emacs". I don't think this is related to the problem. I think we have a bug in the implementation of the 'name' attribute of threads: we call prctl(PR_SET_NAME), which AFAIU changes the name of the _calling_ thread, and the calling thread at that point is the main thread, see sys_thread_create. If I evaluate this: (make-thread 'ignore "XEmacs") and then attach a debugger, I see that the name of the main thread has changed to "XEmacs". > 2. Emacs stops processing all keyboard/mouse input while looping in > wait_reading_process_output. Sending commands via emacsclient still > works. > > GDB output: > > (gdb) info threads > Id Target Id Frame > * 1 Thread 0x7ffff17f5d40 (LWP 26264) "XEmacs" 0x000055555576eac0 in XPNTR (a=XIL(0x7ffff1312533)) at alloc.c:535 > 2 Thread 0x7ffff0ac4700 (LWP 26265) "gmain" 0x00007ffff50d1667 in poll () from /usr/lib/libc.so.6 > 3 Thread 0x7fffebd1a700 (LWP 26266) "gdbus" 0x00007ffff50d1667 in poll () from /usr/lib/libc.so.6 > 4 Thread 0x7fffeb519700 (LWP 26267) "dconf worker" 0x00007ffff50d1667 in poll () from /usr/lib/libc.so.6 > > (gdb) bt full This "bt full" is not enough, because it shows the backtrace of one thread only, the main thread. Please show the backtrace of the other 3 threads by typing "thread apply all bt full" instead. > #5 0x0000555555802a78 in wait_reading_process_output (time_limit=0, nsecs=0, read_kbd=-1, do_display=true, wait_for_cell=XIL(0), wait_proc=0x0, just_wait_proc=0) at process.c:5423 > process_skipped = false > channel = 6 > nfds = 1 > Available = { > fds_bits = {32, 0 <repeats 15 times>} > } Is the keyboard descriptor's bit in Available set or not? What is the contents of the fd_callback_info array at this point? > ;; -*- lexical-binding: t -*- > > (require 'threads) > (require 'eieio) > (require 'cl-lib) > (require 'ring) > > (defun debug (fmt &rest args) > (princ (apply #'format fmt args) #'external-debugging-output) > (terpri #'external-debugging-output)) Please describe what this program tries to accomplish, and how. It's not easy to reverse-engineer that from 200 lines of non-trivial code. It's possible that the reason(s) for the freeze will be apparent from describing your implementation ideas. Thanks. ^ permalink raw reply [flat|nested] 53+ messages in thread
* bug#36609: 27.0.50; Possible race-condition in threading implementation 2019-07-12 7:47 ` Eli Zaretskii @ 2019-07-12 8:05 ` Eli Zaretskii 0 siblings, 0 replies; 53+ messages in thread From: Eli Zaretskii @ 2019-07-12 8:05 UTC (permalink / raw) To: politza; +Cc: 36609 > Date: Fri, 12 Jul 2019 10:47:37 +0300 > From: Eli Zaretskii <eliz@gnu.org> > Cc: 36609@debbugs.gnu.org > > Please describe what this program tries to accomplish, and how. It's > not easy to reverse-engineer that from 200 lines of non-trivial code. > It's possible that the reason(s) for the freeze will be apparent from > describing your implementation ideas. Oh, and one more question: does the value of the global variable last_thread_error tell something interesting at that point? ^ permalink raw reply [flat|nested] 53+ messages in thread
* bug#36609: 27.0.50; Possible race-condition in threading implementation 2019-07-11 20:51 bug#36609: 27.0.50; Possible race-condition in threading implementation Andreas Politz 2019-07-12 7:47 ` Eli Zaretskii @ 2019-07-12 9:02 ` Pip Cet 2019-07-12 12:42 ` Eli Zaretskii 2019-07-12 12:44 ` Pip Cet 2021-06-06 15:50 ` dick.r.chiang [not found] ` <87fsxv8182.fsf@dick> 3 siblings, 2 replies; 53+ messages in thread From: Pip Cet @ 2019-07-12 9:02 UTC (permalink / raw) To: Andreas Politz; +Cc: 36609 [-- Attachment #1: Type: text/plain, Size: 666 bytes --] On Thu, Jul 11, 2019 at 8:52 PM Andreas Politz <politza@hochschule-trier.de> wrote: > I think there is a race-condition in the implementation of threads. I > tried to find a minimal test-case, without success. Thus, I've attached > a lengthy source-file. Loading that file should trigger this bug and > may freeze your session. It does here, so I can provide further debugging information if needed. On first glance, it appears that xgselect returns abnormally with g_main_context acquired in one thread, and then other threads fail to acquire it and loop endlessly. Can you confirm that this very dirty patch avoids (it's not a fix by any means!) the problem? [-- Attachment #2: Avoid-glib-issue.patch --] [-- Type: text/x-patch, Size: 1349 bytes --] diff --git a/src/xgselect.c b/src/xgselect.c index 9982a1f0e9..9ffeb6c66c 100644 --- a/src/xgselect.c +++ b/src/xgselect.c @@ -58,8 +58,16 @@ xg_select (int fds_lim, fd_set *rfds, fd_set *wfds, fd_set *efds, int i, nfds, tmo_in_millisec, must_free = 0; bool need_to_dispatch; + ptrdiff_t count = SPECPDL_INDEX (); context = g_main_context_default (); - context_acquired = g_main_context_acquire (context); + context_acquired = true; + + auto void release_g_main_context (void) + { + } + + if (context_acquired) + record_unwind_protect_void (release_g_main_context); /* FIXME: If we couldn't acquire the context, we just silently proceed because this function handles more than just glib file descriptors. Note that, as implemented, this failure is completely silent: there is @@ -164,9 +172,6 @@ xg_select (int fds_lim, fd_set *rfds, fd_set *wfds, fd_set *efds, errno = pselect_errno; } - if (context_acquired) - g_main_context_release (context); - /* To not have to recalculate timeout, return like this. */ if ((our_fds > 0 || (nfds == 0 && tmop == &tmo)) && (retval == 0)) { @@ -174,6 +179,6 @@ xg_select (int fds_lim, fd_set *rfds, fd_set *wfds, fd_set *efds, errno = EINTR; } - return retval; + return unbind_to (count, Qnil), retval; } #endif /* HAVE_GLIB */ ^ permalink raw reply related [flat|nested] 53+ messages in thread
* bug#36609: 27.0.50; Possible race-condition in threading implementation 2019-07-12 9:02 ` Pip Cet @ 2019-07-12 12:42 ` Eli Zaretskii 2019-07-12 12:57 ` Pip Cet 2019-07-12 12:44 ` Pip Cet 1 sibling, 1 reply; 53+ messages in thread From: Eli Zaretskii @ 2019-07-12 12:42 UTC (permalink / raw) To: Pip Cet; +Cc: 36609, politza > From: Pip Cet <pipcet@gmail.com> > Date: Fri, 12 Jul 2019 09:02:22 +0000 > Cc: 36609@debbugs.gnu.org > > On Thu, Jul 11, 2019 at 8:52 PM Andreas Politz > <politza@hochschule-trier.de> wrote: > > I think there is a race-condition in the implementation of threads. I > > tried to find a minimal test-case, without success. Thus, I've attached > > a lengthy source-file. Loading that file should trigger this bug and > > may freeze your session. > > It does here, so I can provide further debugging information if > needed. Thanks, can you provide the info I asked for? > On first glance, it appears that xgselect returns abnormally with > g_main_context acquired in one thread, and then other threads fail > to acquire it and loop endlessly. If you can describe what causes this to happen, I think we might be half-way to a solution. > + ptrdiff_t count = SPECPDL_INDEX (); I don't think we should do that at this low level. ^ permalink raw reply [flat|nested] 53+ messages in thread
* bug#36609: 27.0.50; Possible race-condition in threading implementation 2019-07-12 12:42 ` Eli Zaretskii @ 2019-07-12 12:57 ` Pip Cet 2019-07-12 13:31 ` Eli Zaretskii 0 siblings, 1 reply; 53+ messages in thread From: Pip Cet @ 2019-07-12 12:57 UTC (permalink / raw) To: Eli Zaretskii; +Cc: 36609, politza On Fri, Jul 12, 2019 at 12:42 PM Eli Zaretskii <eliz@gnu.org> wrote: > > > From: Pip Cet <pipcet@gmail.com> > > Date: Fri, 12 Jul 2019 09:02:22 +0000 > > Cc: 36609@debbugs.gnu.org > > > > On Thu, Jul 11, 2019 at 8:52 PM Andreas Politz > > <politza@hochschule-trier.de> wrote: > > > I think there is a race-condition in the implementation of threads. I > > > tried to find a minimal test-case, without success. Thus, I've attached > > > a lengthy source-file. Loading that file should trigger this bug and > > > may freeze your session. > > > > It does here, so I can provide further debugging information if > > needed. > > Thanks, can you provide the info I asked for? Yes, albeit not right now. > > On first glance, it appears that xgselect returns abnormally with > > g_main_context acquired in one thread, and then other threads fail > > to acquire it and loop endlessly. > > If you can describe what causes this to happen, I think we might be > half-way to a solution. Here's the backtrace of the abnormal exit I see with the patch attached: (gdb) bt full #0 0x00000000006bf987 in release_g_main_context (ptr=0xc1d070) at xgselect.c:36 context = 0x7fffedf79710 #1 0x0000000000616f03 in do_one_unbind (this_binding=0x7fffedf79770, unwinding=true, bindflag=SET_INTERNAL_UNBIND) at eval.c:3446 #2 0x0000000000617245 in unbind_to (count=0, value=XIL(0)) at eval.c:3567 this_binding = { kind = SPECPDL_UNWIND_PTR, unwind = { kind = SPECPDL_UNWIND_PTR, func = 0x6bf97b <release_g_main_context>, arg = XIL(0xc1d070), eval_depth = 0 }, unwind_array = { kind = SPECPDL_UNWIND_PTR, nelts = 7076219, array = 0xc1d070 }, unwind_ptr = { kind = SPECPDL_UNWIND_PTR, func = 0x6bf97b <release_g_main_context>, arg = 0xc1d070 }, unwind_int = { kind = SPECPDL_UNWIND_PTR, func = 0x6bf97b <release_g_main_context>, arg = 12701808 }, unwind_excursion = { kind = SPECPDL_UNWIND_PTR, marker = XIL(0x6bf97b), window = XIL(0xc1d070) }, unwind_void = { kind = SPECPDL_UNWIND_PTR, func = 0x6bf97b <release_g_main_context> }, let = { kind = SPECPDL_UNWIND_PTR, symbol = XIL(0x6bf97b), old_value = XIL(0xc1d070), where = XIL(0), saved_value = XIL(0xef26a0) }, bt = { kind = SPECPDL_UNWIND_PTR, debug_on_exit = false, function = XIL(0x6bf97b), args = 0xc1d070, nargs = 0 } } quitf = XIL(0) #3 0x00000000006116df in unwind_to_catch (catch=0x7fffd8000c50, type=NONLOCAL_EXIT_SIGNAL, value=XIL(0x14d3653)) at eval.c:1162 last_time = false #4 0x00000000006126d9 in signal_or_quit (error_symbol=XIL(0x90), data=XIL(0), keyboard_quit=false) at eval.c:1674 unwind_data = XIL(0x14d3653) conditions = XIL(0x7ffff05d676b) string = XIL(0x5f5e77) real_error_symbol = XIL(0x90) clause = XIL(0x30) h = 0x7fffd8000c50 #5 0x00000000006122e9 in Fsignal (error_symbol=XIL(0x90), data=XIL(0)) at eval.c:1564 #6 0x0000000000698901 in post_acquire_global_lock (self=0xe09db0) at thread.c:115 sym = XIL(0x90) data = XIL(0) prev_thread = 0xa745c0 <main_thread> #7 0x000000000069892b in acquire_global_lock (self=0xe09db0) at thread.c:123 #8 0x0000000000699303 in really_call_select (arg=0x7fffedf79a70) at thread.c:596 sa = 0x7fffedf79a70 self = 0xe09db0 oldset = { __val = {0, 0, 7, 0, 80, 140736817269952, 2031, 2080, 18446744073709550952, 32, 343597383808, 4, 0, 472446402655, 511101108348, 0} } #9 0x00000000005e5ee0 in flush_stack_call_func (func=0x699239 <really_call_select>, arg=0x7fffedf79a70) at alloc.c:4969 end = 0x7fffedf79a30 self = 0xe09db0 sentry = { o = { __max_align_ll = 0, __max_align_ld = <invalid float value> } } #10 0x0000000000699389 in thread_select (func=0x419320 <pselect@plt>, max_fds=9, rfds=0x7fffedf79fa0, wfds=0x7fffedf79f20, efds=0x0, timeout=0x7fffedf7a260, sigmask=0x0) at thread.c:616 sa = { func = 0x419320 <pselect@plt>, max_fds = 9, rfds = 0x7fffedf79fa0, wfds = 0x7fffedf79f20, efds = 0x0, timeout = 0x7fffedf7a260, sigmask = 0x0, result = 1 } #11 0x00000000006bfef5 in xg_select (fds_lim=9, rfds=0x7fffedf7a300, wfds=0x7fffedf7a280, efds=0x0, timeout=0x7fffedf7a260, sigmask=0x0) at xgselect.c:130 all_rfds = { fds_bits = {8, 0 <repeats 15 times>} } all_wfds = { fds_bits = {0 <repeats 16 times>} } tmo = { tv_sec = 0, tv_nsec = 0 } tmop = 0x7fffedf7a260 context = 0xc1d070 have_wfds = true gfds_buf = {{ fd = 5, events = 1, revents = 0 }, { fd = 6, events = 1, revents = 0 }, { fd = 8, events = 1, revents = 0 }, { fd = 0, events = 0, revents = 0 } <repeats 125 times>} gfds = 0x7fffedf79b10 gfds_size = 128 n_gfds = 3 retval = 0 our_fds = 0 max_fds = 8 context_acquired = true i = 3 nfds = 0 tmo_in_millisec = -1 must_free = 0 need_to_dispatch = false count = 3 #12 0x000000000066b757 in wait_reading_process_output (time_limit=3, nsecs=0, read_kbd=0, do_display=false, wait_for_cell=XIL(0), wait_proc=0x0, just_wait_proc=0) at process.c:5423 process_skipped = false channel = 0 nfds = 0 Available = { fds_bits = {8, 0 <repeats 15 times>} } Writeok = { fds_bits = {0 <repeats 16 times>} } check_write = true check_delay = 0 no_avail = false xerrno = 0 proc = XIL(0x7fffedf7a440) timeout = { tv_sec = 3, tv_nsec = 0 } end_time = { tv_sec = 1562935633, tv_nsec = 911868453 } timer_delay = { tv_sec = 0, tv_nsec = -1 } got_output_end_time = { tv_sec = 0, tv_nsec = -1 } wait = TIMEOUT got_some_output = -1 prev_wait_proc_nbytes_read = 0 retry_for_async = false count = 2 now = { tv_sec = 0, tv_nsec = -1 } #13 0x0000000000429bf6 in Fsleep_for (seconds=make_fixnum(3), milliseconds=XIL(0)) at dispnew.c:5825 t = { tv_sec = 3, tv_nsec = 0 } tend = { tv_sec = 1562935633, tv_nsec = 911868112 } duration = 3 #14 0x0000000000613e99 in eval_sub (form=XIL(0xf6df73)) at eval.c:2273 i = 2 maxargs = 2 args_left = XIL(0) numargs = 1 original_fun = XIL(0x7fffefa9fb98) original_args = XIL(0xf6df83) count = 1 fun = XIL(0xa756a5) val = XIL(0) funcar = make_fixnum(35184372085343) argvals = {make_fixnum(3), XIL(0), XIL(0), XIL(0), XIL(0), XIL(0), XIL(0), XIL(0)} #15 0x0000000000610032 in Fprogn (body=XIL(0)) at eval.c:462 form = XIL(0xf6df73) val = XIL(0) #16 0x0000000000616102 in funcall_lambda (fun=XIL(0xf6da43), nargs=0, arg_vector=0xe09dd8) at eval.c:3065 val = XIL(0xc0) syms_left = XIL(0) next = XIL(0x3400000013) lexenv = XIL(0) count = 1 i = 0 optional = false rest = false #17 0x0000000000615542 in Ffuncall (nargs=1, args=0xe09dd0) at eval.c:2813 fun = XIL(0xf6da43) original_fun = XIL(0xf6da43) funcar = XIL(0xc0) numargs = 0 val = XIL(0xaf72e0) count = 0 #18 0x000000000069956f in invoke_thread_function () at thread.c:702 count = 0 #19 0x0000000000611d61 in internal_condition_case (bfun=0x69953e <invoke_thread_function>, handlers=XIL(0x30), hfun=0x699596 <record_thread_error>) at eval.c:1351 val = make_fixnum(1405386) c = 0x7fffd8000c50 #20 0x0000000000699697 in run_thread (state=0xe09db0) at thread.c:741 stack_pos = { __max_align_ll = 0, __max_align_ld = 0 } self = 0xe09db0 iter = 0x0 c = 0x7fffd8000b20 #21 0x00007ffff4b38fa3 in start_thread (arg=<optimized out>) at pthread_create.c:486 ret = <optimized out> pd = <optimized out> now = <optimized out> unwind_buf = { cancel_jmp_buf = {{ jmp_buf = {140737185822464, -1249422724209328276, 140737488341374, 140737488341375, 140737185822464, 0, 1249453444682727276, 1249398985402204012}, mask_was_saved = 0 }}, priv = { pad = {0x0, 0x0, 0x0, 0x0}, data = { prev = 0x0, cleanup = 0x0, canceltype = 0 } } } not_first_call = <optimized out> #22 0x00007ffff49724cf in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95 Lisp Backtrace: "sleep-for" (0xedf7a530) 0xf6da40 Lisp type 3 post_acquire_global_lock () can return abnormally (I didn't know that), so really_call_select() can, too, so thread_select() can, too. > > + ptrdiff_t count = SPECPDL_INDEX (); > > I don't think we should do that at this low level. You're right, it does stick out. I think we're safe because we're calling Fsignal with the global lock held, but it's not a pretty or well-documented situation. ^ permalink raw reply [flat|nested] 53+ messages in thread
* bug#36609: 27.0.50; Possible race-condition in threading implementation 2019-07-12 12:57 ` Pip Cet @ 2019-07-12 13:31 ` Eli Zaretskii 2019-07-12 13:51 ` Pip Cet 0 siblings, 1 reply; 53+ messages in thread From: Eli Zaretskii @ 2019-07-12 13:31 UTC (permalink / raw) To: Pip Cet; +Cc: 36609, politza > From: Pip Cet <pipcet@gmail.com> > Date: Fri, 12 Jul 2019 12:57:51 +0000 > Cc: politza@hochschule-trier.de, 36609@debbugs.gnu.org > > Lisp Backtrace: > "sleep-for" (0xedf7a530) > 0xf6da40 Lisp type 3 > > post_acquire_global_lock () can return abnormally (I didn't know > that), so really_call_select() can, too, so thread_select() can, too. Do you know which code sets current_thread->error_symbol, and what is that error symbol? > > > + ptrdiff_t count = SPECPDL_INDEX (); > > > > I don't think we should do that at this low level. > > You're right, it does stick out. I think we're safe because we're > calling Fsignal with the global lock held, but it's not a pretty or > well-documented situation. Is this the main thread which does that? if so, there should be no problem with holding the global lock in this case, is there? If this is not the main thread, then the error handlers should be set so as to log the error in last_thread_error, and then terminate the thread (via exiting the thread function, AFAIR). If this is not what happens, I'd like to know what does happen here, and why. Thanks. ^ permalink raw reply [flat|nested] 53+ messages in thread
* bug#36609: 27.0.50; Possible race-condition in threading implementation 2019-07-12 13:31 ` Eli Zaretskii @ 2019-07-12 13:51 ` Pip Cet 2019-07-12 15:05 ` Eli Zaretskii 0 siblings, 1 reply; 53+ messages in thread From: Pip Cet @ 2019-07-12 13:51 UTC (permalink / raw) To: Eli Zaretskii; +Cc: 36609, politza On Fri, Jul 12, 2019 at 1:31 PM Eli Zaretskii <eliz@gnu.org> wrote:> > > From: Pip Cet <pipcet@gmail.com> > > Date: Fri, 12 Jul 2019 12:57:51 +0000 > > Cc: politza@hochschule-trier.de, 36609@debbugs.gnu.org > > > > Lisp Backtrace: > > "sleep-for" (0xedf7a530) > > 0xf6da40 Lisp type 3 > > > > post_acquire_global_lock () can return abnormally (I didn't know > > that), so really_call_select() can, too, so thread_select() can, too. > > Do you know which code sets current_thread->error_symbol, and what is > that error symbol? thread-signal, which my example code calls directly. I passed 'error, and that's what error_symbol is set to. last_thread_error is (error) when the Emacs session freezes. > > > > + ptrdiff_t count = SPECPDL_INDEX (); > > > > > > I don't think we should do that at this low level. > > > > You're right, it does stick out. I think we're safe because we're > > calling Fsignal with the global lock held, but it's not a pretty or > > well-documented situation. > > Is this the main thread which does that? if so, there should be no > problem with holding the global lock in this case, is there? > > If this is not the main thread, then the error handlers should be set > so as to log the error in last_thread_error, and then terminate the > thread (via exiting the thread function, AFAIR). Indeed, that's what happens; but the thread now fails to release the GLib lock it had previously acquired, so other threads cannot acquire it again, ever. > If this is not what happens, I'd like to know what does happen here, > and why. Sure, we should understand what's going on here; even if the fix turns out to be simple. ^ permalink raw reply [flat|nested] 53+ messages in thread
* bug#36609: 27.0.50; Possible race-condition in threading implementation 2019-07-12 13:51 ` Pip Cet @ 2019-07-12 15:05 ` Eli Zaretskii 2019-07-12 18:06 ` Pip Cet 0 siblings, 1 reply; 53+ messages in thread From: Eli Zaretskii @ 2019-07-12 15:05 UTC (permalink / raw) To: Pip Cet; +Cc: 36609, politza > From: Pip Cet <pipcet@gmail.com> > Date: Fri, 12 Jul 2019 13:51:48 +0000 > Cc: politza@hochschule-trier.de, 36609@debbugs.gnu.org > > > If this is not the main thread, then the error handlers should be set > > so as to log the error in last_thread_error, and then terminate the > > thread (via exiting the thread function, AFAIR). > > Indeed, that's what happens; but the thread now fails to release the > GLib lock it had previously acquired, so other threads cannot acquire > it again, ever. Then we should make sure we release it when the thread function exits, or before we call Fsignal from post_acquire_global_lock. There's no reason to use the unwind-protect machinery for that, I think. ^ permalink raw reply [flat|nested] 53+ messages in thread
* bug#36609: 27.0.50; Possible race-condition in threading implementation 2019-07-12 15:05 ` Eli Zaretskii @ 2019-07-12 18:06 ` Pip Cet 2019-07-12 18:27 ` Eli Zaretskii 0 siblings, 1 reply; 53+ messages in thread From: Pip Cet @ 2019-07-12 18:06 UTC (permalink / raw) To: Eli Zaretskii; +Cc: 36609, politza On Fri, Jul 12, 2019 at 3:06 PM Eli Zaretskii <eliz@gnu.org> wrote: > > From: Pip Cet <pipcet@gmail.com> > > Date: Fri, 12 Jul 2019 13:51:48 +0000 > > Cc: politza@hochschule-trier.de, 36609@debbugs.gnu.org > > > > > If this is not the main thread, then the error handlers should be set > > > so as to log the error in last_thread_error, and then terminate the > > > thread (via exiting the thread function, AFAIR). > > > > Indeed, that's what happens; but the thread now fails to release the > > GLib lock it had previously acquired, so other threads cannot acquire > > it again, ever. > > Then we should make sure we release it when the thread function exits, That's too late, it's legitimate for the thread to have a signal handler. We need to release the lock at precisely the time that unbind_to would release it. > or before we call Fsignal from post_acquire_global_lock. There's no > reason to use the unwind-protect machinery for that, I think. If we call Fsignal, surely the unwind-protect machinery is already set up and we can safely call it? So unbind_to would work again... I guess I've changed my mind: the unwind-protect machinery does precisely what we want, we should simply document that thread_select can exit nonlocally and that the select functions need to be written to cater to that. Two places call xg_select, and they both run long after unbind_to works. Doing this without the unwind-protect machinery is difficult: for example, a comment in post_acquire_global_lock claims unbind_for_thread_switch can exit nonlocally, though I don't think that actually happens or we would have seen the bug report. What's your proposal for doing this? ^ permalink raw reply [flat|nested] 53+ messages in thread
* bug#36609: 27.0.50; Possible race-condition in threading implementation 2019-07-12 18:06 ` Pip Cet @ 2019-07-12 18:27 ` Eli Zaretskii 2019-07-12 18:34 ` Eli Zaretskii 0 siblings, 1 reply; 53+ messages in thread From: Eli Zaretskii @ 2019-07-12 18:27 UTC (permalink / raw) To: Pip Cet; +Cc: 36609, politza > From: Pip Cet <pipcet@gmail.com> > Date: Fri, 12 Jul 2019 18:06:29 +0000 > Cc: politza@hochschule-trier.de, 36609@debbugs.gnu.org > > > Then we should make sure we release it when the thread function exits, > > That's too late, it's legitimate for the thread to have a signal > handler. We need to release the lock at precisely the time that > unbind_to would release it. I had also another proposal: > > or before we call Fsignal from post_acquire_global_lock. There's no > > reason to use the unwind-protect machinery for that, I think. > > If we call Fsignal, surely the unwind-protect machinery is already set > up and we can safely call it? So unbind_to would work again... Sorry, I don't want to call unwind-protect there. Call me paranoid, if you want. > I guess I've changed my mind: the unwind-protect machinery does > precisely what we want, we should simply document that thread_select > can exit nonlocally and that the select functions need to be written > to cater to that. > > Two places call xg_select, and they both run long after unbind_to works. > > Doing this without the unwind-protect machinery is difficult: for > example, a comment in post_acquire_global_lock claims > unbind_for_thread_switch can exit nonlocally, though I don't think > that actually happens or we would have seen the bug report. > > What's your proposal for doing this? Like I said: release the lock before calling Fsignal. We could do that before calling unbind_for_thread_switch, if needed. ^ permalink raw reply [flat|nested] 53+ messages in thread
* bug#36609: 27.0.50; Possible race-condition in threading implementation 2019-07-12 18:27 ` Eli Zaretskii @ 2019-07-12 18:34 ` Eli Zaretskii 2019-07-12 19:24 ` Pip Cet 0 siblings, 1 reply; 53+ messages in thread From: Eli Zaretskii @ 2019-07-12 18:34 UTC (permalink / raw) To: pipcet; +Cc: 36609, politza > Date: Fri, 12 Jul 2019 21:27:40 +0300 > From: Eli Zaretskii <eliz@gnu.org> > Cc: 36609@debbugs.gnu.org, politza@hochschule-trier.de > > Sorry, I don't want to call unwind-protect there. Call me paranoid, > if you want. Maybe I should explain the rationale behind that paranoia. The main reason is that you are proposing to do that inside code that can switch threads. Switching threads means switching to another stack and also to another set of handlers. So using the unwind-protect machinery in this situation is IMO asking for trouble. And then there's the TTY frame case, where C-g triggers SIGINT, and we actually longjmp from wherever we were... ^ permalink raw reply [flat|nested] 53+ messages in thread
* bug#36609: 27.0.50; Possible race-condition in threading implementation 2019-07-12 18:34 ` Eli Zaretskii @ 2019-07-12 19:24 ` Pip Cet 2019-07-12 19:57 ` Eli Zaretskii 0 siblings, 1 reply; 53+ messages in thread From: Pip Cet @ 2019-07-12 19:24 UTC (permalink / raw) To: Eli Zaretskii; +Cc: 36609, politza [-- Attachment #1: Type: text/plain, Size: 1069 bytes --] On Fri, Jul 12, 2019 at 6:34 PM Eli Zaretskii <eliz@gnu.org> wrote: > > Sorry, I don't want to call unwind-protect there. Call me paranoid, > > if you want. > > Maybe I should explain the rationale behind that paranoia. The main I don't think it's paranoia, I just wasn't sure there was a good way to avoid it. I'm now convinced that there simply is no safe way to call select() from two threads at once when using glib. I think our options are hacking around it and hoping nothing breaks (this is what the attached patch does; it releases the main context glib lock from the wrong thread soon "after" the other thread called select, but there's actually no way to ensure that "after" is accurate), or rewriting things so we have a single thread that does all the select()ing. > reason is that you are proposing to do that inside code that can > switch threads. Switching threads means switching to another stack > and also to another set of handlers. So using the unwind-protect > machinery in this situation is IMO asking for trouble. Thanks for explaining. [-- Attachment #2: glib-hack.diff --] [-- Type: application/x-patch, Size: 4445 bytes --] ^ permalink raw reply [flat|nested] 53+ messages in thread
* bug#36609: 27.0.50; Possible race-condition in threading implementation 2019-07-12 19:24 ` Pip Cet @ 2019-07-12 19:57 ` Eli Zaretskii 2019-07-13 14:37 ` Pip Cet 0 siblings, 1 reply; 53+ messages in thread From: Eli Zaretskii @ 2019-07-12 19:57 UTC (permalink / raw) To: Pip Cet; +Cc: 36609, politza > From: Pip Cet <pipcet@gmail.com> > Date: Fri, 12 Jul 2019 19:24:38 +0000 > Cc: 36609@debbugs.gnu.org, politza@hochschule-trier.de > > I'm now convinced that there simply is no safe way to call select() > from two threads at once when using glib. I hope not, although GTK with its idiosyncrasies caused a lot of pain for the threads implementation in Emacs. > I think our options are hacking around it and hoping nothing breaks > (this is what the attached patch does; it releases the main context > glib lock from the wrong thread soon "after" the other thread called > select, but there's actually no way to ensure that "after" is > accurate), or rewriting things so we have a single thread that does > all the select()ing. Hmm... how would this work with your patch? Suppose one thread calls xg_select, acquires the Glib lock, sets its holding_glib_lock flag, then releases the global Lisp lock and calls pselect. Since the global Lisp lock is now up for grabs, some other Lisp thread can acquire it and start running. If that other thread then calls xg_select, it will hang forever trying to acquire the Glib lock, because the first thread that holds it is stuck in pselect. Am I missing something? I know very little about GTK and the Glib context lock, but AFAIR we really must treat that lock as a global one, not a thread-local one. So I think it's okay for one thread to take the Glib lock, and another to release it, because Glib just wants to know whether the "rest of the program" has it, it doesn't care which thread is that which holds the lock. ^ permalink raw reply [flat|nested] 53+ messages in thread
* bug#36609: 27.0.50; Possible race-condition in threading implementation 2019-07-12 19:57 ` Eli Zaretskii @ 2019-07-13 14:37 ` Pip Cet 2019-07-13 15:03 ` Eli Zaretskii 2019-07-13 15:04 ` Andreas Politz 0 siblings, 2 replies; 53+ messages in thread From: Pip Cet @ 2019-07-13 14:37 UTC (permalink / raw) To: Eli Zaretskii; +Cc: 36609, politza [-- Attachment #1: Type: text/plain, Size: 2523 bytes --] On Fri, Jul 12, 2019 at 7:57 PM Eli Zaretskii <eliz@gnu.org> wrote: > > I'm now convinced that there simply is no safe way to call select() > > from two threads at once when using glib. > > I hope not, although GTK with its idiosyncrasies caused a lot of pain > for the threads implementation in Emacs. Well, I think we're going to have to do one or more of the following: - have a race condition - access glib-locked data from the "wrong" thread (another Emacs thread) - release the glib lock from the "wrong" thread (another Emacs thread) Of these, the second is the best alternative, I think: we simply grab the g_main_context lock globally, acting for all Emacs threads, and the last thread to require it releases it when it leaves xg_select. As long as there's at least one thread in the critical section of xg_select, we hold the lock, but access to the context isn't necessarily from the thread which locked it. > > I think our options are hacking around it and hoping nothing breaks > > (this is what the attached patch does; it releases the main context > > glib lock from the wrong thread soon "after" the other thread called > > select, but there's actually no way to ensure that "after" is > > accurate), or rewriting things so we have a single thread that does > > all the select()ing. > > Hmm... how would this work with your patch? Suppose one thread calls > xg_select, acquires the Glib lock, sets its holding_glib_lock flag, > then releases the global Lisp lock and calls pselect. Since the > global Lisp lock is now up for grabs, some other Lisp thread can > acquire it and start running. And when it starts running, it releases the Glib lock. > If that other thread then calls > xg_select, it will hang forever trying to acquire the Glib lock, > because the first thread that holds it is stuck in pselect. The first thread no longer holds the Glib lock, it was released when we switched threads. > I know very little about GTK and the Glib context lock, but AFAIR we > really must treat that lock as a global one, not a thread-local one. > So I think it's okay for one thread to take the Glib lock, and another > to release it, because Glib just wants to know whether the "rest of > the program" has it, it doesn't care which thread is that which holds > the lock. Okay, that sounds like option #2 above. The attached patch exposes glib externals to the generic code, but it appears to work. If you think the approach is okay, I'll move the glib-specific parts to xgselect.c (if that's okay). [-- Attachment #2: glib-hack-002.diff --] [-- Type: text/x-patch, Size: 5894 bytes --] diff --git a/src/thread.c b/src/thread.c index e2deadd7a8..0ddb79460b 100644 --- a/src/thread.c +++ b/src/thread.c @@ -19,6 +19,9 @@ Copyright (C) 2012-2019 Free Software Foundation, Inc. #include <config.h> #include <setjmp.h> +#ifdef HAVE_GLIB +#include <glib.h> +#endif #include "lisp.h" #include "character.h" #include "buffer.h" @@ -82,7 +85,7 @@ post_acquire_global_lock (struct thread_state *self) /* Do this early on, so that code below could signal errors (e.g., unbind_for_thread_switch might) correctly, because we are already - running in the context of the thread pointed by SELF. */ + running in the context of the thread pointed to by SELF. */ current_thread = self; if (prev_thread != current_thread) @@ -586,6 +589,17 @@ really_call_select (void *arg) sa->result = (sa->func) (sa->max_fds, sa->rfds, sa->wfds, sa->efds, sa->timeout, sa->sigmask); +#ifdef HAVE_GLIB + /* Release the Glib lock, if there are no other threads in the + critical section. */ + if (current_thread != NULL && current_thread->holding_glib_lock) + { + current_thread->holding_glib_lock = false; + if (--threads_holding_glib_lock == 0) + g_main_context_release (glib_main_context); + } +#endif + block_interrupt_signal (&oldset); /* If we were interrupted by C-g while inside sa->func above, the signal handler could have called maybe_reacquire_global_lock, in @@ -756,6 +770,13 @@ run_thread (void *state) } } +#ifdef HAVE_GLIB + /* Remember to release the Glib lock we might still be holding + (?) */ + if (current_thread->holding_glib_lock) + if (--threads_holding_glib_lock == 0) + g_main_context_release (glib_main_context); +#endif current_thread = NULL; sys_cond_broadcast (&self->thread_condvar); diff --git a/src/thread.h b/src/thread.h index 498b9909c9..1a58f65c88 100644 --- a/src/thread.h +++ b/src/thread.h @@ -29,9 +29,18 @@ #define THREAD_H #include <signal.h> /* sigset_t */ #endif +#ifdef HAVE_GLIB +#include <glib.h> +#endif + #include "sysselect.h" /* FIXME */ #include "systhread.h" +#ifdef HAVE_GLIB +extern ptrdiff_t threads_holding_glib_lock; +extern GMainContext *glib_main_context; +#endif + struct thread_state { union vectorlike_header header; @@ -169,6 +178,9 @@ #define getcjmp (current_thread->m_getcjmp) interrupter should broadcast to this condition. */ sys_cond_t *wait_condvar; +#ifdef HAVE_GLIB + bool holding_glib_lock; +#endif /* This thread might have released the global lock. If so, this is non-zero. When a thread runs outside thread_select with this flag non-zero, it means it has been interrupted by SIGINT while diff --git a/src/xgselect.c b/src/xgselect.c index 9982a1f0e9..0c95857ef9 100644 --- a/src/xgselect.c +++ b/src/xgselect.c @@ -29,6 +29,9 @@ Copyright (C) 2009-2019 Free Software Foundation, Inc. #include "blockinput.h" #include "systime.h" +ptrdiff_t threads_holding_glib_lock; +GMainContext *glib_main_context; + /* `xg_select' is a `pselect' replacement. Why do we need a separate function? 1. Timeouts. Glib and Gtk rely on timer events. If we did pselect with a greater timeout then the one scheduled by Glib, we would @@ -54,26 +57,28 @@ xg_select (int fds_lim, fd_set *rfds, fd_set *wfds, fd_set *efds, GPollFD *gfds = gfds_buf; int gfds_size = ARRAYELTS (gfds_buf); int n_gfds, retval = 0, our_fds = 0, max_fds = fds_lim - 1; - bool context_acquired = false; int i, nfds, tmo_in_millisec, must_free = 0; bool need_to_dispatch; context = g_main_context_default (); - context_acquired = g_main_context_acquire (context); - /* FIXME: If we couldn't acquire the context, we just silently proceed - because this function handles more than just glib file descriptors. - Note that, as implemented, this failure is completely silent: there is - no feedback to the caller. */ + /* Acquire the lock. This is a busy wait for testing. */ + if (current_thread != NULL && !current_thread->holding_glib_lock) + { + if (threads_holding_glib_lock++ == 0) + while (!g_main_context_acquire (context)) + { + } + current_thread->holding_glib_lock = true; + glib_main_context = context; + } if (rfds) all_rfds = *rfds; else FD_ZERO (&all_rfds); if (wfds) all_wfds = *wfds; else FD_ZERO (&all_wfds); - n_gfds = (context_acquired - ? g_main_context_query (context, G_PRIORITY_LOW, &tmo_in_millisec, - gfds, gfds_size) - : -1); + n_gfds = g_main_context_query (context, G_PRIORITY_LOW, &tmo_in_millisec, + gfds, gfds_size); if (gfds_size < n_gfds) { @@ -151,8 +156,19 @@ xg_select (int fds_lim, fd_set *rfds, fd_set *wfds, fd_set *efds, #else need_to_dispatch = true; #endif - if (need_to_dispatch && context_acquired) + if (need_to_dispatch) { + /* Acquire the lock. This is a busy wait for testing. */ + glib_main_context = context; + if (!current_thread->holding_glib_lock) + { + if (threads_holding_glib_lock++ == 0) + while (!g_main_context_acquire (context)) + { + } + current_thread->holding_glib_lock = true; + } + int pselect_errno = errno; /* Prevent g_main_dispatch recursion, that would occur without block_input wrapper, because event handlers call @@ -164,8 +180,12 @@ xg_select (int fds_lim, fd_set *rfds, fd_set *wfds, fd_set *efds, errno = pselect_errno; } - if (context_acquired) - g_main_context_release (context); + if (current_thread != NULL && current_thread->holding_glib_lock) + if (--threads_holding_glib_lock == 0) + { + g_main_context_release (context); + current_thread->holding_glib_lock = false; + } /* To not have to recalculate timeout, return like this. */ if ((our_fds > 0 || (nfds == 0 && tmop == &tmo)) && (retval == 0)) ^ permalink raw reply related [flat|nested] 53+ messages in thread
* bug#36609: 27.0.50; Possible race-condition in threading implementation 2019-07-13 14:37 ` Pip Cet @ 2019-07-13 15:03 ` Eli Zaretskii 2019-07-13 15:13 ` Eli Zaretskii 2019-07-13 15:54 ` Eli Zaretskii 2019-07-13 15:04 ` Andreas Politz 1 sibling, 2 replies; 53+ messages in thread From: Eli Zaretskii @ 2019-07-13 15:03 UTC (permalink / raw) To: Pip Cet; +Cc: 36609, politza > From: Pip Cet <pipcet@gmail.com> > Date: Sat, 13 Jul 2019 14:37:25 +0000 > Cc: 36609@debbugs.gnu.org, politza@hochschule-trier.de > > Well, I think we're going to have to do one or more of the following: > > - have a race condition > - access glib-locked data from the "wrong" thread (another Emacs thread) > - release the glib lock from the "wrong" thread (another Emacs thread) > > Of these, the second is the best alternative, I think: we simply grab > the g_main_context lock globally, acting for all Emacs threads, and > the last thread to require it releases it when it leaves xg_select. I agree. > As long as there's at least one thread in the critical section of > xg_select, we hold the lock, but access to the context isn't > necessarily from the thread which locked it. Hmm... wouldn't this cause us always hold that lock when there's more than one Lisp thread? And if so, what will that do to GTK and its own threads, which also want to acquire the Glib context? IOW, I guess I don't understand what problem would the proposed changes solve that the current code doesn't. Can you tell? > > Hmm... how would this work with your patch? Suppose one thread calls > > xg_select, acquires the Glib lock, sets its holding_glib_lock flag, > > then releases the global Lisp lock and calls pselect. Since the > > global Lisp lock is now up for grabs, some other Lisp thread can > > acquire it and start running. > > And when it starts running, it releases the Glib lock. But only if its holding_glib_lock flag is set, which is not necessarily true. For example, what about a thread that was just created, and never called xg_select yet? > Okay, that sounds like option #2 above. The attached patch exposes > glib externals to the generic code, but it appears to work. If you > think the approach is okay, I'll move the glib-specific parts to > xgselect.c (if that's okay). Yes, we should have accessor functions in xgselect.c instead of letting thread.c etc. access the GTK/Glib data directly. But I still would like to understand why we need to maintain a flag per thread. Also, I think I lost the context: does this attempt to solve the original problem reported by Andreas, or only the problem reported by you later in the discussion? Thanks. ^ permalink raw reply [flat|nested] 53+ messages in thread
* bug#36609: 27.0.50; Possible race-condition in threading implementation 2019-07-13 15:03 ` Eli Zaretskii @ 2019-07-13 15:13 ` Eli Zaretskii 2019-07-13 15:54 ` Eli Zaretskii 1 sibling, 0 replies; 53+ messages in thread From: Eli Zaretskii @ 2019-07-13 15:13 UTC (permalink / raw) To: pipcet; +Cc: 36609, politza > Date: Sat, 13 Jul 2019 18:03:35 +0300 > From: Eli Zaretskii <eliz@gnu.org> > Cc: 36609@debbugs.gnu.org, politza@hochschule-trier.de > > Also, I think I lost the context: does this attempt to solve the > original problem reported by Andreas, or only the problem reported by > you later in the discussion? I guess Andreas just answered this question. Thanks. ^ permalink raw reply [flat|nested] 53+ messages in thread
* bug#36609: 27.0.50; Possible race-condition in threading implementation 2019-07-13 15:03 ` Eli Zaretskii 2019-07-13 15:13 ` Eli Zaretskii @ 2019-07-13 15:54 ` Eli Zaretskii 2019-07-13 15:57 ` Pip Cet 1 sibling, 1 reply; 53+ messages in thread From: Eli Zaretskii @ 2019-07-13 15:54 UTC (permalink / raw) To: pipcet; +Cc: 36609, politza > Date: Sat, 13 Jul 2019 18:03:35 +0300 > From: Eli Zaretskii <eliz@gnu.org> > Cc: 36609@debbugs.gnu.org, politza@hochschule-trier.de > > But I still would like to understand why we need to maintain a flag > per thread. Specifically, why not make this flag thread-independent, and have any thread release the context lock if that variable is non-zero? I don't think it matters which thread locked the Glib context, we just need to release it once we emerge from thread_select, and do it before we might call Fsignal. That the lock might be released by a thread other than the one which locked it shouldn't matter, I think, as long as we consider all the Lisp threads acting together on behalf of Emacs. ^ permalink raw reply [flat|nested] 53+ messages in thread
* bug#36609: 27.0.50; Possible race-condition in threading implementation 2019-07-13 15:54 ` Eli Zaretskii @ 2019-07-13 15:57 ` Pip Cet 2019-07-13 16:02 ` Eli Zaretskii 0 siblings, 1 reply; 53+ messages in thread From: Pip Cet @ 2019-07-13 15:57 UTC (permalink / raw) To: Eli Zaretskii; +Cc: 36609, politza On Sat, Jul 13, 2019 at 3:54 PM Eli Zaretskii <eliz@gnu.org> wrote: > > But I still would like to understand why we need to maintain a flag > > per thread. We don't. I had originally planned to make reacquiring the lock fail gracefully if we can't spin-lock, but that didn't make it into the patch, so it should be simplified as you suggest. > Specifically, why not make this flag thread-independent, and have any > thread release the context lock if that variable is non-zero? I don't > think it matters which thread locked the Glib context, we just need to > release it once we emerge from thread_select, and do it before we > might call Fsignal. That the lock might be released by a thread other > than the one which locked it shouldn't matter, I think, as long as we > consider all the Lisp threads acting together on behalf of Emacs. I agree entirely. ^ permalink raw reply [flat|nested] 53+ messages in thread
* bug#36609: 27.0.50; Possible race-condition in threading implementation 2019-07-13 15:57 ` Pip Cet @ 2019-07-13 16:02 ` Eli Zaretskii 2019-07-13 18:17 ` Pip Cet 0 siblings, 1 reply; 53+ messages in thread From: Eli Zaretskii @ 2019-07-13 16:02 UTC (permalink / raw) To: Pip Cet; +Cc: 36609, politza > From: Pip Cet <pipcet@gmail.com> > Date: Sat, 13 Jul 2019 15:57:17 +0000 > Cc: 36609@debbugs.gnu.org, politza@hochschule-trier.de > > > Specifically, why not make this flag thread-independent, and have any > > thread release the context lock if that variable is non-zero? I don't > > think it matters which thread locked the Glib context, we just need to > > release it once we emerge from thread_select, and do it before we > > might call Fsignal. That the lock might be released by a thread other > > than the one which locked it shouldn't matter, I think, as long as we > > consider all the Lisp threads acting together on behalf of Emacs. > > I agree entirely. Great, then I think we are in agreement. ^ permalink raw reply [flat|nested] 53+ messages in thread
* bug#36609: 27.0.50; Possible race-condition in threading implementation 2019-07-13 16:02 ` Eli Zaretskii @ 2019-07-13 18:17 ` Pip Cet 2020-08-21 12:57 ` Lars Ingebrigtsen 0 siblings, 1 reply; 53+ messages in thread From: Pip Cet @ 2019-07-13 18:17 UTC (permalink / raw) To: Eli Zaretskii; +Cc: 36609, politza [-- Attachment #1: Type: text/plain, Size: 315 bytes --] Patch attached. It's possible this'll livelock certain glib libraries that want to modify the glib main context while another thread is running. I don't see a good way to fix that, but a bad way would be to apply some heuristic (such as a 10 ms delay between calling select() and assuming it actually was called). [-- Attachment #2: 0001-Fix-lock-failures-in-xg_select-bug-36609.patch --] [-- Type: text/x-patch, Size: 4691 bytes --] From d99222b1cf5b63e8361e9a7cdbe13faafb40c1d5 Mon Sep 17 00:00:00 2001 From: Pip Cet <pipcet@gmail.com> Date: Sat, 13 Jul 2019 18:09:38 +0000 Subject: [PATCH] Fix lock failures in xg_select (bug#36609) * src/xgselect.c (release_select_lock, acquire_select_lock): Introduce. (xg_select): Use `acquire_select_lock', `release_select_lock'. * src/thread.c (release_select_lock): Introduce for non-GLib builds. (really_call_select): Call `release_select_lock'. Simplify by ensuring acquisition of the lock always succeeds. --- src/thread.c | 8 ++++++++ src/xgselect.c | 42 ++++++++++++++++++++++++++++-------------- src/xgselect.h | 2 ++ 3 files changed, 38 insertions(+), 14 deletions(-) diff --git a/src/thread.c b/src/thread.c index e2deadd7a8..40a25fb5cd 100644 --- a/src/thread.c +++ b/src/thread.c @@ -28,6 +28,12 @@ Copyright (C) 2012-2019 Free Software Foundation, Inc. #include "pdumper.h" #include "keyboard.h" +#ifdef HAVE_GLIB +#include <xgselect.h> +#else +#define release_select_lock() do { } while (0) +#endif + union aligned_thread_state { struct thread_state s; @@ -586,6 +592,8 @@ really_call_select (void *arg) sa->result = (sa->func) (sa->max_fds, sa->rfds, sa->wfds, sa->efds, sa->timeout, sa->sigmask); + release_select_lock (); + block_interrupt_signal (&oldset); /* If we were interrupted by C-g while inside sa->func above, the signal handler could have called maybe_reacquire_global_lock, in diff --git a/src/xgselect.c b/src/xgselect.c index 9982a1f0e9..82f0a61882 100644 --- a/src/xgselect.c +++ b/src/xgselect.c @@ -29,6 +29,27 @@ Copyright (C) 2009-2019 Free Software Foundation, Inc. #include "blockinput.h" #include "systime.h" +static ptrdiff_t threads_holding_glib_lock; +static GMainContext *glib_main_context; + +void release_select_lock (void) +{ + if (--threads_holding_glib_lock == 0) + g_main_context_release (glib_main_context); +} + +static void acquire_select_lock (GMainContext *context) +{ + if (threads_holding_glib_lock++ == 0) + { + glib_main_context = context; + while (!g_main_context_acquire (context)) + { + /* Spin. */ + } + } +} + /* `xg_select' is a `pselect' replacement. Why do we need a separate function? 1. Timeouts. Glib and Gtk rely on timer events. If we did pselect with a greater timeout then the one scheduled by Glib, we would @@ -54,26 +75,19 @@ xg_select (int fds_lim, fd_set *rfds, fd_set *wfds, fd_set *efds, GPollFD *gfds = gfds_buf; int gfds_size = ARRAYELTS (gfds_buf); int n_gfds, retval = 0, our_fds = 0, max_fds = fds_lim - 1; - bool context_acquired = false; int i, nfds, tmo_in_millisec, must_free = 0; bool need_to_dispatch; context = g_main_context_default (); - context_acquired = g_main_context_acquire (context); - /* FIXME: If we couldn't acquire the context, we just silently proceed - because this function handles more than just glib file descriptors. - Note that, as implemented, this failure is completely silent: there is - no feedback to the caller. */ + acquire_select_lock (context); if (rfds) all_rfds = *rfds; else FD_ZERO (&all_rfds); if (wfds) all_wfds = *wfds; else FD_ZERO (&all_wfds); - n_gfds = (context_acquired - ? g_main_context_query (context, G_PRIORITY_LOW, &tmo_in_millisec, - gfds, gfds_size) - : -1); + n_gfds = g_main_context_query (context, G_PRIORITY_LOW, &tmo_in_millisec, + gfds, gfds_size); if (gfds_size < n_gfds) { @@ -151,8 +165,10 @@ xg_select (int fds_lim, fd_set *rfds, fd_set *wfds, fd_set *efds, #else need_to_dispatch = true; #endif - if (need_to_dispatch && context_acquired) + if (need_to_dispatch) { + acquire_select_lock (context); + int pselect_errno = errno; /* Prevent g_main_dispatch recursion, that would occur without block_input wrapper, because event handlers call @@ -162,11 +178,9 @@ xg_select (int fds_lim, fd_set *rfds, fd_set *wfds, fd_set *efds, g_main_context_dispatch (context); unblock_input (); errno = pselect_errno; + release_select_lock (); } - if (context_acquired) - g_main_context_release (context); - /* To not have to recalculate timeout, return like this. */ if ((our_fds > 0 || (nfds == 0 && tmop == &tmo)) && (retval == 0)) { diff --git a/src/xgselect.h b/src/xgselect.h index 92c79c5f64..0d5c96cd21 100644 --- a/src/xgselect.h +++ b/src/xgselect.h @@ -29,4 +29,6 @@ #define XGSELECT_H fd_set *rfds, fd_set *wfds, fd_set *efds, struct timespec *timeout, sigset_t *sigmask); +extern void release_select_lock (void); + #endif /* XGSELECT_H */ -- 2.20.1 ^ permalink raw reply related [flat|nested] 53+ messages in thread
* bug#36609: 27.0.50; Possible race-condition in threading implementation 2019-07-13 18:17 ` Pip Cet @ 2020-08-21 12:57 ` Lars Ingebrigtsen 0 siblings, 0 replies; 53+ messages in thread From: Lars Ingebrigtsen @ 2020-08-21 12:57 UTC (permalink / raw) To: Pip Cet; +Cc: 36609, politza Pip Cet <pipcet@gmail.com> writes: > Patch attached. > > It's possible this'll livelock certain glib libraries that want to > modify the glib main context while another thread is running. I don't > see a good way to fix that, but a bad way would be to apply some > heuristic (such as a 10 ms delay between calling select() and assuming > it actually was called). Skimming this thread, it seems that Eli was in agreement with applying it, and Andreas has tried it extensively and it fixes the problem, so I've applied it to Emacs 28 now. -- (domestic pets only, the antidote for overdose, milk.) bloggy blog: http://lars.ingebrigtsen.no ^ permalink raw reply [flat|nested] 53+ messages in thread
* bug#36609: 27.0.50; Possible race-condition in threading implementation 2019-07-13 14:37 ` Pip Cet 2019-07-13 15:03 ` Eli Zaretskii @ 2019-07-13 15:04 ` Andreas Politz 1 sibling, 0 replies; 53+ messages in thread From: Andreas Politz @ 2019-07-13 15:04 UTC (permalink / raw) To: Pip Cet; +Cc: 36609 I ran my problematic code a couple thousand times on this patch without freezes, while producing the expected results. Thanks for figuring this out. Andreas ^ permalink raw reply [flat|nested] 53+ messages in thread
* bug#36609: 27.0.50; Possible race-condition in threading implementation 2019-07-12 9:02 ` Pip Cet 2019-07-12 12:42 ` Eli Zaretskii @ 2019-07-12 12:44 ` Pip Cet 2019-07-12 12:55 ` Eli Zaretskii 1 sibling, 1 reply; 53+ messages in thread From: Pip Cet @ 2019-07-12 12:44 UTC (permalink / raw) To: Andreas Politz; +Cc: 36609 [-- Attachment #1: Type: text/plain, Size: 931 bytes --] On Fri, Jul 12, 2019 at 9:02 AM Pip Cet <pipcet@gmail.com> wrote: > > On Thu, Jul 11, 2019 at 8:52 PM Andreas Politz > <politza@hochschule-trier.de> wrote: > > I think there is a race-condition in the implementation of threads. I > > tried to find a minimal test-case, without success. Thus, I've attached > > a lengthy source-file. Loading that file should trigger this bug and > > may freeze your session. > > It does here, so I can provide further debugging information if > needed. On first glance, it appears that xgselect returns abnormally > with g_main_context acquired in one thread, and then other threads > fail to acquire it and loop endlessly. Okay, I'm still not sure this is really the problem Andreas was seeing, but this code fails to work with xg_select: (let ((thread (make-thread (lambda () (sleep-for 3))))) (thread-yield) (thread-signal thread 'error nil)) Proposed patch attached. [-- Attachment #2: 0001-Protect-against-abnormal-exit-in-xg_select.patch --] [-- Type: text/x-patch, Size: 2239 bytes --] From 012341a8a674b130c335cfdd8d24888406e98ece Mon Sep 17 00:00:00 2001 From: Pip Cet <pipcet@gmail.com> Date: Fri, 12 Jul 2019 12:39:29 +0000 Subject: [PATCH] Protect against abnormal exit in `xg_select' * src/xgselect.c (release_g_main_context): New function. (xg_select): Unwind-protect release of the GLib main context. --- src/xgselect.c | 18 +++++++++++++++--- 1 file changed, 15 insertions(+), 3 deletions(-) diff --git a/src/xgselect.c b/src/xgselect.c index 9982a1f0e9..e35310be34 100644 --- a/src/xgselect.c +++ b/src/xgselect.c @@ -29,6 +29,15 @@ Copyright (C) 2009-2019 Free Software Foundation, Inc. #include "blockinput.h" #include "systime.h" +/* Helper function for `unwind_protect'. */ + +static void release_g_main_context (void *ptr) +{ + GMainContext *context = (GMainContext *) ptr; + + g_main_context_release (context); +} + /* `xg_select' is a `pselect' replacement. Why do we need a separate function? 1. Timeouts. Glib and Gtk rely on timer events. If we did pselect with a greater timeout then the one scheduled by Glib, we would @@ -58,8 +67,12 @@ xg_select (int fds_lim, fd_set *rfds, fd_set *wfds, fd_set *efds, int i, nfds, tmo_in_millisec, must_free = 0; bool need_to_dispatch; + ptrdiff_t count = SPECPDL_INDEX (); context = g_main_context_default (); context_acquired = g_main_context_acquire (context); + + if (context_acquired) + record_unwind_protect_ptr (release_g_main_context, (void *)context); /* FIXME: If we couldn't acquire the context, we just silently proceed because this function handles more than just glib file descriptors. Note that, as implemented, this failure is completely silent: there is @@ -164,9 +177,6 @@ xg_select (int fds_lim, fd_set *rfds, fd_set *wfds, fd_set *efds, errno = pselect_errno; } - if (context_acquired) - g_main_context_release (context); - /* To not have to recalculate timeout, return like this. */ if ((our_fds > 0 || (nfds == 0 && tmop == &tmo)) && (retval == 0)) { @@ -174,6 +184,8 @@ xg_select (int fds_lim, fd_set *rfds, fd_set *wfds, fd_set *efds, errno = EINTR; } + unbind_to (count, Qnil); + return retval; } #endif /* HAVE_GLIB */ -- 2.20.1 ^ permalink raw reply related [flat|nested] 53+ messages in thread
* bug#36609: 27.0.50; Possible race-condition in threading implementation 2019-07-12 12:44 ` Pip Cet @ 2019-07-12 12:55 ` Eli Zaretskii 2019-07-12 13:40 ` Pip Cet 0 siblings, 1 reply; 53+ messages in thread From: Eli Zaretskii @ 2019-07-12 12:55 UTC (permalink / raw) To: Pip Cet; +Cc: 36609, politza > From: Pip Cet <pipcet@gmail.com> > Date: Fri, 12 Jul 2019 12:44:35 +0000 > Cc: 36609@debbugs.gnu.org > > Okay, I'm still not sure this is really the problem Andreas was > seeing, but this code fails to work with xg_select: > > (let ((thread (make-thread (lambda () > (sleep-for 3))))) > (thread-yield) > (thread-signal thread 'error nil)) What do you mean by "fails"? What do you see when you eval this? > Proposed patch attached. As I said earlier, I don't think it's right to use stack unwinding at this level. (Did you try this in a -nw session?) I'd appreciate a description of what you think happens here, and why. Thanks. ^ permalink raw reply [flat|nested] 53+ messages in thread
* bug#36609: 27.0.50; Possible race-condition in threading implementation 2019-07-12 12:55 ` Eli Zaretskii @ 2019-07-12 13:40 ` Pip Cet 2019-07-12 13:51 ` Eli Zaretskii 0 siblings, 1 reply; 53+ messages in thread From: Pip Cet @ 2019-07-12 13:40 UTC (permalink / raw) To: Eli Zaretskii; +Cc: 36609, politza On Fri, Jul 12, 2019 at 12:55 PM Eli Zaretskii <eliz@gnu.org> wrote: > > From: Pip Cet <pipcet@gmail.com> > > Date: Fri, 12 Jul 2019 12:44:35 +0000 > > Cc: 36609@debbugs.gnu.org > > > > Okay, I'm still not sure this is really the problem Andreas was > > seeing, but this code fails to work with xg_select: > > > > (let ((thread (make-thread (lambda () > > (sleep-for 3))))) > > (thread-yield) > > (thread-signal thread 'error nil)) > > What do you mean by "fails"? What do you see when you eval this? With `emacs -Q', a blinking cursor and an unresponsive emacs (no response to key presses, not even triple C-g). Sometimes, at least. Here's a full backtrace of a session where the hang did happen: Thread 4 (Thread 0x7fffeeb8a700 (LWP 26117)): #0 0x00007ffff4967819 in __GI___poll (fds=0xd93070, nfds=1, timeout=-1) at ../sysdeps/unix/sysv/linux/poll.c:29 resultvar = 18446744073709551100 sc_cancel_oldtype = 0 #1 0x00007ffff698e136 in () at /lib/x86_64-linux-gnu/libglib-2.0.so.0 #2 0x00007ffff698e25c in g_main_context_iteration () at /lib/x86_64-linux-gnu/libglib-2.0.so.0 #3 0x00007fffeebd7ffd in () at /usr/lib/x86_64-linux-gnu/gio/modules/libdconfsettings.so #4 0x00007ffff69b6415 in () at /lib/x86_64-linux-gnu/libglib-2.0.so.0 #5 0x00007ffff4b38fa3 in start_thread (arg=<optimized out>) at pthread_create.c:486 ret = <optimized out> pd = <optimized out> now = <optimized out> unwind_buf = { cancel_jmp_buf = {{ jmp_buf = {140737198466816, -2332898995959237690, 140737488337566, 140737488337567, 140737198466816, 13933232, 2332932536188793798, 2332875456844002246}, mask_was_saved = 0 }}, priv = { pad = {0x0, 0x0, 0x0, 0x0}, data = { prev = 0x0, cleanup = 0x0, canceltype = 0 } } } not_first_call = <optimized out> #6 0x00007ffff49724cf in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95 Thread 3 (Thread 0x7fffef59e700 (LWP 26116)): #0 0x00007ffff4967819 in __GI___poll (fds=0xcb7a20, nfds=2, timeout=-1) at ../sysdeps/unix/sysv/linux/poll.c:29 resultvar = 18446744073709551100 sc_cancel_oldtype = 0 #1 0x00007ffff698e136 in () at /lib/x86_64-linux-gnu/libglib-2.0.so.0 #2 0x00007ffff698e4c2 in g_main_loop_run () at /lib/x86_64-linux-gnu/libglib-2.0.so.0 #3 0x00007ffff6bba0d6 in () at /lib/x86_64-linux-gnu/libgio-2.0.so.0 #4 0x00007ffff69b6415 in () at /lib/x86_64-linux-gnu/libglib-2.0.so.0 #5 0x00007ffff4b38fa3 in start_thread (arg=<optimized out>) at pthread_create.c:486 ret = <optimized out> pd = <optimized out> now = <optimized out> unwind_buf = { cancel_jmp_buf = {{ jmp_buf = {140737209034496, -2332898995959237690, 140737488336926, 140737488336927, 140737209034496, 13323152, 2332932821804118982, 2332875456844002246}, mask_was_saved = 0 }}, priv = { pad = {0x0, 0x0, 0x0, 0x0}, data = { prev = 0x0, cleanup = 0x0, canceltype = 0 } } } not_first_call = <optimized out> #6 0x00007ffff49724cf in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95 Thread 2 (Thread 0x7ffff017d700 (LWP 26115)): #0 0x00007ffff4967819 in __GI___poll (fds=0xb400d0, nfds=1, timeout=-1) at ../sysdeps/unix/sysv/linux/poll.c:29 resultvar = 18446744073709551100 sc_cancel_oldtype = 0 #1 0x00007ffff698e136 in () at /lib/x86_64-linux-gnu/libglib-2.0.so.0 #2 0x00007ffff698e25c in g_main_context_iteration () at /lib/x86_64-linux-gnu/libglib-2.0.so.0 #3 0x00007ffff698e2a1 in () at /lib/x86_64-linux-gnu/libglib-2.0.so.0 #4 0x00007ffff69b6415 in () at /lib/x86_64-linux-gnu/libglib-2.0.so.0 #5 0x00007ffff4b38fa3 in start_thread (arg=<optimized out>) at pthread_create.c:486 ret = <optimized out> pd = <optimized out> now = <optimized out> unwind_buf = { cancel_jmp_buf = {{ jmp_buf = {140737221482240, -2332898995959237690, 140737488347710, 140737488347711, 140737221482240, 0, 2332865253915489222, 2332875456844002246}, mask_was_saved = 0 }}, priv = { pad = {0x0, 0x0, 0x0, 0x0}, data = { prev = 0x0, cleanup = 0x0, canceltype = 0 } } } not_first_call = <optimized out> #6 0x00007ffff49724cf in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95 Thread 1 (Thread 0x7ffff0e30e00 (LWP 26108)): #0 0x00007ffff4b404c6 in pthread_sigmask (how=2, newmask=<optimized out>, oldmask=0x0) at ../sysdeps/unix/sysv/linux/pthread_sigmask.c:48 local_newmask = { __val = {7, 17697270867008781056, 11498208, 5790296, 0, 140737488342800, 2, 0, 0, 0, 0, 0, 0, 0, 0, 6922490} } result = 0 #1 0x0000000000585a7d in restore_signal_mask (oldset=0x7fffffffcf10) at sysdep.c:876 #2 0x0000000000699292 in really_call_select (arg=0x7fffffffd060) at thread.c:584 sa = 0x7fffffffd060 self = 0xa745c0 <main_thread> oldset = { __val = {0, 5710159, 8192, 126116988112, 140737488344608, 140737298839344, 7, 0, 11381952, 0, 8192, 1562936528, 140737488342720, 5968906924941129, 0, 16538645} } #3 0x00000000005e5ee0 in flush_stack_call_func (func=0x699239 <really_call_select>, arg=0x7fffffffd060) at alloc.c:4969 end = 0x7fffffffd020 self = 0xa745c0 <main_thread> sentry = { o = { __max_align_ll = 0, __max_align_ld = <invalid float value> } } #4 0x0000000000699389 in thread_select (func=0x419320 <pselect@plt>, max_fds=6, rfds=0x7fffffffd590, wfds=0x7fffffffd510, efds=0x0, timeout=0x7fffffffd840, sigmask=0x0) at thread.c:616 sa = { func = 0x419320 <pselect@plt>, max_fds = 6, rfds = 0x7fffffffd590, wfds = 0x7fffffffd510, efds = 0x0, timeout = 0x7fffffffd840, sigmask = 0x0, result = -157757095 } #5 0x00000000006bfeac in xg_select (fds_lim=6, rfds=0x7fffffffd8e0, wfds=0x7fffffffd860, efds=0x0, timeout=0x7fffffffd840, sigmask=0x0) at xgselect.c:117 all_rfds = { fds_bits = {32, 0 <repeats 15 times>} } all_wfds = { fds_bits = {0 <repeats 16 times>} } tmo = { tv_sec = 5621546, tv_nsec = 15785445 } tmop = 0x7fffffffd840 context = 0xc1c200 have_wfds = true gfds_buf = {{ fd = 895, events = 0, revents = 0 }, { fd = 0, events = 0, revents = 0 }, { fd = 0, events = 0, revents = 0 }, { fd = 8096, events = 65535, revents = 0 }, { fd = 0, events = 0, revents = 0 }, { fd = 0, events = 0, revents = 0 }, { fd = 0, events = 0, revents = 0 }, { fd = 0, events = 0, revents = 0 }, { fd = 0, events = 0, revents = 0 }, { fd = 0, events = 0, revents = 0 }, { fd = 16, events = 0, revents = 0 }, { fd = 0, events = 0, revents = 0 } <repeats 11 times>, { fd = -2057698731, events = 39706, revents = 28662 }, { fd = 3, events = 0, revents = 0 }, { fd = -2057698731, events = 39706, revents = 28662 }, { fd = -161788823, events = 32767, revents = 0 }, { fd = -2057698731, events = 39642, revents = 28662 }, { fd = -190624704, events = 32767, revents = 0 }, { fd = 16, events = 0, revents = 0 }, { fd = 0, events = 0, revents = 0 }, { fd = -664, events = 65535, revents = 65535 }, { fd = -504530176, events = 22038, revents = 62873 }, { fd = 12553216, events = 0, revents = 0 }, { fd = -1252392363, events = 44399, revents = 28662 }, { fd = 32, events = 0, revents = 0 }, { fd = 0, events = 0, revents = 0 }, { fd = 0, events = 0, revents = 0 }, { fd = 0, events = 0, revents = 0 }, { fd = 12553216, events = 0, revents = 0 }, { fd = -191458327, events = 32767, revents = 0 }, { fd = -190624704, events = 32767, revents = 0 }, { fd = 139264, events = 0, revents = 0 }, { fd = 17472, events = 0, revents = 0 }, { fd = -191893943, events = 32767, revents = 0 }, { fd = 128, events = 0, revents = 0 }, { fd = -191909562, events = 32767, revents = 0 }, { fd = 48, events = 0, revents = 0 }, { fd = 1, events = 0, revents = 0 }, { fd = 7, events = 0, revents = 0 }, { fd = 64, events = 0, revents = 0 }, { fd = 4, events = 32767, revents = 0 }, { fd = 11649056, events = 0, revents = 0 }, { fd = 47, events = 0, revents = 0 }, { fd = 96, events = 0, revents = 0 }, { fd = -664, events = 65535, revents = 65535 }, { fd = 1, events = 0, revents = 0 }, { fd = 4, events = 49, revents = 0 }, { fd = 0, events = 0, revents = 0 }, { fd = 0, events = 0, revents = 0 }, { fd = 91, events = 110, revents = 0 }, { fd = 124, events = 119, revents = 0 }, { fd = 0, events = 0, revents = 0 }, { fd = 17456, events = 0, revents = 0 }, { fd = -190624704, events = 32767, revents = 0 }, { fd = 48, events = 0, revents = 0 }, { fd = 2, events = 0, revents = 0 }, { fd = -664, events = 65535, revents = 65535 }, { fd = 0, events = 0, revents = 0 }, { fd = 0, events = 0, revents = 0 }, { fd = -191900310, events = 32767, revents = 0 }, { fd = -664, events = 65535, revents = 65535 }, { fd = 0, events = 0, revents = 0 }, { fd = 48, events = 0, revents = 0 }, { fd = 187264016, events = 0, revents = 0 }, { fd = 11498208, events = 0, revents = 0 }, { fd = 0, events = 0, revents = 0 }, { fd = 0, events = 0, revents = 0 }, { fd = -11392, events = 32767, revents = 0 }, { fd = 5621546, events = 0, revents = 0 }, { fd = 187264016, events = 0, revents = 0 }, { fd = -11360, events = 32767, revents = 0 }, { fd = 187280083, events = 0, revents = 0 }, { fd = -11344, events = 32767, revents = 0 }, { fd = 5622263, events = 0, revents = 0 }, { fd = 0, events = 0, revents = 0 }, { fd = 187280083, events = 0, revents = 0 }, { fd = -11280, events = 32767, revents = 0 }, { fd = 6170867, events = 0, revents = 0 }, { fd = 0, events = 0, revents = 0 }, { fd = 0, events = 0, revents = 0 }, { fd = 5627325, events = 0, revents = 0 }, { fd = 6, events = 0, revents = 0 }, { fd = 1, events = 0, revents = 0 }, { fd = 0, events = 1, revents = 0 }, { fd = 11498208, events = 0, revents = 0 }, { fd = 0, events = 0, revents = 0 }, { fd = 0, events = 0, revents = 0 }, { fd = -11232, events = 32767, revents = 0 }, { fd = 5621546, events = 0, revents = 0 }, { fd = -134398935, events = 0, revents = 0 }, { fd = -11200, events = 32767, revents = 0 }, { fd = 187280099, events = 0, revents = 0 }, { fd = -11184, events = 32767, revents = 0 }, { fd = 5622263, events = 0, revents = 0 }, { fd = 0, events = 0, revents = 0 }, { fd = 187280099, events = 0, revents = 0 }, { fd = -11120, events = 32767, revents = 0 }, { fd = 6170867, events = 0, revents = 0 }, { fd = 0, events = 0, revents = 0 }, { fd = 0, events = 0, revents = 0 }, { fd = 5627325, events = 0, revents = 0 }, { fd = 6, events = 0, revents = 0 }, { fd = 1, events = 0, revents = 0 }, { fd = 0, events = 1, revents = 0 }, { fd = 11498208, events = 0, revents = 0 }, { fd = 187280099, events = 0, revents = 0 }, { fd = -11072, events = 32767, revents = 0 }, { fd = 5622263, events = 0, revents = 0 }, { fd = -11024, events = 32767, revents = 0 }, { fd = -134398935, events = 32767, revents = 0 }, { fd = -10944, events = 32767, revents = 0 }, { fd = 15785445, events = 0, revents = 0 }, { fd = -11016, events = 32767, revents = 0 }, { fd = 5623097, events = 0, revents = 0 }, { fd = 11498208, events = 0, revents = 0 }, { fd = 0, events = 0, revents = 0 }, { fd = 0, events = 0, revents = 0 }, { fd = -10992, events = 32767, revents = 0 }} gfds = 0x7fffffffd100 gfds_size = 128 n_gfds = -1 retval = 0 our_fds = 0 max_fds = 5 context_acquired = false i = 0 nfds = 0 tmo_in_millisec = 0 must_free = 0 need_to_dispatch = false #6 0x000000000066b757 in wait_reading_process_output (time_limit=0, nsecs=0, read_kbd=-1, do_display=true, wait_for_cell=XIL(0), wait_proc=0x0, just_wait_proc=0) at process.c:5423 process_skipped = false channel = 6 nfds = 1 Available = { fds_bits = {32, 0 <repeats 15 times>} } Writeok = { fds_bits = {0 <repeats 16 times>} } check_write = true check_delay = 0 no_avail = false xerrno = 11 proc = XIL(0xcf7c00) timeout = { tv_sec = 100000, tv_nsec = 0 } end_time = { tv_sec = 0, tv_nsec = 140737488345168 } timer_delay = { tv_sec = 0, tv_nsec = -1 } got_output_end_time = { tv_sec = 0, tv_nsec = -1 } wait = FOREVER got_some_output = -1 prev_wait_proc_nbytes_read = 0 retry_for_async = false count = 4 now = { tv_sec = 0, tv_nsec = -1 } #7 0x000000000056c31e in kbd_buffer_get_event (kbp=0x7fffffffdbd8, used_mouse_menu=0x7fffffffe1bf, end_time=0x0) at keyboard.c:3836 do_display = true obj = XIL(0x14bfd3fb) #8 0x0000000000568706 in read_event_from_main_queue (end_time=0x0, local_getcjmp=0x7fffffffdf80, used_mouse_menu=0x7fffffffe1bf) at keyboard.c:2138 c = XIL(0) save_jump = {{ __jmpbuf = {0, 0, 0, 0, 0, 0, 0, 0}, __mask_was_saved = 0, __saved_mask = { __val = {0 <repeats 16 times>} } }} kb = 0x55c9f7 <XSETCDR+28> count = 3 #9 0x0000000000568970 in read_decoded_event_from_main_queue (end_time=0x0, local_getcjmp=0x7fffffffdf80, prev_event=XIL(0), used_mouse_menu=0x7fffffffe1bf) at keyboard.c:2202 nextevt = XIL(0x7fffffffde30) frame = 0x0 terminal = 0x0 events = {XIL(0x7fffffffddb0), make_fixnum(1422534), XIL(0xaf72e0), XIL(0), XIL(0), XIL(0x7fffffffddb0), make_fixnum(1405386), XIL(0x135b623), XIL(0x7fffffffddf0), make_fixnum(1420782), XIL(0xaf72e0), XIL(0x100000000), XIL(0), XIL(0x7fffffffddf0), make_fixnum(1405386), XIL(0)} n = 0 #10 0x000000000056a002 in read_char (commandflag=1, map=XIL(0xfbaa33), prev_event=XIL(0), used_mouse_menu=0x7fffffffe1bf, end_time=0x0) at keyboard.c:2810 c = XIL(0) jmpcount = 3 local_getcjmp = {{ __jmpbuf = {0, 2332898996579464134, 16538645, 48, 0, 0, 2332898994838827974, -2332899704998988858}, __mask_was_saved = 0, __saved_mask = { __val = {140737214405912, 0, 140737225904168, 11498208, 0, 0, 140737488347152, 5621546, 4032516088, 11498208, 0, 0, 140737488347200, 5621546, 12363987, 140737488347296} } }} save_jump = {{ __jmpbuf = {0, 0, 140737224472304, 0, 0, 11529984, 5621546, 0}, __mask_was_saved = -8336, __saved_mask = { __val = {6254137, 140737231486936, 12884893472, 0, 31776, 140737224472304, 140737231486936, 5623453, 51539607552, 140737224472309, 140737224472304, 5632971, 11529984, 11529984, 0, 11498208} } }} tem = XIL(0x7fffefa759f0) save = XIL(0) previous_echo_area_message = XIL(0) also_record = XIL(0) reread = false recorded = false polling_stopped_here = true orig_kboard = 0xc863c0 #11 0x0000000000576842 in read_key_sequence (keybuf=0x7fffffffe3a0, prompt=XIL(0), dont_downcase_last=false, can_return_switch_frame=true, fix_current_buffer=true, prevent_redisplay=false) at keyboard.c:9124 interrupted_kboard = 0xc863c0 interrupted_frame = 0xf976a0 key = XIL(0xf97ac0) used_mouse_menu = false echo_local_start = 0 last_real_key_start = 0 keys_local_start = 0 new_binding = XIL(0xf97ac0) count = 3 t = 0 echo_start = 0 keys_start = 0 current_binding = XIL(0xfbaa33) first_unbound = 31 mock_input = 0 used_mouse_menu_history = {false <repeats 30 times>} fkey = { parent = XIL(0xbe3cd3), map = XIL(0xbe3cd3), start = 0, end = 0 } keytran = { parent = XIL(0x7ffff0ae0eeb), map = XIL(0x7ffff0ae0eeb), start = 0, end = 0 } indec = { parent = XIL(0xbe3cc3), map = XIL(0xbe3cc3), start = 0, end = 0 } shift_translated = false delayed_switch_frame = XIL(0) original_uppercase = XIL(0) original_uppercase_position = -1 dummyflag = false starting_buffer = 0x7ffff04576f0 fake_prefixed_keys = XIL(0) first_event = XIL(0) second_event = XIL(0) #12 0x00000000005667ae in command_loop_1 () at keyboard.c:1348 cmd = XIL(0x7fffffffe4f0) keybuf = {XIL(0x7fffffffe420), XIL(0x7ffff0b08008), XIL(0x100000007), XIL(0), XIL(0), XIL(0x80a0), XIL(0x11f), XIL(0xaff380), XIL(0xaff380), XIL(0), XIL(0x7fffffffe440), XIL(0x617021), make_fixnum(1073741824), XIL(0x7fffffffe460), XIL(0xaf72e0), XIL(0), XIL(0), XIL(0x7fffffffe440), make_fixnum(1405386), XIL(0xf04576f5), XIL(0x7fffffffe4a0), XIL(0x61729f), XIL(0xaf72e0), XIL(0), XIL(0), XIL(0x7fffffffe480), make_fixnum(1405386), XIL(0xf04576f5), XIL(0x7fffffffe4c0), make_fixnum(1591385)} i = 0 prev_modiff = 0 prev_buffer = 0x0 already_adjusted = false #13 0x0000000000611d61 in internal_condition_case (bfun=0x566384 <command_loop_1>, handlers=XIL(0x90), hfun=0x565b7d <cmd_error>) at eval.c:1351 val = make_fixnum(1405386) c = 0xb99e70 #14 0x000000000056607a in command_loop_2 (ignore=XIL(0)) at keyboard.c:1091 val = XIL(0) #15 0x000000000061161a in internal_catch (tag=XIL(0xd140), func=0x566051 <command_loop_2>, arg=XIL(0)) at eval.c:1112 val = XIL(0x7fffffffe5c0) c = 0xb94fd0 #16 0x000000000056601c in command_loop () at keyboard.c:1070 #17 0x0000000000565752 in recursive_edit_1 () at keyboard.c:714 count = 1 val = XIL(0x7fffffffe620) #18 0x00000000005658d4 in Frecursive_edit () at keyboard.c:786 count = 0 buffer = XIL(0) #19 0x0000000000563961 in main (argc=4, argv=0x7fffffffe878) at emacs.c:2086 stack_bottom_variable = 0x5e00 do_initial_setlocale = true no_loadup = false junk = 0x0 dname_arg = 0x0 ch_to_dir = 0x0 original_pwd = 0x0 dump_mode = 0x0 skip_args = 0 temacs = 0x0 attempt_load_pdump = true rlim = { rlim_cur = 10022912, rlim_max = 18446744073709551615 } sockfd = -1 i thr Id Target Id Frame * 1 Thread 0x7ffff0e30e00 (LWP 26108) "emacs" pthread_sigmask (how=0, newmask=<optimized out>, oldmask=0x7fffffffcf10) at ../sysdeps/unix/sysv/linux/pthread_sigmask.c:48 2 Thread 0x7ffff017d700 (LWP 26115) "gmain" 0x00007ffff4967819 in __GI___poll (fds=0xb400d0, nfds=1, timeout=-1) at ../sysdeps/unix/sysv/linux/poll.c:29 3 Thread 0x7fffef59e700 (LWP 26116) "gdbus" 0x00007ffff4967819 in __GI___poll (fds=0xcb7a20, nfds=2, timeout=-1) at ../sysdeps/unix/sysv/linux/poll.c:29 4 Thread 0x7fffeeb8a700 (LWP 26117) "dconf worker" 0x00007ffff4967819 in __GI___poll (fds=0xd93070, nfds=1, timeout=-1) at ../sysdeps/unix/sysv/linux/poll.c:29 > > Proposed patch attached. > > As I said earlier, I don't think it's right to use stack unwinding at > this level. (Did you try this in a -nw session?) Sorry, I didn't see that email until after I'd sent mine. I did try with an -nw session, now, but I really think thread_select should not return abnormally. > I'd appreciate a description of what you think happens here, and why. When a thread is signalled (by thread-signal, which sets another thread's error_symbol) while the signalled thread is inside a select(), thread_select() will return non-locally for that thread, and we fail to release an internal GLib lock through g_main_context_release(). That's the first bug. When xg_select() fails to acquire the internal GLib lock, it simply does a select() on the remaining file descriptors: context_acquired = g_main_context_acquire (context); /* FIXME: If we couldn't acquire the context, we just silently proceed because this function handles more than just glib file descriptors. Note that, as implemented, this failure is completely silent: there is no feedback to the caller. */ This seems like a second, albeit documented, bug to me. I think we're risking not waking up from the actual select because another (non-Emacs) thread happened to hold the main context at the time. ^ permalink raw reply [flat|nested] 53+ messages in thread
* bug#36609: 27.0.50; Possible race-condition in threading implementation 2019-07-12 13:40 ` Pip Cet @ 2019-07-12 13:51 ` Eli Zaretskii 2019-07-12 14:34 ` Pip Cet 0 siblings, 1 reply; 53+ messages in thread From: Eli Zaretskii @ 2019-07-12 13:51 UTC (permalink / raw) To: Pip Cet; +Cc: 36609, politza > From: Pip Cet <pipcet@gmail.com> > Date: Fri, 12 Jul 2019 13:40:15 +0000 > Cc: politza@hochschule-trier.de, 36609@debbugs.gnu.org > > When a thread is signalled (by thread-signal, which sets another > thread's error_symbol) while the signalled thread is inside a > select(), thread_select() will return non-locally for that thread, and > we fail to release an internal GLib lock through > g_main_context_release(). That's the first bug. We should either release the global lock before the thread exits, or defer the acting upon the signal until later. We cannot disable the signal handling altogether because it is entirely legitimate to signal another thread, and when we do, that other thread will _always_ be inside thread_select. For the main thread, handling the signal in that situation shouldn't be a problem, because it is not going to exit. Right? > When xg_select() fails to acquire the internal GLib lock, it simply > does a select() on the remaining file descriptors: Why does it fail to acquire that lock? > context_acquired = g_main_context_acquire (context); > /* FIXME: If we couldn't acquire the context, we just silently proceed > because this function handles more than just glib file descriptors. > Note that, as implemented, this failure is completely silent: there is > no feedback to the caller. */ > > This seems like a second, albeit documented, bug to me. I think we're > risking not waking up from the actual select because another > (non-Emacs) thread happened to hold the main context at the time. So what is the proposal for that? spin waiting for the lock? ^ permalink raw reply [flat|nested] 53+ messages in thread
* bug#36609: 27.0.50; Possible race-condition in threading implementation 2019-07-12 13:51 ` Eli Zaretskii @ 2019-07-12 14:34 ` Pip Cet 2019-07-12 15:02 ` Eli Zaretskii 0 siblings, 1 reply; 53+ messages in thread From: Pip Cet @ 2019-07-12 14:34 UTC (permalink / raw) To: Eli Zaretskii; +Cc: 36609, politza On Fri, Jul 12, 2019 at 1:51 PM Eli Zaretskii <eliz@gnu.org> wrote: > > From: Pip Cet <pipcet@gmail.com> > > Date: Fri, 12 Jul 2019 13:40:15 +0000 > > Cc: politza@hochschule-trier.de, 36609@debbugs.gnu.org > > > > When a thread is signalled (by thread-signal, which sets another > > thread's error_symbol) while the signalled thread is inside a > > select(), thread_select() will return non-locally for that thread, and > > we fail to release an internal GLib lock through > > g_main_context_release(). That's the first bug. > > We should either release the global lock before the thread exits, or > defer the acting upon the signal until later. We cannot disable the > signal handling altogether because it is entirely legitimate to signal > another thread, and when we do, that other thread will _always_ be > inside thread_select. Really? What about thread-yield? > For the main thread, handling the signal in that situation shouldn't > be a problem, because it is not going to exit. Right? I think the main thread can still fail to release the lock... > > When xg_select() fails to acquire the internal GLib lock, it simply > > does a select() on the remaining file descriptors: > > Why does it fail to acquire that lock? Because another thread holds it, either an Emacs or a non-Emacs thread. In both cases, I think we might miss events unless we return with errno == EINTR. > > context_acquired = g_main_context_acquire (context) > > /* FIXME: If we couldn't acquire the context, we just silently proceed > > because this function handles more than just glib file descriptors. > > Note that, as implemented, this failure is completely silent: there is > > no feedback to the caller. */ > > > > This seems like a second, albeit documented, bug to me. I think we're > > risking not waking up from the actual select because another > > (non-Emacs) thread happened to hold the main context at the time. > > So what is the proposal for that? spin waiting for the lock? I don't know, to be honest, but I'm afraid there's currently no better way. There's a way to take the lock without spinning, but it's been deprecated. ^ permalink raw reply [flat|nested] 53+ messages in thread
* bug#36609: 27.0.50; Possible race-condition in threading implementation 2019-07-12 14:34 ` Pip Cet @ 2019-07-12 15:02 ` Eli Zaretskii 2019-07-12 19:30 ` Pip Cet 0 siblings, 1 reply; 53+ messages in thread From: Eli Zaretskii @ 2019-07-12 15:02 UTC (permalink / raw) To: Pip Cet; +Cc: 36609, politza > From: Pip Cet <pipcet@gmail.com> > Date: Fri, 12 Jul 2019 14:34:44 +0000 > Cc: politza@hochschule-trier.de, 36609@debbugs.gnu.org > > > > When a thread is signalled (by thread-signal, which sets another > > > thread's error_symbol) while the signalled thread is inside a > > > select(), thread_select() will return non-locally for that thread, and > > > we fail to release an internal GLib lock through > > > g_main_context_release(). That's the first bug. > > > > We should either release the global lock before the thread exits, or > > defer the acting upon the signal until later. We cannot disable the > > signal handling altogether because it is entirely legitimate to signal > > another thread, and when we do, that other thread will _always_ be > > inside thread_select. > > Really? What about thread-yield? What about it? You are asking whether, when thread-signal is executed, the thread which we are signaling is necessarily parked inside thread_select? If so, I don't understand your surprise: only one thread can ever be running, and that is by definition the thread which calls thread-signal. All the other threads cannot be running, which means they are parked either in thread_select or in sys_mutex_lock called from acquire_global_lock. Right? As for thread-yield, I'm not sure I understand how is it related to the issue we are discussing. > > For the main thread, handling the signal in that situation shouldn't > > be a problem, because it is not going to exit. Right? > > I think the main thread can still fail to release the lock... Since the main thread doesn't exit, but just longjmp's to top-level, and then continues to run, why does it need to release the lock? Any thread should hold the lock for as long as it runs. > > > When xg_select() fails to acquire the internal GLib lock, it simply > > > does a select() on the remaining file descriptors: > > > > Why does it fail to acquire that lock? > > Because another thread holds it, either an Emacs or a non-Emacs > thread. OK, and why is it a problem that we continue calling thread_select regardless of the failure to acquire the Glib lock? is the problem this one: > In both cases, I think we might miss events unless we return > with errno == EINTR. ? Or is there something else? If the problem with missing events, then which events are those, and what bad things will happen if we miss them? ^ permalink raw reply [flat|nested] 53+ messages in thread
* bug#36609: 27.0.50; Possible race-condition in threading implementation 2019-07-12 15:02 ` Eli Zaretskii @ 2019-07-12 19:30 ` Pip Cet 2019-07-13 6:50 ` Eli Zaretskii 0 siblings, 1 reply; 53+ messages in thread From: Pip Cet @ 2019-07-12 19:30 UTC (permalink / raw) To: Eli Zaretskii; +Cc: 36609, politza > > > We should either release the global lock before the thread exits, or > > > defer the acting upon the signal until later. We cannot disable the > > > signal handling altogether because it is entirely legitimate to signal > > > another thread, and when we do, that other thread will _always_ be > > > inside thread_select. > > > > Really? What about thread-yield? > > What about it? > > You are asking whether, when thread-signal is executed, the thread > which we are signaling is necessarily parked inside thread_select? If > so, I don't understand your surprise: only one thread can ever be > running, and that is by definition the thread which calls > thread-signal. All the other threads cannot be running, which means > they are parked either in thread_select or in sys_mutex_lock called > from acquire_global_lock. Right? No, they might also be in the sys_thread_yield syscall, having released the global lock but not yet reacquired it: release_global_lock (); sys_thread_yield (); <<<<< here acquire_global_lock (self); > As for thread-yield, I'm not sure I understand how is it related to > the issue we are discussing. I'm not sure how it's relevant to assert that "that other thread will _always_ be inside thread_select". I have an idea where you might be going with that, but that idea wouldn't work (to release the lock from the signalling thread, not the signalled thread that holds it). > If the problem with missing events, > then which events are those, and what bad things will happen if we > miss them? All events that glib knows about but Emacs doesn't. For example, a glib timeout is apparently used to achieve some kind of scroll effect on GTK menus, which is why we call xg_select from xmenu.c. I don't know which libraries use glib-based threads, but I think dbus does, too. I believe, but am not certain, that this also includes X events when using GTK. That would explain the frozen sessions. ^ permalink raw reply [flat|nested] 53+ messages in thread
* bug#36609: 27.0.50; Possible race-condition in threading implementation 2019-07-12 19:30 ` Pip Cet @ 2019-07-13 6:50 ` Eli Zaretskii 0 siblings, 0 replies; 53+ messages in thread From: Eli Zaretskii @ 2019-07-13 6:50 UTC (permalink / raw) To: Pip Cet; +Cc: 36609, politza > From: Pip Cet <pipcet@gmail.com> > Date: Fri, 12 Jul 2019 19:30:34 +0000 > Cc: politza@hochschule-trier.de, 36609@debbugs.gnu.org > > > > > We should either release the global lock before the thread exits, or > > > > defer the acting upon the signal until later. We cannot disable the > > > > signal handling altogether because it is entirely legitimate to signal > > > > another thread, and when we do, that other thread will _always_ be > > > > inside thread_select. > > > > > > Really? What about thread-yield? > > > > What about it? > > > > You are asking whether, when thread-signal is executed, the thread > > which we are signaling is necessarily parked inside thread_select? If > > so, I don't understand your surprise: only one thread can ever be > > running, and that is by definition the thread which calls > > thread-signal. All the other threads cannot be running, which means > > they are parked either in thread_select or in sys_mutex_lock called > > from acquire_global_lock. Right? > > No, they might also be in the sys_thread_yield syscall, having > released the global lock but not yet reacquired it: > > release_global_lock (); > sys_thread_yield (); <<<<< here > acquire_global_lock (self); OK, but that, too, means the thread being signaled is not running, right? And I still think that a very frequent case, perhaps the most frequent, is that the thread being signaled is inside thread_select. > I'm not sure how it's relevant to assert that "that other thread will > _always_ be inside thread_select". OK, we've now established that the other thread could also be in acquire_global_lock or (for a very short time) in sys_thread_yield. > I have an idea where you might be going with that I was merely pointing out that we cannot disable the signal handling as a means to solve the problem. > but that idea wouldn't work (to release the lock from the signalling > thread, not the signalled thread that holds it). Maybe we have a misunderstanding here. I was talking about this part of post_acquire_global_lock: /* We could have been signaled while waiting to grab the global lock for the first time since this thread was created, in which case we didn't yet have the opportunity to set up the handlers. Delay raising the signal in that case (it will be actually raised when the thread comes here after acquiring the lock the next time). */ if (!NILP (current_thread->error_symbol) && handlerlist) { Lisp_Object sym = current_thread->error_symbol; Lisp_Object data = current_thread->error_data; current_thread->error_symbol = Qnil; current_thread->error_data = Qnil; Fsignal (sym, data); } In this part, we have already switched to the thread that has been signaled, so we are in the signaled thread, not in the signaling thread. I meant to release the lock before the call to Fsignal here. > > If the problem with missing events, > > then which events are those, and what bad things will happen if we > > miss them? > > All events that glib knows about but Emacs doesn't. For example, a > glib timeout is apparently used to achieve some kind of scroll effect > on GTK menus, which is why we call xg_select from xmenu.c. > > I don't know which libraries use glib-based threads, but I think dbus does, too. > > I believe, but am not certain, that this also includes X events when > using GTK. That would explain the frozen sessions. So is the problem that the Glib context is locked "forever", or is the problem that it's locked by another Lisp thread, even if this lock is short-lived? If the former, then arranging for the release of that lock when the signaled thread exits would solve the problem, right? And if the problem is the latter one, then why didn't we hear about this much earlier? Can you show the bad effect from missing these events without signaling a thread? Thanks. ^ permalink raw reply [flat|nested] 53+ messages in thread
* bug#36609: 27.0.50; Possible race-condition in threading implementation 2019-07-11 20:51 bug#36609: 27.0.50; Possible race-condition in threading implementation Andreas Politz 2019-07-12 7:47 ` Eli Zaretskii 2019-07-12 9:02 ` Pip Cet @ 2021-06-06 15:50 ` dick.r.chiang [not found] ` <87fsxv8182.fsf@dick> 3 siblings, 0 replies; 53+ messages in thread From: dick.r.chiang @ 2021-06-06 15:50 UTC (permalink / raw) To: Andreas Politz; +Cc: 36609 [-- Attachment #1: Type: text/plain, Size: 53 bytes --] Four attachments: #1. Want to revert commit 9c62ffb [-- Warning: decoded text below may be mangled, UTF-8 assumed --] [-- Attachment #2: 0001-Revert-Fix-lock-failures-in-xg_select.patch --] [-- Type: text/x-diff, Size: 4545 bytes --] From 01b8b847f5fc4b65eaac06c7c2d85f8c5410c497 Mon Sep 17 00:00:00 2001 From: dickmao <none> Date: Sun, 6 Jun 2021 11:10:13 -0400 Subject: [PATCH] Revert "Fix lock failures in xg_select" This reverts commit 9c62ffb08262c82b7e38e6eb5767f2087424aa47. * src/thread.c (really_call_select): revert 9c62ffb * src/xgselect.c (xg_select): revert 9c62ffb --- src/thread.c | 8 -------- src/xgselect.c | 42 ++++++++++++++---------------------------- src/xgselect.h | 2 -- 3 files changed, 14 insertions(+), 38 deletions(-) diff --git a/src/thread.c b/src/thread.c index f74f611148..609cd7c5fc 100644 --- a/src/thread.c +++ b/src/thread.c @@ -28,12 +28,6 @@ Copyright (C) 2012-2021 Free Software Foundation, Inc. #include "pdumper.h" #include "keyboard.h" -#if defined HAVE_GLIB && ! defined (HAVE_NS) -#include <xgselect.h> -#else -#define release_select_lock() do { } while (0) -#endif - union aligned_thread_state { struct thread_state s; @@ -592,8 +586,6 @@ really_call_select (void *arg) sa->result = (sa->func) (sa->max_fds, sa->rfds, sa->wfds, sa->efds, sa->timeout, sa->sigmask); - release_select_lock (); - block_interrupt_signal (&oldset); /* If we were interrupted by C-g while inside sa->func above, the signal handler could have called maybe_reacquire_global_lock, in diff --git a/src/xgselect.c b/src/xgselect.c index 0d91d55bad..d7c63e3be1 100644 --- a/src/xgselect.c +++ b/src/xgselect.c @@ -29,27 +29,6 @@ Copyright (C) 2009-2021 Free Software Foundation, Inc. #include "blockinput.h" #include "systime.h" -static ptrdiff_t threads_holding_glib_lock; -static GMainContext *glib_main_context; - -void release_select_lock (void) -{ - if (--threads_holding_glib_lock == 0) - g_main_context_release (glib_main_context); -} - -static void acquire_select_lock (GMainContext *context) -{ - if (threads_holding_glib_lock++ == 0) - { - glib_main_context = context; - while (!g_main_context_acquire (context)) - { - /* Spin. */ - } - } -} - /* `xg_select' is a `pselect' replacement. Why do we need a separate function? 1. Timeouts. Glib and Gtk rely on timer events. If we did pselect with a greater timeout then the one scheduled by Glib, we would @@ -75,19 +54,26 @@ xg_select (int fds_lim, fd_set *rfds, fd_set *wfds, fd_set *efds, GPollFD *gfds = gfds_buf; int gfds_size = ARRAYELTS (gfds_buf); int n_gfds, retval = 0, our_fds = 0, max_fds = fds_lim - 1; + bool context_acquired = false; int i, nfds, tmo_in_millisec, must_free = 0; bool need_to_dispatch; context = g_main_context_default (); - acquire_select_lock (context); + context_acquired = g_main_context_acquire (context); + /* FIXME: If we couldn't acquire the context, we just silently proceed + because this function handles more than just glib file descriptors. + Note that, as implemented, this failure is completely silent: there is + no feedback to the caller. */ if (rfds) all_rfds = *rfds; else FD_ZERO (&all_rfds); if (wfds) all_wfds = *wfds; else FD_ZERO (&all_wfds); - n_gfds = g_main_context_query (context, G_PRIORITY_LOW, &tmo_in_millisec, - gfds, gfds_size); + n_gfds = (context_acquired + ? g_main_context_query (context, G_PRIORITY_LOW, &tmo_in_millisec, + gfds, gfds_size) + : -1); if (gfds_size < n_gfds) { @@ -165,10 +151,8 @@ xg_select (int fds_lim, fd_set *rfds, fd_set *wfds, fd_set *efds, #else need_to_dispatch = true; #endif - if (need_to_dispatch) + if (need_to_dispatch && context_acquired) { - acquire_select_lock (context); - int pselect_errno = errno; /* Prevent g_main_dispatch recursion, that would occur without block_input wrapper, because event handlers call @@ -178,9 +162,11 @@ xg_select (int fds_lim, fd_set *rfds, fd_set *wfds, fd_set *efds, g_main_context_dispatch (context); unblock_input (); errno = pselect_errno; - release_select_lock (); } + if (context_acquired) + g_main_context_release (context); + /* To not have to recalculate timeout, return like this. */ if ((our_fds > 0 || (nfds == 0 && tmop == &tmo)) && (retval == 0)) { diff --git a/src/xgselect.h b/src/xgselect.h index 2142a236b2..e00dce1283 100644 --- a/src/xgselect.h +++ b/src/xgselect.h @@ -29,6 +29,4 @@ #define XGSELECT_H fd_set *rfds, fd_set *wfds, fd_set *efds, struct timespec *timeout, sigset_t *sigmask); -extern void release_select_lock (void); - #endif /* XGSELECT_H */ -- 2.26.2 [-- Attachment #3: Type: text/plain, Size: 57 bytes --] #2. Fails on tip of master, succeeds after patch in #1. [-- Attachment #4: 42.el --] [-- Type: application/emacs-lisp, Size: 5887 bytes --] [-- Attachment #5: Type: text/plain, Size: 57 bytes --] #3. Abridged original example, fails after patch in #1. [-- Attachment #6: report.el --] [-- Type: application/emacs-lisp, Size: 4943 bytes --] [-- Attachment #7: Type: text/plain, Size: 322 bytes --] Fails not necessarily because xgselect.c is wrong, but rather because channel-recv blocks on a mutex before channel-send can get its act together. This was hard for all to discern because OP seemed to have gone out of his way to obfuscate his "minimum" example. #4. What #3 probably intended, succeeds after patch in #1. [-- Attachment #8: report-2.el --] [-- Type: application/emacs-lisp, Size: 5014 bytes --] ^ permalink raw reply related [flat|nested] 53+ messages in thread
[parent not found: <87fsxv8182.fsf@dick>]
* bug#36609: 27.0.50; Possible race-condition in threading implementation [not found] ` <87fsxv8182.fsf@dick> @ 2021-06-06 16:35 ` Eli Zaretskii 2021-06-06 19:10 ` dick.r.chiang 0 siblings, 1 reply; 53+ messages in thread From: Eli Zaretskii @ 2021-06-06 16:35 UTC (permalink / raw) To: dick.r.chiang; +Cc: larsi, politza, pipcet, 36609 > From: dick.r.chiang@gmail.com > Cc: Pip Cet <pipcet@gmail.com>, larsi@gnus.org, 36609@debbugs.gnu.org, > Eli Zaretskii <eliz@gnu.org> > Date: Sun, 06 Jun 2021 11:25:01 -0400 > > #1. Want to revert commit 9c62ffb This will bring back bug#36609, so we cannot do that without discussing first why you think that commit was wrong. > #2. Fails on tip of master, succeeds after patch in #1. Please explain what does "fails" mean, and why do you think the above commit is the culprit. (A much simpler test case will be appreciated, btw.) > Fails not necessarily because xgselect.c is wrong, but rather because > channel-recv blocks on a mutex before channel-send can get its act together. You mean, in this code: (let ((channel (make-channel 1))) (make-thread (lambda nil (channel-send (car channel) 42)) "produce") (channel-recv (cdr channel)) (ignore-errors (enable-command 'list-threads)) (call-interactively #'list-threads)) ? Here, channel-send is called by a new thread, created by make-thread. In this code, it is _expected_ that channel-recv will be called (by the main thread) _before_ channel-send is called by the new thread, because make-thread creates a thread, but the newly created thread doesn't run until it can acquire the global lock. Meanwhile, the main thread continues running and calls channel-recv. The new thread will not begin running, AFAIU, until the main thread calls condition-wait inside channel-recv. By "blocks on a mutex", did you mean that channel-recv blocks trying to acquire the mutex here: (cl-defgeneric channel-recv ((sink channel-terminal)) (with-mutex (oref sink mutex) <<<<<<<<<<<<<<<<<<<<<<<<<< (with-slots (condition msg-queue) sink If so, which thread holds that mutex at this point? > #4. What #3 probably intended, succeeds after patch in #1. Yes, race conditions can be solved by using sleep-for, but that's not really a clean solution, at least not in my book. ^ permalink raw reply [flat|nested] 53+ messages in thread
* bug#36609: 27.0.50; Possible race-condition in threading implementation 2021-06-06 16:35 ` Eli Zaretskii @ 2021-06-06 19:10 ` dick.r.chiang 2021-06-06 19:27 ` Eli Zaretskii 0 siblings, 1 reply; 53+ messages in thread From: dick.r.chiang @ 2021-06-06 19:10 UTC (permalink / raw) To: Eli Zaretskii; +Cc: larsi, pipcet, politza, 36609 EZ> This will bring back bug#36609, so we cannot do that without EZ> discussing first why you think that commit was wrong. By enforcing bijection between acquires and releases in xgselect.c -- cavalierly so as the releases now come out-of-band from another module thread.c -- commit 9c62ffb admits deadlock. The conditions under which that happens are not at all rarefied as evidenced by `for i in 1 2 3 ; do src/emacs -Q -l ./42.el & done ;` Generally, a commit made to remedy a phantom problem should always be suspect. In this case OP was squatting on the main thread, so he deserves what he got. EZ> Yes, race conditions can be solved by using sleep-for, but that's not EZ> really a clean solution, at least not in my book. Case #4 was not meant to solve anything, only to set OP straight on what he really wanted. You can remove the sleep-for. Still works under patch #1. ^ permalink raw reply [flat|nested] 53+ messages in thread
* bug#36609: 27.0.50; Possible race-condition in threading implementation 2021-06-06 19:10 ` dick.r.chiang @ 2021-06-06 19:27 ` Eli Zaretskii 2021-06-09 21:40 ` dick.r.chiang 0 siblings, 1 reply; 53+ messages in thread From: Eli Zaretskii @ 2021-06-06 19:27 UTC (permalink / raw) To: dick.r.chiang; +Cc: larsi, pipcet, politza, 36609 > From: dick.r.chiang@gmail.com > Cc: larsi@gnus.org, pipcet@gmail.com, politza@hochschule-trier.de, > 36609@debbugs.gnu.org > Date: Sun, 06 Jun 2021 15:10:43 -0400 > > EZ> This will bring back bug#36609, so we cannot do that without > EZ> discussing first why you think that commit was wrong. > > By enforcing bijection between acquires and releases in xgselect.c -- > cavalierly so as the releases now come out-of-band from another module > thread.c -- commit 9c62ffb admits deadlock. The conditions under which that > happens are not at all rarefied as evidenced by `for i in 1 2 3 ; do src/emacs > -Q -l ./42.el & done ;` You will have to provide a more detailed analysis, sorry. The fact that some Lisp can cause deadlock does not prove anything, there are too many ways to write such deadlocking code. > Generally, a commit made to remedy a phantom problem should always be > suspect. In this case OP was squatting on the main thread, so he deserves > what he got. That was not the analysis of that test case. See the discussion of the bug. The problems with the GTK lock are real, not phantom. ^ permalink raw reply [flat|nested] 53+ messages in thread
* bug#36609: 27.0.50; Possible race-condition in threading implementation 2021-06-06 19:27 ` Eli Zaretskii @ 2021-06-09 21:40 ` dick.r.chiang 2021-06-10 6:46 ` Eli Zaretskii 0 siblings, 1 reply; 53+ messages in thread From: dick.r.chiang @ 2021-06-09 21:40 UTC (permalink / raw) To: Eli Zaretskii; +Cc: larsi, politza, pipcet, 36609 [-- Attachment #1: Type: text/plain, Size: 497 bytes --] EZ> You will have to provide a more detailed analysis, sorry. There is no guarantee the static variable `threads_holding_glib_lock` introduced in xgselect.c in commit 9c62ffb is synchronized across threads. As such, relying on it becoming zero in time for release_select_lock() is fraught. (If you add print statements, this particular heisenbug disappears, at least on Linux kernel 4.15.0-99-generic). Four attachments: 1. Desired patch to master (reverts 9c62ffb, adds main-thread-p check). [-- Warning: decoded text below may be mangled, UTF-8 assumed --] [-- Attachment #2: 0001-Only-acquire-glib-context-if-main-thread.patch --] [-- Type: text/x-diff, Size: 4523 bytes --] From 5b30b55f60844ada43b119640e052f8ed5ee6e9c Mon Sep 17 00:00:00 2001 From: dickmao <none> Date: Wed, 9 Jun 2021 17:11:48 -0400 Subject: [PATCH] Only acquire glib context if main thread * src/thread.c (really_call_select): revert 9c62ffb * src/xgselect.c (xg_select): revert 9c62ffb, only acquire glib if main --- src/thread.c | 8 -------- src/xgselect.c | 44 ++++++++++++++++---------------------------- src/xgselect.h | 2 -- 3 files changed, 16 insertions(+), 38 deletions(-) diff --git a/src/thread.c b/src/thread.c index f74f611148..609cd7c5fc 100644 --- a/src/thread.c +++ b/src/thread.c @@ -28,12 +28,6 @@ Copyright (C) 2012-2021 Free Software Foundation, Inc. #include "pdumper.h" #include "keyboard.h" -#if defined HAVE_GLIB && ! defined (HAVE_NS) -#include <xgselect.h> -#else -#define release_select_lock() do { } while (0) -#endif - union aligned_thread_state { struct thread_state s; @@ -592,8 +586,6 @@ really_call_select (void *arg) sa->result = (sa->func) (sa->max_fds, sa->rfds, sa->wfds, sa->efds, sa->timeout, sa->sigmask); - release_select_lock (); - block_interrupt_signal (&oldset); /* If we were interrupted by C-g while inside sa->func above, the signal handler could have called maybe_reacquire_global_lock, in diff --git a/src/xgselect.c b/src/xgselect.c index 0d91d55bad..b40f75cb50 100644 --- a/src/xgselect.c +++ b/src/xgselect.c @@ -28,27 +28,7 @@ Copyright (C) 2009-2021 Free Software Foundation, Inc. #include "lisp.h" #include "blockinput.h" #include "systime.h" - -static ptrdiff_t threads_holding_glib_lock; -static GMainContext *glib_main_context; - -void release_select_lock (void) -{ - if (--threads_holding_glib_lock == 0) - g_main_context_release (glib_main_context); -} - -static void acquire_select_lock (GMainContext *context) -{ - if (threads_holding_glib_lock++ == 0) - { - glib_main_context = context; - while (!g_main_context_acquire (context)) - { - /* Spin. */ - } - } -} +#include "thread.h" /* `xg_select' is a `pselect' replacement. Why do we need a separate function? 1. Timeouts. Glib and Gtk rely on timer events. If we did pselect @@ -75,19 +55,27 @@ xg_select (int fds_lim, fd_set *rfds, fd_set *wfds, fd_set *efds, GPollFD *gfds = gfds_buf; int gfds_size = ARRAYELTS (gfds_buf); int n_gfds, retval = 0, our_fds = 0, max_fds = fds_lim - 1; + bool context_acquired = false; int i, nfds, tmo_in_millisec, must_free = 0; bool need_to_dispatch; context = g_main_context_default (); - acquire_select_lock (context); + if (main_thread_p (current_thread)) + context_acquired = g_main_context_acquire (context); + /* FIXME: If we couldn't acquire the context, we just silently proceed + because this function handles more than just glib file descriptors. + Note that, as implemented, this failure is completely silent: there is + no feedback to the caller. */ if (rfds) all_rfds = *rfds; else FD_ZERO (&all_rfds); if (wfds) all_wfds = *wfds; else FD_ZERO (&all_wfds); - n_gfds = g_main_context_query (context, G_PRIORITY_LOW, &tmo_in_millisec, - gfds, gfds_size); + n_gfds = (context_acquired + ? g_main_context_query (context, G_PRIORITY_LOW, &tmo_in_millisec, + gfds, gfds_size) + : -1); if (gfds_size < n_gfds) { @@ -165,10 +153,8 @@ xg_select (int fds_lim, fd_set *rfds, fd_set *wfds, fd_set *efds, #else need_to_dispatch = true; #endif - if (need_to_dispatch) + if (need_to_dispatch && context_acquired) { - acquire_select_lock (context); - int pselect_errno = errno; /* Prevent g_main_dispatch recursion, that would occur without block_input wrapper, because event handlers call @@ -178,9 +164,11 @@ xg_select (int fds_lim, fd_set *rfds, fd_set *wfds, fd_set *efds, g_main_context_dispatch (context); unblock_input (); errno = pselect_errno; - release_select_lock (); } + if (context_acquired) + g_main_context_release (context); + /* To not have to recalculate timeout, return like this. */ if ((our_fds > 0 || (nfds == 0 && tmop == &tmo)) && (retval == 0)) { diff --git a/src/xgselect.h b/src/xgselect.h index 2142a236b2..e00dce1283 100644 --- a/src/xgselect.h +++ b/src/xgselect.h @@ -29,6 +29,4 @@ #define XGSELECT_H fd_set *rfds, fd_set *wfds, fd_set *efds, struct timespec *timeout, sigset_t *sigmask); -extern void release_select_lock (void); - #endif /* XGSELECT_H */ -- 2.26.2 [-- Attachment #3: Type: text/plain, Size: 195 bytes --] 2. OP's original "my code doesn't work, here you figure it out", obfuscated MRE Run as: for i in 1 2 3 ; do src/emacs -Q -l ./report-orig.el & done Fails if: One of the emacsen stops responding [-- Attachment #4: report-orig.el --] [-- Type: application/emacs-lisp, Size: 6345 bytes --] [-- Attachment #5: Type: text/plain, Size: 206 bytes --] 3. "Affine" redaction of OP's MRE *infrequently* breaks on tip of master (succeeds after patch) Run as: for i in 1 2 3 ; do src/emacs -Q -l ./report.el & done Fails if: One of the emacsen stops responding [-- Attachment #6: report.el --] [-- Type: application/emacs-lisp, Size: 704 bytes --] [-- Attachment #7: Type: text/plain, Size: 68 bytes --] 4. My MRE *frequently* breaks tip of master (succeeds after patch) [-- Attachment #8: 42.el --] [-- Type: application/emacs-lisp, Size: 1551 bytes --] [-- Attachment #9: Type: text/plain, Size: 105 bytes --] Run as: for i in 1 2 3 ; do src/emacs -Q -l ./42.el & done Fails if: One of the emacsen stops responding ^ permalink raw reply related [flat|nested] 53+ messages in thread
* bug#36609: 27.0.50; Possible race-condition in threading implementation 2021-06-09 21:40 ` dick.r.chiang @ 2021-06-10 6:46 ` Eli Zaretskii 2021-06-10 11:52 ` dick.r.chiang 0 siblings, 1 reply; 53+ messages in thread From: Eli Zaretskii @ 2021-06-10 6:46 UTC (permalink / raw) To: dick.r.chiang; +Cc: larsi, politza, pipcet, 36609 > From: dick.r.chiang@gmail.com > Cc: larsi@gnus.org, politza@hochschule-trier.de, pipcet@gmail.com, > 36609@debbugs.gnu.org > Date: Wed, 09 Jun 2021 17:40:28 -0400 > > There is no guarantee the static variable `threads_holding_glib_lock` > introduced in xgselect.c in commit 9c62ffb is synchronized across threads. How can that happen? It's a single word, and its change should therefore be atomic. ^ permalink raw reply [flat|nested] 53+ messages in thread
* bug#36609: 27.0.50; Possible race-condition in threading implementation 2021-06-10 6:46 ` Eli Zaretskii @ 2021-06-10 11:52 ` dick.r.chiang 2021-06-10 14:18 ` Eli Zaretskii 0 siblings, 1 reply; 53+ messages in thread From: dick.r.chiang @ 2021-06-10 11:52 UTC (permalink / raw) To: Eli Zaretskii; +Cc: 36609 [Standard disclaimer about not being a compiler semanticist] > a single word... change should ... be atomic. I've never heard that. I googled around; one helpful and only mildly sententious monograph on the topic is https://stefansf.de/post/qualifier-volatile-c/ Adding the volatile qualifier to `threads_holding_glib_lock` did not resolve my MRE on my particular architecture (whatever that may be). ^ permalink raw reply [flat|nested] 53+ messages in thread
* bug#36609: 27.0.50; Possible race-condition in threading implementation 2021-06-10 11:52 ` dick.r.chiang @ 2021-06-10 14:18 ` Eli Zaretskii 2021-06-10 14:55 ` dick.r.chiang 2021-06-10 15:35 ` Andreas Schwab 0 siblings, 2 replies; 53+ messages in thread From: Eli Zaretskii @ 2021-06-10 14:18 UTC (permalink / raw) To: dick.r.chiang; +Cc: 36609 > From: dick.r.chiang@gmail.com > Cc: 36609@debbugs.gnu.org > Date: Thu, 10 Jun 2021 07:52:45 -0400 > > > a single word... change should ... be atomic. > > I've never heard that. The idea is that a variable that is modified in a single machine instruction will always have the value assigned by the last thread which set that. You cannot have, for example, half of the bits of the variable set by one thread and the other half by another thread. > Adding the volatile qualifier to `threads_holding_glib_lock` did not > resolve my MRE on my particular architecture (whatever that may be). So can you describe what you think happens in your scenario that the offending change that added threads_holding_glib_lock causes? Because I still don't see the connection, to tell the truth. ^ permalink raw reply [flat|nested] 53+ messages in thread
* bug#36609: 27.0.50; Possible race-condition in threading implementation 2021-06-10 14:18 ` Eli Zaretskii @ 2021-06-10 14:55 ` dick.r.chiang 2021-06-10 15:04 ` Eli Zaretskii 2021-06-10 15:35 ` Andreas Schwab 1 sibling, 1 reply; 53+ messages in thread From: dick.r.chiang @ 2021-06-10 14:55 UTC (permalink / raw) To: Eli Zaretskii; +Cc: 36609 > You cannot have, for example, half of the bits of the variable set by one thread > and the other half by another thread. I dunno, this seems like a sophomoric, rose-colored view of variable sharing without regard to caching, or really any generally accepted guarantees about POSIX threads. > So can you describe what you think happens in your scenario that the > offending change that added threads_holding_glib_lock causes? Two threads see `threads_holding_glib_lock` is "0" and both attempt acquiring the lock (line 43, xgselect.c) Or, two threads see `threads_holding_glib_lock` as "2" and glib lock is never released. (line 37, xgselect.c) ^ permalink raw reply [flat|nested] 53+ messages in thread
* bug#36609: 27.0.50; Possible race-condition in threading implementation 2021-06-10 14:55 ` dick.r.chiang @ 2021-06-10 15:04 ` Eli Zaretskii 2021-06-10 21:36 ` dick.r.chiang 0 siblings, 1 reply; 53+ messages in thread From: Eli Zaretskii @ 2021-06-10 15:04 UTC (permalink / raw) To: dick.r.chiang; +Cc: 36609 > From: dick.r.chiang@gmail.com > Cc: 36609@debbugs.gnu.org > Date: Thu, 10 Jun 2021 10:55:53 -0400 > > > You cannot have, for example, half of the bits of the variable set by one thread > > and the other half by another thread. > > I dunno, this seems like a sophomoric, rose-colored view of variable sharing > without regard to caching, or really any generally accepted guarantees about > POSIX threads. Is it, really? > > So can you describe what you think happens in your scenario that the > > offending change that added threads_holding_glib_lock causes? > > Two threads see `threads_holding_glib_lock` is "0" and both attempt > acquiring the lock (line 43, xgselect.c) > > Or, two threads see `threads_holding_glib_lock` as "2" and glib > lock is never released. (line 37, xgselect.c) And you can show a GDB backtrace with values of the variable that supports that? And tell how that explains the hang that you see on the Lisp level? These are the details that I was asking from the first message you posted here. ^ permalink raw reply [flat|nested] 53+ messages in thread
* bug#36609: 27.0.50; Possible race-condition in threading implementation 2021-06-10 15:04 ` Eli Zaretskii @ 2021-06-10 21:36 ` dick.r.chiang 2021-06-11 6:00 ` Eli Zaretskii 0 siblings, 1 reply; 53+ messages in thread From: dick.r.chiang @ 2021-06-10 21:36 UTC (permalink / raw) To: Eli Zaretskii; +Cc: 36609 EZ> Is it, really? Yes. While I can't claim to be any more knowledgeable, the naivete of someone protesting thread atomicity based on word-size is stunningly apparent. EZ> And you can show a GDB backtrace with values of the variable that supports EZ> that? Alas I cannot. My gdb skills are less impressive in the presence of threads, and heisenbugs of this nature tend to mysteriously disappear when gdb's leisurely pace gives caches ample time to flush. I mentioned earlier I could not reproduce my MRE failure when I merely inserted print statements. My patch isn't quite as studied as Pip Cet's (my patch merely skirts locking the glib unless you're main-thread-p), but it does fix OP and, unlike the current state of the code, doesn't admit glaring theoretical flaws, even though you seem to think those flaws could never materialize in practice despite evidence (gdb-less as it is) to the contrary. ^ permalink raw reply [flat|nested] 53+ messages in thread
* bug#36609: 27.0.50; Possible race-condition in threading implementation 2021-06-10 21:36 ` dick.r.chiang @ 2021-06-11 6:00 ` Eli Zaretskii 2021-06-19 17:53 ` Eli Zaretskii 0 siblings, 1 reply; 53+ messages in thread From: Eli Zaretskii @ 2021-06-11 6:00 UTC (permalink / raw) To: dick.r.chiang; +Cc: 36609 > From: dick.r.chiang@gmail.com > Cc: 36609@debbugs.gnu.org > Date: Thu, 10 Jun 2021 17:36:13 -0400 > > EZ> Is it, really? > > Yes. While I can't claim to be any more knowledgeable, the naivete of someone > protesting thread atomicity based on word-size is stunningly apparent. I'm okay with replacing the simple increments and decrements of that variable with atomic ones, using the likes of __atomic_fetch_add intrinsics of GCC, if that solves the problem. Did you try that? ^ permalink raw reply [flat|nested] 53+ messages in thread
* bug#36609: 27.0.50; Possible race-condition in threading implementation 2021-06-11 6:00 ` Eli Zaretskii @ 2021-06-19 17:53 ` Eli Zaretskii 2021-06-19 19:14 ` dick.r.chiang 0 siblings, 1 reply; 53+ messages in thread From: Eli Zaretskii @ 2021-06-19 17:53 UTC (permalink / raw) To: dick.r.chiang; +Cc: 36609 > Date: Fri, 11 Jun 2021 09:00:07 +0300 > From: Eli Zaretskii <eliz@gnu.org> > Cc: 36609@debbugs.gnu.org > > I'm okay with replacing the simple increments and decrements of that > variable with atomic ones, using the likes of __atomic_fetch_add > intrinsics of GCC, if that solves the problem. Did you try that? Ping! Did you try that, and if so, did using the atomics solve the original problem? Thanks. ^ permalink raw reply [flat|nested] 53+ messages in thread
* bug#36609: 27.0.50; Possible race-condition in threading implementation 2021-06-19 17:53 ` Eli Zaretskii @ 2021-06-19 19:14 ` dick.r.chiang 2021-06-19 19:18 ` Eli Zaretskii 0 siblings, 1 reply; 53+ messages in thread From: dick.r.chiang @ 2021-06-19 19:14 UTC (permalink / raw) To: Eli Zaretskii; +Cc: 36609 I enter a plea of lacking the programming know-how for atomics. I stand by everything I've said thus far: an obfuscated MRE led to a misdiagnosis, led to an unnecessary and highly thread-unsafe fix, which can be remedied by the vulgar "end-run" patch I submitted, and which I test every day with parallel-thread Gnus. Unfortunately in this and every urological contest of wills I seem to have with project maintainers, I, by construction, must always lose. >>>>> Eli Zaretskii <eliz@gnu.org> writes: >> Date: Fri, 11 Jun 2021 09:00:07 +0300 >> From: Eli Zaretskii <eliz@gnu.org> >> Cc: 36609@debbugs.gnu.org >> >> I'm okay with replacing the simple increments and decrements of that >> variable with atomic ones, using the likes of __atomic_fetch_add >> intrinsics of GCC, if that solves the problem. Did you try that? > Ping! > Did you try that, and if so, did using the atomics solve the original > problem? > Thanks. ^ permalink raw reply [flat|nested] 53+ messages in thread
* bug#36609: 27.0.50; Possible race-condition in threading implementation 2021-06-19 19:14 ` dick.r.chiang @ 2021-06-19 19:18 ` Eli Zaretskii 2021-06-19 21:12 ` dick.r.chiang 0 siblings, 1 reply; 53+ messages in thread From: Eli Zaretskii @ 2021-06-19 19:18 UTC (permalink / raw) To: dick.r.chiang; +Cc: 36609 > From: dick.r.chiang@gmail.com > Cc: 36609@debbugs.gnu.org > Date: Sat, 19 Jun 2021 15:14:19 -0400 > > I enter a plea of lacking the programming know-how for atomics. If I suggest a patch, would you be willing to give it a try? ^ permalink raw reply [flat|nested] 53+ messages in thread
* bug#36609: 27.0.50; Possible race-condition in threading implementation 2021-06-19 19:18 ` Eli Zaretskii @ 2021-06-19 21:12 ` dick.r.chiang 2021-06-20 11:39 ` Eli Zaretskii 0 siblings, 1 reply; 53+ messages in thread From: dick.r.chiang @ 2021-06-19 21:12 UTC (permalink / raw) To: Eli Zaretskii; +Cc: 36609 "Suggest" or "Provide"? If the former, well, my recalcitrance should be your answer. If the latter, sure, although if you've written the patch, then running my MREs would be epsilon more effort. >>>>> Eli Zaretskii <eliz@gnu.org> writes: >> From: dick.r.chiang@gmail.com >> Cc: 36609@debbugs.gnu.org >> Date: Sat, 19 Jun 2021 15:14:19 -0400 >> >> I enter a plea of lacking the programming know-how for atomics. > If I suggest a patch, would you be willing to give it a try? ^ permalink raw reply [flat|nested] 53+ messages in thread
* bug#36609: 27.0.50; Possible race-condition in threading implementation 2021-06-19 21:12 ` dick.r.chiang @ 2021-06-20 11:39 ` Eli Zaretskii 2021-06-20 14:01 ` dick.r.chiang 0 siblings, 1 reply; 53+ messages in thread From: Eli Zaretskii @ 2021-06-20 11:39 UTC (permalink / raw) To: dick.r.chiang; +Cc: 36609 > From: dick.r.chiang@gmail.com > Cc: 36609@debbugs.gnu.org > Date: Sat, 19 Jun 2021 17:12:32 -0400 > > "Suggest" or "Provide"? > > If the former, well, my recalcitrance should be your answer. > > If the latter, sure Thanks, please try the patch below. > although if you've written the patch, then running my MREs would be > epsilon more effort. You have a use case and wrote code with which you are familiar, so you are in a better position to test it. Moreover, I don't have access to a system where these problems could happen, so it's far from epsilon effort for me. diff --git a/src/xgselect.c b/src/xgselect.c index 0d91d55bad..9a90670b0f 100644 --- a/src/xgselect.c +++ b/src/xgselect.c @@ -34,12 +34,27 @@ static GMainContext *glib_main_context; void release_select_lock (void) { +#if GNUC_PREREQ (4, 7, 0) + if (__atomic_sub_fetch (&threads_holding_glib_lock, 1, __ATOMIC_ACQ_REL)) + g_main_context_release (glib_main_context); +#else if (--threads_holding_glib_lock == 0) g_main_context_release (glib_main_context); +#endif } static void acquire_select_lock (GMainContext *context) { +#if GNUC_PREREQ (4, 7, 0) + if (__atomic_fetch_add (&threads_holding_glib_lock, 1, __ATOMIC_ACQ_REL) == 0) + { + glib_main_context = context; + while (!g_main_context_acquire (context)) + { + /* Spin. */ + } + } +#else if (threads_holding_glib_lock++ == 0) { glib_main_context = context; @@ -48,6 +63,7 @@ static void acquire_select_lock (GMainContext *context) /* Spin. */ } } +#endif } /* `xg_select' is a `pselect' replacement. Why do we need a separate function? ^ permalink raw reply related [flat|nested] 53+ messages in thread
* bug#36609: 27.0.50; Possible race-condition in threading implementation 2021-06-20 11:39 ` Eli Zaretskii @ 2021-06-20 14:01 ` dick.r.chiang 2021-06-20 15:53 ` Eli Zaretskii 0 siblings, 1 reply; 53+ messages in thread From: dick.r.chiang @ 2021-06-20 14:01 UTC (permalink / raw) To: Eli Zaretskii; +Cc: 36609 Yes! If you just add a negation to the __atomic_sub_fetch call, this passes all the MREs. It's not clear how you want to deal with the #else GNUC_PREREQ (4, 7, 0). >>>>> "EZ" == Eli Zaretskii <eliz@gnu.org> writes: >> From: dick.r.chiang@gmail.com >> Cc: 36609@debbugs.gnu.org >> Date: Sat, 19 Jun 2021 17:12:32 -0400 >> >> "Suggest" or "Provide"? >> >> If the former, well, my recalcitrance should be your answer. >> >> If the latter, sure EZ> Thanks, please try the patch below. >> although if you've written the patch, then running my MREs would be epsilon >> more effort. EZ> You have a use case and wrote code with which you are familiar, so you are EZ> in a better position to test it. Moreover, I don't have access to a EZ> system where these problems could happen, so it's far from epsilon effort EZ> for me. EZ> diff --git a/src/xgselect.c b/src/xgselect.c EZ> index 0d91d55bad..9a90670b0f 100644 EZ> --- a/src/xgselect.c EZ> +++ b/src/xgselect.c EZ> @@ -34,12 +34,27 @@ static GMainContext *glib_main_context; EZ> void release_select_lock (void) EZ> { EZ> +#if GNUC_PREREQ (4, 7, 0) EZ> + if (__atomic_sub_fetch (&threads_holding_glib_lock, 1, __ATOMIC_ACQ_REL)) EZ> + g_main_context_release (glib_main_context); EZ> +#else EZ> if (--threads_holding_glib_lock == 0) EZ> g_main_context_release (glib_main_context); EZ> +#endif EZ> } EZ> static void acquire_select_lock (GMainContext *context) EZ> { EZ> +#if GNUC_PREREQ (4, 7, 0) EZ> + if (__atomic_fetch_add (&threads_holding_glib_lock, 1, __ATOMIC_ACQ_REL) == 0) EZ> + { EZ> + glib_main_context = context; EZ> + while (!g_main_context_acquire (context)) EZ> + { EZ> + /* Spin. */ EZ> + } EZ> + } EZ> +#else EZ> if (threads_holding_glib_lock++ == 0) EZ> { EZ> glib_main_context = context; EZ> @@ -48,6 +63,7 @@ static void acquire_select_lock (GMainContext *context) EZ> /* Spin. */ EZ> } EZ> } EZ> +#endif EZ> } EZ> /* `xg_select' is a `pselect' replacement. Why do we need a separate EZ> function? ^ permalink raw reply [flat|nested] 53+ messages in thread
* bug#36609: 27.0.50; Possible race-condition in threading implementation 2021-06-20 14:01 ` dick.r.chiang @ 2021-06-20 15:53 ` Eli Zaretskii 2021-06-25 13:54 ` Eli Zaretskii 0 siblings, 1 reply; 53+ messages in thread From: Eli Zaretskii @ 2021-06-20 15:53 UTC (permalink / raw) To: dick.r.chiang; +Cc: 36609 > From: dick.r.chiang@gmail.com > Cc: 36609@debbugs.gnu.org > Date: Sun, 20 Jun 2021 10:01:37 -0400 > > Yes! If you just add a negation to the __atomic_sub_fetch call You mean, use the patch below instead? > this passes all the MREs. Thanks, will install soon. > It's not clear how you want to deal with the #else GNUC_PREREQ (4, 7, 0). By hoping no one uses this and expects threads to be stable enough under GTK. diff --git a/src/xgselect.c b/src/xgselect.c index 0d91d55bad..92b118b955 100644 --- a/src/xgselect.c +++ b/src/xgselect.c @@ -34,12 +34,27 @@ static GMainContext *glib_main_context; void release_select_lock (void) { +#if GNUC_PREREQ (4, 7, 0) + if (__atomic_sub_fetch (&threads_holding_glib_lock, 1, __ATOMIC_ACQ_REL) == 0) + g_main_context_release (glib_main_context); +#else if (--threads_holding_glib_lock == 0) g_main_context_release (glib_main_context); +#endif } static void acquire_select_lock (GMainContext *context) { +#if GNUC_PREREQ (4, 7, 0) + if (__atomic_fetch_add (&threads_holding_glib_lock, 1, __ATOMIC_ACQ_REL) == 0) + { + glib_main_context = context; + while (!g_main_context_acquire (context)) + { + /* Spin. */ + } + } +#else if (threads_holding_glib_lock++ == 0) { glib_main_context = context; @@ -48,6 +63,7 @@ static void acquire_select_lock (GMainContext *context) /* Spin. */ } } +#endif } /* `xg_select' is a `pselect' replacement. Why do we need a separate function? ^ permalink raw reply related [flat|nested] 53+ messages in thread
* bug#36609: 27.0.50; Possible race-condition in threading implementation 2021-06-20 15:53 ` Eli Zaretskii @ 2021-06-25 13:54 ` Eli Zaretskii 0 siblings, 0 replies; 53+ messages in thread From: Eli Zaretskii @ 2021-06-25 13:54 UTC (permalink / raw) To: dick.r.chiang; +Cc: 36609 > Date: Sun, 20 Jun 2021 18:53:40 +0300 > From: Eli Zaretskii <eliz@gnu.org> > Cc: 36609@debbugs.gnu.org > > > From: dick.r.chiang@gmail.com > > Cc: 36609@debbugs.gnu.org > > Date: Sun, 20 Jun 2021 10:01:37 -0400 > > > > Yes! If you just add a negation to the __atomic_sub_fetch call > > You mean, use the patch below instead? > > > this passes all the MREs. > > Thanks, will install soon. Now done. ^ permalink raw reply [flat|nested] 53+ messages in thread
* bug#36609: 27.0.50; Possible race-condition in threading implementation 2021-06-10 14:18 ` Eli Zaretskii 2021-06-10 14:55 ` dick.r.chiang @ 2021-06-10 15:35 ` Andreas Schwab 2021-06-10 15:39 ` Eli Zaretskii 1 sibling, 1 reply; 53+ messages in thread From: Andreas Schwab @ 2021-06-10 15:35 UTC (permalink / raw) To: Eli Zaretskii; +Cc: dick.r.chiang, 36609 On Jun 10 2021, Eli Zaretskii wrote: > The idea is that a variable that is modified in a single machine > instruction will always have the value assigned by the last thread > which set that. You cannot have that without explicit atomic operations. An ordinary RMW access is never atomic across cpus. Andreas. -- Andreas Schwab, schwab@linux-m68k.org GPG Key fingerprint = 7578 EB47 D4E5 4D69 2510 2552 DF73 E780 A9DA AEC1 "And now for something completely different." ^ permalink raw reply [flat|nested] 53+ messages in thread
* bug#36609: 27.0.50; Possible race-condition in threading implementation 2021-06-10 15:35 ` Andreas Schwab @ 2021-06-10 15:39 ` Eli Zaretskii 0 siblings, 0 replies; 53+ messages in thread From: Eli Zaretskii @ 2021-06-10 15:39 UTC (permalink / raw) To: Andreas Schwab; +Cc: dick.r.chiang, 36609 > From: Andreas Schwab <schwab@linux-m68k.org> > Cc: dick.r.chiang@gmail.com, 36609@debbugs.gnu.org > Date: Thu, 10 Jun 2021 17:35:54 +0200 > > On Jun 10 2021, Eli Zaretskii wrote: > > > The idea is that a variable that is modified in a single machine > > instruction will always have the value assigned by the last thread > > which set that. > > You cannot have that without explicit atomic operations. We can do that if it turns out to be an issue. ^ permalink raw reply [flat|nested] 53+ messages in thread
end of thread, other threads:[~2021-06-25 13:54 UTC | newest] Thread overview: 53+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2019-07-11 20:51 bug#36609: 27.0.50; Possible race-condition in threading implementation Andreas Politz 2019-07-12 7:47 ` Eli Zaretskii 2019-07-12 8:05 ` Eli Zaretskii 2019-07-12 9:02 ` Pip Cet 2019-07-12 12:42 ` Eli Zaretskii 2019-07-12 12:57 ` Pip Cet 2019-07-12 13:31 ` Eli Zaretskii 2019-07-12 13:51 ` Pip Cet 2019-07-12 15:05 ` Eli Zaretskii 2019-07-12 18:06 ` Pip Cet 2019-07-12 18:27 ` Eli Zaretskii 2019-07-12 18:34 ` Eli Zaretskii 2019-07-12 19:24 ` Pip Cet 2019-07-12 19:57 ` Eli Zaretskii 2019-07-13 14:37 ` Pip Cet 2019-07-13 15:03 ` Eli Zaretskii 2019-07-13 15:13 ` Eli Zaretskii 2019-07-13 15:54 ` Eli Zaretskii 2019-07-13 15:57 ` Pip Cet 2019-07-13 16:02 ` Eli Zaretskii 2019-07-13 18:17 ` Pip Cet 2020-08-21 12:57 ` Lars Ingebrigtsen 2019-07-13 15:04 ` Andreas Politz 2019-07-12 12:44 ` Pip Cet 2019-07-12 12:55 ` Eli Zaretskii 2019-07-12 13:40 ` Pip Cet 2019-07-12 13:51 ` Eli Zaretskii 2019-07-12 14:34 ` Pip Cet 2019-07-12 15:02 ` Eli Zaretskii 2019-07-12 19:30 ` Pip Cet 2019-07-13 6:50 ` Eli Zaretskii 2021-06-06 15:50 ` dick.r.chiang [not found] ` <87fsxv8182.fsf@dick> 2021-06-06 16:35 ` Eli Zaretskii 2021-06-06 19:10 ` dick.r.chiang 2021-06-06 19:27 ` Eli Zaretskii 2021-06-09 21:40 ` dick.r.chiang 2021-06-10 6:46 ` Eli Zaretskii 2021-06-10 11:52 ` dick.r.chiang 2021-06-10 14:18 ` Eli Zaretskii 2021-06-10 14:55 ` dick.r.chiang 2021-06-10 15:04 ` Eli Zaretskii 2021-06-10 21:36 ` dick.r.chiang 2021-06-11 6:00 ` Eli Zaretskii 2021-06-19 17:53 ` Eli Zaretskii 2021-06-19 19:14 ` dick.r.chiang 2021-06-19 19:18 ` Eli Zaretskii 2021-06-19 21:12 ` dick.r.chiang 2021-06-20 11:39 ` Eli Zaretskii 2021-06-20 14:01 ` dick.r.chiang 2021-06-20 15:53 ` Eli Zaretskii 2021-06-25 13:54 ` Eli Zaretskii 2021-06-10 15:35 ` Andreas Schwab 2021-06-10 15:39 ` Eli Zaretskii
Code repositories for project(s) associated with this public inbox https://git.savannah.gnu.org/cgit/emacs.git This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).