* bug#36609: 27.0.50; Possible race-condition in threading implementation
@ 2019-07-11 20:51 Andreas Politz
2019-07-12 7:47 ` Eli Zaretskii
` (3 more replies)
0 siblings, 4 replies; 53+ messages in thread
From: Andreas Politz @ 2019-07-11 20:51 UTC (permalink / raw)
To: 36609
[-- Attachment #1: Type: text/plain, Size: 22047 bytes --]
I think there is a race-condition in the implementation of threads. I
tried to find a minimal test-case, without success. Thus, I've attached
a lengthy source-file. Loading that file should trigger this bug and
may freeze your session.
Indications:
1. The main-thread has the name of one of created threads (XEmacs in
this case), instead of "emacs".
2. Emacs stops processing all keyboard/mouse input while looping in
wait_reading_process_output. Sending commands via emacsclient still
works.
GDB output:
(gdb) info threads
Id Target Id Frame
* 1 Thread 0x7ffff17f5d40 (LWP 26264) "XEmacs" 0x000055555576eac0 in XPNTR (a=XIL(0x7ffff1312533)) at alloc.c:535
2 Thread 0x7ffff0ac4700 (LWP 26265) "gmain" 0x00007ffff50d1667 in poll () from /usr/lib/libc.so.6
3 Thread 0x7fffebd1a700 (LWP 26266) "gdbus" 0x00007ffff50d1667 in poll () from /usr/lib/libc.so.6
4 Thread 0x7fffeb519700 (LWP 26267) "dconf worker" 0x00007ffff50d1667 in poll () from /usr/lib/libc.so.6
(gdb) bt full
#0 0x00007ffff50d3f76 in pselect () at /usr/lib/libc.so.6
#1 0x0000555555832e48 in really_call_select (arg=0x7fffffffd150) at thread.c:586
sa = 0x7fffffffd150
self = 0x555555c154e0 <main_thread>
oldset = {
__val = {0, 93825011232085, 140737488343064, 31632, 31632, 51840, 140737488343136, 93824994672716, 4294967298, 140737488343184, 93824999850720, 0, 0, 140737488343136, 93824993869565, 4041340661}
}
#2 0x0000555555774449 in flush_stack_call_func (func=0x555555832d81 <really_call_select>, arg=0x7fffffffd150) at alloc.c:4969
end = 0x7fffffffd100
self = 0x555555c154e0 <main_thread>
sentry = {
o = {
__max_align_ll = 140737488343296,
__max_align_ld = <invalid float value>
}
}
#3 0x0000555555832f40 in thread_select (func=0x7ffff50d3eb0 <pselect>, max_fds=6, rfds=0x7fffffffd260, wfds=0x7fffffffd2e0, efds=0x0, timeout=0x7fffffffd8b0, sigmask=0x0) at thread.c:616
sa = {
func = 0x7ffff50d3eb0 <pselect>,
max_fds = 6,
rfds = 0x7fffffffd260,
wfds = 0x7fffffffd2e0,
efds = 0x0,
timeout = 0x7fffffffd8b0,
sigmask = 0x0,
result = 210347336
}
#4 0x000055555585fea3 in xg_select (fds_lim=6, rfds=0x7fffffffd920, wfds=0x7fffffffd9a0, efds=0x0, timeout=0x7fffffffd8b0, sigmask=0x0) at xgselect.c:117
all_rfds = {
fds_bits = {32, 0 <repeats 15 times>}
}
all_wfds = {
fds_bits = {0 <repeats 16 times>}
}
tmo = {
tv_sec = 7,
tv_nsec = 140737488343264
}
tmop = 0x7fffffffd8b0
context = 0x555555dae320
have_wfds = true
gfds_buf = {{
fd = 0,
events = 0,
revents = 0
}, {
fd = 32,
events = 0,
revents = 0
}, {
fd = 0,
events = 0,
revents = 0
}, {
fd = 7,
events = 0,
revents = 0
}, {
fd = 8,
events = 0,
revents = 0
}, {
fd = 15,
events = 0,
revents = 0
}, {
fd = -11072,
events = 32767,
revents = 0
}, {
fd = 1460278864,
events = 21845,
revents = 0
}, {
fd = 64,
events = 0,
revents = 0
}, {
fd = 0,
events = 0,
revents = 0
}, {
fd = 2,
events = 48,
revents = 0
}, {
fd = 0,
events = 0,
revents = 0
}, {
fd = 91,
events = 124,
revents = 0
}, {
fd = -397131776,
events = 35414,
revents = 11125
}, {
fd = 1460278864,
events = 21845,
revents = 0
}, {
fd = 1,
events = 0,
revents = 0
}, {
fd = 2,
events = 0,
revents = 0
}, {
fd = -11072,
events = 32767,
revents = 0
}, {
fd = 1439161536,
events = 21845,
revents = 0
}, {
fd = -11024,
events = 32767,
revents = 0
}, {
fd = -11088,
events = 32767,
revents = 0
}, {
fd = -182671191,
events = 32767,
revents = 0
}, {
fd = 1,
events = 0,
revents = 0
}, {
fd = -182670875,
events = 32767,
revents = 0
}, {
fd = 194871296,
events = 59160,
revents = 48198
}, {
fd = -1175563804,
events = 19,
revents = 0
}, {
fd = 24,
events = 0,
revents = 0
}, {
fd = 1460278864,
events = 21845,
revents = 0
}, {
fd = 2,
events = 0,
revents = 0
}, {
fd = 1439882448,
events = 21845,
revents = 0
}, {
fd = 2,
events = 0,
revents = 0
}, {
fd = 2,
events = 0,
revents = 0
}, {
fd = 1,
events = 2,
revents = 0
}, {
fd = 1460278864,
events = 21845,
revents = 0
}, {
fd = 0,
events = 0,
revents = 0
}, {
fd = -397131776,
events = 35414,
revents = 11125
}, {
fd = 24,
events = 0,
revents = 0
}, {
fd = 1439161536,
events = 21845,
revents = 0
}, {
fd = 2,
events = 0,
revents = 0
}, {
fd = 1,
events = 0,
revents = 0
}, {
fd = 1439161536,
events = 21845,
revents = 0
}, {
fd = -182606722,
events = 32767,
revents = 0
}, {
fd = -10960,
events = 32767,
revents = 0
}, {
fd = 1433487750,
events = 21845,
revents = 0
}, {
fd = -10976,
events = 32767,
revents = 0
}, {
fd = 1439161536,
events = 21845,
revents = 0
}, {
fd = -30,
events = 0,
revents = 0
}, {
fd = 1,
events = 0,
revents = 0
}, {
fd = 1,
events = 64,
revents = 0
}, {
fd = 1562877925,
events = 0,
revents = 0
}, {
fd = 31,
events = 0,
revents = 0
}, {
fd = 1562877925,
events = 0,
revents = 0
}, {
fd = -10960,
events = 32767,
revents = 0
}, {
fd = 1434542700,
events = 21845,
revents = 0
}, {
fd = -10912,
events = 32767,
revents = 0
}, {
fd = 1439161536,
events = 21845,
revents = 0
}, {
fd = 1562877925,
events = 0,
revents = 0
}, {
fd = -397131776,
events = 35414,
revents = 11125
}, {
fd = -10848,
events = 32767,
revents = 0
}, {
fd = 1434543288,
events = 21845,
revents = 0
}, {
fd = 1460274293,
events = 21845,
revents = 0
}, {
fd = 1385447426,
events = 931,
revents = 0
}, {
fd = 95390,
events = 0,
revents = 0
}, {
fd = 1433865373,
events = 7256,
revents = 10134
}, {
fd = 1562877925,
events = 0,
revents = 0
}, {
fd = 1439161536,
events = 21845,
revents = 0
}, {
fd = 664149,
events = 0,
revents = 0
}, {
fd = 80000,
events = 0,
revents = 0
}, {
fd = 1562877925,
events = 0,
revents = 0
}, {
fd = 664149080,
events = 0,
revents = 0
}, {
fd = 1562877925,
events = 0,
revents = 0
}, {
fd = 664149080,
events = 0,
revents = 0
}, {
fd = -10752,
events = 32767,
revents = 0
}, {
fd = 1434543494,
events = 21845,
revents = 0
}, {
fd = 0,
events = 0,
revents = 0
}, {
fd = -10544,
events = 32767,
revents = 0
}, {
fd = 499622378,
events = 0,
revents = 0
}, {
fd = 0,
events = 0,
revents = 0
}, {
fd = 0,
events = 0,
revents = 0
}, {
fd = 499622378,
events = 0,
revents = 0
}, {
fd = -10688,
events = 32767,
revents = 0
}, {
fd = 1434939197,
events = 21845,
revents = 0
}, {
fd = 1562877925,
events = 0,
revents = 0
}, {
fd = 164526702,
events = 0,
revents = 0
}, {
fd = 1562877925,
events = 0,
revents = 0
}, {
fd = 664149080,
events = 0,
revents = 0
}, {
fd = -10544,
events = 32767,
revents = 0
}, {
fd = 499622378,
events = 41450,
revents = 7623
}, {
fd = 0,
events = 0,
revents = 0
}, {
fd = 1562877925,
events = 0,
revents = 0
}, {
fd = 0,
events = 0,
revents = 0
}, {
fd = -1,
events = 65535,
revents = 65535
}, {
fd = -10432,
events = 32767,
revents = 0
}, {
fd = 1433359581,
events = 21845,
revents = 0
}, {
fd = 0,
events = 0,
revents = 0
}, {
fd = 0,
events = 0,
revents = 0
}, {
fd = 0,
events = 0,
revents = 0
}, {
fd = 1443725184,
events = 85,
revents = 0
}, {
fd = 1450650965,
events = 21845,
revents = 0
}, {
fd = 1450650965,
events = 21845,
revents = 0
}, {
fd = 0,
events = 0,
revents = 0
}, {
fd = 16384,
events = 0,
revents = 0
}, {
fd = 5,
events = 0,
revents = 0
}, {
fd = -10496,
events = 32767,
revents = 0
}, {
fd = 0,
events = 0,
revents = 0
}, {
fd = -1,
events = 65535,
revents = 65535
}, {
fd = 1562877925,
events = 0,
revents = 0
}, {
fd = 164526702,
events = 0,
revents = 0
}, {
fd = 719,
events = 0,
revents = 0
}, {
fd = 893997157,
events = 0,
revents = 0
}, {
fd = 1439269600,
events = 21845,
revents = 0
}, {
fd = 0,
events = 0,
revents = 0
}, {
fd = 0,
events = 0,
revents = 0
}, {
fd = -10496,
events = 32767,
revents = 0
}, {
fd = 1433288445,
events = 21845,
revents = 0
}, {
fd = 0,
events = 0,
revents = 0
}, {
fd = 164532384,
events = 0,
revents = 0
}, {
fd = 1562977925,
events = 0,
revents = 0
}, {
fd = 1562977925,
events = 0,
revents = 0
}, {
fd = 164532384,
events = 0,
revents = 0
}, {
fd = -10368,
events = 32767,
revents = 0
}, {
fd = 1434938909,
events = 21845,
revents = 0
}, {
fd = 100000,
events = 0,
revents = 0
}, {
fd = 0,
events = 0,
revents = 0
}, {
fd = 1562877925,
events = 0,
revents = 0
}, {
fd = 164532384,
events = 0,
revents = 0
}, {
fd = -1,
events = 65535,
revents = 65535
}, {
fd = 0,
events = 0,
revents = 0
}}
gfds = 0x7fffffffd360
gfds_size = 128
n_gfds = -1
retval = 0
our_fds = 0
max_fds = 5
context_acquired = false
i = 0
nfds = 135
tmo_in_millisec = 1030
must_free = 0
need_to_dispatch = false
#5 0x0000555555802a78 in wait_reading_process_output (time_limit=0, nsecs=0, read_kbd=-1, do_display=true, wait_for_cell=XIL(0), wait_proc=0x0, just_wait_proc=0) at process.c:5423
process_skipped = false
channel = 6
nfds = 1
Available = {
fds_bits = {32, 0 <repeats 15 times>}
}
Writeok = {
fds_bits = {0 <repeats 16 times>}
}
check_write = true
check_delay = 0
no_avail = false
xerrno = 11
proc = XIL(0x7fffffffd9a0)
timeout = {
tv_sec = 0,
tv_nsec = 499622378
}
end_time = {
tv_sec = 93824993869565,
tv_nsec = 0
}
timer_delay = {
tv_sec = 0,
tv_nsec = 499622378
}
got_output_end_time = {
tv_sec = 1562977205,
tv_nsec = 270561059
}
wait = FOREVER
got_some_output = -1
prev_wait_proc_nbytes_read = 0
retry_for_async = false
count = 4
now = {
tv_sec = 0,
tv_nsec = -1
}
#6 0x00005555556f4718 in kbd_buffer_get_event (kbp=0x7fffffffdc70, used_mouse_menu=0x7fffffffe235, end_time=0x0) at keyboard.c:3836
do_display = true
obj = make_fixnum(1073741816)
#7 0x00005555556f06c5 in read_event_from_main_queue (end_time=0x0, local_getcjmp=0x7fffffffe040, used_mouse_menu=0x7fffffffe235) at keyboard.c:2138
c = XIL(0)
save_jump = {{
__jmpbuf = {0, 0, 0, 0, 0, 0, 0, 0},
__mask_was_saved = 0,
__saved_mask = {
__val = {0 <repeats 16 times>}
}
}}
kb = 0x7fffffffdcd0
count = 3
#8 0x00005555556f099b in read_decoded_event_from_main_queue (end_time=0x0, local_getcjmp=0x7fffffffe040, prev_event=XIL(0), used_mouse_menu=0x7fffffffe235) at keyboard.c:2202
nextevt = XIL(0)
frame = 0x1dcd30d9
terminal = 0x9
events = {XIL(0), XIL(0x1dcd30d9), XIL(0xca80), XIL(0x2b758a56e8544000), XIL(0), XIL(0x555556772ff5), XIL(0x7fffffffde50), XIL(0x5555556f57d8), XIL(0x555555c982e0), XIL(0), XIL(0), XIL(0x7fffffffde50), XIL(0x5555556e3efd), XIL(0x1dcd30d9), XIL(0x7fffffffde90), XIL(0x5555556f3b29)}
n = 0
#9 0x00005555556f2236 in read_char (commandflag=1, map=XIL(0x5555567756e3), prev_event=XIL(0), used_mouse_menu=0x7fffffffe235, end_time=0x0) at keyboard.c:2810
c = XIL(0)
jmpcount = 3
local_getcjmp = {{
__jmpbuf = {0, -245616553674816983, 93825011232757, 51840, 0, 0, -245616553305718231, -6214351577293929943},
__mask_was_saved = 0,
__saved_mask = {
__val = {0, 0, 140737488347312, 93824993869565, 0, 140737488347424, 93824994673374, 93825011242707, 3, 0, 0, 140737488347424, 93824994446493, 93824999850720, 0, 0}
}
}}
save_jump = {{
__jmpbuf = {93824993869565, 0, 140737488347584, 93824994020698, 0, 51840, -1, 0},
__mask_was_saved = 1450661587,
__saved_mask = {
__val = {93825011242723, 140737488347520, 93824993870282, 93825011242707, 93825011242723, 140737488347584, 93824994446493, 93824999885184, 34464, 34464, 140737488347584, 93824993869565, 40105367251, 93825004306304, 140737488347624, 93824999850720}
}
}}
tem = XIL(0x2aaa9b29c970)
save = XIL(0)
previous_echo_area_message = XIL(0)
also_record = XIL(0)
reread = false
recorded = false
polling_stopped_here = true
orig_kboard = 0x555555dfa570
#10 0x00005555556ff6c8 in read_key_sequence (keybuf=0x7fffffffe440, prompt=XIL(0), dont_downcase_last=false, can_return_switch_frame=true, fix_current_buffer=true, prevent_redisplay=false) at keyboard.c:9124
interrupted_kboard = 0x555555dfa570
interrupted_frame = 0x5555560d7f80
key = XIL(0x5555557a7eff)
used_mouse_menu = false
echo_local_start = 0
last_real_key_start = 0
keys_local_start = 0
new_binding = XIL(0x7fffffffe310)
count = 3
t = 0
echo_start = 0
keys_start = 0
current_binding = XIL(0x5555567756e3)
first_unbound = 31
mock_input = 0
used_mouse_menu_history = {false <repeats 30 times>}
fkey = {
parent = XIL(0x555555d6ce03),
map = XIL(0x555555d6ce03),
start = 0,
end = 0
}
keytran = {
parent = XIL(0x7ffff14a7dbb),
map = XIL(0x7ffff14a7dbb),
start = 0,
end = 0
}
indec = {
parent = XIL(0x555555d6cdf3),
map = XIL(0x555555d6cdf3),
start = 0,
end = 0
}
shift_translated = false
delayed_switch_frame = XIL(0)
original_uppercase = XIL(0)
original_uppercase_position = -1
dummyflag = false
starting_buffer = 0x7ffff0e1f6f0
fake_prefixed_keys = XIL(0)
first_event = XIL(0)
second_event = XIL(0)
#11 0x00005555556ee5fb in command_loop_1 () at keyboard.c:1348
cmd = XIL(0)
keybuf = {XIL(0x7fffffffe490), XIL(0x5555557a804c), make_fixnum(11545945833472), XIL(0x7fffffffe4c0), XIL(0x555555c982e0), XIL(0), XIL(0), XIL(0x7fffffffe490), XIL(0x5555556e3efd), XIL(0xf0e1f6f5), XIL(0x7fffffffe500), make_fixnum(23456248668343), XIL(0x555555d18953), XIL(0x3), XIL(0x555555c982e0), XIL(0), XIL(0), XIL(0x7fffffffe4e0), XIL(0x5555556e3efd), XIL(0xf0e1f6f5), XIL(0x7fffffffe520), XIL(0x5555557a2e69), XIL(0x1556e3efd), XIL(0x5760), XIL(0x7fffffffe540), XIL(0x555555d53470), XIL(0), XIL(0), XIL(0x7fffffffe550), make_fixnum(23456248662876)}
i = 32767
prev_modiff = 0
prev_buffer = 0x0
already_adjusted = false
#12 0x00005555557a2a64 in internal_condition_case (bfun=0x5555556ee1b3 <command_loop_1>, handlers=XIL(0x5760), hfun=0x5555556ed968 <cmd_error>) at eval.c:1351
val = XIL(0x5555556e3efd)
c = 0x555555d53470
#13 0x00005555556ede9b in command_loop_2 (ignore=XIL(0)) at keyboard.c:1091
val = XIL(0)
#14 0x00005555557a22d5 in internal_catch (tag=XIL(0xd110), func=0x5555556ede6e <command_loop_2>, arg=XIL(0)) at eval.c:1112
val = XIL(0x45b00000000)
c = 0x555555d53340
#15 0x00005555556ede39 in command_loop () at keyboard.c:1070
#16 0x00005555556ed537 in recursive_edit_1 () at keyboard.c:714
count = 1
val = make_fixnum(23456248667937)
#17 0x00005555556ed6bb in Frecursive_edit () at keyboard.c:786
count = 0
buffer = XIL(0)
#18 0x00005555556eb49d in main (argc=4, argv=0x7fffffffe8b8) at emacs.c:2086
stack_bottom_variable = 0x20
do_initial_setlocale = true
no_loadup = false
junk = 0x0
dname_arg = 0x0
ch_to_dir = 0x0
original_pwd = 0x0
dump_mode = 0x0
skip_args = 0
temacs = 0x0
attempt_load_pdump = true
rlim = {
rlim_cur = 10022912,
rlim_max = 18446744073709551615
}
sockfd = -1
[-- Attachment #2: thread-bug-2.el --]
[-- Type: text/plain, Size: 6199 bytes --]
;; -*- lexical-binding: t -*-
(require 'threads)
(require 'eieio)
(require 'cl-lib)
(require 'ring)
(defun debug (fmt &rest args)
(princ (apply #'format fmt args) #'external-debugging-output)
(terpri #'external-debugging-output))
(define-error 'thread-utils-thread-interrupted
"Thread was interrupted" 'error)
(defun thread-utils-main-thread-p (&optional object)
(let ((object (or object (current-thread))))
(and (threadp object)
(eq object (car (all-threads))))))
(defun thread-utils-quitable-apply (fn &rest args)
(let* ((this-thread (current-thread))
(quit-thread
(make-thread
(lambda nil
(condition-case nil
(cl-loop (sleep-for 3))
(quit
(thread-signal this-thread 'quit nil))
(thread-utils-thread-interrupted nil))))))
(unwind-protect
(apply fn args)
(thread-signal
quit-thread 'thread-utils-thread-interrupted nil))))
(defun thread-utils-condition-quitable-wait (condition)
(cl-check-type condition condition-variable)
(thread-utils-quitable-apply #'condition-wait condition))
(defun thread-utils-condition-wait (condition)
(if (thread-utils-main-thread-p)
(thread-utils-condition-quitable-wait condition)
(condition-wait condition)))
(defconst channel-default-capacity 16)
(defclass channel-terminal nil
((mutex :initarg :mutex :type mutex)
(condition :initarg :condition :type condition-variable)
(msg-queue :initarg :msg-queue :type ring)
(closed-p :initform nil)
(other-terminal :type (or null channel-terminal))))
(defclass channel-source (channel-terminal) nil)
(defclass channel-sink (channel-terminal) nil)
(define-error 'channel-closed
"Trying to send/recv from a closed channel")
(defun make-channel (&optional capacity)
(unless capacity
(setq capacity channel-default-capacity))
(cl-check-type capacity (integer 1 *))
(let* ((mutex (make-mutex "channel"))
(condition (make-condition-variable mutex "channel"))
(msg-queue (make-ring capacity))
(source
(channel-source :mutex mutex
:condition condition
:msg-queue msg-queue))
(sink
(channel-sink :mutex mutex
:condition condition
:msg-queue msg-queue)))
(oset source other-terminal sink)
(oset sink other-terminal source)
(cons source sink)))
(cl-defgeneric channel-send ((source channel-source) message)
(with-mutex (oref source mutex)
(with-slots (condition msg-queue) source
(while (and (not (channel-closed-p source))
(= (ring-size msg-queue)
(ring-length msg-queue)))
(thread-utils-condition-wait condition))
(when (channel-closed-p source)
(signal 'channel-closed (list source)))
(let ((inhibit-quit t))
(ring-insert msg-queue message)
(when (= 1 (ring-length msg-queue))
(condition-notify condition t)))
nil)))
(cl-defgeneric channel-recv ((sink channel-terminal))
(with-mutex (oref sink mutex)
(with-slots (condition msg-queue) sink
(while (and (not (channel-closed-p sink))
(ring-empty-p msg-queue))
(thread-utils-condition-wait condition))
(when (channel-closed-p sink)
(signal 'channel-closed (list sink)))
(let ((inhibit-quit t))
(prog1 (ring-remove msg-queue)
(when (= 1 (- (ring-size msg-queue)
(ring-length msg-queue)))
(condition-notify condition t)))))))
(cl-defgeneric channel-peek ((sink channel-terminal))
(with-mutex (oref sink mutex)
(with-slots (condition msg-queue) sink
(while (and (not (channel-closed-p sink))
(ring-empty-p msg-queue))
(thread-utils-condition-wait condition))
(when (channel-closed-p sink)
(signal 'channel-closed (list sink)))
(ring-ref msg-queue -1))))
(cl-defgeneric channel-close ((terminal channel-terminal))
(with-mutex (oref terminal mutex)
(with-slots (closed-p condition) terminal
(setq closed-p t)
(condition-notify condition t))
nil))
(cl-defmethod channel-closed-p ((source channel-source))
(with-mutex (oref source mutex)
(with-slots (closed-p other-terminal) source
(or closed-p
(oref other-terminal closed-p)))))
(cl-defmethod channel-closed-p ((sink channel-sink))
(with-mutex (oref sink mutex)
(with-slots (closed-p other-terminal msg-queue) sink
(or closed-p
(and (oref other-terminal closed-p)
(ring-empty-p msg-queue))))))
(defclass future nil
((channel :initform (make-channel 1))))
(defun make-future ()
(make-instance 'future))
(cl-defgeneric future-set ((future future) value)
(with-slots (channel) future
(let ((inhibit-quit t))
(condition-case nil
(progn
(debug "Sending future")
(channel-send (car channel) value)
(debug "Future send"))
(channel-closed
(signal 'error (list future))))
(debug "Closing future")
(channel-close (car channel))
(debug "Future closed"))))
(cl-defgeneric future-get ((future future))
(with-slots (channel) future
(debug "Getting future")
(channel-peek (cdr channel))
(debug "Future got")))
(defclass future-deferred (future)
((producer :initarg :producer :type function)
(value-produced-p :initform nil)
(mutex :initform (make-mutex "future-deferred"))))
(defun make-deferred-future (producer)
(make-instance 'future-deferred :producer producer))
(cl-defmethod future-get :before ((future future-deferred))
(with-slots (mutex value-produced-p producer) future
(with-mutex mutex
(unless value-produced-p
(unwind-protect
(make-thread
(lambda nil
(debug "Setting Future")
(future-set future (funcall producer))
(debug "Future set"))
"XEmacs")
(setq value-produced-p t))))))
(let ((future (make-deferred-future (lambda nil 42))))
(future-get future))
^ permalink raw reply [flat|nested] 53+ messages in thread
* bug#36609: 27.0.50; Possible race-condition in threading implementation
2019-07-11 20:51 bug#36609: 27.0.50; Possible race-condition in threading implementation Andreas Politz
@ 2019-07-12 7:47 ` Eli Zaretskii
2019-07-12 8:05 ` Eli Zaretskii
2019-07-12 9:02 ` Pip Cet
` (2 subsequent siblings)
3 siblings, 1 reply; 53+ messages in thread
From: Eli Zaretskii @ 2019-07-12 7:47 UTC (permalink / raw)
To: Andreas Politz; +Cc: 36609
> From: Andreas Politz <politza@hochschule-trier.de>
> Date: Thu, 11 Jul 2019 22:51:10 +0200
>
> I think there is a race-condition in the implementation of threads.
Not sure what you mean by "race condition". Since only one Lisp
thread can run at any given time, where could this race happen, and
between what threads?
> I tried to find a minimal test-case, without success. Thus, I've
> attached a lengthy source-file. Loading that file should trigger
> this bug and may freeze your session.
FWIW, it doesn't freeze my Emacs here, neither on GNU/Linux nor on
MS-Windows (with today's master). Are you getting freezes with 100%
reproducibility? If so, please show all the details of your build, as
collected by report-emacs-bug.
> Indications:
>
> 1. The main-thread has the name of one of created threads (XEmacs in
> this case), instead of "emacs".
I don't think this is related to the problem. I think we have a bug
in the implementation of the 'name' attribute of threads: we call
prctl(PR_SET_NAME), which AFAIU changes the name of the _calling_
thread, and the calling thread at that point is the main thread, see
sys_thread_create. If I evaluate this:
(make-thread 'ignore "XEmacs")
and then attach a debugger, I see that the name of the main thread has
changed to "XEmacs".
> 2. Emacs stops processing all keyboard/mouse input while looping in
> wait_reading_process_output. Sending commands via emacsclient still
> works.
>
> GDB output:
>
> (gdb) info threads
> Id Target Id Frame
> * 1 Thread 0x7ffff17f5d40 (LWP 26264) "XEmacs" 0x000055555576eac0 in XPNTR (a=XIL(0x7ffff1312533)) at alloc.c:535
> 2 Thread 0x7ffff0ac4700 (LWP 26265) "gmain" 0x00007ffff50d1667 in poll () from /usr/lib/libc.so.6
> 3 Thread 0x7fffebd1a700 (LWP 26266) "gdbus" 0x00007ffff50d1667 in poll () from /usr/lib/libc.so.6
> 4 Thread 0x7fffeb519700 (LWP 26267) "dconf worker" 0x00007ffff50d1667 in poll () from /usr/lib/libc.so.6
>
> (gdb) bt full
This "bt full" is not enough, because it shows the backtrace of one
thread only, the main thread. Please show the backtrace of the other
3 threads by typing "thread apply all bt full" instead.
> #5 0x0000555555802a78 in wait_reading_process_output (time_limit=0, nsecs=0, read_kbd=-1, do_display=true, wait_for_cell=XIL(0), wait_proc=0x0, just_wait_proc=0) at process.c:5423
> process_skipped = false
> channel = 6
> nfds = 1
> Available = {
> fds_bits = {32, 0 <repeats 15 times>}
> }
Is the keyboard descriptor's bit in Available set or not? What is the
contents of the fd_callback_info array at this point?
> ;; -*- lexical-binding: t -*-
>
> (require 'threads)
> (require 'eieio)
> (require 'cl-lib)
> (require 'ring)
>
> (defun debug (fmt &rest args)
> (princ (apply #'format fmt args) #'external-debugging-output)
> (terpri #'external-debugging-output))
Please describe what this program tries to accomplish, and how. It's
not easy to reverse-engineer that from 200 lines of non-trivial code.
It's possible that the reason(s) for the freeze will be apparent from
describing your implementation ideas.
Thanks.
^ permalink raw reply [flat|nested] 53+ messages in thread
* bug#36609: 27.0.50; Possible race-condition in threading implementation
2019-07-12 7:47 ` Eli Zaretskii
@ 2019-07-12 8:05 ` Eli Zaretskii
0 siblings, 0 replies; 53+ messages in thread
From: Eli Zaretskii @ 2019-07-12 8:05 UTC (permalink / raw)
To: politza; +Cc: 36609
> Date: Fri, 12 Jul 2019 10:47:37 +0300
> From: Eli Zaretskii <eliz@gnu.org>
> Cc: 36609@debbugs.gnu.org
>
> Please describe what this program tries to accomplish, and how. It's
> not easy to reverse-engineer that from 200 lines of non-trivial code.
> It's possible that the reason(s) for the freeze will be apparent from
> describing your implementation ideas.
Oh, and one more question: does the value of the global variable
last_thread_error tell something interesting at that point?
^ permalink raw reply [flat|nested] 53+ messages in thread
* bug#36609: 27.0.50; Possible race-condition in threading implementation
2019-07-11 20:51 bug#36609: 27.0.50; Possible race-condition in threading implementation Andreas Politz
2019-07-12 7:47 ` Eli Zaretskii
@ 2019-07-12 9:02 ` Pip Cet
2019-07-12 12:42 ` Eli Zaretskii
2019-07-12 12:44 ` Pip Cet
2021-06-06 15:50 ` dick.r.chiang
[not found] ` <87fsxv8182.fsf@dick>
3 siblings, 2 replies; 53+ messages in thread
From: Pip Cet @ 2019-07-12 9:02 UTC (permalink / raw)
To: Andreas Politz; +Cc: 36609
[-- Attachment #1: Type: text/plain, Size: 666 bytes --]
On Thu, Jul 11, 2019 at 8:52 PM Andreas Politz
<politza@hochschule-trier.de> wrote:
> I think there is a race-condition in the implementation of threads. I
> tried to find a minimal test-case, without success. Thus, I've attached
> a lengthy source-file. Loading that file should trigger this bug and
> may freeze your session.
It does here, so I can provide further debugging information if
needed. On first glance, it appears that xgselect returns abnormally
with g_main_context acquired in one thread, and then other threads
fail to acquire it and loop endlessly.
Can you confirm that this very dirty patch avoids (it's not a fix by
any means!) the problem?
[-- Attachment #2: Avoid-glib-issue.patch --]
[-- Type: text/x-patch, Size: 1349 bytes --]
diff --git a/src/xgselect.c b/src/xgselect.c
index 9982a1f0e9..9ffeb6c66c 100644
--- a/src/xgselect.c
+++ b/src/xgselect.c
@@ -58,8 +58,16 @@ xg_select (int fds_lim, fd_set *rfds, fd_set *wfds, fd_set *efds,
int i, nfds, tmo_in_millisec, must_free = 0;
bool need_to_dispatch;
+ ptrdiff_t count = SPECPDL_INDEX ();
context = g_main_context_default ();
- context_acquired = g_main_context_acquire (context);
+ context_acquired = true;
+
+ auto void release_g_main_context (void)
+ {
+ }
+
+ if (context_acquired)
+ record_unwind_protect_void (release_g_main_context);
/* FIXME: If we couldn't acquire the context, we just silently proceed
because this function handles more than just glib file descriptors.
Note that, as implemented, this failure is completely silent: there is
@@ -164,9 +172,6 @@ xg_select (int fds_lim, fd_set *rfds, fd_set *wfds, fd_set *efds,
errno = pselect_errno;
}
- if (context_acquired)
- g_main_context_release (context);
-
/* To not have to recalculate timeout, return like this. */
if ((our_fds > 0 || (nfds == 0 && tmop == &tmo)) && (retval == 0))
{
@@ -174,6 +179,6 @@ xg_select (int fds_lim, fd_set *rfds, fd_set *wfds, fd_set *efds,
errno = EINTR;
}
- return retval;
+ return unbind_to (count, Qnil), retval;
}
#endif /* HAVE_GLIB */
^ permalink raw reply related [flat|nested] 53+ messages in thread
* bug#36609: 27.0.50; Possible race-condition in threading implementation
2019-07-12 9:02 ` Pip Cet
@ 2019-07-12 12:42 ` Eli Zaretskii
2019-07-12 12:57 ` Pip Cet
2019-07-12 12:44 ` Pip Cet
1 sibling, 1 reply; 53+ messages in thread
From: Eli Zaretskii @ 2019-07-12 12:42 UTC (permalink / raw)
To: Pip Cet; +Cc: 36609, politza
> From: Pip Cet <pipcet@gmail.com>
> Date: Fri, 12 Jul 2019 09:02:22 +0000
> Cc: 36609@debbugs.gnu.org
>
> On Thu, Jul 11, 2019 at 8:52 PM Andreas Politz
> <politza@hochschule-trier.de> wrote:
> > I think there is a race-condition in the implementation of threads. I
> > tried to find a minimal test-case, without success. Thus, I've attached
> > a lengthy source-file. Loading that file should trigger this bug and
> > may freeze your session.
>
> It does here, so I can provide further debugging information if
> needed.
Thanks, can you provide the info I asked for?
> On first glance, it appears that xgselect returns abnormally with
> g_main_context acquired in one thread, and then other threads fail
> to acquire it and loop endlessly.
If you can describe what causes this to happen, I think we might be
half-way to a solution.
> + ptrdiff_t count = SPECPDL_INDEX ();
I don't think we should do that at this low level.
^ permalink raw reply [flat|nested] 53+ messages in thread
* bug#36609: 27.0.50; Possible race-condition in threading implementation
2019-07-12 9:02 ` Pip Cet
2019-07-12 12:42 ` Eli Zaretskii
@ 2019-07-12 12:44 ` Pip Cet
2019-07-12 12:55 ` Eli Zaretskii
1 sibling, 1 reply; 53+ messages in thread
From: Pip Cet @ 2019-07-12 12:44 UTC (permalink / raw)
To: Andreas Politz; +Cc: 36609
[-- Attachment #1: Type: text/plain, Size: 931 bytes --]
On Fri, Jul 12, 2019 at 9:02 AM Pip Cet <pipcet@gmail.com> wrote:
>
> On Thu, Jul 11, 2019 at 8:52 PM Andreas Politz
> <politza@hochschule-trier.de> wrote:
> > I think there is a race-condition in the implementation of threads. I
> > tried to find a minimal test-case, without success. Thus, I've attached
> > a lengthy source-file. Loading that file should trigger this bug and
> > may freeze your session.
>
> It does here, so I can provide further debugging information if
> needed. On first glance, it appears that xgselect returns abnormally
> with g_main_context acquired in one thread, and then other threads
> fail to acquire it and loop endlessly.
Okay, I'm still not sure this is really the problem Andreas was
seeing, but this code fails to work with xg_select:
(let ((thread (make-thread (lambda ()
(sleep-for 3)))))
(thread-yield)
(thread-signal thread 'error nil))
Proposed patch attached.
[-- Attachment #2: 0001-Protect-against-abnormal-exit-in-xg_select.patch --]
[-- Type: text/x-patch, Size: 2239 bytes --]
From 012341a8a674b130c335cfdd8d24888406e98ece Mon Sep 17 00:00:00 2001
From: Pip Cet <pipcet@gmail.com>
Date: Fri, 12 Jul 2019 12:39:29 +0000
Subject: [PATCH] Protect against abnormal exit in `xg_select'
* src/xgselect.c (release_g_main_context): New function.
(xg_select): Unwind-protect release of the GLib main context.
---
src/xgselect.c | 18 +++++++++++++++---
1 file changed, 15 insertions(+), 3 deletions(-)
diff --git a/src/xgselect.c b/src/xgselect.c
index 9982a1f0e9..e35310be34 100644
--- a/src/xgselect.c
+++ b/src/xgselect.c
@@ -29,6 +29,15 @@ Copyright (C) 2009-2019 Free Software Foundation, Inc.
#include "blockinput.h"
#include "systime.h"
+/* Helper function for `unwind_protect'. */
+
+static void release_g_main_context (void *ptr)
+{
+ GMainContext *context = (GMainContext *) ptr;
+
+ g_main_context_release (context);
+}
+
/* `xg_select' is a `pselect' replacement. Why do we need a separate function?
1. Timeouts. Glib and Gtk rely on timer events. If we did pselect
with a greater timeout then the one scheduled by Glib, we would
@@ -58,8 +67,12 @@ xg_select (int fds_lim, fd_set *rfds, fd_set *wfds, fd_set *efds,
int i, nfds, tmo_in_millisec, must_free = 0;
bool need_to_dispatch;
+ ptrdiff_t count = SPECPDL_INDEX ();
context = g_main_context_default ();
context_acquired = g_main_context_acquire (context);
+
+ if (context_acquired)
+ record_unwind_protect_ptr (release_g_main_context, (void *)context);
/* FIXME: If we couldn't acquire the context, we just silently proceed
because this function handles more than just glib file descriptors.
Note that, as implemented, this failure is completely silent: there is
@@ -164,9 +177,6 @@ xg_select (int fds_lim, fd_set *rfds, fd_set *wfds, fd_set *efds,
errno = pselect_errno;
}
- if (context_acquired)
- g_main_context_release (context);
-
/* To not have to recalculate timeout, return like this. */
if ((our_fds > 0 || (nfds == 0 && tmop == &tmo)) && (retval == 0))
{
@@ -174,6 +184,8 @@ xg_select (int fds_lim, fd_set *rfds, fd_set *wfds, fd_set *efds,
errno = EINTR;
}
+ unbind_to (count, Qnil);
+
return retval;
}
#endif /* HAVE_GLIB */
--
2.20.1
^ permalink raw reply related [flat|nested] 53+ messages in thread
* bug#36609: 27.0.50; Possible race-condition in threading implementation
2019-07-12 12:44 ` Pip Cet
@ 2019-07-12 12:55 ` Eli Zaretskii
2019-07-12 13:40 ` Pip Cet
0 siblings, 1 reply; 53+ messages in thread
From: Eli Zaretskii @ 2019-07-12 12:55 UTC (permalink / raw)
To: Pip Cet; +Cc: 36609, politza
> From: Pip Cet <pipcet@gmail.com>
> Date: Fri, 12 Jul 2019 12:44:35 +0000
> Cc: 36609@debbugs.gnu.org
>
> Okay, I'm still not sure this is really the problem Andreas was
> seeing, but this code fails to work with xg_select:
>
> (let ((thread (make-thread (lambda ()
> (sleep-for 3)))))
> (thread-yield)
> (thread-signal thread 'error nil))
What do you mean by "fails"? What do you see when you eval this?
> Proposed patch attached.
As I said earlier, I don't think it's right to use stack unwinding at
this level. (Did you try this in a -nw session?)
I'd appreciate a description of what you think happens here, and why.
Thanks.
^ permalink raw reply [flat|nested] 53+ messages in thread
* bug#36609: 27.0.50; Possible race-condition in threading implementation
2019-07-12 12:42 ` Eli Zaretskii
@ 2019-07-12 12:57 ` Pip Cet
2019-07-12 13:31 ` Eli Zaretskii
0 siblings, 1 reply; 53+ messages in thread
From: Pip Cet @ 2019-07-12 12:57 UTC (permalink / raw)
To: Eli Zaretskii; +Cc: 36609, politza
On Fri, Jul 12, 2019 at 12:42 PM Eli Zaretskii <eliz@gnu.org> wrote:
>
> > From: Pip Cet <pipcet@gmail.com>
> > Date: Fri, 12 Jul 2019 09:02:22 +0000
> > Cc: 36609@debbugs.gnu.org
> >
> > On Thu, Jul 11, 2019 at 8:52 PM Andreas Politz
> > <politza@hochschule-trier.de> wrote:
> > > I think there is a race-condition in the implementation of threads. I
> > > tried to find a minimal test-case, without success. Thus, I've attached
> > > a lengthy source-file. Loading that file should trigger this bug and
> > > may freeze your session.
> >
> > It does here, so I can provide further debugging information if
> > needed.
>
> Thanks, can you provide the info I asked for?
Yes, albeit not right now.
> > On first glance, it appears that xgselect returns abnormally with
> > g_main_context acquired in one thread, and then other threads fail
> > to acquire it and loop endlessly.
>
> If you can describe what causes this to happen, I think we might be
> half-way to a solution.
Here's the backtrace of the abnormal exit I see with the patch attached:
(gdb) bt full
#0 0x00000000006bf987 in release_g_main_context (ptr=0xc1d070) at xgselect.c:36
context = 0x7fffedf79710
#1 0x0000000000616f03 in do_one_unbind
(this_binding=0x7fffedf79770, unwinding=true, bindflag=SET_INTERNAL_UNBIND)
at eval.c:3446
#2 0x0000000000617245 in unbind_to (count=0, value=XIL(0)) at eval.c:3567
this_binding = {
kind = SPECPDL_UNWIND_PTR,
unwind = {
kind = SPECPDL_UNWIND_PTR,
func = 0x6bf97b <release_g_main_context>,
arg = XIL(0xc1d070),
eval_depth = 0
},
unwind_array = {
kind = SPECPDL_UNWIND_PTR,
nelts = 7076219,
array = 0xc1d070
},
unwind_ptr = {
kind = SPECPDL_UNWIND_PTR,
func = 0x6bf97b <release_g_main_context>,
arg = 0xc1d070
},
unwind_int = {
kind = SPECPDL_UNWIND_PTR,
func = 0x6bf97b <release_g_main_context>,
arg = 12701808
},
unwind_excursion = {
kind = SPECPDL_UNWIND_PTR,
marker = XIL(0x6bf97b),
window = XIL(0xc1d070)
},
unwind_void = {
kind = SPECPDL_UNWIND_PTR,
func = 0x6bf97b <release_g_main_context>
},
let = {
kind = SPECPDL_UNWIND_PTR,
symbol = XIL(0x6bf97b),
old_value = XIL(0xc1d070),
where = XIL(0),
saved_value = XIL(0xef26a0)
},
bt = {
kind = SPECPDL_UNWIND_PTR,
debug_on_exit = false,
function = XIL(0x6bf97b),
args = 0xc1d070,
nargs = 0
}
}
quitf = XIL(0)
#3 0x00000000006116df in unwind_to_catch
(catch=0x7fffd8000c50, type=NONLOCAL_EXIT_SIGNAL, value=XIL(0x14d3653))
at eval.c:1162
last_time = false
#4 0x00000000006126d9 in signal_or_quit
(error_symbol=XIL(0x90), data=XIL(0), keyboard_quit=false) at eval.c:1674
unwind_data = XIL(0x14d3653)
conditions = XIL(0x7ffff05d676b)
string = XIL(0x5f5e77)
real_error_symbol = XIL(0x90)
clause = XIL(0x30)
h = 0x7fffd8000c50
#5 0x00000000006122e9 in Fsignal (error_symbol=XIL(0x90), data=XIL(0))
at eval.c:1564
#6 0x0000000000698901 in post_acquire_global_lock (self=0xe09db0) at
thread.c:115
sym = XIL(0x90)
data = XIL(0)
prev_thread = 0xa745c0 <main_thread>
#7 0x000000000069892b in acquire_global_lock (self=0xe09db0) at thread.c:123
#8 0x0000000000699303 in really_call_select (arg=0x7fffedf79a70) at
thread.c:596
sa = 0x7fffedf79a70
self = 0xe09db0
oldset = {
__val = {0, 0, 7, 0, 80, 140736817269952, 2031, 2080,
18446744073709550952, 32, 343597383808, 4, 0, 472446402655,
511101108348, 0}
}
#9 0x00000000005e5ee0 in flush_stack_call_func
(func=0x699239 <really_call_select>, arg=0x7fffedf79a70) at alloc.c:4969
end = 0x7fffedf79a30
self = 0xe09db0
sentry = {
o = {
__max_align_ll = 0,
__max_align_ld = <invalid float value>
}
}
#10 0x0000000000699389 in thread_select
(func=0x419320 <pselect@plt>, max_fds=9, rfds=0x7fffedf79fa0,
wfds=0x7fffedf79f20, efds=0x0, timeout=0x7fffedf7a260, sigmask=0x0) at
thread.c:616
sa = {
func = 0x419320 <pselect@plt>,
max_fds = 9,
rfds = 0x7fffedf79fa0,
wfds = 0x7fffedf79f20,
efds = 0x0,
timeout = 0x7fffedf7a260,
sigmask = 0x0,
result = 1
}
#11 0x00000000006bfef5 in xg_select
(fds_lim=9, rfds=0x7fffedf7a300, wfds=0x7fffedf7a280, efds=0x0,
timeout=0x7fffedf7a260, sigmask=0x0) at xgselect.c:130
all_rfds = {
fds_bits = {8, 0 <repeats 15 times>}
}
all_wfds = {
fds_bits = {0 <repeats 16 times>}
}
tmo = {
tv_sec = 0,
tv_nsec = 0
}
tmop = 0x7fffedf7a260
context = 0xc1d070
have_wfds = true
gfds_buf = {{
fd = 5,
events = 1,
revents = 0
}, {
fd = 6,
events = 1,
revents = 0
}, {
fd = 8,
events = 1,
revents = 0
}, {
fd = 0,
events = 0,
revents = 0
} <repeats 125 times>}
gfds = 0x7fffedf79b10
gfds_size = 128
n_gfds = 3
retval = 0
our_fds = 0
max_fds = 8
context_acquired = true
i = 3
nfds = 0
tmo_in_millisec = -1
must_free = 0
need_to_dispatch = false
count = 3
#12 0x000000000066b757 in wait_reading_process_output
(time_limit=3, nsecs=0, read_kbd=0, do_display=false,
wait_for_cell=XIL(0), wait_proc=0x0, just_wait_proc=0) at
process.c:5423
process_skipped = false
channel = 0
nfds = 0
Available = {
fds_bits = {8, 0 <repeats 15 times>}
}
Writeok = {
fds_bits = {0 <repeats 16 times>}
}
check_write = true
check_delay = 0
no_avail = false
xerrno = 0
proc = XIL(0x7fffedf7a440)
timeout = {
tv_sec = 3,
tv_nsec = 0
}
end_time = {
tv_sec = 1562935633,
tv_nsec = 911868453
}
timer_delay = {
tv_sec = 0,
tv_nsec = -1
}
got_output_end_time = {
tv_sec = 0,
tv_nsec = -1
}
wait = TIMEOUT
got_some_output = -1
prev_wait_proc_nbytes_read = 0
retry_for_async = false
count = 2
now = {
tv_sec = 0,
tv_nsec = -1
}
#13 0x0000000000429bf6 in Fsleep_for (seconds=make_fixnum(3),
milliseconds=XIL(0))
at dispnew.c:5825
t = {
tv_sec = 3,
tv_nsec = 0
}
tend = {
tv_sec = 1562935633,
tv_nsec = 911868112
}
duration = 3
#14 0x0000000000613e99 in eval_sub (form=XIL(0xf6df73)) at eval.c:2273
i = 2
maxargs = 2
args_left = XIL(0)
numargs = 1
original_fun = XIL(0x7fffefa9fb98)
original_args = XIL(0xf6df83)
count = 1
fun = XIL(0xa756a5)
val = XIL(0)
funcar = make_fixnum(35184372085343)
argvals =
{make_fixnum(3), XIL(0), XIL(0), XIL(0), XIL(0), XIL(0),
XIL(0), XIL(0)}
#15 0x0000000000610032 in Fprogn (body=XIL(0)) at eval.c:462
form = XIL(0xf6df73)
val = XIL(0)
#16 0x0000000000616102 in funcall_lambda
(fun=XIL(0xf6da43), nargs=0, arg_vector=0xe09dd8) at eval.c:3065
val = XIL(0xc0)
syms_left = XIL(0)
next = XIL(0x3400000013)
lexenv = XIL(0)
count = 1
i = 0
optional = false
rest = false
#17 0x0000000000615542 in Ffuncall (nargs=1, args=0xe09dd0) at eval.c:2813
fun = XIL(0xf6da43)
original_fun = XIL(0xf6da43)
funcar = XIL(0xc0)
numargs = 0
val = XIL(0xaf72e0)
count = 0
#18 0x000000000069956f in invoke_thread_function () at thread.c:702
count = 0
#19 0x0000000000611d61 in internal_condition_case
(bfun=0x69953e <invoke_thread_function>, handlers=XIL(0x30),
hfun=0x699596 <record_thread_error>) at eval.c:1351
val = make_fixnum(1405386)
c = 0x7fffd8000c50
#20 0x0000000000699697 in run_thread (state=0xe09db0) at thread.c:741
stack_pos = {
__max_align_ll = 0,
__max_align_ld = 0
}
self = 0xe09db0
iter = 0x0
c = 0x7fffd8000b20
#21 0x00007ffff4b38fa3 in start_thread (arg=<optimized out>)
at pthread_create.c:486
ret = <optimized out>
pd = <optimized out>
now = <optimized out>
unwind_buf = {
cancel_jmp_buf = {{
jmp_buf = {140737185822464, -1249422724209328276,
140737488341374, 140737488341375, 140737185822464, 0,
1249453444682727276, 1249398985402204012},
mask_was_saved = 0
}},
priv = {
pad = {0x0, 0x0, 0x0, 0x0},
data = {
prev = 0x0,
cleanup = 0x0,
canceltype = 0
}
}
}
not_first_call = <optimized out>
#22 0x00007ffff49724cf in clone () at
../sysdeps/unix/sysv/linux/x86_64/clone.S:95
Lisp Backtrace:
"sleep-for" (0xedf7a530)
0xf6da40 Lisp type 3
post_acquire_global_lock () can return abnormally (I didn't know
that), so really_call_select() can, too, so thread_select() can, too.
> > + ptrdiff_t count = SPECPDL_INDEX ();
>
> I don't think we should do that at this low level.
You're right, it does stick out. I think we're safe because we're
calling Fsignal with the global lock held, but it's not a pretty or
well-documented situation.
^ permalink raw reply [flat|nested] 53+ messages in thread
* bug#36609: 27.0.50; Possible race-condition in threading implementation
2019-07-12 12:57 ` Pip Cet
@ 2019-07-12 13:31 ` Eli Zaretskii
2019-07-12 13:51 ` Pip Cet
0 siblings, 1 reply; 53+ messages in thread
From: Eli Zaretskii @ 2019-07-12 13:31 UTC (permalink / raw)
To: Pip Cet; +Cc: 36609, politza
> From: Pip Cet <pipcet@gmail.com>
> Date: Fri, 12 Jul 2019 12:57:51 +0000
> Cc: politza@hochschule-trier.de, 36609@debbugs.gnu.org
>
> Lisp Backtrace:
> "sleep-for" (0xedf7a530)
> 0xf6da40 Lisp type 3
>
> post_acquire_global_lock () can return abnormally (I didn't know
> that), so really_call_select() can, too, so thread_select() can, too.
Do you know which code sets current_thread->error_symbol, and what is
that error symbol?
> > > + ptrdiff_t count = SPECPDL_INDEX ();
> >
> > I don't think we should do that at this low level.
>
> You're right, it does stick out. I think we're safe because we're
> calling Fsignal with the global lock held, but it's not a pretty or
> well-documented situation.
Is this the main thread which does that? if so, there should be no
problem with holding the global lock in this case, is there?
If this is not the main thread, then the error handlers should be set
so as to log the error in last_thread_error, and then terminate the
thread (via exiting the thread function, AFAIR).
If this is not what happens, I'd like to know what does happen here,
and why.
Thanks.
^ permalink raw reply [flat|nested] 53+ messages in thread
* bug#36609: 27.0.50; Possible race-condition in threading implementation
2019-07-12 12:55 ` Eli Zaretskii
@ 2019-07-12 13:40 ` Pip Cet
2019-07-12 13:51 ` Eli Zaretskii
0 siblings, 1 reply; 53+ messages in thread
From: Pip Cet @ 2019-07-12 13:40 UTC (permalink / raw)
To: Eli Zaretskii; +Cc: 36609, politza
On Fri, Jul 12, 2019 at 12:55 PM Eli Zaretskii <eliz@gnu.org> wrote:
> > From: Pip Cet <pipcet@gmail.com>
> > Date: Fri, 12 Jul 2019 12:44:35 +0000
> > Cc: 36609@debbugs.gnu.org
> >
> > Okay, I'm still not sure this is really the problem Andreas was
> > seeing, but this code fails to work with xg_select:
> >
> > (let ((thread (make-thread (lambda ()
> > (sleep-for 3)))))
> > (thread-yield)
> > (thread-signal thread 'error nil))
>
> What do you mean by "fails"? What do you see when you eval this?
With `emacs -Q', a blinking cursor and an unresponsive emacs (no
response to key presses, not even triple C-g). Sometimes, at least.
Here's a full backtrace of a session where the hang did happen:
Thread 4 (Thread 0x7fffeeb8a700 (LWP 26117)):
#0 0x00007ffff4967819 in __GI___poll (fds=0xd93070, nfds=1, timeout=-1)
at ../sysdeps/unix/sysv/linux/poll.c:29
resultvar = 18446744073709551100
sc_cancel_oldtype = 0
#1 0x00007ffff698e136 in () at /lib/x86_64-linux-gnu/libglib-2.0.so.0
#2 0x00007ffff698e25c in g_main_context_iteration ()
at /lib/x86_64-linux-gnu/libglib-2.0.so.0
#3 0x00007fffeebd7ffd in ()
at /usr/lib/x86_64-linux-gnu/gio/modules/libdconfsettings.so
#4 0x00007ffff69b6415 in () at /lib/x86_64-linux-gnu/libglib-2.0.so.0
#5 0x00007ffff4b38fa3 in start_thread (arg=<optimized out>)
at pthread_create.c:486
ret = <optimized out>
pd = <optimized out>
now = <optimized out>
unwind_buf = {
cancel_jmp_buf = {{
jmp_buf = {140737198466816, -2332898995959237690,
140737488337566, 140737488337567, 140737198466816, 13933232,
2332932536188793798, 2332875456844002246},
mask_was_saved = 0
}},
priv = {
pad = {0x0, 0x0, 0x0, 0x0},
data = {
prev = 0x0,
cleanup = 0x0,
canceltype = 0
}
}
}
not_first_call = <optimized out>
#6 0x00007ffff49724cf in clone () at
../sysdeps/unix/sysv/linux/x86_64/clone.S:95
Thread 3 (Thread 0x7fffef59e700 (LWP 26116)):
#0 0x00007ffff4967819 in __GI___poll (fds=0xcb7a20, nfds=2, timeout=-1)
at ../sysdeps/unix/sysv/linux/poll.c:29
resultvar = 18446744073709551100
sc_cancel_oldtype = 0
#1 0x00007ffff698e136 in () at /lib/x86_64-linux-gnu/libglib-2.0.so.0
#2 0x00007ffff698e4c2 in g_main_loop_run ()
at /lib/x86_64-linux-gnu/libglib-2.0.so.0
#3 0x00007ffff6bba0d6 in () at /lib/x86_64-linux-gnu/libgio-2.0.so.0
#4 0x00007ffff69b6415 in () at /lib/x86_64-linux-gnu/libglib-2.0.so.0
#5 0x00007ffff4b38fa3 in start_thread (arg=<optimized out>)
at pthread_create.c:486
ret = <optimized out>
pd = <optimized out>
now = <optimized out>
unwind_buf = {
cancel_jmp_buf = {{
jmp_buf = {140737209034496, -2332898995959237690,
140737488336926, 140737488336927, 140737209034496, 13323152,
2332932821804118982, 2332875456844002246},
mask_was_saved = 0
}},
priv = {
pad = {0x0, 0x0, 0x0, 0x0},
data = {
prev = 0x0,
cleanup = 0x0,
canceltype = 0
}
}
}
not_first_call = <optimized out>
#6 0x00007ffff49724cf in clone () at
../sysdeps/unix/sysv/linux/x86_64/clone.S:95
Thread 2 (Thread 0x7ffff017d700 (LWP 26115)):
#0 0x00007ffff4967819 in __GI___poll (fds=0xb400d0, nfds=1, timeout=-1)
at ../sysdeps/unix/sysv/linux/poll.c:29
resultvar = 18446744073709551100
sc_cancel_oldtype = 0
#1 0x00007ffff698e136 in () at /lib/x86_64-linux-gnu/libglib-2.0.so.0
#2 0x00007ffff698e25c in g_main_context_iteration ()
at /lib/x86_64-linux-gnu/libglib-2.0.so.0
#3 0x00007ffff698e2a1 in () at /lib/x86_64-linux-gnu/libglib-2.0.so.0
#4 0x00007ffff69b6415 in () at /lib/x86_64-linux-gnu/libglib-2.0.so.0
#5 0x00007ffff4b38fa3 in start_thread (arg=<optimized out>)
at pthread_create.c:486
ret = <optimized out>
pd = <optimized out>
now = <optimized out>
unwind_buf = {
cancel_jmp_buf = {{
jmp_buf = {140737221482240, -2332898995959237690,
140737488347710, 140737488347711, 140737221482240, 0,
2332865253915489222, 2332875456844002246},
mask_was_saved = 0
}},
priv = {
pad = {0x0, 0x0, 0x0, 0x0},
data = {
prev = 0x0,
cleanup = 0x0,
canceltype = 0
}
}
}
not_first_call = <optimized out>
#6 0x00007ffff49724cf in clone () at
../sysdeps/unix/sysv/linux/x86_64/clone.S:95
Thread 1 (Thread 0x7ffff0e30e00 (LWP 26108)):
#0 0x00007ffff4b404c6 in pthread_sigmask
(how=2, newmask=<optimized out>, oldmask=0x0)
at ../sysdeps/unix/sysv/linux/pthread_sigmask.c:48
local_newmask = {
__val = {7, 17697270867008781056, 11498208, 5790296, 0,
140737488342800, 2, 0, 0, 0, 0, 0, 0, 0, 0, 6922490}
}
result = 0
#1 0x0000000000585a7d in restore_signal_mask (oldset=0x7fffffffcf10)
at sysdep.c:876
#2 0x0000000000699292 in really_call_select (arg=0x7fffffffd060) at
thread.c:584
sa = 0x7fffffffd060
self = 0xa745c0 <main_thread>
oldset = {
__val = {0, 5710159, 8192, 126116988112, 140737488344608,
140737298839344, 7, 0, 11381952, 0, 8192, 1562936528, 140737488342720,
5968906924941129, 0, 16538645}
}
#3 0x00000000005e5ee0 in flush_stack_call_func
(func=0x699239 <really_call_select>, arg=0x7fffffffd060) at alloc.c:4969
end = 0x7fffffffd020
self = 0xa745c0 <main_thread>
sentry = {
o = {
__max_align_ll = 0,
__max_align_ld = <invalid float value>
}
}
#4 0x0000000000699389 in thread_select
(func=0x419320 <pselect@plt>, max_fds=6, rfds=0x7fffffffd590,
wfds=0x7fffffffd510, efds=0x0, timeout=0x7fffffffd840, sigmask=0x0) at
thread.c:616
sa = {
func = 0x419320 <pselect@plt>,
max_fds = 6,
rfds = 0x7fffffffd590,
wfds = 0x7fffffffd510,
efds = 0x0,
timeout = 0x7fffffffd840,
sigmask = 0x0,
result = -157757095
}
#5 0x00000000006bfeac in xg_select
(fds_lim=6, rfds=0x7fffffffd8e0, wfds=0x7fffffffd860, efds=0x0,
timeout=0x7fffffffd840, sigmask=0x0) at xgselect.c:117
all_rfds = {
fds_bits = {32, 0 <repeats 15 times>}
}
all_wfds = {
fds_bits = {0 <repeats 16 times>}
}
tmo = {
tv_sec = 5621546,
tv_nsec = 15785445
}
tmop = 0x7fffffffd840
context = 0xc1c200
have_wfds = true
gfds_buf = {{
fd = 895,
events = 0,
revents = 0
}, {
fd = 0,
events = 0,
revents = 0
}, {
fd = 0,
events = 0,
revents = 0
}, {
fd = 8096,
events = 65535,
revents = 0
}, {
fd = 0,
events = 0,
revents = 0
}, {
fd = 0,
events = 0,
revents = 0
}, {
fd = 0,
events = 0,
revents = 0
}, {
fd = 0,
events = 0,
revents = 0
}, {
fd = 0,
events = 0,
revents = 0
}, {
fd = 0,
events = 0,
revents = 0
}, {
fd = 16,
events = 0,
revents = 0
}, {
fd = 0,
events = 0,
revents = 0
} <repeats 11 times>, {
fd = -2057698731,
events = 39706,
revents = 28662
}, {
fd = 3,
events = 0,
revents = 0
}, {
fd = -2057698731,
events = 39706,
revents = 28662
}, {
fd = -161788823,
events = 32767,
revents = 0
}, {
fd = -2057698731,
events = 39642,
revents = 28662
}, {
fd = -190624704,
events = 32767,
revents = 0
}, {
fd = 16,
events = 0,
revents = 0
}, {
fd = 0,
events = 0,
revents = 0
}, {
fd = -664,
events = 65535,
revents = 65535
}, {
fd = -504530176,
events = 22038,
revents = 62873
}, {
fd = 12553216,
events = 0,
revents = 0
}, {
fd = -1252392363,
events = 44399,
revents = 28662
}, {
fd = 32,
events = 0,
revents = 0
}, {
fd = 0,
events = 0,
revents = 0
}, {
fd = 0,
events = 0,
revents = 0
}, {
fd = 0,
events = 0,
revents = 0
}, {
fd = 12553216,
events = 0,
revents = 0
}, {
fd = -191458327,
events = 32767,
revents = 0
}, {
fd = -190624704,
events = 32767,
revents = 0
}, {
fd = 139264,
events = 0,
revents = 0
}, {
fd = 17472,
events = 0,
revents = 0
}, {
fd = -191893943,
events = 32767,
revents = 0
}, {
fd = 128,
events = 0,
revents = 0
}, {
fd = -191909562,
events = 32767,
revents = 0
}, {
fd = 48,
events = 0,
revents = 0
}, {
fd = 1,
events = 0,
revents = 0
}, {
fd = 7,
events = 0,
revents = 0
}, {
fd = 64,
events = 0,
revents = 0
}, {
fd = 4,
events = 32767,
revents = 0
}, {
fd = 11649056,
events = 0,
revents = 0
}, {
fd = 47,
events = 0,
revents = 0
}, {
fd = 96,
events = 0,
revents = 0
}, {
fd = -664,
events = 65535,
revents = 65535
}, {
fd = 1,
events = 0,
revents = 0
}, {
fd = 4,
events = 49,
revents = 0
}, {
fd = 0,
events = 0,
revents = 0
}, {
fd = 0,
events = 0,
revents = 0
}, {
fd = 91,
events = 110,
revents = 0
}, {
fd = 124,
events = 119,
revents = 0
}, {
fd = 0,
events = 0,
revents = 0
}, {
fd = 17456,
events = 0,
revents = 0
}, {
fd = -190624704,
events = 32767,
revents = 0
}, {
fd = 48,
events = 0,
revents = 0
}, {
fd = 2,
events = 0,
revents = 0
}, {
fd = -664,
events = 65535,
revents = 65535
}, {
fd = 0,
events = 0,
revents = 0
}, {
fd = 0,
events = 0,
revents = 0
}, {
fd = -191900310,
events = 32767,
revents = 0
}, {
fd = -664,
events = 65535,
revents = 65535
}, {
fd = 0,
events = 0,
revents = 0
}, {
fd = 48,
events = 0,
revents = 0
}, {
fd = 187264016,
events = 0,
revents = 0
}, {
fd = 11498208,
events = 0,
revents = 0
}, {
fd = 0,
events = 0,
revents = 0
}, {
fd = 0,
events = 0,
revents = 0
}, {
fd = -11392,
events = 32767,
revents = 0
}, {
fd = 5621546,
events = 0,
revents = 0
}, {
fd = 187264016,
events = 0,
revents = 0
}, {
fd = -11360,
events = 32767,
revents = 0
}, {
fd = 187280083,
events = 0,
revents = 0
}, {
fd = -11344,
events = 32767,
revents = 0
}, {
fd = 5622263,
events = 0,
revents = 0
}, {
fd = 0,
events = 0,
revents = 0
}, {
fd = 187280083,
events = 0,
revents = 0
}, {
fd = -11280,
events = 32767,
revents = 0
}, {
fd = 6170867,
events = 0,
revents = 0
}, {
fd = 0,
events = 0,
revents = 0
}, {
fd = 0,
events = 0,
revents = 0
}, {
fd = 5627325,
events = 0,
revents = 0
}, {
fd = 6,
events = 0,
revents = 0
}, {
fd = 1,
events = 0,
revents = 0
}, {
fd = 0,
events = 1,
revents = 0
}, {
fd = 11498208,
events = 0,
revents = 0
}, {
fd = 0,
events = 0,
revents = 0
}, {
fd = 0,
events = 0,
revents = 0
}, {
fd = -11232,
events = 32767,
revents = 0
}, {
fd = 5621546,
events = 0,
revents = 0
}, {
fd = -134398935,
events = 0,
revents = 0
}, {
fd = -11200,
events = 32767,
revents = 0
}, {
fd = 187280099,
events = 0,
revents = 0
}, {
fd = -11184,
events = 32767,
revents = 0
}, {
fd = 5622263,
events = 0,
revents = 0
}, {
fd = 0,
events = 0,
revents = 0
}, {
fd = 187280099,
events = 0,
revents = 0
}, {
fd = -11120,
events = 32767,
revents = 0
}, {
fd = 6170867,
events = 0,
revents = 0
}, {
fd = 0,
events = 0,
revents = 0
}, {
fd = 0,
events = 0,
revents = 0
}, {
fd = 5627325,
events = 0,
revents = 0
}, {
fd = 6,
events = 0,
revents = 0
}, {
fd = 1,
events = 0,
revents = 0
}, {
fd = 0,
events = 1,
revents = 0
}, {
fd = 11498208,
events = 0,
revents = 0
}, {
fd = 187280099,
events = 0,
revents = 0
}, {
fd = -11072,
events = 32767,
revents = 0
}, {
fd = 5622263,
events = 0,
revents = 0
}, {
fd = -11024,
events = 32767,
revents = 0
}, {
fd = -134398935,
events = 32767,
revents = 0
}, {
fd = -10944,
events = 32767,
revents = 0
}, {
fd = 15785445,
events = 0,
revents = 0
}, {
fd = -11016,
events = 32767,
revents = 0
}, {
fd = 5623097,
events = 0,
revents = 0
}, {
fd = 11498208,
events = 0,
revents = 0
}, {
fd = 0,
events = 0,
revents = 0
}, {
fd = 0,
events = 0,
revents = 0
}, {
fd = -10992,
events = 32767,
revents = 0
}}
gfds = 0x7fffffffd100
gfds_size = 128
n_gfds = -1
retval = 0
our_fds = 0
max_fds = 5
context_acquired = false
i = 0
nfds = 0
tmo_in_millisec = 0
must_free = 0
need_to_dispatch = false
#6 0x000000000066b757 in wait_reading_process_output
(time_limit=0, nsecs=0, read_kbd=-1, do_display=true,
wait_for_cell=XIL(0), wait_proc=0x0, just_wait_proc=0) at
process.c:5423
process_skipped = false
channel = 6
nfds = 1
Available = {
fds_bits = {32, 0 <repeats 15 times>}
}
Writeok = {
fds_bits = {0 <repeats 16 times>}
}
check_write = true
check_delay = 0
no_avail = false
xerrno = 11
proc = XIL(0xcf7c00)
timeout = {
tv_sec = 100000,
tv_nsec = 0
}
end_time = {
tv_sec = 0,
tv_nsec = 140737488345168
}
timer_delay = {
tv_sec = 0,
tv_nsec = -1
}
got_output_end_time = {
tv_sec = 0,
tv_nsec = -1
}
wait = FOREVER
got_some_output = -1
prev_wait_proc_nbytes_read = 0
retry_for_async = false
count = 4
now = {
tv_sec = 0,
tv_nsec = -1
}
#7 0x000000000056c31e in kbd_buffer_get_event
(kbp=0x7fffffffdbd8, used_mouse_menu=0x7fffffffe1bf, end_time=0x0)
at keyboard.c:3836
do_display = true
obj = XIL(0x14bfd3fb)
#8 0x0000000000568706 in read_event_from_main_queue
(end_time=0x0, local_getcjmp=0x7fffffffdf80, used_mouse_menu=0x7fffffffe1bf)
at keyboard.c:2138
c = XIL(0)
save_jump = {{
__jmpbuf = {0, 0, 0, 0, 0, 0, 0, 0},
__mask_was_saved = 0,
__saved_mask = {
__val = {0 <repeats 16 times>}
}
}}
kb = 0x55c9f7 <XSETCDR+28>
count = 3
#9 0x0000000000568970 in read_decoded_event_from_main_queue
(end_time=0x0, local_getcjmp=0x7fffffffdf80, prev_event=XIL(0),
used_mouse_menu=0x7fffffffe1bf) at keyboard.c:2202
nextevt = XIL(0x7fffffffde30)
frame = 0x0
terminal = 0x0
events =
{XIL(0x7fffffffddb0), make_fixnum(1422534), XIL(0xaf72e0),
XIL(0), XIL(0), XIL(0x7fffffffddb0), make_fixnum(1405386),
XIL(0x135b623), XIL(0x7fffffffddf0), make_fixnum(1420782),
XIL(0xaf72e0), XIL(0x100000000), XIL(0), XIL(0x7fffffffddf0),
make_fixnum(1405386), XIL(0)}
n = 0
#10 0x000000000056a002 in read_char
(commandflag=1, map=XIL(0xfbaa33), prev_event=XIL(0),
used_mouse_menu=0x7fffffffe1bf, end_time=0x0) at keyboard.c:2810
c = XIL(0)
jmpcount = 3
local_getcjmp = {{
__jmpbuf = {0, 2332898996579464134, 16538645, 48, 0, 0,
2332898994838827974, -2332899704998988858},
__mask_was_saved = 0,
__saved_mask = {
__val = {140737214405912, 0, 140737225904168, 11498208,
0, 0, 140737488347152, 5621546, 4032516088, 11498208, 0, 0,
140737488347200, 5621546, 12363987, 140737488347296}
}
}}
save_jump = {{
__jmpbuf = {0, 0, 140737224472304, 0, 0, 11529984, 5621546, 0},
__mask_was_saved = -8336,
__saved_mask = {
__val = {6254137, 140737231486936, 12884893472, 0,
31776, 140737224472304, 140737231486936, 5623453, 51539607552,
140737224472309, 140737224472304, 5632971, 11529984, 11529984, 0,
11498208}
}
}}
tem = XIL(0x7fffefa759f0)
save = XIL(0)
previous_echo_area_message = XIL(0)
also_record = XIL(0)
reread = false
recorded = false
polling_stopped_here = true
orig_kboard = 0xc863c0
#11 0x0000000000576842 in read_key_sequence
(keybuf=0x7fffffffe3a0, prompt=XIL(0), dont_downcase_last=false,
can_return_switch_frame=true, fix_current_buffer=true,
prevent_redisplay=false)
at keyboard.c:9124
interrupted_kboard = 0xc863c0
interrupted_frame = 0xf976a0
key = XIL(0xf97ac0)
used_mouse_menu = false
echo_local_start = 0
last_real_key_start = 0
keys_local_start = 0
new_binding = XIL(0xf97ac0)
count = 3
t = 0
echo_start = 0
keys_start = 0
current_binding = XIL(0xfbaa33)
first_unbound = 31
mock_input = 0
used_mouse_menu_history = {false <repeats 30 times>}
fkey = {
parent = XIL(0xbe3cd3),
map = XIL(0xbe3cd3),
start = 0,
end = 0
}
keytran = {
parent = XIL(0x7ffff0ae0eeb),
map = XIL(0x7ffff0ae0eeb),
start = 0,
end = 0
}
indec = {
parent = XIL(0xbe3cc3),
map = XIL(0xbe3cc3),
start = 0,
end = 0
}
shift_translated = false
delayed_switch_frame = XIL(0)
original_uppercase = XIL(0)
original_uppercase_position = -1
dummyflag = false
starting_buffer = 0x7ffff04576f0
fake_prefixed_keys = XIL(0)
first_event = XIL(0)
second_event = XIL(0)
#12 0x00000000005667ae in command_loop_1 () at keyboard.c:1348
cmd = XIL(0x7fffffffe4f0)
keybuf =
{XIL(0x7fffffffe420), XIL(0x7ffff0b08008), XIL(0x100000007),
XIL(0), XIL(0), XIL(0x80a0), XIL(0x11f), XIL(0xaff380), XIL(0xaff380),
XIL(0), XIL(0x7fffffffe440), XIL(0x617021), make_fixnum(1073741824),
XIL(0x7fffffffe460), XIL(0xaf72e0), XIL(0), XIL(0),
XIL(0x7fffffffe440), make_fixnum(1405386), XIL(0xf04576f5),
XIL(0x7fffffffe4a0), XIL(0x61729f), XIL(0xaf72e0), XIL(0), XIL(0),
XIL(0x7fffffffe480), make_fixnum(1405386), XIL(0xf04576f5),
XIL(0x7fffffffe4c0), make_fixnum(1591385)}
i = 0
prev_modiff = 0
prev_buffer = 0x0
already_adjusted = false
#13 0x0000000000611d61 in internal_condition_case
(bfun=0x566384 <command_loop_1>, handlers=XIL(0x90), hfun=0x565b7d
<cmd_error>) at eval.c:1351
val = make_fixnum(1405386)
c = 0xb99e70
#14 0x000000000056607a in command_loop_2 (ignore=XIL(0)) at keyboard.c:1091
val = XIL(0)
#15 0x000000000061161a in internal_catch
(tag=XIL(0xd140), func=0x566051 <command_loop_2>, arg=XIL(0)) at eval.c:1112
val = XIL(0x7fffffffe5c0)
c = 0xb94fd0
#16 0x000000000056601c in command_loop () at keyboard.c:1070
#17 0x0000000000565752 in recursive_edit_1 () at keyboard.c:714
count = 1
val = XIL(0x7fffffffe620)
#18 0x00000000005658d4 in Frecursive_edit () at keyboard.c:786
count = 0
buffer = XIL(0)
#19 0x0000000000563961 in main (argc=4, argv=0x7fffffffe878) at emacs.c:2086
stack_bottom_variable = 0x5e00
do_initial_setlocale = true
no_loadup = false
junk = 0x0
dname_arg = 0x0
ch_to_dir = 0x0
original_pwd = 0x0
dump_mode = 0x0
skip_args = 0
temacs = 0x0
attempt_load_pdump = true
rlim = {
rlim_cur = 10022912,
rlim_max = 18446744073709551615
}
sockfd = -1
i thr
Id Target Id Frame
* 1 Thread 0x7ffff0e30e00 (LWP 26108) "emacs" pthread_sigmask (how=0,
newmask=<optimized out>, oldmask=0x7fffffffcf10)
at ../sysdeps/unix/sysv/linux/pthread_sigmask.c:48
2 Thread 0x7ffff017d700 (LWP 26115) "gmain"
0x00007ffff4967819 in __GI___poll (fds=0xb400d0, nfds=1, timeout=-1)
at ../sysdeps/unix/sysv/linux/poll.c:29
3 Thread 0x7fffef59e700 (LWP 26116) "gdbus"
0x00007ffff4967819 in __GI___poll (fds=0xcb7a20, nfds=2, timeout=-1)
at ../sysdeps/unix/sysv/linux/poll.c:29
4 Thread 0x7fffeeb8a700 (LWP 26117) "dconf worker"
0x00007ffff4967819 in __GI___poll (fds=0xd93070, nfds=1, timeout=-1)
at ../sysdeps/unix/sysv/linux/poll.c:29
> > Proposed patch attached.
>
> As I said earlier, I don't think it's right to use stack unwinding at
> this level. (Did you try this in a -nw session?)
Sorry, I didn't see that email until after I'd sent mine. I did try
with an -nw session, now, but I really think thread_select should not
return abnormally.
> I'd appreciate a description of what you think happens here, and why.
When a thread is signalled (by thread-signal, which sets another
thread's error_symbol) while the signalled thread is inside a
select(), thread_select() will return non-locally for that thread, and
we fail to release an internal GLib lock through
g_main_context_release(). That's the first bug.
When xg_select() fails to acquire the internal GLib lock, it simply
does a select() on the remaining file descriptors:
context_acquired = g_main_context_acquire (context);
/* FIXME: If we couldn't acquire the context, we just silently proceed
because this function handles more than just glib file descriptors.
Note that, as implemented, this failure is completely silent: there is
no feedback to the caller. */
This seems like a second, albeit documented, bug to me. I think we're
risking not waking up from the actual select because another
(non-Emacs) thread happened to hold the main context at the time.
^ permalink raw reply [flat|nested] 53+ messages in thread
* bug#36609: 27.0.50; Possible race-condition in threading implementation
2019-07-12 13:40 ` Pip Cet
@ 2019-07-12 13:51 ` Eli Zaretskii
2019-07-12 14:34 ` Pip Cet
0 siblings, 1 reply; 53+ messages in thread
From: Eli Zaretskii @ 2019-07-12 13:51 UTC (permalink / raw)
To: Pip Cet; +Cc: 36609, politza
> From: Pip Cet <pipcet@gmail.com>
> Date: Fri, 12 Jul 2019 13:40:15 +0000
> Cc: politza@hochschule-trier.de, 36609@debbugs.gnu.org
>
> When a thread is signalled (by thread-signal, which sets another
> thread's error_symbol) while the signalled thread is inside a
> select(), thread_select() will return non-locally for that thread, and
> we fail to release an internal GLib lock through
> g_main_context_release(). That's the first bug.
We should either release the global lock before the thread exits, or
defer the acting upon the signal until later. We cannot disable the
signal handling altogether because it is entirely legitimate to signal
another thread, and when we do, that other thread will _always_ be
inside thread_select.
For the main thread, handling the signal in that situation shouldn't
be a problem, because it is not going to exit. Right?
> When xg_select() fails to acquire the internal GLib lock, it simply
> does a select() on the remaining file descriptors:
Why does it fail to acquire that lock?
> context_acquired = g_main_context_acquire (context);
> /* FIXME: If we couldn't acquire the context, we just silently proceed
> because this function handles more than just glib file descriptors.
> Note that, as implemented, this failure is completely silent: there is
> no feedback to the caller. */
>
> This seems like a second, albeit documented, bug to me. I think we're
> risking not waking up from the actual select because another
> (non-Emacs) thread happened to hold the main context at the time.
So what is the proposal for that? spin waiting for the lock?
^ permalink raw reply [flat|nested] 53+ messages in thread
* bug#36609: 27.0.50; Possible race-condition in threading implementation
2019-07-12 13:31 ` Eli Zaretskii
@ 2019-07-12 13:51 ` Pip Cet
2019-07-12 15:05 ` Eli Zaretskii
0 siblings, 1 reply; 53+ messages in thread
From: Pip Cet @ 2019-07-12 13:51 UTC (permalink / raw)
To: Eli Zaretskii; +Cc: 36609, politza
On Fri, Jul 12, 2019 at 1:31 PM Eli Zaretskii <eliz@gnu.org> wrote:>
> > From: Pip Cet <pipcet@gmail.com>
> > Date: Fri, 12 Jul 2019 12:57:51 +0000
> > Cc: politza@hochschule-trier.de, 36609@debbugs.gnu.org
> >
> > Lisp Backtrace:
> > "sleep-for" (0xedf7a530)
> > 0xf6da40 Lisp type 3
> >
> > post_acquire_global_lock () can return abnormally (I didn't know
> > that), so really_call_select() can, too, so thread_select() can, too.
>
> Do you know which code sets current_thread->error_symbol, and what is
> that error symbol?
thread-signal, which my example code calls directly. I passed 'error,
and that's what error_symbol is set to.
last_thread_error is (error) when the Emacs session freezes.
> > > > + ptrdiff_t count = SPECPDL_INDEX ();
> > >
> > > I don't think we should do that at this low level.
> >
> > You're right, it does stick out. I think we're safe because we're
> > calling Fsignal with the global lock held, but it's not a pretty or
> > well-documented situation.
>
> Is this the main thread which does that? if so, there should be no
> problem with holding the global lock in this case, is there?
>
> If this is not the main thread, then the error handlers should be set
> so as to log the error in last_thread_error, and then terminate the
> thread (via exiting the thread function, AFAIR).
Indeed, that's what happens; but the thread now fails to release the
GLib lock it had previously acquired, so other threads cannot acquire
it again, ever.
> If this is not what happens, I'd like to know what does happen here,
> and why.
Sure, we should understand what's going on here; even if the fix turns
out to be simple.
^ permalink raw reply [flat|nested] 53+ messages in thread
* bug#36609: 27.0.50; Possible race-condition in threading implementation
2019-07-12 13:51 ` Eli Zaretskii
@ 2019-07-12 14:34 ` Pip Cet
2019-07-12 15:02 ` Eli Zaretskii
0 siblings, 1 reply; 53+ messages in thread
From: Pip Cet @ 2019-07-12 14:34 UTC (permalink / raw)
To: Eli Zaretskii; +Cc: 36609, politza
On Fri, Jul 12, 2019 at 1:51 PM Eli Zaretskii <eliz@gnu.org> wrote:
> > From: Pip Cet <pipcet@gmail.com>
> > Date: Fri, 12 Jul 2019 13:40:15 +0000
> > Cc: politza@hochschule-trier.de, 36609@debbugs.gnu.org
> >
> > When a thread is signalled (by thread-signal, which sets another
> > thread's error_symbol) while the signalled thread is inside a
> > select(), thread_select() will return non-locally for that thread, and
> > we fail to release an internal GLib lock through
> > g_main_context_release(). That's the first bug.
>
> We should either release the global lock before the thread exits, or
> defer the acting upon the signal until later. We cannot disable the
> signal handling altogether because it is entirely legitimate to signal
> another thread, and when we do, that other thread will _always_ be
> inside thread_select.
Really? What about thread-yield?
> For the main thread, handling the signal in that situation shouldn't
> be a problem, because it is not going to exit. Right?
I think the main thread can still fail to release the lock...
> > When xg_select() fails to acquire the internal GLib lock, it simply
> > does a select() on the remaining file descriptors:
>
> Why does it fail to acquire that lock?
Because another thread holds it, either an Emacs or a non-Emacs
thread. In both cases, I think we might miss events unless we return
with errno == EINTR.
> > context_acquired = g_main_context_acquire (context)
> > /* FIXME: If we couldn't acquire the context, we just silently proceed
> > because this function handles more than just glib file descriptors.
> > Note that, as implemented, this failure is completely silent: there is
> > no feedback to the caller. */
> >
> > This seems like a second, albeit documented, bug to me. I think we're
> > risking not waking up from the actual select because another
> > (non-Emacs) thread happened to hold the main context at the time.
>
> So what is the proposal for that? spin waiting for the lock?
I don't know, to be honest, but I'm afraid there's currently no better
way. There's a way to take the lock without spinning, but it's been
deprecated.
^ permalink raw reply [flat|nested] 53+ messages in thread
* bug#36609: 27.0.50; Possible race-condition in threading implementation
2019-07-12 14:34 ` Pip Cet
@ 2019-07-12 15:02 ` Eli Zaretskii
2019-07-12 19:30 ` Pip Cet
0 siblings, 1 reply; 53+ messages in thread
From: Eli Zaretskii @ 2019-07-12 15:02 UTC (permalink / raw)
To: Pip Cet; +Cc: 36609, politza
> From: Pip Cet <pipcet@gmail.com>
> Date: Fri, 12 Jul 2019 14:34:44 +0000
> Cc: politza@hochschule-trier.de, 36609@debbugs.gnu.org
>
> > > When a thread is signalled (by thread-signal, which sets another
> > > thread's error_symbol) while the signalled thread is inside a
> > > select(), thread_select() will return non-locally for that thread, and
> > > we fail to release an internal GLib lock through
> > > g_main_context_release(). That's the first bug.
> >
> > We should either release the global lock before the thread exits, or
> > defer the acting upon the signal until later. We cannot disable the
> > signal handling altogether because it is entirely legitimate to signal
> > another thread, and when we do, that other thread will _always_ be
> > inside thread_select.
>
> Really? What about thread-yield?
What about it?
You are asking whether, when thread-signal is executed, the thread
which we are signaling is necessarily parked inside thread_select? If
so, I don't understand your surprise: only one thread can ever be
running, and that is by definition the thread which calls
thread-signal. All the other threads cannot be running, which means
they are parked either in thread_select or in sys_mutex_lock called
from acquire_global_lock. Right?
As for thread-yield, I'm not sure I understand how is it related to
the issue we are discussing.
> > For the main thread, handling the signal in that situation shouldn't
> > be a problem, because it is not going to exit. Right?
>
> I think the main thread can still fail to release the lock...
Since the main thread doesn't exit, but just longjmp's to top-level,
and then continues to run, why does it need to release the lock? Any
thread should hold the lock for as long as it runs.
> > > When xg_select() fails to acquire the internal GLib lock, it simply
> > > does a select() on the remaining file descriptors:
> >
> > Why does it fail to acquire that lock?
>
> Because another thread holds it, either an Emacs or a non-Emacs
> thread.
OK, and why is it a problem that we continue calling thread_select
regardless of the failure to acquire the Glib lock? is the problem
this one:
> In both cases, I think we might miss events unless we return
> with errno == EINTR.
? Or is there something else? If the problem with missing events,
then which events are those, and what bad things will happen if we
miss them?
^ permalink raw reply [flat|nested] 53+ messages in thread
* bug#36609: 27.0.50; Possible race-condition in threading implementation
2019-07-12 13:51 ` Pip Cet
@ 2019-07-12 15:05 ` Eli Zaretskii
2019-07-12 18:06 ` Pip Cet
0 siblings, 1 reply; 53+ messages in thread
From: Eli Zaretskii @ 2019-07-12 15:05 UTC (permalink / raw)
To: Pip Cet; +Cc: 36609, politza
> From: Pip Cet <pipcet@gmail.com>
> Date: Fri, 12 Jul 2019 13:51:48 +0000
> Cc: politza@hochschule-trier.de, 36609@debbugs.gnu.org
>
> > If this is not the main thread, then the error handlers should be set
> > so as to log the error in last_thread_error, and then terminate the
> > thread (via exiting the thread function, AFAIR).
>
> Indeed, that's what happens; but the thread now fails to release the
> GLib lock it had previously acquired, so other threads cannot acquire
> it again, ever.
Then we should make sure we release it when the thread function exits,
or before we call Fsignal from post_acquire_global_lock. There's no
reason to use the unwind-protect machinery for that, I think.
^ permalink raw reply [flat|nested] 53+ messages in thread
* bug#36609: 27.0.50; Possible race-condition in threading implementation
2019-07-12 15:05 ` Eli Zaretskii
@ 2019-07-12 18:06 ` Pip Cet
2019-07-12 18:27 ` Eli Zaretskii
0 siblings, 1 reply; 53+ messages in thread
From: Pip Cet @ 2019-07-12 18:06 UTC (permalink / raw)
To: Eli Zaretskii; +Cc: 36609, politza
On Fri, Jul 12, 2019 at 3:06 PM Eli Zaretskii <eliz@gnu.org> wrote:
> > From: Pip Cet <pipcet@gmail.com>
> > Date: Fri, 12 Jul 2019 13:51:48 +0000
> > Cc: politza@hochschule-trier.de, 36609@debbugs.gnu.org
> >
> > > If this is not the main thread, then the error handlers should be set
> > > so as to log the error in last_thread_error, and then terminate the
> > > thread (via exiting the thread function, AFAIR).
> >
> > Indeed, that's what happens; but the thread now fails to release the
> > GLib lock it had previously acquired, so other threads cannot acquire
> > it again, ever.
>
> Then we should make sure we release it when the thread function exits,
That's too late, it's legitimate for the thread to have a signal
handler. We need to release the lock at precisely the time that
unbind_to would release it.
> or before we call Fsignal from post_acquire_global_lock. There's no
> reason to use the unwind-protect machinery for that, I think.
If we call Fsignal, surely the unwind-protect machinery is already set
up and we can safely call it? So unbind_to would work again...
I guess I've changed my mind: the unwind-protect machinery does
precisely what we want, we should simply document that thread_select
can exit nonlocally and that the select functions need to be written
to cater to that.
Two places call xg_select, and they both run long after unbind_to works.
Doing this without the unwind-protect machinery is difficult: for
example, a comment in post_acquire_global_lock claims
unbind_for_thread_switch can exit nonlocally, though I don't think
that actually happens or we would have seen the bug report.
What's your proposal for doing this?
^ permalink raw reply [flat|nested] 53+ messages in thread
* bug#36609: 27.0.50; Possible race-condition in threading implementation
2019-07-12 18:06 ` Pip Cet
@ 2019-07-12 18:27 ` Eli Zaretskii
2019-07-12 18:34 ` Eli Zaretskii
0 siblings, 1 reply; 53+ messages in thread
From: Eli Zaretskii @ 2019-07-12 18:27 UTC (permalink / raw)
To: Pip Cet; +Cc: 36609, politza
> From: Pip Cet <pipcet@gmail.com>
> Date: Fri, 12 Jul 2019 18:06:29 +0000
> Cc: politza@hochschule-trier.de, 36609@debbugs.gnu.org
>
> > Then we should make sure we release it when the thread function exits,
>
> That's too late, it's legitimate for the thread to have a signal
> handler. We need to release the lock at precisely the time that
> unbind_to would release it.
I had also another proposal:
> > or before we call Fsignal from post_acquire_global_lock. There's no
> > reason to use the unwind-protect machinery for that, I think.
>
> If we call Fsignal, surely the unwind-protect machinery is already set
> up and we can safely call it? So unbind_to would work again...
Sorry, I don't want to call unwind-protect there. Call me paranoid,
if you want.
> I guess I've changed my mind: the unwind-protect machinery does
> precisely what we want, we should simply document that thread_select
> can exit nonlocally and that the select functions need to be written
> to cater to that.
>
> Two places call xg_select, and they both run long after unbind_to works.
>
> Doing this without the unwind-protect machinery is difficult: for
> example, a comment in post_acquire_global_lock claims
> unbind_for_thread_switch can exit nonlocally, though I don't think
> that actually happens or we would have seen the bug report.
>
> What's your proposal for doing this?
Like I said: release the lock before calling Fsignal. We could do
that before calling unbind_for_thread_switch, if needed.
^ permalink raw reply [flat|nested] 53+ messages in thread
* bug#36609: 27.0.50; Possible race-condition in threading implementation
2019-07-12 18:27 ` Eli Zaretskii
@ 2019-07-12 18:34 ` Eli Zaretskii
2019-07-12 19:24 ` Pip Cet
0 siblings, 1 reply; 53+ messages in thread
From: Eli Zaretskii @ 2019-07-12 18:34 UTC (permalink / raw)
To: pipcet; +Cc: 36609, politza
> Date: Fri, 12 Jul 2019 21:27:40 +0300
> From: Eli Zaretskii <eliz@gnu.org>
> Cc: 36609@debbugs.gnu.org, politza@hochschule-trier.de
>
> Sorry, I don't want to call unwind-protect there. Call me paranoid,
> if you want.
Maybe I should explain the rationale behind that paranoia. The main
reason is that you are proposing to do that inside code that can
switch threads. Switching threads means switching to another stack
and also to another set of handlers. So using the unwind-protect
machinery in this situation is IMO asking for trouble.
And then there's the TTY frame case, where C-g triggers SIGINT, and we
actually longjmp from wherever we were...
^ permalink raw reply [flat|nested] 53+ messages in thread
* bug#36609: 27.0.50; Possible race-condition in threading implementation
2019-07-12 18:34 ` Eli Zaretskii
@ 2019-07-12 19:24 ` Pip Cet
2019-07-12 19:57 ` Eli Zaretskii
0 siblings, 1 reply; 53+ messages in thread
From: Pip Cet @ 2019-07-12 19:24 UTC (permalink / raw)
To: Eli Zaretskii; +Cc: 36609, politza
[-- Attachment #1: Type: text/plain, Size: 1069 bytes --]
On Fri, Jul 12, 2019 at 6:34 PM Eli Zaretskii <eliz@gnu.org> wrote:
> > Sorry, I don't want to call unwind-protect there. Call me paranoid,
> > if you want.
>
> Maybe I should explain the rationale behind that paranoia. The main
I don't think it's paranoia, I just wasn't sure there was a good way
to avoid it.
I'm now convinced that there simply is no safe way to call select()
from two threads at once when using glib.
I think our options are hacking around it and hoping nothing breaks
(this is what the attached patch does; it releases the main context
glib lock from the wrong thread soon "after" the other thread called
select, but there's actually no way to ensure that "after" is
accurate), or rewriting things so we have a single thread that does
all the select()ing.
> reason is that you are proposing to do that inside code that can
> switch threads. Switching threads means switching to another stack
> and also to another set of handlers. So using the unwind-protect
> machinery in this situation is IMO asking for trouble.
Thanks for explaining.
[-- Attachment #2: glib-hack.diff --]
[-- Type: application/x-patch, Size: 4445 bytes --]
^ permalink raw reply [flat|nested] 53+ messages in thread
* bug#36609: 27.0.50; Possible race-condition in threading implementation
2019-07-12 15:02 ` Eli Zaretskii
@ 2019-07-12 19:30 ` Pip Cet
2019-07-13 6:50 ` Eli Zaretskii
0 siblings, 1 reply; 53+ messages in thread
From: Pip Cet @ 2019-07-12 19:30 UTC (permalink / raw)
To: Eli Zaretskii; +Cc: 36609, politza
> > > We should either release the global lock before the thread exits, or
> > > defer the acting upon the signal until later. We cannot disable the
> > > signal handling altogether because it is entirely legitimate to signal
> > > another thread, and when we do, that other thread will _always_ be
> > > inside thread_select.
> >
> > Really? What about thread-yield?
>
> What about it?
>
> You are asking whether, when thread-signal is executed, the thread
> which we are signaling is necessarily parked inside thread_select? If
> so, I don't understand your surprise: only one thread can ever be
> running, and that is by definition the thread which calls
> thread-signal. All the other threads cannot be running, which means
> they are parked either in thread_select or in sys_mutex_lock called
> from acquire_global_lock. Right?
No, they might also be in the sys_thread_yield syscall, having
released the global lock but not yet reacquired it:
release_global_lock ();
sys_thread_yield (); <<<<< here
acquire_global_lock (self);
> As for thread-yield, I'm not sure I understand how is it related to
> the issue we are discussing.
I'm not sure how it's relevant to assert that "that other thread will
_always_ be inside thread_select". I have an idea where you might be
going with that, but that idea wouldn't work (to release the lock from
the signalling thread, not the signalled thread that holds it).
> If the problem with missing events,
> then which events are those, and what bad things will happen if we
> miss them?
All events that glib knows about but Emacs doesn't. For example, a
glib timeout is apparently used to achieve some kind of scroll effect
on GTK menus, which is why we call xg_select from xmenu.c.
I don't know which libraries use glib-based threads, but I think dbus does, too.
I believe, but am not certain, that this also includes X events when
using GTK. That would explain the frozen sessions.
^ permalink raw reply [flat|nested] 53+ messages in thread
* bug#36609: 27.0.50; Possible race-condition in threading implementation
2019-07-12 19:24 ` Pip Cet
@ 2019-07-12 19:57 ` Eli Zaretskii
2019-07-13 14:37 ` Pip Cet
0 siblings, 1 reply; 53+ messages in thread
From: Eli Zaretskii @ 2019-07-12 19:57 UTC (permalink / raw)
To: Pip Cet; +Cc: 36609, politza
> From: Pip Cet <pipcet@gmail.com>
> Date: Fri, 12 Jul 2019 19:24:38 +0000
> Cc: 36609@debbugs.gnu.org, politza@hochschule-trier.de
>
> I'm now convinced that there simply is no safe way to call select()
> from two threads at once when using glib.
I hope not, although GTK with its idiosyncrasies caused a lot of pain
for the threads implementation in Emacs.
> I think our options are hacking around it and hoping nothing breaks
> (this is what the attached patch does; it releases the main context
> glib lock from the wrong thread soon "after" the other thread called
> select, but there's actually no way to ensure that "after" is
> accurate), or rewriting things so we have a single thread that does
> all the select()ing.
Hmm... how would this work with your patch? Suppose one thread calls
xg_select, acquires the Glib lock, sets its holding_glib_lock flag,
then releases the global Lisp lock and calls pselect. Since the
global Lisp lock is now up for grabs, some other Lisp thread can
acquire it and start running. If that other thread then calls
xg_select, it will hang forever trying to acquire the Glib lock,
because the first thread that holds it is stuck in pselect.
Am I missing something?
I know very little about GTK and the Glib context lock, but AFAIR we
really must treat that lock as a global one, not a thread-local one.
So I think it's okay for one thread to take the Glib lock, and another
to release it, because Glib just wants to know whether the "rest of
the program" has it, it doesn't care which thread is that which holds
the lock.
^ permalink raw reply [flat|nested] 53+ messages in thread
* bug#36609: 27.0.50; Possible race-condition in threading implementation
2019-07-12 19:30 ` Pip Cet
@ 2019-07-13 6:50 ` Eli Zaretskii
0 siblings, 0 replies; 53+ messages in thread
From: Eli Zaretskii @ 2019-07-13 6:50 UTC (permalink / raw)
To: Pip Cet; +Cc: 36609, politza
> From: Pip Cet <pipcet@gmail.com>
> Date: Fri, 12 Jul 2019 19:30:34 +0000
> Cc: politza@hochschule-trier.de, 36609@debbugs.gnu.org
>
> > > > We should either release the global lock before the thread exits, or
> > > > defer the acting upon the signal until later. We cannot disable the
> > > > signal handling altogether because it is entirely legitimate to signal
> > > > another thread, and when we do, that other thread will _always_ be
> > > > inside thread_select.
> > >
> > > Really? What about thread-yield?
> >
> > What about it?
> >
> > You are asking whether, when thread-signal is executed, the thread
> > which we are signaling is necessarily parked inside thread_select? If
> > so, I don't understand your surprise: only one thread can ever be
> > running, and that is by definition the thread which calls
> > thread-signal. All the other threads cannot be running, which means
> > they are parked either in thread_select or in sys_mutex_lock called
> > from acquire_global_lock. Right?
>
> No, they might also be in the sys_thread_yield syscall, having
> released the global lock but not yet reacquired it:
>
> release_global_lock ();
> sys_thread_yield (); <<<<< here
> acquire_global_lock (self);
OK, but that, too, means the thread being signaled is not running,
right? And I still think that a very frequent case, perhaps the most
frequent, is that the thread being signaled is inside thread_select.
> I'm not sure how it's relevant to assert that "that other thread will
> _always_ be inside thread_select".
OK, we've now established that the other thread could also be in
acquire_global_lock or (for a very short time) in sys_thread_yield.
> I have an idea where you might be going with that
I was merely pointing out that we cannot disable the signal handling
as a means to solve the problem.
> but that idea wouldn't work (to release the lock from the signalling
> thread, not the signalled thread that holds it).
Maybe we have a misunderstanding here. I was talking about this part
of post_acquire_global_lock:
/* We could have been signaled while waiting to grab the global lock
for the first time since this thread was created, in which case
we didn't yet have the opportunity to set up the handlers. Delay
raising the signal in that case (it will be actually raised when
the thread comes here after acquiring the lock the next time). */
if (!NILP (current_thread->error_symbol) && handlerlist)
{
Lisp_Object sym = current_thread->error_symbol;
Lisp_Object data = current_thread->error_data;
current_thread->error_symbol = Qnil;
current_thread->error_data = Qnil;
Fsignal (sym, data);
}
In this part, we have already switched to the thread that has been
signaled, so we are in the signaled thread, not in the signaling
thread. I meant to release the lock before the call to Fsignal here.
> > If the problem with missing events,
> > then which events are those, and what bad things will happen if we
> > miss them?
>
> All events that glib knows about but Emacs doesn't. For example, a
> glib timeout is apparently used to achieve some kind of scroll effect
> on GTK menus, which is why we call xg_select from xmenu.c.
>
> I don't know which libraries use glib-based threads, but I think dbus does, too.
>
> I believe, but am not certain, that this also includes X events when
> using GTK. That would explain the frozen sessions.
So is the problem that the Glib context is locked "forever", or is the
problem that it's locked by another Lisp thread, even if this lock is
short-lived? If the former, then arranging for the release of that
lock when the signaled thread exits would solve the problem, right?
And if the problem is the latter one, then why didn't we hear about
this much earlier? Can you show the bad effect from missing these
events without signaling a thread?
Thanks.
^ permalink raw reply [flat|nested] 53+ messages in thread
* bug#36609: 27.0.50; Possible race-condition in threading implementation
2019-07-12 19:57 ` Eli Zaretskii
@ 2019-07-13 14:37 ` Pip Cet
2019-07-13 15:03 ` Eli Zaretskii
2019-07-13 15:04 ` Andreas Politz
0 siblings, 2 replies; 53+ messages in thread
From: Pip Cet @ 2019-07-13 14:37 UTC (permalink / raw)
To: Eli Zaretskii; +Cc: 36609, politza
[-- Attachment #1: Type: text/plain, Size: 2523 bytes --]
On Fri, Jul 12, 2019 at 7:57 PM Eli Zaretskii <eliz@gnu.org> wrote:
> > I'm now convinced that there simply is no safe way to call select()
> > from two threads at once when using glib.
>
> I hope not, although GTK with its idiosyncrasies caused a lot of pain
> for the threads implementation in Emacs.
Well, I think we're going to have to do one or more of the following:
- have a race condition
- access glib-locked data from the "wrong" thread (another Emacs thread)
- release the glib lock from the "wrong" thread (another Emacs thread)
Of these, the second is the best alternative, I think: we simply grab
the g_main_context lock globally, acting for all Emacs threads, and
the last thread to require it releases it when it leaves xg_select. As
long as there's at least one thread in the critical section of
xg_select, we hold the lock, but access to the context isn't
necessarily from the thread which locked it.
> > I think our options are hacking around it and hoping nothing breaks
> > (this is what the attached patch does; it releases the main context
> > glib lock from the wrong thread soon "after" the other thread called
> > select, but there's actually no way to ensure that "after" is
> > accurate), or rewriting things so we have a single thread that does
> > all the select()ing.
>
> Hmm... how would this work with your patch? Suppose one thread calls
> xg_select, acquires the Glib lock, sets its holding_glib_lock flag,
> then releases the global Lisp lock and calls pselect. Since the
> global Lisp lock is now up for grabs, some other Lisp thread can
> acquire it and start running.
And when it starts running, it releases the Glib lock.
> If that other thread then calls
> xg_select, it will hang forever trying to acquire the Glib lock,
> because the first thread that holds it is stuck in pselect.
The first thread no longer holds the Glib lock, it was released when
we switched threads.
> I know very little about GTK and the Glib context lock, but AFAIR we
> really must treat that lock as a global one, not a thread-local one.
> So I think it's okay for one thread to take the Glib lock, and another
> to release it, because Glib just wants to know whether the "rest of
> the program" has it, it doesn't care which thread is that which holds
> the lock.
Okay, that sounds like option #2 above. The attached patch exposes
glib externals to the generic code, but it appears to work. If you
think the approach is okay, I'll move the glib-specific parts to
xgselect.c (if that's okay).
[-- Attachment #2: glib-hack-002.diff --]
[-- Type: text/x-patch, Size: 5894 bytes --]
diff --git a/src/thread.c b/src/thread.c
index e2deadd7a8..0ddb79460b 100644
--- a/src/thread.c
+++ b/src/thread.c
@@ -19,6 +19,9 @@ Copyright (C) 2012-2019 Free Software Foundation, Inc.
#include <config.h>
#include <setjmp.h>
+#ifdef HAVE_GLIB
+#include <glib.h>
+#endif
#include "lisp.h"
#include "character.h"
#include "buffer.h"
@@ -82,7 +85,7 @@ post_acquire_global_lock (struct thread_state *self)
/* Do this early on, so that code below could signal errors (e.g.,
unbind_for_thread_switch might) correctly, because we are already
- running in the context of the thread pointed by SELF. */
+ running in the context of the thread pointed to by SELF. */
current_thread = self;
if (prev_thread != current_thread)
@@ -586,6 +589,17 @@ really_call_select (void *arg)
sa->result = (sa->func) (sa->max_fds, sa->rfds, sa->wfds, sa->efds,
sa->timeout, sa->sigmask);
+#ifdef HAVE_GLIB
+ /* Release the Glib lock, if there are no other threads in the
+ critical section. */
+ if (current_thread != NULL && current_thread->holding_glib_lock)
+ {
+ current_thread->holding_glib_lock = false;
+ if (--threads_holding_glib_lock == 0)
+ g_main_context_release (glib_main_context);
+ }
+#endif
+
block_interrupt_signal (&oldset);
/* If we were interrupted by C-g while inside sa->func above, the
signal handler could have called maybe_reacquire_global_lock, in
@@ -756,6 +770,13 @@ run_thread (void *state)
}
}
+#ifdef HAVE_GLIB
+ /* Remember to release the Glib lock we might still be holding
+ (?) */
+ if (current_thread->holding_glib_lock)
+ if (--threads_holding_glib_lock == 0)
+ g_main_context_release (glib_main_context);
+#endif
current_thread = NULL;
sys_cond_broadcast (&self->thread_condvar);
diff --git a/src/thread.h b/src/thread.h
index 498b9909c9..1a58f65c88 100644
--- a/src/thread.h
+++ b/src/thread.h
@@ -29,9 +29,18 @@ #define THREAD_H
#include <signal.h> /* sigset_t */
#endif
+#ifdef HAVE_GLIB
+#include <glib.h>
+#endif
+
#include "sysselect.h" /* FIXME */
#include "systhread.h"
+#ifdef HAVE_GLIB
+extern ptrdiff_t threads_holding_glib_lock;
+extern GMainContext *glib_main_context;
+#endif
+
struct thread_state
{
union vectorlike_header header;
@@ -169,6 +178,9 @@ #define getcjmp (current_thread->m_getcjmp)
interrupter should broadcast to this condition. */
sys_cond_t *wait_condvar;
+#ifdef HAVE_GLIB
+ bool holding_glib_lock;
+#endif
/* This thread might have released the global lock. If so, this is
non-zero. When a thread runs outside thread_select with this
flag non-zero, it means it has been interrupted by SIGINT while
diff --git a/src/xgselect.c b/src/xgselect.c
index 9982a1f0e9..0c95857ef9 100644
--- a/src/xgselect.c
+++ b/src/xgselect.c
@@ -29,6 +29,9 @@ Copyright (C) 2009-2019 Free Software Foundation, Inc.
#include "blockinput.h"
#include "systime.h"
+ptrdiff_t threads_holding_glib_lock;
+GMainContext *glib_main_context;
+
/* `xg_select' is a `pselect' replacement. Why do we need a separate function?
1. Timeouts. Glib and Gtk rely on timer events. If we did pselect
with a greater timeout then the one scheduled by Glib, we would
@@ -54,26 +57,28 @@ xg_select (int fds_lim, fd_set *rfds, fd_set *wfds, fd_set *efds,
GPollFD *gfds = gfds_buf;
int gfds_size = ARRAYELTS (gfds_buf);
int n_gfds, retval = 0, our_fds = 0, max_fds = fds_lim - 1;
- bool context_acquired = false;
int i, nfds, tmo_in_millisec, must_free = 0;
bool need_to_dispatch;
context = g_main_context_default ();
- context_acquired = g_main_context_acquire (context);
- /* FIXME: If we couldn't acquire the context, we just silently proceed
- because this function handles more than just glib file descriptors.
- Note that, as implemented, this failure is completely silent: there is
- no feedback to the caller. */
+ /* Acquire the lock. This is a busy wait for testing. */
+ if (current_thread != NULL && !current_thread->holding_glib_lock)
+ {
+ if (threads_holding_glib_lock++ == 0)
+ while (!g_main_context_acquire (context))
+ {
+ }
+ current_thread->holding_glib_lock = true;
+ glib_main_context = context;
+ }
if (rfds) all_rfds = *rfds;
else FD_ZERO (&all_rfds);
if (wfds) all_wfds = *wfds;
else FD_ZERO (&all_wfds);
- n_gfds = (context_acquired
- ? g_main_context_query (context, G_PRIORITY_LOW, &tmo_in_millisec,
- gfds, gfds_size)
- : -1);
+ n_gfds = g_main_context_query (context, G_PRIORITY_LOW, &tmo_in_millisec,
+ gfds, gfds_size);
if (gfds_size < n_gfds)
{
@@ -151,8 +156,19 @@ xg_select (int fds_lim, fd_set *rfds, fd_set *wfds, fd_set *efds,
#else
need_to_dispatch = true;
#endif
- if (need_to_dispatch && context_acquired)
+ if (need_to_dispatch)
{
+ /* Acquire the lock. This is a busy wait for testing. */
+ glib_main_context = context;
+ if (!current_thread->holding_glib_lock)
+ {
+ if (threads_holding_glib_lock++ == 0)
+ while (!g_main_context_acquire (context))
+ {
+ }
+ current_thread->holding_glib_lock = true;
+ }
+
int pselect_errno = errno;
/* Prevent g_main_dispatch recursion, that would occur without
block_input wrapper, because event handlers call
@@ -164,8 +180,12 @@ xg_select (int fds_lim, fd_set *rfds, fd_set *wfds, fd_set *efds,
errno = pselect_errno;
}
- if (context_acquired)
- g_main_context_release (context);
+ if (current_thread != NULL && current_thread->holding_glib_lock)
+ if (--threads_holding_glib_lock == 0)
+ {
+ g_main_context_release (context);
+ current_thread->holding_glib_lock = false;
+ }
/* To not have to recalculate timeout, return like this. */
if ((our_fds > 0 || (nfds == 0 && tmop == &tmo)) && (retval == 0))
^ permalink raw reply related [flat|nested] 53+ messages in thread
* bug#36609: 27.0.50; Possible race-condition in threading implementation
2019-07-13 14:37 ` Pip Cet
@ 2019-07-13 15:03 ` Eli Zaretskii
2019-07-13 15:13 ` Eli Zaretskii
2019-07-13 15:54 ` Eli Zaretskii
2019-07-13 15:04 ` Andreas Politz
1 sibling, 2 replies; 53+ messages in thread
From: Eli Zaretskii @ 2019-07-13 15:03 UTC (permalink / raw)
To: Pip Cet; +Cc: 36609, politza
> From: Pip Cet <pipcet@gmail.com>
> Date: Sat, 13 Jul 2019 14:37:25 +0000
> Cc: 36609@debbugs.gnu.org, politza@hochschule-trier.de
>
> Well, I think we're going to have to do one or more of the following:
>
> - have a race condition
> - access glib-locked data from the "wrong" thread (another Emacs thread)
> - release the glib lock from the "wrong" thread (another Emacs thread)
>
> Of these, the second is the best alternative, I think: we simply grab
> the g_main_context lock globally, acting for all Emacs threads, and
> the last thread to require it releases it when it leaves xg_select.
I agree.
> As long as there's at least one thread in the critical section of
> xg_select, we hold the lock, but access to the context isn't
> necessarily from the thread which locked it.
Hmm... wouldn't this cause us always hold that lock when there's more
than one Lisp thread? And if so, what will that do to GTK and its own
threads, which also want to acquire the Glib context?
IOW, I guess I don't understand what problem would the proposed
changes solve that the current code doesn't. Can you tell?
> > Hmm... how would this work with your patch? Suppose one thread calls
> > xg_select, acquires the Glib lock, sets its holding_glib_lock flag,
> > then releases the global Lisp lock and calls pselect. Since the
> > global Lisp lock is now up for grabs, some other Lisp thread can
> > acquire it and start running.
>
> And when it starts running, it releases the Glib lock.
But only if its holding_glib_lock flag is set, which is not
necessarily true. For example, what about a thread that was just
created, and never called xg_select yet?
> Okay, that sounds like option #2 above. The attached patch exposes
> glib externals to the generic code, but it appears to work. If you
> think the approach is okay, I'll move the glib-specific parts to
> xgselect.c (if that's okay).
Yes, we should have accessor functions in xgselect.c instead of
letting thread.c etc. access the GTK/Glib data directly.
But I still would like to understand why we need to maintain a flag
per thread.
Also, I think I lost the context: does this attempt to solve the
original problem reported by Andreas, or only the problem reported by
you later in the discussion?
Thanks.
^ permalink raw reply [flat|nested] 53+ messages in thread
* bug#36609: 27.0.50; Possible race-condition in threading implementation
2019-07-13 14:37 ` Pip Cet
2019-07-13 15:03 ` Eli Zaretskii
@ 2019-07-13 15:04 ` Andreas Politz
1 sibling, 0 replies; 53+ messages in thread
From: Andreas Politz @ 2019-07-13 15:04 UTC (permalink / raw)
To: Pip Cet; +Cc: 36609
I ran my problematic code a couple thousand times on this patch without
freezes, while producing the expected results.
Thanks for figuring this out.
Andreas
^ permalink raw reply [flat|nested] 53+ messages in thread
* bug#36609: 27.0.50; Possible race-condition in threading implementation
2019-07-13 15:03 ` Eli Zaretskii
@ 2019-07-13 15:13 ` Eli Zaretskii
2019-07-13 15:54 ` Eli Zaretskii
1 sibling, 0 replies; 53+ messages in thread
From: Eli Zaretskii @ 2019-07-13 15:13 UTC (permalink / raw)
To: pipcet; +Cc: 36609, politza
> Date: Sat, 13 Jul 2019 18:03:35 +0300
> From: Eli Zaretskii <eliz@gnu.org>
> Cc: 36609@debbugs.gnu.org, politza@hochschule-trier.de
>
> Also, I think I lost the context: does this attempt to solve the
> original problem reported by Andreas, or only the problem reported by
> you later in the discussion?
I guess Andreas just answered this question.
Thanks.
^ permalink raw reply [flat|nested] 53+ messages in thread
* bug#36609: 27.0.50; Possible race-condition in threading implementation
2019-07-13 15:03 ` Eli Zaretskii
2019-07-13 15:13 ` Eli Zaretskii
@ 2019-07-13 15:54 ` Eli Zaretskii
2019-07-13 15:57 ` Pip Cet
1 sibling, 1 reply; 53+ messages in thread
From: Eli Zaretskii @ 2019-07-13 15:54 UTC (permalink / raw)
To: pipcet; +Cc: 36609, politza
> Date: Sat, 13 Jul 2019 18:03:35 +0300
> From: Eli Zaretskii <eliz@gnu.org>
> Cc: 36609@debbugs.gnu.org, politza@hochschule-trier.de
>
> But I still would like to understand why we need to maintain a flag
> per thread.
Specifically, why not make this flag thread-independent, and have any
thread release the context lock if that variable is non-zero? I don't
think it matters which thread locked the Glib context, we just need to
release it once we emerge from thread_select, and do it before we
might call Fsignal. That the lock might be released by a thread other
than the one which locked it shouldn't matter, I think, as long as we
consider all the Lisp threads acting together on behalf of Emacs.
^ permalink raw reply [flat|nested] 53+ messages in thread
* bug#36609: 27.0.50; Possible race-condition in threading implementation
2019-07-13 15:54 ` Eli Zaretskii
@ 2019-07-13 15:57 ` Pip Cet
2019-07-13 16:02 ` Eli Zaretskii
0 siblings, 1 reply; 53+ messages in thread
From: Pip Cet @ 2019-07-13 15:57 UTC (permalink / raw)
To: Eli Zaretskii; +Cc: 36609, politza
On Sat, Jul 13, 2019 at 3:54 PM Eli Zaretskii <eliz@gnu.org> wrote:
> > But I still would like to understand why we need to maintain a flag
> > per thread.
We don't. I had originally planned to make reacquiring the lock fail
gracefully if we can't spin-lock, but that didn't make it into the
patch, so it should be simplified as you suggest.
> Specifically, why not make this flag thread-independent, and have any
> thread release the context lock if that variable is non-zero? I don't
> think it matters which thread locked the Glib context, we just need to
> release it once we emerge from thread_select, and do it before we
> might call Fsignal. That the lock might be released by a thread other
> than the one which locked it shouldn't matter, I think, as long as we
> consider all the Lisp threads acting together on behalf of Emacs.
I agree entirely.
^ permalink raw reply [flat|nested] 53+ messages in thread
* bug#36609: 27.0.50; Possible race-condition in threading implementation
2019-07-13 15:57 ` Pip Cet
@ 2019-07-13 16:02 ` Eli Zaretskii
2019-07-13 18:17 ` Pip Cet
0 siblings, 1 reply; 53+ messages in thread
From: Eli Zaretskii @ 2019-07-13 16:02 UTC (permalink / raw)
To: Pip Cet; +Cc: 36609, politza
> From: Pip Cet <pipcet@gmail.com>
> Date: Sat, 13 Jul 2019 15:57:17 +0000
> Cc: 36609@debbugs.gnu.org, politza@hochschule-trier.de
>
> > Specifically, why not make this flag thread-independent, and have any
> > thread release the context lock if that variable is non-zero? I don't
> > think it matters which thread locked the Glib context, we just need to
> > release it once we emerge from thread_select, and do it before we
> > might call Fsignal. That the lock might be released by a thread other
> > than the one which locked it shouldn't matter, I think, as long as we
> > consider all the Lisp threads acting together on behalf of Emacs.
>
> I agree entirely.
Great, then I think we are in agreement.
^ permalink raw reply [flat|nested] 53+ messages in thread
* bug#36609: 27.0.50; Possible race-condition in threading implementation
2019-07-13 16:02 ` Eli Zaretskii
@ 2019-07-13 18:17 ` Pip Cet
2020-08-21 12:57 ` Lars Ingebrigtsen
0 siblings, 1 reply; 53+ messages in thread
From: Pip Cet @ 2019-07-13 18:17 UTC (permalink / raw)
To: Eli Zaretskii; +Cc: 36609, politza
[-- Attachment #1: Type: text/plain, Size: 315 bytes --]
Patch attached.
It's possible this'll livelock certain glib libraries that want to
modify the glib main context while another thread is running. I don't
see a good way to fix that, but a bad way would be to apply some
heuristic (such as a 10 ms delay between calling select() and assuming
it actually was called).
[-- Attachment #2: 0001-Fix-lock-failures-in-xg_select-bug-36609.patch --]
[-- Type: text/x-patch, Size: 4691 bytes --]
From d99222b1cf5b63e8361e9a7cdbe13faafb40c1d5 Mon Sep 17 00:00:00 2001
From: Pip Cet <pipcet@gmail.com>
Date: Sat, 13 Jul 2019 18:09:38 +0000
Subject: [PATCH] Fix lock failures in xg_select (bug#36609)
* src/xgselect.c (release_select_lock, acquire_select_lock):
Introduce.
(xg_select): Use `acquire_select_lock', `release_select_lock'.
* src/thread.c (release_select_lock): Introduce for non-GLib builds.
(really_call_select): Call `release_select_lock'. Simplify by
ensuring acquisition of the lock always succeeds.
---
src/thread.c | 8 ++++++++
src/xgselect.c | 42 ++++++++++++++++++++++++++++--------------
src/xgselect.h | 2 ++
3 files changed, 38 insertions(+), 14 deletions(-)
diff --git a/src/thread.c b/src/thread.c
index e2deadd7a8..40a25fb5cd 100644
--- a/src/thread.c
+++ b/src/thread.c
@@ -28,6 +28,12 @@ Copyright (C) 2012-2019 Free Software Foundation, Inc.
#include "pdumper.h"
#include "keyboard.h"
+#ifdef HAVE_GLIB
+#include <xgselect.h>
+#else
+#define release_select_lock() do { } while (0)
+#endif
+
union aligned_thread_state
{
struct thread_state s;
@@ -586,6 +592,8 @@ really_call_select (void *arg)
sa->result = (sa->func) (sa->max_fds, sa->rfds, sa->wfds, sa->efds,
sa->timeout, sa->sigmask);
+ release_select_lock ();
+
block_interrupt_signal (&oldset);
/* If we were interrupted by C-g while inside sa->func above, the
signal handler could have called maybe_reacquire_global_lock, in
diff --git a/src/xgselect.c b/src/xgselect.c
index 9982a1f0e9..82f0a61882 100644
--- a/src/xgselect.c
+++ b/src/xgselect.c
@@ -29,6 +29,27 @@ Copyright (C) 2009-2019 Free Software Foundation, Inc.
#include "blockinput.h"
#include "systime.h"
+static ptrdiff_t threads_holding_glib_lock;
+static GMainContext *glib_main_context;
+
+void release_select_lock (void)
+{
+ if (--threads_holding_glib_lock == 0)
+ g_main_context_release (glib_main_context);
+}
+
+static void acquire_select_lock (GMainContext *context)
+{
+ if (threads_holding_glib_lock++ == 0)
+ {
+ glib_main_context = context;
+ while (!g_main_context_acquire (context))
+ {
+ /* Spin. */
+ }
+ }
+}
+
/* `xg_select' is a `pselect' replacement. Why do we need a separate function?
1. Timeouts. Glib and Gtk rely on timer events. If we did pselect
with a greater timeout then the one scheduled by Glib, we would
@@ -54,26 +75,19 @@ xg_select (int fds_lim, fd_set *rfds, fd_set *wfds, fd_set *efds,
GPollFD *gfds = gfds_buf;
int gfds_size = ARRAYELTS (gfds_buf);
int n_gfds, retval = 0, our_fds = 0, max_fds = fds_lim - 1;
- bool context_acquired = false;
int i, nfds, tmo_in_millisec, must_free = 0;
bool need_to_dispatch;
context = g_main_context_default ();
- context_acquired = g_main_context_acquire (context);
- /* FIXME: If we couldn't acquire the context, we just silently proceed
- because this function handles more than just glib file descriptors.
- Note that, as implemented, this failure is completely silent: there is
- no feedback to the caller. */
+ acquire_select_lock (context);
if (rfds) all_rfds = *rfds;
else FD_ZERO (&all_rfds);
if (wfds) all_wfds = *wfds;
else FD_ZERO (&all_wfds);
- n_gfds = (context_acquired
- ? g_main_context_query (context, G_PRIORITY_LOW, &tmo_in_millisec,
- gfds, gfds_size)
- : -1);
+ n_gfds = g_main_context_query (context, G_PRIORITY_LOW, &tmo_in_millisec,
+ gfds, gfds_size);
if (gfds_size < n_gfds)
{
@@ -151,8 +165,10 @@ xg_select (int fds_lim, fd_set *rfds, fd_set *wfds, fd_set *efds,
#else
need_to_dispatch = true;
#endif
- if (need_to_dispatch && context_acquired)
+ if (need_to_dispatch)
{
+ acquire_select_lock (context);
+
int pselect_errno = errno;
/* Prevent g_main_dispatch recursion, that would occur without
block_input wrapper, because event handlers call
@@ -162,11 +178,9 @@ xg_select (int fds_lim, fd_set *rfds, fd_set *wfds, fd_set *efds,
g_main_context_dispatch (context);
unblock_input ();
errno = pselect_errno;
+ release_select_lock ();
}
- if (context_acquired)
- g_main_context_release (context);
-
/* To not have to recalculate timeout, return like this. */
if ((our_fds > 0 || (nfds == 0 && tmop == &tmo)) && (retval == 0))
{
diff --git a/src/xgselect.h b/src/xgselect.h
index 92c79c5f64..0d5c96cd21 100644
--- a/src/xgselect.h
+++ b/src/xgselect.h
@@ -29,4 +29,6 @@ #define XGSELECT_H
fd_set *rfds, fd_set *wfds, fd_set *efds,
struct timespec *timeout, sigset_t *sigmask);
+extern void release_select_lock (void);
+
#endif /* XGSELECT_H */
--
2.20.1
^ permalink raw reply related [flat|nested] 53+ messages in thread
* bug#36609: 27.0.50; Possible race-condition in threading implementation
2019-07-13 18:17 ` Pip Cet
@ 2020-08-21 12:57 ` Lars Ingebrigtsen
0 siblings, 0 replies; 53+ messages in thread
From: Lars Ingebrigtsen @ 2020-08-21 12:57 UTC (permalink / raw)
To: Pip Cet; +Cc: 36609, politza
Pip Cet <pipcet@gmail.com> writes:
> Patch attached.
>
> It's possible this'll livelock certain glib libraries that want to
> modify the glib main context while another thread is running. I don't
> see a good way to fix that, but a bad way would be to apply some
> heuristic (such as a 10 ms delay between calling select() and assuming
> it actually was called).
Skimming this thread, it seems that Eli was in agreement with applying
it, and Andreas has tried it extensively and it fixes the problem, so
I've applied it to Emacs 28 now.
--
(domestic pets only, the antidote for overdose, milk.)
bloggy blog: http://lars.ingebrigtsen.no
^ permalink raw reply [flat|nested] 53+ messages in thread
* bug#36609: 27.0.50; Possible race-condition in threading implementation
2019-07-11 20:51 bug#36609: 27.0.50; Possible race-condition in threading implementation Andreas Politz
2019-07-12 7:47 ` Eli Zaretskii
2019-07-12 9:02 ` Pip Cet
@ 2021-06-06 15:50 ` dick.r.chiang
[not found] ` <87fsxv8182.fsf@dick>
3 siblings, 0 replies; 53+ messages in thread
From: dick.r.chiang @ 2021-06-06 15:50 UTC (permalink / raw)
To: Andreas Politz; +Cc: 36609
[-- Attachment #1: Type: text/plain, Size: 53 bytes --]
Four attachments:
#1. Want to revert commit 9c62ffb
[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 0001-Revert-Fix-lock-failures-in-xg_select.patch --]
[-- Type: text/x-diff, Size: 4545 bytes --]
From 01b8b847f5fc4b65eaac06c7c2d85f8c5410c497 Mon Sep 17 00:00:00 2001
From: dickmao <none>
Date: Sun, 6 Jun 2021 11:10:13 -0400
Subject: [PATCH] Revert "Fix lock failures in xg_select"
This reverts commit 9c62ffb08262c82b7e38e6eb5767f2087424aa47.
* src/thread.c (really_call_select): revert 9c62ffb
* src/xgselect.c (xg_select): revert 9c62ffb
---
src/thread.c | 8 --------
src/xgselect.c | 42 ++++++++++++++----------------------------
src/xgselect.h | 2 --
3 files changed, 14 insertions(+), 38 deletions(-)
diff --git a/src/thread.c b/src/thread.c
index f74f611148..609cd7c5fc 100644
--- a/src/thread.c
+++ b/src/thread.c
@@ -28,12 +28,6 @@ Copyright (C) 2012-2021 Free Software Foundation, Inc.
#include "pdumper.h"
#include "keyboard.h"
-#if defined HAVE_GLIB && ! defined (HAVE_NS)
-#include <xgselect.h>
-#else
-#define release_select_lock() do { } while (0)
-#endif
-
union aligned_thread_state
{
struct thread_state s;
@@ -592,8 +586,6 @@ really_call_select (void *arg)
sa->result = (sa->func) (sa->max_fds, sa->rfds, sa->wfds, sa->efds,
sa->timeout, sa->sigmask);
- release_select_lock ();
-
block_interrupt_signal (&oldset);
/* If we were interrupted by C-g while inside sa->func above, the
signal handler could have called maybe_reacquire_global_lock, in
diff --git a/src/xgselect.c b/src/xgselect.c
index 0d91d55bad..d7c63e3be1 100644
--- a/src/xgselect.c
+++ b/src/xgselect.c
@@ -29,27 +29,6 @@ Copyright (C) 2009-2021 Free Software Foundation, Inc.
#include "blockinput.h"
#include "systime.h"
-static ptrdiff_t threads_holding_glib_lock;
-static GMainContext *glib_main_context;
-
-void release_select_lock (void)
-{
- if (--threads_holding_glib_lock == 0)
- g_main_context_release (glib_main_context);
-}
-
-static void acquire_select_lock (GMainContext *context)
-{
- if (threads_holding_glib_lock++ == 0)
- {
- glib_main_context = context;
- while (!g_main_context_acquire (context))
- {
- /* Spin. */
- }
- }
-}
-
/* `xg_select' is a `pselect' replacement. Why do we need a separate function?
1. Timeouts. Glib and Gtk rely on timer events. If we did pselect
with a greater timeout then the one scheduled by Glib, we would
@@ -75,19 +54,26 @@ xg_select (int fds_lim, fd_set *rfds, fd_set *wfds, fd_set *efds,
GPollFD *gfds = gfds_buf;
int gfds_size = ARRAYELTS (gfds_buf);
int n_gfds, retval = 0, our_fds = 0, max_fds = fds_lim - 1;
+ bool context_acquired = false;
int i, nfds, tmo_in_millisec, must_free = 0;
bool need_to_dispatch;
context = g_main_context_default ();
- acquire_select_lock (context);
+ context_acquired = g_main_context_acquire (context);
+ /* FIXME: If we couldn't acquire the context, we just silently proceed
+ because this function handles more than just glib file descriptors.
+ Note that, as implemented, this failure is completely silent: there is
+ no feedback to the caller. */
if (rfds) all_rfds = *rfds;
else FD_ZERO (&all_rfds);
if (wfds) all_wfds = *wfds;
else FD_ZERO (&all_wfds);
- n_gfds = g_main_context_query (context, G_PRIORITY_LOW, &tmo_in_millisec,
- gfds, gfds_size);
+ n_gfds = (context_acquired
+ ? g_main_context_query (context, G_PRIORITY_LOW, &tmo_in_millisec,
+ gfds, gfds_size)
+ : -1);
if (gfds_size < n_gfds)
{
@@ -165,10 +151,8 @@ xg_select (int fds_lim, fd_set *rfds, fd_set *wfds, fd_set *efds,
#else
need_to_dispatch = true;
#endif
- if (need_to_dispatch)
+ if (need_to_dispatch && context_acquired)
{
- acquire_select_lock (context);
-
int pselect_errno = errno;
/* Prevent g_main_dispatch recursion, that would occur without
block_input wrapper, because event handlers call
@@ -178,9 +162,11 @@ xg_select (int fds_lim, fd_set *rfds, fd_set *wfds, fd_set *efds,
g_main_context_dispatch (context);
unblock_input ();
errno = pselect_errno;
- release_select_lock ();
}
+ if (context_acquired)
+ g_main_context_release (context);
+
/* To not have to recalculate timeout, return like this. */
if ((our_fds > 0 || (nfds == 0 && tmop == &tmo)) && (retval == 0))
{
diff --git a/src/xgselect.h b/src/xgselect.h
index 2142a236b2..e00dce1283 100644
--- a/src/xgselect.h
+++ b/src/xgselect.h
@@ -29,6 +29,4 @@ #define XGSELECT_H
fd_set *rfds, fd_set *wfds, fd_set *efds,
struct timespec *timeout, sigset_t *sigmask);
-extern void release_select_lock (void);
-
#endif /* XGSELECT_H */
--
2.26.2
[-- Attachment #3: Type: text/plain, Size: 57 bytes --]
#2. Fails on tip of master, succeeds after patch in #1.
[-- Attachment #4: 42.el --]
[-- Type: application/emacs-lisp, Size: 5887 bytes --]
[-- Attachment #5: Type: text/plain, Size: 57 bytes --]
#3. Abridged original example, fails after patch in #1.
[-- Attachment #6: report.el --]
[-- Type: application/emacs-lisp, Size: 4943 bytes --]
[-- Attachment #7: Type: text/plain, Size: 322 bytes --]
Fails not necessarily because xgselect.c is wrong, but rather because
channel-recv blocks on a mutex before channel-send can get its act together.
This was hard for all to discern because OP seemed to have gone out of his way
to obfuscate his "minimum" example.
#4. What #3 probably intended, succeeds after patch in #1.
[-- Attachment #8: report-2.el --]
[-- Type: application/emacs-lisp, Size: 5014 bytes --]
^ permalink raw reply related [flat|nested] 53+ messages in thread
* bug#36609: 27.0.50; Possible race-condition in threading implementation
[not found] ` <87fsxv8182.fsf@dick>
@ 2021-06-06 16:35 ` Eli Zaretskii
2021-06-06 19:10 ` dick.r.chiang
0 siblings, 1 reply; 53+ messages in thread
From: Eli Zaretskii @ 2021-06-06 16:35 UTC (permalink / raw)
To: dick.r.chiang; +Cc: larsi, politza, pipcet, 36609
> From: dick.r.chiang@gmail.com
> Cc: Pip Cet <pipcet@gmail.com>, larsi@gnus.org, 36609@debbugs.gnu.org,
> Eli Zaretskii <eliz@gnu.org>
> Date: Sun, 06 Jun 2021 11:25:01 -0400
>
> #1. Want to revert commit 9c62ffb
This will bring back bug#36609, so we cannot do that without
discussing first why you think that commit was wrong.
> #2. Fails on tip of master, succeeds after patch in #1.
Please explain what does "fails" mean, and why do you think the above
commit is the culprit. (A much simpler test case will be
appreciated, btw.)
> Fails not necessarily because xgselect.c is wrong, but rather because
> channel-recv blocks on a mutex before channel-send can get its act together.
You mean, in this code:
(let ((channel (make-channel 1)))
(make-thread
(lambda nil
(channel-send (car channel) 42))
"produce")
(channel-recv (cdr channel))
(ignore-errors (enable-command 'list-threads))
(call-interactively #'list-threads))
? Here, channel-send is called by a new thread, created by
make-thread. In this code, it is _expected_ that channel-recv will be
called (by the main thread) _before_ channel-send is called by the new
thread, because make-thread creates a thread, but the newly created
thread doesn't run until it can acquire the global lock. Meanwhile,
the main thread continues running and calls channel-recv. The new
thread will not begin running, AFAIU, until the main thread calls
condition-wait inside channel-recv.
By "blocks on a mutex", did you mean that channel-recv blocks trying
to acquire the mutex here:
(cl-defgeneric channel-recv ((sink channel-terminal))
(with-mutex (oref sink mutex) <<<<<<<<<<<<<<<<<<<<<<<<<<
(with-slots (condition msg-queue) sink
If so, which thread holds that mutex at this point?
> #4. What #3 probably intended, succeeds after patch in #1.
Yes, race conditions can be solved by using sleep-for, but that's not
really a clean solution, at least not in my book.
^ permalink raw reply [flat|nested] 53+ messages in thread
* bug#36609: 27.0.50; Possible race-condition in threading implementation
2021-06-06 16:35 ` Eli Zaretskii
@ 2021-06-06 19:10 ` dick.r.chiang
2021-06-06 19:27 ` Eli Zaretskii
0 siblings, 1 reply; 53+ messages in thread
From: dick.r.chiang @ 2021-06-06 19:10 UTC (permalink / raw)
To: Eli Zaretskii; +Cc: larsi, pipcet, politza, 36609
EZ> This will bring back bug#36609, so we cannot do that without
EZ> discussing first why you think that commit was wrong.
By enforcing bijection between acquires and releases in xgselect.c --
cavalierly so as the releases now come out-of-band from another module
thread.c -- commit 9c62ffb admits deadlock. The conditions under which that
happens are not at all rarefied as evidenced by `for i in 1 2 3 ; do src/emacs
-Q -l ./42.el & done ;`
Generally, a commit made to remedy a phantom problem should always be
suspect. In this case OP was squatting on the main thread, so he deserves
what he got.
EZ> Yes, race conditions can be solved by using sleep-for, but that's not
EZ> really a clean solution, at least not in my book.
Case #4 was not meant to solve anything, only to set OP straight on what he
really wanted. You can remove the sleep-for. Still works under patch #1.
^ permalink raw reply [flat|nested] 53+ messages in thread
* bug#36609: 27.0.50; Possible race-condition in threading implementation
2021-06-06 19:10 ` dick.r.chiang
@ 2021-06-06 19:27 ` Eli Zaretskii
2021-06-09 21:40 ` dick.r.chiang
0 siblings, 1 reply; 53+ messages in thread
From: Eli Zaretskii @ 2021-06-06 19:27 UTC (permalink / raw)
To: dick.r.chiang; +Cc: larsi, pipcet, politza, 36609
> From: dick.r.chiang@gmail.com
> Cc: larsi@gnus.org, pipcet@gmail.com, politza@hochschule-trier.de,
> 36609@debbugs.gnu.org
> Date: Sun, 06 Jun 2021 15:10:43 -0400
>
> EZ> This will bring back bug#36609, so we cannot do that without
> EZ> discussing first why you think that commit was wrong.
>
> By enforcing bijection between acquires and releases in xgselect.c --
> cavalierly so as the releases now come out-of-band from another module
> thread.c -- commit 9c62ffb admits deadlock. The conditions under which that
> happens are not at all rarefied as evidenced by `for i in 1 2 3 ; do src/emacs
> -Q -l ./42.el & done ;`
You will have to provide a more detailed analysis, sorry. The fact
that some Lisp can cause deadlock does not prove anything, there are
too many ways to write such deadlocking code.
> Generally, a commit made to remedy a phantom problem should always be
> suspect. In this case OP was squatting on the main thread, so he deserves
> what he got.
That was not the analysis of that test case. See the discussion of
the bug. The problems with the GTK lock are real, not phantom.
^ permalink raw reply [flat|nested] 53+ messages in thread
* bug#36609: 27.0.50; Possible race-condition in threading implementation
2021-06-06 19:27 ` Eli Zaretskii
@ 2021-06-09 21:40 ` dick.r.chiang
2021-06-10 6:46 ` Eli Zaretskii
0 siblings, 1 reply; 53+ messages in thread
From: dick.r.chiang @ 2021-06-09 21:40 UTC (permalink / raw)
To: Eli Zaretskii; +Cc: larsi, politza, pipcet, 36609
[-- Attachment #1: Type: text/plain, Size: 497 bytes --]
EZ> You will have to provide a more detailed analysis, sorry.
There is no guarantee the static variable `threads_holding_glib_lock`
introduced in xgselect.c in commit 9c62ffb is synchronized across threads. As
such, relying on it becoming zero in time for release_select_lock() is fraught.
(If you add print statements, this particular heisenbug disappears, at least
on Linux kernel 4.15.0-99-generic).
Four attachments:
1. Desired patch to master (reverts 9c62ffb, adds main-thread-p check).
[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 0001-Only-acquire-glib-context-if-main-thread.patch --]
[-- Type: text/x-diff, Size: 4523 bytes --]
From 5b30b55f60844ada43b119640e052f8ed5ee6e9c Mon Sep 17 00:00:00 2001
From: dickmao <none>
Date: Wed, 9 Jun 2021 17:11:48 -0400
Subject: [PATCH] Only acquire glib context if main thread
* src/thread.c (really_call_select): revert 9c62ffb
* src/xgselect.c (xg_select): revert 9c62ffb, only acquire glib if main
---
src/thread.c | 8 --------
src/xgselect.c | 44 ++++++++++++++++----------------------------
src/xgselect.h | 2 --
3 files changed, 16 insertions(+), 38 deletions(-)
diff --git a/src/thread.c b/src/thread.c
index f74f611148..609cd7c5fc 100644
--- a/src/thread.c
+++ b/src/thread.c
@@ -28,12 +28,6 @@ Copyright (C) 2012-2021 Free Software Foundation, Inc.
#include "pdumper.h"
#include "keyboard.h"
-#if defined HAVE_GLIB && ! defined (HAVE_NS)
-#include <xgselect.h>
-#else
-#define release_select_lock() do { } while (0)
-#endif
-
union aligned_thread_state
{
struct thread_state s;
@@ -592,8 +586,6 @@ really_call_select (void *arg)
sa->result = (sa->func) (sa->max_fds, sa->rfds, sa->wfds, sa->efds,
sa->timeout, sa->sigmask);
- release_select_lock ();
-
block_interrupt_signal (&oldset);
/* If we were interrupted by C-g while inside sa->func above, the
signal handler could have called maybe_reacquire_global_lock, in
diff --git a/src/xgselect.c b/src/xgselect.c
index 0d91d55bad..b40f75cb50 100644
--- a/src/xgselect.c
+++ b/src/xgselect.c
@@ -28,27 +28,7 @@ Copyright (C) 2009-2021 Free Software Foundation, Inc.
#include "lisp.h"
#include "blockinput.h"
#include "systime.h"
-
-static ptrdiff_t threads_holding_glib_lock;
-static GMainContext *glib_main_context;
-
-void release_select_lock (void)
-{
- if (--threads_holding_glib_lock == 0)
- g_main_context_release (glib_main_context);
-}
-
-static void acquire_select_lock (GMainContext *context)
-{
- if (threads_holding_glib_lock++ == 0)
- {
- glib_main_context = context;
- while (!g_main_context_acquire (context))
- {
- /* Spin. */
- }
- }
-}
+#include "thread.h"
/* `xg_select' is a `pselect' replacement. Why do we need a separate function?
1. Timeouts. Glib and Gtk rely on timer events. If we did pselect
@@ -75,19 +55,27 @@ xg_select (int fds_lim, fd_set *rfds, fd_set *wfds, fd_set *efds,
GPollFD *gfds = gfds_buf;
int gfds_size = ARRAYELTS (gfds_buf);
int n_gfds, retval = 0, our_fds = 0, max_fds = fds_lim - 1;
+ bool context_acquired = false;
int i, nfds, tmo_in_millisec, must_free = 0;
bool need_to_dispatch;
context = g_main_context_default ();
- acquire_select_lock (context);
+ if (main_thread_p (current_thread))
+ context_acquired = g_main_context_acquire (context);
+ /* FIXME: If we couldn't acquire the context, we just silently proceed
+ because this function handles more than just glib file descriptors.
+ Note that, as implemented, this failure is completely silent: there is
+ no feedback to the caller. */
if (rfds) all_rfds = *rfds;
else FD_ZERO (&all_rfds);
if (wfds) all_wfds = *wfds;
else FD_ZERO (&all_wfds);
- n_gfds = g_main_context_query (context, G_PRIORITY_LOW, &tmo_in_millisec,
- gfds, gfds_size);
+ n_gfds = (context_acquired
+ ? g_main_context_query (context, G_PRIORITY_LOW, &tmo_in_millisec,
+ gfds, gfds_size)
+ : -1);
if (gfds_size < n_gfds)
{
@@ -165,10 +153,8 @@ xg_select (int fds_lim, fd_set *rfds, fd_set *wfds, fd_set *efds,
#else
need_to_dispatch = true;
#endif
- if (need_to_dispatch)
+ if (need_to_dispatch && context_acquired)
{
- acquire_select_lock (context);
-
int pselect_errno = errno;
/* Prevent g_main_dispatch recursion, that would occur without
block_input wrapper, because event handlers call
@@ -178,9 +164,11 @@ xg_select (int fds_lim, fd_set *rfds, fd_set *wfds, fd_set *efds,
g_main_context_dispatch (context);
unblock_input ();
errno = pselect_errno;
- release_select_lock ();
}
+ if (context_acquired)
+ g_main_context_release (context);
+
/* To not have to recalculate timeout, return like this. */
if ((our_fds > 0 || (nfds == 0 && tmop == &tmo)) && (retval == 0))
{
diff --git a/src/xgselect.h b/src/xgselect.h
index 2142a236b2..e00dce1283 100644
--- a/src/xgselect.h
+++ b/src/xgselect.h
@@ -29,6 +29,4 @@ #define XGSELECT_H
fd_set *rfds, fd_set *wfds, fd_set *efds,
struct timespec *timeout, sigset_t *sigmask);
-extern void release_select_lock (void);
-
#endif /* XGSELECT_H */
--
2.26.2
[-- Attachment #3: Type: text/plain, Size: 195 bytes --]
2. OP's original "my code doesn't work, here you figure it out", obfuscated MRE
Run as:
for i in 1 2 3 ; do src/emacs -Q -l ./report-orig.el & done
Fails if:
One of the emacsen stops responding
[-- Attachment #4: report-orig.el --]
[-- Type: application/emacs-lisp, Size: 6345 bytes --]
[-- Attachment #5: Type: text/plain, Size: 206 bytes --]
3. "Affine" redaction of OP's MRE *infrequently* breaks on tip of master (succeeds
after patch)
Run as:
for i in 1 2 3 ; do src/emacs -Q -l ./report.el & done
Fails if:
One of the emacsen stops responding
[-- Attachment #6: report.el --]
[-- Type: application/emacs-lisp, Size: 704 bytes --]
[-- Attachment #7: Type: text/plain, Size: 68 bytes --]
4. My MRE *frequently* breaks tip of master (succeeds after patch)
[-- Attachment #8: 42.el --]
[-- Type: application/emacs-lisp, Size: 1551 bytes --]
[-- Attachment #9: Type: text/plain, Size: 105 bytes --]
Run as:
for i in 1 2 3 ; do src/emacs -Q -l ./42.el & done
Fails if:
One of the emacsen stops responding
^ permalink raw reply related [flat|nested] 53+ messages in thread
* bug#36609: 27.0.50; Possible race-condition in threading implementation
2021-06-09 21:40 ` dick.r.chiang
@ 2021-06-10 6:46 ` Eli Zaretskii
2021-06-10 11:52 ` dick.r.chiang
0 siblings, 1 reply; 53+ messages in thread
From: Eli Zaretskii @ 2021-06-10 6:46 UTC (permalink / raw)
To: dick.r.chiang; +Cc: larsi, politza, pipcet, 36609
> From: dick.r.chiang@gmail.com
> Cc: larsi@gnus.org, politza@hochschule-trier.de, pipcet@gmail.com,
> 36609@debbugs.gnu.org
> Date: Wed, 09 Jun 2021 17:40:28 -0400
>
> There is no guarantee the static variable `threads_holding_glib_lock`
> introduced in xgselect.c in commit 9c62ffb is synchronized across threads.
How can that happen? It's a single word, and its change should
therefore be atomic.
^ permalink raw reply [flat|nested] 53+ messages in thread
* bug#36609: 27.0.50; Possible race-condition in threading implementation
2021-06-10 6:46 ` Eli Zaretskii
@ 2021-06-10 11:52 ` dick.r.chiang
2021-06-10 14:18 ` Eli Zaretskii
0 siblings, 1 reply; 53+ messages in thread
From: dick.r.chiang @ 2021-06-10 11:52 UTC (permalink / raw)
To: Eli Zaretskii; +Cc: 36609
[Standard disclaimer about not being a compiler semanticist]
> a single word... change should ... be atomic.
I've never heard that. I googled around; one helpful and only
mildly sententious monograph on the topic is
https://stefansf.de/post/qualifier-volatile-c/
Adding the volatile qualifier to `threads_holding_glib_lock` did not
resolve my MRE on my particular architecture (whatever that may be).
^ permalink raw reply [flat|nested] 53+ messages in thread
* bug#36609: 27.0.50; Possible race-condition in threading implementation
2021-06-10 11:52 ` dick.r.chiang
@ 2021-06-10 14:18 ` Eli Zaretskii
2021-06-10 14:55 ` dick.r.chiang
2021-06-10 15:35 ` Andreas Schwab
0 siblings, 2 replies; 53+ messages in thread
From: Eli Zaretskii @ 2021-06-10 14:18 UTC (permalink / raw)
To: dick.r.chiang; +Cc: 36609
> From: dick.r.chiang@gmail.com
> Cc: 36609@debbugs.gnu.org
> Date: Thu, 10 Jun 2021 07:52:45 -0400
>
> > a single word... change should ... be atomic.
>
> I've never heard that.
The idea is that a variable that is modified in a single machine
instruction will always have the value assigned by the last thread
which set that. You cannot have, for example, half of the bits of the
variable set by one thread and the other half by another thread.
> Adding the volatile qualifier to `threads_holding_glib_lock` did not
> resolve my MRE on my particular architecture (whatever that may be).
So can you describe what you think happens in your scenario that the
offending change that added threads_holding_glib_lock causes? Because
I still don't see the connection, to tell the truth.
^ permalink raw reply [flat|nested] 53+ messages in thread
* bug#36609: 27.0.50; Possible race-condition in threading implementation
2021-06-10 14:18 ` Eli Zaretskii
@ 2021-06-10 14:55 ` dick.r.chiang
2021-06-10 15:04 ` Eli Zaretskii
2021-06-10 15:35 ` Andreas Schwab
1 sibling, 1 reply; 53+ messages in thread
From: dick.r.chiang @ 2021-06-10 14:55 UTC (permalink / raw)
To: Eli Zaretskii; +Cc: 36609
> You cannot have, for example, half of the bits of the variable set by one thread
> and the other half by another thread.
I dunno, this seems like a sophomoric, rose-colored view of variable sharing
without regard to caching, or really any generally accepted guarantees about
POSIX threads.
> So can you describe what you think happens in your scenario that the
> offending change that added threads_holding_glib_lock causes?
Two threads see `threads_holding_glib_lock` is "0" and both attempt
acquiring the lock (line 43, xgselect.c)
Or, two threads see `threads_holding_glib_lock` as "2" and glib
lock is never released. (line 37, xgselect.c)
^ permalink raw reply [flat|nested] 53+ messages in thread
* bug#36609: 27.0.50; Possible race-condition in threading implementation
2021-06-10 14:55 ` dick.r.chiang
@ 2021-06-10 15:04 ` Eli Zaretskii
2021-06-10 21:36 ` dick.r.chiang
0 siblings, 1 reply; 53+ messages in thread
From: Eli Zaretskii @ 2021-06-10 15:04 UTC (permalink / raw)
To: dick.r.chiang; +Cc: 36609
> From: dick.r.chiang@gmail.com
> Cc: 36609@debbugs.gnu.org
> Date: Thu, 10 Jun 2021 10:55:53 -0400
>
> > You cannot have, for example, half of the bits of the variable set by one thread
> > and the other half by another thread.
>
> I dunno, this seems like a sophomoric, rose-colored view of variable sharing
> without regard to caching, or really any generally accepted guarantees about
> POSIX threads.
Is it, really?
> > So can you describe what you think happens in your scenario that the
> > offending change that added threads_holding_glib_lock causes?
>
> Two threads see `threads_holding_glib_lock` is "0" and both attempt
> acquiring the lock (line 43, xgselect.c)
>
> Or, two threads see `threads_holding_glib_lock` as "2" and glib
> lock is never released. (line 37, xgselect.c)
And you can show a GDB backtrace with values of the variable that
supports that? And tell how that explains the hang that you see on
the Lisp level? These are the details that I was asking from the
first message you posted here.
^ permalink raw reply [flat|nested] 53+ messages in thread
* bug#36609: 27.0.50; Possible race-condition in threading implementation
2021-06-10 14:18 ` Eli Zaretskii
2021-06-10 14:55 ` dick.r.chiang
@ 2021-06-10 15:35 ` Andreas Schwab
2021-06-10 15:39 ` Eli Zaretskii
1 sibling, 1 reply; 53+ messages in thread
From: Andreas Schwab @ 2021-06-10 15:35 UTC (permalink / raw)
To: Eli Zaretskii; +Cc: dick.r.chiang, 36609
On Jun 10 2021, Eli Zaretskii wrote:
> The idea is that a variable that is modified in a single machine
> instruction will always have the value assigned by the last thread
> which set that.
You cannot have that without explicit atomic operations. An ordinary
RMW access is never atomic across cpus.
Andreas.
--
Andreas Schwab, schwab@linux-m68k.org
GPG Key fingerprint = 7578 EB47 D4E5 4D69 2510 2552 DF73 E780 A9DA AEC1
"And now for something completely different."
^ permalink raw reply [flat|nested] 53+ messages in thread
* bug#36609: 27.0.50; Possible race-condition in threading implementation
2021-06-10 15:35 ` Andreas Schwab
@ 2021-06-10 15:39 ` Eli Zaretskii
0 siblings, 0 replies; 53+ messages in thread
From: Eli Zaretskii @ 2021-06-10 15:39 UTC (permalink / raw)
To: Andreas Schwab; +Cc: dick.r.chiang, 36609
> From: Andreas Schwab <schwab@linux-m68k.org>
> Cc: dick.r.chiang@gmail.com, 36609@debbugs.gnu.org
> Date: Thu, 10 Jun 2021 17:35:54 +0200
>
> On Jun 10 2021, Eli Zaretskii wrote:
>
> > The idea is that a variable that is modified in a single machine
> > instruction will always have the value assigned by the last thread
> > which set that.
>
> You cannot have that without explicit atomic operations.
We can do that if it turns out to be an issue.
^ permalink raw reply [flat|nested] 53+ messages in thread
* bug#36609: 27.0.50; Possible race-condition in threading implementation
2021-06-10 15:04 ` Eli Zaretskii
@ 2021-06-10 21:36 ` dick.r.chiang
2021-06-11 6:00 ` Eli Zaretskii
0 siblings, 1 reply; 53+ messages in thread
From: dick.r.chiang @ 2021-06-10 21:36 UTC (permalink / raw)
To: Eli Zaretskii; +Cc: 36609
EZ> Is it, really?
Yes. While I can't claim to be any more knowledgeable, the naivete of someone
protesting thread atomicity based on word-size is stunningly apparent.
EZ> And you can show a GDB backtrace with values of the variable that supports
EZ> that?
Alas I cannot. My gdb skills are less impressive in the presence of threads,
and heisenbugs of this nature tend to mysteriously disappear when gdb's
leisurely pace gives caches ample time to flush.
I mentioned earlier I could not reproduce my MRE failure when I merely inserted
print statements.
My patch isn't quite as studied as Pip Cet's (my patch merely skirts locking
the glib unless you're main-thread-p), but it does fix OP and, unlike the
current state of the code, doesn't admit glaring theoretical flaws, even
though you seem to think those flaws could never materialize in practice
despite evidence (gdb-less as it is) to the contrary.
^ permalink raw reply [flat|nested] 53+ messages in thread
* bug#36609: 27.0.50; Possible race-condition in threading implementation
2021-06-10 21:36 ` dick.r.chiang
@ 2021-06-11 6:00 ` Eli Zaretskii
2021-06-19 17:53 ` Eli Zaretskii
0 siblings, 1 reply; 53+ messages in thread
From: Eli Zaretskii @ 2021-06-11 6:00 UTC (permalink / raw)
To: dick.r.chiang; +Cc: 36609
> From: dick.r.chiang@gmail.com
> Cc: 36609@debbugs.gnu.org
> Date: Thu, 10 Jun 2021 17:36:13 -0400
>
> EZ> Is it, really?
>
> Yes. While I can't claim to be any more knowledgeable, the naivete of someone
> protesting thread atomicity based on word-size is stunningly apparent.
I'm okay with replacing the simple increments and decrements of that
variable with atomic ones, using the likes of __atomic_fetch_add
intrinsics of GCC, if that solves the problem. Did you try that?
^ permalink raw reply [flat|nested] 53+ messages in thread
* bug#36609: 27.0.50; Possible race-condition in threading implementation
2021-06-11 6:00 ` Eli Zaretskii
@ 2021-06-19 17:53 ` Eli Zaretskii
2021-06-19 19:14 ` dick.r.chiang
0 siblings, 1 reply; 53+ messages in thread
From: Eli Zaretskii @ 2021-06-19 17:53 UTC (permalink / raw)
To: dick.r.chiang; +Cc: 36609
> Date: Fri, 11 Jun 2021 09:00:07 +0300
> From: Eli Zaretskii <eliz@gnu.org>
> Cc: 36609@debbugs.gnu.org
>
> I'm okay with replacing the simple increments and decrements of that
> variable with atomic ones, using the likes of __atomic_fetch_add
> intrinsics of GCC, if that solves the problem. Did you try that?
Ping!
Did you try that, and if so, did using the atomics solve the original
problem?
Thanks.
^ permalink raw reply [flat|nested] 53+ messages in thread
* bug#36609: 27.0.50; Possible race-condition in threading implementation
2021-06-19 17:53 ` Eli Zaretskii
@ 2021-06-19 19:14 ` dick.r.chiang
2021-06-19 19:18 ` Eli Zaretskii
0 siblings, 1 reply; 53+ messages in thread
From: dick.r.chiang @ 2021-06-19 19:14 UTC (permalink / raw)
To: Eli Zaretskii; +Cc: 36609
I enter a plea of lacking the programming know-how for atomics.
I stand by everything I've said thus far: an obfuscated MRE led to a
misdiagnosis, led to an unnecessary and highly thread-unsafe fix, which can be
remedied by the vulgar "end-run" patch I submitted, and which I test every day
with parallel-thread Gnus.
Unfortunately in this and every urological contest of wills I seem to have with
project maintainers, I, by construction, must always lose.
>>>>> Eli Zaretskii <eliz@gnu.org> writes:
>> Date: Fri, 11 Jun 2021 09:00:07 +0300
>> From: Eli Zaretskii <eliz@gnu.org>
>> Cc: 36609@debbugs.gnu.org
>>
>> I'm okay with replacing the simple increments and decrements of that
>> variable with atomic ones, using the likes of __atomic_fetch_add
>> intrinsics of GCC, if that solves the problem. Did you try that?
> Ping!
> Did you try that, and if so, did using the atomics solve the original
> problem?
> Thanks.
^ permalink raw reply [flat|nested] 53+ messages in thread
* bug#36609: 27.0.50; Possible race-condition in threading implementation
2021-06-19 19:14 ` dick.r.chiang
@ 2021-06-19 19:18 ` Eli Zaretskii
2021-06-19 21:12 ` dick.r.chiang
0 siblings, 1 reply; 53+ messages in thread
From: Eli Zaretskii @ 2021-06-19 19:18 UTC (permalink / raw)
To: dick.r.chiang; +Cc: 36609
> From: dick.r.chiang@gmail.com
> Cc: 36609@debbugs.gnu.org
> Date: Sat, 19 Jun 2021 15:14:19 -0400
>
> I enter a plea of lacking the programming know-how for atomics.
If I suggest a patch, would you be willing to give it a try?
^ permalink raw reply [flat|nested] 53+ messages in thread
* bug#36609: 27.0.50; Possible race-condition in threading implementation
2021-06-19 19:18 ` Eli Zaretskii
@ 2021-06-19 21:12 ` dick.r.chiang
2021-06-20 11:39 ` Eli Zaretskii
0 siblings, 1 reply; 53+ messages in thread
From: dick.r.chiang @ 2021-06-19 21:12 UTC (permalink / raw)
To: Eli Zaretskii; +Cc: 36609
"Suggest" or "Provide"?
If the former, well, my recalcitrance should be your answer.
If the latter, sure, although if you've written the patch, then running my
MREs would be epsilon more effort.
>>>>> Eli Zaretskii <eliz@gnu.org> writes:
>> From: dick.r.chiang@gmail.com
>> Cc: 36609@debbugs.gnu.org
>> Date: Sat, 19 Jun 2021 15:14:19 -0400
>>
>> I enter a plea of lacking the programming know-how for atomics.
> If I suggest a patch, would you be willing to give it a try?
^ permalink raw reply [flat|nested] 53+ messages in thread
* bug#36609: 27.0.50; Possible race-condition in threading implementation
2021-06-19 21:12 ` dick.r.chiang
@ 2021-06-20 11:39 ` Eli Zaretskii
2021-06-20 14:01 ` dick.r.chiang
0 siblings, 1 reply; 53+ messages in thread
From: Eli Zaretskii @ 2021-06-20 11:39 UTC (permalink / raw)
To: dick.r.chiang; +Cc: 36609
> From: dick.r.chiang@gmail.com
> Cc: 36609@debbugs.gnu.org
> Date: Sat, 19 Jun 2021 17:12:32 -0400
>
> "Suggest" or "Provide"?
>
> If the former, well, my recalcitrance should be your answer.
>
> If the latter, sure
Thanks, please try the patch below.
> although if you've written the patch, then running my MREs would be
> epsilon more effort.
You have a use case and wrote code with which you are familiar, so you
are in a better position to test it. Moreover, I don't have access to
a system where these problems could happen, so it's far from epsilon
effort for me.
diff --git a/src/xgselect.c b/src/xgselect.c
index 0d91d55bad..9a90670b0f 100644
--- a/src/xgselect.c
+++ b/src/xgselect.c
@@ -34,12 +34,27 @@ static GMainContext *glib_main_context;
void release_select_lock (void)
{
+#if GNUC_PREREQ (4, 7, 0)
+ if (__atomic_sub_fetch (&threads_holding_glib_lock, 1, __ATOMIC_ACQ_REL))
+ g_main_context_release (glib_main_context);
+#else
if (--threads_holding_glib_lock == 0)
g_main_context_release (glib_main_context);
+#endif
}
static void acquire_select_lock (GMainContext *context)
{
+#if GNUC_PREREQ (4, 7, 0)
+ if (__atomic_fetch_add (&threads_holding_glib_lock, 1, __ATOMIC_ACQ_REL) == 0)
+ {
+ glib_main_context = context;
+ while (!g_main_context_acquire (context))
+ {
+ /* Spin. */
+ }
+ }
+#else
if (threads_holding_glib_lock++ == 0)
{
glib_main_context = context;
@@ -48,6 +63,7 @@ static void acquire_select_lock (GMainContext *context)
/* Spin. */
}
}
+#endif
}
/* `xg_select' is a `pselect' replacement. Why do we need a separate function?
^ permalink raw reply related [flat|nested] 53+ messages in thread
* bug#36609: 27.0.50; Possible race-condition in threading implementation
2021-06-20 11:39 ` Eli Zaretskii
@ 2021-06-20 14:01 ` dick.r.chiang
2021-06-20 15:53 ` Eli Zaretskii
0 siblings, 1 reply; 53+ messages in thread
From: dick.r.chiang @ 2021-06-20 14:01 UTC (permalink / raw)
To: Eli Zaretskii; +Cc: 36609
Yes! If you just add a negation to the __atomic_sub_fetch call, this passes
all the MREs.
It's not clear how you want to deal with the #else GNUC_PREREQ (4, 7, 0).
>>>>> "EZ" == Eli Zaretskii <eliz@gnu.org> writes:
>> From: dick.r.chiang@gmail.com
>> Cc: 36609@debbugs.gnu.org
>> Date: Sat, 19 Jun 2021 17:12:32 -0400
>>
>> "Suggest" or "Provide"?
>>
>> If the former, well, my recalcitrance should be your answer.
>>
>> If the latter, sure
EZ> Thanks, please try the patch below.
>> although if you've written the patch, then running my MREs would be epsilon
>> more effort.
EZ> You have a use case and wrote code with which you are familiar, so you are
EZ> in a better position to test it. Moreover, I don't have access to a
EZ> system where these problems could happen, so it's far from epsilon effort
EZ> for me.
EZ> diff --git a/src/xgselect.c b/src/xgselect.c
EZ> index 0d91d55bad..9a90670b0f 100644
EZ> --- a/src/xgselect.c
EZ> +++ b/src/xgselect.c
EZ> @@ -34,12 +34,27 @@ static GMainContext *glib_main_context;
EZ> void release_select_lock (void)
EZ> {
EZ> +#if GNUC_PREREQ (4, 7, 0)
EZ> + if (__atomic_sub_fetch (&threads_holding_glib_lock, 1, __ATOMIC_ACQ_REL))
EZ> + g_main_context_release (glib_main_context);
EZ> +#else
EZ> if (--threads_holding_glib_lock == 0)
EZ> g_main_context_release (glib_main_context);
EZ> +#endif
EZ> }
EZ> static void acquire_select_lock (GMainContext *context)
EZ> {
EZ> +#if GNUC_PREREQ (4, 7, 0)
EZ> + if (__atomic_fetch_add (&threads_holding_glib_lock, 1, __ATOMIC_ACQ_REL) == 0)
EZ> + {
EZ> + glib_main_context = context;
EZ> + while (!g_main_context_acquire (context))
EZ> + {
EZ> + /* Spin. */
EZ> + }
EZ> + }
EZ> +#else
EZ> if (threads_holding_glib_lock++ == 0)
EZ> {
EZ> glib_main_context = context;
EZ> @@ -48,6 +63,7 @@ static void acquire_select_lock (GMainContext *context)
EZ> /* Spin. */
EZ> }
EZ> }
EZ> +#endif
EZ> }
EZ> /* `xg_select' is a `pselect' replacement. Why do we need a separate
EZ> function?
^ permalink raw reply [flat|nested] 53+ messages in thread
* bug#36609: 27.0.50; Possible race-condition in threading implementation
2021-06-20 14:01 ` dick.r.chiang
@ 2021-06-20 15:53 ` Eli Zaretskii
2021-06-25 13:54 ` Eli Zaretskii
0 siblings, 1 reply; 53+ messages in thread
From: Eli Zaretskii @ 2021-06-20 15:53 UTC (permalink / raw)
To: dick.r.chiang; +Cc: 36609
> From: dick.r.chiang@gmail.com
> Cc: 36609@debbugs.gnu.org
> Date: Sun, 20 Jun 2021 10:01:37 -0400
>
> Yes! If you just add a negation to the __atomic_sub_fetch call
You mean, use the patch below instead?
> this passes all the MREs.
Thanks, will install soon.
> It's not clear how you want to deal with the #else GNUC_PREREQ (4, 7, 0).
By hoping no one uses this and expects threads to be stable enough
under GTK.
diff --git a/src/xgselect.c b/src/xgselect.c
index 0d91d55bad..92b118b955 100644
--- a/src/xgselect.c
+++ b/src/xgselect.c
@@ -34,12 +34,27 @@ static GMainContext *glib_main_context;
void release_select_lock (void)
{
+#if GNUC_PREREQ (4, 7, 0)
+ if (__atomic_sub_fetch (&threads_holding_glib_lock, 1, __ATOMIC_ACQ_REL) == 0)
+ g_main_context_release (glib_main_context);
+#else
if (--threads_holding_glib_lock == 0)
g_main_context_release (glib_main_context);
+#endif
}
static void acquire_select_lock (GMainContext *context)
{
+#if GNUC_PREREQ (4, 7, 0)
+ if (__atomic_fetch_add (&threads_holding_glib_lock, 1, __ATOMIC_ACQ_REL) == 0)
+ {
+ glib_main_context = context;
+ while (!g_main_context_acquire (context))
+ {
+ /* Spin. */
+ }
+ }
+#else
if (threads_holding_glib_lock++ == 0)
{
glib_main_context = context;
@@ -48,6 +63,7 @@ static void acquire_select_lock (GMainContext *context)
/* Spin. */
}
}
+#endif
}
/* `xg_select' is a `pselect' replacement. Why do we need a separate function?
^ permalink raw reply related [flat|nested] 53+ messages in thread
* bug#36609: 27.0.50; Possible race-condition in threading implementation
2021-06-20 15:53 ` Eli Zaretskii
@ 2021-06-25 13:54 ` Eli Zaretskii
0 siblings, 0 replies; 53+ messages in thread
From: Eli Zaretskii @ 2021-06-25 13:54 UTC (permalink / raw)
To: dick.r.chiang; +Cc: 36609
> Date: Sun, 20 Jun 2021 18:53:40 +0300
> From: Eli Zaretskii <eliz@gnu.org>
> Cc: 36609@debbugs.gnu.org
>
> > From: dick.r.chiang@gmail.com
> > Cc: 36609@debbugs.gnu.org
> > Date: Sun, 20 Jun 2021 10:01:37 -0400
> >
> > Yes! If you just add a negation to the __atomic_sub_fetch call
>
> You mean, use the patch below instead?
>
> > this passes all the MREs.
>
> Thanks, will install soon.
Now done.
^ permalink raw reply [flat|nested] 53+ messages in thread
end of thread, other threads:[~2021-06-25 13:54 UTC | newest]
Thread overview: 53+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2019-07-11 20:51 bug#36609: 27.0.50; Possible race-condition in threading implementation Andreas Politz
2019-07-12 7:47 ` Eli Zaretskii
2019-07-12 8:05 ` Eli Zaretskii
2019-07-12 9:02 ` Pip Cet
2019-07-12 12:42 ` Eli Zaretskii
2019-07-12 12:57 ` Pip Cet
2019-07-12 13:31 ` Eli Zaretskii
2019-07-12 13:51 ` Pip Cet
2019-07-12 15:05 ` Eli Zaretskii
2019-07-12 18:06 ` Pip Cet
2019-07-12 18:27 ` Eli Zaretskii
2019-07-12 18:34 ` Eli Zaretskii
2019-07-12 19:24 ` Pip Cet
2019-07-12 19:57 ` Eli Zaretskii
2019-07-13 14:37 ` Pip Cet
2019-07-13 15:03 ` Eli Zaretskii
2019-07-13 15:13 ` Eli Zaretskii
2019-07-13 15:54 ` Eli Zaretskii
2019-07-13 15:57 ` Pip Cet
2019-07-13 16:02 ` Eli Zaretskii
2019-07-13 18:17 ` Pip Cet
2020-08-21 12:57 ` Lars Ingebrigtsen
2019-07-13 15:04 ` Andreas Politz
2019-07-12 12:44 ` Pip Cet
2019-07-12 12:55 ` Eli Zaretskii
2019-07-12 13:40 ` Pip Cet
2019-07-12 13:51 ` Eli Zaretskii
2019-07-12 14:34 ` Pip Cet
2019-07-12 15:02 ` Eli Zaretskii
2019-07-12 19:30 ` Pip Cet
2019-07-13 6:50 ` Eli Zaretskii
2021-06-06 15:50 ` dick.r.chiang
[not found] ` <87fsxv8182.fsf@dick>
2021-06-06 16:35 ` Eli Zaretskii
2021-06-06 19:10 ` dick.r.chiang
2021-06-06 19:27 ` Eli Zaretskii
2021-06-09 21:40 ` dick.r.chiang
2021-06-10 6:46 ` Eli Zaretskii
2021-06-10 11:52 ` dick.r.chiang
2021-06-10 14:18 ` Eli Zaretskii
2021-06-10 14:55 ` dick.r.chiang
2021-06-10 15:04 ` Eli Zaretskii
2021-06-10 21:36 ` dick.r.chiang
2021-06-11 6:00 ` Eli Zaretskii
2021-06-19 17:53 ` Eli Zaretskii
2021-06-19 19:14 ` dick.r.chiang
2021-06-19 19:18 ` Eli Zaretskii
2021-06-19 21:12 ` dick.r.chiang
2021-06-20 11:39 ` Eli Zaretskii
2021-06-20 14:01 ` dick.r.chiang
2021-06-20 15:53 ` Eli Zaretskii
2021-06-25 13:54 ` Eli Zaretskii
2021-06-10 15:35 ` Andreas Schwab
2021-06-10 15:39 ` Eli Zaretskii
Code repositories for project(s) associated with this external index
https://git.savannah.gnu.org/cgit/emacs.git
https://git.savannah.gnu.org/cgit/emacs/org-mode.git
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.