* bug#17561: 24.4.50; Emacs can forget processes @ 2014-05-23 15:52 Jorgen Schaefer 2014-05-23 16:28 ` bug#17561: " Paul Eggert 0 siblings, 1 reply; 33+ messages in thread From: Jorgen Schaefer @ 2014-05-23 15:52 UTC (permalink / raw) To: 17561 From time to time, my Emacs tends to become very slow. I have tracked this down to a weird interaction with certain processes. Sometimes, when a process dies (e.g. after its buffer is killed), Emacs apparently does not clean up correctly. The sentinel is called, the process itself receives a termination signal, but the process object still shows up in M-x list-processes, and there is a zombie process in ps. When this happens, Emacs is in a tight loop trying to read from the dead process(es): 21:24:00.948071 pselect6(12, [3 4 9 11], [], NULL, {0, 1376013}, {NULL, 8}) = 2 (in [9 11], left {0, 1373467}) <0.000011> 21:24:00.948529 read(9, "", 4096) = 0 <0.000009> 21:24:00.948564 read(11, "", 4096) = 0 <0.000008> Killing the processes in M-x list-processes using `process-menu-delete-process' returns Emacs to a usable state, albeit the zombie processes stay around. I have been able to reliably reproduce the behavior using this snippet: (with-temp-buffer (let ((process-connection-type t)) (set-process-query-on-exit-flag (start-process "cat" (current-buffer) "cat") nil))) I.e. create a buffer, associate a process with it, remove the query on exit flag, and kill the buffer. The `process-connection-type' nil is the relevant part. With t, it works correctly. The snippet does *not* work on a freshly-started Emacs. A random time after the bug shows up, Emacs also cleans up the zombie processes and the bug can not be triggered anymore. Until a whole while later, it shows up again. I have seen this happen in 24.3.50, 24.3.91 (pretest 2), and current trunk. ^ permalink raw reply [flat|nested] 33+ messages in thread
* bug#17561: Emacs can forget processes 2014-05-23 15:52 bug#17561: 24.4.50; Emacs can forget processes Jorgen Schaefer @ 2014-05-23 16:28 ` Paul Eggert 2014-05-23 16:44 ` Jorgen Schaefer 0 siblings, 1 reply; 33+ messages in thread From: Paul Eggert @ 2014-05-23 16:28 UTC (permalink / raw) To: Jorgen Schaefer; +Cc: 17561 [-- Attachment #1: Type: text/plain, Size: 604 bytes --] How was Emacs configured and built on your platform? What's your platform? Can you run the shell command 'strace -p' on an Emacs with the problem, and trace the system calls near the offending area? Are you using anything involving SIGUSR1 or SIGUSR2? I see an unlikely race condition there. That's part of the problem with this code: it's so racy that it seems that every time I look at it I find another unlikely race condition. I'm attaching a patch to emacs-24 for the unlikely races I found in a quick look at the area, not that I think it'll fix your problem. I haven't installed this. [-- Attachment #2: racy.patch --] [-- Type: text/plain, Size: 2791 bytes --] === modified file 'src/nsterm.m' --- src/nsterm.m 2014-04-04 16:32:24 +0000 +++ src/nsterm.m 2014-05-23 16:13:50 +0000 @@ -4427,7 +4427,7 @@ #ifdef NS_IMPL_GNUSTEP /* GNUstep steals SIGCHLD for use in NSTask, but we don't use NSTask. We must re-catch it so subprocess works. */ - catch_child_signal (); + catch_child_signal (false); #endif return dpyinfo; } === modified file 'src/process.c' --- src/process.c 2014-05-03 20:13:10 +0000 +++ src/process.c 2014-05-23 16:17:57 +0000 @@ -6254,7 +6254,7 @@ #ifdef NS_IMPL_GNUSTEP /* NSTask in GNUstep sets its child handler each time it is called. So we must re-set ours. */ - catch_child_signal(); + catch_child_signal (true); #endif } @@ -7062,16 +7062,18 @@ /* Arrange to catch SIGCHLD if this hasn't already been arranged. Invoke this after init_process_emacs, and after glib and/or GNUstep futz with the SIGCHLD handler, but before Emacs forks any children. - This function's caller should block SIGCHLD. */ + futz with the SIGCHLD handler, but before Emacs forks any children. + If ALREADY_BLOCKED, this function's caller has already blocked SIGCHLD. */ #ifndef NS_IMPL_GNUSTEP static #endif void -catch_child_signal (void) +catch_child_signal (bool already_blocked) { struct sigaction action, old_action; emacs_sigaction_init (&action, deliver_child_signal); + if (!already_blocked) block_child_signal (); sigaction (SIGCHLD, &action, &old_action); eassert (! (old_action.sa_flags & SA_SIGINFO)); @@ -7081,6 +7083,7 @@ = (old_action.sa_handler == SIG_DFL || old_action.sa_handler == SIG_IGN ? dummy_handler : old_action.sa_handler); + if (!already_blocked) unblock_child_signal (); } #endif /* subprocesses */ @@ -7107,7 +7110,7 @@ it into lib_child_handler. */ g_source_unref (g_child_watch_source_new (getpid ())); #endif - catch_child_signal (); + catch_child_signal (false); } FD_ZERO (&input_wait_mask); === modified file 'src/process.h' --- src/process.h 2014-01-01 07:43:34 +0000 +++ src/process.h 2014-05-23 16:12:58 +0000 @@ -241,7 +241,7 @@ extern void add_write_fd (int fd, fd_callback func, void *data); extern void delete_write_fd (int fd); #ifdef NS_IMPL_GNUSTEP -extern void catch_child_signal (void); +extern void catch_child_signal (bool); #endif #ifdef WINDOWSNT === modified file 'src/sysdep.c' --- src/sysdep.c 2014-04-16 13:27:28 +0000 +++ src/sysdep.c 2014-05-23 15:58:51 +0000 @@ -1515,6 +1515,12 @@ #ifdef PROFILER_CPU_SUPPORT sigaddset (&action->sa_mask, SIGPROF); #endif +#ifdef SIGUSR1 + sigaddset (&action->sa_mask, SIGUSR1); +#endif +#ifdef SIGUSR2 + sigaddset (&action->sa_mask, SIGUSR2); +#endif #ifdef SIGWINCH sigaddset (&action->sa_mask, SIGWINCH); #endif ^ permalink raw reply [flat|nested] 33+ messages in thread
* bug#17561: Emacs can forget processes 2014-05-23 16:28 ` bug#17561: " Paul Eggert @ 2014-05-23 16:44 ` Jorgen Schaefer 2014-05-24 23:01 ` Paul Eggert 0 siblings, 1 reply; 33+ messages in thread From: Jorgen Schaefer @ 2014-05-23 16:44 UTC (permalink / raw) To: Paul Eggert; +Cc: 17561 On Fri, 23 May 2014 09:28:44 -0700 Paul Eggert <eggert@cs.ucla.edu> wrote: > How was Emacs configured and built on your platform? What's your > platform? I have seen this bug happen on a Debian 7.5 Wheezy x86_64 with an Emacs configured with --without-x and an openSUSE 12.3 amd64 with a GUI version (can't currently check which toolkit it uses by default there). > Can you run the shell command 'strace -p' on an Emacs with the > problem, and trace the system calls near the offending area? The bug report includes an strace output. Emacs does the pselect6 call and then read calls for all "broken" sockets in a tight loop, and pretty much nothing else. > Are you using anything involving SIGUSR1 or SIGUSR2? I see an > unlikely race condition there. Not to my knowledge, unless Emacs sends those signals in some situation without it being obvious; I do not send any signals myself, all I do is use `kill-buffer' and `delete-process'. > That's part of the problem with this code: it's so racy that it seems > that every time I look at it I find another unlikely race condition. > I'm attaching a patch to emacs-24 for the unlikely races I found in a > quick look at the area, not that I think it'll fix your problem. I > haven't installed this. read(2) returning 0 indicates an EOF (which pselect6 indicates as waiting input), I suspect there could at least be a sanity check here to close the file descriptors instead of continuously polling. Regards, Jorgen ^ permalink raw reply [flat|nested] 33+ messages in thread
* bug#17561: Emacs can forget processes 2014-05-23 16:44 ` Jorgen Schaefer @ 2014-05-24 23:01 ` Paul Eggert 2014-05-25 7:57 ` Jorgen Schaefer 0 siblings, 1 reply; 33+ messages in thread From: Paul Eggert @ 2014-05-24 23:01 UTC (permalink / raw) To: Jorgen Schaefer; +Cc: 17561 Jorgen Schaefer wrote: >> Can you run the shell command 'strace -p' on an Emacs with the >> problem, and trace the system calls near the offending area? > > The bug report includes an strace output. Emacs does the pselect6 call Sorry, I should have been more specific. It'd be nice to see the strace output for when you run start-process in a temp buffer, not for later when Emacs is in a tight loop. > read(2) returning 0 indicates an EOF (which pselect6 indicates as > waiting input), I suspect there could at least be a sanity check here > to close the file descriptors instead of continuously polling. Maybe, though I worry that might mask the bug and might cause other, more serious failures down the road. It'd be nicer to fix the actual bug, if we can figure out what it is. For what it's worth I tried reproducing the bug with emacs-24 'emacs -Q' on Fedora, using your recipe, but this didn't have a problem. I ran it 1000 times rapidly in sequence and it was OK. I ran it 10000 times and ran out of file descriptors but that's to be expected. ^ permalink raw reply [flat|nested] 33+ messages in thread
* bug#17561: Emacs can forget processes 2014-05-24 23:01 ` Paul Eggert @ 2014-05-25 7:57 ` Jorgen Schaefer 2014-05-26 17:08 ` Paul Eggert 0 siblings, 1 reply; 33+ messages in thread From: Jorgen Schaefer @ 2014-05-25 7:57 UTC (permalink / raw) To: Paul Eggert; +Cc: 17561 On Sat, 24 May 2014 16:01:20 -0700 Paul Eggert <eggert@cs.ucla.edu> wrote: > Jorgen Schaefer wrote: > > >> Can you run the shell command 'strace -p' on an Emacs with the > >> problem, and trace the system calls near the offending area? > > > > The bug report includes an strace output. Emacs does the pselect6 > > call > > Sorry, I should have been more specific. It'd be nice to see the > strace output for when you run start-process in a temp buffer, not > for later when Emacs is in a tight loop. Oh, my bad. Luckily, the bug just occurred again (it's really non- deterministic): strace -fttT output The trace starts with a number of these triplets, which seem to be "Emacs behaving normally". After terminating the process and returning to "normal" later, Emacs does not issue these rt_sigprocmask calls anymore, though. 16300 09:41:27.072717 pselect6(20, [3 4 5 6 8 10 14 15 19], [], NULL, {0, 176410474}, {NULL, 8}) = 0 (Timeout) <0.176635> 16300 09:41:27.249649 rt_sigprocmask(SIG_BLOCK, [WINCH IO], NULL, 8) = 0 <0.000011> 16300 09:41:27.249881 rt_sigprocmask(SIG_UNBLOCK, [WINCH IO], NULL, 8) = 0 <0.000010> Then there's a large bunch of syscalls related to my command to restart the process, with the "\r" now being me sending the M-x command: 16300 09:41:28.298391 read(3, "\r", 1) = 1 <0.000012> 16300 09:41:28.298438 ioctl(3, FIONREAD, [0]) = 0 <0.000009> 16300 09:41:28.298480 ioctl(3, FIONREAD, [0]) = 0 <0.000009> 16300 09:41:28.312476 write(3, "\r", 1) = 1 <0.000021> 16300 09:41:28.317392 kill(4294953129, SIGHUP) = 0 <0.002235> 16300 09:41:28.321642 rt_sigprocmask(SIG_BLOCK, [WINCH IO], NULL, 8) = 0 <0.000017> 16300 09:41:28.321841 write(3, "\33[K\33[H\n\n", 8) = 8 <0.000018> 16300 09:41:28.321909 rt_sigprocmask(SIG_UNBLOCK, [WINCH IO], NULL, 8) = 0 <0.000012> 16300 09:41:28.321975 --- SIGIO (I/O possible) @ 0 (0) --- 16300 09:41:28.322004 rt_sigreturn(0x1d) = 0 <0.000013> 16300 09:41:28.322056 ioctl(3, FIONREAD, [0]) = 0 <0.000014> 16300 09:41:28.322183 ioctl(3, FIONREAD, [0]) = 0 <0.000012> 16300 09:41:28.322253 pselect6(20, [3 4 5 6 8 10 14 15 19], [], NULL, {0, 499801124}, {NULL, 8}) = 1 (in [15], left {0, 499788244}) <0.000024> 16300 09:41:28.322337 read(15, "", 4096) = 0 <0.000012> 16300 09:41:28.322475 ioctl(3, FIONREAD, [0]) = 0 <0.000014> 16300 09:41:28.322533 pselect6(20, [3 4 5 6 8 10 14 15 19], [], NULL, {0, 499519197}, {NULL, 8}) = 1 (in [15], left {0, 499513475}) <0.000017> 16300 09:41:28.322603 read(15, "", 4096) = 0 <0.000010> Killing the process via the process list (hence the "d"): 16300 09:49:00.151571 read(3, "d", 1) = 1 <0.000009> 16300 09:49:00.151604 ioctl(3, FIONREAD, [0]) = 0 <0.000007> 16300 09:49:00.151636 ioctl(3, FIONREAD, [0]) = 0 <0.000006> 16300 09:49:00.151790 rt_sigprocmask(SIG_BLOCK, [CHLD], [QUIT ALRM CHLD PROF], 8) = 0 <0.000020> 16300 09:49:00.151849 kill(4294953129, SIGKILL) = 0 <0.000013> 16300 09:49:00.151896 rt_sigprocmask(SIG_SETMASK, [QUIT ALRM CHLD PROF], NULL, 8) = 0 <0.000007> 16300 09:49:00.151964 close(12) = 0 <0.000015> 16300 09:49:00.152003 close(15) = 0 <0.000010> 16300 09:49:00.153389 rt_sigprocmask(SIG_BLOCK, [WINCH IO], NULL, 8) = 0 <0.000010> 16300 09:49:00.153531 rt_sigprocmask(SIG_UNBLOCK, [WINCH IO], NULL, 8) = 0 <0.000008> > > read(2) returning 0 indicates an EOF (which pselect6 indicates as > > waiting input), I suspect there could at least be a sanity check > > here to close the file descriptors instead of continuously polling. > > Maybe, though I worry that might mask the bug and might cause other, > more serious failures down the road. It'd be nicer to fix the actual > bug, if we can figure out what it is. > > For what it's worth I tried reproducing the bug with emacs-24 'emacs > -Q' on Fedora, using your recipe, but this didn't have a problem. I > ran it 1000 times rapidly in sequence and it was OK. I ran it 10000 > times and ran out of file descriptors but that's to be expected. Yes, the bug is really random. I have not been able to intentionally reproduce it. It just happens once in a while. Until this bug report, I wasn't even sure what was causing my Emacs to stop responding in the first place. It just happened, once every few days, that Emacs suddenly becomes very sluggish. My Emacs sessions usually have a number of processes open. When the bug showed up just now, it was five processes and three network connections, for example. I'm not sure if that's related. Regards, Jorgen ^ permalink raw reply [flat|nested] 33+ messages in thread
* bug#17561: Emacs can forget processes 2014-05-25 7:57 ` Jorgen Schaefer @ 2014-05-26 17:08 ` Paul Eggert 2014-05-26 18:49 ` Jorgen Schaefer 0 siblings, 1 reply; 33+ messages in thread From: Paul Eggert @ 2014-05-26 17:08 UTC (permalink / raw) To: Jorgen Schaefer; +Cc: 17561 Jorgen Schaefer wrote: > The trace starts with a number of these triplets, which seem to be > "Emacs behaving normally". > > 16300 09:41:27.072717 pselect6(20, [3 4 5 6 8 10 14 15 19], [], NULL, {0, 176410474}, {NULL, 8}) = 0 (Timeout) <0.176635> > 16300 09:41:27.249649 rt_sigprocmask(SIG_BLOCK, [WINCH IO], NULL, 8) = 0 <0.000011> > 16300 09:41:27.249881 rt_sigprocmask(SIG_UNBLOCK, [WINCH IO], NULL, 8) = 0 <0.000010> I don't observe this behavior when running Emacs on Fedora 20 (this is the latest emacs-24 version, running your little test). I ran: strace -o /tmp/tr -fttT src/bootstrap-emacs -nw -Q I did see this: 29589 09:44:10.979952 ioctl(4, FIONREAD, [0]) = 0 <0.000023> 29589 09:44:10.980031 ioctl(4, FIONREAD, [0]) = 0 <0.000024> 29589 09:44:10.980143 pselect6(6, [4 5], [], NULL, {0, 499810466}, {NULL, 8}) = 0 (Timeout) <0.500499> 29589 09:44:11.480745 poll([{fd=5, events=POLLIN}], 1, 0) = 0 (Timeout) <0.000030> 29589 09:44:11.480861 read(5, 0x7fff777b3820, 16) = -1 EAGAIN (Resource temporarily unavailable) <0.000029> 29589 09:44:11.481214 rt_sigprocmask(SIG_BLOCK, [WINCH IO], NULL, 8) = 0 <0.000030> 29589 09:44:11.481505 rt_sigprocmask(SIG_UNBLOCK, [WINCH IO], NULL, 8) = 0 <0.000027> 29589 09:44:11.481594 ioctl(4, FIONREAD, [0]) = 0 <0.000030> but that's not really the same. Could you please try running emacs -nw -Q with your test case, and see whether it behaves like the pattern on my platform? If so, we might try to investigate why Emacs changes from this pattern to the pattern that you observe. > Then there's a large bunch of syscalls related to my command to restart > the process, with the "\r" now being me sending the M-x command: Which M-x command was that? M-x list-processes? > 16300 09:41:28.298391 read(3, "\r", 1) = 1 <0.000012> > 16300 09:41:28.298438 ioctl(3, FIONREAD, [0]) = 0 <0.000009> > 16300 09:41:28.298480 ioctl(3, FIONREAD, [0]) = 0 <0.000009> > 16300 09:41:28.312476 write(3, "\r", 1) = 1 <0.000021> > 16300 09:41:28.317392 kill(4294953129, SIGHUP) = 0 <0.002235> > 16300 09:41:28.321642 rt_sigprocmask(SIG_BLOCK, [WINCH IO], NULL, 8) = 0 <0.000017> > 16300 09:41:28.321841 write(3, "\33[K\33[H\n\n", 8) = 8 <0.000018> > 16300 09:41:28.321909 rt_sigprocmask(SIG_UNBLOCK, [WINCH IO], NULL, 8) = 0 <0.000012> When I try your test, the child process is running in parallel with the parent, the 'kill' terminates the child, and the parent is signaled. In contrast your trace shows no child, leading me to guess that the child has already exited (so the parent is killing a zombie), which means it's not the test case you sent but some other process (since the test case you sent waits on a pty so the child shouldn't exit). Here's the trace I see around the kill: 29592 09:44:24.494089 access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or directory) <0.000028> 29589 09:44:24.494167 kill(4294937704, SIGHUP <unfinished ...> 29592 09:44:24.494199 open("/usr/lib64/alliance/lib/tls/x86_64/libc.so.6", O_RDONLY|O_CLOEXEC <unfinished ...> 29589 09:44:24.494218 <... kill resumed> ) = 0 <0.000040> 29592 09:44:24.494236 <... open resumed> ) = -1 ENOENT (No such file or directory) <0.000025> 29592 09:44:24.494267 --- SIGHUP {si_signo=SIGHUP, si_code=SI_USER, si_pid=29589, si_uid=1000} --- 29592 09:44:24.494375 +++ killed by SIGHUP +++ 29589 09:44:24.494388 --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_KILLED, si_pid=29592, si_status=SIGHUP, si_utime=0, si_stime=0} --- 29589 09:44:24.494435 wait4(29592, [{WIFSIGNALED(s) && WTERMSIG(s) == SIGHUP}], WNOHANG|WSTOPPED|WCONTINUED, NULL) = 29592 <0.000037> 29589 09:44:24.494507 rt_sigreturn() = 12159536 <0.000021> 29589 09:44:24.494603 rt_sigprocmask(SIG_BLOCK, [ALRM], NULL, 8) = 0 <0.000019> 29589 09:44:24.494653 rt_sigprocmask(SIG_UNBLOCK, [ALRM], NULL, 8) = 0 <0.000017> 29589 09:44:24.495590 rt_sigprocmask(SIG_BLOCK, [WINCH IO], NULL, 8) = 0 <0.000025> 29589 09:44:24.496152 write(4, "\r\n\33[?25lnil\33[48;34H\33[7m11\33[0m\33[3"..., 70) = 70 <0.000036> 29589 09:44:24.496236 rt_sigprocmask(SIG_UNBLOCK, [WINCH IO], NULL, 8) = 0 <0.000015> 29589 09:44:24.496282 --- SIGIO {si_signo=SIGIO, si_code=SI_KERNEL} --- 29589 09:44:24.496306 rt_sigreturn() = 0 <0.000015> 29589 09:44:24.496350 ioctl(4, FIONREAD, [0]) = 0 <0.000016> Perhaps you could run your test on a fresh emacs -Q -nw and see whether it matches the behavior I'm seeing. > Killing the process via the process list (hence the "d"): You typed "d" in the window generated by M-x list-processes? When I try that, it says "d is undefined". > My Emacs sessions usually have a number of processes open. When the bug > showed up just now, it was five processes and three network connections, > for example. I'm not sure if that's related. I expect that it is, unfortunately. ^ permalink raw reply [flat|nested] 33+ messages in thread
* bug#17561: Emacs can forget processes 2014-05-26 17:08 ` Paul Eggert @ 2014-05-26 18:49 ` Jorgen Schaefer 2014-05-26 23:58 ` Paul Eggert 2014-05-27 4:05 ` Paul Eggert 0 siblings, 2 replies; 33+ messages in thread From: Jorgen Schaefer @ 2014-05-26 18:49 UTC (permalink / raw) To: Paul Eggert; +Cc: 17561 [-- Attachment #1: Type: text/plain, Size: 6728 bytes --] On Mon, 26 May 2014 10:08:38 -0700 Paul Eggert <eggert@cs.ucla.edu> wrote: > Jorgen Schaefer wrote: > > The trace starts with a number of these triplets, which seem to be > > "Emacs behaving normally". > > > > 16300 09:41:27.072717 pselect6(20, [3 4 5 6 8 10 14 15 19], [], > > NULL, {0, 176410474}, {NULL, 8}) = 0 (Timeout) <0.176635> 16300 > > 09:41:27.249649 rt_sigprocmask(SIG_BLOCK, [WINCH IO], NULL, 8) = 0 > > <0.000011> 16300 09:41:27.249881 rt_sigprocmask(SIG_UNBLOCK, [WINCH > > IO], NULL, 8) = 0 <0.000010> > > I don't observe this behavior when running Emacs on Fedora 20 (this > is the latest emacs-24 version, running your little test). I ran: > > strace -o /tmp/tr -fttT src/bootstrap-emacs -nw -Q > > I did see this: > > 29589 09:44:10.979952 ioctl(4, FIONREAD, [0]) = 0 <0.000023> > 29589 09:44:10.980031 ioctl(4, FIONREAD, [0]) = 0 <0.000024> > 29589 09:44:10.980143 pselect6(6, [4 5], [], NULL, {0, 499810466}, > {NULL, 8}) = 0 (Timeout) <0.500499> > 29589 09:44:11.480745 poll([{fd=5, events=POLLIN}], 1, 0) = 0 > (Timeout) <0.000030> > 29589 09:44:11.480861 read(5, 0x7fff777b3820, 16) = -1 EAGAIN > (Resource temporarily unavailable) <0.000029> > 29589 09:44:11.481214 rt_sigprocmask(SIG_BLOCK, [WINCH IO], NULL, 8) > = 0 <0.000030> > 29589 09:44:11.481505 rt_sigprocmask(SIG_UNBLOCK, [WINCH IO], NULL, > 8) = 0 <0.000027> > 29589 09:44:11.481594 ioctl(4, FIONREAD, [0]) = 0 <0.000030> > > but that's not really the same. Could you please try running emacs > -nw -Q with your test case, and see whether it behaves like the > pattern on my platform? If so, we might try to investigate why Emacs > changes from this pattern to the pattern that you observe. In "normal mode" (with no processes running), Emacs simply hangs in a pselect6 with a "pretty long timeout". I have attached the full traceback of the emacs -Q session. > > Then there's a large bunch of syscalls related to my command to > > restart the process, with the "\r" now being me sending the M-x > > command: > > Which M-x command was that? M-x list-processes? M-x elpy-rpc-restart https://github.com/jorgenschaefer/elpy/blob/master/elpy.el#L1654-1662 The call to `kill-process' was added only recently as part of debugging this problem, the bug showed up with only the `kill-buffer' call, too. The Elpy RPC processes are where I notice this bug the most, as there are often quite a few of them, and I kill them regularly during development, but I have seen this with e.g. openssl processes, too. > > 16300 09:41:28.298391 read(3, "\r", 1) = 1 <0.000012> > > 16300 09:41:28.298438 ioctl(3, FIONREAD, [0]) = 0 <0.000009> > > 16300 09:41:28.298480 ioctl(3, FIONREAD, [0]) = 0 <0.000009> > > 16300 09:41:28.312476 write(3, "\r", 1) = 1 <0.000021> > > 16300 09:41:28.317392 kill(4294953129, SIGHUP) = 0 <0.002235> > > 16300 09:41:28.321642 rt_sigprocmask(SIG_BLOCK, [WINCH IO], NULL, > > 8) = 0 <0.000017> 16300 09:41:28.321841 write(3, "\33[K\33[H\n\n", > > 8) = 8 <0.000018> 16300 09:41:28.321909 rt_sigprocmask(SIG_UNBLOCK, > > [WINCH IO], NULL, 8) = 0 <0.000012> > > When I try your test, the child process is running in parallel with > the parent, the 'kill' terminates the child, and the parent is > signaled. In contrast your trace shows no child, leading me to guess > that the child has already exited (so the parent is killing a > zombie), which means it's not the test case you sent but some other > process (since the test case you sent waits on a pty so the child > shouldn't exit). That was for the already-existing process that exhibited the problem. I wasn't able to reproduce it with the exact same code I sent this time. Sorry, I should have been more clear about this. > > Killing the process via the process list (hence the "d"): > > You typed "d" in the window generated by M-x list-processes? When I > try that, it says "d is undefined". "d runs the command process-menu-delete-process" This is on a recent trunk version. It seems this was added after the 24.4 branch. (As I can't reliably reproduce this, I have to use whichever Emacs I happen to be running at that point.) I have also attached a file the-bug.txt.gz, which shows the strace of a running Emacs session which starts to exhibit the bug, so we now have a trace that includes the time when the bug actually happens. It shows me starting Python subprocesses (elpy RPC processes) and killing them, in an attempt to reproduce the problem. Eventually, this succeeds, and Emacs goes into the loop reading from fd 15. I snipped a few megabytes of logs there (marked "[...]"). Interestingly, I did not kill anything there, I ran my "reproduction case", and Emacs cleaned up the broken process by itself. If I read this correctly, the "broken" process in question is 27012 (16300 is emacs). It gets killed at some point, right after a RET is read (the elpy-rpc-restart), and goes straight into the loop: 16300 20:01:25.347174 kill(4294940284, SIGKILL <unfinished ...> 27012 20:01:25.349483 +++ killed by SIGKILL +++ 16300 20:01:25.349515 <... kill resumed> ) = 0 <0.002304> 16300 20:01:25.349649 kill(4294940284, SIGHUP) = 0 <0.000019> 16300 20:01:25.354299 rt_sigprocmask(SIG_BLOCK, [WINCH IO], NULL, 8) = 0 <0.0000 16300 20:01:25.354585 write(3, "\33[K\33[16;1H", 10) = 10 <0.000025> 16300 20:01:25.354687 rt_sigprocmask(SIG_UNBLOCK, [WINCH IO], NULL, 8) = 0 <0.00 16300 20:01:25.354753 --- SIGIO (I/O possible) @ 0 (0) --- 16300 20:01:25.354789 rt_sigreturn(0x1d) = 0 <0.000016> 16300 20:01:25.354862 ioctl(3, FIONREAD, [0]) = 0 <0.000017> 16300 20:01:25.355015 ioctl(3, FIONREAD, [0]) = 0 <0.000017> 16300 20:01:25.355098 pselect6(20, [3 4 5 6 8 10 14 15 19], [], NULL, {0, 670925 16300 20:01:25.355222 read(15, "", 4096) = 0 <0.000014> 16300 20:01:25.355455 ioctl(3, FIONREAD, [0]) = 0 <0.000017> 16300 20:01:25.355541 pselect6(20, [3 4 5 6 8 10 14 15 19], [], NULL, {0, 626503 16300 20:01:25.355642 read(15, "", 4096) = 0 <0.000015> 16300 20:01:25.355710 pselect6(20, [3 4 5 6 8 10 14 15 19], [], NULL, {0, 609628 16300 20:01:25.355804 read(15, "", 4096) = 0 <0.000015> 16300 20:01:25.355869 pselect6(20, [3 4 5 6 8 10 14 15 19], [], NULL, {0, 593577 16300 20:01:25.355962 read(15, "", 4096) = 0 <0.000014> 16300 20:01:25.356027 pselect6(20, [3 4 5 6 8 10 14 15 19], [], NULL, {0, 577782 16300 20:01:25.356156 read(15, "", 4096) = 0 <0.000015> Comparing this with the "normal" code, I do not see a SIGCHLD. Emacs does a SIGKILL (delete-process) followed by a SIGHUP (from killing the buffer?), but there's no SIGCHLD. The last rt_sigprocmask with SIG_SETMASK before that sets QUIT ALRM PROF, so that does not seem to be the culprit. Strange stuff. (And I see you noticed the discussion emacs-devel about epa :-)) Regards, Jorgen [-- Attachment #2: emacs-Q-trace.txt --] [-- Type: text/plain, Size: 9467 bytes --] 26726 19:49:21.623623 pselect6(4, [3], [], NULL, {99984, 954975738}, {NULL, 8}) = 1 (in [3], left {99978, 764318633}) <6.190686> 26726 19:49:27.814492 --- SIGIO (I/O possible) @ 0 (0) --- 26726 19:49:27.814539 rt_sigreturn(0x1d) = 1 <0.000010> 26726 19:49:27.814604 ioctl(3, FIONREAD, [1]) = 0 <0.000010> 26726 19:49:27.814664 read(3, "\30", 1) = 1 <0.000009> 26726 19:49:27.814709 ioctl(3, FIONREAD, [0]) = 0 <0.000008> 26726 19:49:27.814742 ioctl(3, FIONREAD, [0]) = 0 <0.000007> 26726 19:49:27.814910 ioctl(3, FIONREAD, [0]) = 0 <0.000009> 26726 19:49:27.814956 pselect6(4, [3], [], NULL, {0, 499942730}, {NULL, 8}) = 1 (in [3], left {0, 130792312}) <0.369205> 26726 19:49:28.184311 --- SIGIO (I/O possible) @ 0 (0) --- 26726 19:49:28.184394 rt_sigreturn(0x1d) = 1 <0.000032> 26726 19:49:28.184529 ioctl(3, FIONREAD, [1]) = 0 <0.000027> 26726 19:49:28.184654 read(3, "\5", 1) = 1 <0.000031> 26726 19:49:28.184777 ioctl(3, FIONREAD, [0]) = 0 <0.000025> 26726 19:49:28.184893 ioctl(3, FIONREAD, [0]) = 0 <0.000026> 26726 19:49:28.185605 faccessat(AT_FDCWD, "/home/forcer/.", F_OK) = 0 <0.000038> 26726 19:49:28.185814 faccessat(AT_FDCWD, "/home/forcer/bin/cat", X_OK) = -1 ENOENT (No such file or directory) <0.000036> 26726 19:49:28.186049 faccessat(AT_FDCWD, "/home/forcer/bin/cat", X_OK) = -1 ENOENT (No such file or directory) <0.000034> 26726 19:49:28.186235 faccessat(AT_FDCWD, "/usr/local/bin/cat", X_OK) = -1 ENOENT (No such file or directory) <0.000037> 26726 19:49:28.186416 faccessat(AT_FDCWD, "/usr/bin/cat", X_OK) = -1 ENOENT (No such file or directory) <0.000034> 26726 19:49:28.186602 faccessat(AT_FDCWD, "/bin/cat", X_OK) = 0 <0.000033> 26726 19:49:28.186732 stat("/bin/cat", {st_mode=S_IFREG|0755, st_size=51856, ...}) = 0 <0.000030> 26726 19:49:28.186983 open("/dev/ptmx", O_RDWR|O_NOCTTY|O_CLOEXEC) = 4 <0.000099> 26726 19:49:28.187360 statfs("/dev/pts", {f_type="DEVPTS_SUPER_MAGIC", f_bsize=4096, f_blocks=0, f_bfree=0, f_bavail=0, f_files=0, f_ffree=0, f_fsid={0, 0}, f_namelen=255, f_frsize=4096}) = 0 <0.000036> 26726 19:49:28.187541 fcntl(4, F_SETFD, FD_CLOEXEC) = 0 <0.000025> 26726 19:49:28.187653 rt_sigprocmask(SIG_BLOCK, [CHLD], NULL, 8) = 0 <0.000027> 26726 19:49:28.187800 ioctl(4, SNDCTL_TMR_TIMEBASE or TCGETS, {B38400 opost isig icanon echo ...}) = 0 <0.000027> 26726 19:49:28.187927 ioctl(4, TIOCGPTN, [0]) = 0 <0.000023> 26726 19:49:28.188035 stat("/dev/pts/0", {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 0), ...}) = 0 <0.000031> 26726 19:49:28.188216 getuid() = 1000 <0.000023> 26726 19:49:28.188336 socket(PF_FILE, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, 0) = 5 <0.000042> 26726 19:49:28.188482 connect(5, {sa_family=AF_FILE, path="/var/run/nscd/socket"}, 110) = -1 ENOENT (No such file or directory) <0.000064> 26726 19:49:28.188679 close(5) = 0 <0.000040> 26726 19:49:28.188804 socket(PF_FILE, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, 0) = 5 <0.000031> 26726 19:49:28.188920 connect(5, {sa_family=AF_FILE, path="/var/run/nscd/socket"}, 110) = -1 ENOENT (No such file or directory) <0.000049> 26726 19:49:28.189104 close(5) = 0 <0.000044> 26726 19:49:28.189271 open("/etc/group", O_RDONLY|O_CLOEXEC) = 5 <0.000038> 26726 19:49:28.189411 lseek(5, 0, SEEK_CUR) = 0 <0.000027> 26726 19:49:28.189523 fstat(5, {st_mode=S_IFREG|0644, st_size=801, ...}) = 0 <0.000025> 26726 19:49:28.189684 mmap(NULL, 801, PROT_READ, MAP_SHARED, 5, 0) = 0x7f5a0c8b2000 <0.000032> 26726 19:49:28.189810 lseek(5, 801, SEEK_SET) = 801 <0.000026> 26726 19:49:28.189945 munmap(0x7f5a0c8b2000, 801) = 0 <0.000036> 26726 19:49:28.190063 close(5) = 0 <0.000028> 26726 19:49:28.190194 ioctl(4, TIOCSPTLCK, [0]) = 0 <0.000025> 26726 19:49:28.190319 ioctl(4, SNDCTL_TMR_TIMEBASE or TCGETS, {B38400 opost isig icanon echo ...}) = 0 <0.000027> 26726 19:49:28.190449 ioctl(4, TIOCGPTN, [0]) = 0 <0.000025> 26726 19:49:28.190563 stat("/dev/pts/0", {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 0), ...}) = 0 <0.000032> 26726 19:49:28.190735 rt_sigprocmask(SIG_UNBLOCK, [CHLD], NULL, 8) = 0 <0.000027> 26726 19:49:28.190874 faccessat(AT_FDCWD, "/dev/pts/0", R_OK|W_OK) = 0 <0.000040> 26726 19:49:28.191050 ioctl(4, FIONBIO, [1]) = 0 <0.000023> 26726 19:49:28.191181 pipe2([5, 6], O_CLOEXEC) = 0 <0.000038> 26726 19:49:28.191321 fcntl(4, F_SETFL, O_RDONLY|O_NONBLOCK) = 0 <0.000027> 26726 19:49:28.191427 fcntl(4, F_SETFL, O_RDONLY|O_NONBLOCK) = 0 <0.000023> 26726 19:49:28.191537 rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0 <0.000026> 26726 19:49:28.191671 vfork() = 26824 <0.001036> 26726 19:49:28.192851 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 <0.000026> 26726 19:49:28.192979 close(6) = 0 <0.000025> 26726 19:49:28.193083 read(5, <unfinished ...> 26824 19:49:28.193348 setsid() = 26824 <0.000032> 26824 19:49:28.193479 open("/dev/tty", O_RDWR|O_CLOEXEC) = -1 ENXIO (No such device or address) <0.000048> 26824 19:49:28.193636 open("/dev/pts/0", O_RDWR|O_CLOEXEC) = 7 <0.000077> 26824 19:49:28.193814 rt_sigaction(SIGINT, {SIG_DFL, [INT], SA_RESTORER|SA_RESTART, 0x7f5a0b4f74f0}, {0x48e780, [INT QUIT ALRM CHLD PROF WINCH IO], SA_RESTORER, 0x7f5a0bcf8030}, 8) = 0 <0.000030> 26824 19:49:28.194012 rt_sigaction(SIGQUIT, {SIG_DFL, [QUIT], SA_RESTORER|SA_RESTART, 0x7f5a0b4f74f0}, {0x48e780, [INT QUIT ALRM CHLD PROF WINCH IO], SA_RESTORER, 0x7f5a0bcf8030}, 8) = 0 <0.000028> 26824 19:49:28.194217 rt_sigaction(SIGPROF, {SIG_DFL, [PROF], SA_RESTORER|SA_RESTART, 0x7f5a0b4f74f0}, {SIG_IGN, [PROF], SA_RESTORER|SA_RESTART, 0x7f5a0b4f74f0}, 8) = 0 <0.000027> 26824 19:49:28.194407 rt_sigaction(SIGPIPE, {SIG_DFL, [PIPE], SA_RESTORER|SA_RESTART, 0x7f5a0b4f74f0}, {SIG_IGN, [PIPE], SA_RESTORER|SA_RESTART, 0x7f5a0b4f74f0}, 8) = 0 <0.000030> 26824 19:49:28.194601 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 <0.000028> 26824 19:49:28.194731 ioctl(7, SNDCTL_TMR_TIMEBASE or TCGETS, {B38400 opost isig icanon echo ...}) = 0 <0.000032> 26824 19:49:28.194864 ioctl(7, SNDCTL_TMR_TIMEBASE or TCGETS, {B38400 opost isig icanon echo ...}) = 0 <0.000030> 26824 19:49:28.194995 ioctl(7, SNDCTL_TMR_STOP or TCSETSW, {B38400 opost isig icanon -echo ...}) = 0 <0.000035> 26824 19:49:28.195127 ioctl(7, SNDCTL_TMR_TIMEBASE or TCGETS, {B38400 opost isig icanon -echo ...}) = 0 <0.000031> 26824 19:49:28.195280 ioctl(7, SNDCTL_TMR_TIMEBASE or TCGETS, {B38400 opost isig icanon -echo ...}) = 0 <0.000032> 26824 19:49:28.195424 getpid() = 26824 <0.000025> 26824 19:49:28.195561 chdir("/home/forcer") = 0 <0.000033> 26824 19:49:28.195979 dup2(7, 0) = 0 <0.000029> 26824 19:49:28.196239 dup2(7, 1) = 1 <0.000028> 26824 19:49:28.196359 dup2(7, 2) = 2 <0.000027> 26824 19:49:28.196484 setpgid(0, 0) = -1 EPERM (Operation not permitted) <0.000027> 26824 19:49:28.196614 ioctl(0, TIOCSPGRP, [26824]) = 0 <0.000030> 26824 19:49:28.196750 execve("/bin/cat", ["/bin/cat"], [/* 46 vars */] <unfinished ...> 26726 19:49:28.197778 <... read resumed> "", 1) = 0 <0.004663> 26726 19:49:28.197862 close(5) = 0 <0.000034> 26726 19:49:28.198096 kill(4294940472, SIGHUP) = 0 <0.000042> 26726 19:49:28.198775 rt_sigprocmask(SIG_BLOCK, [WINCH IO], NULL, 8) = 0 <0.000028> 26726 19:49:28.199293 write(3, "\33[50;1H\33[?25lnil\33[11;1H\33[34h\33[?2"..., 39) = 39 <0.000037> 26726 19:49:28.199457 rt_sigprocmask(SIG_UNBLOCK, [WINCH IO], NULL, 8) = 0 <0.000027> 26726 19:49:28.199613 ioctl(3, FIONREAD, [0]) = 0 <0.000028> 26726 19:49:28.199745 pselect6(5, [3 4], [], NULL, {0, 499841222}, {NULL, 8} <unfinished ...> 26824 19:49:28.200000 <... execve resumed> ) = 0 <0.003081> 26824 19:49:28.200094 --- SIGHUP (Hangup) @ 0 (0) --- 26726 19:49:28.200449 <... pselect6 resumed> ) = 1 (in [4], left {0, 499233010}) <0.000639> 26726 19:49:28.200557 read(4, 0x7fff81fed790, 4096) = -1 EIO (Input/output error) <0.000027> 26726 19:49:28.200685 pselect6(5, [3], [], NULL, {0, 498900754}, {NULL, 8}) = ? ERESTARTNOHAND (To be restarted) <0.000101> 26726 19:49:28.200885 --- SIGCHLD (Child exited) @ 0 (0) --- 26726 19:49:28.200955 wait4(26824, [{WIFSIGNALED(s) && WTERMSIG(s) == SIGHUP}], WNOHANG|WSTOPPED|WCONTINUED, NULL) = 26824 <0.000053> 26726 19:49:28.201114 rt_sigreturn(0x11) = -1 EINTR (Interrupted system call) <0.000028> 26726 19:49:28.201273 pselect6(5, [3], NULL, NULL, {0, 0}, {NULL, 8}) = 0 (Timeout) <0.000031> 26726 19:49:28.201425 read(4, 0x7fff81fed740, 4096) = -1 EIO (Input/output error) <0.000027> 26726 19:49:28.201550 futex(0x7f5a0b84c840, FUTEX_WAKE_PRIVATE, 2147483647) = 0 <0.000026> 26726 19:49:28.201730 close(4) = 0 <0.000268> 26726 19:49:28.202082 --- SIGIO (I/O possible) @ 0 (0) --- 26726 19:49:28.202151 rt_sigreturn(0x1d) = 0 <0.000028> 26726 19:49:28.202276 ioctl(3, FIONREAD, [0]) = 0 <0.000029> 26726 19:49:28.203317 rt_sigprocmask(SIG_BLOCK, [WINCH IO], NULL, 8) = 0 <0.000030> 26726 19:49:28.203733 rt_sigprocmask(SIG_UNBLOCK, [WINCH IO], NULL, 8) = 0 <0.000026> 26726 19:49:28.204260 rt_sigprocmask(SIG_BLOCK, [WINCH IO], NULL, 8) = 0 <0.000030> 26726 19:49:28.204635 rt_sigprocmask(SIG_UNBLOCK, [WINCH IO], NULL, 8) = 0 <0.000025> 26726 19:49:28.204767 ioctl(3, FIONREAD, [0]) = 0 <0.000026> 26726 19:49:28.204889 pselect6(4, [3], [], NULL, {0, 494695989}, {NULL, 8}) = 0 (Timeout) <0.495267> 26726 19:49:28.700416 rt_sigprocmask(SIG_BLOCK, [WINCH IO], NULL, 8) = 0 <0.000009> 26726 19:49:28.700556 rt_sigprocmask(SIG_UNBLOCK, [WINCH IO], NULL, 8) = 0 <0.000008> 26726 19:49:28.700602 pselect6(4, [3], [], NULL, {29, 504268459}, {NULL, 8} <unfinished ...> [-- Attachment #3: the-bug.txt.gz --] [-- Type: application/gzip, Size: 106262 bytes --] ^ permalink raw reply [flat|nested] 33+ messages in thread
* bug#17561: Emacs can forget processes 2014-05-26 18:49 ` Jorgen Schaefer @ 2014-05-26 23:58 ` Paul Eggert 2014-05-27 18:27 ` Jorgen Schaefer 2014-05-27 4:05 ` Paul Eggert 1 sibling, 1 reply; 33+ messages in thread From: Paul Eggert @ 2014-05-26 23:58 UTC (permalink / raw) To: Jorgen Schaefer; +Cc: 17561 Jorgen Schaefer wrote: > the-bug.txt.gz ... shows the strace of a > running Emacs session which starts to exhibit the bug, so we now have a > trace that includes the time when the bug actually happens. It shows me > starting Python subprocesses (elpy RPC processes) and killing them, in > an attempt to reproduce the problem. Eventually, this succeeds, and > Emacs goes into the loop reading from fd 15. Thanks for that trace; it's helpful. A couple of things. First, what platform are you running on, exactly? I'm running on an AMD Phenom II X4 910e (according to /proc/cpuinfo), and using Fedora 20 x86-64 (Linux kernel 3.14.4-200.fc20.x86_64, glibc 2.18), and compiling with GCC 4.9.0. Second, looking at the-bug.txt I'm suspicious about the interaction between vfork, pthread_sigmask (aka rt_sigprocmask) and rt_sigreturn. Could you please try building Emacs with fork rather than vfork and see whether that fixes the problem? If it fixes things, great; if not I'd like to see another strace of the bug. To use fork instead of vfork, you can run "./configure ac_cv_func_fork_works=no", or (faster but flimsier) edit src/config.h to add "#define fork vfork". Here are some details as to why I think vfork is related to the problem. You see a bug on line 7717 of the trace in the-bug.txt, where read(15, ...) returns 0 and Emacs just keeps trying. I see bugs before that, including: * On trace line 1043, file descriptor 15 is created, as part of the preparation for vforking and execing subprocess 27012. So far so good. * On trace line 1048 vfork succeeds. Still good. * On trace line 1049 the parent process continues, and executes another system call! The child process has neither execed nor terminated. This is incorrect; if vfork is called, the Linux man page says "the calling thread is suspended until the child terminates (either normally, by calling _exit(2), or abnormally, after delivery of a fatal signal), or it makes a call to execve(2)". I'm reasonably sure that Emacs relies on this and will fail badly if the parent and child can stomp on the same memory in parallel. (Another possibility, alas, is that strace is buggy and its output unreliable....) * On trace line 1049, the signal mask is cleared except for QUIT, ALRM, PROF. * On trace line 5598, an INT is delivered just as the INT bit is removed from the signal mask. This strongly suggests that the INT bit was set in the signal mask -- but that bit was cleared on line 1049, so what gives? This suggests that the signal mask is messed up somehow. One possibility is that the buggy vfork is doing it somehow -- perhaps the signal mask is part of the shared memory? * On trace line 7706, the subprocess exits but no CHLD signal is delivered to Emacs. Again, this suggests that the signal mask is messed up. This last problem most likely leads to the symptoms you're observing: Emacs isn't told that a child died, so it doesn't check for it. FYI when I try your test, I get something like the following strace output near the vfork. This is the correct vfork behavior, i.e., the parent is suspended until the child execs: 6690 15:51:23.093780 rt_sigprocmask(SIG_BLOCK, [CHLD], NULL, 8) = 0 <0.000025> 6690 15:51:23.093873 vfork( <unfinished ...> 6692 15:51:23.094078 setsid() = 6692 <0.000070> 6692 15:51:23.094202 open("/dev/tty", O_RDWR|O_CLOEXEC) = -1 ENXIO (No such device or address) <0.000048> 6692 15:51:23.094310 open("/dev/pts/9", O_RDWR|O_CLOEXEC) = 9 <0.000041> 6692 15:51:23.094409 rt_sigaction(SIGINT, {SIG_DFL, [INT], SA_RESTORER|SA_RESTART, 0x7f9de6e2fcb0}, {0x4ea440, [INT QUIT ALRM CHLD PROF WINCH IO], SA_RESTORER, 0x7f9de73d7750}, 8) = 0 <0.000026> 6692 15:51:23.094507 rt_sigaction(SIGQUIT, {SIG_DFL, [QUIT], SA_RESTORER|SA_RESTART, 0x7f9de6e2fcb0}, {0x4ea440, [INT QUIT ALRM CHLD PROF WINCH IO], SA_RESTORER, 0x7f9de73d7750}, 8) = 0 <0.000026> 6692 15:51:23.094602 rt_sigaction(SIGPROF, {SIG_DFL, [PROF], SA_RESTORER|SA_RESTART, 0x7f9de6e2fcb0}, {SIG_IGN, [PROF], SA_RESTORER|SA_RESTART, 0x7f9de6e2fcb0}, 8) = 0 <0.000025> 6692 15:51:23.094694 rt_sigaction(SIGPIPE, {SIG_DFL, [PIPE], SA_RESTORER|SA_RESTART, 0x7f9de6e2fcb0}, {SIG_IGN, [PIPE], SA_RESTORER|SA_RESTART, 0x7f9de6e2fcb0}, 8) = 0 <0.000025> 6692 15:51:23.094793 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 <0.000026> 6692 15:51:23.094877 ioctl(9, SNDCTL_TMR_TIMEBASE or SNDRV_TIMER_IOCTL_NEXT_DEVICE or TCGETS, {B38400 opost isig icanon echo ...}) = 0 <0.000029> 6692 15:51:23.094963 ioctl(9, SNDCTL_TMR_STOP or SNDRV_TIMER_IOCTL_GINFO or TCSETSW, {B38400 opost isig icanon -echo ...}) = 0 <0.000030> 6692 15:51:23.095080 ioctl(9, SNDCTL_TMR_TIMEBASE or SNDRV_TIMER_IOCTL_NEXT_DEVICE or TCGETS, {B38400 opost isig icanon -echo ...}) = 0 <0.000029> 6692 15:51:23.095170 getpid() = 6692 <0.000025> 6692 15:51:23.095258 chdir("/home/eggert/src/gnu/emacs/emacs-24-sc") = 0 <0.000038> 6692 15:51:23.095466 dup2(9, 0) = 0 <0.000028> 6692 15:51:23.095545 dup2(9, 1) = 1 <0.000027> 6692 15:51:23.095621 dup2(9, 2) = 2 <0.000025> 6692 15:51:23.095696 setpgid(0, 0) = -1 EPERM (Operation not permitted) <0.000026> 6692 15:51:23.095776 ioctl(0, SNDRV_TIMER_IOCTL_SELECT or TIOCSPGRP, [6692]) = 0 <0.000027> 6692 15:51:23.095872 execve("/home/eggert/opt/Linux-x86_64/merged/bin/cat", ["/home/eggert/opt/Linux-x86_64/me"...], [/* 100 vars */] <unfinished ...> 6690 15:51:23.096361 <... vfork resumed> ) = 6692 <0.002462> 6690 15:51:23.096440 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 <0.000028> ^ permalink raw reply [flat|nested] 33+ messages in thread
* bug#17561: Emacs can forget processes 2014-05-26 23:58 ` Paul Eggert @ 2014-05-27 18:27 ` Jorgen Schaefer 2014-05-27 21:42 ` Paul Eggert 0 siblings, 1 reply; 33+ messages in thread From: Jorgen Schaefer @ 2014-05-27 18:27 UTC (permalink / raw) To: Paul Eggert; +Cc: 17561 On Mon, 26 May 2014 16:58:45 -0700 Paul Eggert <eggert@cs.ucla.edu> wrote: > Jorgen Schaefer wrote: > > the-bug.txt.gz ... shows the strace of a > > running Emacs session which starts to exhibit the bug, so we now > > have a trace that includes the time when the bug actually happens. > > It shows me starting Python subprocesses (elpy RPC processes) and > > killing them, in an attempt to reproduce the problem. Eventually, > > this succeeds, and Emacs goes into the loop reading from fd 15. > > Thanks for that trace; it's helpful. A couple of things. First, > what platform are you running on, exactly? I'm running on an AMD > Phenom II X4 910e (according to /proc/cpuinfo), and using Fedora 20 > x86-64 (Linux kernel 3.14.4-200.fc20.x86_64, glibc 2.18), and > compiling with GCC 4.9.0. Thanks for making me check that. There was a bit of a wtf there for me :-D The strace is from an OpenVZ virtual host: Debian GNU/Linux 7.5 (wheezy) Intel(R) Xeon(R) CPU E5-2620 0 @ 2.00GHz Linux kernel 3.2.41-042stab085.20 eglibc 2.13 Note the embedded glibc. I have seen a similar behavior (Emacs suddenly responding badly, improving once processes are killed) on this type of system: openSUSE 12.2 (x86_64) Intel(R) Core(TM) i3-2120 CPU @ 3.30GHz Linux kernel 3.4.63-2.44-desktop glibc-2.15 gcc 4.7.1 But I can't know for sure if it's the same bug. > Second, looking at the-bug.txt I'm suspicious about the interaction > between vfork, pthread_sigmask (aka rt_sigprocmask) and rt_sigreturn. > Could you please try building Emacs with fork rather than vfork and > see whether that fixes the problem? If it fixes things, great; if > not I'd like to see another strace of the bug. To use fork instead > of vfork, you can run "./configure ac_cv_func_fork_works=no", or > (faster but flimsier) edit src/config.h to add "#define fork vfork". Should that be ac_cv_func_vfork_works=no? I have no recompiled Emacs with the latter change, and it's using clone(2) to execute processes, which sounds right. I haven't been able to reproduce the bug so far, sadly that doesn't say much, I have gone for over a week before without noticing the bug until it reappeared. Regards, Jorgen ^ permalink raw reply [flat|nested] 33+ messages in thread
* bug#17561: Emacs can forget processes 2014-05-27 18:27 ` Jorgen Schaefer @ 2014-05-27 21:42 ` Paul Eggert 2014-05-27 22:16 ` Jorgen Schaefer 0 siblings, 1 reply; 33+ messages in thread From: Paul Eggert @ 2014-05-27 21:42 UTC (permalink / raw) To: Jorgen Schaefer; +Cc: 17561 [-- Attachment #1: Type: text/plain, Size: 1576 bytes --] On 05/27/2014 11:27 AM, Jorgen Schaefer wrote: > Should that be ac_cv_func_vfork_works=no? Yes, thanks for fixing the typo in my earlier message. > The strace is from an OpenVZ virtual host Perhaps OpenVZ is contributing to the problem? There are or were relevant bugs in its implementation of vfork; e.g., see <http://lkml.org/lkml/2012/5/31/364>. > I have no[w] recompiled Emacs > with the latter change, and it's using clone(2) to execute processes, > which sounds right. I haven't been able to reproduce the bug so far So far, so good anyway. That's the good news. However, some bad news: on some platforms strace is reportedly buggy in this area, and can cause vfork to misbehave even if vfork works correctly when it's not being straced. Please see <http://www.openwall.com/lists/musl/2013/02/03/3>. If we have found the problem, I'd like to modify Emacs to avoid the kernel bug. Let's start by trying to build a dynamic test for it. Can you please build and run the attached program, and see whether you can get it to output the message "vfork bug detected" on your platform? Please compile it with the same compiler and flags that you use to compile Emacs. You may need to invoke the test program with arguments, e.g., "./a.out 100 10000" to try the test 100 times with a delay of 10,000 nanoseconds. The goal is to detect the vfork bug as quickly as possible, so if you can detect it it'd be nice to see how small we can make the delay. You might also try to run the test program under strace to see whether that changes things. Thanks. [-- Attachment #2: vfork-test.c --] [-- Type: text/plain, Size: 892 bytes --] #include <stdio.h> #include <stdlib.h> #include <sys/select.h> #include <sys/wait.h> #include <unistd.h> int main (int argc, char **argv) { int test_case_count = argc < 2 ? 10000: atoi (argv[1]); int delay_in_nanoseconds = argc < 3 ? 1000 : atoi (argv[2]); int i; for (i = 0; i < test_case_count; i++) { pid_t p = vfork (); int ok = 0; int volatile child_status = 0; if (p < 0) { perror ("vfork"); return 2; } if (p == 0) { struct timespec timeout; timeout.tv_sec = 0; timeout.tv_nsec = delay_in_nanoseconds; pselect (0, 0, 0, 0, &timeout, 0); _exit (child_status); } if (p > 0) { int status; child_status = 1; if (waitpid (p, &status, 0) == p && WIFEXITED (status) && status == 0) ok = 1; if (!ok) { printf ("vfork bug detected\n"); return 1; } } } return 0; } ^ permalink raw reply [flat|nested] 33+ messages in thread
* bug#17561: Emacs can forget processes 2014-05-27 21:42 ` Paul Eggert @ 2014-05-27 22:16 ` Jorgen Schaefer 2014-05-28 0:47 ` Paul Eggert 0 siblings, 1 reply; 33+ messages in thread From: Jorgen Schaefer @ 2014-05-27 22:16 UTC (permalink / raw) To: Paul Eggert; +Cc: 17561 On Tue, 27 May 2014 14:42:03 -0700 Paul Eggert <eggert@cs.ucla.edu> wrote: > Let's start by trying to build a dynamic test for it. > Can you please build and run the attached program, and see whether > you can get it to output the message "vfork bug detected" on your > platform? Please compile it with the same compiler and flags that you > use to compile Emacs. You may need to invoke the test program with > arguments, e.g., "./a.out 100 10000" to try the test 100 times with a > delay of 10,000 nanoseconds. The goal is to detect the vfork bug as > quickly as possible, so if you can detect it it'd be nice to see how > small we can make the delay. You might also try to run the test > program under strace to see whether that changes things. Thanks. Compiled using: gcc -std=gnu99 -g3 -O2 -MMD -Wl,-znocombreloc -o vfork-test vfork-test.c Ran as: ./vfork-test 1000 1000000 strace -ttf -o /dev/null ./vfork-test 10000 1000000 No output. I also changed the delay to 1s just to test that, but no dice, either. :-( Jorgen ^ permalink raw reply [flat|nested] 33+ messages in thread
* bug#17561: Emacs can forget processes 2014-05-27 22:16 ` Jorgen Schaefer @ 2014-05-28 0:47 ` Paul Eggert 2014-05-28 20:53 ` Jorgen Schaefer 0 siblings, 1 reply; 33+ messages in thread From: Paul Eggert @ 2014-05-28 0:47 UTC (permalink / raw) To: Jorgen Schaefer; +Cc: 17561 Jorgen Schaefer wrote: > No output. Ouch. It sounds like we may be barking up the wrong tree with the assumption that there is a kernel bug with vfork, then. It could be that it's just an strace bug. I'm still suspicious of the interaction between pthread_sigmask and vfork, though. What happens if you do the following shell session, where "$" is the shell prompt and "(gdb)" is the GDB prompt? $ grep -i pthread src/config.h $ ldd src/bootstrap-emacs | grep -i pthread $ gdb src/bootstrap-emacs (gdb) disas rpl_pthread_sigmask (gdb) disas pthread_sigmask (gdb) disas rpl_vfork (gdb) disas vfork (gdb) disas rpl_fork (gdb) disas fork ^ permalink raw reply [flat|nested] 33+ messages in thread
* bug#17561: Emacs can forget processes 2014-05-28 0:47 ` Paul Eggert @ 2014-05-28 20:53 ` Jorgen Schaefer 2014-05-28 23:00 ` Paul Eggert 0 siblings, 1 reply; 33+ messages in thread From: Jorgen Schaefer @ 2014-05-28 20:53 UTC (permalink / raw) To: Paul Eggert; +Cc: 17561 On Tue, 27 May 2014 17:47:41 -0700 Paul Eggert <eggert@cs.ucla.edu> wrote: > I'm still suspicious of the interaction between pthread_sigmask and > vfork, though. What happens if you do the following shell session, > where "$" is the shell prompt and "(gdb)" is the GDB prompt? Assuming you meant in an Emacs with vfork enabled (i.e. the default): > $ grep -i pthread src/config.h /* Define to 1 if you have pthread (-lpthread). */ #define HAVE_PTHREAD 1 /* Define to 1 if you have the <pthread.h> header file. */ #define HAVE_PTHREAD_H 1 /* Define to 1 if the pthread_sigmask function can be used (despite bugs). */ /* #undef HAVE_PTHREAD_SIGMASK */ /* Define to 1 if pthread_sigmask(), when it fails, returns -1 and sets errno. /* #undef PTHREAD_SIGMASK_FAILS_WITH_ERRNO */ /* Define to 1 if pthread_sigmask() may returns 0 and have no effect. */ /* #undef PTHREAD_SIGMASK_INEFFECTIVE */ /* Define to 1 if pthread_sigmask() unblocks signals incorrectly. */ /* #undef PTHREAD_SIGMASK_UNBLOCK_BUG */ #define PTY_TTY_NAME_SPRINTF { char *ptyname = 0; sigset_t blocked; sigemptyset (&blocked); sigaddset (&blocked, SIGCHLD); pthread_sigmask (SIG_BLOCK, &blocked, 0); if (grantpt (fd) != -1 && unlockpt (fd) != -1) ptyname = ptsname(fd); pthread_sigmask (SIG_UNBLOCK, &blocked, 0); if (!ptyname) { emacs_close (fd); return -1; } snprintf (pty_name, PTY_NAME_SIZE, "%s", ptyname); } #ifndef _POSIX_PTHREAD_SEMANTICS # define _POSIX_PTHREAD_SEMANTICS 1 > $ ldd src/bootstrap-emacs | grep -i pthread libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f0c1cf79000) > $ gdb src/bootstrap-emacs > (gdb) disas rpl_pthread_sigmask No symbol "rpl_pthread_sigmask" in current context. > (gdb) disas pthread_sigmask Dump of assembler code for function pthread_sigmask: 0x00000000005642c0 <+0>: sub $0x8,%rsp 0x00000000005642c4 <+4>: callq 0x405790 <sigprocmask@plt> 0x00000000005642c9 <+9>: xor %edx,%edx 0x00000000005642cb <+11>: test %eax,%eax 0x00000000005642cd <+13>: jns 0x5642d6 <pthread_sigmask+22> 0x00000000005642cf <+15>: callq 0x405a10 <__errno_location@plt> 0x00000000005642d4 <+20>: mov (%rax),%edx 0x00000000005642d6 <+22>: mov %edx,%eax 0x00000000005642d8 <+24>: add $0x8,%rsp 0x00000000005642dc <+28>: retq End of assembler dump. > (gdb) disas rpl_vfork No symbol "rpl_vfork" in current context. > (gdb) disas vfork Dump of assembler code for function vfork@plt: 0x0000000000405ed0 <+0>: jmpq *0x39e662(%rip) # 0x7a4538 <vfork@got.plt> 0x0000000000405ed6 <+6>: pushq $0xfc 0x0000000000405edb <+11>: jmpq 0x404f00 End of assembler dump. > (gdb) disas rpl_fork No symbol "rpl_fork" in current context. > (gdb) disas fork Dump of assembler code for function fork@plt: 0x0000000000405cb0 <+0>: jmpq *0x39e772(%rip) # 0x7a4428 <fork@got.plt> 0x0000000000405cb6 <+6>: pushq $0xda 0x0000000000405cbb <+11>: jmpq 0x404f00 End of assembler dump. Regards, Jorgen ^ permalink raw reply [flat|nested] 33+ messages in thread
* bug#17561: Emacs can forget processes 2014-05-28 20:53 ` Jorgen Schaefer @ 2014-05-28 23:00 ` Paul Eggert 2014-05-28 23:35 ` Jorgen Schaefer 0 siblings, 1 reply; 33+ messages in thread From: Paul Eggert @ 2014-05-28 23:00 UTC (permalink / raw) To: Jorgen Schaefer; +Cc: 17561 On 05/28/2014 01:53 PM, Jorgen Schaefer wrote: > Dump of assembler code for function pthread_sigmask: > 0x00000000005642c0 <+0>: sub $0x8,%rsp > 0x00000000005642c4 <+4>: callq 0x405790 <sigprocmask@plt> > 0x00000000005642c9 <+9>: xor %edx,%edx > 0x00000000005642cb <+11>: test %eax,%eax > 0x00000000005642cd <+13>: jns 0x5642d6 <pthread_sigmask+22> > 0x00000000005642cf <+15>: callq 0x405a10 <__errno_location@plt> > 0x00000000005642d4 <+20>: mov (%rax),%edx > 0x00000000005642d6 <+22>: mov %edx,%eax > 0x00000000005642d8 <+24>: add $0x8,%rsp > 0x00000000005642dc <+28>: retq Bingo! We're on the right track. sigprocmask is *definitely* the wrong thing to call here, and would explain your symptoms. Can you please send the preprocessed output of pthread_sigmask.c? That is, if 'make' uses the following command to build pthread_sigmask.o: gcc -std=gnu99 -DHAVE_CONFIG_H -I../src -g3 -O2 -c pthread_sigmask.c then please send the output of: gcc -std=gnu99 -DHAVE_CONFIG_H -I../src -g3 -O2 -E pthread_sigmask.c It may be large; OK to send it as a compressed attachment. Thanks. ^ permalink raw reply [flat|nested] 33+ messages in thread
* bug#17561: Emacs can forget processes 2014-05-28 23:00 ` Paul Eggert @ 2014-05-28 23:35 ` Jorgen Schaefer 2014-05-29 1:22 ` Paul Eggert 2014-05-29 4:17 ` Paul Eggert 0 siblings, 2 replies; 33+ messages in thread From: Jorgen Schaefer @ 2014-05-28 23:35 UTC (permalink / raw) To: Paul Eggert; +Cc: 17561 [-- Attachment #1: Type: text/plain, Size: 1544 bytes --] On Wed, 28 May 2014 16:00:39 -0700 Paul Eggert <eggert@cs.ucla.edu> wrote: > On 05/28/2014 01:53 PM, Jorgen Schaefer wrote: > > Dump of assembler code for function pthread_sigmask: > > 0x00000000005642c0 <+0>: sub $0x8,%rsp > > 0x00000000005642c4 <+4>: callq 0x405790 <sigprocmask@plt> > > 0x00000000005642c9 <+9>: xor %edx,%edx > > 0x00000000005642cb <+11>: test %eax,%eax > > 0x00000000005642cd <+13>: jns 0x5642d6 > > <pthread_sigmask+22> 0x00000000005642cf <+15>: callq 0x405a10 > > <__errno_location@plt> 0x00000000005642d4 <+20>: mov > > (%rax),%edx 0x00000000005642d6 <+22>: mov %edx,%eax > > 0x00000000005642d8 <+24>: add $0x8,%rsp > > 0x00000000005642dc <+28>: retq > > Bingo! We're on the right track. sigprocmask is *definitely* the > wrong thing to call here, and would explain your symptoms. Can you > please send the preprocessed output of pthread_sigmask.c? That is, > if 'make' uses the following command to build pthread_sigmask.o: > > gcc -std=gnu99 -DHAVE_CONFIG_H -I../src -g3 -O2 -c pthread_sigmask.c > > then please send the output of: > > gcc -std=gnu99 -DHAVE_CONFIG_H -I../src -g3 -O2 -E pthread_sigmask.c > > It may be large; OK to send it as a compressed attachment. Thanks. I bow before your debugging skills! gcc -std=gnu99 -DHAVE_CONFIG_H -I. -I../lib -I../src -I../src -g3 -O2 -MT pthread_sigmask.o -MD -MP -MF .deps/pthread_sigmask.Tpo -E -o pthread_sigmask.E pthread_sigmask.c Output attached. Regards, Jorgen [-- Attachment #2: pthread_sigmask.E.gz --] [-- Type: application/gzip, Size: 24413 bytes --] ^ permalink raw reply [flat|nested] 33+ messages in thread
* bug#17561: Emacs can forget processes 2014-05-28 23:35 ` Jorgen Schaefer @ 2014-05-29 1:22 ` Paul Eggert 2014-05-29 10:08 ` Jorgen Schaefer 2014-05-29 4:17 ` Paul Eggert 1 sibling, 1 reply; 33+ messages in thread From: Paul Eggert @ 2014-05-29 1:22 UTC (permalink / raw) To: Jorgen Schaefer; +Cc: 17561 Jorgen Schaefer wrote: > Output attached. OK, thanks, I think we need to fix things by arranging for Emacs to call pthread_sigmask rather than sigprocmask. So, I'm curious as to why HAVE_PTHREAD_SIGMASK is not defined in config.h. Can you please send the contents of config.log? gzipped and attached is fine. ^ permalink raw reply [flat|nested] 33+ messages in thread
* bug#17561: Emacs can forget processes 2014-05-29 1:22 ` Paul Eggert @ 2014-05-29 10:08 ` Jorgen Schaefer 2014-05-29 23:15 ` Paul Eggert 0 siblings, 1 reply; 33+ messages in thread From: Jorgen Schaefer @ 2014-05-29 10:08 UTC (permalink / raw) To: Paul Eggert; +Cc: 17561 [-- Attachment #1: Type: text/plain, Size: 418 bytes --] On Wed, 28 May 2014 18:22:48 -0700 Paul Eggert <eggert@cs.ucla.edu> wrote: > Jorgen Schaefer wrote: > > Output attached. > > OK, thanks, I think we need to fix things by arranging for Emacs to > call pthread_sigmask rather than sigprocmask. So, I'm curious as to > why HAVE_PTHREAD_SIGMASK is not defined in config.h. Can you please > send the contents of config.log? gzipped and attached is fine. See attachment [-- Attachment #2: config.log.gz --] [-- Type: application/gzip, Size: 34243 bytes --] ^ permalink raw reply [flat|nested] 33+ messages in thread
* bug#17561: Emacs can forget processes 2014-05-29 10:08 ` Jorgen Schaefer @ 2014-05-29 23:15 ` Paul Eggert 0 siblings, 0 replies; 33+ messages in thread From: Paul Eggert @ 2014-05-29 23:15 UTC (permalink / raw) To: Jorgen Schaefer; +Cc: 17561 Jorgen Schaefer wrote: > See attachment Thanks, there's clearly a bug there, as 'configure' decides that pthread_sigmask doesn't exist even though it does, in -lpthread. I've installed a patch in trunk bzr 117200. I don't think this fixes your main bug, nor does it fix a bug in emacs-24, but one step at a time. I'll look at the main bug soon. ^ permalink raw reply [flat|nested] 33+ messages in thread
* bug#17561: Emacs can forget processes 2014-05-28 23:35 ` Jorgen Schaefer 2014-05-29 1:22 ` Paul Eggert @ 2014-05-29 4:17 ` Paul Eggert 2014-05-29 11:39 ` Jorgen Schaefer 1 sibling, 1 reply; 33+ messages in thread From: Paul Eggert @ 2014-05-29 4:17 UTC (permalink / raw) To: Jorgen Schaefer; +Cc: 17561 [-- Attachment #1: Type: text/plain, Size: 166 bytes --] Can you please apply the attached patch to the emacs-24 branch, run ./autogen.sh && configure && make clean && make, and then try it out on your platform? Thanks. [-- Attachment #2: sigmask.patch --] [-- Type: text/plain, Size: 8884 bytes --] === modified file 'ChangeLog' --- ChangeLog 2014-05-22 05:00:39 +0000 +++ ChangeLog 2014-05-29 04:12:16 +0000 @@ -1,3 +1,13 @@ +2014-05-29 Paul Eggert <eggert@cs.ucla.edu> + + Don't substitute sigprocmask for pthread_sigmask (Bug#17561). + This helps to fix a bug which caused Emacs to lose CHLD signals + and thus appear to forget processes. + * configure.ac (gl_THREADLIB): Remove dummy. + Merge from gnulib, incorporating: + 2014-05-28 pthread_sigmask, timer-time: use gl_THREADLIB only if needed + * m4/pthread_sigmask.m4, m4/timer_time.m4: Update from gnulib. + 2014-05-22 Paul Eggert <eggert@cs.ucla.edu> Fix port to 32-bit AIX (Bug#17540). === modified file 'configure.ac' --- configure.ac 2014-05-22 05:00:39 +0000 +++ configure.ac 2014-05-29 04:12:16 +0000 @@ -686,8 +686,6 @@ # Avoid gnulib's tests for HAVE_WORKING_O_NOATIME and HAVE_WORKING_O_NOFOLLOW, # as we don't use them. AC_DEFUN([gl_FCNTL_O_FLAGS]) -# Avoid gnulib's threadlib module, as we do threads our own way. -AC_DEFUN([gl_THREADLIB]) # Initialize gnulib right after choosing the compiler. dnl Amongst other things, this sets AR and ARFLAGS. === modified file 'm4/pthread_sigmask.m4' --- m4/pthread_sigmask.m4 2014-01-01 07:43:34 +0000 +++ m4/pthread_sigmask.m4 2014-05-29 04:12:16 +0000 @@ -1,4 +1,4 @@ -# pthread_sigmask.m4 serial 13 +# pthread_sigmask.m4 serial 14 dnl Copyright (C) 2011-2014 Free Software Foundation, Inc. dnl This file is free software; the Free Software Foundation dnl gives unlimited permission to copy and/or distribute it, @@ -97,39 +97,43 @@ dnl with -lpthread, the pthread_sigmask() function always returns 0 and has dnl no effect. if test -z "$LIB_PTHREAD_SIGMASK"; then - AC_CACHE_CHECK([whether pthread_sigmask works without -lpthread], - [gl_cv_func_pthread_sigmask_in_libc_works], - [ - AC_RUN_IFELSE( - [AC_LANG_SOURCE([[ -#include <pthread.h> -#include <signal.h> -#include <stddef.h> -int main () -{ - sigset_t set; - sigemptyset (&set); - return pthread_sigmask (1729, &set, NULL) != 0; -}]])], - [gl_cv_func_pthread_sigmask_in_libc_works=no], - [gl_cv_func_pthread_sigmask_in_libc_works=yes], - [ -changequote(,)dnl - case "$host_os" in - freebsd* | hpux* | solaris | solaris2.[2-9]*) - gl_cv_func_pthread_sigmask_in_libc_works="guessing no";; - *) - gl_cv_func_pthread_sigmask_in_libc_works="guessing yes";; - esac -changequote([,])dnl - ]) - ]) - case "$gl_cv_func_pthread_sigmask_in_libc_works" in - *no) - REPLACE_PTHREAD_SIGMASK=1 - AC_DEFINE([PTHREAD_SIGMASK_INEFFECTIVE], [1], - [Define to 1 if pthread_sigmask() may returns 0 and have no effect.]) - ;; + case " $LIBS " in + *' -lpthread '*) ;; + *) + AC_CACHE_CHECK([whether pthread_sigmask works without -lpthread], + [gl_cv_func_pthread_sigmask_in_libc_works], + [ + AC_RUN_IFELSE( + [AC_LANG_SOURCE([[ + #include <pthread.h> + #include <signal.h> + #include <stddef.h> + int main () + { + sigset_t set; + sigemptyset (&set); + return pthread_sigmask (1729, &set, NULL) != 0; + }]])], + [gl_cv_func_pthread_sigmask_in_libc_works=no], + [gl_cv_func_pthread_sigmask_in_libc_works=yes], + [ + changequote(,)dnl + case "$host_os" in + freebsd* | hpux* | solaris | solaris2.[2-9]*) + gl_cv_func_pthread_sigmask_in_libc_works="guessing no";; + *) + gl_cv_func_pthread_sigmask_in_libc_works="guessing yes";; + esac + changequote([,])dnl + ]) + ]) + case "$gl_cv_func_pthread_sigmask_in_libc_works" in + *no) + REPLACE_PTHREAD_SIGMASK=1 + AC_DEFINE([PTHREAD_SIGMASK_INEFFECTIVE], [1], + [Define to 1 if pthread_sigmask may return 0 and have no effect.]) + ;; + esac;; esac fi @@ -184,11 +188,12 @@ *) gl_cv_func_pthread_sigmask_unblock_works="guessing yes";; esac - dnl Here we link against $LIBMULTITHREAD, not only $LIB_PTHREAD_SIGMASK, - dnl otherwise we get a false positive on those platforms where - dnl $gl_cv_func_pthread_sigmask_in_libc_works is "no". - gl_save_LIBS="$LIBS" - LIBS="$LIBS $LIBMULTITHREAD" + m4_ifdef([gl_][THREADLIB], + [dnl Link against $LIBMULTITHREAD, not only $LIB_PTHREAD_SIGMASK. + dnl Otherwise we get a false positive on those platforms where + dnl $gl_cv_func_pthread_sigmask_in_libc_works is "no". + gl_save_LIBS=$LIBS + LIBS="$LIBS $LIBMULTITHREAD"]) AC_RUN_IFELSE( [AC_LANG_SOURCE([[ #include <pthread.h> @@ -227,7 +232,7 @@ [:], [gl_cv_func_pthread_sigmask_unblock_works=no], [:]) - LIBS="$gl_save_LIBS" + m4_ifdef([gl_][THREADLIB], [LIBS=$gl_save_LIBS]) ]) case "$gl_cv_func_pthread_sigmask_unblock_works" in *no) === modified file 'm4/timer_time.m4' --- m4/timer_time.m4 2014-02-24 07:12:42 +0000 +++ m4/timer_time.m4 2014-05-29 04:12:16 +0000 @@ -1,4 +1,4 @@ -# timer_time.m4 serial 2 +# timer_time.m4 serial 3 dnl Copyright (C) 2011-2014 Free Software Foundation, Inc. dnl This file is free software; the Free Software Foundation dnl gives unlimited permission to copy and/or distribute it, @@ -11,7 +11,12 @@ dnl Based on clock_time.m4. See details there. AC_REQUIRE([gl_USE_SYSTEM_EXTENSIONS]) - AC_REQUIRE([gl_THREADLIB]) + + dnl Test whether the gnulib module 'threadlib' is in use. + dnl Some packages like Emacs use --avoid=threadlib. + dnl Write the symbol in such a way that it does not cause 'aclocal' to pick + dnl the threadlib.m4 file that is installed in $PREFIX/share/aclocal/. + m4_ifdef([gl_][THREADLIB], [AC_REQUIRE([gl_][THREADLIB])]) LIB_TIMER_TIME= AC_SUBST([LIB_TIMER_TIME]) @@ -19,21 +24,21 @@ AC_SEARCH_LIBS([timer_settime], [rt posix4], [test "$ac_cv_search_timer_settime" = "none required" || LIB_TIMER_TIME=$ac_cv_search_timer_settime]) - dnl GLIBC uses threads to emulate posix timers when kernel support - dnl is not available (like Linux < 2.6 or when used with kFreeBSD) - dnl Now the pthread lib is linked automatically in the normal case, - dnl but when linking statically, it needs to be explicitly specified. - AC_EGREP_CPP([Thread], - [ -#include <features.h> -#ifdef __GNU_LIBRARY__ - #if ((__GLIBC__ == 2 && __GLIBC_MINOR__ >= 2) || (__GLIBC__ > 2)) \ - && !(__UCLIBC__ && __HAS_NO_THREADS__) - Thread emulation available - #endif -#endif - ], - [LIB_TIMER_TIME="$LIB_TIMER_TIME $LIBMULTITHREAD"]) + m4_ifdef([gl_][THREADLIB], + [dnl GLIBC uses threads to emulate posix timers when kernel support + dnl is not available (like Linux < 2.6 or when used with kFreeBSD) + dnl Now the pthread lib is linked automatically in the normal case, + dnl but when linking statically, it needs to be explicitly specified. + AC_EGREP_CPP([Thread], + [#include <features.h> + #ifdef __GNU_LIBRARY__ + #if ((__GLIBC__ == 2 && __GLIBC_MINOR__ >= 2) || (__GLIBC__ > 2)) \ + && !(__UCLIBC__ && __HAS_NO_THREADS__) + Thread emulation available + #endif + #endif + ], + [LIB_TIMER_TIME="$LIB_TIMER_TIME $LIBMULTITHREAD"])]) AC_CHECK_FUNCS([timer_settime]) LIBS=$gl_saved_libs ]) === modified file 'src/ChangeLog' --- src/ChangeLog 2014-05-26 15:48:28 +0000 +++ src/ChangeLog 2014-05-29 04:12:16 +0000 @@ -1,3 +1,7 @@ +2014-05-29 Paul Eggert <eggert@cs.ucla.edu> + + * Makefile.in (LIB_PTHREAD_SIGMASK): Remove; all uses removed. + 2014-05-26 Paul Eggert <eggert@cs.ucla.edu> Include sources used to create macuvs.h. === modified file 'src/Makefile.in' --- src/Makefile.in 2014-05-26 15:48:28 +0000 +++ src/Makefile.in 2014-05-29 04:12:16 +0000 @@ -121,7 +121,7 @@ ## -lm, or empty. LIB_MATH=@LIB_MATH@ -## -lpthreads, or empty. +## -lpthread, or empty. LIB_PTHREAD=@LIB_PTHREAD@ LIBIMAGE=@LIBTIFF@ @LIBJPEG@ @LIBPNG@ @LIBGIF@ @LIBXPM@ @@ -286,8 +286,6 @@ LIBGNUTLS_LIBS = @LIBGNUTLS_LIBS@ LIBGNUTLS_CFLAGS = @LIBGNUTLS_CFLAGS@ -LIB_PTHREAD_SIGMASK = @LIB_PTHREAD_SIGMASK@ - INTERVALS_H = dispextern.h intervals.h composite.h GETLOADAVG_LIBS = @GETLOADAVG_LIBS@ @@ -409,7 +407,7 @@ $(LIBXML2_LIBS) $(LIBGPM) $(LIBRESOLV) $(LIBS_SYSTEM) \ $(LIBS_TERMCAP) $(GETLOADAVG_LIBS) $(SETTINGS_LIBS) $(LIBSELINUX_LIBS) \ $(FREETYPE_LIBS) $(FONTCONFIG_LIBS) $(LIBOTF_LIBS) $(M17N_FLT_LIBS) \ - $(LIBGNUTLS_LIBS) $(LIB_PTHREAD) $(LIB_PTHREAD_SIGMASK) \ + $(LIBGNUTLS_LIBS) $(LIB_PTHREAD) \ $(GFILENOTIFY_LIBS) $(LIB_MATH) $(LIBZ) all: emacs$(EXEEXT) $(OTHER_FILES) ^ permalink raw reply [flat|nested] 33+ messages in thread
* bug#17561: Emacs can forget processes 2014-05-29 4:17 ` Paul Eggert @ 2014-05-29 11:39 ` Jorgen Schaefer 2014-05-29 15:09 ` Paul Eggert 0 siblings, 1 reply; 33+ messages in thread From: Jorgen Schaefer @ 2014-05-29 11:39 UTC (permalink / raw) To: Paul Eggert; +Cc: 17561 On Wed, 28 May 2014 21:17:40 -0700 Paul Eggert <eggert@cs.ucla.edu> wrote: > Can you please apply the attached patch to the emacs-24 branch, run > ./autogen.sh && configure && make clean && make, and then try it out > on your platform? Thanks. I can not currently build the emacs-24 branch (and haven't found a commit in it that I can on a quick try), see bug#17629. Using trunk (git 40f5ec0 * alloc.c (Fgarbage_collect): Fix compilation with GC_MARK_STACK == GC_USE_GCPROS_AS_BEFORE.) now, built with your patch and this command: $ ./autogen.sh && ./configure --without-x && make clean && make When restarting Emacs, I had the bug happen again right away. Sadly without a running strace. When my Emacs starts, I connect to a few IRC servers, and my IRC client then starts flyspell in the channel buffers. I got this error right away: Error in post-command-hook (flyspell-post-command-hook): (error "process ispellno longer connected to pipe; closed it") After this, killing Python RPC processes reproduced it again as well. strace: 13:28:46 --- SIGIO (I/O possible) @ 0 (0) --- 13:28:46 rt_sigreturn(0x1d) = 1 <0.000020> 13:28:46 ioctl(3, FIONREAD, [1]) = 0 <0.000019> 13:28:46 read(3, "\r", 1) = 1 <0.000020> 13:28:46 ioctl(3, FIONREAD, [0]) = 0 <0.000016> 13:28:46 ioctl(3, FIONREAD, [0]) = 0 <0.000016> 13:28:46 write(3, "\r", 1) = 1 <0.000032> 13:28:46 rt_sigprocmask(SIG_BLOCK, [CHLD], [QUIT ALRM CHLD PROF], 8) = 0 <0.000020> 13:28:46 kill(4294962889, SIGKILL) = 0 <0.002883> 13:28:46 --- SIGIO (I/O possible) @ 0 (0) --- 13:28:46 rt_sigreturn(0x1d) = 0 <0.000021> 13:28:46 rt_sigprocmask(SIG_SETMASK, [QUIT ALRM CHLD PROF], NULL, 8) = 0 <0.000016> 13:28:46 ioctl(3, FIONREAD, [0]) = 0 <0.000018> 13:28:46 rt_sigprocmask(SIG_BLOCK, [CHLD], [QUIT ALRM CHLD PROF], 8) = 0 <0.000017> 13:28:46 kill(4294962889, SIGHUP) = 0 <0.000017> 13:28:46 rt_sigprocmask(SIG_SETMASK, [QUIT ALRM CHLD PROF], NULL, 8) = 0 <0.000016> 13:28:46 rt_sigprocmask(SIG_BLOCK, [CHLD], [QUIT ALRM CHLD PROF], 8) = 0 <0.000017> 13:28:46 kill(4294962821, SIGKILL) = 0 <0.004785> 13:28:46 rt_sigprocmask(SIG_SETMASK, [QUIT ALRM CHLD PROF], NULL, 8) = 0 <0.000017> 13:28:46 rt_sigprocmask(SIG_BLOCK, [CHLD], [QUIT ALRM CHLD PROF], 8) = 0 <0.000017> 13:28:46 kill(4294962821, SIGHUP) = 0 <0.000019> 13:28:46 rt_sigprocmask(SIG_SETMASK, [QUIT ALRM CHLD PROF], NULL, 8) = 0 <0.000015> 13:28:46 rt_sigprocmask(SIG_BLOCK, [WINCH IO], NULL, 8) = 0 <0.000022> 13:28:46 write(3, "\33[K\33[46;1H", 10) = 10 <0.000028> 13:28:46 rt_sigprocmask(SIG_UNBLOCK, [WINCH IO], NULL, 8) = 0 <0.000016> 13:28:46 ioctl(3, FIONREAD, [0]) = 0 <0.000021> 13:28:46 pselect6(18, [3 4 5 7 8 11 13 17], [], NULL, {0, 262002641}, {NULL, 8}) = 2 (in [13 17], left {0, 261989573}) <0.000029> 13:28:46 read(13, "", 4096) = 0 <0.000017> 13:28:46 read(17, "", 4096) = 0 <0.000016> 13:28:46 ioctl(3, FIONREAD, [0]) = 0 <0.000019> 13:28:46 pselect6(18, [3 4 5 7 8 11 13 17], [], NULL, {0, 261495320}, {NULL, 8}) = 2 (in [13 17], left {0, 261485907}) <0.000030> 13:28:46 read(13, "", 4096) = 0 <0.000017> 13:28:46 read(17, "", 4096) = 0 <0.000015> [...] 13:289:02 --- SIGIO (I/O possible) @ 0 (0) --- 13:289:02 rt_sigreturn(0x1d) = 3 <0.000010> 13:289:02 ioctl(3, FIONREAD, [1]) = 0 <0.000008> 13:289:02 read(3, "d", 1) = 1 <0.000009> 13:289:02 ioctl(3, FIONREAD, [0]) = 0 <0.000007> 13:289:02 ioctl(3, FIONREAD, [0]) = 0 <0.000007> 13:289:02 rt_sigprocmask(SIG_BLOCK, [CHLD], [QUIT ALRM CHLD PROF], 8) = 0 <0.000014> 13:289:02 kill(4294962821, SIGKILL) = 0 <0.000010> 13:289:02 rt_sigprocmask(SIG_SETMASK, [QUIT ALRM CHLD PROF], NULL, 8) = 0 <0.000008> 13:289:02 close(12) = 0 <0.000015> 13:289:02 close(13) = 0 <0.000012> 13:289:02 rt_sigprocmask(SIG_BLOCK, [WINCH IO], NULL, 8) = 0 <0.000012> 13:289:02 rt_sigprocmask(SIG_UNBLOCK, [WINCH IO], NULL, 8) = 0 <0.000010> 13:289:02 rt_sigprocmask(SIG_BLOCK, [WINCH IO], NULL, 8) = 0 <0.012141> 13:289:02 write(3, "\33[1;8r\33[3;1H\33[1M\33[1;50r", 23) = 23 <0.000018> 13:289:02 write(3, "\33[8;1H\33[K\33[H\n", 13) = 13 <0.000015> 13:289:02 rt_sigprocmask(SIG_UNBLOCK, [WINCH IO], NULL, 8) = 0 <0.000010> 13:289:02 --- SIGIO (I/O possible) @ 0 (0) --- 13:289:02 rt_sigreturn(0x1d) = 0 <0.000011> 13:289:02 ioctl(3, FIONREAD, [0]) = 0 <0.000011> 13:289:02 ioctl(3, FIONREAD, [0]) = 0 <0.000010> 13:289:02 pselect6(18, [3 4 5 7 8 11 17], [], NULL, {0, 206805058}, {NULL, 8}) = 1 (in [17], left {0, 206797159}) <0.000017> 13:289:02 read(17, "", 4096) = 0 <0.000009> 13:289:02 pselect6(18, [3 4 5 7 8 11 17], [], NULL, {0, 206700816}, {NULL, 8}) = 1 (in [17], left {0, 206697015}) <0.000013> 13:289:02 read(17, "", 4096) = 0 <0.000008> [...] 13:289:03 --- SIGIO (I/O possible) @ 0 (0) --- 13:289:03 rt_sigreturn(0x1d) = 2 <0.000009> 13:289:03 ioctl(3, FIONREAD, [1]) = 0 <0.000008> 13:289:03 read(3, "d", 1) = 1 <0.000008> 13:289:03 ioctl(3, FIONREAD, [0]) = 0 <0.000006> 13:289:03 ioctl(3, FIONREAD, [0]) = 0 <0.000006> 13:289:03 rt_sigprocmask(SIG_BLOCK, [CHLD], [QUIT ALRM CHLD PROF], 8) = 0 <0.000007> 13:289:03 kill(4294962889, SIGKILL) = 0 <0.000011> 13:289:03 rt_sigprocmask(SIG_SETMASK, [QUIT ALRM CHLD PROF], NULL, 8) = 0 <0.000007> 13:289:03 close(16) = 0 <0.000013> 13:289:03 close(17) = 0 <0.000012> 13:289:03 rt_sigprocmask(SIG_BLOCK, [WINCH IO], NULL, 8) = 0 <0.000010> 13:289:03 rt_sigprocmask(SIG_UNBLOCK, [WINCH IO], NULL, 8) = 0 <0.000009> 13:289:03 rt_sigprocmask(SIG_BLOCK, [WINCH IO], NULL, 8) = 0 <0.000010> 13:289:03 write(3, "\33[1;7r\33[2;1H\33[1M\33[1;50r", 23) = 23 <0.000012> 13:289:03 write(3, "\33[7;1H\33[K\33[H\n", 13) = 13 <0.000012> 13:289:03 rt_sigprocmask(SIG_UNBLOCK, [WINCH IO], NULL, 8) = 0 <0.000007> Back to normal. Regards, Jorgen ^ permalink raw reply [flat|nested] 33+ messages in thread
* bug#17561: Emacs can forget processes 2014-05-29 11:39 ` Jorgen Schaefer @ 2014-05-29 15:09 ` Paul Eggert 2014-05-29 15:22 ` Andreas Schwab 0 siblings, 1 reply; 33+ messages in thread From: Paul Eggert @ 2014-05-29 15:09 UTC (permalink / raw) To: Jorgen Schaefer; +Cc: 17561 Jorgen Schaefer wrote: > I can not currently build the emacs-24 branch (and haven't found a > commit in it that I can on a quick try), see bug#17629. That's an indication of a reasonably-serious problem with Emacs and your platform, which is worrisome and should get fixed. I've responded at Bug#17629. > Using trunk (git 40f5ec0 * alloc.c (Fgarbage_collect): Fix compilation > with GC_MARK_STACK == GC_USE_GCPROS_AS_BEFORE.) now, built with your > patch and this command: > > $ ./autogen.sh && ./configure --without-x && make clean && make > > When restarting Emacs, I had the bug happen again right away. Sadly > without a running strace. If the bug is easily reproducible, that's progress. The patch does fix a bug (even if it's not your bug), so I've pushed it into the trunk. Can you get an strace -f of when Emacs stops getting SIGCHLD? That's the crucial part. That is, early on in all the straces I've seen, child processes die, Emacs gets a SIGCHLD, and then it calls waitpid (wait4) to reap them. At some point this stops working, and Emacs no longer gets a SIGCHLD when a child terminates. We need to find out what this point is, and what's causing it. > 13:28:46 kill(4294962889, SIGKILL) = 0 <0.002883> > 13:28:46 --- SIGIO (I/O possible) @ 0 (0) --- > 13:28:46 rt_sigreturn(0x1d) = 0 <0.000021> > 13:28:46 rt_sigprocmask(SIG_SETMASK, [QUIT ALRM CHLD PROF], NULL, 8) = 0 <0.000016> > 13:28:46 ioctl(3, FIONREAD, [0]) = 0 <0.000018> > 13:28:46 rt_sigprocmask(SIG_BLOCK, [CHLD], [QUIT ALRM CHLD PROF], 8) = 0 <0.000017> > 13:28:46 kill(4294962889, SIGHUP) = 0 <0.000017> I'm surprised by the above trace. The first kill(-4407,SIGKILL) succeeds, indicating that Emacs still has a subprocess. But Emacs doesn't get a SIGCHLD, even after it unblocks SIGCHLD. It would be helpful to see an 'strace -f' to see what's going on in the children here. ^ permalink raw reply [flat|nested] 33+ messages in thread
* bug#17561: Emacs can forget processes 2014-05-29 15:09 ` Paul Eggert @ 2014-05-29 15:22 ` Andreas Schwab 2014-05-29 15:26 ` Paul Eggert 0 siblings, 1 reply; 33+ messages in thread From: Andreas Schwab @ 2014-05-29 15:22 UTC (permalink / raw) To: Paul Eggert; +Cc: 17561, Jorgen Schaefer Paul Eggert <eggert@cs.ucla.edu> writes: >> 13:28:46 kill(4294962889, SIGKILL) = 0 <0.002883> >> 13:28:46 --- SIGIO (I/O possible) @ 0 (0) --- >> 13:28:46 rt_sigreturn(0x1d) = 0 <0.000021> >> 13:28:46 rt_sigprocmask(SIG_SETMASK, [QUIT ALRM CHLD PROF], NULL, 8) = 0 <0.000016> >> 13:28:46 ioctl(3, FIONREAD, [0]) = 0 <0.000018> >> 13:28:46 rt_sigprocmask(SIG_BLOCK, [CHLD], [QUIT ALRM CHLD PROF], 8) = 0 <0.000017> >> 13:28:46 kill(4294962889, SIGHUP) = 0 <0.000017> > > I'm surprised by the above trace. The first kill(-4407,SIGKILL) succeeds, > indicating that Emacs still has a subprocess. But Emacs doesn't get a > SIGCHLD, even after it unblocks SIGCHLD. Nothing in that sequence unblocks SIGCHLD. Andreas. -- Andreas Schwab, schwab@linux-m68k.org GPG Key fingerprint = 58CA 54C7 6D53 942B 1756 01D3 44D5 214B 8276 4ED5 "And now for something completely different." ^ permalink raw reply [flat|nested] 33+ messages in thread
* bug#17561: Emacs can forget processes 2014-05-29 15:22 ` Andreas Schwab @ 2014-05-29 15:26 ` Paul Eggert 2014-05-29 17:03 ` Jorgen Schaefer 0 siblings, 1 reply; 33+ messages in thread From: Paul Eggert @ 2014-05-29 15:26 UTC (permalink / raw) To: Andreas Schwab; +Cc: 17561, Jorgen Schaefer Andreas Schwab wrote: > Nothing in that sequence unblocks SIGCHLD. Ouch; thanks, you're right, I misread that sequence. So I'd like to see an strace of a session to see where Emacs first blocked SIGCHLD, so that we can figure out why Emacs didn't soon unblock it. ^ permalink raw reply [flat|nested] 33+ messages in thread
* bug#17561: Emacs can forget processes 2014-05-29 15:26 ` Paul Eggert @ 2014-05-29 17:03 ` Jorgen Schaefer 2014-05-29 17:55 ` Andreas Schwab 0 siblings, 1 reply; 33+ messages in thread From: Jorgen Schaefer @ 2014-05-29 17:03 UTC (permalink / raw) To: Paul Eggert; +Cc: 17561 [-- Attachment #1: Type: text/plain, Size: 2413 bytes --] On Thu, 29 May 2014 08:26:31 -0700 Paul Eggert <eggert@cs.ucla.edu> wrote: > Andreas Schwab wrote: > > Nothing in that sequence unblocks SIGCHLD. > > Ouch; thanks, you're right, I misread that sequence. > > So I'd like to see an strace of a session to see where Emacs first > blocked SIGCHLD, so that we can figure out why Emacs didn't soon > unblock it. Ok, let's see. --- SIGCHLD (Child exited) @ 0 (0) --- It works. A bit later: rt_sigprocmask(SIG_BLOCK, [CHLD], [QUIT ALRM PROF], 8) = 0 rt_sigprocmask(SIG_SETMASK, [QUIT ALRM PROF], NULL, 8) = 0 That's fine, CHLD is not masked. A while later, then, this: --- SIGIO (I/O possible) @ 0 (0) --- rt_sigreturn(0x1d) = 11684722 ioctl(3, FIONREAD, [0]) = 0 rt_sigprocmask(SIG_BLOCK, [CHLD], [QUIT ALRM CHLD PROF], 8) = 0 kill(4294962810, SIGKILL) = 0 rt_sigprocmask(SIG_SETMASK, [QUIT ALRM CHLD PROF], NULL, 8) = 0 rt_sigprocmask(SIG_BLOCK, [CHLD], [QUIT ALRM CHLD PROF], 8) = 0 kill(4294962810, SIGHUP) = 0 rt_sigprocmask(SIG_SETMASK, [QUIT ALRM CHLD PROF], NULL, 8) = 0 rt_sigprocmask(SIG_BLOCK, [CHLD], [QUIT ALRM CHLD PROF], 8) = 0 kill(4294942593, SIGKILL) = 0 rt_sigprocmask(SIG_SETMASK, [QUIT ALRM CHLD PROF], NULL, 8) = 0 rt_sigprocmask(SIG_BLOCK, [CHLD], [QUIT ALRM CHLD PROF], 8) = 0 kill(4294942593, SIGHUP) = 0 rt_sigprocmask(SIG_SETMASK, [QUIT ALRM CHLD PROF], NULL, 8) = 0 rt_sigprocmask(SIG_BLOCK, [WINCH IO], NULL, 8) = 0 write(3, "\33[K\33[13;12H", 11) = 11 rt_sigprocmask(SIG_UNBLOCK, [WINCH IO], NULL, 8) = 0 --- SIGIO (I/O possible) @ 0 (0) --- rt_sigreturn(0x1d) = 101 ioctl(3, FIONREAD, [0]) = 0 rt_sigprocmask(SIG_BLOCK, [WINCH IO], NULL, 8) = 0 write(3, "\33[9;1H\33[?25l\33[46m(\33[39;49m\33[13;1"..., 48) = 48 write(3, "\33[34h\33[?25h\33[34l", 16) = 16 rt_sigprocmask(SIG_UNBLOCK, [WINCH IO], NULL, 8) = 0 --- SIGIO (I/O possible) @ 0 (0) --- If I read this right, this leaves CHLD blocked (SETMASK for CHLD, then BLOCK WINCH IO, UNBLOCK WINCH IO, etc.) Full trace attached. Uncompressed, that's 8.3M, mostly because of the tight select/read loop. What I did there was to start a process using my "test code" (temp buffer cat), then start a few Python RPC processes (which cause some noise), and then kill them all using the elpy-rpc-restart function I posted before. Regards, Jorgen [-- Attachment #2: emacs-bug.txt.xz --] [-- Type: application/x-xz, Size: 383516 bytes --] ^ permalink raw reply [flat|nested] 33+ messages in thread
* bug#17561: Emacs can forget processes 2014-05-29 17:03 ` Jorgen Schaefer @ 2014-05-29 17:55 ` Andreas Schwab 2014-05-29 18:23 ` Jorgen Schaefer 0 siblings, 1 reply; 33+ messages in thread From: Andreas Schwab @ 2014-05-29 17:55 UTC (permalink / raw) To: Jorgen Schaefer; +Cc: 17561, Paul Eggert Jorgen Schaefer <forcer@forcix.cx> writes: > It works. A bit later: > > rt_sigprocmask(SIG_BLOCK, [CHLD], [QUIT ALRM PROF], 8) = 0 > rt_sigprocmask(SIG_SETMASK, [QUIT ALRM PROF], NULL, 8) = 0 > > That's fine, CHLD is not masked. A while later, then, this: > > --- SIGIO (I/O possible) @ 0 (0) --- > rt_sigreturn(0x1d) = 11684722 > ioctl(3, FIONREAD, [0]) = 0 > rt_sigprocmask(SIG_BLOCK, [CHLD], [QUIT ALRM CHLD PROF], 8) = 0 So the damage happend somewhere between these blocks. Please try this patch to better see when CHLD became blocked: diff --git a/src/sysdep.c b/src/sysdep.c index e5b2920..35c2856 100644 --- a/src/sysdep.c +++ b/src/sysdep.c @@ -618,6 +618,7 @@ request_sigio (void) { #ifdef USABLE_SIGIO sigset_t unblocked; + sigset_t oldmask; if (noninteractive) return; @@ -627,7 +628,7 @@ request_sigio (void) sigaddset (&unblocked, SIGWINCH); # endif sigaddset (&unblocked, SIGIO); - pthread_sigmask (SIG_UNBLOCK, &unblocked, 0); + pthread_sigmask (SIG_UNBLOCK, &unblocked, &oldmask); interrupts_deferred = 0; #endif @@ -638,6 +639,7 @@ unrequest_sigio (void) { #ifdef USABLE_SIGIO sigset_t blocked; + sigset_t oldmask; if (noninteractive) return; @@ -647,7 +649,7 @@ unrequest_sigio (void) sigaddset (&blocked, SIGWINCH); # endif sigaddset (&blocked, SIGIO); - pthread_sigmask (SIG_BLOCK, &blocked, 0); + pthread_sigmask (SIG_BLOCK, &blocked, &oldmask); interrupts_deferred = 1; #endif } Andreas. -- Andreas Schwab, schwab@linux-m68k.org GPG Key fingerprint = 58CA 54C7 6D53 942B 1756 01D3 44D5 214B 8276 4ED5 "And now for something completely different." ^ permalink raw reply related [flat|nested] 33+ messages in thread
* bug#17561: Emacs can forget processes 2014-05-29 17:55 ` Andreas Schwab @ 2014-05-29 18:23 ` Jorgen Schaefer 2014-05-29 19:06 ` Jorgen Schaefer 2014-05-29 19:15 ` Andreas Schwab 0 siblings, 2 replies; 33+ messages in thread From: Jorgen Schaefer @ 2014-05-29 18:23 UTC (permalink / raw) To: Andreas Schwab; +Cc: 17561, Paul Eggert [-- Attachment #1: Type: text/plain, Size: 977 bytes --] On Thu, 29 May 2014 19:55:20 +0200 Andreas Schwab <schwab@linux-m68k.org> wrote: > Jorgen Schaefer <forcer@forcix.cx> writes: > > > It works. A bit later: > > > > rt_sigprocmask(SIG_BLOCK, [CHLD], [QUIT ALRM PROF], 8) = 0 > > rt_sigprocmask(SIG_SETMASK, [QUIT ALRM PROF], NULL, 8) = 0 > > > > That's fine, CHLD is not masked. A while later, then, this: > > > > --- SIGIO (I/O possible) @ 0 (0) --- > > rt_sigreturn(0x1d) = 11684722 > > ioctl(3, FIONREAD, [0]) = 0 > > rt_sigprocmask(SIG_BLOCK, [CHLD], [QUIT ALRM CHLD PROF], 8) = 0 > > So the damage happend somewhere between these blocks. Please try this > patch to better see when CHLD became blocked: See the attached trace. I spared you the few megs of read/select loop this time. I don't understand it, though. Line 4685 has a SIGCHLD. The next rt_sigprocmask that mentions CHLD is in line 4726, where SIGCHLD is in the oldset. It wasn't at 4714/4715. Huh? How did it get there? [-- Attachment #2: with-andreas-patch.txt.xz --] [-- Type: application/x-xz, Size: 40892 bytes --] ^ permalink raw reply [flat|nested] 33+ messages in thread
* bug#17561: Emacs can forget processes 2014-05-29 18:23 ` Jorgen Schaefer @ 2014-05-29 19:06 ` Jorgen Schaefer 2014-05-29 20:27 ` Andreas Schwab 2014-05-29 19:15 ` Andreas Schwab 1 sibling, 1 reply; 33+ messages in thread From: Jorgen Schaefer @ 2014-05-29 19:06 UTC (permalink / raw) To: 17561; +Cc: Paul Eggert, Andreas Schwab On Thu, 29 May 2014 20:23:00 +0200 Jorgen Schaefer <forcer@forcix.cx> wrote: > See the attached trace. I spared you the few megs of read/select loop > this time. > > I don't understand it, though. > > Line 4685 has a SIGCHLD. The next rt_sigprocmask that mentions CHLD is > in line 4726, where SIGCHLD is in the oldset. It wasn't at 4714/4715. > Huh? How did it get there? I do not see the rt_sigreturn corresponding to the SIGINT in line 4724. Which would mean that the signals blocked in a signal handler (which I guess includes SIGCHLD, according to the SIG_BLOCK in line 4726) won't get restored. I see the same sequence a few times in the full (21M) strace output: SIGINT, then SIG_UNBLOCK for [INT], SIG_BLOCK for [WINCH IO] with oldmask [QUIT ALRM CHILD PROF WINCH IO], then two calls to write(), and then SIG_UNBLOCK with [WINCH IO], then a SIGIO, but no rt_sigreturn for the SIGINT. ^ permalink raw reply [flat|nested] 33+ messages in thread
* bug#17561: Emacs can forget processes 2014-05-29 19:06 ` Jorgen Schaefer @ 2014-05-29 20:27 ` Andreas Schwab 0 siblings, 0 replies; 33+ messages in thread From: Andreas Schwab @ 2014-05-29 20:27 UTC (permalink / raw) To: Jorgen Schaefer; +Cc: 17561, Paul Eggert Jorgen Schaefer <forcer@forcix.cx> writes: > I do not see the rt_sigreturn corresponding to the SIGINT in line 4724. There is none, it longjmps out of the signal handler. Andreas. -- Andreas Schwab, schwab@linux-m68k.org GPG Key fingerprint = 58CA 54C7 6D53 942B 1756 01D3 44D5 214B 8276 4ED5 "And now for something completely different." ^ permalink raw reply [flat|nested] 33+ messages in thread
* bug#17561: Emacs can forget processes 2014-05-29 18:23 ` Jorgen Schaefer 2014-05-29 19:06 ` Jorgen Schaefer @ 2014-05-29 19:15 ` Andreas Schwab 2014-05-30 6:07 ` Paul Eggert 1 sibling, 1 reply; 33+ messages in thread From: Andreas Schwab @ 2014-05-29 19:15 UTC (permalink / raw) To: Jorgen Schaefer; +Cc: 17561, Paul Eggert Jorgen Schaefer <forcer@forcix.cx> writes: > Line 4685 has a SIGCHLD. The next rt_sigprocmask that mentions CHLD is > in line 4726, where SIGCHLD is in the oldset. It wasn't at 4714/4715. > Huh? How did it get there? 31528 20:04:49.041423 --- SIGINT (Interrupt) @ 0 (0) --- 31528 20:04:49.041578 rt_sigprocmask(SIG_UNBLOCK, [INT], NULL, 8) = 0 <0.000016> 31528 20:04:49.044035 rt_sigprocmask(SIG_BLOCK, [WINCH IO], [QUIT ALRM CHLD PROF WINCH IO], 8) = 0 <0.000019> It's the first time a SIGINT is received, and the first time the default mask became non-empty. That's because quit_throw_to_read_char doesn't restore the signal mask. Andreas. -- Andreas Schwab, schwab@linux-m68k.org GPG Key fingerprint = 58CA 54C7 6D53 942B 1756 01D3 44D5 214B 8276 4ED5 "And now for something completely different." ^ permalink raw reply [flat|nested] 33+ messages in thread
* bug#17561: Emacs can forget processes 2014-05-29 19:15 ` Andreas Schwab @ 2014-05-30 6:07 ` Paul Eggert 2014-05-30 20:41 ` Jorgen Schaefer 0 siblings, 1 reply; 33+ messages in thread From: Paul Eggert @ 2014-05-30 6:07 UTC (permalink / raw) To: Andreas Schwab, Jorgen Schaefer; +Cc: 17561 Andreas Schwab wrote: > quit_throw_to_read_char doesn't restore the signal mask. Thanks for the diagnosis. I installed trunk bzr 117201 to try to fix this problem. I don't see how this would explain the bug in emacs-24, though, as in emacs-24 the signal handler invokes pthread_sigmask (SIG_SETMASK, &empty_mask, 0) before calling quit_throw_to_read_char, and the pthread_sigmask call should restore the signal mask. ^ permalink raw reply [flat|nested] 33+ messages in thread
* bug#17561: Emacs can forget processes 2014-05-30 6:07 ` Paul Eggert @ 2014-05-30 20:41 ` Jorgen Schaefer 2014-05-30 21:29 ` Paul Eggert 0 siblings, 1 reply; 33+ messages in thread From: Jorgen Schaefer @ 2014-05-30 20:41 UTC (permalink / raw) To: Paul Eggert; +Cc: 17561, Andreas Schwab On Thu, 29 May 2014 23:07:06 -0700 Paul Eggert <eggert@cs.ucla.edu> wrote: > Andreas Schwab wrote: > > quit_throw_to_read_char doesn't restore the signal mask. > > Thanks for the diagnosis. I installed trunk bzr 117201 to try to fix > this problem. I don't see how this would explain the bug in > emacs-24, though, as in emacs-24 the signal handler invokes > pthread_sigmask (SIG_SETMASK, &empty_mask, 0) before calling > quit_throw_to_read_char, and the pthread_sigmask call should restore > the signal mask. I haven't been able to reproduce the bug since I am running the patched version. Does not say it's gone, but it's a good indication. It is entirely possible that the hangs I have witnessed in the past were caused by some other problem. As I said initially, it's only been the last few days that I actually had the luck to track down the behavior in the first place (being at work is a bad time to try and find weird behavior of your editor, sadly :-)). So I would assume that this bug is fixed, and if I come across weird behavior again, I'll create a new report or re-open this one. Thank you for the excellent debugging! Possibly related, while trying to reproduce the bug, I have noticed some other strange behavior. I managed to reduce it to the following reproduction code: (with-temp-buffer (let ((proc (start-process "test" (current-buffer) "bash" "-c" "echo foo ; sleep 5"))) (set-process-query-on-exit-flag proc nil) (accept-process-output nil 2))) This hangs my normal Emacs for 2 seconds. A newly-started Emacs returns immediately. I *suspect* that this is a timing issue, and strace *seems* to concur - Emacs reads the "foo" from the subprocess before running accept-process-output, and then just hangs for the two second timeout, instead of noticing that there already was some output. Should I create a new bug report for this? Regards, Jorgen ^ permalink raw reply [flat|nested] 33+ messages in thread
* bug#17561: Emacs can forget processes 2014-05-30 20:41 ` Jorgen Schaefer @ 2014-05-30 21:29 ` Paul Eggert 0 siblings, 0 replies; 33+ messages in thread From: Paul Eggert @ 2014-05-30 21:29 UTC (permalink / raw) To: Jorgen Schaefer; +Cc: 17561-done On 05/30/2014 01:41 PM, Jorgen Schaefer wrote: > So I would assume that this bug is fixed, and if I come across weird > behavior again, I'll create a new report or re-open this one. Thanks, closing the bug. > Possibly related, while trying to reproduce the bug, I have noticed > some other strange behavior. I managed to reduce it to the following > reproduction code: > > (with-temp-buffer > (let ((proc (start-process "test" (current-buffer) "bash" "-c" > "echo foo ; sleep 5"))) > (set-process-query-on-exit-flag proc nil) > (accept-process-output nil 2))) > > This hangs my normal Emacs for 2 seconds. A newly-started Emacs returns > immediately. I*suspect* that this is a timing issue, and strace > *seems* to concur - Emacs reads the "foo" from the subprocess before > running accept-process-output, and then just hangs for the two second > timeout, instead of noticing that there already was some output. Should > I create a new bug report for this? Yes, please. It'd be nice to get an strace of Emacs, particularly where it goes wrong (which may be well before the symptoms start being visible). ^ permalink raw reply [flat|nested] 33+ messages in thread
* bug#17561: Emacs can forget processes 2014-05-26 18:49 ` Jorgen Schaefer 2014-05-26 23:58 ` Paul Eggert @ 2014-05-27 4:05 ` Paul Eggert 1 sibling, 0 replies; 33+ messages in thread From: Paul Eggert @ 2014-05-27 4:05 UTC (permalink / raw) To: Jorgen Schaefer; +Cc: 17561 I see another bug in the-bug.txt line 399: 16300 20:00:21.012145 kill(4294935048, SIGHUP) = -1 ESRCH (No such process) <0.000009> Here, the subprocess has already been reaped, but Emacs is trying to kill it with SIGHUP. This is a no-no, as Emacs might kill some other random victim process. There's a similar bug in line 408. I doubt whether this bug is causing your problem, but we should fix it anyway. I doubt whether this bug is a regression, so I've installed a patch for this as trunk bzr 117161. ^ permalink raw reply [flat|nested] 33+ messages in thread
end of thread, other threads:[~2014-05-30 21:29 UTC | newest] Thread overview: 33+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2014-05-23 15:52 bug#17561: 24.4.50; Emacs can forget processes Jorgen Schaefer 2014-05-23 16:28 ` bug#17561: " Paul Eggert 2014-05-23 16:44 ` Jorgen Schaefer 2014-05-24 23:01 ` Paul Eggert 2014-05-25 7:57 ` Jorgen Schaefer 2014-05-26 17:08 ` Paul Eggert 2014-05-26 18:49 ` Jorgen Schaefer 2014-05-26 23:58 ` Paul Eggert 2014-05-27 18:27 ` Jorgen Schaefer 2014-05-27 21:42 ` Paul Eggert 2014-05-27 22:16 ` Jorgen Schaefer 2014-05-28 0:47 ` Paul Eggert 2014-05-28 20:53 ` Jorgen Schaefer 2014-05-28 23:00 ` Paul Eggert 2014-05-28 23:35 ` Jorgen Schaefer 2014-05-29 1:22 ` Paul Eggert 2014-05-29 10:08 ` Jorgen Schaefer 2014-05-29 23:15 ` Paul Eggert 2014-05-29 4:17 ` Paul Eggert 2014-05-29 11:39 ` Jorgen Schaefer 2014-05-29 15:09 ` Paul Eggert 2014-05-29 15:22 ` Andreas Schwab 2014-05-29 15:26 ` Paul Eggert 2014-05-29 17:03 ` Jorgen Schaefer 2014-05-29 17:55 ` Andreas Schwab 2014-05-29 18:23 ` Jorgen Schaefer 2014-05-29 19:06 ` Jorgen Schaefer 2014-05-29 20:27 ` Andreas Schwab 2014-05-29 19:15 ` Andreas Schwab 2014-05-30 6:07 ` Paul Eggert 2014-05-30 20:41 ` Jorgen Schaefer 2014-05-30 21:29 ` Paul Eggert 2014-05-27 4:05 ` Paul Eggert
Code repositories for project(s) associated with this external index https://git.savannah.gnu.org/cgit/emacs.git https://git.savannah.gnu.org/cgit/emacs/org-mode.git This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.