all messages for Emacs-related lists mirrored at yhetil.org
 help / color / mirror / code / Atom feed
* Fix to long-standing crashes in GC
@ 2004-05-12 13:19 Kim F. Storm
  2004-05-13 13:06 ` Kenichi Handa
  2004-05-13 15:45 ` Richard Stallman
  0 siblings, 2 replies; 55+ messages in thread
From: Kim F. Storm @ 2004-05-12 13:19 UTC (permalink / raw)



I have installed a change to read_process_output which fixes some
seemingly unexplainable crashes during GC in compact_small_strings and
allocate_string.

The bug was related to reading & decoding multibyte process output.

If some people who reported such crashes in the past still have test
cases for such crashes, could you please try to repeat those test.

Notice that the bug is also present in 21.1, 21.2, and 21.3.

-- 
Kim F. Storm <storm@cua.dk> http://www.cua.dk

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Fix to long-standing crashes in GC
  2004-05-12 13:19 Kim F. Storm
@ 2004-05-13 13:06 ` Kenichi Handa
  2004-05-13 15:45 ` Richard Stallman
  1 sibling, 0 replies; 55+ messages in thread
From: Kenichi Handa @ 2004-05-13 13:06 UTC (permalink / raw)
  Cc: emacs-devel

In article <m3wu3hlr45.fsf@kfs-l.imdomain.dk>, storm@cua.dk (Kim F. Storm) writes:

> I have installed a change to read_process_output which
> fixes some seemingly unexplainable crashes during GC in
> compact_small_strings and allocate_string.

> The bug was related to reading & decoding multibyte
> process output.

> If some people who reported such crashes in the past still
> have test cases for such crashes, could you please try to
> repeat those test.

> Notice that the bug is also present in 21.1, 21.2, and
> 21.3.

Thank you for finding out this bug!!  It seems that I
embugged it when I changed how to decode the process output
between 20.7 and 21.1.

---
Ken'ichi HANDA
handa@m17n.org

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Fix to long-standing crashes in GC
  2004-05-12 13:19 Kim F. Storm
  2004-05-13 13:06 ` Kenichi Handa
@ 2004-05-13 15:45 ` Richard Stallman
  1 sibling, 0 replies; 55+ messages in thread
From: Richard Stallman @ 2004-05-13 15:45 UTC (permalink / raw)
  Cc: emacs-devel

Thank you very much for finding that bug.
I hope that one was the only one.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Fix to long-standing crashes in GC
@ 2004-05-13 18:19 Lars Hansen
  2004-05-13 19:09 ` Luc Teirlinck
                   ` (4 more replies)
  0 siblings, 5 replies; 55+ messages in thread
From: Lars Hansen @ 2004-05-13 18:19 UTC (permalink / raw)


For a long time I have experienced occasional crashes of CVS Emacs, 
usually during Tramp operations. I have four backtraces in case someone 
is interested, but I have not been able produce a test case. To me it 
looks as if GC is involved, but I am unexperienced with gdb. When Kim 
told he had fixed a bug involving GC and process output, it seemed at 
first very likely to be the cause of my troubles. However:

1. In one of the crashes Tramp was not involved.
2. CVS Emacs of today with Kim's fix, just crashed during a Tramp operation.

So now I don't know what to think.
Here is a backtrace from todays crash:

(gdb) run
Starting program: /home/lh/cvsroot/emacs/src/emacs

Program received signal SIGABRT, Aborted.
0x402ec781 in kill () from /lib/libc.so.6
(gdb) bt
#0  0x402ec781 in kill () from /lib/libc.so.6
#1  0x080dfe0a in abort () at emacs.c:433
#2  0x08128037 in mark_object (arg=1217041176) at alloc.c:5034
#3  0x0812809a in mark_object (arg=-1463681024) at alloc.c:5051
#4  0x08126b43 in mark_memory (start=0xbfffe070, end=0xbffff7e8)
    at alloc.c:3700
#5  0x08126f73 in mark_stack () at alloc.c:4055
#6  0x08127549 in Fgarbage_collect () at alloc.c:4429
#7  0x081627f5 in Fbyte_code (bytestr=1752418136, vector=-2006055672,
    maxdepth=53) at bytecode.c:522
#8  0x0813aec1 in funcall_lambda (fun=-2006055216, nargs=3,
    arg_vector=0xbfffe368) at eval.c:2913
#9  0x0813aa81 in Ffuncall (nargs=4, args=0xbfffe364) at eval.c:2783
#10 0x0813a3a5 in Fapply (nargs=3, args=0xbfffe418) at eval.c:2231
#11 0x0813a906 in Ffuncall (nargs=4, args=0xbfffe414) at eval.c:2707
#12 0x081629ec in Fbyte_code (bytestr=1752417784, vector=-2005715320,
    maxdepth=5) at bytecode.c:689
#13 0x0813aec1 in funcall_lambda (fun=-2006054856, nargs=3,
    arg_vector=0xbfffe548) at eval.c:2913
#14 0x0813aa81 in Ffuncall (nargs=4, args=0xbfffe544) at eval.c:2783
#15 0x0813a727 in call3 (fn=675593896, arg1=675187520, arg2=1751585648,
    arg3=675056464) at eval.c:2565
#16 0x08107d13 in Fexpand_file_name (name=1751585648,
    default_directory=675056464) at fileio.c:1705
#17 0x0813a9ad in Ffuncall (nargs=3, args=0xbfffe5f4) at eval.c:2729
#18 0x0813a3a5 in Fapply (nargs=2, args=0xbfffe6a8) at eval.c:2231
#19 0x0813a906 in Ffuncall (nargs=3, args=0xbfffe6a4) at eval.c:2707
#20 0x081629ec in Fbyte_code (bytestr=1752592784, vector=-2005500296,
    maxdepth=5) at bytecode.c:689
#21 0x0813aec1 in funcall_lambda (fun=-2005500152, nargs=3,
    arg_vector=0xbfffe7d8) at eval.c:2913
#22 0x0813aa81 in Ffuncall (nargs=4, args=0xbfffe7d4) at eval.c:2783
#23 0x0813a727 in call3 (fn=678851976, arg1=675187520, arg2=1751585648,
    arg3=675056464) at eval.c:2565
#24 0x08107d13 in Fexpand_file_name (name=1751585648,
    default_directory=675056464) at fileio.c:1705
#25 0x0810f5b0 in Fdirectory_files_and_attributes (directory=1751585648,
    full=675056464, match=675056464, nosort=675056512, id_format=675056464)
    at dired.c:383
#26 0x0813aa11 in Ffuncall (nargs=6, args=0xbfffe8c4) at eval.c:2759
#27 0x0813a3a5 in Fapply (nargs=2, args=0xbfffe988) at eval.c:2231
#28 0x0813a906 in Ffuncall (nargs=3, args=0xbfffe984) at eval.c:2707
#29 0x081629ec in Fbyte_code (bytestr=1752592784, vector=-2005500296,
    maxdepth=5) at bytecode.c:689
#30 0x0813aec1 in funcall_lambda (fun=-2005500152, nargs=6,
    arg_vector=0xbfffeab8) at eval.c:2913
#31 0x0813aa81 in Ffuncall (nargs=7, args=0xbfffeab4) at eval.c:2783
#32 0x0813a787 in call6 (fn=678851976, arg1=675173912, arg2=1751585648,
    arg3=675056464, arg4=675056464, arg5=675056512, arg6=675056464)
    at eval.c:2640
#33 0x0810f5fd in Fdirectory_files_and_attributes (directory=1757484384,
    full=675056464, match=675056464, nosort=675056512, id_format=675056464)
    at dired.c:389
#34 0x0813aa11 in Ffuncall (nargs=5, args=0xbfffeb94) at eval.c:2759
#35 0x081629ec in Fbyte_code (bytestr=1753402600, vector=-2004489696,
    maxdepth=9) at bytecode.c:689
#36 0x0813aec1 in funcall_lambda (fun=-2005624000, nargs=1,
    arg_vector=0xbfffecac) at eval.c:2913
#37 0x0813aa81 in Ffuncall (nargs=2, args=0xbfffeca8) at eval.c:2783
#38 0x081629ec in Fbyte_code (bytestr=1753388008, vector=-2004492704,
    maxdepth=3) at bytecode.c:689
#39 0x0813aec1 in funcall_lambda (fun=-2004860712, nargs=2,
    arg_vector=0xbfffeda8) at eval.c:2913
#40 0x0813aa81 in Ffuncall (nargs=3, args=0xbfffeda4) at eval.c:2783
#41 0x081629ec in Fbyte_code (bytestr=1753387640, vector=-2004626104,
    maxdepth=3) at bytecode.c:689
#42 0x0813aec1 in funcall_lambda (fun=-2004751064, nargs=0,
    arg_vector=0xbfffeea8) at eval.c:2913
#43 0x0813aa81 in Ffuncall (nargs=1, args=0xbfffeea4) at eval.c:2783
#44 0x081629ec in Fbyte_code (bytestr=1753387960, vector=-2004692136,
    maxdepth=7) at bytecode.c:689
#45 0x0813aec1 in funcall_lambda (fun=-2005043992, nargs=0,
    arg_vector=0xbfffefb8) at eval.c:2913
#46 0x0813aa81 in Ffuncall (nargs=1, args=0xbfffefb4) at eval.c:2783
#47 0x081629ec in Fbyte_code (bytestr=1753387448, vector=-2004623568,
    maxdepth=2) at bytecode.c:689
#48 0x0813aec1 in funcall_lambda (fun=-2004907320, nargs=0,
    arg_vector=0xbffff0b8) at eval.c:2913
#49 0x0813aa81 in Ffuncall (nargs=1, args=0xbffff0b4) at eval.c:2783
#50 0x081629ec in Fbyte_code (bytestr=1753223744, vector=-2004866968,
    maxdepth=3) at bytecode.c:689
#51 0x0813aec1 in funcall_lambda (fun=-2004917192, nargs=1,
    arg_vector=0xbffff1b8) at eval.c:2913
#52 0x0813aa81 in Ffuncall (nargs=2, args=0xbffff1b4) at eval.c:2783
#53 0x081629ec in Fbyte_code (bytestr=1753213760, vector=-2004869680,
    maxdepth=2) at bytecode.c:689
#54 0x0813aec1 in funcall_lambda (fun=-2004773840, nargs=0,
    arg_vector=0xbffff2d8) at eval.c:2913
#55 0x0813aa81 in Ffuncall (nargs=1, args=0xbffff2d4) at eval.c:2783
#56 0x0813a6b1 in apply1 (fn=679775736, arg=675056464) at eval.c:2476
#57 0x08136649 in Fcall_interactively (function=679775736,
    record_flag=675056464, keys=-2009241248) at callint.c:406
#58 0x080ecdc9 in Fcommand_execute (cmd=679775736, record_flag=675056464,
    keys=675056464, special=675056464) at keyboard.c:9670
#59 0x080e34ac in command_loop_1 () at keyboard.c:1727
#60 0x08138fad in internal_condition_case (bfun=0x80e27c0 <command_loop_1>,
    handlers=675117376, hfun=0x80e23c4 <cmd_error>) at eval.c:1333
#61 0x080e2688 in command_loop_2 () at keyboard.c:1264
#62 0x08138b25 in internal_catch (tag=675111408,
    func=0x80e2664 <command_loop_2>, arg=675056464) at eval.c:1094
#63 0x080e2633 in command_loop () at keyboard.c:1243
#64 0x080e2188 in recursive_edit_1 () at keyboard.c:959
#65 0x080e22b0 in Frecursive_edit () at keyboard.c:1015
#66 0x080e1122 in main (argc=1, argv=0xbffffa64) at emacs.c:1692
(gdb)

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Fix to long-standing crashes in GC
  2004-05-13 18:19 Fix to long-standing crashes in GC Lars Hansen
@ 2004-05-13 19:09 ` Luc Teirlinck
  2004-05-13 19:29   ` Luc Teirlinck
  2004-05-13 19:30   ` Lars Hansen
  2004-05-13 19:19 ` Stefan Monnier
                   ` (3 subsequent siblings)
  4 siblings, 2 replies; 55+ messages in thread
From: Luc Teirlinck @ 2004-05-13 19:09 UTC (permalink / raw)
  Cc: emacs-devel

Lars Hansen wrote:

   For a long time I have experienced occasional crashes of CVS Emacs, 
   usually during Tramp operations.

Are you using global-auto-revert-mode?  I get all kinds of problems
using Tramp with global-auto-revert-mode enabled: bizarre error
messages, hangs, crashes.  For the crashes, see thread "Tramp with
global-auto-revert-mode."  My backtrace looks very similar to yours.
The crash is also during garbage collection.  I ever never been able
to reproduce any of them with global-auto-revert-mode disabled.  But
maybe I did not try for long enough.

I have a very slow connection.

Sincerely,

Luc.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Fix to long-standing crashes in GC
  2004-05-13 18:19 Fix to long-standing crashes in GC Lars Hansen
  2004-05-13 19:09 ` Luc Teirlinck
@ 2004-05-13 19:19 ` Stefan Monnier
  2004-05-13 22:16   ` Luc Teirlinck
  2004-05-14 21:02 ` Richard Stallman
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 55+ messages in thread
From: Stefan Monnier @ 2004-05-13 19:19 UTC (permalink / raw)
  Cc: emacs-devel

> interested, but I have not been able produce a test case. To me it looks as
> if GC is involved, but I am unexperienced with gdb. When Kim told he had

When the GC seems to be involved what it typically means that there's
a memory corruption somewhere.  The only reason why the GC gets involved is
because the GC traverses all your memory, so it basically does (as a side
effect) some kind of sanity check, so memory corruption is usually detected
during GC.  GC bugs exist as well, of course, but they're pretty rare (or
rather they tend to manifest pretty often, so they get fixed fairly quikly).


        Stefan

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Fix to long-standing crashes in GC
  2004-05-13 19:09 ` Luc Teirlinck
@ 2004-05-13 19:29   ` Luc Teirlinck
  2004-05-13 19:30   ` Lars Hansen
  1 sibling, 0 replies; 55+ messages in thread
From: Luc Teirlinck @ 2004-05-13 19:29 UTC (permalink / raw)
  Cc: larsh, emacs-devel

>From my previous message:

   I ever never been able to reproduce any of them with
   global-auto-revert-mode disabled.

That was unclear, even without the typo ("ever" should have been
"have").  I meant that without global-auto-revert-mode enabled, I
never experienced _any_ of the bizarre error messages, hangs _or_
crashes.

Sincerely,

Luc.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Fix to long-standing crashes in GC
  2004-05-13 19:09 ` Luc Teirlinck
  2004-05-13 19:29   ` Luc Teirlinck
@ 2004-05-13 19:30   ` Lars Hansen
  1 sibling, 0 replies; 55+ messages in thread
From: Lars Hansen @ 2004-05-13 19:30 UTC (permalink / raw)
  Cc: emacs-devel

Luc Teirlinck wrote:

>Are you using global-auto-revert-mode?
>
Tanks for pointing my attention to that.
No, I do not use global-auto-revert-mode.

Lars

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Fix to long-standing crashes in GC
  2004-05-13 19:19 ` Stefan Monnier
@ 2004-05-13 22:16   ` Luc Teirlinck
  2004-05-13 23:04     ` Stefan Monnier
  2004-05-14 11:42     ` Kai Grossjohann
  0 siblings, 2 replies; 55+ messages in thread
From: Luc Teirlinck @ 2004-05-13 22:16 UTC (permalink / raw)
  Cc: larsh, emacs-devel

Stefan Monnier wrote:

   When the GC seems to be involved what it typically means that there's
   a memory corruption somewhere.  The only reason why the GC gets involved is
   because the GC traverses all your memory, so it basically does (as a side
   effect) some kind of sanity check, so memory corruption is usually detected
   during GC.

I would guess that means that studying the backtrace would not be
terribly useful in this case, since the corruption could have happened
anywhere.  I know how to use GDB, but I do not have that much
experience debugging an aborting Emacs with GDB.  I have a live
process around.

The bug is of course not necessarily in Tramp, but could be in any of
the machinery used by Tramp.  I get crashes using Tramp with ssh the
entire time.  I have never observed a crash using an external transfer
method, like scp (although I tried), so en/decoding seems a likely
culprit.  I get strange error messages about invalid base64 data.

Sincerely,

Luc.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Fix to long-standing crashes in GC
  2004-05-13 22:16   ` Luc Teirlinck
@ 2004-05-13 23:04     ` Stefan Monnier
  2004-05-14 11:42     ` Kai Grossjohann
  1 sibling, 0 replies; 55+ messages in thread
From: Stefan Monnier @ 2004-05-13 23:04 UTC (permalink / raw)
  Cc: larsh, emacs-devel

> The bug is of course not necessarily in Tramp, but could be in any of
> the machinery used by Tramp.  I get crashes using Tramp with ssh the

If you get a crash, then the bug is in the C code.
Since Tramp is written 100% in Elisp, the bug can't be in Tramp.


        Stefan

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Fix to long-standing crashes in GC
@ 2004-05-13 23:34 Robert Anderson
  0 siblings, 0 replies; 55+ messages in thread
From: Robert Anderson @ 2004-05-13 23:34 UTC (permalink / raw)
  Cc: larsh, emacs-devel


--- Original Message ---
From: Luc Teirlinck <teirllm@dms.auburn.edu>
To: monnier@iro.umontreal.ca
CC: larsh@math.ku.dk, emacs-devel@gnu.org
Subject: Re: Fix to long-standing crashes in GC

>Stefan Monnier wrote:
>
>   When the GC seems to be involved what it typically means that
there's
>   a memory corruption somewhere.

Which begs the question again of:  did anyone get emacs running
under valgrind yet?  It's a godsend for such bugs.

Bob

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Fix to long-standing crashes in GC
  2004-05-13 22:16   ` Luc Teirlinck
  2004-05-13 23:04     ` Stefan Monnier
@ 2004-05-14 11:42     ` Kai Grossjohann
  2004-05-14 14:53       ` Luc Teirlinck
  2004-05-14 18:39       ` Luc Teirlinck
  1 sibling, 2 replies; 55+ messages in thread
From: Kai Grossjohann @ 2004-05-14 11:42 UTC (permalink / raw)


Luc Teirlinck <teirllm@dms.auburn.edu> writes:

> The bug is of course not necessarily in Tramp, but could be in any of
> the machinery used by Tramp.  I get crashes using Tramp with ssh the
> entire time.  I have never observed a crash using an external transfer
> method, like scp (although I tried), so en/decoding seems a likely
> culprit.  I get strange error messages about invalid base64 data.

Now that I've read the umpteenth explanation of this, the thought was
finally triggered in my head.

Please see the variable tramp-chunksize, try to set it to 500 to see
what happens, and also do what the docstring says.

What are your findings?

Thanks,
Kai

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Fix to long-standing crashes in GC
  2004-05-14 11:42     ` Kai Grossjohann
@ 2004-05-14 14:53       ` Luc Teirlinck
  2004-05-14 20:48         ` Kai Grossjohann
  2004-05-16  9:27         ` Kai Grossjohann
  2004-05-14 18:39       ` Luc Teirlinck
  1 sibling, 2 replies; 55+ messages in thread
From: Luc Teirlinck @ 2004-05-14 14:53 UTC (permalink / raw)
  Cc: emacs-devel

Kai Grossjohann wrote:

   Please see the variable tramp-chunksize, try to set it to 500 to see
   what happens, and also do what the docstring says.

I executed the function in the tramp-chunksize docstring.  Result:

"Bytes sent: 1000	Bytes received: 1000"

That seems to indicate that my system does not have the problem
referred to in the docstring.

I set tramp-chunksize to 500.  Still the same:

tramp: Decoding remote file /ssh:raven.dms.auburn.edu:/home/teirllm/streams.texi with function base64-decode-region...
tramp-handle-file-local-copy: Invalid base64 data

(Yanked from *Messages*.)

Note that the error only occurs when using auto-revert-mode and when
there already is a remote file being visited (and hence
auto-reverted).  This problem and the crash do not occur if
auto-reverting of remote files is disabled using the option I posted.

Sincerely,

Luc.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Fix to long-standing crashes in GC
  2004-05-14 11:42     ` Kai Grossjohann
  2004-05-14 14:53       ` Luc Teirlinck
@ 2004-05-14 18:39       ` Luc Teirlinck
  2004-05-14 20:54         ` Kim F. Storm
  1 sibling, 1 reply; 55+ messages in thread
From: Luc Teirlinck @ 2004-05-14 18:39 UTC (permalink / raw)
  Cc: emacs-devel

Here is another strange error occurring once in a while:

(find-file "/ssh:raven.dms.auburn.edu:sequences.texi" t)

(command yanked from minibuffer after C-x ESC ESC.)

result:

vc-rcs-fetch-master-state: File
/ssh:raven.dms.auburn.edu:/home/teirllm/sequences.texi,v is not an RCS
master file

(yanked from *Messages*).

That file is not under RCS control.

This was actually the _first_ remote file I tried to visit in that
session and it happens only every now and then.  The vast majority of
the time that I execute exactly the same command for exactly the same
file, I do not get that error.  In contrast, the base64 thing seems to
happen _consistently_ when I try to visit the _second_ remote file.

Sincerely,

Luc.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Fix to long-standing crashes in GC
  2004-05-14 14:53       ` Luc Teirlinck
@ 2004-05-14 20:48         ` Kai Grossjohann
  2004-05-16  9:27         ` Kai Grossjohann
  1 sibling, 0 replies; 55+ messages in thread
From: Kai Grossjohann @ 2004-05-14 20:48 UTC (permalink / raw)


Luc Teirlinck <teirllm@dms.auburn.edu> writes:

> Kai Grossjohann wrote:
>
>    Please see the variable tramp-chunksize, try to set it to 500 to see
>    what happens, and also do what the docstring says.
>
> I executed the function in the tramp-chunksize docstring.  Result:
>
> "Bytes sent: 1000	Bytes received: 1000"
>
> That seems to indicate that my system does not have the problem
> referred to in the docstring.
>
> I set tramp-chunksize to 500.  Still the same:
>
> tramp: Decoding remote file /ssh:raven.dms.auburn.edu:/home/teirllm/streams.texi with function base64-decode-region...
> tramp-handle-file-local-copy: Invalid base64 data
>
> (Yanked from *Messages*.)
>
> Note that the error only occurs when using auto-revert-mode and when
> there already is a remote file being visited (and hence
> auto-reverted).  This problem and the crash do not occur if
> auto-reverting of remote files is disabled using the option I posted.

Yeah, I guess it's two commands stepping on each other's toes:

You say C-x C-f, which transfers base64 data across the wire.
Auto-revert then gets triggered from a timer which just switches to
the connection buffer and issues its own base64 data transfer.

Now you get a happy mixture of base64 data from two different files,
and the consequences are no good :-/

It's really amazing that nobody has noticed these problems up to
now.  Tramp is not *that* young...

So I see two possible short-term kludges:

  - Disable auto-revert on remote files.

  - Have Tramp detect reentrant calls, and abort the call from the
    timer.  (I'm hoping that auto-revert will try again after a
    while, if one revert fails.)

I think the only sane long-term solution is to teach Tramp to use
multiple connections, and to use a new connection for each timer
function.  However, opening a new connection might entail user
interaction (for entering passwords), and I'm not sure that
auto-revert mode is prepared for this.

Actually, sane is the wrong word: it's the only solution where I have
at least some idea how to implement it.  My wetware is not up to
David's suggestion, I'm afraid.

Kai

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Fix to long-standing crashes in GC
  2004-05-14 18:39       ` Luc Teirlinck
@ 2004-05-14 20:54         ` Kim F. Storm
  0 siblings, 0 replies; 55+ messages in thread
From: Kim F. Storm @ 2004-05-14 20:54 UTC (permalink / raw)
  Cc: kai, emacs-devel

Luc Teirlinck <teirllm@dms.auburn.edu> writes:

> Here is another strange error occurring once in a while:
> 
> (find-file "/ssh:raven.dms.auburn.edu:sequences.texi" t)
> 
> (command yanked from minibuffer after C-x ESC ESC.)
> 
> result:
> 
> vc-rcs-fetch-master-state: File
> /ssh:raven.dms.auburn.edu:/home/teirllm/sequences.texi,v is not an RCS
> master file
> 
> (yanked from *Messages*).
> 
> That file is not under RCS control.
> 
> This was actually the _first_ remote file I tried to visit in that
> session and it happens only every now and then.  The vast majority of
> the time that I execute exactly the same command for exactly the same
> file, I do not get that error.  In contrast, the base64 thing seems to
> happen _consistently_ when I try to visit the _second_ remote file.

Could it be that when you have two files open, you always get in
trouble with global-auto-revert-mode sending tramp commands in the
middle of an active tramp request?

-- 
Kim F. Storm <storm@cua.dk> http://www.cua.dk

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Fix to long-standing crashes in GC
  2004-05-13 18:19 Fix to long-standing crashes in GC Lars Hansen
  2004-05-13 19:09 ` Luc Teirlinck
  2004-05-13 19:19 ` Stefan Monnier
@ 2004-05-14 21:02 ` Richard Stallman
  2004-05-22 18:09   ` Lars Hansen
  2004-05-15  4:39 ` Robert Marshall
  2004-05-17 14:43 ` Kim F. Storm
  4 siblings, 1 reply; 55+ messages in thread
From: Richard Stallman @ 2004-05-14 21:02 UTC (permalink / raw)
  Cc: emacs-devel

    #0  0x402ec781 in kill () from /lib/libc.so.6
    #1  0x080dfe0a in abort () at emacs.c:433
    #2  0x08128037 in mark_object (arg=1217041176) at alloc.c:5034
    #3  0x0812809a in mark_object (arg=-1463681024) at alloc.c:5051
    #4  0x08126b43 in mark_memory (start=0xbfffe070, end=0xbffff7e8)
	at alloc.c:3700
    #5  0x08126f73 in mark_stack () at alloc.c:4055

Can you figure out what data structure is being traversed by those
frames?  That's a lot of work, but it is the only way I know of to try
to debug a problem of this kind.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Fix to long-standing crashes in GC
  2004-05-13 18:19 Fix to long-standing crashes in GC Lars Hansen
                   ` (2 preceding siblings ...)
  2004-05-14 21:02 ` Richard Stallman
@ 2004-05-15  4:39 ` Robert Marshall
  2004-05-17 14:39   ` Kim F. Storm
  2004-05-17 14:43 ` Kim F. Storm
  4 siblings, 1 reply; 55+ messages in thread
From: Robert Marshall @ 2004-05-15  4:39 UTC (permalink / raw)


On Thu, 13 May 2004, Lars Hansen wrote:

> 
> For a long time I have experienced occasional crashes of CVS Emacs,
> usually during Tramp operations. I have four backtraces in case
> someone is interested, but I have not been able produce a test
> case. To me it looks as if GC is involved, but I am unexperienced
> with gdb. When Kim told he had fixed a bug involving GC and process
> output, it seemed at first very likely to be the cause of my
> troubles. However:
> 
> 1. In one of the crashes Tramp was not involved.
> 2. CVS Emacs of today with Kim's fix, just crashed during a Tramp
>    operation.
> 
> So now I don't know what to think.
> Here is a backtrace from todays crash:
> 
> (gdb) run
> Starting program: /home/lh/cvsroot/emacs/src/emacs
> 
> Program received signal SIGABRT, Aborted.
> 0x402ec781 in kill () from /lib/libc.so.6
> (gdb) bt
> #0  0x402ec781 in kill () from /lib/libc.so.6
> #1  0x080dfe0a in abort () at emacs.c:433
> #2  0x08128037 in mark_object (arg=1217041176) at alloc.c:5034
> #3  0x0812809a in mark_object (arg=-1463681024) at alloc.c:5051
> #4  0x08126b43 in mark_memory (start=0xbfffe070, end=0xbffff7e8)
>     at alloc.c:3700
> #5 0x08126f73 in mark_stack () at alloc.c:4055 6 0x08127549 in
> #Fgarbage_collect () at alloc.c:4429 7 0x081627f5 in Fbyte_code

I'm still getting what looks like this error with an old test case and the latest cvs

C-x C-f /su:root@localhost:/var/log/dansguardian/Access.log

it appears to be sensitive to the length of that file, it needs to be
somewhere around (at least for this machine!) 1.6M - it's currently
1684352

Robert
-- 
La grenouille songe..dans son château d'eau

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Fix to long-standing crashes in GC
  2004-05-14 14:53       ` Luc Teirlinck
  2004-05-14 20:48         ` Kai Grossjohann
@ 2004-05-16  9:27         ` Kai Grossjohann
  1 sibling, 0 replies; 55+ messages in thread
From: Kai Grossjohann @ 2004-05-16  9:27 UTC (permalink / raw)
  Cc: emacs-devel

Luc Teirlinck <teirllm@dms.auburn.edu> writes:

> Kai Grossjohann wrote:
>
>    Please see the variable tramp-chunksize, try to set it to 500 to see
>    what happens, and also do what the docstring says.
>
> I executed the function in the tramp-chunksize docstring.  Result:
>
> "Bytes sent: 1000	Bytes received: 1000"
>
> That seems to indicate that my system does not have the problem
> referred to in the docstring.

One person has reported that the Lisp needs to be applied to a much
larger number than 1000 to see the problem.

But I think that in your case it is the reentrant calls of Tramp that
are causing the problems.

Kai

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Fix to long-standing crashes in GC
  2004-05-15  4:39 ` Robert Marshall
@ 2004-05-17 14:39   ` Kim F. Storm
  2004-05-17 17:42     ` Robert Marshall
  0 siblings, 1 reply; 55+ messages in thread
From: Kim F. Storm @ 2004-05-17 14:39 UTC (permalink / raw)
  Cc: emacs-devel

Robert Marshall <robert@chezmarshall.freeserve.co.uk> writes:

> I'm still getting what looks like this error with an old test case and the latest cvs
> 
> C-x C-f /su:root@localhost:/var/log/dansguardian/Access.log
> 
> it appears to be sensitive to the length of that file, it needs to be
> somewhere around (at least for this machine!) 1.6M - it's currently
> 1684352

Please try again with the latest CVS.

-- 
Kim F. Storm <storm@cua.dk> http://www.cua.dk

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Fix to long-standing crashes in GC
  2004-05-13 18:19 Fix to long-standing crashes in GC Lars Hansen
                   ` (3 preceding siblings ...)
  2004-05-15  4:39 ` Robert Marshall
@ 2004-05-17 14:43 ` Kim F. Storm
  2004-05-18  0:13   ` Luc Teirlinck
  4 siblings, 1 reply; 55+ messages in thread
From: Kim F. Storm @ 2004-05-17 14:43 UTC (permalink / raw)
  Cc: emacs-devel

Lars Hansen <larsh@math.ku.dk> writes:

> For a long time I have experienced occasional crashes of CVS Emacs,
> usually during Tramp operations. I have four backtraces in case
> someone is interested, but I have not been able produce a test
> case. To me it looks as if GC is involved, but I am unexperienced with
> gdb. When Kim told he had fixed a bug involving GC and process output,
> it seemed at first very likely to be the cause of my
> troubles. However:
> 
> 1. In one of the crashes Tramp was not involved.
> 2. CVS Emacs of today with Kim's fix, just crashed during a Tramp operation.
> 
> So now I don't know what to think.

I just installed a change which I hope will fix these crashes.
Please report if you stil experience crashes in mark_object.

-- 
Kim F. Storm <storm@cua.dk> http://www.cua.dk

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Fix to long-standing crashes in GC
  2004-05-17 14:39   ` Kim F. Storm
@ 2004-05-17 17:42     ` Robert Marshall
  0 siblings, 0 replies; 55+ messages in thread
From: Robert Marshall @ 2004-05-17 17:42 UTC (permalink / raw)
  Cc: emacs-devel

On 17 May 2004, Kim F. Storm wrote:

> Robert Marshall <robert@chezmarshall.freeserve.co.uk> writes:
> 
>> I'm still getting what looks like this error with an old test case and
>> the latest cvs
>> 
>> C-x C-f /su:root@localhost:/var/log/dansguardian/Access.log
>> 
>> it appears to be sensitive to the length of that file, it needs to be
>> somewhere around (at least for this machine!) 1.6M - it's currently
>> 1684352
> 
> Please try again with the latest CVS.
> 

Looks good, that one now doesn't crash

Thank you!

Robert

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Fix to long-standing crashes in GC
  2004-05-17 14:43 ` Kim F. Storm
@ 2004-05-18  0:13   ` Luc Teirlinck
  2004-05-19  1:26     ` Richard Stallman
  0 siblings, 1 reply; 55+ messages in thread
From: Luc Teirlinck @ 2004-05-18  0:13 UTC (permalink / raw)
  Cc: larsh, emacs-devel

Kim Storm wrote:

   I just installed a change which I hope will fix these crashes.
   Please report if you stil experience crashes in mark_object.

I still do and pretty easily in a fully customized Emacs.  If I try to
visit a remote file using Tramp-ssh, I get an immediate crash.  The
fact that the crash is now so immediate can not possibly be due to
your Lisp_Misc_Free change, but several other changes have occurred
since I last updated following Andreas' fix early Saturday (US Central).

When I previously replied, I had not yet updated my CVS and tried out
your fix.  I just _assumed_ that it would "obviously" take care of all
my crashes, because every single one of my many crashes was caused by
a Lisp_Misc_Free type.

The problem seems to be that after a while the data type itself of
these objects seems to get corrupted.  At that stage, your fix does
not work any more, because now they have an _invalid_ data type,
instead of being of type Lisp_Misc_Free.  Then we get the abort anyway.  

Apparently, your fix worked for Robert.  With me, my updated Emacs
crashes immediately when I try to visit a file using Tramp-ssh,
_when using a fully customized Emacs_.  Apparently, one needs enough
"activity" to get these marker types corrupted quickly.  In my fully
customized Emacs, there is much more timer activity than auto-revert.

I have plenty of information and could try to produce a "cleaner"
crash with emacs -q and only a few customizations, but at the moment
I will just give the information that I personally believe is
relevant.  The object with the invalid data type _is_ that same
marker, I recognize the -37.

One could completely fix the crashes by making gc not only fail to
abort when discovering still accessible Lisp objects whose memory has
been freed (as your fix currently does), but when detecting Lisp
objects with invalid data type as well.  I do not know enough about
very low level Emacs stuff like this, but would that not be dangerous?
Should we not try to fix the problem completely differently and try to
find out_why_ the memory for these markers was erroneously freed and
fix _that_ problem?  Then gc could continue to try to detect Lisp
objects whose memory was erroneously freed, which, at first sight,
would seem like a safer thing to do (but then again, I do not know
enough about this to judge).

(gdb) bt
#0  abort () at emacs.c:434
#1  0x0812a589 in mark_object (arg=143587538) at alloc.c:5042
#2  0x0812a5ea in mark_object (arg=143787141) at alloc.c:5059
#3  0x0812a5ea in mark_object (arg=143785989) at alloc.c:5059
#4  0x0812948a in mark_memory (start=0xbfffb6a0, end=0xbffff5ac)
    at alloc.c:3781
#5  0x081294f5 in mark_stack () at alloc.c:4055
#6  0x08129aba in Fgarbage_collect () at alloc.c:4429

(gdb) p last_marked_index
$1 = 18
(gdb) p last_marked[17]
$2 = 143587538
(gdb) pr
#<EMACS BUG: INVALID DATATYPE (MISC 0x0002) Save your buffers immediately and please report this bug>
(gdb) xtype
Lisp_Misc
2
(gdb) xmiscfree
$3 = (struct Lisp_Free *) 0x88ef8d0
(gdb) p *$3
$4 = {
  type = 2, 
  gcmarkbit = 1, 
  spacer = 0, 
  chain = 0x2, 
  padding = "\000\000\000\000<\332\216\b\002\000\000\000\002\000\000"
}
(gdb) p last_marked[16]
$5 = 143787141
(gdb) pr
(#<EMACS BUG: INVALID DATATYPE (MISC 0x0002) Save your buffers immediately and please report this bug> . -37)
(gdb) p last_marked[15]
$6 = 143787133
(gdb) pr
((#<EMACS BUG: INVALID DATATYPE (MISC 0x0002) Save your buffers immediately and please report this bug> . -37) (#<EMACS BUG: INVALID DATATYPE (MISC 0x0000) Save your buffers immediately and please report this bug> . -37) (#<EMACS BUG: INVALID DATATYPE (MISC 0x001c) Save your buffers immediately and please report this bug> . -37) (#<EMACS BUG: INVALID DATATYPE (MISC 0x0000) Save your buffers immediately and please report this bug> . -37) (#<EMACS BUG: INVALID DATATYPE (MISC 0x0020) Save your buffers immediately and please report this bug> . -37) (#<EMACS BUG: INVALID DATATYPE (MISC 0x0000) Save your buffers immediately and please report this bug> . -37) (1 . 58) ("tramp_exit_status 0
" . 1) (#<marker at 21 in *tramp/ssh raven.dms.auburn.edu*> . -20))
(gdb) p last_marked[14]
$7 = -296
(gdb) pr
-37
(gdb) p last_marked[13]
$10 = 143806834
(gdb) pr
#<marker in no buffer>
(gdb) p last_marked[12]
$11 = 143787125
(gdb) pr
(#<marker in no buffer> . -37)
(gdb) p last_marked[11]
$12 = 143787117
(gdb) pr
((#<marker in no buffer> . -37) (#<EMACS BUG: INVALID DATATYPE (MISC 0x0002) Save your buffers immediately and please report this bug> . -37) (#<EMACS BUG: INVALID DATATYPE (MISC 0x0000) Save your buffers immediately and please report this bug> . -37) (#<EMACS BUG: INVALID DATATYPE (MISC 0x001c) Save your buffers immediately and please report this bug> . -37) (#<EMACS BUG: INVALID DATATYPE (MISC 0x0000) Save your buffers immediately and please report this bug> . -37) (#<EMACS BUG: INVALID DATATYPE (MISC 0x0020) Save your buffers immediately and please report this bug> . -37) (#<EMACS BUG: INVALID DATATYPE (MISC 0x0000) Save your buffers immediately and please report this bug> . -37) (1 . 58) ("tramp_exit_status 0
" . 1) (#<marker at 21 in *tramp/ssh raven.dms.auburn.edu*> . -20))
(gdb) p last_marked[10]
$13 = -296
(gdb) pr
-37
(gdb) p last_marked[9]
$14 = 143806858
(gdb) pr
#<marker in no buffer>
(gdb) p last_marked[8]
$15 = 143787109
(gdb) pr
(#<marker in no buffer> . -37)
(gdb) p last_marked[7]
$18 = 143787101
(gdb) pr
((#<marker in no buffer> . -37) (#<marker in no buffer> . -37) (#<EMACS BUG: INV
ALID DATATYPE (MISC 0x0002) Save your buffers immediately and please report this
 bug> . -37) (#<EMACS BUG: INVALID DATATYPE (MISC 0x0000) Save your buffers imme
diately and please report this bug> . -37) (#<EMACS BUG: INVALID DATATYPE (MISC 
0x001c) Save your buffers immediately and please report this bug> . -37) (#<EMAC
S BUG: INVALID DATATYPE (MISC 0x0000) Save your buffers immediately and please r
eport this bug> . -37) (#<EMACS BUG: INVALID DATATYPE (MISC 0x0020) Save your bu
ffers immediately and please report this bug> . -37) (#<EMACS BUG: INVALID DATAT
YPE (MISC 0x0000) Save your buffers immediately and please report this bug> . -3
7) (1 . 58) ("tramp_exit_status 0
" . 1) (#<marker at 21 in *tramp/ssh raven.dms.auburn.edu*> . -20))
(gdb) p last_marked[6]
$19 = -296
(gdb) pr
-37
(gdb) p last_marked[5]
$20 = 143807602
(gdb) pr
#<marker at 14 in sequences.texi>
(gdb) p last_marked[4]
$21 = 143787093
(gdb) pr              
(#<marker at 14 in sequences.texi> . -37)
(gdb) p last_marked[3]
$22 = 143787085
(gdb) pr
((#<marker at 14 in sequences.texi> . -37) (#<marker in no buffer> . -37) (#<mar
ker in no buffer> . -37) (#<EMACS BUG: INVALID DATATYPE (MISC 0x0002) Save your 
buffers immediately and please report this bug> . -37) (#<EMACS BUG: INVALID DAT
ATYPE (MISC 0x0000) Save your buffers immediately and please report this bug> . 
-37) (#<EMACS BUG: INVALID DATATYPE (MISC 0x001c) Save your buffers immediately 
and please report this bug> . -37) (#<EMACS BUG: INVALID DATATYPE (MISC 0x0000) 
Save your buffers immediately and please report this bug> . -37) (#<EMACS BUG: I
NVALID DATATYPE (MISC 0x0020) Save your buffers immediately and please report th
is bug> . -37) (#<EMACS BUG: INVALID DATATYPE (MISC 0x0000) Save your buffers im
mediately and please report this bug> . -37) (1 . 58) ("tramp_exit_status 0
" . 1) (#<marker at 21 in *tramp/ssh raven.dms.auburn.edu*> . -20))
(gdb) p last_marked[2]
$23 = -296
(gdb) pr
-37
(gdb) p last_marked[1]
$24 = 143807626
(gdb) pr
#<marker at 7 in sequences.texi>
(gdb) p last_marked[0]
$25 = 143787077
(gdb) pr
(#<marker at 7 in sequences.texi> . -37)

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Fix to long-standing crashes in GC
  2004-05-18  0:13   ` Luc Teirlinck
@ 2004-05-19  1:26     ` Richard Stallman
  2004-05-19 12:11       ` Kim F. Storm
  2004-05-19 12:52       ` Kim F. Storm
  0 siblings, 2 replies; 55+ messages in thread
From: Richard Stallman @ 2004-05-19  1:26 UTC (permalink / raw)
  Cc: larsh, emacs-devel, storm

Please try finding out *precisely* which stack slot
mark_memory is currently examining.  Which stack frame is it in?
What variable is it?

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Fix to long-standing crashes in GC
  2004-05-19  1:26     ` Richard Stallman
@ 2004-05-19 12:11       ` Kim F. Storm
  2004-05-19 19:32         ` Stefan Monnier
  2004-05-19 12:52       ` Kim F. Storm
  1 sibling, 1 reply; 55+ messages in thread
From: Kim F. Storm @ 2004-05-19 12:11 UTC (permalink / raw)
  Cc: Luc Teirlinck, emacs-devel

Richard Stallman <rms@gnu.org> writes:

> Please try finding out *precisely* which stack slot
> mark_memory is currently examining.  Which stack frame is it in?
> What variable is it?

Got it.

In Feval, it points to the backtrace.evalargs member.

#4  0x0816cb3e in mark_maybe_pointer (p=0x8970000) at alloc.c:3803


(gdb) x/32 &backtrace
0xbffe6980:	0xbffe6b10	0xbffe69b4	0xbffe69b0	0xffffffff
0xbffe6990:	0x08970000	0xbffe6960	0x00000002	0x08185456
                ========== here
0xbffe69a0:	0xbffe6ac0	0xbffe69d4	0xbffe69d0	0xffffffff
0xbffe69b0:	0x087a3815	0x084458b9	0x08947f55	0x083eb11c
0xbffe69c0:	0xbffe6b10	0xbffe69f4	0xbffe6a08	0x0818127d
0xbffe69d0:	0x087a382d	0x084458e9	0x088c96f9	0x083eb14c
0xbffe69e0:	0x082188bc	0x08947f3d	0xbffe6a28	0x08430d79
0xbffe69f0:	0x087a38c5	0x08446a81	0x085e3725	0x08430d91

struct backtrace
{
  struct backtrace *next;
  Lisp_Object *function;
  Lisp_Object *args;	/* Points to vector of args. */
  int nargs;		/* Length of vector.
			   If nargs is UNEVALLED, args points to slot holding
			   list of unevalled args */
  char evalargs;
  /* Nonzero means call value of debugger when done with this operation. */
  char debug_on_exit;
};

So setting evalargs and debug_on_exit clears the lower part of the Lisp_Object
which happened to be on the stack on entry, but not the rest of it, so it
points into a legal cons_block cell, but one which probably isn't in use
anymore.

I will install a fix for this specific issue shortly.


However, this reveals a more serious issue:

YOU REALLY CANNOT RELY 100% ON STACK POINTERS ACTUALLY
POINTING TO VALID DATA.

When traversing down a stack pointer object, you may end up in areas
of the memory which has been reused for other purposes (so it is
still "valid" Lisp data, but not of the right type).

So the simplest thing - IMO - is to silently ignore unrecognized items
while scanning through the stack -- In the present case, there
probably isn't anything wrong anywhere, GC just happens to stumble
into reused memory.

Here is the data pointed to by the offending stack pointer:

((#<marker at 1 in HISTORY> . -37)
 (#<marker at 44 in HISTORY> . -37)
 (#<marker at 44 in HISTORY> . -37)
 (#<misc free cell> . -37)
 (#<misc free cell> . -37)
 (#<misc free cell> . -37)
 (#<misc free cell> . -37)
 (#<misc free cell> . -37)
 (#<misc free cell> . -37)
 (#<marker in no buffer> . -37)
 (#<marker in no buffer> . -37)
 (#<marker in no buffer> . -37)
 (#<marker in no buffer> . -37)
 (#<marker in no buffer> . -37)
 (#<marker in no buffer> . -37)
 (#<marker in no buffer> . -37)
 (#<marker in no buffer> . -37)
 (#<marker in no buffer> . -37)
 (#<marker in no buffer> . -37)
 (#<marker in no buffer> . -37)
 (#<marker in no buffer> . -37)
 (#<marker in no buffer> . -37)
 (#<marker in no buffer> . -37)
 (#<marker in no buffer> . -37)
 (#<marker in no buffer> . -37)
 (#<marker in no buffer> . -37)
 (#<marker in no buffer> . -37)
 (#<marker in no buffer> . -37)
 (#<marker in no buffer> . -37)
 (#<EMACS BUG: INVALID DATATYPE (MISC 0x0d79) Save your buffers immediately and please report this bug> . -37)
 (#<EMACS BUG: INVALID DATATYPE (MISC 0x0d79) Save your buffers immediately and please report this bug> . -37)
 (#<EMACS BUG: INVALID DATATYPE (MISC 0x27fc) Save your buffers immediately and please report this bug> . -37)
 (#<EMACS BUG: INVALID DATATYPE (MISC 0xffffddb9) Save your buffers immediately and please report this bug> . -37)
 (#<EMACS BUG: INVALID DATATYPE (MISC 0xffffddb9) Save your buffers immediately and please report this bug> . -37)
 (#<misc free cell> . -37) (1 . 58) ("
///00db4cb378364b4e7fcfd7777f3c1b1f
" . 1) (#<marker at 41 in *tramp/ssh 222.5.4.127*> . -37) (1 . 38) ("tramp_exit_status 0
(nil 1 500 501 (16555 37172) (15086 40970) (16442 20332) 4705 33204 t (25 . 3297) -1)
" . 1) (#<marker at 41 in *tramp/ssh 222.5.4.127*> . -106) (21 . 107) ("
///00db4cb378364b4e7fcfd7777f3c1b1f
" . 21) (#<marker at 41 in *tramp/ssh 222.5.4.127*> . -37) (1 . 58) ("tramp_exit_status 0
" . 1) (#<marker at 41 in *tramp/ssh 222.5.4.127*> . -20) ("
///00db4cb378364b4e7fcfd7777f3c1b1f
" . 21) (#<marker at 41 in *tramp/ssh 222.5.4.127*> . -37) (1 . 58) ("tramp_exit_status 0
" . 1) (#<marker at 41 in *tramp/ssh 222.5.4.127*> . -20) ("
///00db4cb378364b4e7fcfd7777f3c1b1f
" . 21) (#<marker at 41 in *tramp/ssh 222.5.4.127*> . -37) (1 . 58) ("tramp_exit_status 0
" . 1) (#<marker at 41 in *tramp/ssh 222.5.4.127*> . -20) ("
///00db4cb378364b4e7fcfd7777f3c1b1f
" . 21) (#<marker at 41 in *tramp/ssh 222.5.4.127*> . -37) (1 . 58) ("tramp_exit_status 0
" . 1) (#<marker at 41 in *tramp/ssh 222.5.4.127*> . -20) ("
///00db4cb378364b4e7fcfd7777f3c1b1f
" . 21) (#<marker at 41 in *tramp/ssh 222.5.4.127*> . -37) (1 . 58) ("
///00db4cb378364b4e7fcfd7777f3c1b1f
" . 1) (#<marker at 41 in *tramp/ssh 222.5.4.127*> . -37) (1 . 38) ("tramp_exit_status 0
(nil 1 500 501 (16555 37172) (15086 40970) (16442 20332) 4705 33204 t (25 . 3297) -1)
" . 1) (#<marker at 41 in *tramp/ssh 222.5.4.127*> . -106) (21 . 107) ("
///00db4cb378364b4e7fcfd7777f3c1b1f
" . 21) (#<marker at 41 in *tramp/ssh 222.5.4.127*> . -37) (1 . 58) ("tramp_exit_status 0
" . 1) (#<marker at 41 in *tramp/ssh 222.5.4.127*> . -20) ("
///00db4cb378364b4e7fcfd7777f3c1b1f
" . 21) (#<marker at 41 in *tramp/ssh 222.5.4.127*> . -37) (1 . 58) ("tramp_exit_status 0
" . 1) (#<marker at 41 in *tramp/ssh 222.5.4.127*> . -20) ("
///00db4cb378364b4e7fcfd7777f3c1b1f
" . 21) (#<marker at 41 in *tramp/ssh 222.5.4.127*> . -37) (1 . 58)
 ("(nil 1 500 501 (16555 37172) (15086 40970) (16442 20332) 4705 33204 t (25 . 3297) -1)
" . 1) (#<marker at 41 in *tramp/ssh 222.5.4.127*> . -86))


--
Kim F. Storm  http://www.cua.dk

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Fix to long-standing crashes in GC
  2004-05-19  1:26     ` Richard Stallman
  2004-05-19 12:11       ` Kim F. Storm
@ 2004-05-19 12:52       ` Kim F. Storm
  2004-05-19 16:48         ` Stefan Monnier
  2004-05-20  7:08         ` Richard Stallman
  1 sibling, 2 replies; 55+ messages in thread
From: Kim F. Storm @ 2004-05-19 12:52 UTC (permalink / raw)
  Cc: Luc Teirlinck, emacs-devel

Richard Stallman <rms@gnu.org> writes:

> Please try finding out *precisely* which stack slot
> mark_memory is currently examining.  Which stack frame is it in?
> What variable is it?
> 

Found another case where a stack pointer points to bogus data,
this time in Flet, variable *temps:

DEFUN ("let", Flet, Slet, 1, UNEVALLED, 0,
       doc: /* Bind variables according to VARLIST then eval BODY.
The value of the last form in BODY is returned.
Each element of VARLIST is a symbol (which is bound to nil)
or a list (SYMBOL VALUEFORM) (which binds SYMBOL to the value of VALUEFORM).
All the VALUEFORMs are evalled before any symbols are bound.
usage: (let VARLIST BODY...)  */)
     (args)
     Lisp_Object args;
{
  Lisp_Object *temps, tem;
  register Lisp_Object elt, varlist;
  int count = SPECPDL_INDEX ();
  register int argnum;
  struct gcpro gcpro1, gcpro2;

  varlist = Fcar (args);

  /* Make space to hold the values to give the bound variables */
  elt = Flength (varlist);
  temps = (Lisp_Object *) alloca (XFASTINT (elt) * sizeof (Lisp_Object));

  /* Compute the values and store them in `temps' */

  GCPRO2 (args, *temps);
  gcpro2.nvars = 0;

  for (argnum = 0; !NILP (varlist); varlist = Fcdr (varlist))
    {
      QUIT;
      elt = Fcar (varlist);
      if (SYMBOLP (elt))
	temps [argnum++] = Qnil;
      else if (! NILP (Fcdr (Fcdr (elt))))
	Fsignal (Qerror,
		 Fcons (build_string ("`let' bindings can have only one value-form"),
			elt));
      else
	temps [argnum++] = Feval (Fcar (Fcdr (elt)));

Here we call Feval, which -- at some point in time will trigger GC --
and "temps" is filled with random data, some of which are bogus Lisp
object pointers.


There may be MANY places where this happens -- it's just that with
the tramp + global-auto-revert-mode cases seen so far, the number
of frames on the stack is in the range 200-300, thus making the risk
of triggering this kind of error bigger than what we normally see...

Either we have to explicitly ensure that we never have any bogus
pointer on the stack, e.g bzero all alloca'ed memory, or we
must accept (i.e. ignore) bogus objects that we find via the stack.



-- 
Kim F. Storm <storm@cua.dk> http://www.cua.dk

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Fix to long-standing crashes in GC
  2004-05-19 12:52       ` Kim F. Storm
@ 2004-05-19 16:48         ` Stefan Monnier
  2004-05-19 22:04           ` Kim F. Storm
  2004-05-20  7:08         ` Richard Stallman
  1 sibling, 1 reply; 55+ messages in thread
From: Stefan Monnier @ 2004-05-19 16:48 UTC (permalink / raw)
  Cc: Luc Teirlinck, rms, emacs-devel

>> Please try finding out *precisely* which stack slot
>> mark_memory is currently examining.  Which stack frame is it in?
>> What variable is it?
>> 

> Found another case where a stack pointer points to bogus data,
> this time in Flet, variable *temps:

What do you mean by "found"?  Did you inspect the code looking for
suspicious things, or did GDB lead you there?

> Here we call Feval, which -- at some point in time will trigger GC --
> and "temps" is filled with random data, some of which are bogus Lisp
> object pointers.

If gcpros are used, it seems safe: the gcpro2.nvars is initially set to 0 so
none of the random values in the uninitialized `temps' array are considered
and then as the array gets filled gcpro2.nvars is incremented accordingly.
Looks fine.

If gcpros are not used (i.e. we use conservative stack scanning), it
shouldn't be a problem either because the conservative scan goes through
some trouble to ensure that it ignores words pointing to non-GC-managed
(or non-live) objects.

So I think w need to look further.


        Stefan

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Fix to long-standing crashes in GC
  2004-05-19 12:11       ` Kim F. Storm
@ 2004-05-19 19:32         ` Stefan Monnier
  2004-05-19 22:33           ` Kim F. Storm
  2004-05-20 13:17           ` Richard Stallman
  0 siblings, 2 replies; 55+ messages in thread
From: Stefan Monnier @ 2004-05-19 19:32 UTC (permalink / raw)
  Cc: Luc Teirlinck, rms, emacs-devel

>> Please try finding out *precisely* which stack slot
>> mark_memory is currently examining.  Which stack frame is it in?
>> What variable is it?

> Got it.

> In Feval, it points to the backtrace.evalargs member.

> #4  0x0816cb3e in mark_maybe_pointer (p=0x8970000) at alloc.c:3803

> (gdb) x/32 &backtrace
> 0xbffe6980:	0xbffe6b10	0xbffe69b4	0xbffe69b0	0xffffffff
> 0xbffe6990:	0x08970000	0xbffe6960	0x00000002	0x08185456
>                 ========== here
> 0xbffe69a0:	0xbffe6ac0	0xbffe69d4	0xbffe69d0	0xffffffff
> 0xbffe69b0:	0x087a3815	0x084458b9	0x08947f55	0x083eb11c
> 0xbffe69c0:	0xbffe6b10	0xbffe69f4	0xbffe6a08	0x0818127d
> 0xbffe69d0:	0x087a382d	0x084458e9	0x088c96f9	0x083eb14c
> 0xbffe69e0:	0x082188bc	0x08947f3d	0xbffe6a28	0x08430d79
> 0xbffe69f0:	0x087a38c5	0x08446a81	0x085e3725	0x08430d91

> struct backtrace
> {
>   struct backtrace *next;
>   Lisp_Object *function;
>   Lisp_Object *args;	/* Points to vector of args. */
>   int nargs;		/* Length of vector.
> 			   If nargs is UNEVALLED, args points to slot holding
> 			   list of unevalled args */
>   char evalargs;
>   /* Nonzero means call value of debugger when done with this operation. */
>   char debug_on_exit;
> };

> So setting evalargs and debug_on_exit clears the lower part of the
> Lisp_Object which happened to be on the stack on entry, but not the rest
> of it, so it points into a legal cons_block cell, but one which probably
> isn't in use anymore.

> I will install a fix for this specific issue shortly.

Please show us the fix first.

> However, this reveals a more serious issue:

> YOU REALLY CANNOT RELY 100% ON STACK POINTERS ACTUALLY
> POINTING TO VALID DATA.

But that's already checked:
mark_memory calls makr_maybe_object and makr_maybe_pointer, but of which
check for every object whether it's currently live (which really means
"allocated" in the sense that it hasn't been reclaimed by the GC yet).
That's what live_consp_, live_stringp_p, ... are for.

> Here is the data pointed to by the offending stack pointer:

> ((#<marker at 1 in HISTORY> . -37)
>  (#<marker at 44 in HISTORY> . -37)
>  (#<marker at 44 in HISTORY> . -37)
>  (#<misc free cell> . -37)
>  (#<misc free cell> . -37)
>  (#<misc free cell> . -37)

I strongly suspect that we're looking at a consequence of another bug.
Either this cons cell (pointed to by evalargs) is indeed live and then it
should not lead to any misc_free object, or it is incorrectly considered as
live (maybe because of a bug in live_cons_p, or because of a bug in the
code that marks cons cells as dead in gc_sweep, or because of some wild
pointer overwriting the corresponding data, ...).


        Stefan

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Fix to long-standing crashes in GC
  2004-05-19 16:48         ` Stefan Monnier
@ 2004-05-19 22:04           ` Kim F. Storm
  2004-05-19 22:25             ` Stefan Monnier
  0 siblings, 1 reply; 55+ messages in thread
From: Kim F. Storm @ 2004-05-19 22:04 UTC (permalink / raw)
  Cc: Luc Teirlinck, rms, emacs-devel

Stefan Monnier <monnier@iro.umontreal.ca> writes:

> >> Please try finding out *precisely* which stack slot
> >> mark_memory is currently examining.  Which stack frame is it in?
> >> What variable is it?
> >> 
> 
> > Found another case where a stack pointer points to bogus data,
> > this time in Flet, variable *temps:
> 
> What do you mean by "found"?  Did you inspect the code looking for
> suspicious things, or did GDB lead you there?

I removed my hack to accept Lisp_Misc_Free objects (to better
understand where the problem originated), and then emacs crashed
there...  so GDB lead me there.

> 
> > Here we call Feval, which -- at some point in time will trigger GC --
> > and "temps" is filled with random data, some of which are bogus Lisp
> > object pointers.
> 
> If gcpros are used, it seems safe: the gcpro2.nvars is initially set to 0 so
> none of the random values in the uninitialized `temps' array are considered
> and then as the array gets filled gcpro2.nvars is incremented accordingly.
> Looks fine.

Yes, that works fine.

> 
> If gcpros are not used (i.e. we use conservative stack scanning), it
> shouldn't be a problem either because the conservative scan goes through
> some trouble to ensure that it ignores words pointing to non-GC-managed
> (or non-live) objects.

I have now found two different cases where a pointer on the stack
points to GC-managed memory (mem_find finds it), to a cons cell with a
Lisp_Misc_Free in the car and a bogus list in the cdr.  I suspect there
are many more such cases.

I included one such list in my previous mail.  So even if we accepted
the Lisp_Misc_Free object (as I said, I reverted that change locally to
understand the problem), following the cdr would lead to a bad Misc
object anyway.

> 
> So I think w need to look further.

Looking for what? 

More proff that current functionality is broken, or ways to fix it?


-- 
Kim F. Storm <storm@cua.dk> http://www.cua.dk

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Fix to long-standing crashes in GC
  2004-05-19 22:04           ` Kim F. Storm
@ 2004-05-19 22:25             ` Stefan Monnier
  2004-05-19 22:37               ` Kim F. Storm
  0 siblings, 1 reply; 55+ messages in thread
From: Stefan Monnier @ 2004-05-19 22:25 UTC (permalink / raw)
  Cc: Luc Teirlinck, rms, emacs-devel

> I have now found two different cases where a pointer on the stack
> points to GC-managed memory (mem_find finds it), to a cons cell with a
> Lisp_Misc_Free in the car and a bogus list in the cdr.  I suspect there
> are many more such cases.

Right.  And it should never happen.

> I included one such list in my previous mail.

Yes, I've seen it.  Thank you.

>> So I think w need to look further.
> Looking for what?

For what causes Lisp_Misc_Free cells to appear in the car of cons cells that
are apparently live.  I'm not sure how best to do that.  Maybe we should
write a GC-sanity-check routine (scan all cons blocks looking for live cells
with misc free objects in them) that we could call at GC-entry and GC-exit,
this way we could at least see whether the problem happens during GC or
between GC.


        Stefan

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Fix to long-standing crashes in GC
  2004-05-19 19:32         ` Stefan Monnier
@ 2004-05-19 22:33           ` Kim F. Storm
  2004-05-20 13:17           ` Richard Stallman
  1 sibling, 0 replies; 55+ messages in thread
From: Kim F. Storm @ 2004-05-19 22:33 UTC (permalink / raw)
  Cc: Luc Teirlinck, rms, emacs-devel

Stefan Monnier <monnier@iro.umontreal.ca> writes:

> > So setting evalargs and debug_on_exit clears the lower part of the
> > Lisp_Object which happened to be on the stack on entry, but not the rest
> > of it, so it points into a legal cons_block cell, but one which probably
> > isn't in use anymore.
> 
> > I will install a fix for this specific issue shortly.
> 
> Please show us the fix first.

The fix was to pad/zero-fill the backtrace structure.

Understanding more about what's going on now, that fix is not needed,
so I will not install it.

-- 
Kim F. Storm <storm@cua.dk> http://www.cua.dk

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Fix to long-standing crashes in GC
  2004-05-19 22:25             ` Stefan Monnier
@ 2004-05-19 22:37               ` Kim F. Storm
  2004-05-19 22:50                 ` Stefan Monnier
  0 siblings, 1 reply; 55+ messages in thread
From: Kim F. Storm @ 2004-05-19 22:37 UTC (permalink / raw)
  Cc: Luc Teirlinck, rms, emacs-devel

Stefan Monnier <monnier@iro.umontreal.ca> writes:

> > I have now found two different cases where a pointer on the stack
> > points to GC-managed memory (mem_find finds it), to a cons cell with a
> > Lisp_Misc_Free in the car and a bogus list in the cdr.  I suspect there
> > are many more such cases.
> 
> Right.  And it should never happen.
> 
> > I included one such list in my previous mail.
> 
> Yes, I've seen it.  Thank you.
> 
> >> So I think w need to look further.
> > Looking for what?
> 
> For what causes Lisp_Misc_Free cells to appear in the car of cons cells that
> are apparently live.  I'm not sure how best to do that.  Maybe we should
> write a GC-sanity-check routine (scan all cons blocks looking for live cells
> with misc free objects in them) that we could call at GC-entry and GC-exit,
> this way we could at least see whether the problem happens during GC or
> between GC.

Yesterday, I did a partial test where I wrote a fun to check all
Lisp_Misc blocks for invalid misc types (except Lisp_Misc_Free) for
every call to Fcons -- and no such problems occurred, i.e. the bad
misc cells pointed to from some cons cells were NOT in the marker
block chain...

I will make another test to check the cons blocks in a similar way.

-- 
Kim F. Storm <storm@cua.dk> http://www.cua.dk

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Fix to long-standing crashes in GC
  2004-05-19 22:37               ` Kim F. Storm
@ 2004-05-19 22:50                 ` Stefan Monnier
  2004-05-20  0:44                   ` Kim F. Storm
  0 siblings, 1 reply; 55+ messages in thread
From: Stefan Monnier @ 2004-05-19 22:50 UTC (permalink / raw)
  Cc: Luc Teirlinck, rms, emacs-devel

> Yesterday, I did a partial test where I wrote a fun to check all
> Lisp_Misc blocks for invalid misc types (except Lisp_Misc_Free) for
> every call to Fcons -- and no such problems occurred, i.e. the bad
> misc cells pointed to from some cons cells were NOT in the marker
> block chain...

Hmm... was this run before, after, or during GC ?
[ I assume it was before. ]
Can you use the mem_find to try and figure out whether it's pointing at a
GC-managed block or not?
Can you also disable the block-release code (the one that calls xfree on
blocks of cnos cells and lisp_misc objects) to see if it is maybe pointing
to blocks that have been reclaimed?


        Stefan

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Fix to long-standing crashes in GC
  2004-05-19 22:50                 ` Stefan Monnier
@ 2004-05-20  0:44                   ` Kim F. Storm
  2004-05-21 23:43                     ` Kim F. Storm
  0 siblings, 1 reply; 55+ messages in thread
From: Kim F. Storm @ 2004-05-20  0:44 UTC (permalink / raw)
  Cc: Luc Teirlinck, rms, emacs-devel

Stefan Monnier <monnier@iro.umontreal.ca> writes:

> > Yesterday, I did a partial test where I wrote a fun to check all
> > Lisp_Misc blocks for invalid misc types (except Lisp_Misc_Free) for
> > every call to Fcons -- and no such problems occurred, i.e. the bad
> > misc cells pointed to from some cons cells were NOT in the marker
> > block chain...
> 
> Hmm... was this run before, after, or during GC ?

It was on every entry to Fcons -- so pretty often.
I also call'ed the check function directly from gdb after it had
crashed, and there were still no invalid misc objects...

> Can you use the mem_find to try and figure out whether it's pointing at a
> GC-managed block or not?

IIRC, it was pointing to a MEM_CONS block.

> Can you also disable the block-release code (the one that calls xfree on
> blocks of cnos cells and lisp_misc objects) to see if it is maybe pointing
> to blocks that have been reclaimed?

I will try that.

-- 
Kim F. Storm <storm@cua.dk> http://www.cua.dk

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Fix to long-standing crashes in GC
  2004-05-19 12:52       ` Kim F. Storm
  2004-05-19 16:48         ` Stefan Monnier
@ 2004-05-20  7:08         ` Richard Stallman
  2004-05-21 22:58           ` Stefan Monnier
  1 sibling, 1 reply; 55+ messages in thread
From: Richard Stallman @ 2004-05-20  7:08 UTC (permalink / raw)
  Cc: teirllm, emacs-devel

    Either we have to explicitly ensure that we never have any bogus
    pointer on the stack, e.g bzero all alloca'ed memory, or we
    must accept (i.e. ignore) bogus objects that we find via the stack.

Since mark_memory cannot distinguish integer variables and C pointers
from Lisp_Object variables, it will always be possible that some
incorrect but apparently meaningful address appears in a stack slot.
The design principle is that this should never cause an error
or crash, it should at worst preserve some data that ought to be garbage.

It looks like our code fails in many ways to follow that principle.

When mark_object is compiled with GC_CHECK_MARKED_OBJECTS, it aborts
when it finds a pointer that isn't meaningful.  That is not right, but
it indicates the wrong basic expectation is followed in marking.

Marking a misc object sets the mark bit in it.  If this supposed misc
object isn't really other data, that will destroy the other data
there.  Likewise for symbols.  With conservative stack marking, it
would seem that using a mark bit inside an object is a bug, unless
mark_object can first verify the object is real.

mark_buffer seems to have a similar problem.

I think GETMARKBIT has a similar problem, in that it assumes
that the cons or float is contained in a real cons block or a real
float block.  It finds the address of that block by address
calculations.  If it found random data that points to a supposed
cons cell in the wrong place, the address calculations will give
an address that doesn't really correspond to a cons block.
When it tries to find the mark bits of that block, it can crash
on an invalid pointer, or even find a valid-looking pointer to
other data and garble it.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Fix to long-standing crashes in GC
  2004-05-19 19:32         ` Stefan Monnier
  2004-05-19 22:33           ` Kim F. Storm
@ 2004-05-20 13:17           ` Richard Stallman
  1 sibling, 0 replies; 55+ messages in thread
From: Richard Stallman @ 2004-05-20 13:17 UTC (permalink / raw)
  Cc: emacs-devel, teirllm, storm

    But that's already checked:
    mark_memory calls makr_maybe_object and makr_maybe_pointer, but of which
    check for every object whether it's currently live (which really means
    "allocated" in the sense that it hasn't been reclaimed by the GC yet).
    That's what live_consp_, live_stringp_p, ... are for.

I had not noticed the function mark_maybe_object when I wrote
my previous message.  Now I am as puzzled as you are by this
failure.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Fix to long-standing crashes in GC
  2004-05-20  7:08         ` Richard Stallman
@ 2004-05-21 22:58           ` Stefan Monnier
  0 siblings, 0 replies; 55+ messages in thread
From: Stefan Monnier @ 2004-05-21 22:58 UTC (permalink / raw)
  Cc: emacs-devel, teirllm, Kim F. Storm

> Marking a misc object sets the mark bit in it.  If this supposed misc
> object isn't really other data, that will destroy the other data
> there.

misc_live_p is supposed to make sure this never happens.

> Likewise for symbols.  With conservative stack marking, it
> would seem that using a mark bit inside an object is a bug, unless
> mark_object can first verify the object is real.

mark_object is only called after checking that the presumed pointer does
point to a live object of the expected type.

> mark_buffer seems to have a similar problem.

And the same solution is applied.

> I think GETMARKBIT has a similar problem, in that it assumes
> that the cons or float is contained in a real cons block or a real
> float block.  It finds the address of that block by address
> calculations.  If it found random data that points to a supposed
> cons cell in the wrong place, the address calculations will give
> an address that doesn't really correspond to a cons block.

That's what find_mem is for: to make sure it's indeed one of our cons
blocks or float block (and which of the two).

> When it tries to find the mark bits of that block, it can crash
> on an invalid pointer, or even find a valid-looking pointer to
> other data and garble it.

Only if there's a bug in the mem_find code.


        Stefan

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Fix to long-standing crashes in GC
  2004-05-20  0:44                   ` Kim F. Storm
@ 2004-05-21 23:43                     ` Kim F. Storm
  2004-05-23  1:14                       ` Stefan Monnier
                                         ` (3 more replies)
  0 siblings, 4 replies; 55+ messages in thread
From: Kim F. Storm @ 2004-05-21 23:43 UTC (permalink / raw)
  Cc: Luc Teirlinck, rms, emacs-devel


I have now installed some further changes to the GC code to
fix problems with invalid Lisp_Misc objects.

The main problem was that marker blocks could be freed while
still being pointed to by buffer undo lists, but I made other
minor changes to make the code more robust.

With my fix, marker blocks are never freed (they remain
on the free marker lists) -- this can be improved by keeping
the "all free" blocks on a separate list and freeing them
at the end of GC.  I may do that later...

-- 
Kim F. Storm <storm@cua.dk> http://www.cua.dk

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Fix to long-standing crashes in GC
  2004-05-14 21:02 ` Richard Stallman
@ 2004-05-22 18:09   ` Lars Hansen
  2004-05-23 16:33     ` Eli Zaretskii
  0 siblings, 1 reply; 55+ messages in thread
From: Lars Hansen @ 2004-05-22 18:09 UTC (permalink / raw)
  Cc: emacs-devel

Richard Stallman wrote:

>    #0  0x402ec781 in kill () from /lib/libc.so.6
>    #1  0x080dfe0a in abort () at emacs.c:433
>    #2  0x08128037 in mark_object (arg=1217041176) at alloc.c:5034
>    #3  0x0812809a in mark_object (arg=-1463681024) at alloc.c:5051
>    #4  0x08126b43 in mark_memory (start=0xbfffe070, end=0xbffff7e8)
>	at alloc.c:3700
>    #5  0x08126f73 in mark_stack () at alloc.c:4055
>
>Can you figure out what data structure is being traversed by those
>frames?  That's a lot of work, but it is the only way I know of to try
>to debug a problem of this kind.
>
I am afraid I do not know enough about gdb and Emacs memory management 
to do that.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Fix to long-standing crashes in GC
  2004-05-21 23:43                     ` Kim F. Storm
@ 2004-05-23  1:14                       ` Stefan Monnier
  2004-05-23 18:28                       ` Richard Stallman
                                         ` (2 subsequent siblings)
  3 siblings, 0 replies; 55+ messages in thread
From: Stefan Monnier @ 2004-05-23  1:14 UTC (permalink / raw)
  Cc: Luc Teirlinck, rms, emacs-devel

> The main problem was that marker blocks could be freed while
> still being pointed to by buffer undo lists, but I made other
> minor changes to make the code more robust.

I don't have access to the code right now, so I can't see whether you've
added comments that explain the issue.  Could you either explain the issue
(i.e. a sample scenario) here or in a comment in the code?
If the code already has it, please ignore this email.


        Stefan

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Fix to long-standing crashes in GC
  2004-05-23 16:33     ` Eli Zaretskii
@ 2004-05-23 16:32       ` Luc Teirlinck
  2004-05-23 17:11         ` Lars Hansen
  2004-05-24  5:30         ` Eli Zaretskii
  0 siblings, 2 replies; 55+ messages in thread
From: Luc Teirlinck @ 2004-05-23 16:32 UTC (permalink / raw)
  Cc: larsh, emacs-devel

Eli Zaretskii wrote:

   > >Can you figure out what data structure is being traversed by those
   > >frames?  That's a lot of work, but it is the only way I know of to try
   > >to debug a problem of this kind.
   > >
   > I am afraid I do not know enough about gdb and Emacs memory management 
   > to do that.

   There's some guidance in the file etc/DEBUG (search for "in GC"), and
   you can always ask for more specific help here.

Yes, except that the guidance there seems to assume that the person
doing the debugging has a pretty good knowledge about Emacs memory
management.  Anyway, I believe that the problem has been fixed by
recent changes made by Kim.  If Lars still has his old GDB session
around, then I guess that he might see something very similar to:

(gdb) p last_marked_index
$197 = 18
(gdb) p last_marked[17]
$198 = 145714794
(gdb) pr
#<misc free cell>
(gdb) p last_marked[16]
$199 = 148693005
(gdb) pr
(#<misc free cell> . -37)

That is, the problem was caused by trying to mark a Lisp_Misc_Free
object.  If Lars sees anything else, that might be interesting.  I
guess that Lars will not be able to produce the crash any more with
the latest CVS.  If he still can, then he should definitely report
that, since we believe that these crashes are fixed.

Sincerely,

Luc.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Fix to long-standing crashes in GC
  2004-05-22 18:09   ` Lars Hansen
@ 2004-05-23 16:33     ` Eli Zaretskii
  2004-05-23 16:32       ` Luc Teirlinck
  0 siblings, 1 reply; 55+ messages in thread
From: Eli Zaretskii @ 2004-05-23 16:33 UTC (permalink / raw)
  Cc: emacs-devel

> Date: Sat, 22 May 2004 20:09:47 +0200
> From: Lars Hansen <larsh@math.ku.dk>
> 
> Richard Stallman wrote:
> 
> >    #0  0x402ec781 in kill () from /lib/libc.so.6
> >    #1  0x080dfe0a in abort () at emacs.c:433
> >    #2  0x08128037 in mark_object (arg=1217041176) at alloc.c:5034
> >    #3  0x0812809a in mark_object (arg=-1463681024) at alloc.c:5051
> >    #4  0x08126b43 in mark_memory (start=0xbfffe070, end=0xbffff7e8)
> >	at alloc.c:3700
> >    #5  0x08126f73 in mark_stack () at alloc.c:4055
> >
> >Can you figure out what data structure is being traversed by those
> >frames?  That's a lot of work, but it is the only way I know of to try
> >to debug a problem of this kind.
> >
> I am afraid I do not know enough about gdb and Emacs memory management 
> to do that.

There's some guidance in the file etc/DEBUG (search for "in GC"), and
you can always ask for more specific help here.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Fix to long-standing crashes in GC
  2004-05-23 16:32       ` Luc Teirlinck
@ 2004-05-23 17:11         ` Lars Hansen
  2004-05-24  5:30         ` Eli Zaretskii
  1 sibling, 0 replies; 55+ messages in thread
From: Lars Hansen @ 2004-05-23 17:11 UTC (permalink / raw)
  Cc: eliz, emacs-devel

Luc Teirlinck wrote:

>If Lars still has his old GDB session around,
>
I don't.

>I guess that Lars will not be able to produce the crash any more with
>the latest CVS.  If he still can, then he should definitely report
>that, since we believe that these crashes are fixed.
>  
>
I will report it if Emacs still crash.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Fix to long-standing crashes in GC
  2004-05-21 23:43                     ` Kim F. Storm
  2004-05-23  1:14                       ` Stefan Monnier
@ 2004-05-23 18:28                       ` Richard Stallman
  2004-05-24 11:57                       ` Kim F. Storm
  2004-05-28 21:51                       ` Stefan Monnier
  3 siblings, 0 replies; 55+ messages in thread
From: Richard Stallman @ 2004-05-23 18:28 UTC (permalink / raw)
  Cc: teirllm, monnier, emacs-devel

    With my fix, marker blocks are never freed (they remain
    on the free marker lists) -- this can be improved by keeping
    the "all free" blocks on a separate list and freeing them
    at the end of GC.  I may do that later...

It might be simpler, and not much slower, to scan the list of marker
blocks at the end of GC and check each one to see if it is all free.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Fix to long-standing crashes in GC
  2004-05-23 16:32       ` Luc Teirlinck
  2004-05-23 17:11         ` Lars Hansen
@ 2004-05-24  5:30         ` Eli Zaretskii
  2004-05-25  3:03           ` Luc Teirlinck
  1 sibling, 1 reply; 55+ messages in thread
From: Eli Zaretskii @ 2004-05-24  5:30 UTC (permalink / raw)
  Cc: larsh, emacs-devel

> Date: Sun, 23 May 2004 11:32:35 -0500 (CDT)
> From: Luc Teirlinck <teirllm@dms.auburn.edu>
> 
>    There's some guidance in the file etc/DEBUG (search for "in GC"), and
>    you can always ask for more specific help here.
> 
> Yes, except that the guidance there seems to assume that the person
> doing the debugging has a pretty good knowledge about Emacs memory
> management.

If you can show where's the text that assumes that, i.e. is not clear
to someone who doesn't have a good knowledge about Emacs memory
management, I'm sure someone will try to improve those parts of the
text.

For the record, that part of etc/DEBUG summarizes what I learned
debugging (with Richard's help) a tricky problem that caused rare
crashes inside GC.  At the time, I knew almost nothing about the GC
machinery.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Fix to long-standing crashes in GC
  2004-05-21 23:43                     ` Kim F. Storm
  2004-05-23  1:14                       ` Stefan Monnier
  2004-05-23 18:28                       ` Richard Stallman
@ 2004-05-24 11:57                       ` Kim F. Storm
  2004-05-28 21:51                       ` Stefan Monnier
  3 siblings, 0 replies; 55+ messages in thread
From: Kim F. Storm @ 2004-05-24 11:57 UTC (permalink / raw)
  Cc: Stefan Monnier, emacs-devel


RMS wrote:

>     With my fix, marker blocks are never freed (they remain
>     on the free marker lists) -- this can be improved by keeping
>     the "all free" blocks on a separate list and freeing them
>     at the end of GC.  I may do that later...
> 
> It might be simpler, and not much slower, to scan the list of marker
> blocks at the end of GC and check each one to see if it is all free.

Yes, looking for and freeing marker blocks with all Lisp_Misc_Free
objects would work.

-- 
Kim F. Storm <storm@cua.dk> http://www.cua.dk

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Fix to long-standing crashes in GC
  2004-05-24  5:30         ` Eli Zaretskii
@ 2004-05-25  3:03           ` Luc Teirlinck
  2004-05-25  7:07             ` Eli Zaretskii
  0 siblings, 1 reply; 55+ messages in thread
From: Luc Teirlinck @ 2004-05-25  3:03 UTC (permalink / raw)
  Cc: larsh, emacs-devel

Eli Zaretskii wrote:

   If you can show where's the text that assumes that, i.e. is not clear
   to someone who doesn't have a good knowledge about Emacs memory
   management, I'm sure someone will try to improve those parts of the
   text.

For instance:

    Once you discover the corrupted Lisp object or data structure, it is
    useful to look at it in a fresh Emacs session and compare its contents
    with a session that you are debugging.

Except that to notice that a Lisp object is corrupted you have to
_already_ know how its contents look in a fresh Emacs session.  Many
Elisp programmers do not have a very good knowledge about the very low
level C structure of various Lisp objects.

As an example, when I tried to debug a recent gc crash, the very first
thing I noticed was that the immediate cause of the abort was that the
garbage collector was trying to mark a Lisp_Misc_Free object.  That
_was_ the problem, I did not need to look any further.  Except that at
the time I did not know that this was not supposed to happen.  (I know
now.)  I did not even know what a Lisp_Misc_Free object was.  (I know
now.)  So I went through all of the last_marked array, without any
idea of what to look for, that is: how do you recognize a "corrupted
Lisp object or data structure"?

When you see:

(gdb) p last_marked_index
$1 = 18
(gdb) p last_marked[17]
$2 = 143587538
(gdb) pr
#<EMACS BUG: INVALID DATATYPE (MISC 0x0002) Save your buffers
immediately and please report this bug>

_then_ things are pretty obvious.  But that is not always the case.

Other more concrete ambiguous stuff:

    This is not easy since GC changes the tag bits and relocates strings
    which make it hard to look at Lisp objects with commands such as `pr'.
    It is sometimes necessary to convert Lisp_Object variables into
    pointers to C struct's manually.

It says "It is sometimes necessary...".  When?

When I see:

pr

that is, no output, I can guess it is necessary.

What if I see:

pr
""

I know from experience that I still have to use xstring in that case,
even though the empty string is a perfectly valid return value.  But
xstring often reveals a different real value anyway.  Is this a bug in
pr or is this normal?

What if I see

pr
"dired-find-file"

Can I trust _this_ or should I still use xstring, that is, should the
above have said: "It is always necessary, to be safe,..."?

Sincerely,

Luc.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Fix to long-standing crashes in GC
  2004-05-25  3:03           ` Luc Teirlinck
@ 2004-05-25  7:07             ` Eli Zaretskii
  0 siblings, 0 replies; 55+ messages in thread
From: Eli Zaretskii @ 2004-05-25  7:07 UTC (permalink / raw)
  Cc: larsh, emacs-devel

> Date: Mon, 24 May 2004 22:03:34 -0500 (CDT)
> From: Luc Teirlinck <teirllm@dms.auburn.edu>
> 
>     Once you discover the corrupted Lisp object or data structure, it is
>     useful to look at it in a fresh Emacs session and compare its contents
>     with a session that you are debugging.
> 
> Except that to notice that a Lisp object is corrupted you have to
> _already_ know how its contents look in a fresh Emacs session.

No, that's not what DEBUG wants to say.

A corrupted object is _always_ the one that caused the crash.  That's
why we call `abort' at those places: we've discovered something that
cannot happen with valid Lisp objects.

``Discovering the corrupted Lisp object or data structure'' in the
fragment above means that one needs to find the _enclosing_ data
structure of which the corrupted object is a part.  For example, if
the object that was the immideate cause of the crash is a cdr of some
cons cell, one needs to find out what cons cell was that; if it's a
member of a plist, one needs to find out whose property list it was;
etc.  That is when you make use of the last_marked[] array and walk
the marking code backwards guided by its contents.

> Many Elisp programmers do not have a very good knowledge about the
> very low level C structure of various Lisp objects.

Well, that's something that comes with experience.  However, if you
(or someone else) can share some pieces of that knowledge which, if we
add it to DEBUG, could make the learning curve shorter and/or less
steep, we could certainly use that.

> So I went through all of the last_marked array, without any
> idea of what to look for, that is: how do you recognize a "corrupted
> Lisp object or data structure"?

Does what I wrote above help in any way?  It cannot cover every
possible situation, and of course some knowledge about the object that
was the immideate cause of the call to `abort' _is_ needed, but I
don't see how this can be avoided.

> (gdb) p last_marked[17]
> $2 = 143587538
> (gdb) pr
> #<EMACS BUG: INVALID DATATYPE (MISC 0x0002) Save your buffers
> immediately and please report this bug>

Actually, as DEBUG says, it is not recommended to use `pr' in a
crashed session, especially one that crashed during GC.  `pr' invokes
a function inside Emacs code that looks at Lisp data structures; when
those data structures are corrupted, `pr' could well cause another
segfault and ruin your entire debugging session.

>     This is not easy since GC changes the tag bits and relocates strings
>     which make it hard to look at Lisp objects with commands such as `pr'.
>     It is sometimes necessary to convert Lisp_Object variables into
>     pointers to C struct's manually.
> 
> It says "It is sometimes necessary...".  When?

When `pr' and the x* (xstring, xsymbol, etc.) commands fail to print
the Lisp object.

> When I see:
> 
> pr
> 
> that is, no output, I can guess it is necessary.
> 
> What if I see:
> 
> pr
> ""
> 
> I know from experience that I still have to use xstring in that case,
> even though the empty string is a perfectly valid return value.  But
> xstring often reveals a different real value anyway.  Is this a bug in
> pr or is this normal?

Again, don't use `pr' in these cases.  Use xtype and the appropriate
x* command according to the type.

When you use x*, a failure to examine an object generates partial
information and an error message, like this:

  (gdb) xsymbol
  $201 = (struct Lisp_Symbol *) 0xdeadbeef
  Argument to arithmetic operation not a number or boolean.

You then need to examine the Lisp_Symbol structure at the address
shown as a C object:

  (gdb) print *((struct Lisp_Symbol *) 0xdeadbeef)

> What if I see
> 
> pr
> "dired-find-file"
> 
> Can I trust _this_ or should I still use xstring, that is, should the
> above have said: "It is always necessary, to be safe,..."?

In a crashed session, I personally never trust `pr', and only use it
as a secondary means, to view very complex data structures.  The
xsymbol command and its ilk are your friends.

I'll try to add this info to DEBUG when I have time (unless someone
else beats me to that).

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Fix to long-standing crashes in GC
  2004-05-21 23:43                     ` Kim F. Storm
                                         ` (2 preceding siblings ...)
  2004-05-24 11:57                       ` Kim F. Storm
@ 2004-05-28 21:51                       ` Stefan Monnier
  2004-05-28 23:40                         ` Kim F. Storm
  3 siblings, 1 reply; 55+ messages in thread
From: Stefan Monnier @ 2004-05-28 21:51 UTC (permalink / raw)
  Cc: Luc Teirlinck, rms, emacs-devel

[-- Attachment #1: Type: text/plain, Size: 3444 bytes --]

> I have now installed some further changes to the GC code to
> fix problems with invalid Lisp_Misc objects.

> The main problem was that marker blocks could be freed while
> still being pointed to by buffer undo lists, but I made other
> minor changes to make the code more robust.

I still don't understand how this could happen.
Your fix looks rather intricate and introduces lots of fragile and
undocumented invariants, breaking some of the more traditional invariants
that the old code enforced (or at least assumed).

While I still don't know which scenario you were trying to fix, my
understanding is that it had to do with live objects pointing to
unmarked objects.

How about the patch below *instead* of the one you installed?

I.e. instead of sweeping first and then looking for misc_free objects in
undo_list, thus introducing some strange intermediate state where we
actually have misc_free objects inside reachable data,
I suggest we simply don't do the half-assed marking of undo_list in
mark_buffer and instead we do it just after we removed the unwanted entries
from the list.

I think the change below is a good change in any case (it makes the code
simpler and more robust), but my question is: do you think that it fixes
the problem you were trying to fix?


        Stefan


PS: the patch below is against an older version of alloc.c: it's mostly
    meant to be read rather than applied.  For people who want to try the
    patch in vivo, see the attached patch which is against the latest
    alloc.c.


Index: src/alloc.c
===================================================================
RCS file: /cvsroot/emacs/emacs/src/alloc.c,v
retrieving revision 1.336
diff -u -r1.336 alloc.c
--- src/alloc.c	18 May 2004 16:22:46 -0000	1.336
+++ src/alloc.c	28 May 2004 21:34:26 -0000
@@ -4476,7 +4454,9 @@
   }
 #endif
 
-  /* Look thru every buffer's undo list
+  /* Everything is now marked, except for the things that require special
+     finalization, i.e. the undo_list.
+     Look thru every buffer's undo list
      for elements that update markers that were not marked,
      and delete them.  */
   {
@@ -4514,6 +4494,9 @@
 		  }
 	      }
 	  }
+	/* Now that we have stripped the elements that need not be in the
+	   undo_list any more, we can finally mark the list.  */
+	mark_object (nextb->undo_list);
 
 	nextb = nextb->next;
       }
@@ -5095,41 +5070,9 @@
 
   MARK_INTERVAL_TREE (BUF_INTERVALS (buffer));
 
-  if (CONSP (buffer->undo_list))
-    {
-      Lisp_Object tail;
-      tail = buffer->undo_list;
-
-      /* We mark the undo list specially because
-	 its pointers to markers should be weak.  */
-
-      while (CONSP (tail))
-	{
-	  register struct Lisp_Cons *ptr = XCONS (tail);
-
-	  if (CONS_MARKED_P (ptr))
-	    break;
-	  CONS_MARK (ptr);
-	  if (GC_CONSP (ptr->car)
-	      && !CONS_MARKED_P (XCONS (ptr->car))
-	      && GC_MARKERP (XCAR (ptr->car)))
-	    {
-	      CONS_MARK (XCONS (ptr->car));
-	      mark_object (XCDR (ptr->car));
-	    }
-	  else
-	    mark_object (ptr->car);
-
-	  if (CONSP (ptr->cdr))
-	    tail = ptr->cdr;
-	  else
-	    break;
-	}
-
-      mark_object (XCDR (tail));
-    }
-  else
-    mark_object (buffer->undo_list);
+  /* For now, we just don't mark the undo_list.  It's done later in
+     a special way just before the sweep phase, and after stripping
+     some of its elements that are not needed any more.  */
 
   if (buffer->overlays_before)
     {


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: alloc.patch --]
[-- Type: text/x-patch, Size: 5557 bytes --]

--- alloc.c	28 mai 2004 16:57:39 -0400	1.340
+++ alloc.c	28 mai 2004 17:33:20 -0400	
@@ -2866,10 +2866,6 @@
 
 union Lisp_Misc *marker_free_list;
 
-/* Marker blocks which should be freed at end of GC.  */
-
-struct marker_block *marker_blocks_pending_free;
-
 /* Total number of marker blocks now in use.  */
 
 int n_marker_blocks;
@@ -2880,7 +2876,6 @@
   marker_block = NULL;
   marker_block_index = MARKER_BLOCK_SIZE;
   marker_free_list = 0;
-  marker_blocks_pending_free = 0;
   n_marker_blocks = 0;
 }
 
@@ -4459,42 +4454,36 @@
   }
 #endif
 
-  gc_sweep ();
-
-  /* Look thru every buffer's undo list for elements that used to
-     contain update markers that were changed to Lisp_Misc_Free
-     objects and delete them.  This may leave a few cons cells
-     unchained, but we will get those on the next sweep.  */
+  /* Everything is now marked, except for the things that require special
+     finalization, i.e. the undo_list.
+     Look thru every buffer's undo list
+     for elements that update markers that were not marked,
+     and delete them.  */
   {
     register struct buffer *nextb = all_buffers;
 
     while (nextb)
       {
 	/* If a buffer's undo list is Qt, that means that undo is
-	   turned off in that buffer.  */
+	   turned off in that buffer.  Calling truncate_undo_list on
+	   Qt tends to return NULL, which effectively turns undo back on.
+	   So don't call truncate_undo_list if undo_list is Qt.  */
 	if (! EQ (nextb->undo_list, Qt))
 	  {
-	    Lisp_Object tail, prev, elt, car;
+	    Lisp_Object tail, prev;
 	    tail = nextb->undo_list;
 	    prev = Qnil;
 	    while (CONSP (tail))
 	      {
-		if ((elt = XCAR (tail), GC_CONSP (elt))
-		    && (car = XCAR (elt), GC_MISCP (car))
-		    && XMISCTYPE (car) == Lisp_Misc_Free)
-		  {
-		    Lisp_Object cdr = XCDR (tail);
-		    /* Do not use free_cons here, as we don't know if
-		       anybody else has a pointer to these conses.  */
-		    XSETCAR (elt, Qnil);
-		    XSETCDR (elt, Qnil);
-		    XSETCAR (tail, Qnil);
-		    XSETCDR (tail, Qnil);
+		if (GC_CONSP (XCAR (tail))
+		    && GC_MARKERP (XCAR (XCAR (tail)))
+		    && !XMARKER (XCAR (XCAR (tail)))->gcmarkbit)
+		  {
 		    if (NILP (prev))
-		      nextb->undo_list = tail = cdr;
+		      nextb->undo_list = tail = XCDR (tail);
 		    else
 		      {
-			tail = cdr;
+			tail = XCDR (tail);
 			XSETCDR (prev, tail);
 		      }
 		  }
@@ -4505,22 +4494,15 @@
 		  }
 	      }
 	  }
+	/* Now that we have stripped the elements that need not be in the
+	   undo_list any more, we can finally mark the list.  */
+	mark_object (nextb->undo_list);
 
 	nextb = nextb->next;
       }
   }
 
-  /* Undo lists have been cleaned up, so we can free marker blocks now.  */
-
-  {
-    struct marker_block *mblk;
-
-    while ((mblk = marker_blocks_pending_free) != 0)
-      {
-	marker_blocks_pending_free = mblk->next;
-	lisp_free (mblk);
-      }
-  }
+  gc_sweep ();
 
   /* Clear the mark bits that we set in certain root slots.  */
 
@@ -5088,41 +5070,9 @@
 
   MARK_INTERVAL_TREE (BUF_INTERVALS (buffer));
 
-  if (CONSP (buffer->undo_list))
-    {
-      Lisp_Object tail;
-      tail = buffer->undo_list;
-
-      /* We mark the undo list specially because
-	 its pointers to markers should be weak.  */
-
-      while (CONSP (tail))
-	{
-	  register struct Lisp_Cons *ptr = XCONS (tail);
-
-	  if (CONS_MARKED_P (ptr))
-	    break;
-	  CONS_MARK (ptr);
-	  if (GC_CONSP (ptr->car)
-	      && !CONS_MARKED_P (XCONS (ptr->car))
-	      && GC_MARKERP (XCAR (ptr->car)))
-	    {
-	      CONS_MARK (XCONS (ptr->car));
-	      mark_object (XCDR (ptr->car));
-	    }
-	  else
-	    mark_object (ptr->car);
-
-	  if (CONSP (ptr->cdr))
-	    tail = ptr->cdr;
-	  else
-	    break;
-	}
-
-      mark_object (XCDR (tail));
-    }
-  else
-    mark_object (buffer->undo_list);
+  /* For now, we just don't mark the undo_list.  It's done later in
+     a special way just before the sweep phase, and after stripping
+     some of its elements that are not needed any more.  */
 
   if (buffer->overlays_before)
     {
@@ -5202,6 +5152,16 @@
 static void
 gc_sweep ()
 {
+  /* Remove or mark entries in weak hash tables.
+     This must be done before any object is unmarked.  */
+  sweep_weak_hash_tables ();
+
+  sweep_strings ();
+#ifdef GC_CHECK_STRING_BYTES
+  if (!noninteractive)
+    check_string_bytes (1);
+#endif
+
   /* Put all unmarked conses on free list */
   {
     register struct cons_block *cblk;
@@ -5252,16 +5212,6 @@
     total_free_conses = num_free;
   }
 
-  /* Remove or mark entries in weak hash tables.
-     This must be done before any object is unmarked.  */
-  sweep_weak_hash_tables ();
-
-  sweep_strings ();
-#ifdef GC_CHECK_STRING_BYTES
-  if (!noninteractive)
-    check_string_bytes (1);
-#endif
-
   /* Put all unmarked floats on free list */
   {
     register struct float_block *fblk;
@@ -5430,7 +5380,6 @@
     register int num_free = 0, num_used = 0;
 
     marker_free_list = 0;
-    marker_blocks_pending_free = 0;
 
     for (mblk = marker_block; mblk; mblk = *mprev)
       {
@@ -5466,13 +5415,8 @@
 	    *mprev = mblk->next;
 	    /* Unhook from the free list.  */
 	    marker_free_list = mblk->markers[0].u_free.chain;
+	    lisp_free (mblk);
 	    n_marker_blocks--;
-
-	    /* It is not safe to free the marker block at this stage,
-	       since there may still be pointers to these markers from
-	       a buffer's undo list.  KFS 2004-05-25.  */
-	    mblk->next = marker_blocks_pending_free;
-	    marker_blocks_pending_free = mblk;
 	  }
 	else
 	  {

[-- Attachment #3: Type: text/plain, Size: 142 bytes --]

_______________________________________________
Emacs-devel mailing list
Emacs-devel@gnu.org
http://lists.gnu.org/mailman/listinfo/emacs-devel

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Fix to long-standing crashes in GC
  2004-05-28 21:51                       ` Stefan Monnier
@ 2004-05-28 23:40                         ` Kim F. Storm
  2004-05-28 23:49                           ` Stefan Monnier
  0 siblings, 1 reply; 55+ messages in thread
From: Kim F. Storm @ 2004-05-28 23:40 UTC (permalink / raw)
  Cc: Luc Teirlinck, rms, emacs-devel

Stefan Monnier <monnier@iro.umontreal.ca> writes:

> > I have now installed some further changes to the GC code to
> > fix problems with invalid Lisp_Misc objects.
> 
> > The main problem was that marker blocks could be freed while
> > still being pointed to by buffer undo lists, but I made other
> > minor changes to make the code more robust.
> 
> I still don't understand how this could happen.

In retrospect, the problem was/is caused by the method used to mark
the undo-list (the step that your patch removes).

The problem with the current undo-list marking is that we still mark
the cons cells which points to the markers that we might later free
and remove from the undo-list.

This meant that we would not free those cons cells which thus ends up
pointing to invalid marker objects.

Now, when we on next gc calls mark_stack, and thus mark_maybe_object /
mark_maybe_pointer, we may accidentally have a pointer address that
leads to a (valid) cons, which is checked with mark_maybe_object...
that's ok, you see that it points to a valid cons block, and it is not
marked Vdead.

So you do mark_object on that cons, meaning that you unconditionally
does mark_object on the car and cdr.

The cdr of that cons is another cons, so you do mark_object (unconditionally)
on the car of that cons -- but it happened to be a cons cell which used to
be on an undo-list, so its car points to a lisp_free_marker -- or worse
into a freed marker block.


My changes focused on fixing the effects of the erroneous marking of
those cons cells, so part of my fix was to expclicitly clear (Qnil)
the car and cdr of the two cons cells that are removed from the
undo-list; that was necessary as the removal was done AFTER gc_sweep.


Now, looking at your alternative patch, it elegantly eliminates the
bogus marking of the cons cells that are removed from the undo-list,
so I think your patch will also fix the problem.

Your change is indeed cleaner than my fixes, so let's install your
patch instead.


BTW, I think that clearing the cons cells before strings is still a
proper sequence in gc_sweep, as the old sequence (that your patch
reinstalls)  implies that after cleaning the strings, there are
cons cells (unmarked though) that points into freed memory.

I.e. I suggest that the following part of the patch is omitted:

@@ -5202,6 +5152,16 @@
 static void
 gc_sweep ()
 {
+  /* Remove or mark entries in weak hash tables.
+     This must be done before any object is unmarked.  */
+  sweep_weak_hash_tables ();
+
+  sweep_strings ();
+#ifdef GC_CHECK_STRING_BYTES
+  if (!noninteractive)
+    check_string_bytes (1);
+#endif
+
   /* Put all unmarked conses on free list */
   {
     register struct cons_block *cblk;
@@ -5252,16 +5212,6 @@
     total_free_conses = num_free;
   }
 
-  /* Remove or mark entries in weak hash tables.
-     This must be done before any object is unmarked.  */
-  sweep_weak_hash_tables ();
-
-  sweep_strings ();
-#ifdef GC_CHECK_STRING_BYTES
-  if (!noninteractive)
-    check_string_bytes (1);
-#endif
-
   /* Put all unmarked floats on free list */
   {
     register struct float_block *fblk;

-- 
Kim F. Storm <storm@cua.dk> http://www.cua.dk

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Fix to long-standing crashes in GC
  2004-05-28 23:40                         ` Kim F. Storm
@ 2004-05-28 23:49                           ` Stefan Monnier
  2004-05-29 23:15                             ` Kim F. Storm
  0 siblings, 1 reply; 55+ messages in thread
From: Stefan Monnier @ 2004-05-28 23:49 UTC (permalink / raw)
  Cc: Luc Teirlinck, rms, emacs-devel

> Now, when we on next gc calls mark_stack, and thus mark_maybe_object /
> mark_maybe_pointer, we may accidentally have a pointer address that
> leads to a (valid) cons, which is checked with mark_maybe_object...
> that's ok, you see that it points to a valid cons block, and it is not
> marked Vdead.

Oh, yes, I understand now, thank you.

> BTW, I think that clearing the cons cells before strings is still a
> proper sequence in gc_sweep, as the old sequence (that your patch
> reinstalls)  implies that after cleaning the strings, there are
> cons cells (unmarked though) that points into freed memory.

I think this is a red-herring: doing it your way leads to temporarily having
non-marked strings whose text-properties might point to deallocated
cons-cells.

The sweep step simply has to be atomic, just like most of the GC itself
(afterall, that's why the mark&sweep algorithm is in the category called
"stop-the-world").


        Stefan

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Fix to long-standing crashes in GC
  2004-05-28 23:49                           ` Stefan Monnier
@ 2004-05-29 23:15                             ` Kim F. Storm
  2004-05-30 20:44                               ` Stefan Monnier
  0 siblings, 1 reply; 55+ messages in thread
From: Kim F. Storm @ 2004-05-29 23:15 UTC (permalink / raw)
  Cc: Luc Teirlinck, rms, emacs-devel

Stefan Monnier <monnier@iro.umontreal.ca> writes:

> I think this is a red-herring: doing it your way leads to temporarily having
> non-marked strings whose text-properties might point to deallocated
> cons-cells.

Well, in the tests I performed trying to understand this bug, I
frequently ran into cons cells pointing to de-allocated strings before
I changed the sequence -- but after changing the sequence, I didn't
have a single occurrence of a string pointing to a de-allocated cons
cell.  I'm not saying that it will not happen, but in practice it is
less likely to happen.


> 
> The sweep step simply has to be atomic, just like most of the GC itself
> (afterall, that's why the mark&sweep algorithm is in the category called
> "stop-the-world").

True -- but if you try to debug GC errors, I prefer the code to be as
safe as possible.

-- 
Kim F. Storm <storm@cua.dk> http://www.cua.dk

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Fix to long-standing crashes in GC
  2004-05-29 23:15                             ` Kim F. Storm
@ 2004-05-30 20:44                               ` Stefan Monnier
  2004-05-31 20:21                                 ` Kim F. Storm
  0 siblings, 1 reply; 55+ messages in thread
From: Stefan Monnier @ 2004-05-30 20:44 UTC (permalink / raw)
  Cc: Luc Teirlinck, rms, emacs-devel

>> The sweep step simply has to be atomic, just like most of the GC itself
>> (afterall, that's why the mark&sweep algorithm is in the category called
>> "stop-the-world").

> True -- but if you try to debug GC errors, I prefer the code to be as
> safe as possible.

If you want it for debugging purposes, feel free to switch the sweeping
of strings w.r.t the sweeping of cons cells.  I've never had to debug in
there, so I wouldn't have any preference either way.  As I said, it
shouldn't make any difference w.r.t to correctness.
The two "reasons" why I undid this part of your change were:
- you also moved the sweeping of cons cells to before the sweeping of weak
  hash tables, which sounds seriously wrong.
- I started from a full undo of your patch.


        Stefan

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Fix to long-standing crashes in GC
  2004-05-30 20:44                               ` Stefan Monnier
@ 2004-05-31 20:21                                 ` Kim F. Storm
  2004-06-08 20:03                                   ` Lars Hansen
  0 siblings, 1 reply; 55+ messages in thread
From: Kim F. Storm @ 2004-05-31 20:21 UTC (permalink / raw)
  Cc: Luc Teirlinck, rms, emacs-devel

Stefan Monnier <monnier@iro.umontreal.ca> writes:

> >> The sweep step simply has to be atomic, just like most of the GC itself
> >> (afterall, that's why the mark&sweep algorithm is in the category called
> >> "stop-the-world").
> 
> > True -- but if you try to debug GC errors, I prefer the code to be as
> > safe as possible.
> 
> If you want it for debugging purposes, feel free to switch the sweeping
> of strings w.r.t the sweeping of cons cells.  I've never had to debug in
> there, so I wouldn't have any preference either way.  As I said, it
> shouldn't make any difference w.r.t to correctness.
> The two "reasons" why I undid this part of your change were:
> - you also moved the sweeping of cons cells to before the sweeping of weak
>   hash tables, which sounds seriously wrong.
> - I started from a full undo of your patch.
> 

Ok, I'll leave things as they are -- and remember the issue if I have
to look into GC again later.

-- 
Kim F. Storm <storm@cua.dk> http://www.cua.dk

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Fix to long-standing crashes in GC
  2004-05-31 20:21                                 ` Kim F. Storm
@ 2004-06-08 20:03                                   ` Lars Hansen
  0 siblings, 0 replies; 55+ messages in thread
From: Lars Hansen @ 2004-06-08 20:03 UTC (permalink / raw)


I haven't been able to crash Emacs for two weeks now. Cool!
Thanks a lot!

^ permalink raw reply	[flat|nested] 55+ messages in thread

end of thread, other threads:[~2004-06-08 20:03 UTC | newest]

Thread overview: 55+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-05-13 18:19 Fix to long-standing crashes in GC Lars Hansen
2004-05-13 19:09 ` Luc Teirlinck
2004-05-13 19:29   ` Luc Teirlinck
2004-05-13 19:30   ` Lars Hansen
2004-05-13 19:19 ` Stefan Monnier
2004-05-13 22:16   ` Luc Teirlinck
2004-05-13 23:04     ` Stefan Monnier
2004-05-14 11:42     ` Kai Grossjohann
2004-05-14 14:53       ` Luc Teirlinck
2004-05-14 20:48         ` Kai Grossjohann
2004-05-16  9:27         ` Kai Grossjohann
2004-05-14 18:39       ` Luc Teirlinck
2004-05-14 20:54         ` Kim F. Storm
2004-05-14 21:02 ` Richard Stallman
2004-05-22 18:09   ` Lars Hansen
2004-05-23 16:33     ` Eli Zaretskii
2004-05-23 16:32       ` Luc Teirlinck
2004-05-23 17:11         ` Lars Hansen
2004-05-24  5:30         ` Eli Zaretskii
2004-05-25  3:03           ` Luc Teirlinck
2004-05-25  7:07             ` Eli Zaretskii
2004-05-15  4:39 ` Robert Marshall
2004-05-17 14:39   ` Kim F. Storm
2004-05-17 17:42     ` Robert Marshall
2004-05-17 14:43 ` Kim F. Storm
2004-05-18  0:13   ` Luc Teirlinck
2004-05-19  1:26     ` Richard Stallman
2004-05-19 12:11       ` Kim F. Storm
2004-05-19 19:32         ` Stefan Monnier
2004-05-19 22:33           ` Kim F. Storm
2004-05-20 13:17           ` Richard Stallman
2004-05-19 12:52       ` Kim F. Storm
2004-05-19 16:48         ` Stefan Monnier
2004-05-19 22:04           ` Kim F. Storm
2004-05-19 22:25             ` Stefan Monnier
2004-05-19 22:37               ` Kim F. Storm
2004-05-19 22:50                 ` Stefan Monnier
2004-05-20  0:44                   ` Kim F. Storm
2004-05-21 23:43                     ` Kim F. Storm
2004-05-23  1:14                       ` Stefan Monnier
2004-05-23 18:28                       ` Richard Stallman
2004-05-24 11:57                       ` Kim F. Storm
2004-05-28 21:51                       ` Stefan Monnier
2004-05-28 23:40                         ` Kim F. Storm
2004-05-28 23:49                           ` Stefan Monnier
2004-05-29 23:15                             ` Kim F. Storm
2004-05-30 20:44                               ` Stefan Monnier
2004-05-31 20:21                                 ` Kim F. Storm
2004-06-08 20:03                                   ` Lars Hansen
2004-05-20  7:08         ` Richard Stallman
2004-05-21 22:58           ` Stefan Monnier
  -- strict thread matches above, loose matches on Subject: below --
2004-05-13 23:34 Robert Anderson
2004-05-12 13:19 Kim F. Storm
2004-05-13 13:06 ` Kenichi Handa
2004-05-13 15:45 ` Richard Stallman

Code repositories for project(s) associated with this external index

	https://git.savannah.gnu.org/cgit/emacs.git
	https://git.savannah.gnu.org/cgit/emacs/org-mode.git

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.