On Sun, Sep 30, 2018 at 10:14 AM Eli Zaretskii <eliz@gnu.org> wrote:
> Then please step with a debugger through the code starting from the
> call to getrlimit, and please show the values of related variables,
> such as newlim, all the way until the call to setrlimit and the
> computed value of emacs_re_safe_alloca.  Please do that once with the
> current code and then once again with the code before the offending
> commit.  I'd like to see the differences, because I meanwhile see
> nothing wrong with using rlim_t here.

One change from my past reports: after compiling Emacs with -g flags,
I have now managed to reproduce the crash under lldb, including
attaching to the forked process which eats CPU after the crash.
Backtrace from that process is attached.

Here are my results from stepping through the code. Note this all runs
at Emacs startup, long before anything forks.

The highlights (as far as I noticed) are:
- emacs_re_max_failures and the older re_max_failures are not
initialized at this point
- in the working branch, newlim is reset to rlim.rlim_max; in the
broken branch, it is not
- in the working branch, setrlimit does not get called; in the broken
branch, it does

I'm guessing the problem is with the uninitialized values for
*_re_max_failures and the resulting values being assigned to lim and
newlim. It seems to only work on the working branch by accident
because, for whatever reason, newlim always gets reset to
rlim.rlim_max and setrlimit doesn't get called.

-----
master branch (commit 3eedabaef37e), use of rlim_t:

- immediately after getrlimit call, lim is assigned, value: 0
- lim is then assigned rlim.rlim_cur, value: 67104768
- min_ratio is initialized, value: 160
- ratio is initialized, value: 213
- try_to_grow_stack ends up assigned, value: true

The code proceeds into the try_to_grow_stack condition:

- newlim is assigned, value: 10020000
- BUT: emacs_re_max_failures defined at that point and used to
calculate newlim has a very large size_t value: 6500256977556508423
- looks like newlim has overflown here to fit unsigned long long
- pagesize is assigned, value 4096
- newlim is decremented, value: 10024095
- condition checking if rlim.rlim_max < newlim; rlim.rlim_max is
67104768 so the condition evaluates to false (emacs.c:880)
- condition checking if pagesize <= (newlim - lim) evaluates to true:
this happens because (newlim < rlim), and the subtraction causes an
overflow (newlim - lim returns an unsigned long long with value
18446744073652469760); consequently, setrlimit is called and succeeds

The try_to_grow_stack condition ends.

- emacs_re_safe_alloca is assigned, value: 4435280473597425792. I'm
not sure if that's a reasonable value for a value of type ptrdiff_t.
-----

-----
last working revision (commit 6cdd1c333034b), use of long:

Please note that this code predates the introduction of emacs_re_safe_alloca.

- immediately after getrlimit call, lim is assigned, value: 0
- lim then is assigned rlim.rlim_curr, value: 67104768
- ratio is then initialized: 160
- and subsequently incremented, value: 213
- try_to_grow_stack ends up assigned, value: true

The code proceeds into the try_to_grow_stack condition:

- newlim is assigned, value: 67104578
- BUT: re_max_failures defined at that point and used to calculate
newlim has a very large size_t value: 16107485546189635934
- newlim has obviously overflown here to fit a signed long
- pagesize is assigned, value 4096
- newlim is decremented, value: 67108673
- condition checking if rlim.rlim_max < newlim; rlim.rlim_max is
67104768 so the condition evaluates to true and newlim is set to
rlim.rlim_max (emacs.c:862)
- newlim decrement by newlim % pagesize is a noop
- condition checking if pagesize <= (newlim - lim) evaluates to false,
skipping the setrlimit call
-----

I am attaching lldb session transcripts for both runs in case you want
to look more closely at what's going on.