From mboxrd@z Thu Jan  1 00:00:00 1970
Path: news.gmane.org!not-for-mail
From: Eli Zaretskii <eliz@gnu.org>
Newsgroups: gmane.emacs.devel
Subject: Re: newline cache
Date: Wed, 23 Apr 2014 18:23:42 +0300
Message-ID: <837g6g9iip.fsf@gnu.org>
References: <E1WcH4d-0004ok-KH@fencepost.gnu.org> <837g6id3mi.fsf@gnu.org>
	<E1WcTOw-0004HM-Fa@fencepost.gnu.org> <838uqxb6o6.fsf@gnu.org>
	<E1WcpmH-0000VT-KK@fencepost.gnu.org>
Reply-To: Eli Zaretskii <eliz@gnu.org>
NNTP-Posting-Host: plane.gmane.org
X-Trace: ger.gmane.org 1398266658 12740 80.91.229.3 (23 Apr 2014 15:24:18 GMT)
X-Complaints-To: usenet@ger.gmane.org
NNTP-Posting-Date: Wed, 23 Apr 2014 15:24:18 +0000 (UTC)
Cc: emacs-devel@gnu.org
To: rms@gnu.org
Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Wed Apr 23 17:24:10 2014
Return-path: <emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org>
Envelope-to: ged-emacs-devel@m.gmane.org
Original-Received: from lists.gnu.org ([208.118.235.17])
	by plane.gmane.org with esmtp (Exim 4.69)
	(envelope-from <emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org>)
	id 1Wcz1q-00067W-Ao
	for ged-emacs-devel@m.gmane.org; Wed, 23 Apr 2014 17:24:06 +0200
Original-Received: from localhost ([::1]:33349 helo=lists.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org>)
	id 1Wcz1p-0007UV-UE
	for ged-emacs-devel@m.gmane.org; Wed, 23 Apr 2014 11:24:05 -0400
Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:60658)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <eliz@gnu.org>) id 1Wcz1g-0007Iv-UQ
	for emacs-devel@gnu.org; Wed, 23 Apr 2014 11:24:03 -0400
Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <eliz@gnu.org>) id 1Wcz1a-0007Qq-1t
	for emacs-devel@gnu.org; Wed, 23 Apr 2014 11:23:56 -0400
Original-Received: from mtaout27.012.net.il ([80.179.55.183]:46568)
	by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from <eliz@gnu.org>)
	id 1Wcz1Z-0007Q4-Jv; Wed, 23 Apr 2014 11:23:49 -0400
Original-Received: from conversion-daemon.mtaout27.012.net.il by mtaout27.012.net.il
	(HyperSendmail v2007.08) id <0N4H00K00P6IMZ00@mtaout27.012.net.il>;
	Wed, 23 Apr 2014 18:20:13 +0300 (IDT)
Original-Received: from HOME-C4E4A596F7 ([87.69.4.28]) by mtaout27.012.net.il
	(HyperSendmail v2007.08) with ESMTPA id
	<0N4H00AU1P9PMWA0@mtaout27.012.net.il>;
	Wed, 23 Apr 2014 18:20:13 +0300 (IDT)
In-reply-to: <E1WcpmH-0000VT-KK@fencepost.gnu.org>
X-012-Sender: halo1@inter.net.il
X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.6.x
X-Received-From: 80.179.55.183
X-BeenThere: emacs-devel@gnu.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: "Emacs development discussions." <emacs-devel.gnu.org>
List-Unsubscribe: <https://lists.gnu.org/mailman/options/emacs-devel>,
	<mailto:emacs-devel-request@gnu.org?subject=unsubscribe>
List-Archive: <http://lists.gnu.org/archive/html/emacs-devel>
List-Post: <mailto:emacs-devel@gnu.org>
List-Help: <mailto:emacs-devel-request@gnu.org?subject=help>
List-Subscribe: <https://lists.gnu.org/mailman/listinfo/emacs-devel>,
	<mailto:emacs-devel-request@gnu.org?subject=subscribe>
Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org
Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org
Xref: news.gmane.org gmane.emacs.devel:171609
Archived-At: <http://permalink.gmane.org/gmane.emacs.devel/171609>

> Date: Wed, 23 Apr 2014 01:31:25 -0400
> From: Richard Stallman <rms@gnu.org>
> CC: emacs-devel@gnu.org
> 
>     The function I added returns an array with 2 sub-arrays, one with
>     newline positions according to the cache, the other with newline
>     positions according to the actual buffer contents.
> 
> That may take a very long time for my RMAIL buffer, which is 4 meg.  I
> don't think I could tolerate having that run after each Rmail command.

I suggest to try it; you might be surprised.  I tried it on a 7MB mbox
file, and didn't see any significant slowdown.  The reason is simple:
the mbox buffer is almost always narrowed, and find_newline, which is
the workhorse of the function I wrote, and also the main suspect, only
looks within the restriction.

In any case, if you decide to use the debugging code, please use the
patch below, which fixes a stupid thinko in the previous version.

> To make this fast enough for me to use it to localize the bug, it
> needs to operate only on a specified part of the buffer.

Narrowing already does that.  Anyway, it is impossible to predict in
advance in which portions of the buffer the corruption will happen, at
least not with the level of understanding of this bug that I have now.

FWIW, I played with this post-command-hook in a large mbox buffer, and
couldn't reproduce any problems yet.  So something else might be at
work here.

Here's an updated patch:

=== modified file 'src/search.c'
--- src/search.c	2014-03-16 16:28:34 +0000
+++ src/search.c	2014-04-23 15:21:25 +0000
@@ -3108,6 +3108,187 @@ DEFUN ("regexp-quote", Fregexp_quote, Sr
 				out - temp,
 				STRING_MULTIBYTE (string));
 }
+
+/* Like find_newline, but doesn't use the cache, and only searches forward.  */
+static ptrdiff_t
+find_newline1 (ptrdiff_t start, ptrdiff_t start_byte, ptrdiff_t end,
+	       ptrdiff_t end_byte, ptrdiff_t count, ptrdiff_t *shortage,
+	       ptrdiff_t *bytepos, bool allow_quit)
+{
+  if (count > 0)
+    {
+      if (!end)
+	end = ZV, end_byte = ZV_BYTE;
+    }
+  else
+    {
+      if (!end)
+	end = BEGV, end_byte = BEGV_BYTE;
+    }
+  if (end_byte == -1)
+    end_byte = CHAR_TO_BYTE (end);
+
+  if (shortage != 0)
+    *shortage = 0;
+
+  immediate_quit = allow_quit;
+
+  if (count > 0)
+    while (start != end)
+      {
+        /* Our innermost scanning loop is very simple; it doesn't know
+           about gaps, buffer ends, or the newline cache.  ceiling is
+           the position of the last character before the next such
+           obstacle --- the last character the dumb search loop should
+           examine.  */
+	ptrdiff_t tem, ceiling_byte = end_byte - 1;
+
+	if (start_byte == -1)
+	  start_byte = CHAR_TO_BYTE (start);
+
+        /* The dumb loop can only scan text stored in contiguous
+           bytes. BUFFER_CEILING_OF returns the last character
+           position that is contiguous, so the ceiling is the
+           position after that.  */
+	tem = BUFFER_CEILING_OF (start_byte);
+	ceiling_byte = min (tem, ceiling_byte);
+
+        {
+          /* The termination address of the dumb loop.  */
+	  unsigned char *lim_addr = BYTE_POS_ADDR (ceiling_byte) + 1;
+	  ptrdiff_t lim_byte = ceiling_byte + 1;
+
+	  /* Nonpositive offsets (relative to LIM_ADDR and LIM_BYTE)
+	     of the base, the cursor, and the next line.  */
+	  ptrdiff_t base = start_byte - lim_byte;
+	  ptrdiff_t cursor, next;
+
+	  for (cursor = base; cursor < 0; cursor = next)
+	    {
+              /* The dumb loop.  */
+	      unsigned char *nl = memchr (lim_addr + cursor, '\n', - cursor);
+	      next = nl ? nl - lim_addr : 0;
+
+              if (! nl)
+		break;
+	      next++;
+
+	      if (--count == 0)
+		{
+		  immediate_quit = 0;
+		  if (bytepos)
+		    *bytepos = lim_byte + next;
+		  return BYTE_TO_CHAR (lim_byte + next);
+		}
+            }
+
+	  start_byte = lim_byte;
+	  start = BYTE_TO_CHAR (start_byte);
+        }
+      }
+
+  immediate_quit = 0;
+  if (shortage)
+    *shortage = count;
+  if (bytepos)
+    {
+      *bytepos = start_byte == -1 ? CHAR_TO_BYTE (start) : start_byte;
+      eassert (*bytepos == CHAR_TO_BYTE (start));
+    }
+  return start;
+}
+
+DEFUN ("newline-cache-check", Fnewline_cache_check, Snewline_cache_check,
+       0, 1, 0,
+       doc: /* Check the newline cache of BUFFER against buffer contents.
+
+BUFFER defaults to the current buffer.
+
+Value is an array of 2 sub-arrays of buffer positions for newlines,
+the first based on the cache, the second based on actually scanning
+the buffer.  If the buffer doesn't have a cache, the value is nil.  */)
+  (Lisp_Object buffer)
+{
+  struct buffer *buf, *old = NULL;
+  ptrdiff_t shortage, nl_count_cache, nl_count_buf;
+  Lisp_Object cache_newlines, buf_newlines, val;
+  ptrdiff_t from, found, i;
+
+  if (NILP (buffer))
+    buf = current_buffer;
+  else
+    {
+      CHECK_BUFFER (buffer);
+      buf = XBUFFER (buffer);
+      old = current_buffer;
+    }
+  if (buf->base_buffer)
+    buf = buf->base_buffer;
+
+  /* If the buffer doesn't have a newline cache, return nil.  */
+  if (NILP (BVAR (buf, cache_long_scans))
+      || buf->newline_cache == NULL)
+    return Qnil;
+
+  /* find_newline can only work on the current buffer.  */
+  if (old != NULL)
+    set_buffer_internal_1 (buf);
+
+  /* How many newlines are there according to the cache?  */
+  find_newline (BEGV, BEGV_BYTE, ZV, ZV_BYTE,
+		TYPE_MAXIMUM (ptrdiff_t), &shortage, NULL, true);
+  nl_count_cache = TYPE_MAXIMUM (ptrdiff_t) - shortage;
+
+  /* Create vector and populate it.  */
+  cache_newlines = make_uninit_vector (nl_count_cache);
+
+  if (nl_count_cache)
+    {
+      for (from = BEGV, found = from, i = 0; from < ZV; from = found, i++)
+	{
+	  ptrdiff_t from_byte = CHAR_TO_BYTE (from);
+
+	  found = find_newline (from, from_byte, 0, -1, 1, &shortage,
+				NULL, true);
+	  if (shortage != 0 || i >= nl_count_cache)
+	    break;
+	  ASET (cache_newlines, i, make_number (found - 1));
+	}
+      /* Fill the rest of slots with an invalid position.  */
+      for ( ; i < nl_count_cache; i++)
+	ASET (cache_newlines, i, make_number (-1));
+    }
+
+  /* Now do the same, but without using the cache.  */
+  find_newline1 (BEGV, BEGV_BYTE, ZV, ZV_BYTE,
+		 TYPE_MAXIMUM (ptrdiff_t), &shortage, NULL, true);
+  nl_count_buf = TYPE_MAXIMUM (ptrdiff_t) - shortage;
+  buf_newlines = make_uninit_vector (nl_count_buf);
+  if (nl_count_buf)
+    {
+      for (from = BEGV, found = from, i = 0; from < ZV; from = found, i++)
+	{
+	  ptrdiff_t from_byte = CHAR_TO_BYTE (from);
+
+	  found = find_newline1 (from, from_byte, 0, -1, 1, &shortage,
+				 NULL, true);
+	  if (shortage != 0 || i >= nl_count_buf)
+	    break;
+	  ASET (buf_newlines, i, make_number (found - 1));
+	}
+      for ( ; i < nl_count_buf; i++)
+	ASET (buf_newlines, i, make_number (-1));
+    }
+
+  /* Construct the value and return it.  */
+  val = make_uninit_vector (2);
+  ASET (val, 0, cache_newlines);
+  ASET (val, 1, buf_newlines);
+
+  if (old != NULL)
+    set_buffer_internal_1 (old);
+  return val;
+}
 
 void
 syms_of_search (void)
@@ -3180,4 +3361,5 @@ is to bind it with `let' around a small 
   defsubr (&Smatch_data);
   defsubr (&Sset_match_data);
   defsubr (&Sregexp_quote);
+  defsubr (&Snewline_cache_check);
 }