bug#8525: Lisp reader and string-to-number bugs and inconsistencies

unofficial mirror of bug-gnu-emacs@gnu.org 
 help / color / mirror / code / Atom feed

From: Paul Eggert <eggert@cs.ucla.edu>
To: Stefan Monnier <monnier@iro.umontreal.ca>
Cc: 8525@debbugs.gnu.org
Subject: bug#8525: Lisp reader and string-to-number bugs and inconsistencies
Date: Thu, 21 Apr 2011 13:05:51 -0700	[thread overview]
Message-ID: <4DB08E1F.9090206@cs.ucla.edu> (raw)
In-Reply-To: <jwvfwpdrs0d.fsf-monnier+emacs@gnu.org>

[-- Attachment #1: Type: text/plain, Size: 15540 bytes --]

On 04/20/11 06:08, Stefan Monnier wrote:
> We want to use floats rather than signal an overflow (this is evident
> from the history of the code since the conversion to floats was added
> somewhat recently).

OK I came up with a revised patch (below) which does that.  I want to
test it a bit more, but thought I'd publish it now.  A couple of things:

* It's easy to use floats for huge base-10 numbers, but there's no
  portable and accurate way to convert huge non-base-10 values to
  floats.  This patch still signals overflow for cases like
  (string-to-number "ffffffffffffffffffffffffff" 16) where the integer
  cannot be represented as a 64-bit number.  This is the best we can
  do easily, without going to arbitrary-precision arithmetic, or
  without having some rounding errors that I'd rather avoid.

* This patch uses strtoumax to get the widest available conversions
  for non-base-10 integers, as the result can be converted to float easily
  and accurately.  strtoumax is a C99 function and works on standard
  modern platforms; for the oddball exception we can use the gnulib
  substitute.  This part is a one-line change to Makefile.in that
  drags in a bunch of gnulib code, but this code is well-tested on old
  platforms and isn't needed on new ones.  I've attached a gzipped
  patch for this porting change.

For Windows, the implications for strtoumax are that <inttypes.h>
should exist, and should define uintmax_t and strtoumax.  If Windows
doesn't already do that, a simple substitute inttypes.h like the
following should do the trick:

   #define uintmax_t unsigned long long
   #define strtoumax strtoull

or, if Windows doesn't support long long, then use "unsigned long"
and "strtoul".

Here's the patch.  I'm afraid that it has to be applied by hand,
as I hand-mangled it out of a bunch of other patches.  If you
want something that applies automatically please let me know and
I'll generate one.

2011-04-21  Paul Eggert  <eggert@cs.ucla.edu>

	Make the Lisp reader and string-to-float more consistent.
	* data.c (atof): Remove decl; no longer used or needed.
	(digit_to_number): Move to lread.c.
	(Fstring_to_number): Use new string_to_number function, to be
	consistent with how the Lisp reader treats infinities and NaNs.
	Do not assume that floating-point numbers represent EMACS_INT
	without losing information; this is not true on most 64-bit hosts.
	Avoid double-rounding errors, by insisting on integers when
	parsing non-base-10 numbers, as the documentation specifies.
	* lisp.h (string_to_number): New decl, replacing ...
	(isfloat_string): Remove.
	* lread.c: Include <inttypes.h>, for uintmax_t and strtoumax.
	(read1): Do not accept +. and -. as integers; this
	appears to have been a coding error.  Similarly, do not accept
	strings like +-1e0 as floating point numbers.  Do not report
	overflow for integer overflows unless the base is not 10 which
	means we have no simple and reliable way to continue.
	Break out the floating-point parsing into a new
	function string_to_number, so that Fstring_to_number parses
	floating point numbers consistently with the Lisp reader.
	(digit_to_number): Moved here from data.c.  Make it static inline.
	(E_CHAR, EXP_INT): Remove, replacing with ...
	(E_EXP): New macro, to solve the "1.0e+" problem mentioned below.
	(string_to_number): New function, replacing isfloat_string.
	This function checks for valid syntax and produces the resulting
	Lisp float number too.  Rework it so that string-to-number
	no longer mishandles examples like "1.0e+".  Use strtoumax,
	so that overflow for non-base-10 numbers is reported only when
	there's no portable and simple way to convert to floating point.

diff -pu trunk/src/data.c atest/src/data.c
--- trunk/src/data.c	2011-04-16 16:07:57.605482000 -0700
+++ atest/src/data.c	2011-04-20 14:48:29.597646000 -0700
@@ -48,10 +48,6 @@ along with GNU Emacs.  If not, see <http
 
 #include <math.h>
 
-#if !defined (atof)
-extern double atof (const char *);
-#endif /* !atof */
-
 Lisp_Object Qnil, Qt, Qquote, Qlambda, Qunbound;
 static Lisp_Object Qsubr;
 Lisp_Object Qerror_conditions, Qerror_message, Qtop_level;
@@ -2374,35 +2370,10 @@ NUMBER may be an integer or a floating p
   return build_string (buffer);
 }
 
-INLINE static int
-digit_to_number (int character, int base)
-{
-  int digit;
-
-  if (character >= '0' && character <= '9')
-    digit = character - '0';
-  else if (character >= 'a' && character <= 'z')
-    digit = character - 'a' + 10;
-  else if (character >= 'A' && character <= 'Z')
-    digit = character - 'A' + 10;
-  else
-    return -1;
-
-  if (digit >= base)
-    return -1;
-  else
-    return digit;
-}
-
 DEFUN ("string-to-number", Fstring_to_number, Sstring_to_number, 1, 2, 0,
        doc: /* Parse STRING as a decimal number and return the number.
 This parses both integers and floating point numbers.
@@ -2415,7 +2386,6 @@ If the base used is not 10, STRING is al
 {
   register char *p;
   register int b;
-  int sign = 1;
   Lisp_Object val;
 
   CHECK_STRING (string);
@@ -2430,40 +2400,13 @@ If the base used is not 10, STRING is al
 	xsignal1 (Qargs_out_of_range, base);
     }
 
-  /* Skip any whitespace at the front of the number.  Some versions of
-     atoi do this anyway, so we might as well make Emacs lisp consistent.  */
   p = SSDATA (string);
   while (*p == ' ' || *p == '\t')
     p++;
 
-  if (*p == '-')
-    {
-      sign = -1;
-      p++;
-    }
-  else if (*p == '+')
-    p++;
-
-  if (isfloat_string (p, 1) && b == 10)
-    val = make_float (sign * atof (p));
-  else
-    {
-      double v = 0;
-
-      while (1)
-	{
-	  int digit = digit_to_number (*p++, b);
-	  if (digit < 0)
-	    break;
-	  v = v * b + digit;
-	}
-
-      val = make_fixnum_or_float (sign * v);
-    }
-
-  return val;
+  val = string_to_number (p, b, 1);
+  return NILP (val) ? make_number (0) : val;
 }
-
 \f
 enum arithop
   {
diff -pu trunk/src/lisp.h atest/src/lisp.h
--- trunk/src/lisp.h	2011-04-15 01:47:55.574976000 -0700
+++ atest/src/lisp.h	2011-04-21 10:25:15.716995000 -0700
@@ -2782,7 +2782,7 @@ extern Lisp_Object oblookup (Lisp_Object
   } while (0)
 extern int openp (Lisp_Object, Lisp_Object, Lisp_Object,
                   Lisp_Object *, Lisp_Object);
-extern int isfloat_string (const char *, int);
+Lisp_Object string_to_number (char const *, int, int);
 extern void map_obarray (Lisp_Object, void (*) (Lisp_Object, Lisp_Object),
                          Lisp_Object);
 extern void dir_warning (const char *, Lisp_Object);
diff -pu trunk/src/lread.c atest/src/lread.c
--- trunk/src/lread.c	2011-04-14 16:10:48.006381000 -0700
+++ atest/src/lread.c	2011-04-21 12:15:13.973449000 -0700
@@ -19,6 +19,7 @@ along with GNU Emacs.  If not, see <http
 
 
 #include <config.h>
+#include <inttypes.h>
 #include <stdio.h>
 #include <sys/types.h>
 #include <sys/stat.h>
@@ -3005,86 +3006,9 @@ read1 (register Lisp_Object readcharfun,
 
 	if (!quoted && !uninterned_symbol)
 	  {
-	    register char *p1;
-	    p1 = read_buffer;
-	    if (*p1 == '+' || *p1 == '-') p1++;
-	    /* Is it an integer? */
-	    if (p1 != p)
-	      {
-		while (p1 != p && (c = *p1) >= '0' && c <= '9') p1++;
-		/* Integers can have trailing decimal points.  */
-		if (p1 > read_buffer && p1 < p && *p1 == '.') p1++;
-		if (p1 == p)
-		  /* It is an integer. */
-		  {
-		    if (p1[-1] == '.')
-		      p1[-1] = '\0';
-		    {
-		      /* EMACS_INT n = atol (read_buffer); */
-		      char *endptr = NULL;
-		      EMACS_INT n = (errno = 0,
-				     strtol (read_buffer, &endptr, 10));
-		      if (errno == ERANGE && endptr)
-			{
-			  Lisp_Object args
-			    = Fcons (make_string (read_buffer,
-						  endptr - read_buffer),
-				     Qnil);
-			  xsignal (Qoverflow_error, args);
-			}
-		      return make_fixnum_or_float (n);
-		    }
-		  }
-	      }
-	    if (isfloat_string (read_buffer, 0))
-	      {
-		/* Compute NaN and infinities using 0.0 in a variable,
-		   to cope with compilers that think they are smarter
-		   than we are.  */
-		double zero = 0.0;
-
-		double value;
-
-		/* Negate the value ourselves.  This treats 0, NaNs,
-		   and infinity properly on IEEE floating point hosts,
-		   and works around a common bug where atof ("-0.0")
-		   drops the sign.  */
-		int negative = read_buffer[0] == '-';
-
-		/* The only way p[-1] can be 'F' or 'N', after isfloat_string
-		   returns 1, is if the input ends in e+INF or e+NaN.  */
-		switch (p[-1])
-		  {
-		  case 'F':
-		    value = 1.0 / zero;
-		    break;
-		  case 'N':
-		    value = zero / zero;
-
-		    /* If that made a "negative" NaN, negate it.  */
-
-		    {
-		      int i;
-		      union { double d; char c[sizeof (double)]; } u_data, u_minus_zero;
-
-		      u_data.d = value;
-		      u_minus_zero.d = - 0.0;
-		      for (i = 0; i < sizeof (double); i++)
-			if (u_data.c[i] & u_minus_zero.c[i])
-			  {
-			    value = - value;
-			    break;
-			  }
-		    }
-		    /* Now VALUE is a positive NaN.  */
-		    break;
-		  default:
-		    value = atof (read_buffer + negative);
-		    break;
-		  }
-
-		return make_float (negative ? - value : value);
-	      }
+	    Lisp_Object result = string_to_number (read_buffer, 10, 0);
+	    if (! NILP (result))
+	      return result;
 	  }
 	{
 	  Lisp_Object name, result;
@@ -3242,74 +3166,179 @@ substitute_in_interval (INTERVAL interva
 }
 
 \f
+static inline int
+digit_to_number (int character, int base)
+{
+  int digit;
+
+  if ('0' <= character && character <= '9')
+    digit = character - '0';
+  else if ('a' <= character && character <= 'z')
+    digit = character - 'a' + 10;
+  else if ('A' <= character && character <= 'Z')
+    digit = character - 'A' + 10;
+  else
+    return -1;
+
+  return digit < base ? digit : -1;
+}
+
 #define LEAD_INT 1
 #define DOT_CHAR 2
 #define TRAIL_INT 4
-#define E_CHAR 8
-#define EXP_INT 16
+#define E_EXP 16
 
-int
-isfloat_string (const char *cp, int ignore_trailing)
+
+/* Convert STRING to a number, assuming base BASE.  Return a fixnum if CP has
+   integer syntax and fits in a fixnum, else return the nearest float if CP has
+   either floating point or integer syntax and BASE is 10, else return nil.  If
+   IGNORE_TRAILING is nonzero, consider just the longest prefix of CP that has
+   valid floating point syntax.  Signal an overflow if BASE is not 10 and the
+   number has integer syntax but does not fit.  */
+
+Lisp_Object
+string_to_number (char const *string, int base, int ignore_trailing)
 {
   int state;
-  const char *start = cp;
+  char const *cp = string;
+  int leading_digit;
+  int float_syntax = 0;
+  double value = 0;
+
+  /* Compute NaN and infinities using a variable, to cope with compilers that
+     think they are smarter than we are.  */
+  double zero = 0;
+
+  /* Negate the value ourselves.  This treats 0, NaNs, and infinity properly on
+     IEEE floating point hosts, and works around a formerly-common bug where
+     atof ("-0.0") drops the sign.  */
+  int negative = *cp == '-';
+
+  int signedp = negative || *cp == '+';
+  cp += signedp;
 
   state = 0;
-  if (*cp == '+' || *cp == '-')
-    cp++;
 
-  if (*cp >= '0' && *cp <= '9')
+  leading_digit = digit_to_number (*cp, base);
+  if (0 <= leading_digit)
     {
       state |= LEAD_INT;
-      while (*cp >= '0' && *cp <= '9')
-	cp++;
+      do
+	++cp;
+      while (0 <= digit_to_number (*cp, base));
     }
   if (*cp == '.')
     {
       state |= DOT_CHAR;
       cp++;
     }
-  if (*cp >= '0' && *cp <= '9')
-    {
-      state |= TRAIL_INT;
-      while (*cp >= '0' && *cp <= '9')
-	cp++;
-    }
-  if (*cp == 'e' || *cp == 'E')
-    {
-      state |= E_CHAR;
-      cp++;
-      if (*cp == '+' || *cp == '-')
-	cp++;
-    }
 
-  if (*cp >= '0' && *cp <= '9')
-    {
-      state |= EXP_INT;
-      while (*cp >= '0' && *cp <= '9')
-	cp++;
-    }
-  else if (cp == start)
-    ;
-  else if (cp[-1] == '+' && cp[0] == 'I' && cp[1] == 'N' && cp[2] == 'F')
+  if (base == 10)
     {
-      state |= EXP_INT;
-      cp += 3;
+      if ('0' <= *cp && *cp <= '9')
+	{
+	  state |= TRAIL_INT;
+	  do
+	    cp++;
+	  while ('0' <= *cp && *cp <= '9');
+	}
+      if (*cp == 'e' || *cp == 'E')
+	{
+	  char const *ecp = cp;
+	  cp++;
+	  if (*cp == '+' || *cp == '-')
+	    cp++;
+	  if ('0' <= *cp && *cp <= '9')
+	    {
+	      state |= E_EXP;
+	      do
+		cp++;
+	      while ('0' <= *cp && *cp <= '9');
+	    }
+	  else if (cp[-1] == '+'
+		   && cp[0] == 'I' && cp[1] == 'N' && cp[2] == 'F')
+	    {
+	      state |= E_EXP;
+	      cp += 3;
+	      value = 1.0 / zero;
+	    }
+	  else if (cp[-1] == '+'
+		   && cp[0] == 'N' && cp[1] == 'a' && cp[2] == 'N')
+	    {
+	      state |= E_EXP;
+	      cp += 3;
+	      value = zero / zero;
+
+	      /* If that made a "negative" NaN, negate it.  */
+	      {
+		int i;
+		union { double d; char c[sizeof (double)]; }
+		  u_data, u_minus_zero;
+		u_data.d = value;
+		u_minus_zero.d = -0.0;
+		for (i = 0; i < sizeof (double); i++)
+		  if (u_data.c[i] & u_minus_zero.c[i])
+		    {
+		      value = -value;
+		      break;
+		    }
+	      }
+	      /* Now VALUE is a positive NaN.  */
+	    }
+	  else
+	    cp = ecp;
+	}
+
+      float_syntax = ((state & (DOT_CHAR|TRAIL_INT)) == (DOT_CHAR|TRAIL_INT)
+		      || state == (LEAD_INT|E_EXP));
     }
-  else if (cp[-1] == '+' && cp[0] == 'N' && cp[1] == 'a' && cp[2] == 'N')
+
+  /* Return nil if the number uses invalid syntax.  If IGNORE_TRAILING, accept
+     any prefix that matches.  Otherwise, the entire string must match.  */
+  if (! (ignore_trailing
+	 ? ((state & LEAD_INT) != 0 || float_syntax)
+	 : (!*cp && ((state & ~DOT_CHAR) == LEAD_INT || float_syntax))))
+    return Qnil;
+
+  /* If the number uses integer and not float syntax, and is in C-language
+     range, use its value, preferably as a fixnum.  */
+  if (0 <= leading_digit && ! float_syntax)
     {
-      state |= EXP_INT;
-      cp += 3;
+      uintmax_t n;
+
+      /* Fast special case for single-digit integers.  This also avoids a
+	 glitch when BASE is 16 and IGNORE_TRAILING is nonzero, because in that
+	 case some versions of strtoumax accept numbers like "0x1" that Emacs
+	 does not allow.  */
+      if (digit_to_number (string[signedp + 1], base) < 0)
+	return make_number (negative ? -leading_digit : leading_digit);
+
+      errno = 0;
+      n = strtoumax (string + signedp, NULL, base);
+      if (errno == ERANGE)
+	{
+	  /* Unfortunately there's no simple and accurate way to convert
+	     non-base-10 numbers that are out of C-language range.  */
+	  if (base != 10)
+	    xsignal (Qoverflow_error, list1 (build_string (string)));
+	}
+      else if (n <= (negative ? -MOST_NEGATIVE_FIXNUM : MOST_POSITIVE_FIXNUM))
+	{
+	  EMACS_INT signed_n = n;
+	  return make_number (negative ? -signed_n : signed_n);
+	}
+      else
+	value = n;
     }
 
-  return ((ignore_trailing
-	   || *cp == 0 || *cp == ' ' || *cp == '\t' || *cp == '\n'
-	   || *cp == '\r' || *cp == '\f')
-	  && (state == (LEAD_INT|DOT_CHAR|TRAIL_INT)
-	      || state == (DOT_CHAR|TRAIL_INT)
-	      || state == (LEAD_INT|E_CHAR|EXP_INT)
-	      || state == (LEAD_INT|DOT_CHAR|TRAIL_INT|E_CHAR|EXP_INT)
-	      || state == (DOT_CHAR|TRAIL_INT|E_CHAR|EXP_INT)));
+  /* Either the number uses float syntax, or it does not fit into a fixnum.
+     Convert it from string to floating point, unless the value is already
+     known because it is an infinity, a NAN, or its absolute value fits in
+     uintmax_t.  */
+  if (! value)
+    value = atof (string + signedp);
+
+  return make_float (negative ? -value : value);
 }
 
 \f


[-- Attachment #2: strtoumax-patch.txt.gz --]
[-- Type: application/x-gzip, Size: 32229 bytes --]

next prev parent reply	other threads:[~2011-04-21 20:05 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-04-20  9:27 bug#8525: Lisp reader and string-to-number bugs and inconsistencies Paul Eggert
2011-04-20 10:45 ` Eli Zaretskii
2011-04-20 13:08 ` Stefan Monnier
2011-04-21 20:05   ` Paul Eggert [this message]
2011-04-24  4:43     ` Stefan Monnier

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://www.gnu.org/software/emacs/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4DB08E1F.9090206@cs.ucla.edu \
    --to=eggert@cs.ucla.edu \
    --cc=8525@debbugs.gnu.org \
    --cc=monnier@iro.umontreal.ca \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).