unofficial mirror of bug-guile@gnu.org 
 help / color / mirror / Atom feed
* bug#72086: scm_i_utf8_string_hash overruns buffer when len is zero
@ 2024-07-13  0:43 Rob Browning
  0 siblings, 0 replies; only message in thread
From: Rob Browning @ 2024-07-13  0:43 UTC (permalink / raw)
  To: 72086

[-- Attachment #1: Type: text/plain, Size: 163 bytes --]


The first patch attempts to fix that, and the second is an optimization
when then input is ASCII (since we already have the information we need
to detect that):


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 0001-scm_i_utf8_string_hash-don-t-overrun-when-len-is-zer.patch --]
[-- Type: text/x-diff, Size: 2304 bytes --]

From 619e3d3afec2c116007d9cb2ad32a500fb32a7dd Mon Sep 17 00:00:00 2001
From: Rob Browning <rlb@defaultvalue.org>
Date: Sun, 30 Jun 2024 22:41:40 -0500
Subject: [PATCH 1/2] scm_i_utf8_string_hash: don't overrun when len is zero

When the length is zero, the previous code would include the byte after
the end of the string in the hash.  Fix that (the wide and narrow
hashers also guard against it via "case 0"), and while we're there,
switch to u8_mbtouc since the unsafe variant is now the same (see the
info pages), and don't bother mutating length for the trailing bytes,
since we don't need to.

libguile/hash.c (scm_i_utf8_string_hash): switch to u8_mbtouc, and avoid
overrun when len == 0.
---
 libguile/hash.c | 28 +++++++++++++++-------------
 1 file changed, 15 insertions(+), 13 deletions(-)

diff --git a/libguile/hash.c b/libguile/hash.c
index a038a11bf..d92f60df8 100644
--- a/libguile/hash.c
+++ b/libguile/hash.c
@@ -195,32 +195,34 @@ scm_i_utf8_string_hash (const char *str, size_t len)
   /* Handle most of the key.  */
   while (length > 3)
     {
-      ustr += u8_mbtouc_unsafe (&u32, ustr, end - ustr);
+      ustr += u8_mbtouc (&u32, ustr, end - ustr);
       a += u32;
-      ustr += u8_mbtouc_unsafe (&u32, ustr, end - ustr);
+      ustr += u8_mbtouc (&u32, ustr, end - ustr);
       b += u32;
-      ustr += u8_mbtouc_unsafe (&u32, ustr, end - ustr);
+      ustr += u8_mbtouc (&u32, ustr, end - ustr);
       c += u32;
       mix (a, b, c);
       length -= 3;
     }
 
   /* Handle the last 3 elements's.  */
-  ustr += u8_mbtouc_unsafe (&u32, ustr, end - ustr);
-  a += u32;
-  if (--length)
+  if (length)
     {
-      ustr += u8_mbtouc_unsafe (&u32, ustr, end - ustr);
-      b += u32;
-      if (--length)
+      ustr += u8_mbtouc (&u32, ustr, end - ustr);
+      a += u32;
+      if (length > 1)
         {
-          ustr += u8_mbtouc_unsafe (&u32, ustr, end - ustr);
-          c += u32;
+          ustr += u8_mbtouc (&u32, ustr, end - ustr);
+          b += u32;
+          if (length > 2)
+            {
+              ustr += u8_mbtouc (&u32, ustr, end - ustr);
+              c += u32;
+            }
         }
+      final (a, b, c);
     }
 
-  final (a, b, c);
-
   if (sizeof (unsigned long) == 8)
     ret = (((unsigned long) c) << 32) | b;
   else
-- 
2.43.0


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #3: 0002-scm_i_utf8_string_hash-optimize-ASCII.patch --]
[-- Type: text/x-diff, Size: 1975 bytes --]

From c6a888888101a893820f38561898e7c0390dd9d2 Mon Sep 17 00:00:00 2001
From: Rob Browning <rlb@defaultvalue.org>
Date: Mon, 1 Jul 2024 20:56:57 -0500
Subject: [PATCH 2/2] scm_i_utf8_string_hash: optimize ASCII

Since we already compute the char length, use that to detect all ASCII
strings and handle those the same way we handle latin-1.

libguile/hash.c (scm_i_utf8_string_hash): when byte_len == char_len,
(i.e. fixed-width ASCII) optimize hashing via existing narrow path.
---
 libguile/hash.c | 28 ++++++++++++++++------------
 1 file changed, 16 insertions(+), 12 deletions(-)

diff --git a/libguile/hash.c b/libguile/hash.c
index d92f60df8..fe950e358 100644
--- a/libguile/hash.c
+++ b/libguile/hash.c
@@ -169,25 +169,29 @@ scm_i_latin1_string_hash (const char *str, size_t len)
 unsigned long 
 scm_i_utf8_string_hash (const char *str, size_t len)
 {
-  const uint8_t *end, *ustr = (const uint8_t *) str;
-  unsigned long ret;
-
-  /* The length of the string in characters.  This name corresponds to
-     Jenkins' original name.  */
-  size_t length;
-
-  uint32_t a, b, c, u32;
-
   if (len == (size_t) -1)
     len = strlen (str);
 
-  end = ustr + len;
-
+  const uint8_t *ustr = (const uint8_t *) str;
   if (u8_check (ustr, len) != NULL)
     /* Invalid UTF-8; punt.  */
     return scm_i_string_hash (scm_from_utf8_stringn (str, len));
 
-  length = u8_mbsnlen (ustr, len);
+  /* The length of the string in characters.  This name corresponds to
+     Jenkins' original name.  */
+  size_t length = u8_mbsnlen (ustr, len);
+
+  if (len == length) // ascii, same as narrow_string_hash above
+    {
+      unsigned long ret;
+      JENKINS_LOOKUP3_HASHWORD2 (str, len, ret);
+      ret >>= 2; /* Ensure that it fits in a fixnum.  */
+      return ret;
+    }
+
+  const uint8_t * const end = ustr + len;
+  uint32_t a, b, c, u32;
+  unsigned long ret;
 
   /* Set up the internal state.  */
   a = b = c = 0xdeadbeef + ((uint32_t)(length<<2)) + 47;
-- 
2.43.0


[-- Attachment #4: Type: text/plain, Size: 205 bytes --]


Thanks
-- 
Rob Browning
rlb @defaultvalue.org and @debian.org
GPG as of 2011-07-10 E6A9 DA3C C9FD 1FF8 C676 D2C4 C0F0 39E9 ED1B 597A
GPG as of 2002-11-03 14DD 432F AE39 534D B592 F9A0 25C8 D377 8C7E 73A4

^ permalink raw reply related	[flat|nested] only message in thread

only message in thread, other threads:[~2024-07-13  0:43 UTC | newest]

Thread overview: (only message) (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-07-13  0:43 bug#72086: scm_i_utf8_string_hash overruns buffer when len is zero Rob Browning

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).