From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.io!.POSTED.blaine.gmane.org!not-for-mail From: Rob Browning Newsgroups: gmane.lisp.guile.bugs Subject: bug#72086: scm_i_utf8_string_hash overruns buffer when len is zero Date: Fri, 12 Jul 2024 19:43:18 -0500 Message-ID: <87h6cuw28p.fsf@trouble.defaultvalue.org> Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="=-=-=" Injection-Info: ciao.gmane.io; posting-host="blaine.gmane.org:116.202.254.214"; logging-data="4464"; mail-complaints-to="usenet@ciao.gmane.io" To: 72086@debbugs.gnu.org Original-X-From: bug-guile-bounces+guile-bugs=m.gmane-mx.org@gnu.org Sat Jul 13 02:44:18 2024 Return-path: Envelope-to: guile-bugs@m.gmane-mx.org Original-Received: from lists.gnu.org ([209.51.188.17]) by ciao.gmane.io with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.92) (envelope-from ) id 1sSQs6-0000vQ-Ln for guile-bugs@m.gmane-mx.org; Sat, 13 Jul 2024 02:44:18 +0200 Original-Received: from localhost ([::1] helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1sSQrr-0001rW-Ky; Fri, 12 Jul 2024 20:44:03 -0400 Original-Received: from eggs.gnu.org ([2001:470:142:3::10]) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1sSQrp-0001r1-Rw for bug-guile@gnu.org; Fri, 12 Jul 2024 20:44:01 -0400 Original-Received: from debbugs.gnu.org ([2001:470:142:5::43]) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.90_1) (envelope-from ) id 1sSQrp-0003yT-JA for bug-guile@gnu.org; Fri, 12 Jul 2024 20:44:01 -0400 Original-Received: from Debian-debbugs by debbugs.gnu.org with local (Exim 4.84_2) (envelope-from ) id 1sSQrq-0004C7-28 for bug-guile@gnu.org; Fri, 12 Jul 2024 20:44:02 -0400 X-Loop: help-debbugs@gnu.org Resent-From: Rob Browning Original-Sender: "Debbugs-submit" Resent-CC: bug-guile@gnu.org Resent-Date: Sat, 13 Jul 2024 00:44:01 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: report 72086 X-GNU-PR-Package: guile X-Debbugs-Original-To: bug-guile@gnu.org Original-Received: via spool by submit@debbugs.gnu.org id=B.172083140316064 (code B ref -1); Sat, 13 Jul 2024 00:44:01 +0000 Original-Received: (at submit) by debbugs.gnu.org; 13 Jul 2024 00:43:23 +0000 Original-Received: from localhost ([127.0.0.1]:54899 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1sSQrD-0004B0-6S for submit@debbugs.gnu.org; Fri, 12 Jul 2024 20:43:23 -0400 Original-Received: from lists.gnu.org ([209.51.188.17]:50066) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1sSQrC-0004At-6O for submit@debbugs.gnu.org; Fri, 12 Jul 2024 20:43:22 -0400 Original-Received: from eggs.gnu.org ([2001:470:142:3::10]) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1sSQrB-0001og-EL for bug-guile@gnu.org; Fri, 12 Jul 2024 20:43:21 -0400 Original-Received: from defaultvalue.org ([45.33.119.55]) by eggs.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1sSQr9-0003w6-MZ for bug-guile@gnu.org; Fri, 12 Jul 2024 20:43:21 -0400 Original-Received: from trouble.defaultvalue.org (localhost [127.0.0.1]) (Authenticated sender: rlb@defaultvalue.org) by defaultvalue.org (Postfix) with ESMTPSA id 2E6B3202C2 for ; Fri, 12 Jul 2024 19:43:19 -0500 (CDT) Original-Received: by trouble.defaultvalue.org (Postfix, from userid 1000) id C75CC14E05C; Fri, 12 Jul 2024 19:43:18 -0500 (CDT) Received-SPF: pass client-ip=45.33.119.55; envelope-from=rlb@defaultvalue.org; helo=defaultvalue.org X-Spam_score_int: -18 X-Spam_score: -1.9 X-Spam_bar: - X-Spam_report: (-1.9 / 5.0 requ) BAYES_00=-1.9, SPF_HELO_PASS=-0.001, SPF_PASS=-0.001 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list X-BeenThere: bug-guile@gnu.org List-Id: "Bug reports for GUILE, GNU's Ubiquitous Extension Language" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: bug-guile-bounces+guile-bugs=m.gmane-mx.org@gnu.org Original-Sender: bug-guile-bounces+guile-bugs=m.gmane-mx.org@gnu.org Xref: news.gmane.io gmane.lisp.guile.bugs:10895 Archived-At: --=-=-= Content-Type: text/plain The first patch attempts to fix that, and the second is an optimization when then input is ASCII (since we already have the information we need to detect that): --=-=-= Content-Type: text/x-diff Content-Disposition: inline; filename=0001-scm_i_utf8_string_hash-don-t-overrun-when-len-is-zer.patch Content-Description: 0001-scm_i_utf8_string_hash-don-t-overrun-when-len-is-zer.patch >From 619e3d3afec2c116007d9cb2ad32a500fb32a7dd Mon Sep 17 00:00:00 2001 From: Rob Browning Date: Sun, 30 Jun 2024 22:41:40 -0500 Subject: [PATCH 1/2] scm_i_utf8_string_hash: don't overrun when len is zero When the length is zero, the previous code would include the byte after the end of the string in the hash. Fix that (the wide and narrow hashers also guard against it via "case 0"), and while we're there, switch to u8_mbtouc since the unsafe variant is now the same (see the info pages), and don't bother mutating length for the trailing bytes, since we don't need to. libguile/hash.c (scm_i_utf8_string_hash): switch to u8_mbtouc, and avoid overrun when len == 0. --- libguile/hash.c | 28 +++++++++++++++------------- 1 file changed, 15 insertions(+), 13 deletions(-) diff --git a/libguile/hash.c b/libguile/hash.c index a038a11bf..d92f60df8 100644 --- a/libguile/hash.c +++ b/libguile/hash.c @@ -195,32 +195,34 @@ scm_i_utf8_string_hash (const char *str, size_t len) /* Handle most of the key. */ while (length > 3) { - ustr += u8_mbtouc_unsafe (&u32, ustr, end - ustr); + ustr += u8_mbtouc (&u32, ustr, end - ustr); a += u32; - ustr += u8_mbtouc_unsafe (&u32, ustr, end - ustr); + ustr += u8_mbtouc (&u32, ustr, end - ustr); b += u32; - ustr += u8_mbtouc_unsafe (&u32, ustr, end - ustr); + ustr += u8_mbtouc (&u32, ustr, end - ustr); c += u32; mix (a, b, c); length -= 3; } /* Handle the last 3 elements's. */ - ustr += u8_mbtouc_unsafe (&u32, ustr, end - ustr); - a += u32; - if (--length) + if (length) { - ustr += u8_mbtouc_unsafe (&u32, ustr, end - ustr); - b += u32; - if (--length) + ustr += u8_mbtouc (&u32, ustr, end - ustr); + a += u32; + if (length > 1) { - ustr += u8_mbtouc_unsafe (&u32, ustr, end - ustr); - c += u32; + ustr += u8_mbtouc (&u32, ustr, end - ustr); + b += u32; + if (length > 2) + { + ustr += u8_mbtouc (&u32, ustr, end - ustr); + c += u32; + } } + final (a, b, c); } - final (a, b, c); - if (sizeof (unsigned long) == 8) ret = (((unsigned long) c) << 32) | b; else -- 2.43.0 --=-=-= Content-Type: text/x-diff Content-Disposition: inline; filename=0002-scm_i_utf8_string_hash-optimize-ASCII.patch Content-Description: 0002-scm_i_utf8_string_hash-optimize-ASCII.patch >From c6a888888101a893820f38561898e7c0390dd9d2 Mon Sep 17 00:00:00 2001 From: Rob Browning Date: Mon, 1 Jul 2024 20:56:57 -0500 Subject: [PATCH 2/2] scm_i_utf8_string_hash: optimize ASCII Since we already compute the char length, use that to detect all ASCII strings and handle those the same way we handle latin-1. libguile/hash.c (scm_i_utf8_string_hash): when byte_len == char_len, (i.e. fixed-width ASCII) optimize hashing via existing narrow path. --- libguile/hash.c | 28 ++++++++++++++++------------ 1 file changed, 16 insertions(+), 12 deletions(-) diff --git a/libguile/hash.c b/libguile/hash.c index d92f60df8..fe950e358 100644 --- a/libguile/hash.c +++ b/libguile/hash.c @@ -169,25 +169,29 @@ scm_i_latin1_string_hash (const char *str, size_t len) unsigned long scm_i_utf8_string_hash (const char *str, size_t len) { - const uint8_t *end, *ustr = (const uint8_t *) str; - unsigned long ret; - - /* The length of the string in characters. This name corresponds to - Jenkins' original name. */ - size_t length; - - uint32_t a, b, c, u32; - if (len == (size_t) -1) len = strlen (str); - end = ustr + len; - + const uint8_t *ustr = (const uint8_t *) str; if (u8_check (ustr, len) != NULL) /* Invalid UTF-8; punt. */ return scm_i_string_hash (scm_from_utf8_stringn (str, len)); - length = u8_mbsnlen (ustr, len); + /* The length of the string in characters. This name corresponds to + Jenkins' original name. */ + size_t length = u8_mbsnlen (ustr, len); + + if (len == length) // ascii, same as narrow_string_hash above + { + unsigned long ret; + JENKINS_LOOKUP3_HASHWORD2 (str, len, ret); + ret >>= 2; /* Ensure that it fits in a fixnum. */ + return ret; + } + + const uint8_t * const end = ustr + len; + uint32_t a, b, c, u32; + unsigned long ret; /* Set up the internal state. */ a = b = c = 0xdeadbeef + ((uint32_t)(length<<2)) + 47; -- 2.43.0 --=-=-= Content-Type: text/plain Thanks -- Rob Browning rlb @defaultvalue.org and @debian.org GPG as of 2011-07-10 E6A9 DA3C C9FD 1FF8 C676 D2C4 C0F0 39E9 ED1B 597A GPG as of 2002-11-03 14DD 432F AE39 534D B592 F9A0 25C8 D377 8C7E 73A4 --=-=-=--