From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.io!.POSTED.blaine.gmane.org!not-for-mail From: Rob Browning Newsgroups: gmane.lisp.guile.bugs Subject: bug#56413: [PATCH 1/1] scm_i_utf8_string_hash: compute u8 chars not bytes Date: Sun, 06 Nov 2022 13:46:36 -0600 Message-ID: <87zgd3vpb7.fsf@trouble.defaultvalue.org> References: <20220706012323.1024763-1-rlb@defaultvalue.org> <87zgd5gi4t.fsf@gnu.org> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Injection-Info: ciao.gmane.io; posting-host="blaine.gmane.org:116.202.254.214"; logging-data="3764"; mail-complaints-to="usenet@ciao.gmane.io" Cc: 56413@debbugs.gnu.org To: Ludovic =?UTF-8?Q?Court=C3=A8s?= Original-X-From: bug-guile-bounces+guile-bugs=m.gmane-mx.org@gnu.org Sun Nov 06 20:48:01 2022 Return-path: Envelope-to: guile-bugs@m.gmane-mx.org Original-Received: from lists.gnu.org ([209.51.188.17]) by ciao.gmane.io with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.92) (envelope-from ) id 1orlce-0000fj-K9 for guile-bugs@m.gmane-mx.org; Sun, 06 Nov 2022 20:48:00 +0100 Original-Received: from localhost ([::1] helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1orlbk-0008BD-9p; Sun, 06 Nov 2022 14:47:04 -0500 Original-Received: from eggs.gnu.org ([2001:470:142:3::10]) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1orlbi-0008AZ-9j for bug-guile@gnu.org; Sun, 06 Nov 2022 14:47:02 -0500 Original-Received: from debbugs.gnu.org ([209.51.188.43]) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.90_1) (envelope-from ) id 1orlbi-0004Z5-1a for bug-guile@gnu.org; Sun, 06 Nov 2022 14:47:02 -0500 Original-Received: from Debian-debbugs by debbugs.gnu.org with local (Exim 4.84_2) (envelope-from ) id 1orlbh-00088f-Tv for bug-guile@gnu.org; Sun, 06 Nov 2022 14:47:01 -0500 X-Loop: help-debbugs@gnu.org Resent-From: Rob Browning Original-Sender: "Debbugs-submit" Resent-CC: bug-guile@gnu.org Resent-Date: Sun, 06 Nov 2022 19:47:01 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 56413 X-GNU-PR-Package: guile X-GNU-PR-Keywords: patch Original-Received: via spool by 56413-submit@debbugs.gnu.org id=B56413.166776399931256 (code B ref 56413); Sun, 06 Nov 2022 19:47:01 +0000 Original-Received: (at 56413) by debbugs.gnu.org; 6 Nov 2022 19:46:39 +0000 Original-Received: from localhost ([127.0.0.1]:60714 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1orlbL-000883-8k for submit@debbugs.gnu.org; Sun, 06 Nov 2022 14:46:39 -0500 Original-Received: from defaultvalue.org ([45.33.119.55]:59692 ident=postfix) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1orlbJ-00087v-G3 for 56413@debbugs.gnu.org; Sun, 06 Nov 2022 14:46:38 -0500 Original-Received: from trouble.defaultvalue.org (localhost [127.0.0.1]) (Authenticated sender: rlb@defaultvalue.org) by defaultvalue.org (Postfix) with ESMTPSA id B49F12017E; Sun, 6 Nov 2022 13:46:36 -0600 (CST) Original-Received: by trouble.defaultvalue.org (Postfix, from userid 1000) id 3F2E114E553; Sun, 6 Nov 2022 13:46:36 -0600 (CST) In-Reply-To: <87zgd5gi4t.fsf@gnu.org> X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list X-BeenThere: bug-guile@gnu.org List-Id: "Bug reports for GUILE, GNU's Ubiquitous Extension Language" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: bug-guile-bounces+guile-bugs=m.gmane-mx.org@gnu.org Original-Sender: bug-guile-bounces+guile-bugs=m.gmane-mx.org@gnu.org Xref: news.gmane.io gmane.lisp.guile.bugs:10416 Archived-At: Ludovic Court=C3=A8s writes: > Rob Browning skribis: >> + // Make sure a utf-8 symbol has the expected hash. In addition to >> + // catching algorithmic regressions, this would have caught a >> + // long-standing buffer overflow. >> + >> + // =CF=80=CE=B5=CF=81=CE=AF >> + char about_u8[] =3D {0xce, 0xa0, 0xce, 0xb5, 0xcf, 0x81, 0xce, 0xaf, = 0}; >> + SCM sym =3D scm_from_utf8_symbol (about_u8); >> + >> + const unsigned long expect =3D 4029223418961680680; >> + const unsigned long actual =3D scm_to_ulong (scm_symbol_hash (sym)); > > Is this a documented example of Jenkins? Or did you use a reference > implementation? OK, so unfortunately I don't actually recall how I came up with that number, but I can start over with some canonical approach to compute the value if we like. ...if I didn't get it from somewhere more authoritative, I might also have just been trying to at least prevent undetected regressions. > AFAICS this will only change the hash of UTF-8 symbols and won=E2=80=99t = have > any effect on the output of =E2=80=98string-hash=E2=80=99, right? If not= that would be > an incompatibility. The u8_mbsnlen() change should strictly fix bugs I think? i.e. if the length is supposed to be in characters, which it looks like from all the other uses in the function (and from the comment), then the old code was returning the wrong values (which prompted the original crashes). So this change *could* alter results, but only for non-ASCII strings, and those results would have been wrong (i.e. relying on uninitialized memory). Of course if that memory was *always* the same for a given symbol somewhow (everywhere in memory), then the result would be stable, if incorrect. That leaves the size_t -> long change in scm_i_str2symbol(), and I don't think that has anything to do with UTF-8, but it could cause mangling of the value on any platform where the data types differ sufficiently, and then of course if we're not using the same type consistently, then we could give different answers for the same symbol in different contexts (for different code paths). And indeed, looks like I missed another case; just below in scm_i_str2uninterned_symbol() we also use size_t. For now, I suspect we should change both or neither, and definitely change them all to match "eventually". Thanks --=20 Rob Browning rlb @defaultvalue.org and @debian.org GPG as of 2011-07-10 E6A9 DA3C C9FD 1FF8 C676 D2C4 C0F0 39E9 ED1B 597A GPG as of 2002-11-03 14DD 432F AE39 534D B592 F9A0 25C8 D377 8C7E 73A4