unofficial mirror of bug-guile@gnu.org 
 help / color / mirror / Atom feed
* bug#56413: [PATCH 1/1] scm_i_utf8_string_hash: compute u8 chars not bytes
@ 2022-07-06  1:23 Rob Browning
  2022-07-06  3:04 ` Rob Browning
  2022-11-05 22:18 ` Ludovic Courtès
  0 siblings, 2 replies; 10+ messages in thread
From: Rob Browning @ 2022-07-06  1:23 UTC (permalink / raw)
  To: 56413

Noticed while investigating a migration to utf-8 strings.  After making
changes that routed non-ascii symbol hashing through this function,
encoding-iso88597.test began intermittently failing because it would
traverse trailing garbage when u8_strnlen reported 8 chars instead of 4.

Change the scm_i_str2symbol internal hash type to unsigned long to
explicitly match the hashing result type.
---

 Proposed for at least main.

 libguile/hash.c                      |  2 +-
 libguile/symbols.c                   |  2 +-
 test-suite/standalone/Makefile.am    |  7 ++++
 test-suite/standalone/test-hashing.c | 61 ++++++++++++++++++++++++++++
 4 files changed, 70 insertions(+), 2 deletions(-)
 create mode 100644 test-suite/standalone/test-hashing.c

diff --git a/libguile/hash.c b/libguile/hash.c
index 93431102f..0740b2645 100644
--- a/libguile/hash.c
+++ b/libguile/hash.c
@@ -188,7 +188,7 @@ scm_i_utf8_string_hash (const char *str, size_t len)
     /* Invalid UTF-8; punt.  */
     return scm_i_string_hash (scm_from_utf8_stringn (str, len));
 
-  length = u8_strnlen (ustr, len);
+  length = u8_mbsnlen (ustr, len);
 
   /* Set up the internal state.  */
   a = b = c = 0xdeadbeef + ((uint32_t)(length<<2)) + 47;
diff --git a/libguile/symbols.c b/libguile/symbols.c
index ad5f22f57..cd9cda3de 100644
--- a/libguile/symbols.c
+++ b/libguile/symbols.c
@@ -239,7 +239,7 @@ static SCM
 scm_i_str2symbol (SCM str)
 {
   SCM symbol;
-  size_t raw_hash = scm_i_string_hash (str);
+  unsigned long raw_hash = scm_i_string_hash (str);
 
   symbol = lookup_interned_symbol (str, raw_hash);
   if (scm_is_true (symbol))
diff --git a/test-suite/standalone/Makefile.am b/test-suite/standalone/Makefile.am
index e87100c96..ca1b3131b 100644
--- a/test-suite/standalone/Makefile.am
+++ b/test-suite/standalone/Makefile.am
@@ -167,6 +167,13 @@ test_conversion_LDADD = $(LIBGUILE_LDADD) $(top_builddir)/lib/libgnu.la
 check_PROGRAMS += test-conversion
 TESTS += test-conversion
 
+# test-hashing
+test_hashing_SOURCES = test-hashing.c
+test_hashing_CFLAGS = ${test_cflags}
+test_hashing_LDADD = $(LIBGUILE_LDADD) $(top_builddir)/lib/libgnu.la
+check_PROGRAMS += test-hashing
+TESTS += test-hashing
+
 # test-loose-ends
 test_loose_ends_SOURCES = test-loose-ends.c
 test_loose_ends_CFLAGS = ${test_cflags}
diff --git a/test-suite/standalone/test-hashing.c b/test-suite/standalone/test-hashing.c
new file mode 100644
index 000000000..476181fe2
--- /dev/null
+++ b/test-suite/standalone/test-hashing.c
@@ -0,0 +1,61 @@
+/* Copyright 2022
+     Free Software Foundation, Inc.
+
+   This file is part of Guile.
+
+   Guile is free software: you can redistribute it and/or modify it
+   under the terms of the GNU Lesser General Public License as published
+   by the Free Software Foundation, either version 3 of the License, or
+   (at your option) any later version.
+
+   Guile is distributed in the hope that it will be useful, but WITHOUT
+   ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+   FITNESS FOR A PARTICULAR PURPOSE.  See the GNU Lesser General Public
+   License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with Guile.  If not, see
+   <https://www.gnu.org/licenses/>.  */
+
+#if HAVE_CONFIG_H
+# include <config.h>
+#endif
+
+#include <libguile.h>
+
+#include <stdio.h>
+
+static void
+test_hashing ()
+{
+  // Make sure a utf-8 symbol has the expected hash.  In addition to
+  // catching algorithmic regressions, this would have caught a
+  // long-standing buffer overflow.
+
+  // περί
+  char about_u8[] = {0xce, 0xa0, 0xce, 0xb5, 0xcf, 0x81, 0xce, 0xaf, 0};
+  SCM sym = scm_from_utf8_symbol (about_u8);
+
+  const unsigned long expect = 4029223418961680680;
+  const unsigned long actual = scm_to_ulong (scm_symbol_hash (sym));
+
+  if (actual != expect)
+    {
+      fprintf (stderr, "fail: unexpected utf-8 symbol hash (%lu != %lu)\n",
+               actual, expect);
+      exit (EXIT_FAILURE);
+    }
+}
+
+static void
+tests (void *data, int argc, char **argv)
+{
+  test_hashing ();
+}
+
+int
+main (int argc, char *argv[])
+{
+  scm_boot_guile (argc, argv, tests, NULL);
+  return 0;
+}
-- 
2.30.2






^ permalink raw reply related	[flat|nested] 10+ messages in thread

* bug#56413: [PATCH 1/1] scm_i_utf8_string_hash: compute u8 chars not bytes
  2022-07-06  1:23 bug#56413: [PATCH 1/1] scm_i_utf8_string_hash: compute u8 chars not bytes Rob Browning
@ 2022-07-06  3:04 ` Rob Browning
  2022-11-05 22:18 ` Ludovic Courtès
  1 sibling, 0 replies; 10+ messages in thread
From: Rob Browning @ 2022-07-06  3:04 UTC (permalink / raw)
  To: 56413

Rob Browning <rlb@defaultvalue.org> writes:

> Noticed while investigating a migration to utf-8 strings.  After making
> changes that routed non-ascii symbol hashing through this function,
> encoding-iso88597.test began intermittently failing because it would
> traverse trailing garbage when u8_strnlen reported 8 chars instead of 4.
>
> Change the scm_i_str2symbol internal hash type to unsigned long to
> explicitly match the hashing result type.

Hmm.  I suppose the current test could be handled on the scheme side
instead.  (I'd started off attempting some more direct, elaborate tests
that didn't pan out.)  Happy to rework that if desired.

-- 
Rob Browning
rlb @defaultvalue.org and @debian.org
GPG as of 2011-07-10 E6A9 DA3C C9FD 1FF8 C676 D2C4 C0F0 39E9 ED1B 597A
GPG as of 2002-11-03 14DD 432F AE39 534D B592 F9A0 25C8 D377 8C7E 73A4





^ permalink raw reply	[flat|nested] 10+ messages in thread

* bug#56413: [PATCH 1/1] scm_i_utf8_string_hash: compute u8 chars not bytes
  2022-07-06  1:23 bug#56413: [PATCH 1/1] scm_i_utf8_string_hash: compute u8 chars not bytes Rob Browning
  2022-07-06  3:04 ` Rob Browning
@ 2022-11-05 22:18 ` Ludovic Courtès
  2022-11-06 16:44   ` Rob Browning
  2022-11-06 19:46   ` Rob Browning
  1 sibling, 2 replies; 10+ messages in thread
From: Ludovic Courtès @ 2022-11-05 22:18 UTC (permalink / raw)
  To: Rob Browning; +Cc: 56413

Hi,

Rob Browning <rlb@defaultvalue.org> skribis:

> Noticed while investigating a migration to utf-8 strings.  After making
> changes that routed non-ascii symbol hashing through this function,
> encoding-iso88597.test began intermittently failing because it would
> traverse trailing garbage when u8_strnlen reported 8 chars instead of 4.
>
> Change the scm_i_str2symbol internal hash type to unsigned long to
> explicitly match the hashing result type.

Oh, good catch.

For the final patch please add a ChangeLog-style entry.

> +  // Make sure a utf-8 symbol has the expected hash.  In addition to
> +  // catching algorithmic regressions, this would have caught a
> +  // long-standing buffer overflow.
> +
> +  // περί
> +  char about_u8[] = {0xce, 0xa0, 0xce, 0xb5, 0xcf, 0x81, 0xce, 0xaf, 0};
> +  SCM sym = scm_from_utf8_symbol (about_u8);
> +
> +  const unsigned long expect = 4029223418961680680;
> +  const unsigned long actual = scm_to_ulong (scm_symbol_hash (sym));

Is this a documented example of Jenkins?  Or did you use a reference
implementation?

> Hmm.  I suppose the current test could be handled on the scheme side
> instead.  (I'd started off attempting some more direct, elaborate tests
> that didn't pan out.)  Happy to rework that if desired.

Yes, it may be nicer to have it in ‘test-suite/tests/hash.test’.

AFAICS this will only change the hash of UTF-8 symbols and won’t have
any effect on the output of ‘string-hash’, right?  If not that would be
an incompatibility.

Thanks and sorry for the delay!

Ludo’.





^ permalink raw reply	[flat|nested] 10+ messages in thread

* bug#56413: [PATCH 1/1] scm_i_utf8_string_hash: compute u8 chars not bytes
  2022-11-05 22:18 ` Ludovic Courtès
@ 2022-11-06 16:44   ` Rob Browning
  2022-11-06 17:45     ` Rob Browning
  2022-11-07 13:06     ` Ludovic Courtès
  2022-11-06 19:46   ` Rob Browning
  1 sibling, 2 replies; 10+ messages in thread
From: Rob Browning @ 2022-11-06 16:44 UTC (permalink / raw)
  To: Ludovic Courtès; +Cc: 56413

Ludovic Courtès <ludo@gnu.org> writes:

> For the final patch please add a ChangeLog-style entry.

Will do.

> Is this a documented example of Jenkins?  Or did you use a reference
> implementation?

Jenkins?

> Yes, it may be nicer to have it in ‘test-suite/tests/hash.test’.
>
> AFAICS this will only change the hash of UTF-8 symbols and won’t have
> any effect on the output of ‘string-hash’, right?  If not that would be
> an incompatibility.

I think that's right, but I'll have to refresh my memory regarding the
changes.  (Haven't gotten back to the utf-8 work for a bit so it's not
top of mind, though I hope to soon.)

Thanks
-- 
Rob Browning
rlb @defaultvalue.org and @debian.org
GPG as of 2011-07-10 E6A9 DA3C C9FD 1FF8 C676 D2C4 C0F0 39E9 ED1B 597A
GPG as of 2002-11-03 14DD 432F AE39 534D B592 F9A0 25C8 D377 8C7E 73A4





^ permalink raw reply	[flat|nested] 10+ messages in thread

* bug#56413: [PATCH 1/1] scm_i_utf8_string_hash: compute u8 chars not bytes
  2022-11-06 16:44   ` Rob Browning
@ 2022-11-06 17:45     ` Rob Browning
  2022-11-07 13:06     ` Ludovic Courtès
  1 sibling, 0 replies; 10+ messages in thread
From: Rob Browning @ 2022-11-06 17:45 UTC (permalink / raw)
  To: Ludovic Courtès; +Cc: 56413

Rob Browning <rlb@defaultvalue.org> writes:

> Jenkins?

Oh, right (after looking back at the code).

I'll get back to you regarding this and the other questions after I
finish reviewing/remembering.

Thanks
-- 
Rob Browning
rlb @defaultvalue.org and @debian.org
GPG as of 2011-07-10 E6A9 DA3C C9FD 1FF8 C676 D2C4 C0F0 39E9 ED1B 597A
GPG as of 2002-11-03 14DD 432F AE39 534D B592 F9A0 25C8 D377 8C7E 73A4





^ permalink raw reply	[flat|nested] 10+ messages in thread

* bug#56413: [PATCH 1/1] scm_i_utf8_string_hash: compute u8 chars not bytes
  2022-11-05 22:18 ` Ludovic Courtès
  2022-11-06 16:44   ` Rob Browning
@ 2022-11-06 19:46   ` Rob Browning
  2022-11-07 13:07     ` Ludovic Courtès
  2022-11-08  5:05     ` Rob Browning
  1 sibling, 2 replies; 10+ messages in thread
From: Rob Browning @ 2022-11-06 19:46 UTC (permalink / raw)
  To: Ludovic Courtès; +Cc: 56413

Ludovic Courtès <ludo@gnu.org> writes:

> Rob Browning <rlb@defaultvalue.org> skribis:

>> +  // Make sure a utf-8 symbol has the expected hash.  In addition to
>> +  // catching algorithmic regressions, this would have caught a
>> +  // long-standing buffer overflow.
>> +
>> +  // περί
>> +  char about_u8[] = {0xce, 0xa0, 0xce, 0xb5, 0xcf, 0x81, 0xce, 0xaf, 0};
>> +  SCM sym = scm_from_utf8_symbol (about_u8);
>> +
>> +  const unsigned long expect = 4029223418961680680;
>> +  const unsigned long actual = scm_to_ulong (scm_symbol_hash (sym));
>
> Is this a documented example of Jenkins?  Or did you use a reference
> implementation?

OK, so unfortunately I don't actually recall how I came up with that
number, but I can start over with some canonical approach to compute the
value if we like.

...if I didn't get it from somewhere more authoritative, I might also
have just been trying to at least prevent undetected regressions.

> AFAICS this will only change the hash of UTF-8 symbols and won’t have
> any effect on the output of ‘string-hash’, right?  If not that would be
> an incompatibility.

The u8_mbsnlen() change should strictly fix bugs I think?  i.e. if the
length is supposed to be in characters, which it looks like from all the
other uses in the function (and from the comment), then the old code
was returning the wrong values (which prompted the original crashes).

So this change *could* alter results, but only for non-ASCII strings,
and those results would have been wrong (i.e. relying on uninitialized
memory).  Of course if that memory was *always* the same for a given
symbol somewhow (everywhere in memory), then the result would be stable,
if incorrect.


That leaves the size_t -> long change in scm_i_str2symbol(), and I don't
think that has anything to do with UTF-8, but it could cause mangling of
the value on any platform where the data types differ sufficiently, and
then of course if we're not using the same type consistently, then we
could give different answers for the same symbol in different contexts
(for different code paths).

And indeed, looks like I missed another case; just below in
scm_i_str2uninterned_symbol() we also use size_t.  For now, I suspect we
should change both or neither, and definitely change them all to match
"eventually".

Thanks
-- 
Rob Browning
rlb @defaultvalue.org and @debian.org
GPG as of 2011-07-10 E6A9 DA3C C9FD 1FF8 C676 D2C4 C0F0 39E9 ED1B 597A
GPG as of 2002-11-03 14DD 432F AE39 534D B592 F9A0 25C8 D377 8C7E 73A4





^ permalink raw reply	[flat|nested] 10+ messages in thread

* bug#56413: [PATCH 1/1] scm_i_utf8_string_hash: compute u8 chars not bytes
  2022-11-06 16:44   ` Rob Browning
  2022-11-06 17:45     ` Rob Browning
@ 2022-11-07 13:06     ` Ludovic Courtès
  1 sibling, 0 replies; 10+ messages in thread
From: Ludovic Courtès @ 2022-11-07 13:06 UTC (permalink / raw)
  To: Rob Browning; +Cc: 56413

Rob Browning <rlb@defaultvalue.org> skribis:

>> Is this a documented example of Jenkins?  Or did you use a reference
>> implementation?
>
> Jenkins?

That’s the name of the hash function in question.

If not, where did you get that example from?  :-)

Thanks,
Ludo’.





^ permalink raw reply	[flat|nested] 10+ messages in thread

* bug#56413: [PATCH 1/1] scm_i_utf8_string_hash: compute u8 chars not bytes
  2022-11-06 19:46   ` Rob Browning
@ 2022-11-07 13:07     ` Ludovic Courtès
  2022-11-08  5:05     ` Rob Browning
  1 sibling, 0 replies; 10+ messages in thread
From: Ludovic Courtès @ 2022-11-07 13:07 UTC (permalink / raw)
  To: Rob Browning; +Cc: 56413

Rob Browning <rlb@defaultvalue.org> skribis:

> So this change *could* alter results, but only for non-ASCII strings,
> and those results would have been wrong (i.e. relying on uninitialized
> memory).

OK, that was my understanding too.

> That leaves the size_t -> long change in scm_i_str2symbol(), and I don't
> think that has anything to do with UTF-8, but it could cause mangling of
> the value on any platform where the data types differ sufficiently, and
> then of course if we're not using the same type consistently, then we
> could give different answers for the same symbol in different contexts
> (for different code paths).

Right.  This one looks safe to me.

> And indeed, looks like I missed another case; just below in
> scm_i_str2uninterned_symbol() we also use size_t.  For now, I suspect we
> should change both or neither, and definitely change them all to match
> "eventually".

Sure.

Thanks!

Ludo’.





^ permalink raw reply	[flat|nested] 10+ messages in thread

* bug#56413: [PATCH 1/1] scm_i_utf8_string_hash: compute u8 chars not bytes
  2022-11-06 19:46   ` Rob Browning
  2022-11-07 13:07     ` Ludovic Courtès
@ 2022-11-08  5:05     ` Rob Browning
  2022-11-08 10:09       ` Ludovic Courtès
  1 sibling, 1 reply; 10+ messages in thread
From: Rob Browning @ 2022-11-08  5:05 UTC (permalink / raw)
  To: Ludovic Courtès; +Cc: 56413

Rob Browning <rlb@defaultvalue.org> writes:

> OK, so unfortunately I don't actually recall how I came up with that
> number, but I can start over with some canonical approach to compute the
> value if we like.

I hacked up hash.c to let me call wide_string_hash() directly and
printed the hash for wchar_t {0x3A0, 0x3B5, 0x3C1, 0x3AF}, which should
be what the optimized utf-8 code is consuming.

I saw 4029223418961680680.  I double-checked via (symbol-hash
'Περί) from the terminal, and that returned the same value.

Oh, and unless I'm missing something, I remembered why we may need to
keep the standalone C test program -- there's no straightforward way to
call scm_from_utf8_symbol() from scheme?

Thanks
-- 
Rob Browning
rlb @defaultvalue.org and @debian.org
GPG as of 2011-07-10 E6A9 DA3C C9FD 1FF8 C676 D2C4 C0F0 39E9 ED1B 597A
GPG as of 2002-11-03 14DD 432F AE39 534D B592 F9A0 25C8 D377 8C7E 73A4





^ permalink raw reply	[flat|nested] 10+ messages in thread

* bug#56413: [PATCH 1/1] scm_i_utf8_string_hash: compute u8 chars not bytes
  2022-11-08  5:05     ` Rob Browning
@ 2022-11-08 10:09       ` Ludovic Courtès
  0 siblings, 0 replies; 10+ messages in thread
From: Ludovic Courtès @ 2022-11-08 10:09 UTC (permalink / raw)
  To: Rob Browning; +Cc: 56413

Hi,

Rob Browning <rlb@defaultvalue.org> skribis:

> Oh, and unless I'm missing something, I remembered why we may need to
> keep the standalone C test program -- there's no straightforward way to
> call scm_from_utf8_symbol() from scheme?

Ah yes, you’re probably right!

Ludo’.





^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2022-11-08 10:09 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-07-06  1:23 bug#56413: [PATCH 1/1] scm_i_utf8_string_hash: compute u8 chars not bytes Rob Browning
2022-07-06  3:04 ` Rob Browning
2022-11-05 22:18 ` Ludovic Courtès
2022-11-06 16:44   ` Rob Browning
2022-11-06 17:45     ` Rob Browning
2022-11-07 13:06     ` Ludovic Courtès
2022-11-06 19:46   ` Rob Browning
2022-11-07 13:07     ` Ludovic Courtès
2022-11-08  5:05     ` Rob Browning
2022-11-08 10:09       ` Ludovic Courtès

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).