unofficial mirror of bug-guile@gnu.org 
 help / color / mirror / Atom feed
From: Rob Browning <rlb@defaultvalue.org>
To: "Ludovic Courtès" <ludo@gnu.org>
Cc: 56413@debbugs.gnu.org
Subject: bug#56413: [PATCH 1/1] scm_i_utf8_string_hash: compute u8 chars not bytes
Date: Sun, 06 Nov 2022 13:46:36 -0600	[thread overview]
Message-ID: <87zgd3vpb7.fsf@trouble.defaultvalue.org> (raw)
In-Reply-To: <87zgd5gi4t.fsf@gnu.org>

Ludovic Courtès <ludo@gnu.org> writes:

> Rob Browning <rlb@defaultvalue.org> skribis:

>> +  // Make sure a utf-8 symbol has the expected hash.  In addition to
>> +  // catching algorithmic regressions, this would have caught a
>> +  // long-standing buffer overflow.
>> +
>> +  // περί
>> +  char about_u8[] = {0xce, 0xa0, 0xce, 0xb5, 0xcf, 0x81, 0xce, 0xaf, 0};
>> +  SCM sym = scm_from_utf8_symbol (about_u8);
>> +
>> +  const unsigned long expect = 4029223418961680680;
>> +  const unsigned long actual = scm_to_ulong (scm_symbol_hash (sym));
>
> Is this a documented example of Jenkins?  Or did you use a reference
> implementation?

OK, so unfortunately I don't actually recall how I came up with that
number, but I can start over with some canonical approach to compute the
value if we like.

...if I didn't get it from somewhere more authoritative, I might also
have just been trying to at least prevent undetected regressions.

> AFAICS this will only change the hash of UTF-8 symbols and won’t have
> any effect on the output of ‘string-hash’, right?  If not that would be
> an incompatibility.

The u8_mbsnlen() change should strictly fix bugs I think?  i.e. if the
length is supposed to be in characters, which it looks like from all the
other uses in the function (and from the comment), then the old code
was returning the wrong values (which prompted the original crashes).

So this change *could* alter results, but only for non-ASCII strings,
and those results would have been wrong (i.e. relying on uninitialized
memory).  Of course if that memory was *always* the same for a given
symbol somewhow (everywhere in memory), then the result would be stable,
if incorrect.


That leaves the size_t -> long change in scm_i_str2symbol(), and I don't
think that has anything to do with UTF-8, but it could cause mangling of
the value on any platform where the data types differ sufficiently, and
then of course if we're not using the same type consistently, then we
could give different answers for the same symbol in different contexts
(for different code paths).

And indeed, looks like I missed another case; just below in
scm_i_str2uninterned_symbol() we also use size_t.  For now, I suspect we
should change both or neither, and definitely change them all to match
"eventually".

Thanks
-- 
Rob Browning
rlb @defaultvalue.org and @debian.org
GPG as of 2011-07-10 E6A9 DA3C C9FD 1FF8 C676 D2C4 C0F0 39E9 ED1B 597A
GPG as of 2002-11-03 14DD 432F AE39 534D B592 F9A0 25C8 D377 8C7E 73A4





  parent reply	other threads:[~2022-11-06 19:46 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-07-06  1:23 bug#56413: [PATCH 1/1] scm_i_utf8_string_hash: compute u8 chars not bytes Rob Browning
2022-07-06  3:04 ` Rob Browning
2022-11-05 22:18 ` Ludovic Courtès
2022-11-06 16:44   ` Rob Browning
2022-11-06 17:45     ` Rob Browning
2022-11-07 13:06     ` Ludovic Courtès
2022-11-06 19:46   ` Rob Browning [this message]
2022-11-07 13:07     ` Ludovic Courtès
2022-11-08  5:05     ` Rob Browning
2022-11-08 10:09       ` Ludovic Courtès
2023-03-05 22:21         ` bug#56413: [PATCH v2 " Rob Browning
2023-03-06 16:39           ` Ludovic Courtès
2023-03-12 19:30             ` bug#56413: [PATCH v3 " Rob Browning
2023-03-13 11:29               ` Ludovic Courtès

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://www.gnu.org/software/guile/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87zgd3vpb7.fsf@trouble.defaultvalue.org \
    --to=rlb@defaultvalue.org \
    --cc=56413@debbugs.gnu.org \
    --cc=ludo@gnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).