unofficial mirror of bug-guix@gnu.org 
 help / color / mirror / code / Atom feed
From: Timothy Sample <samplet@ngyro.com>
To: "Ludovic Courtès" <ludo@gnu.org>
Cc: 48114@debbugs.gnu.org
Subject: bug#48114: Disarchive occasionally fails tests
Date: Mon, 03 May 2021 00:02:09 -0400	[thread overview]
Message-ID: <8735v4ea7y.fsf@ngyro.com> (raw)
In-Reply-To: <87a6pceerf.fsf@ngyro.com> (Timothy Sample's message of "Sun, 02 May 2021 22:24:04 -0400")

Timothy Sample <samplet@ngyro.com> writes:

> I’m still looking into this, but I wanted to quickly post this
> reproducer for the Guile bug:
>
>     (use-modules (ice-9 regex))
>     (define str
> "\U101514\U103ab0\U0f6e6e\U02e278\U01d9eb\U10b996\U1089b5\uea15\U0fa074\U101e41\U02e330\u0177\u2492")
>     (match:substring (string-match "[0-8]+" str))
>
> This triggers the out-of-range error when run with “LC_ALL=C”.

It turns out that all that’s needed is the last code point, which is
“Number Eleven Full Stop”, or ‘⒒’.  When Guile converts this to an ASCII
C string using ‘u32_conv_from_encoding’, it becomes “11.”.  The regex
(“[0-8]+”) matches the “11” part with start index 0 and end index 2.
The ‘fixup_multibyte_match’ function does nothing (it only matters when
the locale encoding is multibyte) [1].  Guile then builds the match
vector with the original string but keeps the ASCII offsets.  In other
words, it thinks the match substring goes from 0 to 2 in a single code
point string:

    ,use (ice-9 regex)
    (string-match "11" "\u2492")
    => #("\u2492" (0 . 2))

I’m not sure there’s any way to solve this nicely in Guile.  It would be
clearer if the match vector included the string as libc matched it, but
it’s still surprising that the match happens with a different string.

In Disarchive, I can rewrite the generator without regex.  I’ll do that
and see what I can do about the “Gave up!” issue.

[1] It works on the converted-to-ASCII C string, which means that the
byte offsets and code point offsets are the same.  Hence, it has nothing
to do.


-- Tim




  reply	other threads:[~2021-05-03  4:03 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-04-30 10:00 bug#48114: Disarchive occasionally fails tests Ludovic Courtès
2021-04-30 19:49 ` Timothy Sample
2021-05-02 19:57   ` Ludovic Courtès
2021-05-03  2:24     ` Timothy Sample
2021-05-03  4:02       ` Timothy Sample [this message]
2021-05-03  6:19         ` Bengt Richter
2021-05-03 20:03         ` Ludovic Courtès
2021-05-13 21:04         ` Ludovic Courtès
2021-05-14  3:06           ` Timothy Sample
2021-05-14 13:51             ` Ludovic Courtès

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://guix.gnu.org/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=8735v4ea7y.fsf@ngyro.com \
    --to=samplet@ngyro.com \
    --cc=48114@debbugs.gnu.org \
    --cc=ludo@gnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/guix.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).