unofficial mirror of bug-guix@gnu.org 
 help / color / mirror / code / Atom feed
From: Mark H Weaver <mhw@netris.org>
To: Danny Milosavljevic <dannym@scratchpost.org>
Cc: Pierre Neidhardt <mail@ambrevar.xyz>, 33848@debbugs.gnu.org
Subject: bug#33848: Store references in SBCL-compiled code are "invisible"
Date: Thu, 27 Dec 2018 09:29:42 -0500	[thread overview]
Message-ID: <87pntnvy8e.fsf@netris.org> (raw)
In-Reply-To: <20181227145258.0c420eac@scratchpost.org> (Danny Milosavljevic's message of "Thu, 27 Dec 2018 14:52:58 +0100")

Hi Danny,

Danny Milosavljevic <dannym@scratchpost.org> writes:

> On Mon, 24 Dec 2018 13:12:23 -0500
> Mark H Weaver <mhw@netris.org> wrote:
>
>> Of course, the usual reason to choose UTF-32 is to support non-ASCII
>> characters while retaining fixed-width code points, so that string
>> lookups are straightforward and efficient.
>
> This kind of lookup is almost never what is necessary.  There are many
> people who assume character is the same as codepoint and to those people
> UTF-32 brings something to the table, but it's really not useful if people
> do text processing correctly, see below.
>
> (Of course whether packages actually do this remains to be seen)
>
>>  That extra
>> complexity is what I guess we would need to add to each program that
>> currently uses UTF-32.
>
> Yes, but they usually have to do stream processing even with UTF-32 (because
> a character can be composed of possibly infinite number of codepoints),

I agree with you.  However, as silly as it might be, the fact remains
that almost every modern programming language and string library uses
code points as the base units by which to index strings.

> so the infrastructure should be already there and the effort should be
> minimal.

The infrastructure might or might not be there, depending on the
sophistication of the program's unicode support, but even if it _is_
there, it will most likely be a layer that expects to iterate over
strings indexed by code point to find graphemes, etc.

Anyway, if you truly believe the effort should be minimal, feel free to
investigate and propose patches to fix our 5 common lisp compilers and
Fish to avoid storing UTF-32 in the object code.

> Also, if both UTF-32 and UTF-8 are used on disk, care needs to not misdetect
> an UTF-8 sequence as an UTF-32 sequence of different text - or the other way
> around -, but that's unlikely for ASCII strings.

This is not an issue because the substrings that the reference scanner
and grafter are looking for are ASCII-only, even if they are part of a
larger non-ASCII string.  Specifically, they only need to look for the
nix hashes.

>> I really think it would be a mistake to try to force every program and
>> language implementation to use our preferred string representation.  I
>> suspect it would be vastly easier to compromise and support a few other
>> popular string representations in Guix, namely UTF-16 and UTF-32.
>
> In 1992, UTF-8 was invented.  Subsequently, most of the Internet,
> all new GNU Linux distributions etc, all UNIX GUI frameworks, Subversion
> etc standardized on UTF-8, with the eventual goal of standardizing all
> network transfer and storage to UTF-8.  I think that by now the outliers
> are the ones who need to change,

I agree that we need to standardize on Unicode.  However, given the
perhaps unfortunate fact that almost everyone has standardized on code
points as the units by which to index strings, choosing UTF-32 as an
internal representation is a very reasonable choice, IMO.

Anyway, feel free to engage with the developers of the Common Lisp
implementations that use UTF-32 and try to convince them to change.

The remaining question is: what to do if upstream refuses to change?  Do
we exclude that software in Guix, or do we maintain our own patches to
override upstream's decision?

>> If you don't want to change the daemon, it could be worked around in our
>> build-side code as follows: we could add a new phase to certain build
>> systems (or possibly gnu-build-system) that scans each output for
>> UTF-16/32 encoded store references that are never referenced in UTF-8.
>> If such references exist, a file with an unobtrusive name would be added
>> to that output containing those references encoded in UTF-8.  This would
>> enable our daemon's existing reference scanner to find all of the
>> references.
>
> I agree that that would be nice.  As a first step, even just detecting
> problems like that and erroring out would be okay - in order to find them
> in the first place.  Right now, it's difficult to detect and so also difficult
> to say how wide-spread the problem is.  If the problem is wide-spread enough
> my tune could change very quickly.

Sure, it would be useful to have more data on what packages are
currently affected by this issue.

      Regards,
        Mark

      reply	other threads:[~2018-12-27 14:31 UTC|newest]

Thread overview: 63+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-12-23 14:19 bug#33848: Store references in SBCL-compiled code are "invisible" Ludovic Courtès
2018-12-23 15:05 ` Pierre Neidhardt
2018-12-24 14:57   ` Ludovic Courtès
2018-12-23 16:45 ` Mark H Weaver
2018-12-23 17:32   ` Ludovic Courtès
2018-12-23 22:01     ` Pierre Neidhardt
2018-12-24 15:06       ` Ludovic Courtès
2018-12-24 17:08         ` Pierre Neidhardt
2018-12-26 16:07           ` Ludovic Courtès
2018-12-24 18:12         ` Mark H Weaver
2018-12-24 23:58           ` Pierre Neidhardt
2018-12-26 16:14           ` Ludovic Courtès
2018-12-27 10:37             ` Pierre Neidhardt
2018-12-27 14:03               ` Mark H Weaver
2018-12-27 14:45                 ` Ludovic Courtès
2018-12-27 15:02                   ` Pierre Neidhardt
2018-12-27 16:15                     ` Pierre Neidhardt
2018-12-27 17:03                     ` Ludovic Courtès
2018-12-27 18:57                       ` Pierre Neidhardt
2018-12-27 21:54                         ` Ludovic Courtès
2018-12-27 22:05                           ` Pierre Neidhardt
2018-12-27 22:59                             ` Ludovic Courtès
2018-12-28  7:47                               ` Pierre Neidhardt
2021-03-30 10:15                                 ` Pierre Neidhardt
2021-03-30 20:09                                   ` Ludovic Courtès
2021-03-31  7:10                                     ` Pierre Neidhardt
2021-03-31 16:12                                     ` Pierre Neidhardt
2021-03-31 20:42                                       ` Ludovic Courtès
2021-03-31 20:57                                         ` Pierre Neidhardt
2021-04-01 17:23                                         ` Mark H Weaver
2021-04-02 15:13                                           ` Ludovic Courtès
2021-04-01  6:03                                       ` Mark H Weaver
2021-04-01  7:13                                         ` Pierre Neidhardt
2021-04-01  7:57                                         ` Ludovic Courtès
2021-04-01  8:48                                           ` Pierre Neidhardt
2021-04-01  9:07                                           ` Guillaume Le Vaillant
2021-04-01  9:13                                             ` Pierre Neidhardt
2021-04-01  9:52                                               ` Guillaume Le Vaillant
2021-04-01 10:06                                                 ` Pierre Neidhardt
2021-04-01 10:07                                                 ` Pierre Neidhardt
2021-04-01 15:24                                                   ` Ludovic Courtès
2021-04-01 17:33                                                   ` Mark H Weaver
2021-04-02 22:46                                                     ` Mark H Weaver
2021-04-03  6:51                                                       ` Pierre Neidhardt
2021-04-03 20:10                                                         ` Mark H Weaver
2021-04-05 19:45                                                           ` Ludovic Courtès
2021-04-05 23:04                                                             ` Mark H Weaver
2021-04-06  8:16                                                               ` Ludovic Courtès
2021-04-06  8:23                                                                 ` Pierre Neidhardt
2021-04-30 20:03                                                                   ` Mark H Weaver
2021-05-01  9:22                                                                     ` Pierre Neidhardt
2021-05-11  8:46                                                                     ` Ludovic Courtès
2021-04-06 17:23                                                             ` Leo Famulari
2021-04-06 23:21                                                               ` Mark H Weaver
2021-04-06 11:19                                                       ` Mark H Weaver
2021-04-08 10:13                                                         ` Ludovic Courtès
2021-04-13 20:06                                                           ` Mark H Weaver
2021-04-14 10:55                                                             ` Ludovic Courtès
2021-04-14 22:37                                                               ` Leo Famulari
2021-04-15  7:26                                                               ` Mark H Weaver
2021-04-16  9:44                                                                 ` Pierre Neidhardt
2018-12-27 13:52           ` Danny Milosavljevic
2018-12-27 14:29             ` Mark H Weaver [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://guix.gnu.org/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87pntnvy8e.fsf@netris.org \
    --to=mhw@netris.org \
    --cc=33848@debbugs.gnu.org \
    --cc=dannym@scratchpost.org \
    --cc=mail@ambrevar.xyz \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/guix.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).