all messages for Guix-related lists mirrored at yhetil.org
 help / color / mirror / code / Atom feed
From: Danny Milosavljevic <dannym@scratchpost.org>
To: Mark H Weaver <mhw@netris.org>
Cc: Pierre Neidhardt <mail@ambrevar.xyz>, 33848@debbugs.gnu.org
Subject: bug#33848: Store references in SBCL-compiled code are "invisible"
Date: Thu, 27 Dec 2018 14:52:58 +0100	[thread overview]
Message-ID: <20181227145258.0c420eac@scratchpost.org> (raw)
In-Reply-To: <87tvj2yesd.fsf@netris.org>

[-- Attachment #1: Type: text/plain, Size: 4639 bytes --]

Hi Mark,

On Mon, 24 Dec 2018 13:12:23 -0500
Mark H Weaver <mhw@netris.org> wrote:

> Of course, the usual reason to choose UTF-32 is to support non-ASCII
> characters while retaining fixed-width code points, so that string
> lookups are straightforward and efficient.

This kind of lookup is almost never what is necessary.  There are many
people who assume character is the same as codepoint and to those people
UTF-32 brings something to the table, but it's really not useful if people
do text processing correctly, see below.

(Of course whether packages actually do this remains to be seen)

>  Using UTF-8 improves space efficiency, but at the cost of extra code
>complexity.

I agree.

>  That extra
> complexity is what I guess we would need to add to each program that
> currently uses UTF-32.

Yes, but they usually have to do stream processing even with UTF-32 (because
a character can be composed of possibly infinite number of codepoints),
so the infrastructure should be already there and the effort should be
minimal.

>  Alternatively, we could extend the on-disk
> format to support UTF-8 and then add some kind of "load hook" that
> converts the string to UTF-32 at load time.  Either way, it's likely to
> be a can of worms.

If it ever came to that, a pluggable reference scanner would be 
preferrable.  But really, it would irk me to have so much complexity
in something so basic (the reference scanner) for no end-user gain
(as a distribution we could just mandate UTF-8 for references and the
problem would be gone for the user with no loss of functionality).

It's always easy to add special cases - but more code means more bugs
and I think if possible it's best to have only the simple case implemented
in the core - because it's less complicated which means more likely
to be correct (for the case it does handle).  In the end it depends on
what would be more code, and more widely used.

Also, if we wanted to debug reference errors, we couldn't use grep anymore
because it can't handle utf-32 either (neither can any of the other UNIX tools).

Also, I really don't want to return to the time where I had to call iconv
once every three commands to be able to do anything useful on UNIX.

Also, the build daemon is written in C++ and C++ strings are widely
known to have very very bad codepoint awareness (to say nothing about
the horrible conversion facilities).

Also, if both UTF-32 and UTF-8 are used on disk, care needs to not misdetect
an UTF-8 sequence as an UTF-32 sequence of different text - or the other way
around -, but that's unlikely for ASCII strings.

> I really think it would be a mistake to try to force every program and
> language implementation to use our preferred string representation.  I
> suspect it would be vastly easier to compromise and support a few other
> popular string representations in Guix, namely UTF-16 and UTF-32.

In 1992, UTF-8 was invented.  Subsequently, most of the Internet,
all new GNU Linux distributions etc, all UNIX GUI frameworks, Subversion
etc standardized on UTF-8, with the eventual goal of standardizing all
network transfer and storage to UTF-8.  I think that by now the outliers
are the ones who need to change, otherwise these senseless encoding
conversions will never cease.  It's not like different encodings allow for
better expression of writings or anything useful to the end user.

As a distribution we can't force upstream to change, but just filing
bug reports upstream would make us see where they stand on this.

> If you don't want to change the daemon, it could be worked around in our
> build-side code as follows: we could add a new phase to certain build
> systems (or possibly gnu-build-system) that scans each output for
> UTF-16/32 encoded store references that are never referenced in UTF-8.
> If such references exist, a file with an unobtrusive name would be added
> to that output containing those references encoded in UTF-8.  This would
> enable our daemon's existing reference scanner to find all of the
> references.

I agree that that would be nice.  As a first step, even just detecting
problems like that and erroring out would be okay - in order to find them
in the first place.  Right now, it's difficult to detect and so also difficult
to say how wide-spread the problem is.  If the problem is wide-spread enough
my tune could change very quickly.

What you propose is similar to what I did in Java in Guix, only it gives
us even more advantages in the Java case (faster class loading and
eventual non-propagated inputs).

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

  parent reply	other threads:[~2018-12-27 13:54 UTC|newest]

Thread overview: 63+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-12-23 14:19 bug#33848: Store references in SBCL-compiled code are "invisible" Ludovic Courtès
2018-12-23 15:05 ` Pierre Neidhardt
2018-12-24 14:57   ` Ludovic Courtès
2018-12-23 16:45 ` Mark H Weaver
2018-12-23 17:32   ` Ludovic Courtès
2018-12-23 22:01     ` Pierre Neidhardt
2018-12-24 15:06       ` Ludovic Courtès
2018-12-24 17:08         ` Pierre Neidhardt
2018-12-26 16:07           ` Ludovic Courtès
2018-12-24 18:12         ` Mark H Weaver
2018-12-24 23:58           ` Pierre Neidhardt
2018-12-26 16:14           ` Ludovic Courtès
2018-12-27 10:37             ` Pierre Neidhardt
2018-12-27 14:03               ` Mark H Weaver
2018-12-27 14:45                 ` Ludovic Courtès
2018-12-27 15:02                   ` Pierre Neidhardt
2018-12-27 16:15                     ` Pierre Neidhardt
2018-12-27 17:03                     ` Ludovic Courtès
2018-12-27 18:57                       ` Pierre Neidhardt
2018-12-27 21:54                         ` Ludovic Courtès
2018-12-27 22:05                           ` Pierre Neidhardt
2018-12-27 22:59                             ` Ludovic Courtès
2018-12-28  7:47                               ` Pierre Neidhardt
2021-03-30 10:15                                 ` Pierre Neidhardt
2021-03-30 20:09                                   ` Ludovic Courtès
2021-03-31  7:10                                     ` Pierre Neidhardt
2021-03-31 16:12                                     ` Pierre Neidhardt
2021-03-31 20:42                                       ` Ludovic Courtès
2021-03-31 20:57                                         ` Pierre Neidhardt
2021-04-01 17:23                                         ` Mark H Weaver
2021-04-02 15:13                                           ` Ludovic Courtès
2021-04-01  6:03                                       ` Mark H Weaver
2021-04-01  7:13                                         ` Pierre Neidhardt
2021-04-01  7:57                                         ` Ludovic Courtès
2021-04-01  8:48                                           ` Pierre Neidhardt
2021-04-01  9:07                                           ` Guillaume Le Vaillant
2021-04-01  9:13                                             ` Pierre Neidhardt
2021-04-01  9:52                                               ` Guillaume Le Vaillant
2021-04-01 10:06                                                 ` Pierre Neidhardt
2021-04-01 10:07                                                 ` Pierre Neidhardt
2021-04-01 15:24                                                   ` Ludovic Courtès
2021-04-01 17:33                                                   ` Mark H Weaver
2021-04-02 22:46                                                     ` Mark H Weaver
2021-04-03  6:51                                                       ` Pierre Neidhardt
2021-04-03 20:10                                                         ` Mark H Weaver
2021-04-05 19:45                                                           ` Ludovic Courtès
2021-04-05 23:04                                                             ` Mark H Weaver
2021-04-06  8:16                                                               ` Ludovic Courtès
2021-04-06  8:23                                                                 ` Pierre Neidhardt
2021-04-30 20:03                                                                   ` Mark H Weaver
2021-05-01  9:22                                                                     ` Pierre Neidhardt
2021-05-11  8:46                                                                     ` Ludovic Courtès
2021-04-06 17:23                                                             ` Leo Famulari
2021-04-06 23:21                                                               ` Mark H Weaver
2021-04-06 11:19                                                       ` Mark H Weaver
2021-04-08 10:13                                                         ` Ludovic Courtès
2021-04-13 20:06                                                           ` Mark H Weaver
2021-04-14 10:55                                                             ` Ludovic Courtès
2021-04-14 22:37                                                               ` Leo Famulari
2021-04-15  7:26                                                               ` Mark H Weaver
2021-04-16  9:44                                                                 ` Pierre Neidhardt
2018-12-27 13:52           ` Danny Milosavljevic [this message]
2018-12-27 14:29             ` Mark H Weaver

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20181227145258.0c420eac@scratchpost.org \
    --to=dannym@scratchpost.org \
    --cc=33848@debbugs.gnu.org \
    --cc=mail@ambrevar.xyz \
    --cc=mhw@netris.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this external index

	https://git.savannah.gnu.org/cgit/guix.git

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.