From mboxrd@z Thu Jan 1 00:00:00 1970 From: Mark H Weaver Subject: bug#33848: Store references in SBCL-compiled code are "invisible" Date: Thu, 27 Dec 2018 09:29:42 -0500 Message-ID: <87pntnvy8e.fsf@netris.org> References: <87r2e8jpfx.fsf@gnu.org> <877eg0i43j.fsf@netris.org> <87d0psi1xo.fsf@gnu.org> <874lb3kin6.fsf@ambrevar.xyz> <87sgynezha.fsf@gnu.org> <87tvj2yesd.fsf@netris.org> <20181227145258.0c420eac@scratchpost.org> Mime-Version: 1.0 Content-Type: text/plain Return-path: Received: from eggs.gnu.org ([208.118.235.92]:49674) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1gcWgo-0001QW-AH for bug-guix@gnu.org; Thu, 27 Dec 2018 09:31:11 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1gcWgk-0000S9-9g for bug-guix@gnu.org; Thu, 27 Dec 2018 09:31:10 -0500 Received: from debbugs.gnu.org ([208.118.235.43]:53377) by eggs.gnu.org with esmtps (TLS1.0:RSA_AES_128_CBC_SHA1:16) (Exim 4.71) (envelope-from ) id 1gcWgg-0000Dd-JO for bug-guix@gnu.org; Thu, 27 Dec 2018 09:31:04 -0500 Received: from Debian-debbugs by debbugs.gnu.org with local (Exim 4.84_2) (envelope-from ) id 1gcWgg-0001Uv-AX for bug-guix@gnu.org; Thu, 27 Dec 2018 09:31:02 -0500 Sender: "Debbugs-submit" Resent-Message-ID: In-Reply-To: <20181227145258.0c420eac@scratchpost.org> (Danny Milosavljevic's message of "Thu, 27 Dec 2018 14:52:58 +0100") List-Id: Bug reports for GNU Guix List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: bug-guix-bounces+gcggb-bug-guix=m.gmane.org@gnu.org Sender: "bug-Guix" To: Danny Milosavljevic Cc: Pierre Neidhardt , 33848@debbugs.gnu.org Hi Danny, Danny Milosavljevic writes: > On Mon, 24 Dec 2018 13:12:23 -0500 > Mark H Weaver wrote: > >> Of course, the usual reason to choose UTF-32 is to support non-ASCII >> characters while retaining fixed-width code points, so that string >> lookups are straightforward and efficient. > > This kind of lookup is almost never what is necessary. There are many > people who assume character is the same as codepoint and to those people > UTF-32 brings something to the table, but it's really not useful if people > do text processing correctly, see below. > > (Of course whether packages actually do this remains to be seen) > >> That extra >> complexity is what I guess we would need to add to each program that >> currently uses UTF-32. > > Yes, but they usually have to do stream processing even with UTF-32 (because > a character can be composed of possibly infinite number of codepoints), I agree with you. However, as silly as it might be, the fact remains that almost every modern programming language and string library uses code points as the base units by which to index strings. > so the infrastructure should be already there and the effort should be > minimal. The infrastructure might or might not be there, depending on the sophistication of the program's unicode support, but even if it _is_ there, it will most likely be a layer that expects to iterate over strings indexed by code point to find graphemes, etc. Anyway, if you truly believe the effort should be minimal, feel free to investigate and propose patches to fix our 5 common lisp compilers and Fish to avoid storing UTF-32 in the object code. > Also, if both UTF-32 and UTF-8 are used on disk, care needs to not misdetect > an UTF-8 sequence as an UTF-32 sequence of different text - or the other way > around -, but that's unlikely for ASCII strings. This is not an issue because the substrings that the reference scanner and grafter are looking for are ASCII-only, even if they are part of a larger non-ASCII string. Specifically, they only need to look for the nix hashes. >> I really think it would be a mistake to try to force every program and >> language implementation to use our preferred string representation. I >> suspect it would be vastly easier to compromise and support a few other >> popular string representations in Guix, namely UTF-16 and UTF-32. > > In 1992, UTF-8 was invented. Subsequently, most of the Internet, > all new GNU Linux distributions etc, all UNIX GUI frameworks, Subversion > etc standardized on UTF-8, with the eventual goal of standardizing all > network transfer and storage to UTF-8. I think that by now the outliers > are the ones who need to change, I agree that we need to standardize on Unicode. However, given the perhaps unfortunate fact that almost everyone has standardized on code points as the units by which to index strings, choosing UTF-32 as an internal representation is a very reasonable choice, IMO. Anyway, feel free to engage with the developers of the Common Lisp implementations that use UTF-32 and try to convince them to change. The remaining question is: what to do if upstream refuses to change? Do we exclude that software in Guix, or do we maintain our own patches to override upstream's decision? >> If you don't want to change the daemon, it could be worked around in our >> build-side code as follows: we could add a new phase to certain build >> systems (or possibly gnu-build-system) that scans each output for >> UTF-16/32 encoded store references that are never referenced in UTF-8. >> If such references exist, a file with an unobtrusive name would be added >> to that output containing those references encoded in UTF-8. This would >> enable our daemon's existing reference scanner to find all of the >> references. > > I agree that that would be nice. As a first step, even just detecting > problems like that and erroring out would be okay - in order to find them > in the first place. Right now, it's difficult to detect and so also difficult > to say how wide-spread the problem is. If the problem is wide-spread enough > my tune could change very quickly. Sure, it would be useful to have more data on what packages are currently affected by this issue. Regards, Mark