Hi Ludovic, Ludovic Courtès writes: > Mark H Weaver skribis: > >> Ludovic Courtès writes: > > [...] > >>> So there are two things. To fix the issue you reported (build output >>> that goes through), I think we must simply turn off UTF-8 decoding from >>> ‘process-stderr’ and leave that entirely to ‘build-event-output-port’. >> >> Can we assume that UTF-8 is the appropriate encoding for >> (current-build-output-port)? My interpretation of the Guix manual entry >> for 'current-build-output-port' suggests that the answer should be "no". > > What goes to ‘current-build-output-port’ comes from builds processes. > It’s usually UTF-8 but it can be anything, including binary garbage, > which should be gracefully handled. > > That’s why ‘process-stderr’ currently uses ‘read-maybe-utf8-string’. I agree that we should (permissively) interpret the build process output as UTF-8, regardless of locale settings. However, the encoding of 'current-build-output-port' is orthogonal, and I see no reason to assume that it's UTF-8. As 'process-stderr' is currently implemented, it makes no assumptions about the encoding of 'current-build-output-port'. That's because it uses only textual I/O on it. The end result is that the UTF-8 build output is effectively converted into the port encoding of 'current-build-output-port', whatever it might be. I think that's how it should be, no? >> Also, in your previous message you wrote: >> >> The problem is the first layer of UTF-8 decoding that happens in >> ‘process-stderr’, in the ‘%stderr-next’ case. We would need to >> disable it, but only if the build output port is >> ‘build-event-output-port’ (i.e., it’s capable of interpreting >> “multiplexed build output” correctly.) >> >> It sounds like you're suggesting that 'process-stderr' should look to >> see if (current-build-output-port) is a 'build-event-output-port', and >> in that case it should use binary I/O primitives to write raw binary >> data to it, otherwise it should use text I/O primitives and write >> characters to it. Do I understand correctly? > > Yes. (Actually, rather than guessing if (current-build-output-port) is > a ‘build-event-output-port’, there could be a fluid to ask for the use > of raw binary primitives.) > >> IMO, it would be cleaner to treat 'build-event-output-port' uniformly, >> and specifically as a textual port of unknown encoding. > > (You mean ‘current-build-output-port’, right?) Yes, indeed. > I think you’re right. I’m not yet entirely sure what the implications > are. There’s a couple of tests in tests/store.scm for UTF-8 > interpretation that describe behavior that I think we should preserve. I certainly agree that we should preserve those tests. I would go further and add two more tests that bind 'current-build-output-port' to a port with a non-UTF-8 encoding (e.g. UTF-16) and verify that the λ gets converted correctly. The test build process would output the λ as UTF-8, but it should be written to 'current-build-output-port' as e.g. UTF-16. What do you think? >> I would suggest changing 'build-event-output-port' to create an R6RS >> custom *textual* output port, so that it wouldn't have to worry about >> encodings at all, and it would only be given whole characters. >> Internally, it would be doing exactly what you suggest above, but those >> details would be encapsulated within the custom textual port. >> >> However, I don't think we can use Guile's current implementation of R6RS >> custom textual output ports, which are currently built on Guile's legacy >> soft ports, which I suspect have a similar bug with multibyte characters >> sometimes being split (see 'soft_port_write' in vports.c). >> >> Having said all of this, my suggestions would ultimately entail having >> two separate places along the stderr pipeline where 'utf8->string!' >> would be used, and maybe that's too much until we have a more optimized >> C implementation of it. > > Yeah it looks like we don’t yet have custom textual output ports that we > could rely on, do we? > > I support your work to add that in Guile proper! For now, I can offer a new implementation of custom textual output ports built upon custom binary ports and the 'utf8->string!' that I previously sent. See attached. Thanks, Mark --8<---------------cut here---------------start------------->8--- GNU Guile 2.2.4 Copyright (C) 1995-2017 Free Software Foundation, Inc. Guile comes with ABSOLUTELY NO WARRANTY; for details type `,show w'. This program is free software, and you are welcome to redistribute it under certain conditions; type `,show c' for details. Enter `,help' for help. scheme@(guile-user)> (load "utf8-decoder.scm") scheme@(guile-user)> (load "guile-new-custom-textual-ports.scm") scheme@(guile-user)> (define (my-write! str start count) (pk 'my-write! (substring str start (+ start count))) count) scheme@(guile-user)> (define port (make-custom-textual-output-port "test1" my-write! #f #f #f)) scheme@(guile-user)> (display "Hello λ world!" port) scheme@(guile-user)> (force-output port) ;;; (my-write! "Hello λ world!") scheme@(guile-user)> (string->utf8 "λ") $2 = #vu8(206 187) scheme@(guile-user)> (string->utf8 "Hello λ world!") $3 = #vu8(72 101 108 108 111 32 206 187 32 119 111 114 108 100 33) scheme@(guile-user)> (put-bytevector port #vu8(72 101 108 108 111 32 206)) scheme@(guile-user)> (force-output port) ;;; (my-write! "Hello ") scheme@(guile-user)> (put-bytevector port #vu8(187 32 119 111 114 108 100 33)) scheme@(guile-user)> (force-output port) ;;; (my-write! "λ world!") scheme@(guile-user)> --8<---------------cut here---------------end--------------->8---