From: Sarah Morgensen <iskarian@mgsn.dev>
To: Mark H Weaver <mhw@netris.org>
Cc: 35350@debbugs.gnu.org
Subject: bug#35350: Some compile output still leaks through with --verbosity=1
Date: Sun, 19 Sep 2021 22:44:55 -0700 [thread overview]
Message-ID: <864kafhkbs.fsf@mgsn.dev> (raw)
In-Reply-To: <87tveauk2u.fsf@netris.org> (Mark H. Weaver's message of "Sat, 04 May 2019 14:53:50 -0400 (2 years, 19 weeks, 5 days ago)")
Hello,
I encountered this issue today. This looks like a pretty complete
solution ready to go. Did this ever make it into Guile/Guix?
(Ironically I was also reaching for a "make-custom-textual-output-port"
the other day!)
--
Sarah
Mark H Weaver <mhw@netris.org> writes:
> Hi Ludovic,
>
> Ludovic Courtès <ludo@gnu.org> writes:
>
>> Mark H Weaver <mhw@netris.org> skribis:
>>
>>> Ludovic Courtès <ludo@gnu.org> writes:
>>
>> [...]
>>
>>>> So there are two things. To fix the issue you reported (build output
>>>> that goes through), I think we must simply turn off UTF-8 decoding from
>>>> ‘process-stderr’ and leave that entirely to ‘build-event-output-port’.
>>>
>>> Can we assume that UTF-8 is the appropriate encoding for
>>> (current-build-output-port)? My interpretation of the Guix manual entry
>>> for 'current-build-output-port' suggests that the answer should be "no".
>>
>> What goes to ‘current-build-output-port’ comes from builds processes.
>> It’s usually UTF-8 but it can be anything, including binary garbage,
>> which should be gracefully handled.
>>
>> That’s why ‘process-stderr’ currently uses ‘read-maybe-utf8-string’.
>
> I agree that we should (permissively) interpret the build process output
> as UTF-8, regardless of locale settings. However, the encoding of
> 'current-build-output-port' is orthogonal, and I see no reason to assume
> that it's UTF-8.
>
> As 'process-stderr' is currently implemented, it makes no assumptions
> about the encoding of 'current-build-output-port'. That's because it
> uses only textual I/O on it. The end result is that the UTF-8 build
> output is effectively converted into the port encoding of
> 'current-build-output-port', whatever it might be. I think that's how
> it should be, no?
>
>>> Also, in your previous message you wrote:
>>>
>>> The problem is the first layer of UTF-8 decoding that happens in
>>> ‘process-stderr’, in the ‘%stderr-next’ case. We would need to
>>> disable it, but only if the build output port is
>>> ‘build-event-output-port’ (i.e., it’s capable of interpreting
>>> “multiplexed build output” correctly.)
>>>
>>> It sounds like you're suggesting that 'process-stderr' should look to
>>> see if (current-build-output-port) is a 'build-event-output-port', and
>>> in that case it should use binary I/O primitives to write raw binary
>>> data to it, otherwise it should use text I/O primitives and write
>>> characters to it. Do I understand correctly?
>>
>> Yes. (Actually, rather than guessing if (current-build-output-port) is
>> a ‘build-event-output-port’, there could be a fluid to ask for the use
>> of raw binary primitives.)
>>
>>> IMO, it would be cleaner to treat 'build-event-output-port' uniformly,
>>> and specifically as a textual port of unknown encoding.
>>
>> (You mean ‘current-build-output-port’, right?)
>
> Yes, indeed.
>
>> I think you’re right. I’m not yet entirely sure what the implications
>> are. There’s a couple of tests in tests/store.scm for UTF-8
>> interpretation that describe behavior that I think we should preserve.
>
> I certainly agree that we should preserve those tests. I would go
> further and add two more tests that bind 'current-build-output-port' to
> a port with a non-UTF-8 encoding (e.g. UTF-16) and verify that the λ
> gets converted correctly. The test build process would output the λ as
> UTF-8, but it should be written to 'current-build-output-port' as
> e.g. UTF-16.
>
> What do you think?
>
>>> I would suggest changing 'build-event-output-port' to create an R6RS
>>> custom *textual* output port, so that it wouldn't have to worry about
>>> encodings at all, and it would only be given whole characters.
>>> Internally, it would be doing exactly what you suggest above, but those
>>> details would be encapsulated within the custom textual port.
>>>
>>> However, I don't think we can use Guile's current implementation of R6RS
>>> custom textual output ports, which are currently built on Guile's legacy
>>> soft ports, which I suspect have a similar bug with multibyte characters
>>> sometimes being split (see 'soft_port_write' in vports.c).
>>>
>>> Having said all of this, my suggestions would ultimately entail having
>>> two separate places along the stderr pipeline where 'utf8->string!'
>>> would be used, and maybe that's too much until we have a more optimized
>>> C implementation of it.
>>
>> Yeah it looks like we don’t yet have custom textual output ports that we
>> could rely on, do we?
>>
>> I support your work to add that in Guile proper!
>
> For now, I can offer a new implementation of custom textual output ports
> built upon custom binary ports and the 'utf8->string!' that I previously
> sent. See attached.
>
> Thanks,
> Mark
>
> GNU Guile 2.2.4
> Copyright (C) 1995-2017 Free Software Foundation, Inc.
>
> Guile comes with ABSOLUTELY NO WARRANTY; for details type `,show w'.
> This program is free software, and you are welcome to redistribute it
> under certain conditions; type `,show c' for details.
>
> Enter `,help' for help.
> scheme@(guile-user)> (load "utf8-decoder.scm")
> scheme@(guile-user)> (load "guile-new-custom-textual-ports.scm")
> scheme@(guile-user)> (define (my-write! str start count)
> (pk 'my-write! (substring str start (+ start count)))
> count)
> scheme@(guile-user)> (define port (make-custom-textual-output-port "test1" my-write! #f #f #f))
> scheme@(guile-user)> (display "Hello λ world!" port)
> scheme@(guile-user)> (force-output port)
>
> ;;; (my-write! "Hello λ world!")
> scheme@(guile-user)> (string->utf8 "λ")
> $2 = #vu8(206 187)
> scheme@(guile-user)> (string->utf8 "Hello λ world!")
> $3 = #vu8(72 101 108 108 111 32 206 187 32 119 111 114 108 100 33)
> scheme@(guile-user)> (put-bytevector port #vu8(72 101 108 108 111 32 206))
> scheme@(guile-user)> (force-output port)
>
> ;;; (my-write! "Hello ")
> scheme@(guile-user)> (put-bytevector port #vu8(187 32 119 111 114 108 100 33))
> scheme@(guile-user)> (force-output port)
>
> ;;; (my-write! "λ world!")
> scheme@(guile-user)>
>
> ;;; Copyright © 2019 Mark H Weaver <mhw@netris.org>
> ;;;
> ;;; This program is free software: you can redistribute it and/or modify
> ;;; it under the terms of the GNU General Public License as published by
> ;;; the Free Software Foundation, either version 3 of the License, or
> ;;; (at your option) any later version.
> ;;;
> ;;; This program is distributed in the hope that it will be useful,
> ;;; but WITHOUT ANY WARRANTY; without even the implied warranty of
> ;;; MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
> ;;; GNU General Public License for more details.
> ;;;
> ;;; You should have received a copy of the GNU General Public License
> ;;; along with this program. If not, see <http://www.gnu.org/licenses/>.
>
> (use-modules (rnrs io ports))
>
> (define (make-custom-textual-output-port id
> write!
> get-position
> set-position!
> close)
> (let (;; Allocate a per-port string buffer which will be used as a
> ;; temporary buffer for decoding, to avoid heap allocation
> ;; during normal operation.
> (buffer (make-string 4096))
> ;; 'state' is the UTF-8 decoder state, which represents a
> ;; proper prefix of a well-formed UTF-8 byte sequence. These
> ;; are bytes that 'binary-write!' has accepted and reported as
> ;; having been written, although we are not able to decode
> ;; them into a character to pass to (textual) 'write!' until
> ;; more bytes arrive.
> (state 0))
> (define (binary-write! bv start count)
> (call-with-values (lambda ()
> ;; XXX FIXME: Consider performing this
> ;; decoding strictly.
> (utf8->string! state bv start (+ start count)
> buffer 0 (string-length buffer)))
> (lambda (new-state bv-pos char-count)
> (let* (;; Avoid calling write! with (char-count = 0) unless
> ;; (count = 0) was passed to us, because calling
> ;; 'write!' with count=0 has a special meaning: it
> ;; means to pass an EOF object to the byte/character
> ;; sink.
> (chars-accepted (if (and (zero? char-count)
> (not (zero? count)))
> 0
> (write! buffer 0 char-count)))
> ;; Compute 'bytes-accepted' in such a way that the
> ;; bytes from STATE are not included, because they
> ;; were passed to us in previous calls, and are not
> ;; part of the bytevector range that we are now being
> ;; asked to write. However, it's important to note
> ;; that if 'write!' did not accept the bytes from
> ;; STATE, 'bytes-accepted' will be negative. We must
> ;; handle that case specially below.
> (bytes-accepted (- count (string-utf8-length
> (substring buffer
> chars-accepted
> char-count)))))
> ;; If 'bytes-accepted' is negative, that means the bytes
> ;; from STATE were not written. This can only happen if
> ;; 'chars-accepted' is 0, because 'write!' can only accept
> ;; whole code points, and the bytes from STATE are part of
> ;; at most a single code point. In this case, we must
> ;; leave STATE unchanged and return 0.
> (if (negative? bytes-accepted)
> 0
> (begin
> (set! state new-state)
> bytes-accepted))))))
> (define (binary-close)
> (set! buffer #f)
> (when close (close)))
> (define port
> (make-custom-binary-output-port id
> binary-write!
> get-position
> set-position!
> binary-close))
> ;; Always use UTF-8 as the encoding for custom textual ports, as
> ;; an internal implementation detail, to ensure that all Unicode
> ;; characters will pass through regardless of the current locale.
> (set-port-encoding! port "UTF-8")
> port))
prev parent reply other threads:[~2021-09-20 5:46 UTC|newest]
Thread overview: 13+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-04-20 23:53 bug#35350: Some compile output still leaks through with --verbosity=1 Mark H Weaver
2019-04-21 20:15 ` Ludovic Courtès
2019-04-22 23:52 ` Mark H Weaver
2019-04-23 8:45 ` Mark H Weaver
2019-04-23 10:12 ` Ludovic Courtès
2019-04-26 19:09 ` Mark H Weaver
2019-04-27 0:45 ` Mark H Weaver
2019-04-27 7:56 ` Mark H Weaver
2019-04-27 16:36 ` Ludovic Courtès
2019-04-30 20:26 ` Mark H Weaver
2019-05-04 9:33 ` Ludovic Courtès
2019-05-04 18:53 ` Mark H Weaver
2021-09-20 5:44 ` Sarah Morgensen [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=864kafhkbs.fsf@mgsn.dev \
--to=iskarian@mgsn.dev \
--cc=35350@debbugs.gnu.org \
--cc=mhw@netris.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this external index
https://git.savannah.gnu.org/cgit/guix.git
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.