From: Mark H Weaver <mhw@netris.org>
To: taylanbayirli@gmail.com (Taylan Ulrich "Bayırlı/Kammer")
Cc: 26058@debbugs.gnu.org
Subject: bug#26058: utf16->string and utf32->string don't conform to R6RS
Date: Mon, 15 Oct 2018 00:57:41 -0400 [thread overview]
Message-ID: <878t2zst6i.fsf@netris.org> (raw)
In-Reply-To: <87r31xdnih.fsf@gmail.com> ("Taylan Ulrich \=\?utf-8\?Q\?\=5C\=22Ba\?\= \=\?utf-8\?Q\?y\=C4\=B1rl\=C4\=B1\=2FKammer\=5C\=22\=22's\?\= message of "Thu, 16 Mar 2017 20:34:14 +0100")
Hi Taylan,
taylanbayirli@gmail.com (Taylan Ulrich "Bayırlı/Kammer") writes:
> Andy Wingo <wingo@pobox.com> writes:
>
>> Adopting the behavior is more or less fine. If it can be done while
>> relying on the existing behavior, that is better than something ad-hoc
>> in a module.
In general, I agree with Andy's sentiment that it would be better to
avoid redundant BOM handling code, and moreover, I appreciate his
reluctance to apply a fix without careful consideration of our existing
BOM semantics.
However, as Taylan discovered, Guile does not provide a mechanism to
specify a default endianness of a UTF-16 or UTF-32 port in case a BOM is
not found. I see no straightforward way to implement these R6RS
interfaces using ports.
We could certainly add such a mechanism if needed, but I see another
problem with this approach: the expense of creating and later collecting
a bytevector port object would be a very heavy burden to place on these
otherwise fairly lightweight operations. Therefore, I would prefer to
avoid that implementation strategy for these operations.
Although BOM handling for ports is quite complex with many subtle points
to consider, detecting a BOM at the beginning of a bytevector is so
trivial that I personally have no objection to this tiny duplication of
logic.
Therefore, my preference would be to adopt code similar to that proposed
by Taylan, although I believe it can, and should, be further simplified:
> diff --git a/module/rnrs/bytevectors.scm b/module/rnrs/bytevectors.scm
> index 9744359f0..997a8c9cb 100644
> --- a/module/rnrs/bytevectors.scm
> +++ b/module/rnrs/bytevectors.scm
> @@ -69,7 +69,9 @@
> bytevector-ieee-double-native-set!
>
> string->utf8 string->utf16 string->utf32
> - utf8->string utf16->string utf32->string))
> + utf8->string
> + (r6rs-utf16->string . utf16->string)
> + (r6rs-utf32->string . utf32->string)))
>
>
> (load-extension (string-append "libguile-" (effective-version))
> @@ -80,4 +82,52 @@
> `(quote ,sym)
> (error "unsupported endianness" sym)))
>
> +(define (read-bom16 bv)
> + (let ((c0 (bytevector-u8-ref bv 0))
> + (c1 (bytevector-u8-ref bv 1)))
> + (cond
> + ((and (= c0 #xFE) (= c1 #xFF))
> + 'big)
> + ((and (= c0 #xFF) (= c1 #xFE))
> + 'little)
> + (else
> + #f))))
We should gracefully handle the case of an empty bytevector, returning
an empty string without error in that case.
Also, we should use a single 'bytevector-u16-ref' operation to check for
the BOM. Pick an arbitrary endianness for the operation (big-endian?),
and compare the resulting integer with both #xFEFF and #xFFFE. That
way, the code will be simpler and more efficient. Note that our VM has
dedicated instructions for these multi-byte bytevector accessors, and
there will be fewer comparison operations as well. Similarly for the
utf32 case.
What do you think?
> +(define r6rs-utf16->string
> + (case-lambda
> + ((bv default-endianness)
> + (let ((bom-endianness (read-bom16 bv)))
> + (if (not bom-endianness)
> + (utf16->string bv default-endianness)
> + (substring/shared (utf16->string bv bom-endianness) 1))))
Better to use plain 'substring' here, I think. The machinery of shared
substrings is more expensive, and unnecessary in this case.
Otherwise, it looks good to me. Would you like to propose a revised
patch?
Andy, what do you think?
Mark
prev parent reply other threads:[~2018-10-15 4:57 UTC|newest]
Thread overview: 8+ messages / expand[flat|nested] mbox.gz Atom feed top
2017-03-11 12:19 bug#26058: utf16->string and utf32->string don't conform to R6RS "Taylan Ulrich Bayırlı/Kammer"
2017-03-13 13:03 ` Andy Wingo
2017-03-13 18:10 ` Taylan Ulrich Bayırlı/Kammer
2017-03-13 21:24 ` Andy Wingo
2017-03-14 15:03 ` Taylan Ulrich Bayırlı/Kammer
2017-03-14 15:44 ` Andy Wingo
2017-03-16 19:34 ` Taylan Ulrich Bayırlı/Kammer
2018-10-15 4:57 ` Mark H Weaver [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
List information: https://www.gnu.org/software/guile/
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=878t2zst6i.fsf@netris.org \
--to=mhw@netris.org \
--cc=26058@debbugs.gnu.org \
--cc=taylanbayirli@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).