From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!.POSTED!not-for-mail From: Mark H Weaver Newsgroups: gmane.lisp.guile.bugs Subject: bug#26058: utf16->string and utf32->string don't conform to R6RS Date: Mon, 15 Oct 2018 00:57:41 -0400 Message-ID: <878t2zst6i.fsf@netris.org> References: <87o9x83t0f.fsf@gmail.com> <87shmhqqgd.fsf@pobox.com> <87h92xyrmr.fsf@gmail.com> <87bmt4rht1.fsf@pobox.com> <87d1djzysb.fsf@gmail.com> <877f3r7ti2.fsf@pobox.com> <87r31xdnih.fsf@gmail.com> NNTP-Posting-Host: blaine.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Trace: blaine.gmane.org 1539579431 29381 195.159.176.226 (15 Oct 2018 04:57:11 GMT) X-Complaints-To: usenet@blaine.gmane.org NNTP-Posting-Date: Mon, 15 Oct 2018 04:57:11 +0000 (UTC) User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.1 (gnu/linux) Cc: 26058@debbugs.gnu.org To: taylanbayirli@gmail.com (Taylan Ulrich "=?UTF-8?Q?Bay=C4=B1rl=C4=B1/Kammer?=") Original-X-From: bug-guile-bounces+guile-bugs=m.gmane.org@gnu.org Mon Oct 15 06:57:07 2018 Return-path: Envelope-to: guile-bugs@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by blaine.gmane.org with esmtp (Exim 4.84_2) (envelope-from ) id 1gBuwE-0007XM-PB for guile-bugs@m.gmane.org; Mon, 15 Oct 2018 06:57:06 +0200 Original-Received: from localhost ([::1]:50360 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1gBuyL-0003bB-Cj for guile-bugs@m.gmane.org; Mon, 15 Oct 2018 00:59:17 -0400 Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:53996) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1gBuyB-0003b2-VT for bug-guile@gnu.org; Mon, 15 Oct 2018 00:59:08 -0400 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1gBuy7-00074g-Pl for bug-guile@gnu.org; Mon, 15 Oct 2018 00:59:07 -0400 Original-Received: from debbugs.gnu.org ([208.118.235.43]:45456) by eggs.gnu.org with esmtps (TLS1.0:RSA_AES_128_CBC_SHA1:16) (Exim 4.71) (envelope-from ) id 1gBuy7-00074R-GV for bug-guile@gnu.org; Mon, 15 Oct 2018 00:59:03 -0400 Original-Received: from Debian-debbugs by debbugs.gnu.org with local (Exim 4.84_2) (envelope-from ) id 1gBuy6-00089P-5W for bug-guile@gnu.org; Mon, 15 Oct 2018 00:59:03 -0400 X-Loop: help-debbugs@gnu.org Resent-From: Mark H Weaver Original-Sender: "Debbugs-submit" Resent-CC: bug-guile@gnu.org Resent-Date: Mon, 15 Oct 2018 04:59:02 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 26058 X-GNU-PR-Package: guile X-GNU-PR-Keywords: Original-Received: via spool by 26058-submit@debbugs.gnu.org id=B26058.153957948931231 (code B ref 26058); Mon, 15 Oct 2018 04:59:02 +0000 Original-Received: (at 26058) by debbugs.gnu.org; 15 Oct 2018 04:58:09 +0000 Original-Received: from localhost ([127.0.0.1]:49714 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1gBuxF-00087f-Hf for submit@debbugs.gnu.org; Mon, 15 Oct 2018 00:58:09 -0400 Original-Received: from world.peace.net ([64.112.178.59]:49132) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1gBuxD-00087N-MW for 26058@debbugs.gnu.org; Mon, 15 Oct 2018 00:58:08 -0400 Original-Received: from mhw by world.peace.net with esmtpsa (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.89) (envelope-from ) id 1gBux3-0000YW-S1; Mon, 15 Oct 2018 00:57:58 -0400 In-Reply-To: <87r31xdnih.fsf@gmail.com> ("Taylan Ulrich \=\?utf-8\?Q\?\=5C\=22Ba\?\= \=\?utf-8\?Q\?y\=C4\=B1rl\=C4\=B1\=2FKammer\=5C\=22\=22's\?\= message of "Thu, 16 Mar 2017 20:34:14 +0100") X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] X-Received-From: 208.118.235.43 X-BeenThere: bug-guile@gnu.org List-Id: "Bug reports for GUILE, GNU's Ubiquitous Extension Language" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: bug-guile-bounces+guile-bugs=m.gmane.org@gnu.org Original-Sender: "bug-guile" Xref: news.gmane.org gmane.lisp.guile.bugs:9192 Archived-At: Hi Taylan, taylanbayirli@gmail.com (Taylan Ulrich "Bay=C4=B1rl=C4=B1/Kammer") writes: > Andy Wingo writes: > >> Adopting the behavior is more or less fine. If it can be done while >> relying on the existing behavior, that is better than something ad-hoc >> in a module. In general, I agree with Andy's sentiment that it would be better to avoid redundant BOM handling code, and moreover, I appreciate his reluctance to apply a fix without careful consideration of our existing BOM semantics. However, as Taylan discovered, Guile does not provide a mechanism to specify a default endianness of a UTF-16 or UTF-32 port in case a BOM is not found. I see no straightforward way to implement these R6RS interfaces using ports. We could certainly add such a mechanism if needed, but I see another problem with this approach: the expense of creating and later collecting a bytevector port object would be a very heavy burden to place on these otherwise fairly lightweight operations. Therefore, I would prefer to avoid that implementation strategy for these operations. Although BOM handling for ports is quite complex with many subtle points to consider, detecting a BOM at the beginning of a bytevector is so trivial that I personally have no objection to this tiny duplication of logic. Therefore, my preference would be to adopt code similar to that proposed by Taylan, although I believe it can, and should, be further simplified: > diff --git a/module/rnrs/bytevectors.scm b/module/rnrs/bytevectors.scm > index 9744359f0..997a8c9cb 100644 > --- a/module/rnrs/bytevectors.scm > +++ b/module/rnrs/bytevectors.scm > @@ -69,7 +69,9 @@ > bytevector-ieee-double-native-set! >=20=20 > string->utf8 string->utf16 string->utf32 > - utf8->string utf16->string utf32->string)) > + utf8->string > + (r6rs-utf16->string . utf16->string) > + (r6rs-utf32->string . utf32->string))) >=20=20 >=20=20 > (load-extension (string-append "libguile-" (effective-version)) > @@ -80,4 +82,52 @@ > `(quote ,sym) > (error "unsupported endianness" sym))) >=20=20 > +(define (read-bom16 bv) > + (let ((c0 (bytevector-u8-ref bv 0)) > + (c1 (bytevector-u8-ref bv 1))) > + (cond > + ((and (=3D c0 #xFE) (=3D c1 #xFF)) > + 'big) > + ((and (=3D c0 #xFF) (=3D c1 #xFE)) > + 'little) > + (else > + #f)))) We should gracefully handle the case of an empty bytevector, returning an empty string without error in that case. Also, we should use a single 'bytevector-u16-ref' operation to check for the BOM. Pick an arbitrary endianness for the operation (big-endian?), and compare the resulting integer with both #xFEFF and #xFFFE. That way, the code will be simpler and more efficient. Note that our VM has dedicated instructions for these multi-byte bytevector accessors, and there will be fewer comparison operations as well. Similarly for the utf32 case. What do you think? > +(define r6rs-utf16->string > + (case-lambda > + ((bv default-endianness) > + (let ((bom-endianness (read-bom16 bv))) > + (if (not bom-endianness) > + (utf16->string bv default-endianness) > + (substring/shared (utf16->string bv bom-endianness) 1)))) Better to use plain 'substring' here, I think. The machinery of shared substrings is more expensive, and unnecessary in this case. Otherwise, it looks good to me. Would you like to propose a revised patch? Andy, what do you think? Mark