From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: Mark H Weaver Newsgroups: gmane.lisp.guile.bugs Subject: bug#18520: string ports should not have an encoding Date: Wed, 24 Sep 2014 01:30:59 -0400 Message-ID: <87oau5h9f0.fsf@yeeloong.lan> References: <87iokgmttc.fsf@fencepost.gnu.org> NNTP-Posting-Host: plane.gmane.org Mime-Version: 1.0 Content-Type: text/plain X-Trace: ger.gmane.org 1411536872 2797 80.91.229.3 (24 Sep 2014 05:34:32 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Wed, 24 Sep 2014 05:34:32 +0000 (UTC) Cc: 18520@debbugs.gnu.org To: David Kastrup Original-X-From: bug-guile-bounces+guile-bugs=m.gmane.org@gnu.org Wed Sep 24 07:34:25 2014 Return-path: Envelope-to: guile-bugs@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by plane.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1XWfDc-0004xd-VK for guile-bugs@m.gmane.org; Wed, 24 Sep 2014 07:34:25 +0200 Original-Received: from localhost ([::1]:57376 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1XWfDc-000756-H9 for guile-bugs@m.gmane.org; Wed, 24 Sep 2014 01:34:24 -0400 Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:60354) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1XWfDT-00073E-3h for bug-guile@gnu.org; Wed, 24 Sep 2014 01:34:21 -0400 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1XWfDL-0005Sm-TM for bug-guile@gnu.org; Wed, 24 Sep 2014 01:34:15 -0400 Original-Received: from debbugs.gnu.org ([140.186.70.43]:58833) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1XWfDL-0005SI-QG for bug-guile@gnu.org; Wed, 24 Sep 2014 01:34:07 -0400 Original-Received: from Debian-debbugs by debbugs.gnu.org with local (Exim 4.80) (envelope-from ) id 1XWfDG-0000VD-5S for bug-guile@gnu.org; Wed, 24 Sep 2014 01:34:02 -0400 X-Loop: help-debbugs@gnu.org Resent-From: Mark H Weaver Original-Sender: "Debbugs-submit" Resent-CC: bug-guile@gnu.org Resent-Date: Wed, 24 Sep 2014 05:34:02 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 18520 X-GNU-PR-Package: guile X-GNU-PR-Keywords: Original-Received: via spool by 18520-submit@debbugs.gnu.org id=B18520.14115367841855 (code B ref 18520); Wed, 24 Sep 2014 05:34:02 +0000 Original-Received: (at 18520) by debbugs.gnu.org; 24 Sep 2014 05:33:04 +0000 Original-Received: from localhost ([127.0.0.1]:50397 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1XWfCJ-0000Tq-61 for submit@debbugs.gnu.org; Wed, 24 Sep 2014 01:33:03 -0400 Original-Received: from world.peace.net ([96.39.62.75]:53294) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1XWfCG-0000TF-CD for 18520@debbugs.gnu.org; Wed, 24 Sep 2014 01:33:01 -0400 Original-Received: from c-24-62-95-23.hsd1.ma.comcast.net ([24.62.95.23] helo=yeeloong.lan) by world.peace.net with esmtpsa (TLS1.0:RSA_AES_128_CBC_SHA1:16) (Exim 4.72) (envelope-from ) id 1XWfC7-0006jS-VL; Wed, 24 Sep 2014 01:32:52 -0400 In-Reply-To: <87iokgmttc.fsf@fencepost.gnu.org> (David Kastrup's message of "Mon, 22 Sep 2014 01:34:39 +0200") User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/24.3 (gnu/linux) X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.15 Precedence: list X-detected-operating-system: by eggs.gnu.org: GNU/Linux 3.x X-Received-From: 140.186.70.43 X-BeenThere: bug-guile@gnu.org List-Id: "Bug reports for GUILE, GNU's Ubiquitous Extension Language" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: bug-guile-bounces+guile-bugs=m.gmane.org@gnu.org Original-Sender: bug-guile-bounces+guile-bugs=m.gmane.org@gnu.org Xref: news.gmane.org gmane.lisp.guile.bugs:7588 Archived-At: David Kastrup writes: > In Guile 2.0, at the time a string port is opened, the value of the > fluid %default-port-encoding is used for deciding how to encode the > string into a byte stream, [...] I agree that this was a mistake. The issue is fixed on the master branch. > Ports fundamentally deliver characters, and so reading and writing from > a string source/sink should not involve _any_ coding system. David, you know as well as I that internally, there is always a coding system. Strings have a coding system too, even if it's UCS-4. Emacs uses something based on UTF-8, and I'd like to Guile to do something similar in the future. I guess you don't like the fact that it is possible to expose the internal representation via 'set-port-encoding!', 'ftell' or 'seek'. I don't see this as a problem, and arguably it's a benefit. First I'll address the non-standard 'set-port-encoding!'. As you say, it doesn't even make sense on string ports, and arguably should be an error. So why do you care if some internal details leak out when you do this nonsensical thing? Admittedly, we're missing an opportunity to report a possible bug to the user, but that's the only problem I see here. Regarding 'ftell' and 'seek', it's not entirely clear to me what's the best representation of those positions. In some situations, I guess it would be convenient for them to count unicode code points or string indices. In other situations, I could imagine it being more convenient for them to count grapheme clusters or UTF-8 bytes. R6RS, the only Scheme standard that supports getting or setting file positions, gives us complete freedom to choose our representation of positions on textual ports. The R6RS is explicit that they don't even have to be integers, and if they are, they don't have to correspond to bytes or characters. For better or for worse, Guile's ports are fundamentally based on bytes, and allow mixed binary and textual operations on all ports. Sometimes this is very helpful, for example when implementing HTTP. I can think of one other case where it's very helpful: I don't know how deeply you've looked at UTF-8, but it has some unusual properties that allow many (most?) string algorithms to be most naturally (and efficiently) implemented by operating on bytes rather than code points. Much of the time, you don't even have to be aware of the code point boundaries, which is a great savings. Efficient lookup tables based on bytes are also much cheaper than ones based on code points, etc. In fact, I intend to propose that in a future version of Guile, strings will not only be based on UTF-8 internally, but that this fact should be exposed in the API, allowing users to implement UTF-8 string operations that operate on bytes not code points. I'd also like lightweight, fast string ports that allow access to these bytes when desired. This leads me to believe that it's a feature, not a bug, that string ports use UTF-8 internally, and that it's possible (via non-standard extensions) to get access to the underlying bytes. Mark