From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: David Kastrup Newsgroups: gmane.lisp.guile.bugs Subject: bug#18520: string ports should not have an encoding Date: Wed, 24 Sep 2014 14:00:38 +0200 Message-ID: <87k34ti5y1.fsf@fencepost.gnu.org> References: <87iokgmttc.fsf@fencepost.gnu.org> <87oau5h9f0.fsf@yeeloong.lan> NNTP-Posting-Host: plane.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Transfer-Encoding: quoted-printable X-Trace: ger.gmane.org 1411562305 9258 80.91.229.3 (24 Sep 2014 12:38:25 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Wed, 24 Sep 2014 12:38:25 +0000 (UTC) Cc: 18520@debbugs.gnu.org To: Mark H Weaver Original-X-From: bug-guile-bounces+guile-bugs=m.gmane.org@gnu.org Wed Sep 24 14:38:17 2014 Return-path: Envelope-to: guile-bugs@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by plane.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1XWlpo-00069l-L5 for guile-bugs@m.gmane.org; Wed, 24 Sep 2014 14:38:16 +0200 Original-Received: from localhost ([::1]:59641 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1XWlpo-0006lS-7p for guile-bugs@m.gmane.org; Wed, 24 Sep 2014 08:38:16 -0400 Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:56427) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1XWlpj-0006i1-5L for bug-guile@gnu.org; Wed, 24 Sep 2014 08:38:13 -0400 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1XWlpg-0002lY-Oe for bug-guile@gnu.org; Wed, 24 Sep 2014 08:38:11 -0400 Original-Received: from debbugs.gnu.org ([140.186.70.43]:58962) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1XWlpg-0002kg-Ku for bug-guile@gnu.org; Wed, 24 Sep 2014 08:38:08 -0400 Original-Received: from Debian-debbugs by debbugs.gnu.org with local (Exim 4.80) (envelope-from ) id 1XWlpb-0005Jq-CC for bug-guile@gnu.org; Wed, 24 Sep 2014 08:38:03 -0400 X-Loop: help-debbugs@gnu.org Resent-From: David Kastrup Original-Sender: "Debbugs-submit" Resent-CC: bug-guile@gnu.org Resent-Date: Wed, 24 Sep 2014 12:38:02 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 18520 X-GNU-PR-Package: guile X-GNU-PR-Keywords: Original-Received: via spool by 18520-submit@debbugs.gnu.org id=B18520.141156225720408 (code B ref 18520); Wed, 24 Sep 2014 12:38:02 +0000 Original-Received: (at 18520) by debbugs.gnu.org; 24 Sep 2014 12:37:37 +0000 Original-Received: from localhost ([127.0.0.1]:50526 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1XWlp9-0005J4-T8 for submit@debbugs.gnu.org; Wed, 24 Sep 2014 08:37:36 -0400 Original-Received: from fencepost.gnu.org ([208.118.235.10]:52755) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1XWlp5-0005It-8H for 18520@debbugs.gnu.org; Wed, 24 Sep 2014 08:37:32 -0400 Original-Received: from localhost ([127.0.0.1]:60060 helo=lola) by fencepost.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1XWlp3-0001tY-Cx; Wed, 24 Sep 2014 08:37:29 -0400 Original-Received: by lola (Postfix, from userid 1000) id 65A87DF8CA; Wed, 24 Sep 2014 14:00:38 +0200 (CEST) In-Reply-To: <87oau5h9f0.fsf@yeeloong.lan> (Mark H. Weaver's message of "Wed, 24 Sep 2014 01:30:59 -0400") User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/24.4.50 (gnu/linux) X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.15 Precedence: list X-detected-operating-system: by eggs.gnu.org: GNU/Linux 3.x X-Received-From: 140.186.70.43 X-BeenThere: bug-guile@gnu.org List-Id: "Bug reports for GUILE, GNU's Ubiquitous Extension Language" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: bug-guile-bounces+guile-bugs=m.gmane.org@gnu.org Original-Sender: bug-guile-bounces+guile-bugs=m.gmane.org@gnu.org Xref: news.gmane.org gmane.lisp.guile.bugs:7589 Archived-At: Mark H Weaver writes: > David Kastrup writes: > >> In Guile 2.0, at the time a string port is opened, the value of the >> fluid %default-port-encoding is used for deciding how to encode the >> string into a byte stream, [...] > > I agree that this was a mistake. The issue is fixed on the master > branch. The mistake is having a string port use a different sequence-of-character encoding than a string. >> Ports fundamentally deliver characters, and so reading and writing >> from a string source/sink should not involve _any_ coding system. > > David, you know as well as I that internally, there is always a coding > system. Strings have a coding system too, even if it's UCS-4. Emacs > uses something based on UTF-8, and I'd like to Guile to do something > similar in the future. > > I guess you don't like the fact that it is possible to expose the > internal representation via 'set-port-encoding!', 'ftell' or 'seek'. > I don't see this as a problem, and arguably it's a benefit. Shrug. That arguable benefit went down in flames in Emacs=A020. It triggered the last great migration from Emacs users to XEmacs. It took until Emacs=A020.4 until the horrible mistake of exposing byte offsets to the user in either strings or buffers was corrected. You write above "Emacs uses something based on UTF-8", and it's worth pointing out that it does so starting with Emacs=A023. Previously Emacs used its own peculiar multibyte encoding that existed long before UTF-8. The important thing to note is that is was _completely_ hidden from sight from Elisp users when the Emacs=A020 tribulations were over. Emacs was able to swap out this multibyte encoding for the Emacs=A023 coding rather transparently, and the main reason to do that was to make UTF-8 a favored encoding regarding performance of encoding/decoding and processing of Elisp source files. Emacs' internal encoding is not proper UTF-8. You can take a random byte string, tell Emacs that it is encoded in UTF-8, and decode it into Emacs' internal representation. All passages that happen to be proper uniquely represented UTF-8 will pass the transcoding unchanged, but everything else will be transcoded into a UTF-8-like representation of "unencodable byte". I think Emacs uses the UTF-8 forbidden code points from 0xd800 to 0xd880 for encoding stray bytes, or something like that. So if you reencode the unchanged "UTF-8" Emacs uses internally, the result will again faithfully reproduce the random byte stream. Garbage in, _same_ garbage out. A very important property that many of Emacs' supported file encodings share. Notable exception are various Japanese encodings based on escape characters. At any rate, unless you are using explicit conversions like string-as-unibyte or _encoding_ to Emacs' internal representation (it is available as a named coding system), the representation is not exposed. Strings are indexed per character, and buffers (which are at their heart random-access string ports) are indexed per character. Emacs has both unibyte and multibyte strings and unibyte and multibyte buffers, and unibyte strings and buffers are the source for decoding and the target for encoding into multibyte strings and buffers. XEmacs does not have unibyte strings/buffers, so a lot of string internals do not need to make the distinction. GUILE could probably get away without unibyte strings as well because it has bytevectors. This would imply that if you wanted to do stuff akin to string operations on unibyte strings, you'd have to first convert bytevectors to multibyte strings, do your operations, convert back. XEmacs chose _not_ to have unibyte strings (and the corresponding complications to support both in the primitives), Emacs chose to have them. I think both approaches are defensible. Since GUILE presents itself as an extension language and since strings will need to get passed in and out of extension languages all the time, the implementation cost of offering a low-cost unibyte string is probably even more defensible than with Elisp where Elisp is the main processing language. > First I'll address the non-standard 'set-port-encoding!'. As you say, > it doesn't even make sense on string ports, and arguably should be an > error. So why do you care if some internal details leak out when you > do this nonsensical thing? Admittedly, we're missing an opportunity > to report a possible bug to the user, but that's the only problem I > see here. > > Regarding 'ftell' and 'seek', it's not entirely clear to me what's the > best representation of those positions. In some situations, I guess > it would be convenient for them to count unicode code points or string > indices. In other situations, I could imagine it being more > convenient for them to count grapheme clusters or UTF-8 bytes. > > R6RS, the only Scheme standard that supports getting or setting file > positions, gives us complete freedom to choose our representation of > positions on textual ports. The R6RS is explicit that they don't even > have to be integers, and if they are, they don't have to correspond to > bytes or characters. R6RS gives you the freedom to match your semantics to your implementation. String ports are strings-in-progress (and Emacs buffers are strings-in-progress on steroids), so it makes sense to match the fseek/ftell semantics of string ports to those of strings and the implementation to those of strings. You don't have anything to gain from converting characters to bytes and back just because you can. > For better or for worse, Guile's ports are fundamentally based on > bytes, Seriously? The whole point of this issue was that fundamentally basing GUILE's string ports on bytes is for worse. > and allow mixed binary and textual operations on all ports. I'll go out on a limb here and state "they don't". They work with bytes (either located on file or in some internally generated or consumed byte vector) and they input/output characters on their Scheme side, and you can change the en/decoding system which which characters are put into the stream or consumed. Their external side is identical to its internal side, and the Scheme/character/string side is fundamentally different. By changing the port encoding, you can change the conversion between Scheme on the one side and internal/external on the other. All operations are binary on the internal side, and textual on the Scheme side. That there are encodings which are less costly does not fundamentally change this. > Sometimes this is very helpful, for example when implementing HTTP. I > can think of one other case where it's very helpful: > > I don't know how deeply you've looked at UTF-8, It is a somewhat safe bet that a person who is the head maintainer of an application conversing in UTF-8 while using GUILE-1.8 in its internals has had some basic amount of exposure to UTF-8. In general, the working assumption "David just has little clue about computing" is rarely helpful for dismissing matters since David tends to have picked up tidbits occasionally since he started computing on systems where lowercase letters already needed a multi-sextet representation in its 60bit words. So it is a reasonably safe bet that when David has some problems with matters, chances are that a non-negligible percentage of other users will not fare significantly better, so it is a somewhat relevant indicator what to avoid. > but it has some unusual properties that allow many (most?) string > algorithms to be most naturally (and efficiently) implemented by > operating on bytes rather than code points. Much of the time, you > don't even have to be aware of the code point boundaries, which is a > great savings. Efficient lookup tables based on bytes are also much > cheaper than ones based on code points, etc. That's all very nice but totally irrelevant for this issue. If you like UTF-8, by all means base the internal string representation of GUILE on it. It comes at a cost since strings in Scheme are writable (and there are more operations for doing so than in Elisp) and indexed by character. Emacs has paid this cost: I think the basic speed of Emacs dropped by a factor of 2 when indexing was moved from bytes to characters around Emacs 20.2 or similar. But this issue is about not using different internal coding and exposed interfaces for strings and string ports. Whatever internal string representation you choose, it does not make sense to pick a different representation and indexing for string ports. > In fact, I intend to propose that in a future version of Guile, > strings will not only be based on UTF-8 internally, but that this fact > should be exposed in the API, allowing users to implement UTF-8 string > operations that operate on bytes not code points. This experiment has been tried and crashed and burnt with the initial MULE versions in Emacs=A020. Current versions _do_ offer conversion-less reinterpretations string-as-unibyte and string-as-multibyte and offer working with either string type. As explained, that comes at the cost of having to make all primitives able to work with either. They are actually rarely used by application level programmers, so most applications do not have this as a porting problem between Emacs and XEmacs (XEmacs has only multibyte strings). Personally, I'd consider that worth the cost in the case of GUILE. While XEmacs gets along without this addition, it seems important for efficient passing of data in and out of GUILE. It would also make sense to distinguish between multibyte (internal form of UTF-8, anything may happen if it is not properly formed) and external UTF-8 (reading/writing it uses a conversion process turning all illegal UTF-8 bytes into some reproducible representation). > I'd also like lightweight, fast string ports that allow access to > these bytes when desired. Any string port that does not involve encoding/decoding will be lightweight and fast, lighter and faster than any implementation having to code/decode gratuitously. Which is one of the points of this issue, even though I am more concerned with the conceptual cost than the runtime cost. But both have an impact. > This leads me to believe that it's a feature, not a bug, that string > ports use UTF-8 internally, and that it's possible (via non-standard > extensions) to get access to the underlying bytes. Getting confused about bytes and characters and introducing unnecessary conversions is not a feature. Even if you at one time use an UTF-8 based string representation, working with external UTF-8 will involve encoding/decoding processes. Forcing a string port to encode/decode during operation will remain expensive. Exposing string internals beyond quite special-purpose functions will be hard to deal with. All those lessons have already been learnt with Emacs. If you want to relearn them from scratch, the available developer power will not make basing Emacs on GUILE realistic in the next 10 years: Emacs fundamentally operates with texts. Too many reliability or efficiency problems doing that (or having to implement them as foreign datatypes altogether) will not make Guilemacs acceptable. So even in cases where multiple strategies are feasible, it may make sense to lean towards Emacs' choices. One choice that has served Emacs well is to hide its internal encoding system well from the external ones. That way its switch to an internal coding system based on UTF-8 affected almost no existing Elisp packages, and the programming model was conceptually clean. --=20 David Kastrup