From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: Eli Zaretskii Newsgroups: gmane.emacs.devel Subject: Re: Dynamic loading progress Date: Sun, 22 Nov 2015 19:35:33 +0200 Message-ID: <83io4u2aze.fsf@gnu.org> References: <83k2ptq5t3.fsf@gnu.org> <87h9kxx60e.fsf@lifelogs.com> <877flswse5.fsf@lifelogs.com> <8737wgw7kf.fsf@lifelogs.com> <87io5bv1it.fsf@lifelogs.com> <87egfzuwca.fsf@lifelogs.com> <876118u6f2.fsf@lifelogs.com> <8737w3qero.fsf@lifelogs.com> <831tbn9g9j.fsf@gnu.org> <878u5upw7o.fsf@lifelogs.com> <83ziya8xph.fsf@gnu.org> <83y4du80xo.fsf@gnu.org> <837fld6lps.fsf@gnu.org> <83si3z4s5n.fsf@gnu.org> <83mvu74nhm.fsf@gnu.org> <83d1v34hba.fsf@gnu.org> Reply-To: Eli Zaretskii NNTP-Posting-Host: plane.gmane.org X-Trace: ger.gmane.org 1448213773 20702 80.91.229.3 (22 Nov 2015 17:36:13 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Sun, 22 Nov 2015 17:36:13 +0000 (UTC) Cc: aurelien.aptel+emacs@gmail.com, tzz@lifelogs.com, emacs-devel@gnu.org To: Philipp Stephani Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Sun Nov 22 18:36:04 2015 Return-path: Envelope-to: ged-emacs-devel@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by plane.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1a0YYQ-0006xg-R6 for ged-emacs-devel@m.gmane.org; Sun, 22 Nov 2015 18:35:59 +0100 Original-Received: from localhost ([::1]:56898 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1a0YYQ-0007oL-Lp for ged-emacs-devel@m.gmane.org; Sun, 22 Nov 2015 12:35:58 -0500 Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:39677) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1a0YYE-0007oF-7P for emacs-devel@gnu.org; Sun, 22 Nov 2015 12:35:47 -0500 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1a0YYA-0007qy-VH for emacs-devel@gnu.org; Sun, 22 Nov 2015 12:35:46 -0500 Original-Received: from mtaout21.012.net.il ([80.179.55.169]:62807) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1a0YYA-0007qr-JK for emacs-devel@gnu.org; Sun, 22 Nov 2015 12:35:42 -0500 Original-Received: from conversion-daemon.a-mtaout21.012.net.il by a-mtaout21.012.net.il (HyperSendmail v2007.08) id <0NY800M008TD3100@a-mtaout21.012.net.il> for emacs-devel@gnu.org; Sun, 22 Nov 2015 19:35:40 +0200 (IST) Original-Received: from HOME-C4E4A596F7 ([84.94.185.246]) by a-mtaout21.012.net.il (HyperSendmail v2007.08) with ESMTPA id <0NY800LWA8VGZL60@a-mtaout21.012.net.il>; Sun, 22 Nov 2015 19:35:40 +0200 (IST) In-reply-to: X-012-Sender: halo1@inter.net.il X-detected-operating-system: by eggs.gnu.org: Solaris 10 X-Received-From: 80.179.55.169 X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Xref: news.gmane.org gmane.emacs.devel:195043 Archived-At: > From: Philipp Stephani > Date: Sun, 22 Nov 2015 09:25:08 +0000 > Cc: tzz@lifelogs.com, aurelien.aptel+emacs@gmail.com, emacs-devel@gnu.org > > > Fine with me, but how would we then represent Emacs strings that are not > valid > > Unicode strings? Just raise an error? > > No need to raise an error. Strings that are returned to modules > should be encoded into UTF-8. That encoding already takes care of > these situations: it either produces the UTF-8 encoding of the > equivalent Unicode characters, or outputs raw bytes. > > Then we should document such a situation and give module authors a way to > detect them. I already suggested what we should say in the documentation: that these interfaces accept and produce UTF-8 encoded non-ASCII text. > For example, what happens if a sequence of such raw bytes happens > to be a valid UTF-8 sequence? Is there a way for module code to detect this > situation? How can you detect that if you are only given the byte stream? You can't. You need some additional information to be able to distinguish between these two alternatives. Look, an Emacs module _must_ support non-ASCII text, otherwise it would be severely limited, to say the least. Having interfaces that accept and produce UTF-8 encoded strings is the simplest complete solution to this problem. So we must at least support that much. Supporting strings of raw bytes is also possible, probably even desirable, but it's an extension, something that would be required much more rarely. Such strings cannot be meaningfully treated as text: you cannot ask if some byte is upper-case or lower-case letter, you cannot display such strings as readable text, you cannot count characters in it, etc. Such strings are useful for a limited number of specialized jobs, and handling them in Lisp requires some caution, because if you treat them as normal text strings, you get surprises. So let's solve the more important issues first, and talk about extensions later. The more important issue is how can a module pass to Emacs non-ASCII text and get back non-ASCII text. And the answer to that is to use UTF-8 encoded strings. > We are quite capable of quietly accepting such strings, so that is > what I would suggest. Doing so would be in line with what Emacs does > when such invalid sequences come from other sources, like files. > > If we accept such strings, then we should document what the extensions are. > - Are UTF-8-like sequences encoding surrogate code points accepted? > - Are UTF-8-like sequences encoding integers outside the Unicode codespace > accepted? > - Are non-shortest forms accepted? > - Are other invalid code unit sequences accepted? _Anything_ can be accepted. _Any_ byte sequence. Emacs will cope. The perpetrator will probably get back after processing a string that is not entirely human-readable, or its processing will sometimes produce surprises, like if the string is lower-cased. But nothing bad will happen to Emacs, it won't crash and won't garble its display. Moreover, just passing such a string to Emacs, then outputting it back without any changes will produce an exact copy of the input, which is quite a feat, considering that the input was "invalid". If you want to see what "bad" things can happen, take a Latin-1 encoded FILE and visit it with "C-x RET c utf-8 RET C-x C-f FILE RET". Then play with the buffer a while. This is what happens when Emacs is told the text is in UTF-8, when it really isn't. There's no catastrophe, but the luser who does that might be amply punished, at the very least she will not see the letters she expects. However, if you save such a buffer to a file, using UTF-8, you will get the same Latin-1 encoded text as was there originally. Now, given such resilience, why do we need to raise an error? > If the answer to any of these is "yes", we can't say we accept UTF-8, because > we don't. We _expect_ UTF-8, and if given that, will produce known, predictable results when the string is processed as text. We can _tolerate_ violations, resulting in somewhat surprising behavior, if such a text is treated as "normal" human-readable text. (If the module knows what it does, and really means to work with raw bytes, then Emacs will do what the module expects, and produce raw bytes on output, as expected.) > Rather we should say what is actually accepted. Saying that is meaningless in this case, because we can accept anything. _If_ the module wants the string it passes to be processed as human-readable text that consists of recognizable characters, then the module should _only_ pass valid UTF-8 sequences. But raising errors upon detecting violations was discovered long ago a bad idea that users resented. So we don't, and neither should the module API. > > * If copy_string_contents is passed an Emacs string that is not a valid > Unicode > > string, what should happen? > > How can that happen? The Emacs string comes from the Emacs bowels, so > it must be "valid" string by Emacs standards. Or maybe I don't > understand what you mean by "invalid Unicode string". > > A sequence of integers where at least one element is not a Unicode scalar > value. Emacs doesn't store characters as scalar Unicode values, so this doesn't really explain to me your concept of a "valid Unicode string". > In any case, we already deal with any such problems when we save a > buffer to a file, or send it over the network. This isn't some new > problem we need to cope with. > > Yes, but the module interface is new, it doesn't necessarily have to have the > same behavior. Of course, it does! Modules are Emacs extensions, so the interface should support the same features that core Emacs does. Why? because there's no limits to what creative minds can do with this feature, so we should not artificially impose such limitations where we have sound, time-proven infrastructure that doesn't need them. > If we say we emit only UTF-8, then we should do so. We emit only valid UTF-8, provided that its source (if it came from a module) was valid UTF-8.