From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: "Stephen J. Turnbull" Newsgroups: gmane.emacs.devel Subject: Re: Emacs rewrite in a maintainable language Date: Mon, 19 Oct 2015 02:46:26 +0900 Message-ID: <22051.56050.174581.446113@turnbull.sk.tsukuba.ac.jp> References: <561A19AB.5060001@cumego.com> <87lhb82qxc.fsf@gmail.com> <87oag4jk74.fsf@wanadoo.es> <87k2qrki45.fsf@wanadoo.es> <8737xf9je9.fsf@fencepost.gnu.org> <87pp0fm0j3.fsf@gnu.org> <87r3kusx8z.fsf@fencepost.gnu.org> <83lhb26eb9.fsf@gnu.org> <876126key3.fsf@gnu.org> <83fv1a6bfu.fsf@gnu.org> <87d1weo7u9.fsf@gnu.org> <83zizi3qr0.fsf@gnu.org> <87lhb1n81y.fsf@gnu.org> <83si594wt3.fsf@gnu.org> <87io64iigs.fsf@gnu.org> <87r3kso1gr.fsf@fencepost.gnu.org> <87wpuks5ek.fsf@T420.taylan> <83vba4i1z3.fsf@gnu.org> <87oafwm7xi.fsf@fencepost.gnu.org> NNTP-Posting-Host: plane.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-Trace: ger.gmane.org 1445190449 14092 80.91.229.3 (18 Oct 2015 17:47:29 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Sun, 18 Oct 2015 17:47:29 +0000 (UTC) Cc: emacs-devel@gnu.org To: David Kastrup Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Sun Oct 18 19:47:21 2015 Return-path: Envelope-to: ged-emacs-devel@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by plane.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1Zns3F-0004Ju-5P for ged-emacs-devel@m.gmane.org; Sun, 18 Oct 2015 19:47:21 +0200 Original-Received: from localhost ([::1]:34973 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1Zns3E-00008N-CC for ged-emacs-devel@m.gmane.org; Sun, 18 Oct 2015 13:47:20 -0400 Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:53679) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1Zns2g-00007b-Py for emacs-devel@gnu.org; Sun, 18 Oct 2015 13:46:51 -0400 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1Zns2X-0003gj-JI for emacs-devel@gnu.org; Sun, 18 Oct 2015 13:46:38 -0400 Original-Received: from turnbull.sk.tsukuba.ac.jp ([130.158.96.25]:49552) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1Zns2P-0003Zk-NY; Sun, 18 Oct 2015 13:46:29 -0400 Original-Received: from steve by turnbull.sk.tsukuba.ac.jp with local (Exim 4.86) (envelope-from ) id 1Zns2M-0007oF-6u; Mon, 19 Oct 2015 02:46:26 +0900 In-Reply-To: <87oafwm7xi.fsf@fencepost.gnu.org> X-Mailer: VM 8.0.12-devo-585 under 21.5 (beta34) "kale" 698a9aa86de4 XEmacs Lucid (x86_64-apple-darwin14.5.0) X-SA-Exim-Connect-IP: X-SA-Exim-Mail-From: steve@turnbull.sk.tsukuba.ac.jp X-SA-Exim-Scanned: No (on turnbull.sk.tsukuba.ac.jp); SAEximRunCond expanded to false X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] X-Received-From: 130.158.96.25 X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Xref: news.gmane.org gmane.emacs.devel:191985 Archived-At: David Kastrup writes: > Personally I have no problem with an implementation insisting on > certain properties for its internal encoding. But that implies > that "internal encoding" and "external UTF-8" may diverge when > "external UTF-8" does not exclusively contain valid UTF-8. Then the external data shouldn't be called "UTF-8" in discussions like this one. The problem of data that is not valid for the presumed encoding is not limited to UTF-8, Unicode, or even to text. It just happens that we have good solutions (not limited to ritual suicide) for the text stream case. Also, we should remember that Unicode is a wire protocol. It's very useful to adapt the formats defined by Unicode for constructing and parsing internal and external data -- that can be very efficient. But we also need to have a strict-conformance option for I/O that is declared to be Unicode, and that probably be the default. > However, if "internal encoding" is not the same as "valid UTF-8" > throughout, it means that code called with it has to be able to > deal with the representations for invalid UTF-8. Emacs certainly can deal, since it has a 'binary' encoding and can represent that internally. But that's awfully inconvenient. Something like Emacs's current implementation, Markus Kuhn's UTF-8b, or Python's PEP 383 is really required for Emacs implementations. (Does anybody remember that awful mail format of Win2k beta's version of Outlook Express, where the HTML tags were encoded in ASCII and the element content in little-endian UTF-16?) > [Emacs's internal text representation is] not cast into stone but > pretty efficient (I think Python uses 3-byte surrogate sequences > for raw bytes, somewhat worse) No. Python uses a wide-char representation. In Python 2, it's 2 bytes on most non-glibc platforms, and 4 bytes on glibc. In Python 3 with PEP 393 support, valid ISO-8859-1 text (even if decoded from another external encoding) is represented in one byte, valid BMP text (optionally with support for invalid "rawbytes", internally encoded as lone trailing surrogates) in two bytes, and text containing characters from the astral planes in four bytes (again with optional support for invalid rawbytes). > and straightforward as it keeps the basic UTF-8 coding scheme > invariants intact. > > Of course, all of this can be done simpler using an UCS-32 > representation, but the basic tradeoffs leading to Emacs using a > variable-size multibyte representation are still valid in my > opinion. Seems reasonable to me. So far Python with PEP 393 has been pretty successful, but since emoticons live in the astral planes, I suspect it may not be the best representation for the web and phones -- one smiley in ASCII text will quadruple the needed string storage. I don't see a good reason to change Emacs's representation at this point.