From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: Eli Zaretskii Newsgroups: gmane.emacs.devel Subject: Re: Dynamic loading progress Date: Sun, 22 Nov 2015 20:04:28 +0200 Message-ID: <83egfh3o7n.fsf@gnu.org> References: <83k2ptq5t3.fsf@gnu.org> <87h9kxx60e.fsf@lifelogs.com> <877flswse5.fsf@lifelogs.com> <8737wgw7kf.fsf@lifelogs.com> <87io5bv1it.fsf@lifelogs.com> <87egfzuwca.fsf@lifelogs.com> <876118u6f2.fsf@lifelogs.com> <8737w3qero.fsf@lifelogs.com> <831tbn9g9j.fsf@gnu.org> <878u5upw7o.fsf@lifelogs.com> <83ziya8xph.fsf@gnu.org> <83y4du80xo.fsf@gnu.org> <837fld6lps.fsf@gnu.org> <83si3z4s5n.fsf@gnu.org> <83mvu74nhm.fsf@gnu.org> <83d1v34hba.fsf@gnu.org> Reply-To: Eli Zaretskii NNTP-Posting-Host: plane.gmane.org X-Trace: ger.gmane.org 1448215498 14305 80.91.229.3 (22 Nov 2015 18:04:58 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Sun, 22 Nov 2015 18:04:58 +0000 (UTC) Cc: aurelien.aptel+emacs@gmail.com, tzz@lifelogs.com, emacs-devel@gnu.org To: Philipp Stephani Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Sun Nov 22 19:04:44 2015 Return-path: Envelope-to: ged-emacs-devel@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by plane.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1a0Z0G-0005xr-6p for ged-emacs-devel@m.gmane.org; Sun, 22 Nov 2015 19:04:44 +0100 Original-Received: from localhost ([::1]:56968 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1a0Z0G-0006Hz-BA for ged-emacs-devel@m.gmane.org; Sun, 22 Nov 2015 13:04:44 -0500 Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:45617) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1a0Z0C-0006Hn-8W for emacs-devel@gnu.org; Sun, 22 Nov 2015 13:04:41 -0500 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1a0Z09-0005cr-1f for emacs-devel@gnu.org; Sun, 22 Nov 2015 13:04:40 -0500 Original-Received: from mtaout21.012.net.il ([80.179.55.169]:63946) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1a0Z08-0005cm-Pr for emacs-devel@gnu.org; Sun, 22 Nov 2015 13:04:36 -0500 Original-Received: from conversion-daemon.a-mtaout21.012.net.il by a-mtaout21.012.net.il (HyperSendmail v2007.08) id <0NY800M009YR4X00@a-mtaout21.012.net.il> for emacs-devel@gnu.org; Sun, 22 Nov 2015 20:04:35 +0200 (IST) Original-Received: from HOME-C4E4A596F7 ([84.94.185.246]) by a-mtaout21.012.net.il (HyperSendmail v2007.08) with ESMTPA id <0NY800M86A7M4S10@a-mtaout21.012.net.il>; Sun, 22 Nov 2015 20:04:35 +0200 (IST) In-reply-to: X-012-Sender: halo1@inter.net.il X-detected-operating-system: by eggs.gnu.org: Solaris 10 X-Received-From: 80.179.55.169 X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Xref: news.gmane.org gmane.emacs.devel:195044 Archived-At: > From: Philipp Stephani > Date: Sun, 22 Nov 2015 14:56:12 +0000 > Cc: tzz@lifelogs.com, aurelien.aptel+emacs@gmail.com, emacs-devel@gnu.org > > - The multibyte API should use an extension of UTF-8 to encode Emacs strings. > The extension is the obvious one already in use in multiple places. It is only used in one place: the internal representation of characters in buffers and strings. Emacs _never_ lets this internal representation leak outside. In practice the last sentence means that text that Emacs encoded in UTF-8 will only include either valid UTF-8 sequences of characters whose codepoints are below #x200000 or single bytes that don't belong to any UTF-8 sequence. You are suggesting to expose the internal representation to outside application code, which predictably will cause that representation to leak into Lisp. That'd be a disaster. We had something like that back in the Emacs 20 era, and it took many years to plug those leaks. We would be making a grave mistake to go back there. What you suggest is also impossible without deep changes in how we decode and encode text: that process maps codepoints above #1FFFFF to either codepoints below that mark or to raw bytes. So it's impossible to produce these high codes in UTF-8 compatible form while handling UTF-8 text. To say nothing about the simple fact that no library function in any C library will ever be able to do anything useful with such codepoints, because they are our own invention. > - There should be a one-to-one mapping between Emacs multibyte strings and > encoded module API strings. UTF-8 encoded strings satisfy that requirement. > Therefore non-shortest forms, illegal code unit sequences, and code > unit sequences that would encode values outside the range of Emacs > characters are illegal and raise a signal. Once again, this was tried in the past and was found to be a bad idea. Emacs provides features to test the result of converting invalid sequences, for the purposes of detecting such problems, but it leaves that to the application. > Likewise, such sequences will never be returned from Emacs. Emacs doesn't return invalid sequences, if the original text didn't include raw bytes. If there were raw bytes in the original text, Emacs has no choice but return them back, or else it will violate a basic expectation from a text-processing program: that it shall never change the portions of text that were not affected by the processing. > I think this is a relatively simple and unsurprising approach. It allows > encoding the documented Emacs character space while still being fully > compatible with UTF-8 and not resorting to undocumented Emacs internals. So does the approach I suggested. The advantage of my suggestion is that it follows a long Emacs tradition about every aspect of encoding and decoding text, and doesn't require any changes in the existing infrastructure.