From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: David Kastrup Newsgroups: gmane.emacs.devel Subject: Re: Emacs rewrite in a maintainable language Date: Sun, 18 Oct 2015 18:56:57 +0200 Message-ID: <87oafwm7xi.fsf@fencepost.gnu.org> References: <561A19AB.5060001@cumego.com> <87lhb82qxc.fsf@gmail.com> <87oag4jk74.fsf@wanadoo.es> <87k2qrki45.fsf@wanadoo.es> <8737xf9je9.fsf@fencepost.gnu.org> <87pp0fm0j3.fsf@gnu.org> <87r3kusx8z.fsf@fencepost.gnu.org> <83lhb26eb9.fsf@gnu.org> <876126key3.fsf@gnu.org> <83fv1a6bfu.fsf@gnu.org> <87d1weo7u9.fsf@gnu.org> <83zizi3qr0.fsf@gnu.org> <87lhb1n81y.fsf@gnu.org> <83si594wt3.fsf@gnu.org> <87io64iigs.fsf@gnu.org> <87r3kso1gr.fsf@fencepost.gnu.org> <87wpuks5ek.fsf@T420.taylan> <83vba4i1z3.fsf@gnu.org> NNTP-Posting-Host: plane.gmane.org Mime-Version: 1.0 Content-Type: text/plain X-Trace: ger.gmane.org 1445188307 12713 80.91.229.3 (18 Oct 2015 17:11:47 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Sun, 18 Oct 2015 17:11:47 +0000 (UTC) To: emacs-devel@gnu.org Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Sun Oct 18 19:11:47 2015 Return-path: Envelope-to: ged-emacs-devel@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by plane.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1ZnrUo-0003AW-RQ for ged-emacs-devel@m.gmane.org; Sun, 18 Oct 2015 19:11:46 +0200 Original-Received: from localhost ([::1]:34783 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1ZnrUo-0001d1-93 for ged-emacs-devel@m.gmane.org; Sun, 18 Oct 2015 13:11:46 -0400 Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:45217) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1ZnrUi-0001Zo-84 for emacs-devel@gnu.org; Sun, 18 Oct 2015 13:11:41 -0400 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1ZnrUg-00031s-I2 for emacs-devel@gnu.org; Sun, 18 Oct 2015 13:11:39 -0400 Original-Received: from fencepost.gnu.org ([2001:4830:134:3::e]:49737) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1ZnrUg-00031b-9O for emacs-devel@gnu.org; Sun, 18 Oct 2015 13:11:38 -0400 Original-Received: from localhost ([127.0.0.1]:35323 helo=lola) by fencepost.gnu.org with esmtp (Exim 4.82) (envelope-from ) id 1ZnrUf-0005v1-FP for emacs-devel@gnu.org; Sun, 18 Oct 2015 13:11:37 -0400 Original-Received: by lola (Postfix, from userid 1000) id 765ABDF535; Sun, 18 Oct 2015 18:56:57 +0200 (CEST) In-Reply-To: (John Wiegley's message of "Sun, 18 Oct 2015 09:40:20 -0700") User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/25.0.50 (gnu/linux) X-detected-operating-system: by eggs.gnu.org: Error: Malformed IPv6 address (bad octet value). X-Received-From: 2001:4830:134:3::e X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Xref: news.gmane.org gmane.emacs.devel:191975 Archived-At: "John Wiegley" writes: >>>>>> Eli Zaretskii writes: > >> One of the major lessons Emacs development learned since Emacs 20.1 >> is that raw bytes happen as part of text (a.k.a. "strings"), and >> therefore there's a need to support a mixture of these two in the >> same buffer/string. I think that's something Guile should support as >> well, as that will make it a more powerful and flexible extension >> language, able to deal with a wider range of real-life situations. > > I'd like to second Eli's recommendation. In real life, encoding and > decoding of bytes to and from characters (codepoints) is never a > simple problem. We do need good flexibility here. Personally I have no problem with an implementation insisting on certain properties for its internal encoding. But that implies that "internal encoding" and "external UTF-8" may diverge when "external UTF-8" does not exclusively contain valid UTF-8. Maintaining that distinction for GUILE should not be hard as currently its internal encoding is either Latin-1 or UCS-32 so it is not like it currently _has_ an internal UTF-8 for strings even though it has a number of functions taking UTF-8 input. However, if "internal encoding" is not the same as "valid UTF-8" throughout, it means that code called with it has to be able to deal with the representations for invalid UTF-8. Currently Emacs uses code points above the Unicode range for representing non-Unicode characters from different encodings, and it uses the 2-byte overlong byte sequences for 0-127 to represent raw bytes 128-255. That's not cast into stone but pretty efficient (I think Python uses 3-byte surrogate sequences for raw bytes, somewhat worse) and straightforward as it keeps the basic UTF-8 coding scheme invariants intact. Of course, all of this can be done simpler using an UCS-32 representation, but the basic tradeoffs leading to Emacs using a variable-size multibyte representation are still valid in my opinion. -- David Kastrup