From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: Alan Mackenzie Newsgroups: gmane.emacs.devel Subject: Re: Fwd: Re: Inadequate documentation of silly characters on screen. Date: Thu, 19 Nov 2009 12:21:19 +0000 Message-ID: <20091119122119.GB1720@muc.de> References: <20091118191258.GA2676@muc.de> <20091119082040.GA1720@muc.de> NNTP-Posting-Host: lo.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Transfer-Encoding: 8bit X-Trace: ger.gmane.org 1258633009 6722 80.91.229.12 (19 Nov 2009 12:16:49 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Thu, 19 Nov 2009 12:16:49 +0000 (UTC) Cc: Stefan Monnier , emacs-devel@gnu.org To: Andreas Schwab Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Thu Nov 19 13:16:42 2009 Return-path: Envelope-to: ged-emacs-devel@m.gmane.org Original-Received: from lists.gnu.org ([199.232.76.165]) by lo.gmane.org with esmtp (Exim 4.50) id 1NB5wD-0003gx-I5 for ged-emacs-devel@m.gmane.org; Thu, 19 Nov 2009 13:16:37 +0100 Original-Received: from localhost ([127.0.0.1]:60296 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1NB5wC-0000jp-QE for ged-emacs-devel@m.gmane.org; Thu, 19 Nov 2009 07:16:36 -0500 Original-Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43) id 1NB5w0-0000iL-AM for emacs-devel@gnu.org; Thu, 19 Nov 2009 07:16:24 -0500 Original-Received: from exim by lists.gnu.org with spam-scanned (Exim 4.43) id 1NB5vu-0000ep-Nq for emacs-devel@gnu.org; Thu, 19 Nov 2009 07:16:23 -0500 Original-Received: from [199.232.76.173] (port=45450 helo=monty-python.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1NB5vu-0000ek-Ir for emacs-devel@gnu.org; Thu, 19 Nov 2009 07:16:18 -0500 Original-Received: from colin.muc.de ([193.149.48.1]:4123 helo=mail.muc.de) by monty-python.gnu.org with esmtps (TLS-1.0:DHE_RSA_AES_256_CBC_SHA1:32) (Exim 4.60) (envelope-from ) id 1NB5vu-0000wu-1Q for emacs-devel@gnu.org; Thu, 19 Nov 2009 07:16:18 -0500 Original-Received: (qmail 83629 invoked by uid 3782); 19 Nov 2009 12:16:15 -0000 Original-Received: from acm.muc.de (pD9E51409.dip.t-dialin.net [217.229.20.9]) by colin2.muc.de (tmda-ofmipd) with ESMTP; Thu, 19 Nov 2009 13:16:13 +0100 Original-Received: (qmail 5329 invoked by uid 1000); 19 Nov 2009 12:21:19 -0000 Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.9i X-Delivery-Agent: TMDA/1.1.5 (Fettercairn) X-Primary-Address: acm@muc.de X-detected-operating-system: by monty-python.gnu.org: FreeBSD 4.6-4.9 X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Xref: news.gmane.org gmane.emacs.devel:117245 Archived-At: Hi, Andreas, On Thu, Nov 19, 2009 at 11:16:03AM +0100, Andreas Schwab wrote: > Alan Mackenzie writes: > > So my `aset' invocation is trying to write a multibyte ?ñ into a > > unibyte ?\n, and gets truncated from #x8f1 to #xf1 in the process. > Nothing gets truncated. In Emacs 23 ?ñ is simply the number 241, > whereas in Emacs 22 is it the number 2289. You can put 2289 in a string > in Emacs 23, but there is no defined unicode character with that value. Ah, thanks! So when I do M-: (setq nl "\n") M-: (aset nl 0 ?ñ) M-: (insert nl) , after the `aset', the string nl correctly contains, one character which is the single byte #xf1. The bug happens in `insert', where something is interpreting the byte #xf1 as the signed integer #xfffff.....ffff1. Delving into the bowels of Emacs, I find this in character.h: 1. #define STRING_CHAR_AND_LENGTH(p, len, actual_len) \ 2. (!((p)[0] & 0x80) \ 3. ? ((actual_len) = 1, (p)[0]) \ 4. : ! ((p)[0] & 0x20) \ 5. ? ((actual_len) = 2, \ 6. (((((p)[0] & 0x1F) << 6) \ 7. | ((p)[1] & 0x3F)) \ 8. + (((unsigned char) (p)[0]) < 0xC2 ? 0x3FFF80 : 0))) \ 9. : ! ((p)[0] & 0x10) \ 10. ? ((actual_len) = 3, \ 11. ((((p)[0] & 0x0F) << 12) \ 12. | (((p)[1] & 0x3F) << 6) \ 13. | ((p)[2] & 0x3F))) \ 14. : string_char ((p), NULL, &actual_len)) #xf1 drops through all this nonsense to string_char (in character.c). It drops through to this case: else if (! (*p & 0x08)) { c = ((((p)[0] & 0xF) << 18) | (((p)[1] & 0x3F) << 12) | (((p)[2] & 0x3F) << 6) | ((p)[3] & 0x3F)); p += 4; } , where it obviously becomes silly. At least, I think that's where it ends up. This isn't the most maintainable piece of code in Emacs. So, if ISO-8559-1 characters are now represented as single bytes in Emacs, what test for mutibyticity should STRING_CHAR_AND_LENGTH be using? > Andreas. -- Alan Mackenzie (Nuremberg, Germany).