From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: Alan Mackenzie Newsgroups: gmane.emacs.devel Subject: Re: Fwd: Re: Inadequate documentation of silly characters on screen. Date: Thu, 19 Nov 2009 08:20:40 +0000 Message-ID: <20091119082040.GA1720@muc.de> References: <20091118191258.GA2676@muc.de> NNTP-Posting-Host: lo.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Transfer-Encoding: 8bit X-Trace: ger.gmane.org 1258618589 25656 80.91.229.12 (19 Nov 2009 08:16:29 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Thu, 19 Nov 2009 08:16:29 +0000 (UTC) Cc: emacs-devel@gnu.org To: Stefan Monnier Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Thu Nov 19 09:16:22 2009 Return-path: Envelope-to: ged-emacs-devel@m.gmane.org Original-Received: from lists.gnu.org ([199.232.76.165]) by lo.gmane.org with esmtp (Exim 4.50) id 1NB2Bh-0003KU-3k for ged-emacs-devel@m.gmane.org; Thu, 19 Nov 2009 09:16:21 +0100 Original-Received: from localhost ([127.0.0.1]:50455 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1NB2Bg-0000q2-IL for ged-emacs-devel@m.gmane.org; Thu, 19 Nov 2009 03:16:20 -0500 Original-Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43) id 1NB2Bb-0000o3-Hu for emacs-devel@gnu.org; Thu, 19 Nov 2009 03:16:15 -0500 Original-Received: from exim by lists.gnu.org with spam-scanned (Exim 4.43) id 1NB2BW-0000ic-CG for emacs-devel@gnu.org; Thu, 19 Nov 2009 03:16:14 -0500 Original-Received: from [199.232.76.173] (port=55533 helo=monty-python.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1NB2BW-0000iT-7w for emacs-devel@gnu.org; Thu, 19 Nov 2009 03:16:10 -0500 Original-Received: from colin.muc.de ([193.149.48.1]:2274 helo=mail.muc.de) by monty-python.gnu.org with esmtps (TLS-1.0:DHE_RSA_AES_256_CBC_SHA1:32) (Exim 4.60) (envelope-from ) id 1NB2BV-0005CZ-Jm for emacs-devel@gnu.org; Thu, 19 Nov 2009 03:16:10 -0500 Original-Received: (qmail 46833 invoked by uid 3782); 19 Nov 2009 08:16:06 -0000 Original-Received: from acm.muc.de (pD9E51409.dip.t-dialin.net [217.229.20.9]) by colin2.muc.de (tmda-ofmipd) with ESMTP; Thu, 19 Nov 2009 09:15:36 +0100 Original-Received: (qmail 2173 invoked by uid 1000); 19 Nov 2009 08:20:40 -0000 Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.9i X-Delivery-Agent: TMDA/1.1.5 (Fettercairn) X-Primary-Address: acm@muc.de X-detected-operating-system: by monty-python.gnu.org: FreeBSD 4.6-4.9 X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Xref: news.gmane.org gmane.emacs.devel:117236 Archived-At: Morning, Stefan! On Wed, Nov 18, 2009 at 08:27:24PM -0500, Stefan Monnier wrote: > The integer 241 is used to represent the char ?ñ, but it's also used for > many other things, one of them being to represent the byte 241 (tho such > a byte can also be represented as the integer 4194289). > Now strings come in two flavors: multibyte (i.e. sequences of chars) and > unibyte (i.e. sequences of bytes). So when you do: > M-: (setq nl "\n") > M-: (aset nl 0 ?ñ) > M-: (insert nl) > The `aset' part may do two different things depending on whether `nl' is > unibyte or multibyte: it will either insert the char ?ñ or the byte 241. > In the above code the "\n" is taken as a unibyte string, tho I'm not > sure why we made this arbitrary choice. The above sequence "works" in Emacs 22.3, in the sense that "ñ" gets displayed - when I do M-: (aset nl 0 ?ñ), I get "2289 (#o4361, #x8f1)" (Emacs 22.3) "241 (#o361, #xf1)" (Emacs 23.1) displayed in the echo area. So my `aset' invocation is trying to write a multibyte ?ñ into a unibyte ?\n, and gets truncated from #x8f1 to #xf1 in the process. Surely this behaviour in Emacs 23.1 is a bug? Shouldn't we fix it before the pretest? How about interpreting "\n" and friends as multibyte or unibyte according to the prevailing flavour? > If you give us more context (i.e. more of the real code where the > problem show up), maybe we can tell you how to avoid it. OK. I have my own routine to display regexps. As a first step, I translate \n -> ñ, (and \t, \r, \f similarly). This is how: (defun translate-rnt (regexp) "REGEXP is a string. Translate any \t \n \r and \f characters to wierd non-ASCII printable characters: \t to Î (206, \xCE), \n to ñ (241, \xF1), \r to ® (174, \xAE) and \f to £ (163, \xA3). The original string is modified." (let (ch pos) (while (setq pos (string-match "[\t\n\r\f]" regexp)) (setq ch (aref regexp pos)) (aset regexp pos ; <=================== (cond ((eq ch ?\t) ?Î) ((eq ch ?\n) ?ñ) ((eq ch ?\r) ?®) (t ?£)))) regexp)) > Usually, I recommend to stay away from `aset' on strings for various > reasons, and it seems that it also helps avoid those tricky issues (tho > it doesn't protect you from them completely). Again, surely this is a bug? These tricky issues should be dealt with in the lisp interpreter in a way that lisp hackers don't have to worry about. Why do we have both unibyte and multibyte? Is there any reason not to remove unibyte altogether (though obviously not for 23.2). What was the change between 22.3 and 23.1 that broke my code? Would it, perhaps, be a good idea to reconsider that change? > Stefan -- Alan Mackenzie (Nurmberg, Germany).