From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: Stefan Monnier Newsgroups: gmane.emacs.devel Subject: Re: Fwd: Re: Inadequate documentation of silly characters on screen. Date: Thu, 19 Nov 2009 09:08:29 -0500 Message-ID: References: <20091118191258.GA2676@muc.de> <20091119082040.GA1720@muc.de> NNTP-Posting-Host: lo.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Transfer-Encoding: quoted-printable X-Trace: ger.gmane.org 1258639859 31341 80.91.229.12 (19 Nov 2009 14:10:59 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Thu, 19 Nov 2009 14:10:59 +0000 (UTC) Cc: emacs-devel@gnu.org To: Alan Mackenzie Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Thu Nov 19 15:10:52 2009 Return-path: Envelope-to: ged-emacs-devel@m.gmane.org Original-Received: from lists.gnu.org ([199.232.76.165]) by lo.gmane.org with esmtp (Exim 4.50) id 1NB7il-0002Ld-5O for ged-emacs-devel@m.gmane.org; Thu, 19 Nov 2009 15:10:52 +0100 Original-Received: from localhost ([127.0.0.1]:39662 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1NB7ij-0000Tx-V0 for ged-emacs-devel@m.gmane.org; Thu, 19 Nov 2009 09:10:50 -0500 Original-Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43) id 1NB7ga-0007lm-Pv for emacs-devel@gnu.org; Thu, 19 Nov 2009 09:08:36 -0500 Original-Received: from exim by lists.gnu.org with spam-scanned (Exim 4.43) id 1NB7gV-0007hO-Bq for emacs-devel@gnu.org; Thu, 19 Nov 2009 09:08:35 -0500 Original-Received: from [199.232.76.173] (port=53044 helo=monty-python.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1NB7gV-0007hG-4V for emacs-devel@gnu.org; Thu, 19 Nov 2009 09:08:31 -0500 Original-Received: from ironport2-out.teksavvy.com ([206.248.154.183]:62119 helo=ironport2-out.pppoe.ca) by monty-python.gnu.org with esmtp (Exim 4.60) (envelope-from ) id 1NB7gU-0005Ao-Ja for emacs-devel@gnu.org; Thu, 19 Nov 2009 09:08:30 -0500 X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: AqYEABbgBEvO+IIa/2dsb2JhbACBTdQnhDsEgxGGWA X-IronPort-AV: E=Sophos;i="4.44,771,1249272000"; d="scan'208";a="49650200" Original-Received: from 206-248-130-26.dsl.teksavvy.com (HELO pastel.home) ([206.248.130.26]) by ironport2-out.pppoe.ca with ESMTP; 19 Nov 2009 09:08:29 -0500 Original-Received: by pastel.home (Postfix, from userid 20848) id 4BD5F8774; Thu, 19 Nov 2009 09:08:29 -0500 (EST) In-Reply-To: <20091119082040.GA1720@muc.de> (Alan Mackenzie's message of "Thu, 19 Nov 2009 08:20:40 +0000") User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/23.1.50 (gnu/linux) X-detected-operating-system: by monty-python.gnu.org: Genre and OS details not recognized. X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Xref: news.gmane.org gmane.emacs.devel:117248 Archived-At: > The above sequence "works" in Emacs 22.3, in the sense that "=F1" gets > displayed There are many differences that cause it to work completely differently: > - when I do M-: (aset nl 0 ?=F1), I get > "2289 (#o4361, #x8f1)" (Emacs 22.3) > "241 (#o361, #xf1)" (Emacs 23.1) ?=F1 =3D 2289 in Emacs-22 ?=F1 =3D 241 in Emacs-23 So in Emacs-22, there is no possible confusion for this char with a byte. So when you do the `aset', Emacs-22 converts the unibyte string nl to multibyte, whereas Emacs-23 doesn't. From then on, in Emacs-22 your example is all multibyte, so there's no surprise. Now if in Emacs-22 you do instead (aset nl 0 241), where 241 in Emacs-22 is not a valid char and can hence only be a byte, then aset leaves the string as unibyte and we end up with the same nl as in Emacs-23. But if you then (insert nl), Emacs-22 will probably end up inserting a =F1 in your buffer, because Emacs-22 performs a decoding step using your language environment when inserting a unibyte string into a unibyte buffer (this used to be helpful for code that didn't know enough about Mule to setup coding systems properly, which is why it was done, but nowadays it was just hiding bugs and encouraging sloppiness in coding so we removed it). > fix it before the pretest? How about interpreting "\n" and friends as > multibyte or unibyte according to the prevailing flavour? I'm not sure what that means. But maybe "\n" should be multibyte, yes. >> If you give us more context (i.e. more of the real code where the >> problem show up), maybe we can tell you how to avoid it. > OK. I have my own routine to display regexps. As a first step, I > translate \n -> =F1, (and \t, \r, \f similarly). This is how: > (defun translate-rnt (regexp) > "REGEXP is a string. Translate any \t \n \r and \f characters > to wierd non-ASCII printable characters: \t to =CE (206, \xCE), \n > to =F1 (241, \xF1), \r to =AE (174, \xAE) and \f to =A3 (163, \xA3). > The original string is modified." > (let (ch pos) > (while (setq pos (string-match "[\t\n\r\f]" regexp)) > (setq ch (aref regexp pos)) > (aset regexp pos ; <=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > (cond ((eq ch ?\t) ?=CE) > ((eq ch ?\n) ?=F1) > ((eq ch ?\r) ?=AE) > (t ?=A3)))) > regexp)) Each one of those `aset' (when performed according to your wishes) would change the byte-size of the string, so it would internally require copying the whole string each time: aset on (multibyte) strings is very inefficient (compared to what most people expect, not necessarily compared to other operations). I'd recommend you use higher-level operations since they'll work just as well and are less susceptible to such problems: (replace-regexp-in-string "[\t\n\r\f]" (lambda (s) (or (cdr (assoc s '(("\t" . "=CE") ("\n" . "=F1") ("\r" . "=AE")))) "=A3")) regexp) > Why do we have both unibyte and multibyte? Is there any reason > not to remove unibyte altogether (though obviously not for 23.2). Because bytes and chars are different, so we have strings of bytes and strings of chars. The problem with it is not their combined existence, but the fact that they are not different enough. Many people don't understand the difference between chars and bytes, but even more people can't figure out which Elisp operation returns a unibyte string and which a multibyte strings, and that for a "good" reason: it's very difficult to predict. Emacs-23 tries to help in this in the following ways: - `string' always builds a multibyte string now, so if you want a unibyte string, you need to use the new `unibyte-string' function. - we don't automatically perform encoding/decoding conversions between the two forms, so we hide the difference a bit less. We should probably moved towards making all string immediates multibyte and add a new syntax to unibyte immediates. > What was the change between 22.3 and 23.1 that broke my code? Mostly: the change to unibyte internal representation which made 241 (and other byte values) ambiguous since it can also be interpreted now as a character value. > Would it, perhaps, be a good idea to reconsider that change? I think you'll understand that reverting to the emacs-mule (iso-2022-based) internal representation is not really on the table ;-) Stefan