From mboxrd@z Thu Jan  1 00:00:00 1970
Path: news.gmane.org!not-for-mail
From: Stefan Monnier <monnier@iro.umontreal.ca>
Newsgroups: gmane.emacs.devel
Subject: Re: Fwd: Re: Inadequate documentation of silly characters on screen.
Date: Thu, 19 Nov 2009 09:08:29 -0500
Message-ID: <jwvmy2ieix7.fsf-monnier+emacs@gnu.org>
References: <20091118191258.GA2676@muc.de>
	<jwvlji3fgzi.fsf-monnier+emacs@gnu.org> <20091119082040.GA1720@muc.de>
NNTP-Posting-Host: lo.gmane.org
Mime-Version: 1.0
Content-Type: text/plain; charset=iso-8859-1
Content-Transfer-Encoding: quoted-printable
X-Trace: ger.gmane.org 1258639859 31341 80.91.229.12 (19 Nov 2009 14:10:59 GMT)
X-Complaints-To: usenet@ger.gmane.org
NNTP-Posting-Date: Thu, 19 Nov 2009 14:10:59 +0000 (UTC)
Cc: emacs-devel@gnu.org
To: Alan Mackenzie <acm@muc.de>
Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Thu Nov 19 15:10:52 2009
Return-path: <emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org>
Envelope-to: ged-emacs-devel@m.gmane.org
Original-Received: from lists.gnu.org ([199.232.76.165])
	by lo.gmane.org with esmtp (Exim 4.50)
	id 1NB7il-0002Ld-5O
	for ged-emacs-devel@m.gmane.org; Thu, 19 Nov 2009 15:10:52 +0100
Original-Received: from localhost ([127.0.0.1]:39662 helo=lists.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.43)
	id 1NB7ij-0000Tx-V0
	for ged-emacs-devel@m.gmane.org; Thu, 19 Nov 2009 09:10:50 -0500
Original-Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43)
	id 1NB7ga-0007lm-Pv
	for emacs-devel@gnu.org; Thu, 19 Nov 2009 09:08:36 -0500
Original-Received: from exim by lists.gnu.org with spam-scanned (Exim 4.43)
	id 1NB7gV-0007hO-Bq
	for emacs-devel@gnu.org; Thu, 19 Nov 2009 09:08:35 -0500
Original-Received: from [199.232.76.173] (port=53044 helo=monty-python.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.43) id 1NB7gV-0007hG-4V
	for emacs-devel@gnu.org; Thu, 19 Nov 2009 09:08:31 -0500
Original-Received: from ironport2-out.teksavvy.com ([206.248.154.183]:62119
	helo=ironport2-out.pppoe.ca)
	by monty-python.gnu.org with esmtp (Exim 4.60)
	(envelope-from <monnier@iro.umontreal.ca>) id 1NB7gU-0005Ao-Ja
	for emacs-devel@gnu.org; Thu, 19 Nov 2009 09:08:30 -0500
X-IronPort-Anti-Spam-Filtered: true
X-IronPort-Anti-Spam-Result: AqYEABbgBEvO+IIa/2dsb2JhbACBTdQnhDsEgxGGWA
X-IronPort-AV: E=Sophos;i="4.44,771,1249272000"; d="scan'208";a="49650200"
Original-Received: from 206-248-130-26.dsl.teksavvy.com (HELO pastel.home)
	([206.248.130.26])
	by ironport2-out.pppoe.ca with ESMTP; 19 Nov 2009 09:08:29 -0500
Original-Received: by pastel.home (Postfix, from userid 20848)
	id 4BD5F8774; Thu, 19 Nov 2009 09:08:29 -0500 (EST)
In-Reply-To: <20091119082040.GA1720@muc.de> (Alan Mackenzie's message of "Thu, 
	19 Nov 2009 08:20:40 +0000")
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/23.1.50 (gnu/linux)
X-detected-operating-system: by monty-python.gnu.org: Genre and OS details not
	recognized.
X-BeenThere: emacs-devel@gnu.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: "Emacs development discussions." <emacs-devel.gnu.org>
List-Unsubscribe: <http://lists.gnu.org/mailman/listinfo/emacs-devel>,
	<mailto:emacs-devel-request@gnu.org?subject=unsubscribe>
List-Archive: <http://lists.gnu.org/pipermail/emacs-devel>
List-Post: <mailto:emacs-devel@gnu.org>
List-Help: <mailto:emacs-devel-request@gnu.org?subject=help>
List-Subscribe: <http://lists.gnu.org/mailman/listinfo/emacs-devel>,
	<mailto:emacs-devel-request@gnu.org?subject=subscribe>
Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org
Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org
Xref: news.gmane.org gmane.emacs.devel:117248
Archived-At: <http://permalink.gmane.org/gmane.emacs.devel/117248>

> The above sequence "works" in Emacs 22.3, in the sense that "=F1" gets
> displayed

There are many differences that cause it to work completely differently:

> - when I do M-: (aset nl 0 ?=F1), I get

>    "2289 (#o4361, #x8f1)" (Emacs 22.3)
>    "241 (#o361, #xf1)"    (Emacs 23.1)

?=F1 =3D 2289 in Emacs-22
?=F1 =3D 241  in Emacs-23

So in Emacs-22, there is no possible confusion for this char with
a byte.
So when you do the `aset', Emacs-22 converts the unibyte string nl to
multibyte, whereas Emacs-23 doesn't.  From then on, in Emacs-22 your
example is all multibyte, so there's no surprise.

Now if in Emacs-22 you do instead (aset nl 0 241), where 241 in Emacs-22
is not a valid char and can hence only be a byte, then aset leaves the
string as unibyte and we end up with the same nl as in Emacs-23.  But if
you then (insert nl), Emacs-22 will probably end up inserting a =F1 in
your buffer, because Emacs-22 performs a decoding step using your
language environment when inserting a unibyte string into a unibyte
buffer (this used to be helpful for code that didn't know enough about
Mule to setup coding systems properly, which is why it was done, but
nowadays it was just hiding bugs and encouraging sloppiness in coding so
we removed it).

> fix it before the pretest?  How about interpreting "\n" and friends as
> multibyte or unibyte according to the prevailing flavour?

I'm not sure what that means.  But maybe "\n" should be multibyte, yes.

>> If you give us more context (i.e. more of the real code where the
>> problem show up), maybe we can tell you how to avoid it.

> OK.  I have my own routine to display regexps.  As a first step, I
> translate \n -> =F1, (and \t, \r, \f similarly).  This is how:

>     (defun translate-rnt (regexp)
>       "REGEXP is a string.  Translate any \t \n \r and \f characters
>     to wierd non-ASCII printable characters: \t to =CE (206, \xCE), \n
>     to =F1 (241, \xF1), \r to =AE (174, \xAE) and \f to =A3 (163, \xA3).
>     The original string is modified."
>       (let (ch pos)
>         (while (setq pos (string-match "[\t\n\r\f]" regexp))
>           (setq ch (aref regexp pos))
>           (aset regexp pos                        ; <=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>                 (cond ((eq ch ?\t) ?=CE)
>                       ((eq ch ?\n) ?=F1)
>                       ((eq ch ?\r) ?=AE)
>                       (t           ?=A3))))
>         regexp))

Each one of those `aset' (when performed according to your wishes) would
change the byte-size of the string, so it would internally require
copying the whole string each time: aset on (multibyte) strings is very
inefficient (compared to what most people expect, not necessarily
compared to other operations).  I'd recommend you use higher-level
operations since they'll work just as well and are less susceptible to
such problems:

  (replace-regexp-in-string "[\t\n\r\f]"
                            (lambda (s)
                              (or (cdr (assoc s '(("\t" . "=CE")
                                                  ("\n" . "=F1")
                                                  ("\r" . "=AE"))))
                                  "=A3"))
                            regexp)

> Why do we have both unibyte and multibyte?  Is there any reason
> not to remove unibyte altogether (though obviously not for 23.2).

Because bytes and chars are different, so we have strings of bytes and
strings of chars.  The problem with it is not their combined existence,
but the fact that they are not different enough.  Many people don't
understand the difference between chars and bytes, but even more people
can't figure out which Elisp operation returns a unibyte string and
which a multibyte strings, and that for a "good" reason: it's very
difficult to predict.

Emacs-23 tries to help in this in the following ways:
- `string' always builds a multibyte string now, so if you want
  a unibyte string, you need to use the new `unibyte-string' function.
- we don't automatically perform encoding/decoding conversions between
  the two forms, so we hide the difference a bit less.

We should probably moved towards making all string immediates multibyte
and add a new syntax to unibyte immediates.

> What was the change between 22.3 and 23.1 that broke my code?

Mostly: the change to unibyte internal representation which made 241
(and other byte values) ambiguous since it can also be interpreted now
as a character value.

> Would it, perhaps, be a good idea to reconsider that change?

I think you'll understand that reverting to the emacs-mule
(iso-2022-based) internal representation is not really on the table ;-)


        Stefan