Inadequate documentation of silly characters on screen.

unofficial mirror of emacs-devel@gnu.org 
 help / color / mirror / code / Atom feed

* Inadequate documentation of silly characters on screen.
@ 2009-11-18  9:37 Alan Mackenzie
  2009-11-18  9:40 ` Miles Bader
  0 siblings, 1 reply; 101+ messages in thread
From: Alan Mackenzie @ 2009-11-18  9:37 UTC (permalink / raw)
  To: emacs-devel

Hi, Emacs,

Once again, I'm getting silly characters on the screen.  In *scratch*,
where's I've written "ñ", what gets displayed is "\361".  It may have
happened when I upgraded to Emacs 23.

This keeps happening to me, I don't know why, but most importantly it
seems unusually poorly documented.  Goodness knows how an ordinary user
manages this, but I cannot easily track down the proper bit in the
manual to sort this out.

Things like character sets and their display isn't my area.  Why won't
it just work?

I go to the coding systems page.  There is no @def{coding system}, just
vague references with which you're supposed to get the understanding by
osmosis.  What I DON'T get from this osmosis is whether or not "coding
systems" deal with the garbage "\361" on my screen.  There is definitely
a missing "For the appearance of the text on your screen @ref{...}".

So, once again, I've got between half an hour and an hour of wasted time
trying to debug, yet again, this problem.  Why can I not easily find the
answer in the Emacs manual?

Of secondary importance, why does this problem keep happening in the
first place?

-- 
Alan Mackenzie (Nuremberg, Germany).

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Inadequate documentation of silly characters on screen.
  2009-11-18  9:37 Alan Mackenzie
@ 2009-11-18  9:40 ` Miles Bader
  2009-11-18 10:15   ` Alan Mackenzie
  0 siblings, 1 reply; 101+ messages in thread
From: Miles Bader @ 2009-11-18  9:40 UTC (permalink / raw)
  To: Alan Mackenzie; +Cc: emacs-devel

Alan Mackenzie <acm@muc.de> writes:
> Once again, I'm getting silly characters on the screen.  In *scratch*,
> where's I've written "ñ", what gets displayed is "\361".  It may have
> happened when I upgraded to Emacs 23.

Does it happen with "emacs -Q"?

How do you "write" ñ (do you use an input method?  Type it on your keyboard...?)?

What language environment do you use (if you don't set it explicitly, it
will be set automatically from the LANG environment variable)?

Do you use X emacs, emacs in a tty, etc.?  If tty emacs, which type of
terminal do you use?

-Miles

-- 
Defenceless, adj. Unable to attack.

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Inadequate documentation of silly characters on screen.
  2009-11-18  9:40 ` Miles Bader
@ 2009-11-18 10:15   ` Alan Mackenzie
  2009-11-18 12:03     ` Jason Rumney
  2009-11-18 15:02     ` Stefan Monnier
  0 siblings, 2 replies; 101+ messages in thread
From: Alan Mackenzie @ 2009-11-18 10:15 UTC (permalink / raw)
  To: Miles Bader; +Cc: emacs-devel

Hi, Miles,

thanks for such a quick answer.

On Wed, Nov 18, 2009 at 06:40:53PM +0900, Miles Bader wrote:
> Alan Mackenzie <acm@muc.de> writes:
> > Once again, I'm getting silly characters on the screen.  In *scratch*,
> > where's I've written "ñ", what gets displayed is "\361".  It may have
> > happened when I upgraded to Emacs 23.

> Does it happen with "emacs -Q"?

Yes.

> How do you "write" ñ (do you use an input method?  Type it on your keyboard...?)?

When I hold the <alt> key and type "241" on the numeric keypad, the "ñ"
appears correctly on the screen.  My program does (insert pr-line), where
pr-line is a string containing the ñ - this puts \361 up.

> What language environment do you use (if you don't set it explicitly, it
> will be set automatically from the LANG environment variable)?

I've tried M-x set-language-environment <CR> latin-1.  The mode line of
my *scratch* looks like this:

    -111:**--F1  *scratch*      All L10   C184  (Lisp Interaction)----P678/678

> Do you use X emacs, emacs in a tty, etc.?  If tty emacs, which type of
> terminal do you use?

A Linux tty.

> -Miles

-- 
Alan Mackenzie (Nuremberg, Germany).




^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Inadequate documentation of silly characters on screen.
  2009-11-18 10:15   ` Alan Mackenzie
@ 2009-11-18 12:03     ` Jason Rumney
  2009-11-18 15:02     ` Stefan Monnier
  1 sibling, 0 replies; 101+ messages in thread
From: Jason Rumney @ 2009-11-18 12:03 UTC (permalink / raw)
  To: Alan Mackenzie; +Cc: emacs-devel, Miles Bader

Alan Mackenzie wrote:
> I've tried M-x set-language-environment <CR> latin-1.  The mode line of
> my *scratch* looks like this:
>   
If you DON'T do that, after starting emacs -Q, what does C-h L <CR> report?

It should be initialised to the same as your terminal language 
environment, which is what you need when running in a tty.

>> Do you use X emacs, emacs in a tty, etc.?  If tty emacs, which type of
>> terminal do you use?
>>     
>
> A Linux tty.
>   

More specific please. Console? Serial? xterm? GNU screen?





^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Inadequate documentation of silly characters on screen.
  2009-11-18 10:15   ` Alan Mackenzie
  2009-11-18 12:03     ` Jason Rumney
@ 2009-11-18 15:02     ` Stefan Monnier
  1 sibling, 0 replies; 101+ messages in thread
From: Stefan Monnier @ 2009-11-18 15:02 UTC (permalink / raw)
  To: Alan Mackenzie; +Cc: emacs-devel, Miles Bader

> When I hold the <alt> key and type "241" on the numeric keypad, the "ñ"
> appears correctly on the screen.

So assuming you did that in a "normal" buffer, that means that A-241 is
properly interpreted by self-insert-command and you get the char "ñ"
inserted in your buffer (and then properly displayed as well).

> My program does (insert pr-line), where
> pr-line is a string containing the ñ - this puts \361 up.

How does pr-line contain this char?  I.e. how do you construct it?


        Stefan




^ permalink raw reply	[flat|nested] 101+ messages in thread

* [acm@muc.de: Re: Inadequate documentation of silly characters on screen.]
@ 2009-11-18 19:12 Alan Mackenzie
  2009-11-19  1:27 ` Fwd: Re: Inadequate documentation of silly characters on screen Stefan Monnier
  0 siblings, 1 reply; 101+ messages in thread
From: Alan Mackenzie @ 2009-11-18 19:12 UTC (permalink / raw)
  To: emacs-devel

Hi, Emacs!

This is the message I meant to CC: to emacs-devel.  It looks serious.

----- Forwarded message from Alan Mackenzie <acm@muc.de> -----

Date: Wed, 18 Nov 2009 11:04:53 +0000
From: Alan Mackenzie <acm@muc.de>
To: Miles Bader <miles@gnu.org>
Subject: Re: Inadequate documentation of silly characters on screen.

Hi, again, Miles!

On Wed, Nov 18, 2009 at 06:40:53PM +0900, Miles Bader wrote:
> Alan Mackenzie <acm@muc.de> writes:
> > Once again, I'm getting silly characters on the screen.  In *scratch*,
> > where's I've written "ñ", what gets displayed is "\361".  It may have
> > happened when I upgraded to Emacs 23.

> Does it happen with "emacs -Q"?

> How do you "write" ñ (do you use an input method?  Type it on your keyboard...?)?

Of the good and the bad representations, if I do "C-x =" on each, I get
this:

Char: ñ (241, #o361, #xf1, file #xF1)
Char: \361 (4194289, #o17777761, #x3ffff1, raw-byte)

This sequence reproduces the bug:
M-: (setq nl "\n")
M-: (aset nl 0 ?ñ
M-: (insert nl)

So it looks a bit like the `aset' invocation is doing damage, by doing
sign extension rather than zero filling.

> Do you use X emacs, emacs in a tty, etc.?  If tty emacs, which type of
> terminal do you use?

Linux tty.

> -Miles

-- 
Alan Mackenzie (Nuremberg, Germany).


----- End forwarded message -----




^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Fwd: Re: Inadequate documentation of silly characters on screen.
  2009-11-18 19:12 [acm@muc.de: Re: Inadequate documentation of silly characters on screen.] Alan Mackenzie
@ 2009-11-19  1:27 ` Stefan Monnier
  2009-11-19  8:20   ` Alan Mackenzie
  0 siblings, 1 reply; 101+ messages in thread
From: Stefan Monnier @ 2009-11-19  1:27 UTC (permalink / raw)
  To: Alan Mackenzie; +Cc: emacs-devel

> This is the message I meant to CC: to emacs-devel.  It looks serious.

The integer 241 is used to represent the char ?ñ, but it's also used for
many other things, one of them being to represent the byte 241 (tho such
a byte can also be represented as the integer 4194289).

Now strings come in two flavors: multibyte (i.e. sequences of chars) and
unibyte (i.e. sequences of bytes).  So when you do:

   M-: (setq nl "\n")
   M-: (aset nl 0 ?ñ)
   M-: (insert nl)

The `aset' part may do two different things depending on whether `nl' is
unibyte or multibyte: it will either insert the char ?ñ or the byte 241.
In the above code the "\n" is taken as a unibyte string, tho I'm not
sure why we made this arbitrary choice.
If you give us more context (i.e. more of the real code where the
problem show up), maybe we can tell you how to avoid it.  Usually,
I recommend to stay away from `aset' on strings for various reasons, and
it seems that it also helps avoid those tricky issues (tho it
doesn't protect you from them completely).

        Stefan

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Fwd: Re: Inadequate documentation of silly characters on screen.
  2009-11-19  1:27 ` Fwd: Re: Inadequate documentation of silly characters on screen Stefan Monnier
@ 2009-11-19  8:20   ` Alan Mackenzie
  2009-11-19  8:50     ` Miles Bader
                       ` (2 more replies)
  0 siblings, 3 replies; 101+ messages in thread
From: Alan Mackenzie @ 2009-11-19  8:20 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: emacs-devel

Morning, Stefan!

On Wed, Nov 18, 2009 at 08:27:24PM -0500, Stefan Monnier wrote:

> The integer 241 is used to represent the char ?ñ, but it's also used for
> many other things, one of them being to represent the byte 241 (tho such
> a byte can also be represented as the integer 4194289).

> Now strings come in two flavors: multibyte (i.e. sequences of chars) and
> unibyte (i.e. sequences of bytes).  So when you do:

>    M-: (setq nl "\n")
>    M-: (aset nl 0 ?ñ)
>    M-: (insert nl)

> The `aset' part may do two different things depending on whether `nl' is
> unibyte or multibyte: it will either insert the char ?ñ or the byte 241.
> In the above code the "\n" is taken as a unibyte string, tho I'm not
> sure why we made this arbitrary choice.

The above sequence "works" in Emacs 22.3, in the sense that "ñ" gets
displayed - when I do M-: (aset nl 0 ?ñ), I get

   "2289 (#o4361, #x8f1)" (Emacs 22.3)
   "241 (#o361, #xf1)"    (Emacs 23.1)

displayed in the echo area.  So my `aset' invocation is trying to write a
multibyte ?ñ into a unibyte ?\n, and gets truncated from #x8f1 to #xf1 in
the process.  Surely this behaviour in Emacs 23.1 is a bug?  Shouldn't we
fix it before the pretest?  How about interpreting "\n" and friends as
multibyte or unibyte according to the prevailing flavour?

> If you give us more context (i.e. more of the real code where the
> problem show up), maybe we can tell you how to avoid it.

OK.  I have my own routine to display regexps.  As a first step, I
translate \n -> ñ, (and \t, \r, \f similarly).  This is how:

    (defun translate-rnt (regexp)
      "REGEXP is a string.  Translate any \t \n \r and \f characters
    to wierd non-ASCII printable characters: \t to Î (206, \xCE), \n
    to ñ (241, \xF1), \r to ® (174, \xAE) and \f to £ (163, \xA3).
    The original string is modified."
      (let (ch pos)
        (while (setq pos (string-match "[\t\n\r\f]" regexp))
          (setq ch (aref regexp pos))
          (aset regexp pos                        ; <===================
                (cond ((eq ch ?\t) ?Î)
                      ((eq ch ?\n) ?ñ)
                      ((eq ch ?\r) ?®)
                      (t           ?£))))
        regexp))



> Usually, I recommend to stay away from `aset' on strings for various
> reasons, and it seems that it also helps avoid those tricky issues (tho
> it doesn't protect you from them completely).

Again, surely this is a bug?  These tricky issues should be dealt with in
the lisp interpreter in a way that lisp hackers don't have to worry
about.  Why do we have both unibyte and multibyte?  Is there any reason
not to remove unibyte altogether (though obviously not for 23.2).

What was the change between 22.3 and 23.1 that broke my code?  Would it,
perhaps, be a good idea to reconsider that change?

>         Stefan

-- 
Alan Mackenzie (Nurmberg, Germany).




^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Inadequate documentation of silly characters on screen.
  2009-11-19  8:20   ` Alan Mackenzie
@ 2009-11-19  8:50     ` Miles Bader
  2009-11-19 10:16     ` Fwd: " Andreas Schwab
  2009-11-19 14:08     ` Stefan Monnier
  2 siblings, 0 replies; 101+ messages in thread
From: Miles Bader @ 2009-11-19  8:50 UTC (permalink / raw)
  To: Alan Mackenzie; +Cc: Stefan Monnier, emacs-devel

Alan Mackenzie <acm@muc.de> writes:
> Why do we have both unibyte and multibyte?  Is there any reason
> not to remove unibyte altogether (though obviously not for 23.2).

For certain rare cases, it's useful for efficiency reasons, but maybe it
should never be the default.

-Miles

-- 
Opposition, n. In politics the party that prevents the Goverment from running
amok by hamstringing it.




^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Fwd: Re: Inadequate documentation of silly characters on screen.
  2009-11-19  8:20   ` Alan Mackenzie
  2009-11-19  8:50     ` Miles Bader
@ 2009-11-19 10:16     ` Andreas Schwab
  2009-11-19 12:21       ` Alan Mackenzie
  2009-11-19 13:21       ` Jason Rumney
  2009-11-19 14:08     ` Stefan Monnier
  2 siblings, 2 replies; 101+ messages in thread
From: Andreas Schwab @ 2009-11-19 10:16 UTC (permalink / raw)
  To: Alan Mackenzie; +Cc: Stefan Monnier, emacs-devel

Alan Mackenzie <acm@muc.de> writes:

> So my `aset' invocation is trying to write a multibyte ?ñ into a
> unibyte ?\n, and gets truncated from #x8f1 to #xf1 in the process.

Nothing gets truncated.  In Emacs 23 ?ñ is simply the number 241,
whereas in Emacs 22 is it the number 2289.  You can put 2289 in a string
in Emacs 23, but there is no defined unicode character with that value.

Andreas.

-- 
Andreas Schwab, schwab@linux-m68k.org
GPG Key fingerprint = 58CA 54C7 6D53 942B 1756  01D3 44D5 214B 8276 4ED5
"And now for something completely different."




^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Fwd: Re: Inadequate documentation of silly characters on screen.
  2009-11-19 10:16     ` Fwd: " Andreas Schwab
@ 2009-11-19 12:21       ` Alan Mackenzie
  2009-11-19 13:21       ` Jason Rumney
  1 sibling, 0 replies; 101+ messages in thread
From: Alan Mackenzie @ 2009-11-19 12:21 UTC (permalink / raw)
  To: Andreas Schwab; +Cc: Stefan Monnier, emacs-devel

Hi, Andreas,

On Thu, Nov 19, 2009 at 11:16:03AM +0100, Andreas Schwab wrote:
> Alan Mackenzie <acm@muc.de> writes:

> > So my `aset' invocation is trying to write a multibyte ?ñ into a
> > unibyte ?\n, and gets truncated from #x8f1 to #xf1 in the process.

> Nothing gets truncated.  In Emacs 23 ?ñ is simply the number 241,
> whereas in Emacs 22 is it the number 2289.  You can put 2289 in a string
> in Emacs 23, but there is no defined unicode character with that value.

Ah, thanks!  So when I do 

   M-: (setq nl "\n")
   M-: (aset nl 0 ?ñ)
   M-: (insert nl)

, after the `aset', the string nl correctly contains, one character which
is the single byte #xf1.  The bug happens in `insert', where something is
interpreting the byte #xf1 as the signed integer #xfffff.....ffff1.

Delving into the bowels of Emacs, I find this in character.h:

1.  #define STRING_CHAR_AND_LENGTH(p, len, actual_len)              \
2.    (!((p)[0] & 0x80)                                             \
3.     ? ((actual_len) = 1, (p)[0])                                 \
4.     : ! ((p)[0] & 0x20)                                          \
5.       ? ((actual_len) = 2,                                         \
6.          (((((p)[0] & 0x1F) << 6)                                  \
7.            | ((p)[1] & 0x3F))                                      \
8.           + (((unsigned char) (p)[0]) < 0xC2 ? 0x3FFF80 : 0)))     \
9.       : ! ((p)[0] & 0x10)                                          \
10.        ? ((actual_len) = 3,                                         \
11.           ((((p)[0] & 0x0F) << 12)                                  \
12.            | (((p)[1] & 0x3F) << 6)                                 \
13.            | ((p)[2] & 0x3F)))                                      \
14.        : string_char ((p), NULL, &actual_len))

#xf1 drops through all this nonsense to string_char (in character.c).  It
drops through to this case:

  else if (! (*p & 0x08))
    {
      c = ((((p)[0] & 0xF) << 18)
           | (((p)[1] & 0x3F) << 12)
           | (((p)[2] & 0x3F) << 6)
           | ((p)[3] & 0x3F));
      p += 4;
    }

, where it obviously becomes silly.  At least, I think that's where it
ends up.  This isn't the most maintainable piece of code in Emacs.

So, if ISO-8559-1 characters are now represented as single bytes in
Emacs, what test for mutibyticity should STRING_CHAR_AND_LENGTH be using?

> Andreas.

-- 
Alan Mackenzie (Nuremberg, Germany).




^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Fwd: Re: Inadequate documentation of silly characters on screen.
  2009-11-19 10:16     ` Fwd: " Andreas Schwab
  2009-11-19 12:21       ` Alan Mackenzie
@ 2009-11-19 13:21       ` Jason Rumney
  2009-11-19 13:35         ` Stefan Monnier
  2009-11-19 14:18         ` Alan Mackenzie
  1 sibling, 2 replies; 101+ messages in thread
From: Jason Rumney @ 2009-11-19 13:21 UTC (permalink / raw)
  To: Andreas Schwab; +Cc: Alan Mackenzie, Stefan Monnier, emacs-devel

Andreas Schwab <schwab@linux-m68k.org> writes:

> Nothing gets truncated.  In Emacs 23 ?ñ is simply the number 241,
> whereas in Emacs 22 is it the number 2289.  You can put 2289 in a string
> in Emacs 23, but there is no defined unicode character with that value.

The bug here is likely that setting a character in a unibyte string to a
value between 160 and 255 does not result in an automatic conversion to
multibyte.  That was correct in 22.3, since values in that range were
raw binary bytes outside of any character set, but in 23.1 they correspond
to valid Latin-1 codepoints.

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Fwd: Re: Inadequate documentation of silly characters on screen.
  2009-11-19 13:21       ` Jason Rumney
@ 2009-11-19 13:35         ` Stefan Monnier
  2009-11-19 14:18         ` Alan Mackenzie
  1 sibling, 0 replies; 101+ messages in thread
From: Stefan Monnier @ 2009-11-19 13:35 UTC (permalink / raw)
  To: Jason Rumney; +Cc: Alan Mackenzie, Andreas Schwab, emacs-devel

>> Nothing gets truncated.  In Emacs 23 ?ñ is simply the number 241,
>> whereas in Emacs 22 is it the number 2289.  You can put 2289 in a string
>> in Emacs 23, but there is no defined unicode character with that value.

> The bug here is likely that setting a character in a unibyte string to a
> value between 160 and 255 does not result in an automatic conversion to
> multibyte.  That was correct in 22.3, since values in that range were
> raw binary bytes outside of any character set, but in 23.1 they correspond
> to valid Latin-1 codepoints.

If you think of unibyte strings as sequences of bytes, it makes perfect
sense to not automatically convert them to multibyte strings, since
a sequence of bytes cannot hold the character ñ, only the byte 241.


        Stefan






^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Fwd: Re: Inadequate documentation of silly characters on screen.
  2009-11-19  8:20   ` Alan Mackenzie
  2009-11-19  8:50     ` Miles Bader
  2009-11-19 10:16     ` Fwd: " Andreas Schwab
@ 2009-11-19 14:08     ` Stefan Monnier
  2009-11-19 14:50       ` Jason Rumney
  2009-11-19 17:08       ` Fwd: " Alan Mackenzie
  2 siblings, 2 replies; 101+ messages in thread
From: Stefan Monnier @ 2009-11-19 14:08 UTC (permalink / raw)
  To: Alan Mackenzie; +Cc: emacs-devel

> The above sequence "works" in Emacs 22.3, in the sense that "ñ" gets
> displayed

There are many differences that cause it to work completely differently:

> - when I do M-: (aset nl 0 ?ñ), I get

>    "2289 (#o4361, #x8f1)" (Emacs 22.3)
>    "241 (#o361, #xf1)"    (Emacs 23.1)

?ñ = 2289 in Emacs-22
?ñ = 241  in Emacs-23

So in Emacs-22, there is no possible confusion for this char with
a byte.
So when you do the `aset', Emacs-22 converts the unibyte string nl to
multibyte, whereas Emacs-23 doesn't.  From then on, in Emacs-22 your
example is all multibyte, so there's no surprise.

Now if in Emacs-22 you do instead (aset nl 0 241), where 241 in Emacs-22
is not a valid char and can hence only be a byte, then aset leaves the
string as unibyte and we end up with the same nl as in Emacs-23.  But if
you then (insert nl), Emacs-22 will probably end up inserting a ñ in
your buffer, because Emacs-22 performs a decoding step using your
language environment when inserting a unibyte string into a unibyte
buffer (this used to be helpful for code that didn't know enough about
Mule to setup coding systems properly, which is why it was done, but
nowadays it was just hiding bugs and encouraging sloppiness in coding so
we removed it).

> fix it before the pretest?  How about interpreting "\n" and friends as
> multibyte or unibyte according to the prevailing flavour?

I'm not sure what that means.  But maybe "\n" should be multibyte, yes.

>> If you give us more context (i.e. more of the real code where the
>> problem show up), maybe we can tell you how to avoid it.

> OK.  I have my own routine to display regexps.  As a first step, I
> translate \n -> ñ, (and \t, \r, \f similarly).  This is how:

>     (defun translate-rnt (regexp)
>       "REGEXP is a string.  Translate any \t \n \r and \f characters
>     to wierd non-ASCII printable characters: \t to Î (206, \xCE), \n
>     to ñ (241, \xF1), \r to ® (174, \xAE) and \f to £ (163, \xA3).
>     The original string is modified."
>       (let (ch pos)
>         (while (setq pos (string-match "[\t\n\r\f]" regexp))
>           (setq ch (aref regexp pos))
>           (aset regexp pos                        ; <===================
>                 (cond ((eq ch ?\t) ?Î)
>                       ((eq ch ?\n) ?ñ)
>                       ((eq ch ?\r) ?®)
>                       (t           ?£))))
>         regexp))

Each one of those `aset' (when performed according to your wishes) would
change the byte-size of the string, so it would internally require
copying the whole string each time: aset on (multibyte) strings is very
inefficient (compared to what most people expect, not necessarily
compared to other operations).  I'd recommend you use higher-level
operations since they'll work just as well and are less susceptible to
such problems:

  (replace-regexp-in-string "[\t\n\r\f]"
                            (lambda (s)
                              (or (cdr (assoc s '(("\t" . "Î")
                                                  ("\n" . "ñ")
                                                  ("\r" . "®"))))
                                  "£"))
                            regexp)

> Why do we have both unibyte and multibyte?  Is there any reason
> not to remove unibyte altogether (though obviously not for 23.2).

Because bytes and chars are different, so we have strings of bytes and
strings of chars.  The problem with it is not their combined existence,
but the fact that they are not different enough.  Many people don't
understand the difference between chars and bytes, but even more people
can't figure out which Elisp operation returns a unibyte string and
which a multibyte strings, and that for a "good" reason: it's very
difficult to predict.

Emacs-23 tries to help in this in the following ways:
- `string' always builds a multibyte string now, so if you want
  a unibyte string, you need to use the new `unibyte-string' function.
- we don't automatically perform encoding/decoding conversions between
  the two forms, so we hide the difference a bit less.

We should probably moved towards making all string immediates multibyte
and add a new syntax to unibyte immediates.

> What was the change between 22.3 and 23.1 that broke my code?

Mostly: the change to unibyte internal representation which made 241
(and other byte values) ambiguous since it can also be interpreted now
as a character value.

> Would it, perhaps, be a good idea to reconsider that change?

I think you'll understand that reverting to the emacs-mule
(iso-2022-based) internal representation is not really on the table ;-)

        Stefan

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Fwd: Re: Inadequate documentation of silly characters on screen.
  2009-11-19 13:21       ` Jason Rumney
  2009-11-19 13:35         ` Stefan Monnier
@ 2009-11-19 14:18         ` Alan Mackenzie
  2009-11-19 14:58           ` Jason Rumney
  2009-11-19 15:30           ` Stefan Monnier
  1 sibling, 2 replies; 101+ messages in thread
From: Alan Mackenzie @ 2009-11-19 14:18 UTC (permalink / raw)
  To: Jason Rumney; +Cc: Andreas Schwab, Stefan Monnier, emacs-devel

On Thu, Nov 19, 2009 at 09:21:41PM +0800, Jason Rumney wrote:
> Andreas Schwab <schwab@linux-m68k.org> writes:

> > Nothing gets truncated.  In Emacs 23 ?ñ is simply the number 241,
> > whereas in Emacs 22 is it the number 2289.  You can put 2289 in a
> > string in Emacs 23, but there is no defined unicode character with
> > that value.

> The bug here is likely that setting a character in a unibyte string to
> a value between 160 and 255 does not result in an automatic conversion
> to multibyte.  That was correct in 22.3, since values in that range
> were raw binary bytes outside of any character set, but in 23.1 they
> correspond to valid Latin-1 codepoints.

Putting point over the \361 and doing C-x = shows the character is 

    Char: \361 (4194289, #o17777761, #x3ffff1, raw-byte)

The actual character in the string is ñ (#x3f).

Going through all the motions, here is what I think is happening: the
\361 is put there by `insert'.

insert calls
  general_insert_function, calls
    insert_from_string (via a function pointer), calls
      insert_from_string_1, calls
        copy_text

	at this stage, I'm assuming to_multibyte (the screen buffer, in
	some form) is TRUE, and from_multibyte (a string holding the
	single character #xf1) is FALSE.  We thus execute this code in
        copy_txt:

  else
    {
      unsigned char *initial_to_addr = to_addr;

      /* Convert single-byte to multibyte.  */
      while (nbytes > 0)
        {
          int c = *from_addr++;        <==============================

          if (c >= 0200)
            {
              c = unibyte_char_to_multibyte (c);
              to_addr += CHAR_STRING (c, to_addr);
              nbytes--;
            }
          else
            /* Special case for speed.  */
            *to_addr++ = c, nbytes--;
        }
      return to_addr - initial_to_addr;
    }

        At the indicated line, c is a SIGNED integer, therefore will get
	the value 0xfffffff1, not 0xf1.

	copy_text then invokes the macro
	  unibyte_char_to_multibyte (-15),

	  at which point there's no point going any further.

At least, that's my guess as to what's happening.  A fix would be to
change the declaration of "int c" to "unsigned int c".  I'm going to try
that now.

-- 
Alan Mackenzie (Nuremberg, Germany).




^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Fwd: Re: Inadequate documentation of silly characters on screen.
  2009-11-19 14:08     ` Stefan Monnier
@ 2009-11-19 14:50       ` Jason Rumney
  2009-11-19 15:27         ` Stefan Monnier
  2009-11-19 17:08       ` Fwd: " Alan Mackenzie
  1 sibling, 1 reply; 101+ messages in thread
From: Jason Rumney @ 2009-11-19 14:50 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: Alan Mackenzie, emacs-devel

Stefan Monnier <monnier@iro.umontreal.ca> writes:

> We should probably moved towards making all string immediates multibyte
> and add a new syntax to unibyte immediates.

Also, make it an error to try to put a multibyte character in a unibyte
string rather than automatically converting the string to multibyte or
silently truncating to 8 bit or whatever Emacs does now.




^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Fwd: Re: Inadequate documentation of silly characters on screen.
  2009-11-19 14:18         ` Alan Mackenzie
@ 2009-11-19 14:58           ` Jason Rumney
  2009-11-19 15:42             ` Alan Mackenzie
  2009-11-19 15:30           ` Stefan Monnier
  1 sibling, 1 reply; 101+ messages in thread
From: Jason Rumney @ 2009-11-19 14:58 UTC (permalink / raw)
  To: Alan Mackenzie; +Cc: Andreas Schwab, Stefan Monnier, emacs-devel

Alan Mackenzie <acm@muc.de> writes:

>         At the indicated line, c is a SIGNED integer, therefore will get
> 	the value 0xfffffff1, not 0xf1.

Surely 0xf1 is the same, regardless of whether the integer is signed
or unsigned.  

Since \361 == \xf1, I don't think this is a bug where the value is
accidentally being corrupted, but one where the character is
deliberately being assigned to its corresponding raw-byte codepoint.

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Fwd: Re: Inadequate documentation of silly characters on screen.
  2009-11-19 14:50       ` Jason Rumney
@ 2009-11-19 15:27         ` Stefan Monnier
  2009-11-19 23:12           ` Miles Bader
  0 siblings, 1 reply; 101+ messages in thread
From: Stefan Monnier @ 2009-11-19 15:27 UTC (permalink / raw)
  To: Jason Rumney; +Cc: Alan Mackenzie, emacs-devel

>> We should probably moved towards making all string immediates multibyte
>> and add a new syntax to unibyte immediates.
> Also, make it an error to try to put a multibyte character in a unibyte
> string rather than automatically converting the string to multibyte or

Yes.  Currently, we need this conversion specifically because many
strings start as unibyte even though they really should start right away
as multibyte.  This said, `aset' in multibyte strings is still evil
and unnecessary.

> silently truncating to 8 bit or whatever Emacs does now.

I don't think Emacs-23 does such silent truncations any more, tho there
might be some such checks that we still haven't installed.

        Stefan

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Fwd: Re: Inadequate documentation of silly characters on screen.
  2009-11-19 14:18         ` Alan Mackenzie
  2009-11-19 14:58           ` Jason Rumney
@ 2009-11-19 15:30           ` Stefan Monnier
  2009-11-19 15:58             ` Alan Mackenzie
  1 sibling, 1 reply; 101+ messages in thread
From: Stefan Monnier @ 2009-11-19 15:30 UTC (permalink / raw)
  To: Alan Mackenzie; +Cc: emacs-devel, Andreas Schwab, Jason Rumney

> The actual character in the string is ñ (#x3f).

No: the string does not contain any characters, only bytes, because it's
a unibyte string.  So it contains the byte 241, not the character ñ.
The byte 241 can be inserted in multibyte strings and buffers because it
is also a char of code 4194289 (which gets displayed as \361).

        Stefan

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Fwd: Re: Inadequate documentation of silly characters on screen.
  2009-11-19 14:58           ` Jason Rumney
@ 2009-11-19 15:42             ` Alan Mackenzie
  2009-11-19 19:39               ` Eli Zaretskii
  0 siblings, 1 reply; 101+ messages in thread
From: Alan Mackenzie @ 2009-11-19 15:42 UTC (permalink / raw)
  To: Jason Rumney; +Cc: Andreas Schwab, Stefan Monnier, emacs-devel

On Thu, Nov 19, 2009 at 10:58:36PM +0800, Jason Rumney wrote:
> Alan Mackenzie <acm@muc.de> writes:

> >         At the indicated line, c is a SIGNED integer, therefore will get
> > 	the value 0xfffffff1, not 0xf1.

> Surely 0xf1 is the same, regardless of whether the integer is signed
> or unsigned.  

Yes it is.  Sorry - I just tried it out.  It depends only on the
signedness of the char on the RHS of the assignment.

Nevertheless, I think the bug is caused by something along these lines.

> Since \361 == \xf1, I don't think this is a bug where the value is
> accidentally being corrupted, but one where the character is
> deliberately being assigned to its corresponding raw-byte codepoint.

It's getting the value -15, at least to 23 places of ones-complement.
In the sequence

    (aset nl 0 ?ñ)
    (insert nl)

, the character that comes out isn't the one that went in.  That is a
bug.

-- 
Alan Mackenzie (Nuremberg, Germany).




^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Fwd: Re: Inadequate documentation of silly characters on screen.
  2009-11-19 15:30           ` Stefan Monnier
@ 2009-11-19 15:58             ` Alan Mackenzie
  2009-11-19 16:06               ` Andreas Schwab
                                 ` (4 more replies)
  0 siblings, 5 replies; 101+ messages in thread
From: Alan Mackenzie @ 2009-11-19 15:58 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: emacs-devel, Andreas Schwab, Jason Rumney

Hi, Stefan,

On Thu, Nov 19, 2009 at 10:30:18AM -0500, Stefan Monnier wrote:
> > The actual character in the string is ñ (#x3f).

> No: the string does not contain any characters, only bytes, because it's
> a unibyte string.

I'm thinking from the lisp viewpoint.  The string is a data structure
which contains characters.  I really don't want to have to think about
the difference between "chars" and "bytes" when I'm hacking lisp.  If I
do, then the abstraction "string" is broken.

> So it contains the byte 241, not the character ñ.

That is then a bug.  I wrote "(aset nl 0 ?ñ)", not "(aset nl 0 241)".

> The byte 241 can be inserted in multibyte strings and buffers because
> it is also a char of code 4194289 (which gets displayed as \361).

Hang on a mo'!  How can the byte 241 "be" a char of code 4194289?  This
is some strange usage of the word "be" that I wasn't previously aware
of.  ;-)

At this point, would you please just agree with me that when I do

   (setq nl "\n")
   (aset nl 0 ?ñ)
   (insert nl)

, what should appear on the screen should be "ñ", NOT "\361"?  Thanks!

>         Stefan

-- 
Alan Mackenzie (Nuremberg, Germany).

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Fwd: Re: Inadequate documentation of silly characters on screen.
  2009-11-19 15:58             ` Alan Mackenzie
@ 2009-11-19 16:06               ` Andreas Schwab
  2009-11-19 16:47               ` Aidan Kehoe
                                 ` (3 subsequent siblings)
  4 siblings, 0 replies; 101+ messages in thread
From: Andreas Schwab @ 2009-11-19 16:06 UTC (permalink / raw)
  To: Alan Mackenzie; +Cc: emacs-devel, Stefan Monnier, Jason Rumney

Alan Mackenzie <acm@muc.de> writes:

> I wrote "(aset nl 0 ?ñ)", not "(aset nl 0 241)".

Those expressions are entirely identical, indistinguishable.

Andreas.

-- 
Andreas Schwab, schwab@linux-m68k.org
GPG Key fingerprint = 58CA 54C7 6D53 942B 1756  01D3 44D5 214B 8276 4ED5
"And now for something completely different."




^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Fwd: Re: Inadequate documentation of silly characters on screen.
  2009-11-19 15:58             ` Alan Mackenzie
  2009-11-19 16:06               ` Andreas Schwab
@ 2009-11-19 16:47               ` Aidan Kehoe
  2009-11-19 17:29                 ` Alan Mackenzie
                                   ` (2 more replies)
  2009-11-19 16:55               ` David Kastrup
                                 ` (2 subsequent siblings)
  4 siblings, 3 replies; 101+ messages in thread
From: Aidan Kehoe @ 2009-11-19 16:47 UTC (permalink / raw)
  To: Alan Mackenzie; +Cc: Jason Rumney, Andreas Schwab, Stefan Monnier, emacs-devel


 Ar an naoú lá déag de mí na Samhain, scríobh Alan Mackenzie: 

 > Hi, Stefan,
 > 
 > On Thu, Nov 19, 2009 at 10:30:18AM -0500, Stefan Monnier wrote:
 > > > The actual character in the string is ñ (#x3f).
 > 
 > > No: the string does not contain any characters, only bytes, because it's
 > > a unibyte string.
 > 
 > I'm thinking from the lisp viewpoint.  The string is a data structure
 > I really don't want to have to think about
 > the difference between "chars" and "bytes" when I'm hacking lisp.  If I
 > do, then the abstraction "string" is broken.

For some context on this, that’s how it works in XEmacs; we’ve never had
problems with it, we seem to avoid an entire class of programming errors
that GNU Emacs developers deal with on a regular basis.

Tangentally, for those that like the unibyte/multibyte distinction, to my
knowledge the editor does not have any way of representing “an octet with
numeric value < #x7f to be treated with byte semantics, not character
semantics”, which seems arbitrary to me. For example: 

;; Both the decoded sequences are illegal in UTF-16:
(split-char
 (car (append (decode-coding-string "\xd8\x00\x00\x7f" 'utf-16-be) nil)))
=> (ascii 127)

(split-char
 (car (append (decode-coding-string "\xd8\x00\x00\x80" 'utf-16-be) nil)))
=> (eight-bit-control 128)

-- 
“Apart from the nine-banded armadillo, man is the only natural host of
Mycobacterium leprae, although it can be grown in the footpads of mice.”
  -- Kumar & Clark, Clinical Medicine, summarising improbable leprosy research




^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Fwd: Re: Inadequate documentation of silly characters on screen.
  2009-11-19 15:58             ` Alan Mackenzie
  2009-11-19 16:06               ` Andreas Schwab
  2009-11-19 16:47               ` Aidan Kehoe
@ 2009-11-19 16:55               ` David Kastrup
  2009-11-19 18:08                 ` Alan Mackenzie
  2009-11-19 19:43               ` Eli Zaretskii
  2009-11-19 20:02               ` Stefan Monnier
  4 siblings, 1 reply; 101+ messages in thread
From: David Kastrup @ 2009-11-19 16:55 UTC (permalink / raw)
  To: emacs-devel

Alan Mackenzie <acm@muc.de> writes:

> On Thu, Nov 19, 2009 at 10:30:18AM -0500, Stefan Monnier wrote:
>> > The actual character in the string is ñ (#x3f).
>
>> No: the string does not contain any characters, only bytes, because
>> it's a unibyte string.
>
> I'm thinking from the lisp viewpoint.  The string is a data structure
> which contains characters.  I really don't want to have to think about
> the difference between "chars" and "bytes" when I'm hacking lisp.  If
> I do, then the abstraction "string" is broken.
>
>> So it contains the byte 241, not the character ñ.
>
> That is then a bug.  I wrote "(aset nl 0 ?ñ)", not "(aset nl 0 241)".

Huh?  ?ñ is the Emacs code point of ñ.  Which is pretty much identical
to the Unicode code point in Emacs 23.

>> The byte 241 can be inserted in multibyte strings and buffers because
>> it is also a char of code 4194289 (which gets displayed as \361).
>
> Hang on a mo'!  How can the byte 241 "be" a char of code 4194289?
> This is some strange usage of the word "be" that I wasn't previously
> aware of.  ;-)

Emacs encodes most of its things in utf-8.  A Unicode code point is an
integer.  You can encode it in different encodings, resulting in
different byte streams.  Inside of a byte stream encoded in utf-8, the
isolated byte 241 does not correspond to a Unicode character.  It is not
valid utf-8.  When Emacs reads a file supposedly in utf-8, it wants to
represent _all_ possible byte streams in order to be able to save
unchanged data unmolested.

So it encodes the entity "illegal isolated byte 241 in an utf-8
document" with the character code 4194289 which has a representation in
Emacs' internal variant of utf-8, but is outside of the range of
Unicode.

> At this point, would you please just agree with me that when I do
>
>    (setq nl "\n")
>    (aset nl 0 ?ñ)
>    (insert nl)
>
> , what should appear on the screen should be "ñ", NOT "\361"?  Thanks!

You assume that ?ñ is a character.  But in Emacs, it is an integer, a
Unicode code point in Emacs 23.  As long as there is something like a
unibyte string, there is no way to distinguish the character 241 and the
byte 241 except when Emacs is told explicitly.

Because Emacs has no separate "character" data type.

-- 
David Kastrup

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Fwd: Re: Inadequate documentation of silly characters on screen.
  2009-11-19 14:08     ` Stefan Monnier
  2009-11-19 14:50       ` Jason Rumney
@ 2009-11-19 17:08       ` Alan Mackenzie
  1 sibling, 0 replies; 101+ messages in thread
From: Alan Mackenzie @ 2009-11-19 17:08 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: emacs-devel

Hi, Stefan,

On Thu, Nov 19, 2009 at 09:08:29AM -0500, Stefan Monnier wrote:

> >> If you give us more context (i.e. more of the real code where the
> >> problem show up), maybe we can tell you how to avoid it.

> > OK.  I have my own routine to display regexps.  As a first step, I
> > translate \n -> ñ, (and \t, \r, \f similarly).  This is how:

> >     (defun translate-rnt (regexp)
> >       "REGEXP is a string.  Translate any \t \n \r and \f characters
> >     to wierd non-ASCII printable characters: \t to Î (206, \xCE), \n
> >     to ñ (241, \xF1), \r to ® (174, \xAE) and \f to £ (163, \xA3).
> >     The original string is modified."
> >       (let (ch pos)
> >         (while (setq pos (string-match "[\t\n\r\f]" regexp))
> >           (setq ch (aref regexp pos))
> >           (aset regexp pos                        ; <===================
> >                 (cond ((eq ch ?\t) ?Î)
> >                       ((eq ch ?\n) ?ñ)
> >                       ((eq ch ?\r) ?®)
> >                       (t           ?£))))
> >         regexp))

> Each one of those `aset' (when performed according to your wishes) would
> change the byte-size of the string, so it would internally require
> copying the whole string each time: aset on (multibyte) strings is very
> inefficient (compared to what most people expect, not necessarily
> compared to other operations).  I'd recommend you use higher-level
> operations since they'll work just as well and are less susceptible to
> such problems:

>   (replace-regexp-in-string "[\t\n\r\f]"
>                             (lambda (s)
>                               (or (cdr (assoc s '(("\t" . "Î")
>                                                   ("\n" . "ñ")
>                                                   ("\r" . "®"))))
>                                   "£"))
>                             regexp)

That works 100%.  Even in Emacs 23 ;-).  Thanks!

>         Stefan

-- 
Alan Mackenzie (Nuremberg, Germany).




^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Fwd: Re: Inadequate documentation of silly characters on screen.
  2009-11-19 16:47               ` Aidan Kehoe
@ 2009-11-19 17:29                 ` Alan Mackenzie
  2009-11-19 18:21                   ` Aidan Kehoe
  2009-11-20  2:43                   ` Stephen J. Turnbull
  2009-11-19 19:45                 ` Eli Zaretskii
  2009-11-19 19:55                 ` Stefan Monnier
  2 siblings, 2 replies; 101+ messages in thread
From: Alan Mackenzie @ 2009-11-19 17:29 UTC (permalink / raw)
  To: Aidan Kehoe; +Cc: Jason Rumney, Andreas Schwab, Stefan Monnier, emacs-devel

On Thu, Nov 19, 2009 at 04:47:09PM +0000, Aidan Kehoe wrote:

>  Ar an naoú lá déag de mí na Samhain, scríobh Alan Mackenzie: 

>  > Hi, Stefan,

>  > On Thu, Nov 19, 2009 at 10:30:18AM -0500, Stefan Monnier wrote:
>  > > > The actual character in the string is ñ (#x3f).

>  > > No: the string does not contain any characters, only bytes, because it's
>  > > a unibyte string.

>  > I'm thinking from the lisp viewpoint.  The string is a data structure
>  > I really don't want to have to think about
>  > the difference between "chars" and "bytes" when I'm hacking lisp.  If I
>  > do, then the abstraction "string" is broken.

> For some context on this, that???s how it works in XEmacs; we???ve never had
> problems with it, we seem to avoid an entire class of programming errors
> that GNU Emacs developers deal with on a regular basis.

In XEmacs, characters and integers are distinct types.  That causes
extra work having to convert between them, both mentally and in writing
code.  It is not that the GNU Emacs way is wrong, it just has a bug at
the moment.

-- 
Alan Mackenzie (Nuremberg, Germany).




^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Fwd: Re: Inadequate documentation of silly characters on screen.
  2009-11-19 16:55               ` David Kastrup
@ 2009-11-19 18:08                 ` Alan Mackenzie
  2009-11-19 19:25                   ` Davis Herring
                                     ` (2 more replies)
  0 siblings, 3 replies; 101+ messages in thread
From: Alan Mackenzie @ 2009-11-19 18:08 UTC (permalink / raw)
  To: David Kastrup; +Cc: emacs-devel

Hi, David!

On Thu, Nov 19, 2009 at 05:55:10PM +0100, David Kastrup wrote:
> Alan Mackenzie <acm@muc.de> writes:

> > On Thu, Nov 19, 2009 at 10:30:18AM -0500, Stefan Monnier wrote:
> >> > The actual character in the string is ñ (#x3f).

> >> No: the string does not contain any characters, only bytes, because
> >> it's a unibyte string.

> > I'm thinking from the lisp viewpoint.  The string is a data
> > structure which contains characters.  I really don't want to have to
> > think about the difference between "chars" and "bytes" when I'm
> > hacking lisp.  If I do, then the abstraction "string" is broken.

> >> So it contains the byte 241, not the character ñ.

> > That is then a bug.  I wrote "(aset nl 0 ?ñ)", not "(aset nl 0 241)".

> Huh?  ?ñ is the Emacs code point of ñ.  Which is pretty much identical
> to the Unicode code point in Emacs 23.

No, you (all of you) are missing the point.  That point is that if an
Emacs Lisp hacker writes "?ñ", it should work, regardless of
what "codepoint" it has, what "bytes" represent it, whether those
"bytes" are coded with a different codepoint, or what have you.  All of
that stuff is uninteresting.  If it gets interesting, like now, it is
because it is buggy.

> >> The byte 241 can be inserted in multibyte strings and buffers
> >> because it is also a char of code 4194289 (which gets displayed as
> >> \361).

OK.  Surely displaying it as "\361" is a bug?  Should it not display as
"\17777761".  If it did, it would have saved half of my ranting.

> > Hang on a mo'!  How can the byte 241 "be" a char of code 4194289?
> > This is some strange usage of the word "be" that I wasn't previously
> > aware of.  ;-)

> Emacs encodes most of its things in utf-8.  A Unicode code point is an
> integer.  You can encode it in different encodings, resulting in
> different byte streams.  Inside of a byte stream encoded in utf-8, the
> isolated byte 241 does not correspond to a Unicode character.  It is not
> valid utf-8.  When Emacs reads a file supposedly in utf-8, it wants to
> represent _all_ possible byte streams in order to be able to save
> unchanged data unmolested.

That's a good explanation - it's sort of like &lt; in html.  Thanks.

> So it encodes the entity "illegal isolated byte 241 in an utf-8
> document" with the character code 4194289 which has a representation in
> Emacs' internal variant of utf-8, but is outside of the range of
> Unicode.

So, how did the character "ñ" get turned into the illegal byte #xf1?  Is
that the bug?

> > At this point, would you please just agree with me that when I do

> >    (setq nl "\n")
> >    (aset nl 0 ?ñ)
> >    (insert nl)

> > , what should appear on the screen should be "ñ", NOT "\361"?  Thanks!

> You assume that ?ñ is a character.

I do indeed.  It is self evident.

Now, would you too please just agree that when I execute the three forms
above, and "ñ" should appear?

The identical argument applies to "ä".  They are character used in
writing wierd European languages like Spanish and German.  Emacs should
not have difficulty with them.  It is a standard Emacs idiom that ?x (or
?\x) is the integer representing the character x.  Indeed (unlike in
XEmacs), characters ARE integers.  Why does this not work for, e.g.,
ISO-8559-1?

> But in Emacs, it is an integer, a Unicode code point in Emacs 23.

That sounds like the sort of argument one might read on
gnu-misc-discuss.  ;-)  Sorry.  Are you saying that Emacs is converting
"?ñ" and "?ä" into the wrong integers? 

> As long as there is something like a unibyte string, there is no way
> to distinguish the character 241 and the byte 241 except when Emacs is
> told explicitly.

What is the correct Emacs internal representation for "ñ" and "ä"?  They
surely cannot share internal representations with other
(non-)characters?

> Because Emacs has no separate "character" data type.

For which I am thankful.

> -- 
> David Kastrup

-- 
Alan Mackenzie (Nuremberg, Germany).

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Fwd: Re: Inadequate documentation of silly characters on screen.
  2009-11-19 17:29                 ` Alan Mackenzie
@ 2009-11-19 18:21                   ` Aidan Kehoe
  2009-11-20  2:43                   ` Stephen J. Turnbull
  1 sibling, 0 replies; 101+ messages in thread
From: Aidan Kehoe @ 2009-11-19 18:21 UTC (permalink / raw)
  To: Alan Mackenzie; +Cc: Jason Rumney, Andreas Schwab, Stefan Monnier, emacs-devel


 Ar an naoú lá déag de mí na Samhain, scríobh Alan Mackenzie: 

 > On Thu, Nov 19, 2009 at 04:47:09PM +0000, Aidan Kehoe wrote:
 > 
 > >  Ar an naoú lá déag de mí na Samhain, scríobh Alan Mackenzie: 
 > 
 > >  > Hi, Stefan,
 > 
 > >  > [...] I really don't want to have to think about the difference
 > >  > between "chars" and "bytes" when I'm hacking lisp. If I do, then the
 > >  > abstraction "string" is broken.
 > 
 > > For some context on this, that’s how it works in XEmacs; we’ve
 > > never had problems with it, we seem to avoid an entire class of
 > > programming errors that GNU Emacs developers deal with on a regular
 > > basis.
 > 
 > In XEmacs, characters and integers are distinct types.  That causes
 > extra work having to convert between them, both mentally and in writing
 > code. 

Certainly--that’s orthogonal to the issue at hand, though, it involves some
of the same things but is distinct. XEmacs could have implemented the
unibyte-string/multibyte-string Lisp distinction and kept the type
distinction between characters and integers; we didn’t, though. (Or maybe it
was just that the Mule version that we based our code on didn’t have it.)

 > It is not that the GNU Emacs way is wrong, it just has a bug at
 > the moment.

As far as I can see it’s an old design decision. 

-- 
“Apart from the nine-banded armadillo, man is the only natural host of
Mycobacterium leprae, although it can be grown in the footpads of mice.”
  -- Kumar & Clark, Clinical Medicine, summarising improbable leprosy research




^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Fwd: Re: Inadequate documentation of silly characters on  screen.
  2009-11-19 18:08                 ` Alan Mackenzie
@ 2009-11-19 19:25                   ` Davis Herring
  2009-11-19 21:25                     ` Alan Mackenzie
  2009-11-19 19:52                   ` Eli Zaretskii
  2009-11-19 20:05                   ` Stefan Monnier
  2 siblings, 1 reply; 101+ messages in thread
From: Davis Herring @ 2009-11-19 19:25 UTC (permalink / raw)
  To: Alan Mackenzie; +Cc: David Kastrup, emacs-devel

[I end up having to say the same thing several times here; I thought it
preferable to omitting any of Alan's questions or any aspect of the
problem.  It's not meant to be a rant.]

> No, you (all of you) are missing the point.  That point is that if an
> Emacs Lisp hacker writes "?ñ", it should work, regardless of
> what "codepoint" it has, what "bytes" represent it, whether those
> "bytes" are coded with a different codepoint, or what have you.  All of
> that stuff is uninteresting.  If it gets interesting, like now, it is
> because it is buggy.

When you wrote ?ñ, it did work -- that character has the Unicode (and
Emacs 23) code point 241, so that two-character token is entirely
equivalent to the token "241" in Emacs source.  (This is independent of
the encoding of the source file: the same two characters might be
represented by many different octet sequences in the source file, but you
always get 241 as the value (which is a code point and is distinct from
octet sequences anyway).)

But you didn't insert that object!  You forced it into a (perhaps
surprisingly: unibyte) string, which interpreted its argument (the integer
241) as a raw byte value, because that's what unibyte strings contain. 
When you then inserted the string, Emacs transformed it into a (somewhat
artificial) character whose meaning is "this was really the byte 241,
which, since it corresponds to no UTF-8 character, must merely be
reproduced literally on disk" and whose Emacs code point is 4194289. 
(That integer looks like it could be derived from 241 by sign-extension
for the convenience of Emacs hackers; the connection is unimportant to the
user.)

> OK.  Surely displaying it as "\361" is a bug?  Should it not display as
> "\17777761".  If it did, it would have saved half of my ranting.

No: characters are displayed according to their meaning, not their
internal code point.  As it happens, this character's whole meaning is
"the byte #o361", so that's what's displayed.

> So, how did the character "ñ" get turned into the illegal byte #xf1?  Is
> that the bug?

By its use in `aset' in a unibyte context (determined entirely by the
target string).

>> You assume that ?ñ is a character.
>
> I do indeed.  It is self evident.

Its characterness is determined by context, because (as you know) Emacs
has no distinct character type.  So, in the isolation of English prose, we
have no way of telling whether ?ñ "is" a character or an integer, any more
than we can guess about 241.  (We can guess about the writer's desires,
but not about the real effects.)

> Now, would you too please just agree that when I execute the three forms
> above, and "ñ" should appear?

That's Stefan's point: should common string literals generate multibyte
strings (so as to change the meaning, not of the string, but of `aset', to
what you want)?  Maybe: one could also address the issue by disallowing
`aset' on unibyte strings (or strings entirely) and introducing
`aset-unibyte' (and perhaps `aset-multibyte') so that the argument
interpretation (and the O(n) nature of the latter) would be made clear to
the programmer.  Maybe the doc-string for `aset' should just bear a really
loud warning.

It bears more consideration than merely "yes" to your question, as
reasonable as it seems.

> What is the correct Emacs internal representation for "ñ" and "ä"?  They
> surely cannot share internal representations with other
> (non-)characters?

They have the unique internal representation as (mostly) Unicode code
points (integers) 241 and 228, which happen to be identical to the
representations of bytes of those values (which interpretation prevails in
a unibyte context).

Davis

-- 
This product is sold by volume, not by mass.  If it appears too dense or
too sparse, it is because mass-energy conversion has occurred during
shipping.

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Fwd: Re: Inadequate documentation of silly characters on screen.
  2009-11-19 15:42             ` Alan Mackenzie
@ 2009-11-19 19:39               ` Eli Zaretskii
  0 siblings, 0 replies; 101+ messages in thread
From: Eli Zaretskii @ 2009-11-19 19:39 UTC (permalink / raw)
  To: Alan Mackenzie; +Cc: emacs-devel, schwab, monnier, jasonr

> Date: Thu, 19 Nov 2009 15:42:31 +0000
> From: Alan Mackenzie <acm@muc.de>
> Cc: Andreas Schwab <schwab@linux-m68k.org>,
> 	Stefan Monnier <monnier@iro.umontreal.ca>, emacs-devel@gnu.org
> 
> In the sequence
> 
>     (aset nl 0 ?ñ)
>     (insert nl)
> 
> , the character that comes out isn't the one that went in.  That is a
> bug.

No, it isn't.  You inserted 241 and got it back.





^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Fwd: Re: Inadequate documentation of silly characters on screen.
  2009-11-19 15:58             ` Alan Mackenzie
                                 ` (2 preceding siblings ...)
  2009-11-19 16:55               ` David Kastrup
@ 2009-11-19 19:43               ` Eli Zaretskii
  2009-11-19 21:57                 ` Alan Mackenzie
  2009-11-19 20:02               ` Stefan Monnier
  4 siblings, 1 reply; 101+ messages in thread
From: Eli Zaretskii @ 2009-11-19 19:43 UTC (permalink / raw)
  To: Alan Mackenzie; +Cc: jasonr, schwab, monnier, emacs-devel

> Date: Thu, 19 Nov 2009 15:58:48 +0000
> From: Alan Mackenzie <acm@muc.de>
> Cc: emacs-devel@gnu.org, Andreas Schwab <schwab@linux-m68k.org>,
> 	Jason Rumney <jasonr@gnu.org>
> 
> > No: the string does not contain any characters, only bytes, because it's
> > a unibyte string.
> 
> I'm thinking from the lisp viewpoint.  The string is a data structure
> which contains characters.  I really don't want to have to think about
> the difference between "chars" and "bytes" when I'm hacking lisp.  If I
> do, then the abstraction "string" is broken.

No, it isn't.  Emacs supports unibyte strings and multibyte strings.
The latter hold characters, but the former hold raw bytes.  See
"(elisp) Text Representations".

> > The byte 241 can be inserted in multibyte strings and buffers because
> > it is also a char of code 4194289 (which gets displayed as \361).
> 
> Hang on a mo'!  How can the byte 241 "be" a char of code 4194289?  This
> is some strange usage of the word "be" that I wasn't previously aware
> of.  ;-)

That's how Emacs 23 represents raw bytes in multibyte buffers and
strings.

> At this point, would you please just agree with me that when I do
> 
>    (setq nl "\n")
>    (aset nl 0 ?ñ)
>    (insert nl)
> 
> , what should appear on the screen should be "ñ", NOT "\361"?

No, I don't agree.  If you want to get a human-readable text string,
don't use aset; use string operations instead.





^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Fwd: Re: Inadequate documentation of silly characters on screen.
  2009-11-19 16:47               ` Aidan Kehoe
  2009-11-19 17:29                 ` Alan Mackenzie
@ 2009-11-19 19:45                 ` Eli Zaretskii
  2009-11-19 20:07                   ` Eli Zaretskii
  2009-11-19 19:55                 ` Stefan Monnier
  2 siblings, 1 reply; 101+ messages in thread
From: Eli Zaretskii @ 2009-11-19 19:45 UTC (permalink / raw)
  To: Aidan Kehoe; +Cc: acm, emacs-devel, schwab, monnier, jasonr

> From: Aidan Kehoe <kehoea@parhasard.net>
> Date: Thu, 19 Nov 2009 16:47:09 +0000
> Cc: Jason Rumney <jasonr@gnu.org>, Andreas Schwab <schwab@linux-m68k.org>,
> 	Stefan Monnier <monnier@iro.umontreal.ca>, emacs-devel@gnu.org
> 
> Tangentally, for those that like the unibyte/multibyte distinction, to my
> knowledge the editor does not have any way of representing “an octet with
> numeric value < #x7f to be treated with byte semantics, not character
> semantics”

Emacs 23 does have a way of representing raw bytes, and it
distinguishes between them and characters.  See the ELisp manual (I
mentioned the node earlier in this thread).





^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Fwd: Re: Inadequate documentation of silly characters on screen.
  2009-11-19 18:08                 ` Alan Mackenzie
  2009-11-19 19:25                   ` Davis Herring
@ 2009-11-19 19:52                   ` Eli Zaretskii
  2009-11-19 20:53                     ` Alan Mackenzie
  2009-11-19 20:05                   ` Stefan Monnier
  2 siblings, 1 reply; 101+ messages in thread
From: Eli Zaretskii @ 2009-11-19 19:52 UTC (permalink / raw)
  To: Alan Mackenzie; +Cc: dak, emacs-devel

> Date: Thu, 19 Nov 2009 18:08:48 +0000
> From: Alan Mackenzie <acm@muc.de>
> Cc: emacs-devel@gnu.org
> 
> No, you (all of you) are missing the point.  That point is that if an
> Emacs Lisp hacker writes "?ñ", it should work, regardless of
> what "codepoint" it has, what "bytes" represent it, whether those
> "bytes" are coded with a different codepoint, or what have you.

No can do, as long as we support both unibyte and multibyte buffers
and strings.

> OK.  Surely displaying it as "\361" is a bug?

It's no more a bug than this:

   M-: ?a RET => 97

If `a' can be represented as 97, then why cannot \361 be represented
as 4194289?

> So, how did the character "ñ" get turned into the illegal byte #xf1?

It did so because you used aset to put it into a unibyte string.

> Are you saying that Emacs is converting "?ñ" and "?ä" into the wrong
> integers?

Emacs can convert it into 2 distinct integer representations.  It
decides which one by the context.  And you just happened to give it
the wrong context.

> What is the correct Emacs internal representation for "ñ" and "ä"?

That depends on whether they will be put into a multibyte
string/buffer or a unibyte one.

> > Because Emacs has no separate "character" data type.
> 
> For which I am thankful.

Then please understand that there's no bug here.





^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Fwd: Re: Inadequate documentation of silly characters on screen.
  2009-11-19 16:47               ` Aidan Kehoe
  2009-11-19 17:29                 ` Alan Mackenzie
  2009-11-19 19:45                 ` Eli Zaretskii
@ 2009-11-19 19:55                 ` Stefan Monnier
  2009-11-20  3:13                   ` Stephen J. Turnbull
  2 siblings, 1 reply; 101+ messages in thread
From: Stefan Monnier @ 2009-11-19 19:55 UTC (permalink / raw)
  To: Aidan Kehoe; +Cc: Alan Mackenzie, Jason Rumney, Andreas Schwab, emacs-devel

>> I'm thinking from the lisp viewpoint.  The string is a data structure
>> I really don't want to have to think about
>> the difference between "chars" and "bytes" when I'm hacking lisp.  If I
>> do, then the abstraction "string" is broken.

> For some context on this, that’s how it works in XEmacs; we’ve never had
> problems with it, we seem to avoid an entire class of programming errors
> that GNU Emacs developers deal with on a regular basis.

Indeed XEmacs does not represent chars as integers, and that can
eliminate several sources of problems.  Note that this problem is new in
Emacs-23, since in Emacs-22 (and in XEmacs, IIUC), there was no
character whose integer value was between 127 and 256, so there was no
ambiguity.

AFAIK most of the programming errors we've had to deal with over the
years (i.e. in Emacs-20, 21, 22) had to do with incorrect (or missing)
encoding/decoding and most of those errors existed just as much on
XEmacs because there's no way to fix them right in the
infrastructure code (tho XEmacs may have managed to hide them better by
detecting the lack of encoding/decoding and guessing an appropriate
coding-system instead).

> Tangentally, for those that like the unibyte/multibyte distinction, to my
> knowledge the editor does not have any way of representing “an octet with
> numeric value < #x7f to be treated with byte semantics, not character
> semantics”, which seems arbitrary to me. For example: 

Indeed.  It hasn't bitten us hard yet, mostly because (luckily) there
are very few coding-system which use chars 0-127 in ways incompatible
with ascii.

        Stefan

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Fwd: Re: Inadequate documentation of silly characters on screen.
  2009-11-19 15:58             ` Alan Mackenzie
                                 ` (3 preceding siblings ...)
  2009-11-19 19:43               ` Eli Zaretskii
@ 2009-11-19 20:02               ` Stefan Monnier
  4 siblings, 0 replies; 101+ messages in thread
From: Stefan Monnier @ 2009-11-19 20:02 UTC (permalink / raw)
  To: Alan Mackenzie; +Cc: emacs-devel, Andreas Schwab, Jason Rumney

>> No: the string does not contain any characters, only bytes, because it's
>> a unibyte string.
> I'm thinking from the lisp viewpoint.

So am I.  Lisp also manipulates bytes sometimes.  What happens is that
you're working mostly on a major mode, so you mostly never deal with
processes and files, so basically your whole world is (or should be)
multibyte and you never want to bump into a byte.

> I really don't want to have to think about the difference between
> "chars" and "bytes" when I'm hacking lisp.

When you write code that gets an email message via a connection to an
IMAP server, you have no choice but to care about the distinction
between the sequence of bytes you receive and the sequence of
chars&images you want to turn it into.  That's true for any language,
Elisp included.

> If I do, then the abstraction "string" is broken.

Not sure in which way.

>> So it contains the byte 241, not the character ñ.
> That is then a bug.  I wrote "(aset nl 0 ?ñ)", not "(aset nl 0 241)".

?ñ = 241 = #xf1 = #o361

There is absolutely no difference between the two expressions once
they've been read: the reader turns ?ñ into the integer 241.

>> The byte 241 can be inserted in multibyte strings and buffers because
>> it is also a char of code 4194289 (which gets displayed as \361).

> Hang on a mo'!  How can the byte 241 "be" a char of code 4194289?  This
> is some strange usage of the word "be" that I wasn't previously aware
> of.  ;-)

Agreed.

> At this point, would you please just agree with me that when I do

>    (setq nl "\n")
>    (aset nl 0 ?ñ)
>    (insert nl)

> , what should appear on the screen should be "ñ", NOT "\361"?  Thanks!

I have already agreed.

        Stefan

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Fwd: Re: Inadequate documentation of silly characters on screen.
  2009-11-19 18:08                 ` Alan Mackenzie
  2009-11-19 19:25                   ` Davis Herring
  2009-11-19 19:52                   ` Eli Zaretskii
@ 2009-11-19 20:05                   ` Stefan Monnier
  2009-11-19 21:27                     ` Alan Mackenzie
  2 siblings, 1 reply; 101+ messages in thread
From: Stefan Monnier @ 2009-11-19 20:05 UTC (permalink / raw)
  To: Alan Mackenzie; +Cc: David Kastrup, emacs-devel

> OK.  Surely displaying it as "\361" is a bug?  Should it not display as
> "\17777761".  If it did, it would have saved half of my ranting.

Hmm.. I lost you here.  How would it have helped you?


        Stefan




^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Fwd: Re: Inadequate documentation of silly characters on screen.
  2009-11-19 19:45                 ` Eli Zaretskii
@ 2009-11-19 20:07                   ` Eli Zaretskii
  0 siblings, 0 replies; 101+ messages in thread
From: Eli Zaretskii @ 2009-11-19 20:07 UTC (permalink / raw)
  To: kehoea, acm, emacs-devel, schwab, monnier, jasonr

> Date: Thu, 19 Nov 2009 21:45:02 +0200
> From: Eli Zaretskii <eliz@gnu.org>
> Cc: acm@muc.de, emacs-devel@gnu.org, schwab@linux-m68k.org,
> 	monnier@iro.umontreal.ca, jasonr@gnu.org
> 
> > From: Aidan Kehoe <kehoea@parhasard.net>
> > Date: Thu, 19 Nov 2009 16:47:09 +0000
> > Cc: Jason Rumney <jasonr@gnu.org>, Andreas Schwab <schwab@linux-m68k.org>,
> > 	Stefan Monnier <monnier@iro.umontreal.ca>, emacs-devel@gnu.org
> > 
> > Tangentally, for those that like the unibyte/multibyte distinction, to my
> > knowledge the editor does not have any way of representing “an octet with
> > numeric value < #x7f to be treated with byte semantics, not character
> > semantics”
> 
> Emacs 23 does have a way of representing raw bytes, and it
> distinguishes between them and characters.  See the ELisp manual (I
> mentioned the node earlier in this thread).

This is of course true, but for bytes > #x7f, not < #x7f.  Sorry.





^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Fwd: Re: Inadequate documentation of silly characters on screen.
  2009-11-19 19:52                   ` Eli Zaretskii
@ 2009-11-19 20:53                     ` Alan Mackenzie
  2009-11-19 22:16                       ` David Kastrup
  0 siblings, 1 reply; 101+ messages in thread
From: Alan Mackenzie @ 2009-11-19 20:53 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: dak, emacs-devel

Hi, Eli!

On Thu, Nov 19, 2009 at 09:52:20PM +0200, Eli Zaretskii wrote:
> > Date: Thu, 19 Nov 2009 18:08:48 +0000
> > From: Alan Mackenzie <acm@muc.de>
> > Cc: emacs-devel@gnu.org

> > No, you (all of you) are missing the point.  That point is that if an
> > Emacs Lisp hacker writes "?ñ", it should work, regardless of what
> > "codepoint" it has, what "bytes" represent it, whether those "bytes"
> > are coded with a different codepoint, or what have you.

> No can do, as long as we support both unibyte and multibyte buffers
> and strings.

This seems to be the big thing.  That ?ñ has no unique meaning.  The
current situation violates the description on the elisp page "Basic Char
Syntax", which describes the situation as I understood it up until half
an hour ago.

> > OK.  Surely displaying it as "\361" is a bug?

> If `a' can be represented as 97, then why cannot \361 be represented
> as 4194289?

ROFLMAO.  If this weren't true, you couldn't invent it.  ;-)

> > So, how did the character "ñ" get turned into the illegal byte #xf1?

> It did so because you used aset to put it into a unibyte string.

So, what should I have done to achieve the desired effect?  How should I
modify "(aset nl 0 ?ü)" so that it does the Right Thing?

> > Are you saying that Emacs is converting "?ñ" and "?ä" into the wrong
> > integers?

> Emacs can convert it into 2 distinct integer representations.  It
> decides which one by the context.  And you just happened to give it
> the wrong context.

OK, I understand that now, thanks.

> > > Because Emacs has no separate "character" data type.

> > For which I am thankful.

> Then please understand that there's no bug here.

Oh, I disagree with that.  But, whatever....

-- 
Alan Mackenzie (Nuremberg, Germany).




^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Fwd: Re: Inadequate documentation of silly characters on screen.
  2009-11-19 19:25                   ` Davis Herring
@ 2009-11-19 21:25                     ` Alan Mackenzie
  2009-11-19 22:31                       ` David Kastrup
  2009-11-20  8:48                       ` Fwd: Re: Inadequate documentation of silly characters on screen Eli Zaretskii
  0 siblings, 2 replies; 101+ messages in thread
From: Alan Mackenzie @ 2009-11-19 21:25 UTC (permalink / raw)
  To: Davis Herring; +Cc: David Kastrup, emacs-devel

Hi, Davis, always good to hear from you!

On Thu, Nov 19, 2009 at 11:25:05AM -0800, Davis Herring wrote:
> [I end up having to say the same thing several times here; I thought it
> preferable to omitting any of Alan's questions or any aspect of the
> problem.  It's not meant to be a rant.]

> > No, you (all of you) are missing the point.  That point is that if an
> > Emacs Lisp hacker writes "?ñ", it should work, regardless of
> > what "codepoint" it has, what "bytes" represent it, whether those
> > "bytes" are coded with a different codepoint, or what have you.  All of
> > that stuff is uninteresting.  If it gets interesting, like now, it is
> > because it is buggy.

> When you wrote ?ñ, it did work -- that character has the Unicode (and
> Emacs 23) code point 241, so that two-character token is entirely
> equivalent to the token "241" in Emacs source.  (This is independent of
> the encoding of the source file: the same two characters might be
> represented by many different octet sequences in the source file, but you
> always get 241 as the value (which is a code point and is distinct from
> octet sequences anyway).)

OK - so what's happening is that ?ñ is unambiguously 241.  But Emacs
cannot say whether that is unibyte 241 or multibyte 241, which it encodes
as 4194289.  Despite not knowing, Emacs is determined never to confuse a
4194289 type of 241 with a 241 type of 241.  So, despite the fact that
the character 4194289 probably originated as a unibyte ?ñ, it prints it
uglily on the screen as "\361".

> But you didn't insert that object!  You forced it into a (perhaps
> surprisingly: unibyte) string, which interpreted its argument (the integer
> 241) as a raw byte value, because that's what unibyte strings contain. 
> When you then inserted the string, Emacs transformed it into a (somewhat
> artificial) character whose meaning is "this was really the byte 241,
> which, since it corresponds to no UTF-8 character, must merely be
> reproduced literally on disk" and whose Emacs code point is 4194289. 
> (That integer looks like it could be derived from 241 by sign-extension
> for the convenience of Emacs hackers; the connection is unimportant to the
> user.)

Why couldn't Emacs have simply displayed the character as "ñ"?  Why does
it have to enforce its internal dirty linen on an unsuspecting hacker?

> > OK.  Surely displaying it as "\361" is a bug?  Should it not display
> > as "\17777761".  If it did, it would have saved half of my ranting.

> No: characters are displayed according to their meaning, not their
> internal code point.  As it happens, this character's whole meaning is
> "the byte #o361", so that's what's displayed.

That meaning is an artificial one imposed by Emacs itself.  Is there any
pressing reason to distinguish 4194289 from 241 when displaying them as
characters on a screen?

> > So, how did the character "ñ" get turned into the illegal byte #xf1?
> > Is that the bug?

> By its use in `aset' in a unibyte context (determined entirely by the
> target string).

> >> You assume that ?ñ is a character.

> > I do indeed.  It is self evident.

> Its characterness is determined by context, because (as you know) Emacs
> has no distinct character type.  So, in the isolation of English prose, we
> have no way of telling whether ?ñ "is" a character or an integer, any more
> than we can guess about 241.  (We can guess about the writer's desires,
> but not about the real effects.)

> > Now, would you too please just agree that when I execute the three
> > forms above, and "ñ" should appear?

> That's Stefan's point: should common string literals generate multibyte
> strings (so as to change the meaning, not of the string, but of `aset',
> to what you want)?

Lisp is a high level language.  It should do the Right Thing in its
representation of low level concepts, and shouldn't bug its users with
these things.

The situation is like having a text document with some characters in
ISO-8559-1 and some in UTF-8.  Chaos.  I stick with one of these
character sets for my personal stuff.

> Maybe: one could also address the issue by disallowing `aset' on
> unibyte strings (or strings entirely) and introducing `aset-unibyte'
> (and perhaps `aset-multibyte') so that the argument interpretation (and
> the O(n) nature of the latter) would be made clear to the programmer.

No.  The problem should be solved by deciding on one single character
set visible to lisp hackers, and sticking to it rigidly.  At least,
that's my humble opinion as one of the Emacs hackers least well informed
on the matter.  ;-(

> Maybe the doc-string for `aset' should just bear a really loud warning.

Yes.  But it's not really `aset' which is the liability.  It's "?".

> It bears more consideration than merely "yes" to your question, as
> reasonable as it seems.

> > What is the correct Emacs internal representation for "ñ" and "ä"?  They
> > surely cannot share internal representations with other
> > (non-)characters?

> They have the unique internal representation as (mostly) Unicode code
> points (integers) 241 and 228, which happen to be identical to the
> representations of bytes of those values (which interpretation prevails in
> a unibyte context).

Sorry, what the heck is "the byte with value 241"?  Does this concept
have any meaning, any utility beyond the machiavellian one of confusing
me?  How would one use "the byte with value 241", and why does it need to
be kept distinct from "ñ"?

> Davis

-- 
Alan Mackenzie (Nuremberg, Germany).




^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Fwd: Re: Inadequate documentation of silly characters on screen.
  2009-11-19 20:05                   ` Stefan Monnier
@ 2009-11-19 21:27                     ` Alan Mackenzie
  0 siblings, 0 replies; 101+ messages in thread
From: Alan Mackenzie @ 2009-11-19 21:27 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: David Kastrup, emacs-devel

Hi, Stefan!

On Thu, Nov 19, 2009 at 03:05:59PM -0500, Stefan Monnier wrote:
> > OK.  Surely displaying it as "\361" is a bug?  Should it not display as
> > "\17777761".  If it did, it would have saved half of my ranting.

> Hmm.. I lost you here.  How would it have helped you?

I wouldn't have wasted an hour trying to sort out what was apparently
wrong with the coding systems.

>         Stefan

-- 
Alan Mackenzie (Nuremberg, Germany).




^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Fwd: Re: Inadequate documentation of silly characters on screen.
  2009-11-19 19:43               ` Eli Zaretskii
@ 2009-11-19 21:57                 ` Alan Mackenzie
  2009-11-19 23:10                   ` Stefan Monnier
  0 siblings, 1 reply; 101+ messages in thread
From: Alan Mackenzie @ 2009-11-19 21:57 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: emacs-devel, schwab, monnier, jasonr

Hi, Eli!

On Thu, Nov 19, 2009 at 09:43:29PM +0200, Eli Zaretskii wrote:
> > Date: Thu, 19 Nov 2009 15:58:48 +0000
> > From: Alan Mackenzie <acm@muc.de>
> > Cc: emacs-devel@gnu.org, Andreas Schwab <schwab@linux-m68k.org>,
> > 	Jason Rumney <jasonr@gnu.org>

> > > No: the string does not contain any characters, only bytes, because
> > > it's a unibyte string.

> > I'm thinking from the lisp viewpoint.  The string is a data structure
> > which contains characters.  I really don't want to have to think
> > about the difference between "chars" and "bytes" when I'm hacking
> > lisp.  If I do, then the abstraction "string" is broken.

> No, it isn't.  Emacs supports unibyte strings and multibyte strings.
> The latter hold characters, but the former hold raw bytes.  See
> "(elisp) Text Representations".

The abstraction is broken.  It is broken because it isn't abstract - its
users have to think about the way characters are represented.  In an
effective abstraction, a user could just write "ñ" or ?ñ and rely on the
underlying mechanisms to work.

Instead of the abstraction "string", we have two grossly inferior
abstractions, "unibyte string" and "multibyte string".

Please suggest to me the correct elisp to "replace the zeroth character
of an existing string with Spanish n-twiddle".  If this is impossible to
write, or it's grossly larger than the buggy "(aset nl 0 ?ñ)", that's a
demonstration of the breakage.

> > > The byte 241 can be inserted in multibyte strings and buffers
> > > because it is also a char of code 4194289 (which gets displayed as
> > > \361).

> > Hang on a mo'!  How can the byte 241 "be" a char of code 4194289?  This
> > is some strange usage of the word "be" that I wasn't previously aware
> > of.  ;-)

> That's how Emacs 23 represents raw bytes in multibyte buffers and
> strings.

Why is it necessary to distinguish between 'A' and 65?  Surely they're
both just 0x41?  I'm missing something here.

> > At this point, would you please just agree with me that when I do

> >    (setq nl "\n")
> >    (aset nl 0 ?ñ)
> >    (insert nl)

> > , what should appear on the screen should be "ñ", NOT "\361"?

> No, I don't agree.  If you want to get a human-readable text string,
> don't use aset; use string operations instead.

There aren't any.  `store-substring' will fail if the bits-and-bytes
representation of the new bit differ in size from the old bit, thus
surely isn't any better than `aset'.  At least `aset' tries to convert to
multibyte.

I don't imagine anybody here would hold that the current state of strings
is ideal.  I'm still trying to piece together what the essence of the
problem is.

-- 
Alan Mackenzie (Nuremberg, Germany).

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Fwd: Re: Inadequate documentation of silly characters on screen.
  2009-11-19 20:53                     ` Alan Mackenzie
@ 2009-11-19 22:16                       ` David Kastrup
  2009-11-20  8:55                         ` Eli Zaretskii
  0 siblings, 1 reply; 101+ messages in thread
From: David Kastrup @ 2009-11-19 22:16 UTC (permalink / raw)
  To: emacs-devel

Alan Mackenzie <acm@muc.de> writes:

> Hi, Eli!
>
> On Thu, Nov 19, 2009 at 09:52:20PM +0200, Eli Zaretskii wrote:
>> > Date: Thu, 19 Nov 2009 18:08:48 +0000
>> > From: Alan Mackenzie <acm@muc.de>
>> > Cc: emacs-devel@gnu.org
>
>> > No, you (all of you) are missing the point.  That point is that if an
>> > Emacs Lisp hacker writes "?ñ", it should work, regardless of what
>> > "codepoint" it has, what "bytes" represent it, whether those "bytes"
>> > are coded with a different codepoint, or what have you.
>
>> No can do, as long as we support both unibyte and multibyte buffers
>> and strings.
>
> This seems to be the big thing.  That ?ñ has no unique meaning.

Wrong.  It means the character code of the character ñ in Emacs'
internal encoding.

> The current situation violates the description on the elisp page
> "Basic Char Syntax", which describes the situation as I understood it
> up until half an hour ago.

Hm?


    2.3.3.1 Basic Char Syntax
    .........................

    Since characters are really integers, the printed representation of
    a character is a decimal number.  This is also a possible read
    syntax for a character, but writing characters that way in Lisp
    programs is not clear programming.  You should _always_ use the
    special read syntax formats that Emacs Lisp provides for characters.
    These syntax formats start with a question mark.

This makes very very very clear that we are talking about an integer
here.

Not that the higher node does not also mention this:

    2.3.3 Character Type
    --------------------

    A "character" in Emacs Lisp is nothing more than an integer.  In
    other words, characters are represented by their character codes.
    For example, the character `A' is represented as the integer 65.

>> > OK.  Surely displaying it as "\361" is a bug?
>
>> If `a' can be represented as 97, then why cannot \361 be represented
>> as 4194289?
>
> ROFLMAO.  If this weren't true, you couldn't invent it.  ;-)

Since raw bytes above 127 are not legal utf-8 sequences and we want some
character representation for them, and since character codes 128 to 255
are already valid Unicode codepoints, the obvious solution is to use
numbers that aren't valid Unicode codepoints.  One could have chosen
-128 to -255 for example.  Except that we don't have a natural algorithm
for encoding those in a superset of utf-8.

>> > So, how did the character "ñ" get turned into the illegal byte
>> > #xf1?
>
>> It did so because you used aset to put it into a unibyte string.
>
> So, what should I have done to achieve the desired effect?  How should
> I modify "(aset nl 0 ?ü)" so that it does the Right Thing?

Using aset on strings is crude.  If it were up to me, I would not allow
this operation at all.

>> > Are you saying that Emacs is converting "?ñ" and "?ä" into the
>> > wrong integers?
>
>> Emacs can convert it into 2 distinct integer representations.  It
>> decides which one by the context.  And you just happened to give it
>> the wrong context.
>
> OK, I understand that now, thanks.

Too bad that it's wrong.  ?ñ is the integer that is Emacs' internal
character code for ñ.  A single integer representation, only different
on Emacsen with different internal character codes.  If you want to
produce an actual string from it, use char-to-string.

-- 
David Kastrup





^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Fwd: Re: Inadequate documentation of silly characters on screen.
  2009-11-19 21:25                     ` Alan Mackenzie
@ 2009-11-19 22:31                       ` David Kastrup
  2009-11-21 22:52                         ` Richard Stallman
  2009-11-20  8:48                       ` Fwd: Re: Inadequate documentation of silly characters on screen Eli Zaretskii
  1 sibling, 1 reply; 101+ messages in thread
From: David Kastrup @ 2009-11-19 22:31 UTC (permalink / raw)
  To: emacs-devel

Alan Mackenzie <acm@muc.de> writes:

> OK - so what's happening is that ?ñ is unambiguously 241.  But Emacs
> cannot say whether that is unibyte 241 or multibyte 241, which it
> encodes as 4194289.  Despite not knowing, Emacs is determined never to
> confuse a 4194289 type of 241 with a 241 type of 241.  So, despite the
> fact that the character 4194289 probably originated as a unibyte ?ñ,

?ñ is the code point of a character.  Unibyte strings contain bytes, not
characters.  ?ñ is a confusing way of writing 241 in the context of
unibyte, just like '\n' may be a confusing way of writing 10 in the
context of number bases.

> Why couldn't Emacs have simply displayed the character as "ñ"?

Because there is no character with a byte representation of 241.  You
are apparently demanding that Emacs display this "wild byte" as if it
were really encoded in latin-1.  What is so special about latin-1?
Latin-1 characters have a byte representation in utf-8, but it is not
241.

> Why does it have to enforce its internal dirty linen on an
> unsuspecting hacker?

It doesn't.  And since we are talking about a non-character isolated
byte, Emacs displays it as a non-character isolated byte rather than
throwing it out on the terminal and confusing the user with whatever the
terminal may make of it.

> That meaning is an artificial one imposed by Emacs itself.  Is there
> any pressing reason to distinguish 4194289 from 241 when displaying
> them as characters on a screen?

4194289 is the Emacs code point for "invalid raw byte with value 241",
241 is the Emacs code point for "Unicode character 241, part of latin-1
plane".  If you throw them to encode-region, the resulting unibyte
string will contain 241 for the first, but whatever external
representation is proper for the specified encoding for the second.  If
you encode to latin-1, the distinction will get lost.  If you encode to
other encodings, it won't.

> Sorry, what the heck is "the byte with value 241"?  Does this concept
> have any meaning, any utility beyond the machiavellian one of
> confusing me?  How would one use "the byte with value 241", and why
> does it need to be kept distinct from "ñ"?

You can use Emacs to load an executable, change some string inside of it
(make sure that it contains the same number of bytes afterwards!) and
save, and everything you did not edit is the same.

That's a very fine thing.  To have this work, Emacs needs an internal
representation for "byte with code x that is not valid as part of a
character".

-- 
David Kastrup

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Fwd: Re: Inadequate documentation of silly characters on screen.
  2009-11-19 21:57                 ` Alan Mackenzie
@ 2009-11-19 23:10                   ` Stefan Monnier
  0 siblings, 0 replies; 101+ messages in thread
From: Stefan Monnier @ 2009-11-19 23:10 UTC (permalink / raw)
  To: Alan Mackenzie; +Cc: Eli Zaretskii, emacs-devel, schwab, jasonr

> The abstraction is broken.  It is broken because it isn't abstract - its
> users have to think about the way characters are represented.  In an
> effective abstraction, a user could just write "ñ" or ?ñ and rely on the
> underlying mechanisms to work.

> Instead of the abstraction "string", we have two grossly inferior
> abstractions, "unibyte string" and "multibyte string".

No: the abstraction "multibyte string" is what you call "a string", it's
absolutely identical.  The only problem is that there's one tiny but
significant unsupported spot: when you write a string constant you may
think it's a multibyte string, but Emacs may disagree.

The abstraction "unibyte string" is what you might call "a byte array".
It doesn't have much to do with your idea of a string.

> Please suggest to me the correct elisp to "replace the zeroth character
> of an existing string with Spanish n-twiddle".

For a unibyte string, it's impossible since "Spanish n-twiddle" is not
a byte.  For multibyte strings, `aset' will work dandy (tho
inefficiently of course because we're talking about a string, not an
array).

> If this is impossible to write, or it's grossly larger than the buggy
> "(aset nl 0 ?ñ)", that's a demonstration of the breakage.

Except the breakage is elsewhere: you expect `nl' to be a multibyte
string (i.e. "a string" in your mind), whereas Emacs tricked you earlier
and `nl' is really a byte array.

> Why is it necessary to distinguish between 'A' and 65?

It's not usually.  Because in almost all coding systems, the character
A is represented by the byte 65.

>> No, I don't agree.  If you want to get a human-readable text string,
>> don't use aset; use string operations instead.
> There aren't any.

Of course there are: substring+concat.

> I don't imagine anybody here would hold that the current state of strings
> is ideal.  I'm still trying to piece together what the essence of the
> problem is.

The essense is that "\n" is not what you think of as a string: it's
a byte array instead.  And Emacs managed to do enough magic to trick you
into thinking until now that it's just like a string.

        Stefan

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Inadequate documentation of silly characters on screen.
  2009-11-19 15:27         ` Stefan Monnier
@ 2009-11-19 23:12           ` Miles Bader
  2009-11-20  2:16             ` Stefan Monnier
  2009-11-20  3:37             ` Stephen J. Turnbull
  0 siblings, 2 replies; 101+ messages in thread
From: Miles Bader @ 2009-11-19 23:12 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: Alan Mackenzie, emacs-devel, Jason Rumney

Stefan Monnier <monnier@iro.umontreal.ca> writes:
> many strings start as unibyte even though they really should start
> right away as multibyte.

That seems the fundamental problem here.

It seems better to make unibyte strings something that can only be
created with some explicit operation.

-Miles

-- 
"Suppose we've chosen the wrong god. Every time we go to church we're
just making him madder and madder." -- Homer Simpson




^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Inadequate documentation of silly characters on screen.
  2009-11-19 23:12           ` Miles Bader
@ 2009-11-20  2:16             ` Stefan Monnier
  2009-11-20  3:37             ` Stephen J. Turnbull
  1 sibling, 0 replies; 101+ messages in thread
From: Stefan Monnier @ 2009-11-20  2:16 UTC (permalink / raw)
  To: Miles Bader; +Cc: Alan Mackenzie, emacs-devel, Jason Rumney

>> many strings start as unibyte even though they really should start
>> right away as multibyte.
> That seems the fundamental problem here.
> It seems better to make unibyte strings something that can only be
> created with some explicit operation.

Agreed.  As I said earlier in this thread:

   We should probably move towards making all string immediates
   multibyte and add a new syntax for unibyte immediates.


-- Stefan




^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Fwd: Re: Inadequate documentation of silly characters on screen.
  2009-11-19 17:29                 ` Alan Mackenzie
  2009-11-19 18:21                   ` Aidan Kehoe
@ 2009-11-20  2:43                   ` Stephen J. Turnbull
  1 sibling, 0 replies; 101+ messages in thread
From: Stephen J. Turnbull @ 2009-11-20  2:43 UTC (permalink / raw)
  To: Alan Mackenzie
  Cc: Aidan Kehoe, emacs-devel, Andreas Schwab, Stefan Monnier,
	Jason Rumney

Alan Mackenzie writes:

 > In XEmacs, characters and integers are distinct types.  That causes
 > extra work having to convert between them, both mentally and in writing
 > code.

Why do you have to convert?  The only time you need to worry about the
integer values of characters is (1) when implementing a coding system
and (2) when dealing with control characters which do not have
consistent names or graphic representations (mostly the C1 set, but
there are areas in C0 as well -- quick, what's the name of \034?)
When do you need to do either?

 > It is not that the GNU Emacs way is wrong, it just has a bug at the
 > moment.

I agree that equating the character type to the integer type is not
"wrong".  It's a tradeoff which we make differently from Emacs: Emacs
prefers code that is shorter and easier to write, XEmacs prefers code
that may be longer (ie, uses explicit conversions where necessary) but
is easier to debug because it signals errors earlier (ie, when a
function receives an object of the wrong type rather than when a user
observes incorrect display).

However, I think that allowing a given array of bytes to change type
from unibyte to multibyte and back is just insane.  Either the types
should be different and immutable (as in Python) or there should be
only one representation (multibyte) as in XEmacs.

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Fwd: Re: Inadequate documentation of silly characters on screen.
  2009-11-19 19:55                 ` Stefan Monnier
@ 2009-11-20  3:13                   ` Stephen J. Turnbull
  0 siblings, 0 replies; 101+ messages in thread
From: Stephen J. Turnbull @ 2009-11-20  3:13 UTC (permalink / raw)
  To: Stefan Monnier
  Cc: Aidan Kehoe, Alan Mackenzie, emacs-devel, Andreas Schwab,
	Jason Rumney

[-- Attachment #1: Type: text/plain, Size: 544 bytes --]

Stefan Monnier writes:

 > Indeed XEmacs does not represent chars as integers, and that can
 > eliminate several sources of problems.  Note that this problem is new in
 > Emacs-23, since in Emacs-22 (and in XEmacs, IIUC), there was no
 > character whose integer value was between 127 and 256, so there was no
 > ambiguity.

In XEmacs:

(char-int-p 241) => t
(int-char 241) => ?ñ

No problems with this that I can recall, except a few people with code
that did

    (set-face-font 'default "-*-*-*-*-*-*-*-*-*-*-*-*-iso8859-2")

[-- Attachment #2: Type: text/plain, Size: 67 bytes --]

and expected `(insert (int-char 241))' to display `ń' instead of 

[-- Attachment #3: Type: text/plain, Size: 1832 bytes --]

`ñ'.
(For the non-Mule-implementers, this hack works without Mule but won't
work in Mule because Mule matches those two trailing fields to the
character's charset, and 241 corresponds to a Latin-1 character, so a
"-*-*-*-*-*-*-*-*-*-*-*-*-iso8859-1" font from the set associated with
the default face will be used.)

For this reason, using char-int and int-char in XEmacs is generally a
bug unless you want to examine the internal coding system; you almost
always want to use make-char.  (Of course for ASCII values it's an
accepted idiom, but still a bad habit.)

 > AFAIK most of the programming errors we've had to deal with over the
 > years (i.e. in Emacs-20, 21, 22) had to do with incorrect (or missing)
 > encoding/decoding and most of those errors existed just as much on
 > XEmacs

I don't think that's true; AFAIK we have *no* recorded instances of
the \201 bug, while that regression persisted in GNU Emacs (albeit a
patched version, at first) from at the latest 1992 until just a few
years ago.  I think it got fixed in Mule (ie, all paths into or out of
a text object got a coding stage) before that was integrated into
XEmacs or Emacs, and the regression when Mule was integrated into
Emacs was cause by the performance hack, "text object as unibyte".

 > because there's no way to fix them right in the infrastructure code
 > (tho XEmacs may have managed to hide them better by detecting the
 > lack of encoding/decoding and guessing an appropriate coding-system
 > instead).

I don't know of any such guessing.  When the user asks us to, we guess
on input, just as you do, but once we've got text in internal format,
there is no more guessing to be done.  Emacs will encounter the need
to guess because you support "text object as unibyte".

Vive la difference technical!

;-)

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Inadequate documentation of silly characters on screen.
  2009-11-19 23:12           ` Miles Bader
  2009-11-20  2:16             ` Stefan Monnier
@ 2009-11-20  3:37             ` Stephen J. Turnbull
  2009-11-20  4:30               ` Stefan Monnier
  1 sibling, 1 reply; 101+ messages in thread
From: Stephen J. Turnbull @ 2009-11-20  3:37 UTC (permalink / raw)
  To: Miles Bader; +Cc: Alan Mackenzie, Jason Rumney, Stefan Monnier, emacs-devel

Miles Bader writes:
 > Stefan Monnier <monnier@iro.umontreal.ca> writes:
 > > many strings start as unibyte even though they really should start
 > > right away as multibyte.
 > 
 > That seems the fundamental problem here.
 > 
 > It seems better to make unibyte strings something that can only be
 > created with some explicit operation.

I don't see why you *need* them at all.  Both pre-Emacs-integration
Mule and XEmacs do fine with a multibyte representation for binary.
Nobody has complained about performance of stream operations since
Kyle Jones and Hrvoje Niksic bitched and we did some measurements in
1998 or so.  It turns out that (as you'd expect) multibyte stream
operations (except Boyer-Moore, which takes no performance hit :-) are
about 50% slower because the representation is about 50% bigger.  But
this is rarely noticable to users.  The noticable performance problems
turned out to be a problem with Unix interfaces, not multibyte.

The performance problem is in array operations, since (without
caching) finding a particular character position is O(position).

If you want to turn Emacs into an engine for general network
programming and the like, yes, it would be good to have a separate
unibyte type.  This is what Python does, but Emacs would not have to
go through the agony of switching from a unibyte representation for
human-readable text to a multibyte representation the way Python does
for Python 3.  In that case, Emacs should not create them without an
explicit operation, and there should be a separate notation such as
#b"this is a unibyte string" (although #b may already be taken?) for
literals.

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Inadequate documentation of silly characters on screen.
  2009-11-20  3:37             ` Stephen J. Turnbull
@ 2009-11-20  4:30               ` Stefan Monnier
  2009-11-20  7:18                 ` Stephen J. Turnbull
  0 siblings, 1 reply; 101+ messages in thread
From: Stefan Monnier @ 2009-11-20  4:30 UTC (permalink / raw)
  To: Stephen J. Turnbull
  Cc: Alan Mackenzie, emacs-devel, Jason Rumney, Miles Bader

> I don't see why you *need* them at all.

We don't need the unibyte representation.  But we do need to distinguish
bytes and chars, encoded string from non-encoded strings, etc...
What representation is used for them is secondary, but using different
representations for the two cases doesn't seem to be a source
of problems.  The source of problems is that inherited history where we
mixed the unibyte and multibyte objects and treid to pretend they were
just one and the same thing and that conversion between them can be
done automatically.

        Stefan

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Inadequate documentation of silly characters on screen.
  2009-11-20  4:30               ` Stefan Monnier
@ 2009-11-20  7:18                 ` Stephen J. Turnbull
  2009-11-20 14:16                   ` Stefan Monnier
  0 siblings, 1 reply; 101+ messages in thread
From: Stephen J. Turnbull @ 2009-11-20  7:18 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: Miles Bader, Alan Mackenzie, Jason Rumney, emacs-devel

Stefan Monnier writes:

 > What representation is used for them is secondary, but using different
 > representations for the two cases doesn't seem to be a source
 > of problems.  The source of problems is that inherited history where we
 > mixed the unibyte and multibyte objects and treid to pretend they were
 > just one and the same thing and that conversion between them can be
 > done automatically.

Er, they *were* one and the same thing because of string-as-unibyte
and friends.





^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Fwd: Re: Inadequate documentation of silly characters on screen.
  2009-11-19 21:25                     ` Alan Mackenzie
  2009-11-19 22:31                       ` David Kastrup
@ 2009-11-20  8:48                       ` Eli Zaretskii
  1 sibling, 0 replies; 101+ messages in thread
From: Eli Zaretskii @ 2009-11-20  8:48 UTC (permalink / raw)
  To: Alan Mackenzie; +Cc: dak, emacs-devel

> Date: Thu, 19 Nov 2009 21:25:50 +0000
> From: Alan Mackenzie <acm@muc.de>
> Cc: David Kastrup <dak@gnu.org>, emacs-devel@gnu.org
> 
> Why couldn't Emacs have simply displayed the character as "ñ"?

Because Emacs does not interpret raw bytes as human-readable
characters, by design.

You could set unibyte-display-via-language-environment to get it
displayed as "ñ", but that's only a display setting, it doesn't change
the basic fact that Emacs is _not_ treating 241 in a unibyte string as
a character.

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Fwd: Re: Inadequate documentation of silly characters on screen.
  2009-11-19 22:16                       ` David Kastrup
@ 2009-11-20  8:55                         ` Eli Zaretskii
  0 siblings, 0 replies; 101+ messages in thread
From: Eli Zaretskii @ 2009-11-20  8:55 UTC (permalink / raw)
  To: David Kastrup; +Cc: emacs-devel

> From: David Kastrup <dak@gnu.org>
> Date: Thu, 19 Nov 2009 23:16:24 +0100
> 
> >> > Are you saying that Emacs is converting "?ñ" and "?ä" into the
> >> > wrong integers?
> >
> >> Emacs can convert it into 2 distinct integer representations.  It
> >> decides which one by the context.  And you just happened to give it
> >> the wrong context.
> >
> > OK, I understand that now, thanks.
> 
> Too bad that it's wrong.  ?ñ is the integer that is Emacs' internal
> character code for ñ.

What I wrote was not about ?ñ itself (which is indeed just an integer
241 in Emacs 23), but about the two possibilities to convert it to the
internal representation when it is inserted into a string (or a
buffer, for that matter).  One possibility is to convert it to a UTF-8
encoding of the Latin-1 character ñ, the other is to convert to the
(extended) UTF-8 encoding of a character whose codepoint is 4194289.

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Inadequate documentation of silly characters on screen.
  2009-11-20  7:18                 ` Stephen J. Turnbull
@ 2009-11-20 14:16                   ` Stefan Monnier
  2009-11-21  4:13                     ` Stephen J. Turnbull
  0 siblings, 1 reply; 101+ messages in thread
From: Stefan Monnier @ 2009-11-20 14:16 UTC (permalink / raw)
  To: Stephen J. Turnbull
  Cc: Miles Bader, Alan Mackenzie, Jason Rumney, emacs-devel

>> What representation is used for them is secondary, but using different
>> representations for the two cases doesn't seem to be a source
>> of problems.  The source of problems is that inherited history where we
>> mixed the unibyte and multibyte objects and treid to pretend they were
>> just one and the same thing and that conversion between them can be
>> done automatically.

> Er, they *were* one and the same thing because of string-as-unibyte
> and friends.

string-as-unibyte returns a new string, so no: they were not the same.


        Stefan




^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Inadequate documentation of silly characters on screen.
  2009-11-20 14:16                   ` Stefan Monnier
@ 2009-11-21  4:13                     ` Stephen J. Turnbull
  2009-11-21  5:24                       ` Stefan Monnier
  0 siblings, 1 reply; 101+ messages in thread
From: Stephen J. Turnbull @ 2009-11-21  4:13 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: Alan Mackenzie, Jason Rumney, emacs-devel, Miles Bader

Stefan Monnier writes:

 > string-as-unibyte returns a new string, so no: they were not the same.

Sorry, `toggle-enable-multibyte-characters' was what I had in mind.
So, yes, they *were* *indeed* the same.  YHBT (it wasn't intentional).

I dunno, de gustibus non est disputandum and all that, but this idea
of having an in-band representation for raw bytes in a multibyte
string sounds to me like more trouble than it's worth.  I think it
would be much better to serve (eg) AUCTeX's needs with a special
coding system that grabs some unlikely-to-be-used private code space
and puts the bytes there.  That puts the responsibility for dealing
with such perversity[1] on the people who have some idea what they're
dealing with, not unsuspecting CC Mode maintainers who won't be using
that coding system.

And it should be either an error to (aset string pos 241) (sorry
Alan!) or 241 should be implicitly interpreted as Latin-1 (ie, ?ñ).  I
favor the former, because what Alan is doing screws Spanish-speaking
users AFAICS.  OTOH, the latter extends naturally if you have plans to
add support for fixed-width Unicode buffers (UTF-16 and UTF-32).

Vive la différence techniquement!

Footnotes: 
[1]  In the sense of "the world is perverse", I'm not blaming AUCTeX
or TeX for this!

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Inadequate documentation of silly characters on screen.
  2009-11-21  4:13                     ` Stephen J. Turnbull
@ 2009-11-21  5:24                       ` Stefan Monnier
  2009-11-21  6:42                         ` Stephen J. Turnbull
  0 siblings, 1 reply; 101+ messages in thread
From: Stefan Monnier @ 2009-11-21  5:24 UTC (permalink / raw)
  To: Stephen J. Turnbull
  Cc: Alan Mackenzie, Jason Rumney, emacs-devel, Miles Bader

> Sorry, `toggle-enable-multibyte-characters' was what I had in mind.
> So, yes, they *were* *indeed* the same.  YHBT (it wasn't intentional).

Oh, yes, *that* one.  I haven't yet managed to run a useful Emacs
instance with an "assert (BEG == Z);" at the entrance to this nasty
function, but I keep hoping I'll get there.

> I dunno, de gustibus non est disputandum and all that, but this idea
> of having an in-band representation for raw bytes in a multibyte
> string sounds to me like more trouble than it's worth.  I think it
> would be much better to serve (eg) AUCTeX's needs with a special
> coding system that grabs some unlikely-to-be-used private code space
> and puts the bytes there.  That puts the responsibility for dealing
> with such perversity[1] on the people who have some idea what they're
> dealing with, not unsuspecting CC Mode maintainers who won't be using
> that coding system.

I don't know what you mean.  The eight-bit "chars" were introduced to
make sure that decoding+reencoding will always return the exact same
byte-sequence, no matter what coding-system was used (i.e. even if the
byte-sequence is invaldi for that coding-system).  Dunno how XEmacs
handles it.

> And it should be either an error to (aset string pos 241) (sorry
> Alan!) or 241 should be implicitly interpreted as Latin-1 (ie, ?ñ).  I
> favor the former, because what Alan is doing screws Spanish-speaking
> users AFAICS.  OTOH, the latter extends naturally if you have plans to
> add support for fixed-width Unicode buffers (UTF-16 and UTF-32).

I understand this even less.  I think XEmacs's fundamental tradeoffs are
subtly different but lead to very far-reaching consequences, and for
that reason it's difficult for us to take a step back and understand the
other point of view.

        Stefan

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Inadequate documentation of silly characters on screen.
  2009-11-21  5:24                       ` Stefan Monnier
@ 2009-11-21  6:42                         ` Stephen J. Turnbull
  2009-11-21  6:49                           ` Stefan Monnier
  2009-11-21 12:33                           ` David Kastrup
  0 siblings, 2 replies; 101+ messages in thread
From: Stephen J. Turnbull @ 2009-11-21  6:42 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: Miles Bader, Alan Mackenzie, emacs-devel, Jason Rumney

Stefan Monnier writes:

 > I don't know what you mean.  The eight-bit "chars" were introduced to
 > make sure that decoding+reencoding will always return the exact same
 > byte-sequence, no matter what coding-system was used (i.e. even if the
 > byte-sequence is invaldi for that coding-system).  Dunno how XEmacs
 > handles it.

Honestly, it currently doesn't, or doesn't very well, despite some
work by Aidan.

However, I think a well-behaved platform should by default error
(something derived from invalid-state, in XEmacs's error hierarchy) in
such a case; normally this means corruption in the file.  There are
special cases like utf8latex whose error messages give you a certain
number of octets without respecting character boundaries; I agree
there is need to handle this case.  What Python 3 (PEP 383) does is
provide a family of coding system variants which use invalid Unicode
surrogates to encode "raw bytes" for situations where the user asks
you to proceed despite invalid octet sequences for the coding system;
since Emacs's internal code is UTF-8, any Unicode surrogate is invalid
and could be used for this purpose.  This would make non-Emacs apps
barf errors on such Emacs autosaves, but they'll probably barf on the
source file, too.

 > > And it should be either an error to (aset string pos 241) (sorry
 > > Alan!) or 241 should be implicitly interpreted as Latin-1 (ie, ?ñ).  I
 > > favor the former, because what Alan is doing screws Spanish-speaking
 > > users AFAICS.  OTOH, the latter extends naturally if you have plans to
 > > add support for fixed-width Unicode buffers (UTF-16 and UTF-32).
 > 
 > I understand this even less.

There's a typo in the expr above, should be "multibyte-string".  The
proposed treatment of 241 is due to the fact that it is currently
illegal in multibyte strings AIUI.

Re the bit about Spanish-speakers: AIUI, Alan is translating multiline
strings to oneline strings by using an unusual graphic character.  But
it's only unusual in non-Spanish cases; Spanish-speakers may very well
want to include comments like "¡I wanna write this comment in Español!"
which would presumably get unfolded to "¡I wanna write this comment in
Espa\nol!"  Not very nice.

Re widechar buffers: the codes for Latin-1 characters in UTF-16 and
UTF-32 are just zero-padded extensions of the unibyte codes.  I'm
pretty sure it's this kind of thing that Ben had in mind when he
originally designed the XEmacs version of the Mule internal encoding
to make (= (char-int ?ñ) 241) true in all versions of XEmacs.

 > I think XEmacs's fundamental tradeoffs are subtly different but
 > lead to very far-reaching consequences,

Indeed, but I'm not talking about XEmacs, except for comparison of
techniques.

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Inadequate documentation of silly characters on screen.
  2009-11-21  6:42                         ` Stephen J. Turnbull
@ 2009-11-21  6:49                           ` Stefan Monnier
  2009-11-21  7:27                             ` Stephen J. Turnbull
  2009-11-21 12:33                           ` David Kastrup
  1 sibling, 1 reply; 101+ messages in thread
From: Stefan Monnier @ 2009-11-21  6:49 UTC (permalink / raw)
  To: Stephen J. Turnbull
  Cc: Miles Bader, Alan Mackenzie, emacs-devel, Jason Rumney

> There's a typo in the expr above, should be "multibyte-string".  The
> proposed treatment of 241 is due to the fact that it is currently
> illegal in multibyte strings AIUI.

241 is perfectly valid in multibyte strings (as well as in
unibyte-strings).


        Stefan




^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Inadequate documentation of silly characters on screen.
  2009-11-21  6:49                           ` Stefan Monnier
@ 2009-11-21  7:27                             ` Stephen J. Turnbull
  2009-11-23  1:58                               ` Stefan Monnier
  0 siblings, 1 reply; 101+ messages in thread
From: Stephen J. Turnbull @ 2009-11-21  7:27 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: Alan Mackenzie, emacs-devel

Stefan Monnier writes:

 > 241 is perfectly valid in multibyte strings (as well as in
 > unibyte-strings).

OK, so "invalid" was up to Emacs 22, then?

So the problem is that because characters are integers and vice versa,
there's no way for the user to let Emacs duck-type multibyte vs
unibyte strings for him.  If he cares, he needs to check.  If he
doesn't care, eventually Emacs will punish him for his lapse.

I suppose subst-char-in-string is similarly useless for Alan's
purpose, then?  What he really needs to use is something like

    (replace-in-string str "\n" "ñ")

right?

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Inadequate documentation of silly characters on screen.
  2009-11-21  6:42                         ` Stephen J. Turnbull
  2009-11-21  6:49                           ` Stefan Monnier
@ 2009-11-21 12:33                           ` David Kastrup
  2009-11-21 13:55                             ` Stephen J. Turnbull
  1 sibling, 1 reply; 101+ messages in thread
From: David Kastrup @ 2009-11-21 12:33 UTC (permalink / raw)
  To: emacs-devel

"Stephen J. Turnbull" <stephen@xemacs.org> writes:

> Stefan Monnier writes:
>
>  > I don't know what you mean.  The eight-bit "chars" were introduced
>  > to make sure that decoding+reencoding will always return the exact
>  > same byte-sequence, no matter what coding-system was used
>  > (i.e. even if the byte-sequence is invaldi for that coding-system).
>  > Dunno how XEmacs handles it.
>
> Honestly, it currently doesn't, or doesn't very well, despite some
> work by Aidan.

But we don't need to make this a problem for _Emacs_.

> However, I think a well-behaved platform should by default error
> (something derived from invalid-state, in XEmacs's error hierarchy) in
> such a case; normally this means corruption in the file.

We take care that it does not mean corruption.  And more often it means
that you might have been loading with the wrong encoding (people do that
all the time).  If you edit some innocent ASCII part and save again, you
won't appreciate changes all across the file elsewhere in parts you did
not touch or see on-screen.

Sometimes there is no "right encoding".  If I load an executable or an
image file with tag strings and change one string in overwrite mode, I
want to be able to save again.  Compiled Elisp files contain binary
strings as well.  There may be source files with binary blobs in them,
there may be files with parts in different encodings and so on.

> There are special cases like utf8latex whose error messages give you a
> certain number of octets without respecting character boundaries; I
> agree there is need to handle this case.

Forget about the TeX problem: that is a red herring.  It is just one
case where irrevertable corruption is not the right answer.  In fact, I
know of no case where irrevertable corruption is the right answer.
"Don't touch what you don't understand" is a good rationale.  For
XEmacs, following this rationale would currently require erroring out.
And I actually recommend that you do so: you will learn the hard way
that users like the Emacs solution of "don't touch what you don't
understand", namely having artificial code points for losslessly
representing the parts Emacs does not understand in a particular
encoding, better.

> What Python 3 (PEP 383) does is provide a family of coding system
> variants which use invalid Unicode surrogates to encode "raw bytes"
> for situations where the user asks you to proceed despite invalid
> octet sequences for the coding system; since Emacs's internal code is
> UTF-8, any Unicode surrogate is invalid and could be used for this
> purpose.  This would make non-Emacs apps barf errors on such Emacs
> autosaves, but they'll probably barf on the source file, too.

We currently _have_ such a scheme in place.  We just use different
Unicode-invalid code points.

> There's a typo in the expr above, should be "multibyte-string".  The
> proposed treatment of 241 is due to the fact that it is currently
> illegal in multibyte strings AIUI.

It is a perfectly valid character ñ in multibyte strings, but not
represented by its single-byte/latin-1 equivalent.

> Re widechar buffers: the codes for Latin-1 characters in UTF-16 and
> UTF-32 are just zero-padded extensions of the unibyte codes.

I think you may be muddling characters and their byte sequence
representations.  At least I can't read much sense into this statement
otherwise.

-- 
David Kastrup

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Inadequate documentation of silly characters on screen.
  2009-11-21 12:33                           ` David Kastrup
@ 2009-11-21 13:55                             ` Stephen J. Turnbull
  2009-11-21 14:36                               ` David Kastrup
  0 siblings, 1 reply; 101+ messages in thread
From: Stephen J. Turnbull @ 2009-11-21 13:55 UTC (permalink / raw)
  To: David Kastrup; +Cc: emacs-devel

David Kastrup writes:

 > > However, I think a well-behaved platform should by default error
 > > (something derived from invalid-state, in XEmacs's error hierarchy) in
 > > such a case; normally this means corruption in the file.
 > 
 > We take care that it does not mean corruption.

I meant pre-existing corruption, like your pre-existing disposition to
bash XEmacs.  Please take it elsewhere; it doesn't belong on Emacs
channels.  (Of course I'd prefer not to see it on XEmacs channels
either, but at least it wouldn't be entirely off-topic there.)

 > And more often it means that you might have been loading with the
 > wrong encoding (people do that all the time).  If you edit some
 > innocent ASCII part

You can't do that if the file is not in a buffer because the encoding
error aborted the conversion.  Aborting the conversion is what the
Unicode Consortium requires, too, IIRC: errors in UTF-8 (or any other
UTF for that matter) are considered *fatal* by the standard.  Exactly
what that means is up to the application to decide.  One plausible
approach would be to do what you do now, but make the buffer read-only.

 > Sometimes there is no "right encoding".

So what?  The point is that there certainly are *wrong* encodings,
namely ones that will result in corruption if you try to save the file
in that encoding.  There are usually many "usable" encodings (binary
is always available, for example).  Some will be preferred by users,
and that will be reflected in coding system precedence.

But when faced with ambiguity, it is best to refuse to guess.

 > We currently _have_ [a scheme for encoding invalid sequences of
 > code units] in place.  We just use different Unicode-invalid code
 > points [from Python].

Conceded.  I realized that later; the important difference is that
Python only uses that scheme when explicitly requested.

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Inadequate documentation of silly characters on screen.
  2009-11-21 13:55                             ` Stephen J. Turnbull
@ 2009-11-21 14:36                               ` David Kastrup
  2009-11-21 17:53                                 ` Stephen J. Turnbull
  0 siblings, 1 reply; 101+ messages in thread
From: David Kastrup @ 2009-11-21 14:36 UTC (permalink / raw)
  To: emacs-devel

"Stephen J. Turnbull" <stephen@xemacs.org> writes:

> David Kastrup writes:
>
>  > > However, I think a well-behaved platform should by default error
>  > > (something derived from invalid-state, in XEmacs's error
>  > > hierarchy) in such a case; normally this means corruption in the
>  > > file.
>  > 
>  > We take care that it does not mean corruption.
>
> I meant pre-existing corruption [...]

That interpretation is not the business of the editor.  It may decide to
give a warning, but refusing to work at all does not increase its
usefulness.

>  > And more often it means that you might have been loading with the
>  > wrong encoding (people do that all the time).  If you edit some
>  > innocent ASCII part
>
> You can't do that if the file is not in a buffer because the encoding
> error aborted the conversion.

Not being able to do what I want is not a particularly enticing feature.

> Aborting the conversion is what the Unicode Consortium requires, too,
> IIRC:

An editor is not the same as a validator.  It's not its business to
decide what files I should be allowed to work with.

> errors in UTF-8 (or any other UTF for that matter) are considered
> *fatal* by the standard.  Exactly what that means is up to the
> application to decide.  One plausible approach would be to do what you
> do now, but make the buffer read-only.

Making the buffer read-only is a reasonable thing to do if it can't
possibly be written back unchanged.  For example, if I load a file in
latin-1 and insert a few non-latin-1 characters.  In this case Emacs
should not just silently write the file in utf-8 because that changes
the encoding of some preexisting characters.  The situation is different
if I load a pure ASCII file: in that case, the utf-8 decision is
feasible when compatible with the environment.

>  > Sometimes there is no "right encoding".
>
> So what?  The point is that there certainly are *wrong* encodings,
> namely ones that will result in corruption if you try to save the file
> in that encoding.

But we have a fair amount of encodings (those without escape characters
IIRC) which don't imply corruption when saving.  And that is a good
feature for an editor.  For example, when working with version control
systems, you want minimal diffs.  Encoding systems with escape
characters are not good for that.  I would strongly advise against Emacs
picking any escape-character based encoding (or otherwise
non-byte-stream-preserving) automatically.

Less breakage is always a good thing.

> But when faced with ambiguity, it is best to refuse to guess.

You don't need to guess if you just preserve the byte sequence.  That
makes it somebody else's problem.  The GNU utilities have always made it
a point to work with arbitrary input without insisting on it being
"sensible".  Historically, most Unix utilities just crashed when you fed
them arbitrary garbage.  They have taken a lesson from GNU nowadays.

And I consider it a good lesson.

>  > We currently _have_ [a scheme for encoding invalid sequences of
>  > code units] in place.  We just use different Unicode-invalid code
>  > points [from Python].
>
> Conceded.  I realized that later; the important difference is that
> Python only uses that scheme when explicitly requested.

All in all, it is nobody else's business what encoding Emacs uses for
internal purposes.  Making Emacs preserve byte streams means that the
user has to worry less, not more, about what Emacs might be able to work
with.  The Emacs 23 internal encoding does a better job not getting into
the hair of users with encoding issues than Emacs 22 did, because of a
better correspondence with external encodings.  But ideally, the user
should not have to worry about the difference.

-- 
David Kastrup

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Inadequate documentation of silly characters on screen.
  2009-11-21 14:36                               ` David Kastrup
@ 2009-11-21 17:53                                 ` Stephen J. Turnbull
  2009-11-21 23:30                                   ` David Kastrup
  0 siblings, 1 reply; 101+ messages in thread
From: Stephen J. Turnbull @ 2009-11-21 17:53 UTC (permalink / raw)
  To: David Kastrup; +Cc: emacs-devel

David Kastrup writes:
 > "Stephen J. Turnbull" <stephen@xemacs.org> writes:

 > > I meant pre-existing corruption [...]
 > 
 > That interpretation is not the business of the editor.

Precisely my point.  The editor has *no* way to interpret at the point
of encountering the invalid sequence, and therefore it should *stop*
and ask the user what to do.  That doesn't mean it should throw away
the data, but it sure does mean that it should not continue as though
there is valid data in the buffer.

Emacs is welcome to do that, but I am sure you will get bug reports
about it.

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Fwd: Re: Inadequate documentation of silly characters on screen.
  2009-11-19 22:31                       ` David Kastrup
@ 2009-11-21 22:52                         ` Richard Stallman
  2009-11-23  2:08                           ` Displaying bytes (was: Inadequate documentation of silly characters on screen.) Stefan Monnier
  0 siblings, 1 reply; 101+ messages in thread
From: Richard Stallman @ 2009-11-21 22:52 UTC (permalink / raw)
  To: David Kastrup; +Cc: emacs-devel

    > Why couldn't Emacs have simply displayed the character as "ñ"?

    Because there is no character with a byte representation of 241.  You
    are apparently demanding that Emacs display this "wild byte" as if it
    were really encoded in latin-1.

Latin-1 or Unicode.  The Unicode code point for ñ is 241.  (aref "ñ"
0) returns 241, which is 361 in octal.  So if there is a character
\361, it seems that ought to be the same as ñ.

Basically, it isn't clear that \361 is a byte rather than a character,
and what difference that ought to make, and what you should do
if you want to turn it from a byte into a character.

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Inadequate documentation of silly characters on screen.
  2009-11-21 17:53                                 ` Stephen J. Turnbull
@ 2009-11-21 23:30                                   ` David Kastrup
  2009-11-22  1:27                                     ` Sebastian Rose
  0 siblings, 1 reply; 101+ messages in thread
From: David Kastrup @ 2009-11-21 23:30 UTC (permalink / raw)
  To: emacs-devel

"Stephen J. Turnbull" <stephen@xemacs.org> writes:

> David Kastrup writes:
>  > "Stephen J. Turnbull" <stephen@xemacs.org> writes:
>
>  > > I meant pre-existing corruption [...]
>  > 
>  > That interpretation is not the business of the editor.
>
> Precisely my point.  The editor has *no* way to interpret at the point
> of encountering the invalid sequence, and therefore it should *stop*
> and ask the user what to do.  That doesn't mean it should throw away
> the data, but it sure does mean that it should not continue as though
> there is valid data in the buffer.
>
> Emacs is welcome to do that, but I am sure you will get bug reports
> about it.

Why would we get a bug report about Emacs saving a file changed only in
the locations that the user actually edited?

People might complain when Emacs does not recognize some encoding
properly, but they certainly will not demand that Emacs should stop
working altogether.

-- 
David Kastrup





^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Inadequate documentation of silly characters on screen.
  2009-11-21 23:30                                   ` David Kastrup
@ 2009-11-22  1:27                                     ` Sebastian Rose
  2009-11-22  8:06                                       ` David Kastrup
  0 siblings, 1 reply; 101+ messages in thread
From: Sebastian Rose @ 2009-11-22  1:27 UTC (permalink / raw)
  To: David Kastrup; +Cc: emacs-devel

David Kastrup <dak@gnu.org> writes:
> "Stephen J. Turnbull" <stephen@xemacs.org> writes:
>
>> David Kastrup writes:
>>  > "Stephen J. Turnbull" <stephen@xemacs.org> writes:
>>
>>  > > I meant pre-existing corruption [...]
>>  > 
>>  > That interpretation is not the business of the editor.
>>
>> Precisely my point.  The editor has *no* way to interpret at the point
>> of encountering the invalid sequence, and therefore it should *stop*
>> and ask the user what to do.  That doesn't mean it should throw away
>> the data, but it sure does mean that it should not continue as though
>> there is valid data in the buffer.
>>
>> Emacs is welcome to do that, but I am sure you will get bug reports
>> about it.
>
> Why would we get a bug report about Emacs saving a file changed only in
> the locations that the user actually edited?
>
> People might complain when Emacs does not recognize some encoding
> properly, but they certainly will not demand that Emacs should stop
> working altogether.


People do indeed complain on the emacs-orgmode mailing list and I can
reproduce their problems.

You may read the details here:

  http://www.mail-archive.com/emacs-orgmode@gnu.org/msg19778.html

`M-x recode-file-name' doesn't work either.


I guess this is related?





Best wishes

      Sebastian




^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Inadequate documentation of silly characters on screen.
  2009-11-22  1:27                                     ` Sebastian Rose
@ 2009-11-22  8:06                                       ` David Kastrup
  2009-11-22 23:52                                         ` Sebastian Rose
  0 siblings, 1 reply; 101+ messages in thread
From: David Kastrup @ 2009-11-22  8:06 UTC (permalink / raw)
  To: emacs-devel

Sebastian Rose <sebastian_rose@gmx.de> writes:

> David Kastrup <dak@gnu.org> writes:
>> "Stephen J. Turnbull" <stephen@xemacs.org> writes:
>>
>>> David Kastrup writes:
>>>  > "Stephen J. Turnbull" <stephen@xemacs.org> writes:
>>>
>>>  > > I meant pre-existing corruption [...]
>>>  > 
>>>  > That interpretation is not the business of the editor.
>>>
>>> Precisely my point.  The editor has *no* way to interpret at the point
>>> of encountering the invalid sequence, and therefore it should *stop*
>>> and ask the user what to do.  That doesn't mean it should throw away
>>> the data, but it sure does mean that it should not continue as though
>>> there is valid data in the buffer.
>>>
>>> Emacs is welcome to do that, but I am sure you will get bug reports
>>> about it.
>>
>> Why would we get a bug report about Emacs saving a file changed only in
>> the locations that the user actually edited?
>>
>> People might complain when Emacs does not recognize some encoding
>> properly, but they certainly will not demand that Emacs should stop
>> working altogether.
>
>
> People do indeed complain on the emacs-orgmode mailing list and I can
> reproduce their problems.

What meaning of "indeed" are you using here?  This is a complaint about
Emacs _not_ faithfully replicating a byte pattern that it expects to be
in a particular encoding.

>   http://www.mail-archive.com/emacs-orgmode@gnu.org/msg19778.html
>
> I guess this is related?

It is related, but it bolsters rather than defeats my argument.

People don't _like_ Emacs to cop out altogether.

-- 
David Kastrup





^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Inadequate documentation of silly characters on screen.
  2009-11-22  8:06                                       ` David Kastrup
@ 2009-11-22 23:52                                         ` Sebastian Rose
  0 siblings, 0 replies; 101+ messages in thread
From: Sebastian Rose @ 2009-11-22 23:52 UTC (permalink / raw)
  To: David Kastrup; +Cc: emacs-devel

David Kastrup <dak@gnu.org> writes:

> Sebastian Rose <sebastian_rose@gmx.de> writes:
>
>> David Kastrup <dak@gnu.org> writes:
>>> "Stephen J. Turnbull" <stephen@xemacs.org> writes:
>>>
>>>> David Kastrup writes:
>>>>  > "Stephen J. Turnbull" <stephen@xemacs.org> writes:
>>>>
>>>>  > > I meant pre-existing corruption [...]
>>>>  > 
>>>>  > That interpretation is not the business of the editor.
>>>>
>>>> Precisely my point.  The editor has *no* way to interpret at the point
>>>> of encountering the invalid sequence, and therefore it should *stop*
>>>> and ask the user what to do.  That doesn't mean it should throw away
>>>> the data, but it sure does mean that it should not continue as though
>>>> there is valid data in the buffer.
>>>>
>>>> Emacs is welcome to do that, but I am sure you will get bug reports
>>>> about it.
>>>
>>> Why would we get a bug report about Emacs saving a file changed only in
>>> the locations that the user actually edited?
>>>
>>> People might complain when Emacs does not recognize some encoding
>>> properly, but they certainly will not demand that Emacs should stop
>>> working altogether.
>>
>>
>> People do indeed complain on the emacs-orgmode mailing list and I can
>> reproduce their problems.
>
> What meaning of "indeed" are you using here?  This is a complaint about
> Emacs _not_ faithfully replicating a byte pattern that it expects to be
> in a particular encoding.
>
>>   http://www.mail-archive.com/emacs-orgmode@gnu.org/msg19778.html
>>
>> I guess this is related?
>
> It is related, but it bolsters rather than defeats my argument.
>
> People don't _like_ Emacs to cop out altogether.


Sorry David. This was not meant as an argument. It was more a question
because I was a bit unsure if this was related (I did not follow thread
that closely).

And in that case, the OP reported, that Emacs indeed refused to work, in
that it didn't want to save the file (which I cannot fully reproduce).

I didn't mean to highjack this thread though.


Thanks for your answer anyway


  Sebastiab




^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Inadequate documentation of silly characters on screen.
  2009-11-21  7:27                             ` Stephen J. Turnbull
@ 2009-11-23  1:58                               ` Stefan Monnier
  0 siblings, 0 replies; 101+ messages in thread
From: Stefan Monnier @ 2009-11-23  1:58 UTC (permalink / raw)
  To: Stephen J. Turnbull; +Cc: Alan Mackenzie, emacs-devel

> So the problem is that because characters are integers and vice versa,
> there's no way for the user to let Emacs duck-type multibyte vs
> unibyte strings for him.  If he cares, he needs to check.  If he
> doesn't care, eventually Emacs will punish him for his lapse.

> I suppose subst-char-in-string is similarly useless for Alan's
> purpose, then?  What he really needs to use is something like

>     (replace-in-string str "\n" "ñ")

> right?

Pretty much yes.  When chars come within strings, the multibyteness of
the string indicates what the string elements are (chars or bytes), so
as long as you only manipulate strings, Emacs is able to DTRT.
As soon as you manipulate actual chars, the ambiguity between chars and
bytes for values [128..255] can bite you unless you're careful about how
you use them (e.g. about the multibyteness of the strings with which
you combine them).

That's where `aset' bites.  I hate `aset' on strings because it has
side-effects (obviously) and because strings aren't vectors so you can't
guarantee the expected efficiency, but neither are the source of the
problem here.  So indeed subst-char-in-string suffers similarly.

        Stefan

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Displaying bytes (was: Inadequate documentation of silly characters on screen.)
  2009-11-21 22:52                         ` Richard Stallman
@ 2009-11-23  2:08                           ` Stefan Monnier
  2009-11-23 20:38                             ` Richard Stallman
  0 siblings, 1 reply; 101+ messages in thread
From: Stefan Monnier @ 2009-11-23  2:08 UTC (permalink / raw)
  To: rms; +Cc: David Kastrup, emacs-devel

> Basically, it isn't clear that \361 is a byte rather than a character,
> and what difference that ought to make, and what you should do
> if you want to turn it from a byte into a character.

So how do you suggest we represent the byte 241?


        Stefan




^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Displaying bytes (was: Inadequate documentation of silly characters on screen.)
  2009-11-23  2:08                           ` Displaying bytes (was: Inadequate documentation of silly characters on screen.) Stefan Monnier
@ 2009-11-23 20:38                             ` Richard Stallman
  2009-11-23 21:34                               ` Per Starbäck
  2009-11-24  1:28                               ` Displaying bytes Stefan Monnier
  0 siblings, 2 replies; 101+ messages in thread
From: Richard Stallman @ 2009-11-23 20:38 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: dak, emacs-devel

    > Basically, it isn't clear that \361 is a byte rather than a character,
    > and what difference that ought to make, and what you should do
    > if you want to turn it from a byte into a character.

    So how do you suggest we represent the byte 241?

No better way jumps into my mind.  But maybe we could figure out some
way to make the current way easier to understand.

For instance, C-u C-x = on \224 says

	    character: ” (4194196, #o17777624, #x3fff94)
    preferred charset: tis620-2533 (TIS620.2533)
	   code point: 0x94
	       syntax: w 	which means: word
	  buffer code: #x94
	    file code: #x94 (encoded by coding system no-conversion)
	      display: not encodable for terminal

    Character code properties: customize what to show

    [back]

Perhaps it should say, 

	    character: Stray byte ” (4194196, #o17777624, #x3fff94)

What are the situations where a user is likely to see these stray
bytes.  When visiting a binary file, of course; but in that situation,
nobody will be surprised or disappointed.  So what are the other
cases, and what might the user really want instead?  Does it mean the
user probably wants to do M-x decode-coding-region?  If so, can we find a way
to give the user that hint?

When I click on tis620-2533 in that output, I get this

    Character set: tis620-2533

    TIS620.2533

    Number of contained characters: 256
    ASCII compatible.
    Code space: [0 255]

    [back]

which is totally unhelpful.  What is this character set's main
purpose?  Does it exist specifically for stray non-ASCII bytes?
If so, saying so here would help.  If not -- if it has some other
purpose -- then it would be good to explain both purposes here.

Also, if it exists for these stray non-ASCII bytes, why does it have
256 chars in it?  There are only 128 possible stray non-ASCII bytes.

(It is also not clear to me what "ASCII compatible" means in this
context.)

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Displaying bytes (was: Inadequate documentation of silly  characters on screen.)
  2009-11-23 20:38                             ` Richard Stallman
@ 2009-11-23 21:34                               ` Per Starbäck
  2009-11-24 22:47                                 ` Richard Stallman
  2009-11-24  1:28                               ` Displaying bytes Stefan Monnier
  1 sibling, 1 reply; 101+ messages in thread
From: Per Starbäck @ 2009-11-23 21:34 UTC (permalink / raw)
  To: rms; +Cc: dak, Stefan Monnier, emacs-devel

2009/11/23 Richard Stallman <rms@gnu.org>:
> What are the situations where a user is likely to see these stray
> bytes.  When visiting a binary file, of course; but in that situation,
> nobody will be surprised or disappointed.  So what are the other
> cases,

Sometimes when Emacs can't guess the coding system.

$ od -c euro.txt
0000000   T   h   a   t       c   o   s   t   s     200   1   7   .  \n
0000020
$ emacs euro.txt

This is really a windows-1252 file and the strange character is
supposed to be a Euro sign.
For me, with no particular setup to make Emacs expect windows-1252
files that shows in emacs as
"That costs \20017." with raw-text-unix.

> and what might the user really want instead?  Does it mean the
> user probably wants to do M-x decode-coding-region?  If so, can we find a way
> to give the user that hint?

In that case revert-buffer-with-coding-system. Ideally I'd like Emacs
to ask directly when opening the file
in such a case, if it can't determine anything better than raw-bytes.
At least if the mode (like text-mode here)
indicates that it shouldn't be a binary file.

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Displaying bytes
  2009-11-23 20:38                             ` Richard Stallman
  2009-11-23 21:34                               ` Per Starbäck
@ 2009-11-24  1:28                               ` Stefan Monnier
  2009-11-24 22:47                                 ` Richard Stallman
  2009-11-24 22:47                                 ` Richard Stallman
  1 sibling, 2 replies; 101+ messages in thread
From: Stefan Monnier @ 2009-11-24  1:28 UTC (permalink / raw)
  To: rms; +Cc: dak, emacs-devel

[-- Attachment #1: Type: text/plain, Size: 363 bytes --]

> For instance, C-u C-x = on \224 says

> 	    character:   (4194196, #o17777624, #x3fff94)
>     preferred charset: tis620-2533 (TIS620.2533)
> 	   code point: 0x94
> 	       syntax: w 	which means: word
> 	  buffer code: #x94
> 	    file code: #x94 (encoded by coding system no-conversion)
> 	      display: not encodable for terminal

Here C-u C-x = tells me:

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: Type: text/plain, Size: 332 bytes --]


        character: ¡ (4194209, #o17777641, #x3fffa1)
preferred charset: eight-bit (Raw bytes 128-255)
       code point: 0xA1
           syntax: w 	which means: word
      buffer code: #xA1
        file code: not encodable by coding system utf-8-unix
          display: no font available

I don't know you see this "tis620" stuff.

[-- Attachment #3: Type: text/plain, Size: 586 bytes --]


> Perhaps it should say,
> 	    character: Stray byte   (4194196, #o17777624, #x3fff94)

We could do that indeed.

> What are the situations where a user is likely to see these stray
> bytes.

There pretty much shouldn't be any in multibyte buffers.

> When visiting a binary file, of course; but in that situation,
> nobody will be surprised or disappointed.

And presumably for binary files, the buffer will be unibyte.

> (It is also not clear to me what "ASCII compatible" means in this
> context.)

It means that the lower 128 chars coincide with those of ASCII.


        Stefan

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Displaying bytes
  2009-11-24  1:28                               ` Displaying bytes Stefan Monnier
@ 2009-11-24 22:47                                 ` Richard Stallman
  2009-11-25  2:18                                   ` Stefan Monnier
  2009-11-24 22:47                                 ` Richard Stallman
  1 sibling, 1 reply; 101+ messages in thread
From: Richard Stallman @ 2009-11-24 22:47 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: dak, emacs-devel

    I don't know you see this "tis620" stuff.

How strange this discrepancy.  I have a few changes that are not
installed, but not in anything relevant here.  I last updated
source code on Nov 18.

Here's what apparently defines that character set, in mule-conf.el:

(define-charset 'tis620-2533
  "TIS620.2533"
  :short-name "TIS620.2533"
  :ascii-compatible-p t
  :code-space [0 255]
  :superset '(ascii eight-bit-control (thai-tis620 . 128)))

I don't entirely understand define-charset, but it seems plausible
that this gives the observed results.  Is this absent in your source?

Anyway, please don't overlook the other suggestions in my message
for how to make things clearer.

    > What are the situations where a user is likely to see these stray
    > bytes.

    There pretty much shouldn't be any in multibyte buffers.

Would it be good to ask people to send bug reports when these
stray byte characters appear in multibyte buffers?

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Displaying bytes (was: Inadequate documentation of silly  characters on screen.)
  2009-11-23 21:34                               ` Per Starbäck
@ 2009-11-24 22:47                                 ` Richard Stallman
  2009-11-25  1:33                                   ` Kenichi Handa
  0 siblings, 1 reply; 101+ messages in thread
From: Richard Stallman @ 2009-11-24 22:47 UTC (permalink / raw)
  To: Per Starbäck; +Cc: dak, monnier, emacs-devel

    $ od -c euro.txt
    0000000   T   h   a   t       c   o   s   t   s     200   1   7   .  \n
    0000020
    $ emacs euro.txt

    This is really a windows-1252 file and the strange character is
    supposed to be a Euro sign.
    For me, with no particular setup to make Emacs expect windows-1252
    files that shows in emacs as
    "That costs \20017." with raw-text-unix.

Why doesn't Emacs guess right, in this case?  Could we make it guess
right by changing the coding system priorities?  If so, should we
change the default priorities?

It may be that a different set of priorities would cause similar
problems in some other cases and the current defaults are the best.
But if we have not looked at the question in several years, it would
be worth studying it now.

    In that case revert-buffer-with-coding-system. Ideally I'd like Emacs
    to ask directly when opening the file
    in such a case, if it can't determine anything better than raw-bytes.

Maybe so.




^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Displaying bytes
  2009-11-24  1:28                               ` Displaying bytes Stefan Monnier
  2009-11-24 22:47                                 ` Richard Stallman
@ 2009-11-24 22:47                                 ` Richard Stallman
  1 sibling, 0 replies; 101+ messages in thread
From: Richard Stallman @ 2009-11-24 22:47 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: dak, emacs-devel

    It means that the lower 128 chars coincide with those of ASCII.

We could make that more self-explanatory in the buffer.




^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Displaying bytes (was: Inadequate documentation of silly  characters on screen.)
  2009-11-24 22:47                                 ` Richard Stallman
@ 2009-11-25  1:33                                   ` Kenichi Handa
  2009-11-25  2:29                                     ` Displaying bytes (was: Inadequate documentation of silly Stefan Monnier
                                                       ` (3 more replies)
  0 siblings, 4 replies; 101+ messages in thread
From: Kenichi Handa @ 2009-11-25  1:33 UTC (permalink / raw)
  To: rms; +Cc: per.starback, dak, monnier, emacs-devel

In article <E1ND4AD-0003Yg-Cc@fencepost.gnu.org>, Richard Stallman <rms@gnu.org> writes:

>     $ od -c euro.txt
>     0000000   T   h   a   t       c   o   s   t   s     200   1   7   .  \n
>     0000020
>     $ emacs euro.txt

>     This is really a windows-1252 file and the strange character is
>     supposed to be a Euro sign.
>     For me, with no particular setup to make Emacs expect windows-1252
>     files that shows in emacs as
>     "That costs \20017." with raw-text-unix.

> Why doesn't Emacs guess right, in this case?

Because some other coding system of the same coding-category of
windows-1252 (coding-category-charset) has the higher priority and
that coding system doesn't contain code \200.

> Could we make it guess right by changing the coding system
> priorities?

Yes.

> If so, should we change the default priorities?

I'm not sure.  As it seems that windows-1252 is a superset of
iso-8859-1, it may be ok to give windows-1252 the higher priority.
How do iso-8859-1 users think?

The better thing is to allow registering multiple coding systems in
one coding-category, but I'm not sure I have a time to work on it.

> It may be that a different set of priorities would cause similar
> problems in some other cases and the current defaults are the best.
> But if we have not looked at the question in several years, it would
> be worth studying it now.

>     In that case revert-buffer-with-coding-system. Ideally I'd like Emacs
>     to ask directly when opening the file
>     in such a case, if it can't determine anything better than raw-bytes.

> Maybe so.

For that, it seems that adding that facility in
after-insert-file-set-coding is good.   Here's a sample patch.  The
actual change should give more information to a user.

--- mule.el.~1.294.~	2009-11-17 11:42:45.000000000 +0900
+++ mule.el	2009-11-25 10:17:49.000000000 +0900
@@ -1893,7 +1893,18 @@
 	   coding-system-for-read
 	   (not (eq coding-system-for-read 'auto-save-coding)))
       (setq buffer-file-coding-system-explicit
-	    (cons coding-system-for-read nil)))
+	    (cons coding-system-for-read nil))
+    (when (and last-coding-system-used
+	       (eq (coding-system-base last-coding-system-used) 'raw-text))
+      ;; Give a chance of decoding by some coding system.
+      (let ((coding-system (read-coding-system "Actual coding system: ")))
+	(if coding-system
+	    (save-restriction
+	      (narrow-to-region (point) (+ (point) inserted))
+	      (let ((modified (buffer-modified-p)))
+		(decode-coding-region (point-min) (point-max) coding-system)
+		(setq inserted (- (point-max) (point-min)))
+		(set-buffer-modified-p modified)))))))
   (if last-coding-system-used
       (let ((coding-system
 	     (find-new-buffer-file-coding-system last-coding-system-used)))

---
Kenichi Handa
handa@m17n.org




^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Displaying bytes
  2009-11-24 22:47                                 ` Richard Stallman
@ 2009-11-25  2:18                                   ` Stefan Monnier
  2009-11-26  6:24                                     ` Richard Stallman
  0 siblings, 1 reply; 101+ messages in thread
From: Stefan Monnier @ 2009-11-25  2:18 UTC (permalink / raw)
  To: rms; +Cc: dak, emacs-devel

>     I don't know you see this "tis620" stuff.
> How strange this discrepancy.  I have a few changes that are not
> installed, but not in anything relevant here.  I last updated
> source code on Nov 18.

Oh wait, I now see: you get `tis620' for chars between 128 ans 160
(i.e. eight-bit-control), and `eight-bit' for chars between 160 and 256.

> Here's what apparently defines that character set, in mule-conf.el:

> (define-charset 'tis620-2533
>   "TIS620.2533"
>   :short-name "TIS620.2533"
>   :ascii-compatible-p t
>   :code-space [0 255]
>   :superset '(ascii eight-bit-control (thai-tis620 . 128)))

Looks like the eight-bit-control here is part of the problem.

> Anyway, please don't overlook the other suggestions in my message
> for how to make things clearer.

Of course.

>> What are the situations where a user is likely to see these stray
>> bytes.
>     There pretty much shouldn't be any in multibyte buffers.
> Would it be good to ask people to send bug reports when these
> stray byte characters appear in multibyte buffers?

No, these chars can appear in cases where Emacs does the right thing.
I.e. sometimes they reflect bugs, but often they just reflect "pilot
errors" or corrupted data completely outside the control of Emacs.


        Stefan





^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Displaying bytes (was: Inadequate documentation of silly
  2009-11-25  1:33                                   ` Kenichi Handa
@ 2009-11-25  2:29                                     ` Stefan Monnier
  2009-11-25  2:50                                       ` Lennart Borgman
  2009-11-25  6:25                                       ` Stephen J. Turnbull
  2009-11-25  5:40                                     ` Displaying bytes (was: Inadequate documentation of silly characters on screen.) Ulrich Mueller
                                                       ` (2 subsequent siblings)
  3 siblings, 2 replies; 101+ messages in thread
From: Stefan Monnier @ 2009-11-25  2:29 UTC (permalink / raw)
  To: Kenichi Handa; +Cc: per.starback, dak, rms, emacs-devel

>> If so, should we change the default priorities?
> I'm not sure.  As it seems that windows-1252 is a superset of
> iso-8859-1, it may be ok to give windows-1252 the higher priority.
> How do iso-8859-1 users think?

The problem with windows-1252 is that all files are valid in that
coding-system.  So it's OK if there's a really high chance of
encountering such files, but otherwise it leads to many misdetections.

> For that, it seems that adding that facility in
> after-insert-file-set-coding is good.   Here's a sample patch.  The
> actual change should give more information to a user.

Maybe we could try that.  But I really dislike adding a user-prompt in
the middle of some operation that might be performed as part of
something "unrelated".  And indeed the actual change may need to give
a lot more information, mostly displaying the buffer without which the
user cannot make a good guess.


        Stefan




^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Displaying bytes (was: Inadequate documentation of silly
  2009-11-25  2:29                                     ` Displaying bytes (was: Inadequate documentation of silly Stefan Monnier
@ 2009-11-25  2:50                                       ` Lennart Borgman
  2009-11-25  6:25                                       ` Stephen J. Turnbull
  1 sibling, 0 replies; 101+ messages in thread
From: Lennart Borgman @ 2009-11-25  2:50 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: per.starback, dak, emacs-devel, rms, Kenichi Handa

On Wed, Nov 25, 2009 at 3:29 AM, Stefan Monnier
<monnier@iro.umontreal.ca> wrote:
>>> If so, should we change the default priorities?
>> I'm not sure.  As it seems that windows-1252 is a superset of
>> iso-8859-1, it may be ok to give windows-1252 the higher priority.
>> How do iso-8859-1 users think?
>
> The problem with windows-1252 is that all files are valid in that
> coding-system.  So it's OK if there's a really high chance of
> encountering such files, but otherwise it leads to many misdetections.
>
>> For that, it seems that adding that facility in
>> after-insert-file-set-coding is good.   Here's a sample patch.  The
>> actual change should give more information to a user.
>
> Maybe we could try that.  But I really dislike adding a user-prompt in
> the middle of some operation that might be performed as part of
> something "unrelated".  And indeed the actual change may need to give
> a lot more information, mostly displaying the buffer without which the
> user cannot make a good guess.


Maybe it is better to read in the file in the buffer with a best guess
and add a hook that is run the first time the buffer is shown in a
window with some notification to the user of the problem?

Then of course also provide enough hints to make it as easy to change
coding system in that situation to the relevant alternatives.




^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Displaying bytes (was: Inadequate documentation of silly  characters on screen.)
  2009-11-25  1:33                                   ` Kenichi Handa
  2009-11-25  2:29                                     ` Displaying bytes (was: Inadequate documentation of silly Stefan Monnier
@ 2009-11-25  5:40                                     ` Ulrich Mueller
  2009-11-26 22:59                                       ` Displaying bytes Reiner Steib
  2009-11-25  5:59                                     ` Displaying bytes (was: Inadequate documentation of silly characters on screen.) Stephen J. Turnbull
  2009-11-29 16:01                                     ` Richard Stallman
  3 siblings, 1 reply; 101+ messages in thread
From: Ulrich Mueller @ 2009-11-25  5:40 UTC (permalink / raw)
  To: Kenichi Handa; +Cc: per.starback, dak, emacs-devel, rms, monnier

>>>>> On Wed, 25 Nov 2009, Kenichi Handa wrote:

>> If so, should we change the default priorities?

> I'm not sure.  As it seems that windows-1252 is a superset of
> iso-8859-1, it may be ok to give windows-1252 the higher priority.

Please don't.

I wonder why one would even *think* of changing Emacs's default to a
Microsoft proprietary "code page". :-(

> How do iso-8859-1 users think?

Seems to me that use of iso-8859-* is much more widespread on *nix
systems. I think the current default priorities are perfectly fine.

Ulrich




^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Displaying bytes (was: Inadequate documentation of silly  characters on screen.)
  2009-11-25  1:33                                   ` Kenichi Handa
  2009-11-25  2:29                                     ` Displaying bytes (was: Inadequate documentation of silly Stefan Monnier
  2009-11-25  5:40                                     ` Displaying bytes (was: Inadequate documentation of silly characters on screen.) Ulrich Mueller
@ 2009-11-25  5:59                                     ` Stephen J. Turnbull
  2009-11-25  8:16                                       ` Kenichi Handa
  2009-11-29 16:01                                     ` Richard Stallman
  3 siblings, 1 reply; 101+ messages in thread
From: Stephen J. Turnbull @ 2009-11-25  5:59 UTC (permalink / raw)
  To: Kenichi Handa; +Cc: per.starback, dak, emacs-devel, rms, monnier

Kenichi Handa writes:
 > In article <E1ND4AD-0003Yg-Cc@fencepost.gnu.org>, Richard Stallman <rms@gnu.org> writes:

 > > If so, should we change the default priorities?
 > 
 > I'm not sure.  As it seems that windows-1252 is a superset of
 > iso-8859-1, it may be ok to give windows-1252 the higher priority.
 > How do iso-8859-1 users think?

Why not make a Windows-12xx coding-category?  If you don't want to
advertise what it is, you could call it "ascii8" or "pseudo-ascii" or
something like that.  (Wouldn't some of the obsolete Vietnamese
standards fit this too?  Ie, 0-0177 are the same as ISO-646, and
0200-0377 are used for the alternate script?)

If you don't make a separate coding category for that, I don't like
the change, myself.  Windows-12xx character sets are proprietary in
the sense that last I looked, the IANA registry for Windows-12xx coded
character sets pointed to internal Microsoft documents, and made no
promises about changes to those documents.  As far as I know,
Microsoft added the EURO SIGN to Windows-1252 simply by editing that
internal page.  There was no indication of the history of such changes
on the IANA page.

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Displaying bytes (was: Inadequate documentation of silly
  2009-11-25  2:29                                     ` Displaying bytes (was: Inadequate documentation of silly Stefan Monnier
  2009-11-25  2:50                                       ` Lennart Borgman
@ 2009-11-25  6:25                                       ` Stephen J. Turnbull
  1 sibling, 0 replies; 101+ messages in thread
From: Stephen J. Turnbull @ 2009-11-25  6:25 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: per.starback, dak, emacs-devel, rms, Kenichi Handa

Stefan Monnier writes:

 > The problem with windows-1252 is that all files are valid in that
 > coding-system.

Well, *pedantically* that's true of any ISO 8859 coding system too,
since ISO 8859 doesn't specify what might appear in C1 at all.  In
practice for 1252

1.  The only C0 controls you'll commonly see are \t, \r, and \n.
2.  The set of C1 controls that are defined is limited IIRC (but
    Microsoft does go around changing it without warning, so I could
    be wrong by now ;-).
3.  It's line-oriented text (even if long-lines): you'll very probably
    see \r and \n only as \r\n, you might see only \n and no \r, and
    you'll not see "random" use of \r or \n.





^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Displaying bytes (was: Inadequate documentation of silly  characters on screen.)
  2009-11-25  5:59                                     ` Displaying bytes (was: Inadequate documentation of silly characters on screen.) Stephen J. Turnbull
@ 2009-11-25  8:16                                       ` Kenichi Handa
  0 siblings, 0 replies; 101+ messages in thread
From: Kenichi Handa @ 2009-11-25  8:16 UTC (permalink / raw)
  To: Stephen J. Turnbull; +Cc: per.starback, dak, emacs-devel, rms, monnier

In article <87fx835elh.fsf@uwakimon.sk.tsukuba.ac.jp>, "Stephen J. Turnbull" <stephen@xemacs.org> writes:

> I'm not sure.  As it seems that windows-1252 is a superset of
> iso-8859-1, it may be ok to give windows-1252 the higher priority.
> How do iso-8859-1 users think?

> Why not make a Windows-12xx coding-category?  If you don't want to
> advertise what it is, you could call it "ascii8" or "pseudo-ascii" or
> something like that.

Ah!

A coding-category of a coding-system is automatically
determined by :coding-type arg (and by some other arg
depending on :coding-type) of define-coding-system.  And
iso-8859-x and windows-12xx are exactly the same in this
aspect; i.e. both :coding-type is `charset' which means the
coding system is for decoding/encoding charsets in
:charset-list.

Perhaps it is good to add one more coding-category
`charset8' to which such coding-systems that handle a single
byte charset containing many 0x80..0x9F area code are
classified.

> (Wouldn't some of the obsolete Vietnamese
> standards fit this too?  Ie, 0-0177 are the same as ISO-646, and
> 0200-0377 are used for the alternate script?)

Do you mean such coding-systems as vietnamese-tcvn and
vietnamese-viscii?  Although their 0x00-0x1F are not the
same as ASCII, yes, they can be classified into `charset8'
category.

---
Kenichi Handa
handa@m17n.org

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Displaying bytes
  2009-11-25  2:18                                   ` Stefan Monnier
@ 2009-11-26  6:24                                     ` Richard Stallman
  2009-11-26  8:59                                       ` David Kastrup
  2009-11-26 14:57                                       ` Stefan Monnier
  0 siblings, 2 replies; 101+ messages in thread
From: Richard Stallman @ 2009-11-26  6:24 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: dak, emacs-devel

    >     There pretty much shouldn't be any in multibyte buffers.

    > Would it be good to ask people to send bug reports when these
    > stray byte characters appear in multibyte buffers?

    No, these chars can appear in cases where Emacs does the right thing.
    I.e. sometimes they reflect bugs, but often they just reflect "pilot
    errors" or corrupted data completely outside the control of Emacs.

If it is nearly always due to a bug, a user error, or bad data,
perhaps it would be good to display a diagnostic after file commands
that put them in the buffer.  Perhaps pop up a buffer explaining what
these mean and what to do about them.




^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Displaying bytes
  2009-11-26  6:24                                     ` Richard Stallman
@ 2009-11-26  8:59                                       ` David Kastrup
  2009-11-26 14:57                                       ` Stefan Monnier
  1 sibling, 0 replies; 101+ messages in thread
From: David Kastrup @ 2009-11-26  8:59 UTC (permalink / raw)
  To: emacs-devel

Richard Stallman <rms@gnu.org> writes:

>     >     There pretty much shouldn't be any in multibyte buffers.
>
>     > Would it be good to ask people to send bug reports when these
>     > stray byte characters appear in multibyte buffers?
>
>     No, these chars can appear in cases where Emacs does the right thing.
>     I.e. sometimes they reflect bugs, but often they just reflect "pilot
>     errors" or corrupted data completely outside the control of Emacs.
>
> If it is nearly always due to a bug, a user error, or bad data,
> perhaps it would be good to display a diagnostic after file commands
> that put them in the buffer.  Perhaps pop up a buffer explaining what
> these mean and what to do about them.

The encoding indicator in the mode line could get warning-face, and the
respective pop-up help mention

"buffer contains undecodable bytes."

-- 
David Kastrup





^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Displaying bytes
  2009-11-26  6:24                                     ` Richard Stallman
  2009-11-26  8:59                                       ` David Kastrup
@ 2009-11-26 14:57                                       ` Stefan Monnier
  2009-11-26 16:28                                         ` Lennart Borgman
  2009-11-27  6:36                                         ` Richard Stallman
  1 sibling, 2 replies; 101+ messages in thread
From: Stefan Monnier @ 2009-11-26 14:57 UTC (permalink / raw)
  To: rms; +Cc: dak, emacs-devel

> If it is nearly always due to a bug, a user error, or bad data,
> perhaps it would be good to display a diagnostic after file commands
> that put them in the buffer.  Perhaps pop up a buffer explaining what
> these mean and what to do about them.

If someone wants to take a stab at it, that's fine by me, but it looks
way too difficult for me.  The origin of the problem can be so diverse
that it'll be difficult to come up with instrcutions that will be useful
and will not confuse a significant part of the user population.


        Stefan




^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Displaying bytes
  2009-11-26 14:57                                       ` Stefan Monnier
@ 2009-11-26 16:28                                         ` Lennart Borgman
  2009-11-27  6:36                                         ` Richard Stallman
  1 sibling, 0 replies; 101+ messages in thread
From: Lennart Borgman @ 2009-11-26 16:28 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: dak, rms, emacs-devel

On Thu, Nov 26, 2009 at 3:57 PM, Stefan Monnier
<monnier@iro.umontreal.ca> wrote:
>> If it is nearly always due to a bug, a user error, or bad data,
>> perhaps it would be good to display a diagnostic after file commands
>> that put them in the buffer.  Perhaps pop up a buffer explaining what
>> these mean and what to do about them.
>
> If someone wants to take a stab at it, that's fine by me, but it looks
> way too difficult for me.  The origin of the problem can be so diverse
> that it'll be difficult to come up with instrcutions that will be useful
> and will not confuse a significant part of the user population.


Is that not good enough instructions to put up?




^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Displaying bytes
  2009-11-25  5:40                                     ` Displaying bytes (was: Inadequate documentation of silly characters on screen.) Ulrich Mueller
@ 2009-11-26 22:59                                       ` Reiner Steib
  2009-11-27  0:16                                         ` Ulrich Mueller
                                                           ` (2 more replies)
  0 siblings, 3 replies; 101+ messages in thread
From: Reiner Steib @ 2009-11-26 22:59 UTC (permalink / raw)
  To: Ulrich Mueller; +Cc: emacs-devel, Kenichi Handa

On Wed, Nov 25 2009, Ulrich Mueller wrote:

>>>>>> On Wed, 25 Nov 2009, Kenichi Handa wrote:
>
>>> If so, should we change the default priorities?
>
>> I'm not sure.  As it seems that windows-1252 is a superset of
>> iso-8859-1, 

It is, yes.

>> it may be ok to give windows-1252 the higher priority.
>
> Please don't.
>
> I wonder why one would even *think* of changing Emacs's default to a
> Microsoft proprietary "code page". :-(

Just because it has "windows" in its name?  IIRC it is registered at
IANA.

>> How do iso-8859-1 users think?
>
> Seems to me that use of iso-8859-* is much more widespread on *nix
> systems. 

As far as I understand, an iso-8859-1 user won't notice any
difference.  Only if the file is _not_ iso-8859-1 and "fits" in
windows-1252 (e.g. it uses one of the few chars that make the
difference).

We have done something similar (see `mm-charset-override-alist') in
Gnus for displaying mis-labelled articles.

> I think the current default priorities are perfectly fine.

Bye, Reiner.
-- 
       ,,,
      (o o)
---ooO-(_)-Ooo---  |  PGP key available  |  http://rsteib.home.pages.de/




^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Displaying bytes
  2009-11-26 22:59                                       ` Displaying bytes Reiner Steib
@ 2009-11-27  0:16                                         ` Ulrich Mueller
  2009-11-27  1:41                                         ` Stefan Monnier
  2009-11-27  4:14                                         ` Stephen J. Turnbull
  2 siblings, 0 replies; 101+ messages in thread
From: Ulrich Mueller @ 2009-11-27  0:16 UTC (permalink / raw)
  To: Reiner Steib; +Cc: emacs-devel, Kenichi Handa

>>>>> On Thu, 26 Nov 2009, Reiner Steib wrote:

>>> I'm not sure.  As it seems that windows-1252 is a superset of
>>> iso-8859-1, 

> It is, yes.

They are identical, except for the range from 0x80 to 0x9f, where
ISO-8859-1 assigns control characters [1]. Look into the log file of
an xterm (TERM=xterm-8bit) and you'll see them.

>> I wonder why one would even *think* of changing Emacs's default to
>> a Microsoft proprietary "code page". :-(

> Just because it has "windows" in its name? IIRC it is registered at
> IANA.

Yes, in the "vendor" range [2], together with all variants of EBCDIC
that ever existed. ;-)

Whereas ISO-8859-1 is an official ISO standard.

Ulrich

[1] ftp://ftp.unicode.org/Public/MAPPINGS/ISO8859/8859-1.TXT
[2] http://www.iana.org/assignments/character-sets




^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Displaying bytes
  2009-11-26 22:59                                       ` Displaying bytes Reiner Steib
  2009-11-27  0:16                                         ` Ulrich Mueller
@ 2009-11-27  1:41                                         ` Stefan Monnier
  2009-11-27  4:14                                         ` Stephen J. Turnbull
  2 siblings, 0 replies; 101+ messages in thread
From: Stefan Monnier @ 2009-11-27  1:41 UTC (permalink / raw)
  To: Reiner Steib; +Cc: Ulrich Mueller, Kenichi Handa, emacs-devel

>> Seems to me that use of iso-8859-* is much more widespread on *nix
>> systems.
> As far as I understand, an iso-8859-1 user won't notice any
> difference.

They'll notice a difference when opening a file that's neither latin-1
nor windows-1252 but which happens to fall within the range of
windows-1252 (which is the case for most non-latin1 files).

> We have done something similar (see `mm-charset-override-alist') in
> Gnus for displaying mis-labelled articles.

It's very different: it's perfectly OK to treat a latin-1 message or
file as if it were windows-1252.  It'll almost always DTRT.  The problem
is when we have to guess the coding-system, in which case checking
windows-1252 instead of latin-1 will give you more false positives.

        Stefan

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Displaying bytes
  2009-11-26 22:59                                       ` Displaying bytes Reiner Steib
  2009-11-27  0:16                                         ` Ulrich Mueller
  2009-11-27  1:41                                         ` Stefan Monnier
@ 2009-11-27  4:14                                         ` Stephen J. Turnbull
  2 siblings, 0 replies; 101+ messages in thread
From: Stephen J. Turnbull @ 2009-11-27  4:14 UTC (permalink / raw)
  To: Reiner Steib; +Cc: Ulrich Mueller, Kenichi Handa, emacs-devel

Reiner Steib writes:

 > Just because it has "windows" in its name?  IIRC it is registered at
 > IANA.

Not because of the name.  Because the registration at IANA does not
define it, the last time I looked.  It merely is a placeholder for an
internal Microsoft page that Microsoft updates at its convenience (and
has done, for example when adding the EURO SIGN).






^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Displaying bytes
  2009-11-26 14:57                                       ` Stefan Monnier
  2009-11-26 16:28                                         ` Lennart Borgman
@ 2009-11-27  6:36                                         ` Richard Stallman
  1 sibling, 0 replies; 101+ messages in thread
From: Richard Stallman @ 2009-11-27  6:36 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: dak, emacs-devel

    If someone wants to take a stab at it, that's fine by me, but it looks
    way too difficult for me.  The origin of the problem can be so diverse
    that it'll be difficult to come up with instrcutions that will be useful
    and will not confuse a significant part of the user population.

How about:

It's possible Emacs guessed the wrong coding system to decode the file.
[advice on how to check that, and how to specify a different coding system]

If these strange characters are due to bad data in a file you visited,
just try not to let them worry you.

If you think they appeared due to a bug in Emacs, please send a bug
report using M-x report-emacs-bug.

If they appear for some other reason not mentioned above, please
consider its absence from this message to be a bug in Emacs, and
please send a bug report using M-x report-emacs-bug.

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Displaying bytes (was: Inadequate documentation of silly  characters on screen.)
  2009-11-25  1:33                                   ` Kenichi Handa
                                                       ` (2 preceding siblings ...)
  2009-11-25  5:59                                     ` Displaying bytes (was: Inadequate documentation of silly characters on screen.) Stephen J. Turnbull
@ 2009-11-29 16:01                                     ` Richard Stallman
  2009-11-29 16:31                                       ` Displaying bytes (was: Inadequate documentation of silly Stefan Monnier
  2009-11-29 22:19                                       ` Displaying bytes (was: Inadequate documentation of silly characters on screen.) Kim F. Storm
  3 siblings, 2 replies; 101+ messages in thread
From: Richard Stallman @ 2009-11-29 16:01 UTC (permalink / raw)
  To: Kenichi Handa; +Cc: per.starback, dak, monnier, emacs-devel

We don't want to raise the priority of windows-1252 because it would
cause many other encodings not to be recognized.

If it turns out that windows-1252 files are the main cause of
8-bit-control characters in the buffer, here's another idea.

If visiting a file gives you some 8-bit-control characters,
ask the user "Is this file encoded in Windows encoding (windows-1252)?"
and do so if she says yes.

Here's another idea.  We could employ some heuristics to see if the
distribution of those characters seems typical for the way those
characters are used.  For instance, some of the punctuation characters
(the ones that represent quotation marks) should always have
whitespace or punctuation on at least one side.  Also, there should be
no ASCII control characters other than whitespace.  Maybe more
specific heuristics can be developed.

These could be used as conditions for recognizing the file as
windows-1252.  If these heuristics are strong enough, they could
reject nearly all false matches, provided the file is long enough.
(A minimum length could be part of the conditions.)  Then we
could increase the priority of windows-1252 without the bad
side effect of using it when it is not intended.

This is ad-hoc, and not elegant.  But the problem is important enough
in practice that an ad-hoc solution is justified if it works well.

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Displaying bytes (was: Inadequate documentation of silly
  2009-11-29 16:01                                     ` Richard Stallman
@ 2009-11-29 16:31                                       ` Stefan Monnier
  2009-11-29 22:01                                         ` Juri Linkov
  2009-11-29 22:19                                       ` Displaying bytes (was: Inadequate documentation of silly characters on screen.) Kim F. Storm
  1 sibling, 1 reply; 101+ messages in thread
From: Stefan Monnier @ 2009-11-29 16:31 UTC (permalink / raw)
  To: rms; +Cc: per.starback, dak, emacs-devel, Kenichi Handa

> If it turns out that windows-1252 files are the main cause of
> 8-bit-control characters in the buffer, here's another idea.

It may be the case for some users, but it probably isn't the case
in general.  It's clearly not the case for me (I only/mostly see such
characters in Gnus when I receive email that is improperly labelled,
where I'm happy to see tham so that I complain to their originator).

> Here's another idea.  We could employ some heuristics to see if the
> distribution of those characters seems typical for the way those
> characters are used.  For instance, some of the punctuation characters

Using such heursitics might be a good idea in general to automatically
detect which encoding is used, or which language is used.
As time passes, it becomes less and less important for coding-systems in
my experience (utf-8 and utf-16 seem to slowly take over and we already
auto-detect them well).

        Stefan

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Displaying bytes (was: Inadequate documentation of silly
  2009-11-29 16:31                                       ` Displaying bytes (was: Inadequate documentation of silly Stefan Monnier
@ 2009-11-29 22:01                                         ` Juri Linkov
  2009-11-30  6:05                                           ` tomas
  0 siblings, 1 reply; 101+ messages in thread
From: Juri Linkov @ 2009-11-29 22:01 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: per.starback, dak, rms, Kenichi Handa, emacs-devel

>> Here's another idea.  We could employ some heuristics to see if the
>> distribution of those characters seems typical for the way those
>> characters are used.  For instance, some of the punctuation characters
>
> Using such heursitics might be a good idea in general to automatically
> detect which encoding is used, or which language is used.

Unicad (http://www.emacswiki.org/emacs/Unicad) uses statistic models
to auto-detect windows-1252 and many many other coding systems
(auto-detecting windows-1252 is not advertised on the main page,
but actually can be observed in source code).  The theory is described
at http://www.mozilla.org/projects/intl/UniversalCharsetDetection.html
I hope sometime this will be added to Emacs.

-- 
Juri Linkov
http://www.jurta.org/emacs/




^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Displaying bytes (was: Inadequate documentation of silly characters on screen.)
  2009-11-29 16:01                                     ` Richard Stallman
  2009-11-29 16:31                                       ` Displaying bytes (was: Inadequate documentation of silly Stefan Monnier
@ 2009-11-29 22:19                                       ` Kim F. Storm
  2009-11-30  1:42                                         ` Stephen J. Turnbull
  1 sibling, 1 reply; 101+ messages in thread
From: Kim F. Storm @ 2009-11-29 22:19 UTC (permalink / raw)
  To: rms; +Cc: per.starback, dak, emacs-devel, monnier, Kenichi Handa

Richard Stallman <rms@gnu.org> writes:

> We don't want to raise the priority of windows-1252 because it would
> cause many other encodings not to be recognized.
>
> If it turns out that windows-1252 files are the main cause of
> 8-bit-control characters in the buffer, here's another idea.

Sorry I haven't followed the entire thread, but here's an idea:

A Windows-1252 file most likely originated on Windoze, so what about
only raising the priority when the file has CRNL line endings?

-- 
Kim F. Storm <storm@cua.dk> http://www.cua.dk





^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Displaying bytes (was: Inadequate documentation of silly characters on screen.)
  2009-11-29 22:19                                       ` Displaying bytes (was: Inadequate documentation of silly characters on screen.) Kim F. Storm
@ 2009-11-30  1:42                                         ` Stephen J. Turnbull
  0 siblings, 0 replies; 101+ messages in thread
From: Stephen J. Turnbull @ 2009-11-30  1:42 UTC (permalink / raw)
  To: Kim F. Storm; +Cc: dak, rms, Kenichi Handa, per.starback, emacs-devel, monnier

Kim F. Storm writes:

 > A Windows-1252 file most likely originated on Windoze, so what about
 > only raising the priority when the file has CRNL line endings?

That turns out not to be true in my experience.  There are a lot of
European people of my acquaintance who started using 1252 when it had
the EURO SIGN (Microsoft put it in well before Euros were in
circulation IIRC) and ISO-8859-15 had not yet been published.




^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Displaying bytes (was: Inadequate documentation of silly
  2009-11-29 22:01                                         ` Juri Linkov
@ 2009-11-30  6:05                                           ` tomas
  2009-11-30 12:09                                             ` Andreas Schwab
  0 siblings, 1 reply; 101+ messages in thread
From: tomas @ 2009-11-30  6:05 UTC (permalink / raw)
  To: Juri Linkov
  Cc: dak, rms, Kenichi Handa, per.starback, emacs-devel,
	Stefan Monnier

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Mon, Nov 30, 2009 at 12:01:29AM +0200, Juri Linkov wrote:
[...]
> Unicad (http://www.emacswiki.org/emacs/Unicad) uses statistic models
> to auto-detect windows-1252 and many many other coding systems
> (auto-detecting windows-1252 is not advertised on the main page,
> but actually can be observed in source code).  The theory is described
> at http://www.mozilla.org/projects/intl/UniversalCharsetDetection.html
> I hope sometime this will be added to Emacs.

It looks theoretically quite neat. I hope this too -- the current
heuristics are often at a loss.

Ironically, the cited page at mozilla doesn't display correctly in my
browser (of all things mozilla!). Setting to auto-detect guesses UTF-8
whereas it's latin-1 -- as correctly advertised in the headers :-)
(yes, it's off-topic and it's most-probably some miscofiguration on my
side, but I thought some might savour the irony).

But I also feel that we need more systematic heuristics. I'll give
Unicad a try.

Regards
- -- tomás
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)

iD8DBQFLE2CwBcgs9XrR2kYRAsCxAJ0cyKl6hp5jN4+N7ogimn354z9+lgCdHAqW
REqc68ZeDEqG7eXi7d/HFLU=
=efXE
-----END PGP SIGNATURE-----




^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Displaying bytes (was: Inadequate documentation of silly
  2009-11-30  6:05                                           ` tomas
@ 2009-11-30 12:09                                             ` Andreas Schwab
  2009-11-30 12:39                                               ` tomas
  0 siblings, 1 reply; 101+ messages in thread
From: Andreas Schwab @ 2009-11-30 12:09 UTC (permalink / raw)
  To: tomas
  Cc: dak, rms, Kenichi Handa, per.starback, emacs-devel, Juri Linkov,
	Stefan Monnier

tomas@tuxteam.de writes:

> Ironically, the cited page at mozilla doesn't display correctly in my
> browser (of all things mozilla!). Setting to auto-detect guesses UTF-8
> whereas it's latin-1 -- as correctly advertised in the headers :-)

The HTML header claims UTF-8, as does the HTTP header.

Andreas.

-- 
Andreas Schwab, schwab@linux-m68k.org
GPG Key fingerprint = 58CA 54C7 6D53 942B 1756  01D3 44D5 214B 8276 4ED5
"And now for something completely different."




^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Displaying bytes (was: Inadequate documentation of silly
  2009-11-30 12:09                                             ` Andreas Schwab
@ 2009-11-30 12:39                                               ` tomas
  0 siblings, 0 replies; 101+ messages in thread
From: tomas @ 2009-11-30 12:39 UTC (permalink / raw)
  To: emacs-devel

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Mon, Nov 30, 2009 at 01:09:55PM +0100, Andreas Schwab wrote:
> tomas@tuxteam.de writes:
> 
> > Ironically, the cited page at mozilla doesn't display correctly in my
> > browser (of all things mozilla!). Setting to auto-detect guesses UTF-8
> > whereas it's latin-1 -- as correctly advertised in the headers :-)
> 
> The HTML header claims UTF-8, as does the HTTP header.

I stand corrected. I did put too much belief on what Mozilla told me in
the "page info" blurb. Note to self: don't believe what web browsers
tell you. Grumble.

This makes the irony even better ;-)

Thanks

- -- tomás
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)

iD8DBQFLE70eBcgs9XrR2kYRAlUPAJ4n5x+aaGoYGmbANgY/SXlOFF1ETACdFa2j
TZxfwsMyxnzqI7MI/9+HTPM=
=JXqN
-----END PGP SIGNATURE-----




^ permalink raw reply	[flat|nested] 101+ messages in thread

end of thread, other threads:[~2009-11-30 12:39 UTC | newest]

Thread overview: 101+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-11-18 19:12 [acm@muc.de: Re: Inadequate documentation of silly characters on screen.] Alan Mackenzie
2009-11-19  1:27 ` Fwd: Re: Inadequate documentation of silly characters on screen Stefan Monnier
2009-11-19  8:20   ` Alan Mackenzie
2009-11-19  8:50     ` Miles Bader
2009-11-19 10:16     ` Fwd: " Andreas Schwab
2009-11-19 12:21       ` Alan Mackenzie
2009-11-19 13:21       ` Jason Rumney
2009-11-19 13:35         ` Stefan Monnier
2009-11-19 14:18         ` Alan Mackenzie
2009-11-19 14:58           ` Jason Rumney
2009-11-19 15:42             ` Alan Mackenzie
2009-11-19 19:39               ` Eli Zaretskii
2009-11-19 15:30           ` Stefan Monnier
2009-11-19 15:58             ` Alan Mackenzie
2009-11-19 16:06               ` Andreas Schwab
2009-11-19 16:47               ` Aidan Kehoe
2009-11-19 17:29                 ` Alan Mackenzie
2009-11-19 18:21                   ` Aidan Kehoe
2009-11-20  2:43                   ` Stephen J. Turnbull
2009-11-19 19:45                 ` Eli Zaretskii
2009-11-19 20:07                   ` Eli Zaretskii
2009-11-19 19:55                 ` Stefan Monnier
2009-11-20  3:13                   ` Stephen J. Turnbull
2009-11-19 16:55               ` David Kastrup
2009-11-19 18:08                 ` Alan Mackenzie
2009-11-19 19:25                   ` Davis Herring
2009-11-19 21:25                     ` Alan Mackenzie
2009-11-19 22:31                       ` David Kastrup
2009-11-21 22:52                         ` Richard Stallman
2009-11-23  2:08                           ` Displaying bytes (was: Inadequate documentation of silly characters on screen.) Stefan Monnier
2009-11-23 20:38                             ` Richard Stallman
2009-11-23 21:34                               ` Per Starbäck
2009-11-24 22:47                                 ` Richard Stallman
2009-11-25  1:33                                   ` Kenichi Handa
2009-11-25  2:29                                     ` Displaying bytes (was: Inadequate documentation of silly Stefan Monnier
2009-11-25  2:50                                       ` Lennart Borgman
2009-11-25  6:25                                       ` Stephen J. Turnbull
2009-11-25  5:40                                     ` Displaying bytes (was: Inadequate documentation of silly characters on screen.) Ulrich Mueller
2009-11-26 22:59                                       ` Displaying bytes Reiner Steib
2009-11-27  0:16                                         ` Ulrich Mueller
2009-11-27  1:41                                         ` Stefan Monnier
2009-11-27  4:14                                         ` Stephen J. Turnbull
2009-11-25  5:59                                     ` Displaying bytes (was: Inadequate documentation of silly characters on screen.) Stephen J. Turnbull
2009-11-25  8:16                                       ` Kenichi Handa
2009-11-29 16:01                                     ` Richard Stallman
2009-11-29 16:31                                       ` Displaying bytes (was: Inadequate documentation of silly Stefan Monnier
2009-11-29 22:01                                         ` Juri Linkov
2009-11-30  6:05                                           ` tomas
2009-11-30 12:09                                             ` Andreas Schwab
2009-11-30 12:39                                               ` tomas
2009-11-29 22:19                                       ` Displaying bytes (was: Inadequate documentation of silly characters on screen.) Kim F. Storm
2009-11-30  1:42                                         ` Stephen J. Turnbull
2009-11-24  1:28                               ` Displaying bytes Stefan Monnier
2009-11-24 22:47                                 ` Richard Stallman
2009-11-25  2:18                                   ` Stefan Monnier
2009-11-26  6:24                                     ` Richard Stallman
2009-11-26  8:59                                       ` David Kastrup
2009-11-26 14:57                                       ` Stefan Monnier
2009-11-26 16:28                                         ` Lennart Borgman
2009-11-27  6:36                                         ` Richard Stallman
2009-11-24 22:47                                 ` Richard Stallman
2009-11-20  8:48                       ` Fwd: Re: Inadequate documentation of silly characters on screen Eli Zaretskii
2009-11-19 19:52                   ` Eli Zaretskii
2009-11-19 20:53                     ` Alan Mackenzie
2009-11-19 22:16                       ` David Kastrup
2009-11-20  8:55                         ` Eli Zaretskii
2009-11-19 20:05                   ` Stefan Monnier
2009-11-19 21:27                     ` Alan Mackenzie
2009-11-19 19:43               ` Eli Zaretskii
2009-11-19 21:57                 ` Alan Mackenzie
2009-11-19 23:10                   ` Stefan Monnier
2009-11-19 20:02               ` Stefan Monnier
2009-11-19 14:08     ` Stefan Monnier
2009-11-19 14:50       ` Jason Rumney
2009-11-19 15:27         ` Stefan Monnier
2009-11-19 23:12           ` Miles Bader
2009-11-20  2:16             ` Stefan Monnier
2009-11-20  3:37             ` Stephen J. Turnbull
2009-11-20  4:30               ` Stefan Monnier
2009-11-20  7:18                 ` Stephen J. Turnbull
2009-11-20 14:16                   ` Stefan Monnier
2009-11-21  4:13                     ` Stephen J. Turnbull
2009-11-21  5:24                       ` Stefan Monnier
2009-11-21  6:42                         ` Stephen J. Turnbull
2009-11-21  6:49                           ` Stefan Monnier
2009-11-21  7:27                             ` Stephen J. Turnbull
2009-11-23  1:58                               ` Stefan Monnier
2009-11-21 12:33                           ` David Kastrup
2009-11-21 13:55                             ` Stephen J. Turnbull
2009-11-21 14:36                               ` David Kastrup
2009-11-21 17:53                                 ` Stephen J. Turnbull
2009-11-21 23:30                                   ` David Kastrup
2009-11-22  1:27                                     ` Sebastian Rose
2009-11-22  8:06                                       ` David Kastrup
2009-11-22 23:52                                         ` Sebastian Rose
2009-11-19 17:08       ` Fwd: " Alan Mackenzie
  -- strict thread matches above, loose matches on Subject: below --
2009-11-18  9:37 Alan Mackenzie
2009-11-18  9:40 ` Miles Bader
2009-11-18 10:15   ` Alan Mackenzie
2009-11-18 12:03     ` Jason Rumney
2009-11-18 15:02     ` Stefan Monnier

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).