unofficial mirror of emacs-devel@gnu.org 
 help / color / mirror / code / Atom feed
* Inadequate documentation of silly characters on screen.
@ 2009-11-18  9:37 Alan Mackenzie
  2009-11-18  9:40 ` Miles Bader
  0 siblings, 1 reply; 26+ messages in thread
From: Alan Mackenzie @ 2009-11-18  9:37 UTC (permalink / raw)
  To: emacs-devel

Hi, Emacs,

Once again, I'm getting silly characters on the screen.  In *scratch*,
where's I've written "ñ", what gets displayed is "\361".  It may have
happened when I upgraded to Emacs 23.

This keeps happening to me, I don't know why, but most importantly it
seems unusually poorly documented.  Goodness knows how an ordinary user
manages this, but I cannot easily track down the proper bit in the
manual to sort this out.

Things like character sets and their display isn't my area.  Why won't
it just work?

I go to the coding systems page.  There is no @def{coding system}, just
vague references with which you're supposed to get the understanding by
osmosis.  What I DON'T get from this osmosis is whether or not "coding
systems" deal with the garbage "\361" on my screen.  There is definitely
a missing "For the appearance of the text on your screen @ref{...}".

So, once again, I've got between half an hour and an hour of wasted time
trying to debug, yet again, this problem.  Why can I not easily find the
answer in the Emacs manual?

Of secondary importance, why does this problem keep happening in the
first place?

-- 
Alan Mackenzie (Nuremberg, Germany).




^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Inadequate documentation of silly characters on screen.
  2009-11-18  9:37 Inadequate documentation of silly characters on screen Alan Mackenzie
@ 2009-11-18  9:40 ` Miles Bader
  2009-11-18 10:15   ` Alan Mackenzie
  0 siblings, 1 reply; 26+ messages in thread
From: Miles Bader @ 2009-11-18  9:40 UTC (permalink / raw)
  To: Alan Mackenzie; +Cc: emacs-devel

Alan Mackenzie <acm@muc.de> writes:
> Once again, I'm getting silly characters on the screen.  In *scratch*,
> where's I've written "ñ", what gets displayed is "\361".  It may have
> happened when I upgraded to Emacs 23.

Does it happen with "emacs -Q"?

How do you "write" ñ (do you use an input method?  Type it on your keyboard...?)?

What language environment do you use (if you don't set it explicitly, it
will be set automatically from the LANG environment variable)?

Do you use X emacs, emacs in a tty, etc.?  If tty emacs, which type of
terminal do you use?

-Miles

-- 
Defenceless, adj. Unable to attack.




^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Inadequate documentation of silly characters on screen.
  2009-11-18  9:40 ` Miles Bader
@ 2009-11-18 10:15   ` Alan Mackenzie
  2009-11-18 12:03     ` Jason Rumney
  2009-11-18 15:02     ` Stefan Monnier
  0 siblings, 2 replies; 26+ messages in thread
From: Alan Mackenzie @ 2009-11-18 10:15 UTC (permalink / raw)
  To: Miles Bader; +Cc: emacs-devel

Hi, Miles,

thanks for such a quick answer.

On Wed, Nov 18, 2009 at 06:40:53PM +0900, Miles Bader wrote:
> Alan Mackenzie <acm@muc.de> writes:
> > Once again, I'm getting silly characters on the screen.  In *scratch*,
> > where's I've written "ñ", what gets displayed is "\361".  It may have
> > happened when I upgraded to Emacs 23.

> Does it happen with "emacs -Q"?

Yes.

> How do you "write" ñ (do you use an input method?  Type it on your keyboard...?)?

When I hold the <alt> key and type "241" on the numeric keypad, the "ñ"
appears correctly on the screen.  My program does (insert pr-line), where
pr-line is a string containing the ñ - this puts \361 up.

> What language environment do you use (if you don't set it explicitly, it
> will be set automatically from the LANG environment variable)?

I've tried M-x set-language-environment <CR> latin-1.  The mode line of
my *scratch* looks like this:

    -111:**--F1  *scratch*      All L10   C184  (Lisp Interaction)----P678/678

> Do you use X emacs, emacs in a tty, etc.?  If tty emacs, which type of
> terminal do you use?

A Linux tty.

> -Miles

-- 
Alan Mackenzie (Nuremberg, Germany).




^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Inadequate documentation of silly characters on screen.
  2009-11-18 10:15   ` Alan Mackenzie
@ 2009-11-18 12:03     ` Jason Rumney
  2009-11-18 15:02     ` Stefan Monnier
  1 sibling, 0 replies; 26+ messages in thread
From: Jason Rumney @ 2009-11-18 12:03 UTC (permalink / raw)
  To: Alan Mackenzie; +Cc: emacs-devel, Miles Bader

Alan Mackenzie wrote:
> I've tried M-x set-language-environment <CR> latin-1.  The mode line of
> my *scratch* looks like this:
>   
If you DON'T do that, after starting emacs -Q, what does C-h L <CR> report?

It should be initialised to the same as your terminal language 
environment, which is what you need when running in a tty.

>> Do you use X emacs, emacs in a tty, etc.?  If tty emacs, which type of
>> terminal do you use?
>>     
>
> A Linux tty.
>   

More specific please. Console? Serial? xterm? GNU screen?





^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Inadequate documentation of silly characters on screen.
  2009-11-18 10:15   ` Alan Mackenzie
  2009-11-18 12:03     ` Jason Rumney
@ 2009-11-18 15:02     ` Stefan Monnier
  1 sibling, 0 replies; 26+ messages in thread
From: Stefan Monnier @ 2009-11-18 15:02 UTC (permalink / raw)
  To: Alan Mackenzie; +Cc: emacs-devel, Miles Bader

> When I hold the <alt> key and type "241" on the numeric keypad, the "ñ"
> appears correctly on the screen.

So assuming you did that in a "normal" buffer, that means that A-241 is
properly interpreted by self-insert-command and you get the char "ñ"
inserted in your buffer (and then properly displayed as well).

> My program does (insert pr-line), where
> pr-line is a string containing the ñ - this puts \361 up.

How does pr-line contain this char?  I.e. how do you construct it?


        Stefan




^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Inadequate documentation of silly characters on screen.
  2009-11-19  8:20   ` Alan Mackenzie
@ 2009-11-19  8:50     ` Miles Bader
  2009-11-19 14:08     ` Fwd: " Stefan Monnier
  1 sibling, 0 replies; 26+ messages in thread
From: Miles Bader @ 2009-11-19  8:50 UTC (permalink / raw)
  To: Alan Mackenzie; +Cc: Stefan Monnier, emacs-devel

Alan Mackenzie <acm@muc.de> writes:
> Why do we have both unibyte and multibyte?  Is there any reason
> not to remove unibyte altogether (though obviously not for 23.2).

For certain rare cases, it's useful for efficiency reasons, but maybe it
should never be the default.

-Miles

-- 
Opposition, n. In politics the party that prevents the Goverment from running
amok by hamstringing it.




^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Inadequate documentation of silly characters on screen.
  2009-11-19 15:27         ` Stefan Monnier
@ 2009-11-19 23:12           ` Miles Bader
  2009-11-20  2:16             ` Stefan Monnier
  2009-11-20  3:37             ` Stephen J. Turnbull
  0 siblings, 2 replies; 26+ messages in thread
From: Miles Bader @ 2009-11-19 23:12 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: Alan Mackenzie, emacs-devel, Jason Rumney

Stefan Monnier <monnier@iro.umontreal.ca> writes:
> many strings start as unibyte even though they really should start
> right away as multibyte.

That seems the fundamental problem here.

It seems better to make unibyte strings something that can only be
created with some explicit operation.

-Miles

-- 
"Suppose we've chosen the wrong god. Every time we go to church we're
just making him madder and madder." -- Homer Simpson




^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Inadequate documentation of silly characters on screen.
  2009-11-19 23:12           ` Miles Bader
@ 2009-11-20  2:16             ` Stefan Monnier
  2009-11-20  3:37             ` Stephen J. Turnbull
  1 sibling, 0 replies; 26+ messages in thread
From: Stefan Monnier @ 2009-11-20  2:16 UTC (permalink / raw)
  To: Miles Bader; +Cc: Alan Mackenzie, emacs-devel, Jason Rumney

>> many strings start as unibyte even though they really should start
>> right away as multibyte.
> That seems the fundamental problem here.
> It seems better to make unibyte strings something that can only be
> created with some explicit operation.

Agreed.  As I said earlier in this thread:

   We should probably move towards making all string immediates
   multibyte and add a new syntax for unibyte immediates.


-- Stefan




^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Inadequate documentation of silly characters on screen.
  2009-11-19 23:12           ` Miles Bader
  2009-11-20  2:16             ` Stefan Monnier
@ 2009-11-20  3:37             ` Stephen J. Turnbull
  2009-11-20  4:30               ` Stefan Monnier
  1 sibling, 1 reply; 26+ messages in thread
From: Stephen J. Turnbull @ 2009-11-20  3:37 UTC (permalink / raw)
  To: Miles Bader; +Cc: Alan Mackenzie, Jason Rumney, Stefan Monnier, emacs-devel

Miles Bader writes:
 > Stefan Monnier <monnier@iro.umontreal.ca> writes:
 > > many strings start as unibyte even though they really should start
 > > right away as multibyte.
 > 
 > That seems the fundamental problem here.
 > 
 > It seems better to make unibyte strings something that can only be
 > created with some explicit operation.

I don't see why you *need* them at all.  Both pre-Emacs-integration
Mule and XEmacs do fine with a multibyte representation for binary.
Nobody has complained about performance of stream operations since
Kyle Jones and Hrvoje Niksic bitched and we did some measurements in
1998 or so.  It turns out that (as you'd expect) multibyte stream
operations (except Boyer-Moore, which takes no performance hit :-) are
about 50% slower because the representation is about 50% bigger.  But
this is rarely noticable to users.  The noticable performance problems
turned out to be a problem with Unix interfaces, not multibyte.

The performance problem is in array operations, since (without
caching) finding a particular character position is O(position).

If you want to turn Emacs into an engine for general network
programming and the like, yes, it would be good to have a separate
unibyte type.  This is what Python does, but Emacs would not have to
go through the agony of switching from a unibyte representation for
human-readable text to a multibyte representation the way Python does
for Python 3.  In that case, Emacs should not create them without an
explicit operation, and there should be a separate notation such as
#b"this is a unibyte string" (although #b may already be taken?) for
literals.





^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Inadequate documentation of silly characters on screen.
  2009-11-20  3:37             ` Stephen J. Turnbull
@ 2009-11-20  4:30               ` Stefan Monnier
  2009-11-20  7:18                 ` Stephen J. Turnbull
  0 siblings, 1 reply; 26+ messages in thread
From: Stefan Monnier @ 2009-11-20  4:30 UTC (permalink / raw)
  To: Stephen J. Turnbull
  Cc: Alan Mackenzie, emacs-devel, Jason Rumney, Miles Bader

> I don't see why you *need* them at all.

We don't need the unibyte representation.  But we do need to distinguish
bytes and chars, encoded string from non-encoded strings, etc...
What representation is used for them is secondary, but using different
representations for the two cases doesn't seem to be a source
of problems.  The source of problems is that inherited history where we
mixed the unibyte and multibyte objects and treid to pretend they were
just one and the same thing and that conversion between them can be
done automatically.


        Stefan




^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Inadequate documentation of silly characters on screen.
  2009-11-20  4:30               ` Stefan Monnier
@ 2009-11-20  7:18                 ` Stephen J. Turnbull
  2009-11-20 14:16                   ` Stefan Monnier
  0 siblings, 1 reply; 26+ messages in thread
From: Stephen J. Turnbull @ 2009-11-20  7:18 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: Miles Bader, Alan Mackenzie, Jason Rumney, emacs-devel

Stefan Monnier writes:

 > What representation is used for them is secondary, but using different
 > representations for the two cases doesn't seem to be a source
 > of problems.  The source of problems is that inherited history where we
 > mixed the unibyte and multibyte objects and treid to pretend they were
 > just one and the same thing and that conversion between them can be
 > done automatically.

Er, they *were* one and the same thing because of string-as-unibyte
and friends.





^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Inadequate documentation of silly characters on screen.
  2009-11-20  7:18                 ` Stephen J. Turnbull
@ 2009-11-20 14:16                   ` Stefan Monnier
  2009-11-21  4:13                     ` Stephen J. Turnbull
  0 siblings, 1 reply; 26+ messages in thread
From: Stefan Monnier @ 2009-11-20 14:16 UTC (permalink / raw)
  To: Stephen J. Turnbull
  Cc: Miles Bader, Alan Mackenzie, Jason Rumney, emacs-devel

>> What representation is used for them is secondary, but using different
>> representations for the two cases doesn't seem to be a source
>> of problems.  The source of problems is that inherited history where we
>> mixed the unibyte and multibyte objects and treid to pretend they were
>> just one and the same thing and that conversion between them can be
>> done automatically.

> Er, they *were* one and the same thing because of string-as-unibyte
> and friends.

string-as-unibyte returns a new string, so no: they were not the same.


        Stefan




^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Inadequate documentation of silly characters on screen.
  2009-11-20 14:16                   ` Stefan Monnier
@ 2009-11-21  4:13                     ` Stephen J. Turnbull
  2009-11-21  5:24                       ` Stefan Monnier
  0 siblings, 1 reply; 26+ messages in thread
From: Stephen J. Turnbull @ 2009-11-21  4:13 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: Alan Mackenzie, Jason Rumney, emacs-devel, Miles Bader

Stefan Monnier writes:

 > string-as-unibyte returns a new string, so no: they were not the same.

Sorry, `toggle-enable-multibyte-characters' was what I had in mind.
So, yes, they *were* *indeed* the same.  YHBT (it wasn't intentional).

I dunno, de gustibus non est disputandum and all that, but this idea
of having an in-band representation for raw bytes in a multibyte
string sounds to me like more trouble than it's worth.  I think it
would be much better to serve (eg) AUCTeX's needs with a special
coding system that grabs some unlikely-to-be-used private code space
and puts the bytes there.  That puts the responsibility for dealing
with such perversity[1] on the people who have some idea what they're
dealing with, not unsuspecting CC Mode maintainers who won't be using
that coding system.

And it should be either an error to (aset string pos 241) (sorry
Alan!) or 241 should be implicitly interpreted as Latin-1 (ie, ?ñ).  I
favor the former, because what Alan is doing screws Spanish-speaking
users AFAICS.  OTOH, the latter extends naturally if you have plans to
add support for fixed-width Unicode buffers (UTF-16 and UTF-32).

Vive la différence techniquement!

Footnotes: 
[1]  In the sense of "the world is perverse", I'm not blaming AUCTeX
or TeX for this!





^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Inadequate documentation of silly characters on screen.
  2009-11-21  4:13                     ` Stephen J. Turnbull
@ 2009-11-21  5:24                       ` Stefan Monnier
  2009-11-21  6:42                         ` Stephen J. Turnbull
  0 siblings, 1 reply; 26+ messages in thread
From: Stefan Monnier @ 2009-11-21  5:24 UTC (permalink / raw)
  To: Stephen J. Turnbull
  Cc: Alan Mackenzie, Jason Rumney, emacs-devel, Miles Bader

> Sorry, `toggle-enable-multibyte-characters' was what I had in mind.
> So, yes, they *were* *indeed* the same.  YHBT (it wasn't intentional).

Oh, yes, *that* one.  I haven't yet managed to run a useful Emacs
instance with an "assert (BEG == Z);" at the entrance to this nasty
function, but I keep hoping I'll get there.

> I dunno, de gustibus non est disputandum and all that, but this idea
> of having an in-band representation for raw bytes in a multibyte
> string sounds to me like more trouble than it's worth.  I think it
> would be much better to serve (eg) AUCTeX's needs with a special
> coding system that grabs some unlikely-to-be-used private code space
> and puts the bytes there.  That puts the responsibility for dealing
> with such perversity[1] on the people who have some idea what they're
> dealing with, not unsuspecting CC Mode maintainers who won't be using
> that coding system.

I don't know what you mean.  The eight-bit "chars" were introduced to
make sure that decoding+reencoding will always return the exact same
byte-sequence, no matter what coding-system was used (i.e. even if the
byte-sequence is invaldi for that coding-system).  Dunno how XEmacs
handles it.

> And it should be either an error to (aset string pos 241) (sorry
> Alan!) or 241 should be implicitly interpreted as Latin-1 (ie, ?ñ).  I
> favor the former, because what Alan is doing screws Spanish-speaking
> users AFAICS.  OTOH, the latter extends naturally if you have plans to
> add support for fixed-width Unicode buffers (UTF-16 and UTF-32).

I understand this even less.  I think XEmacs's fundamental tradeoffs are
subtly different but lead to very far-reaching consequences, and for
that reason it's difficult for us to take a step back and understand the
other point of view.


        Stefan




^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Inadequate documentation of silly characters on screen.
  2009-11-21  5:24                       ` Stefan Monnier
@ 2009-11-21  6:42                         ` Stephen J. Turnbull
  2009-11-21  6:49                           ` Stefan Monnier
  2009-11-21 12:33                           ` David Kastrup
  0 siblings, 2 replies; 26+ messages in thread
From: Stephen J. Turnbull @ 2009-11-21  6:42 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: Miles Bader, Alan Mackenzie, emacs-devel, Jason Rumney

Stefan Monnier writes:

 > I don't know what you mean.  The eight-bit "chars" were introduced to
 > make sure that decoding+reencoding will always return the exact same
 > byte-sequence, no matter what coding-system was used (i.e. even if the
 > byte-sequence is invaldi for that coding-system).  Dunno how XEmacs
 > handles it.

Honestly, it currently doesn't, or doesn't very well, despite some
work by Aidan.

However, I think a well-behaved platform should by default error
(something derived from invalid-state, in XEmacs's error hierarchy) in
such a case; normally this means corruption in the file.  There are
special cases like utf8latex whose error messages give you a certain
number of octets without respecting character boundaries; I agree
there is need to handle this case.  What Python 3 (PEP 383) does is
provide a family of coding system variants which use invalid Unicode
surrogates to encode "raw bytes" for situations where the user asks
you to proceed despite invalid octet sequences for the coding system;
since Emacs's internal code is UTF-8, any Unicode surrogate is invalid
and could be used for this purpose.  This would make non-Emacs apps
barf errors on such Emacs autosaves, but they'll probably barf on the
source file, too.

 > > And it should be either an error to (aset string pos 241) (sorry
 > > Alan!) or 241 should be implicitly interpreted as Latin-1 (ie, ?ñ).  I
 > > favor the former, because what Alan is doing screws Spanish-speaking
 > > users AFAICS.  OTOH, the latter extends naturally if you have plans to
 > > add support for fixed-width Unicode buffers (UTF-16 and UTF-32).
 > 
 > I understand this even less.

There's a typo in the expr above, should be "multibyte-string".  The
proposed treatment of 241 is due to the fact that it is currently
illegal in multibyte strings AIUI.

Re the bit about Spanish-speakers: AIUI, Alan is translating multiline
strings to oneline strings by using an unusual graphic character.  But
it's only unusual in non-Spanish cases; Spanish-speakers may very well
want to include comments like "¡I wanna write this comment in Español!"
which would presumably get unfolded to "¡I wanna write this comment in
Espa\nol!"  Not very nice.

Re widechar buffers: the codes for Latin-1 characters in UTF-16 and
UTF-32 are just zero-padded extensions of the unibyte codes.  I'm
pretty sure it's this kind of thing that Ben had in mind when he
originally designed the XEmacs version of the Mule internal encoding
to make (= (char-int ?ñ) 241) true in all versions of XEmacs.

 > I think XEmacs's fundamental tradeoffs are subtly different but
 > lead to very far-reaching consequences,

Indeed, but I'm not talking about XEmacs, except for comparison of
techniques.





^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Inadequate documentation of silly characters on screen.
  2009-11-21  6:42                         ` Stephen J. Turnbull
@ 2009-11-21  6:49                           ` Stefan Monnier
  2009-11-21  7:27                             ` Stephen J. Turnbull
  2009-11-21 12:33                           ` David Kastrup
  1 sibling, 1 reply; 26+ messages in thread
From: Stefan Monnier @ 2009-11-21  6:49 UTC (permalink / raw)
  To: Stephen J. Turnbull
  Cc: Miles Bader, Alan Mackenzie, emacs-devel, Jason Rumney

> There's a typo in the expr above, should be "multibyte-string".  The
> proposed treatment of 241 is due to the fact that it is currently
> illegal in multibyte strings AIUI.

241 is perfectly valid in multibyte strings (as well as in
unibyte-strings).


        Stefan




^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Inadequate documentation of silly characters on screen.
  2009-11-21  6:49                           ` Stefan Monnier
@ 2009-11-21  7:27                             ` Stephen J. Turnbull
  2009-11-23  1:58                               ` Stefan Monnier
  0 siblings, 1 reply; 26+ messages in thread
From: Stephen J. Turnbull @ 2009-11-21  7:27 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: Alan Mackenzie, emacs-devel

Stefan Monnier writes:

 > 241 is perfectly valid in multibyte strings (as well as in
 > unibyte-strings).

OK, so "invalid" was up to Emacs 22, then?

So the problem is that because characters are integers and vice versa,
there's no way for the user to let Emacs duck-type multibyte vs
unibyte strings for him.  If he cares, he needs to check.  If he
doesn't care, eventually Emacs will punish him for his lapse.

I suppose subst-char-in-string is similarly useless for Alan's
purpose, then?  What he really needs to use is something like

    (replace-in-string str "\n" "ñ")

right?





^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Inadequate documentation of silly characters on screen.
  2009-11-21  6:42                         ` Stephen J. Turnbull
  2009-11-21  6:49                           ` Stefan Monnier
@ 2009-11-21 12:33                           ` David Kastrup
  2009-11-21 13:55                             ` Stephen J. Turnbull
  1 sibling, 1 reply; 26+ messages in thread
From: David Kastrup @ 2009-11-21 12:33 UTC (permalink / raw)
  To: emacs-devel

"Stephen J. Turnbull" <stephen@xemacs.org> writes:

> Stefan Monnier writes:
>
>  > I don't know what you mean.  The eight-bit "chars" were introduced
>  > to make sure that decoding+reencoding will always return the exact
>  > same byte-sequence, no matter what coding-system was used
>  > (i.e. even if the byte-sequence is invaldi for that coding-system).
>  > Dunno how XEmacs handles it.
>
> Honestly, it currently doesn't, or doesn't very well, despite some
> work by Aidan.

But we don't need to make this a problem for _Emacs_.

> However, I think a well-behaved platform should by default error
> (something derived from invalid-state, in XEmacs's error hierarchy) in
> such a case; normally this means corruption in the file.

We take care that it does not mean corruption.  And more often it means
that you might have been loading with the wrong encoding (people do that
all the time).  If you edit some innocent ASCII part and save again, you
won't appreciate changes all across the file elsewhere in parts you did
not touch or see on-screen.

Sometimes there is no "right encoding".  If I load an executable or an
image file with tag strings and change one string in overwrite mode, I
want to be able to save again.  Compiled Elisp files contain binary
strings as well.  There may be source files with binary blobs in them,
there may be files with parts in different encodings and so on.

> There are special cases like utf8latex whose error messages give you a
> certain number of octets without respecting character boundaries; I
> agree there is need to handle this case.

Forget about the TeX problem: that is a red herring.  It is just one
case where irrevertable corruption is not the right answer.  In fact, I
know of no case where irrevertable corruption is the right answer.
"Don't touch what you don't understand" is a good rationale.  For
XEmacs, following this rationale would currently require erroring out.
And I actually recommend that you do so: you will learn the hard way
that users like the Emacs solution of "don't touch what you don't
understand", namely having artificial code points for losslessly
representing the parts Emacs does not understand in a particular
encoding, better.

> What Python 3 (PEP 383) does is provide a family of coding system
> variants which use invalid Unicode surrogates to encode "raw bytes"
> for situations where the user asks you to proceed despite invalid
> octet sequences for the coding system; since Emacs's internal code is
> UTF-8, any Unicode surrogate is invalid and could be used for this
> purpose.  This would make non-Emacs apps barf errors on such Emacs
> autosaves, but they'll probably barf on the source file, too.

We currently _have_ such a scheme in place.  We just use different
Unicode-invalid code points.

> There's a typo in the expr above, should be "multibyte-string".  The
> proposed treatment of 241 is due to the fact that it is currently
> illegal in multibyte strings AIUI.

It is a perfectly valid character ñ in multibyte strings, but not
represented by its single-byte/latin-1 equivalent.

> Re widechar buffers: the codes for Latin-1 characters in UTF-16 and
> UTF-32 are just zero-padded extensions of the unibyte codes.

I think you may be muddling characters and their byte sequence
representations.  At least I can't read much sense into this statement
otherwise.

-- 
David Kastrup





^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Inadequate documentation of silly characters on screen.
  2009-11-21 12:33                           ` David Kastrup
@ 2009-11-21 13:55                             ` Stephen J. Turnbull
  2009-11-21 14:36                               ` David Kastrup
  0 siblings, 1 reply; 26+ messages in thread
From: Stephen J. Turnbull @ 2009-11-21 13:55 UTC (permalink / raw)
  To: David Kastrup; +Cc: emacs-devel

David Kastrup writes:

 > > However, I think a well-behaved platform should by default error
 > > (something derived from invalid-state, in XEmacs's error hierarchy) in
 > > such a case; normally this means corruption in the file.
 > 
 > We take care that it does not mean corruption.

I meant pre-existing corruption, like your pre-existing disposition to
bash XEmacs.  Please take it elsewhere; it doesn't belong on Emacs
channels.  (Of course I'd prefer not to see it on XEmacs channels
either, but at least it wouldn't be entirely off-topic there.)

 > And more often it means that you might have been loading with the
 > wrong encoding (people do that all the time).  If you edit some
 > innocent ASCII part

You can't do that if the file is not in a buffer because the encoding
error aborted the conversion.  Aborting the conversion is what the
Unicode Consortium requires, too, IIRC: errors in UTF-8 (or any other
UTF for that matter) are considered *fatal* by the standard.  Exactly
what that means is up to the application to decide.  One plausible
approach would be to do what you do now, but make the buffer read-only.

 > Sometimes there is no "right encoding".

So what?  The point is that there certainly are *wrong* encodings,
namely ones that will result in corruption if you try to save the file
in that encoding.  There are usually many "usable" encodings (binary
is always available, for example).  Some will be preferred by users,
and that will be reflected in coding system precedence.

But when faced with ambiguity, it is best to refuse to guess.

 > We currently _have_ [a scheme for encoding invalid sequences of
 > code units] in place.  We just use different Unicode-invalid code
 > points [from Python].

Conceded.  I realized that later; the important difference is that
Python only uses that scheme when explicitly requested.





^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Inadequate documentation of silly characters on screen.
  2009-11-21 13:55                             ` Stephen J. Turnbull
@ 2009-11-21 14:36                               ` David Kastrup
  2009-11-21 17:53                                 ` Stephen J. Turnbull
  0 siblings, 1 reply; 26+ messages in thread
From: David Kastrup @ 2009-11-21 14:36 UTC (permalink / raw)
  To: emacs-devel

"Stephen J. Turnbull" <stephen@xemacs.org> writes:

> David Kastrup writes:
>
>  > > However, I think a well-behaved platform should by default error
>  > > (something derived from invalid-state, in XEmacs's error
>  > > hierarchy) in such a case; normally this means corruption in the
>  > > file.
>  > 
>  > We take care that it does not mean corruption.
>
> I meant pre-existing corruption [...]

That interpretation is not the business of the editor.  It may decide to
give a warning, but refusing to work at all does not increase its
usefulness.

>  > And more often it means that you might have been loading with the
>  > wrong encoding (people do that all the time).  If you edit some
>  > innocent ASCII part
>
> You can't do that if the file is not in a buffer because the encoding
> error aborted the conversion.

Not being able to do what I want is not a particularly enticing feature.

> Aborting the conversion is what the Unicode Consortium requires, too,
> IIRC:

An editor is not the same as a validator.  It's not its business to
decide what files I should be allowed to work with.

> errors in UTF-8 (or any other UTF for that matter) are considered
> *fatal* by the standard.  Exactly what that means is up to the
> application to decide.  One plausible approach would be to do what you
> do now, but make the buffer read-only.

Making the buffer read-only is a reasonable thing to do if it can't
possibly be written back unchanged.  For example, if I load a file in
latin-1 and insert a few non-latin-1 characters.  In this case Emacs
should not just silently write the file in utf-8 because that changes
the encoding of some preexisting characters.  The situation is different
if I load a pure ASCII file: in that case, the utf-8 decision is
feasible when compatible with the environment.

>  > Sometimes there is no "right encoding".
>
> So what?  The point is that there certainly are *wrong* encodings,
> namely ones that will result in corruption if you try to save the file
> in that encoding.

But we have a fair amount of encodings (those without escape characters
IIRC) which don't imply corruption when saving.  And that is a good
feature for an editor.  For example, when working with version control
systems, you want minimal diffs.  Encoding systems with escape
characters are not good for that.  I would strongly advise against Emacs
picking any escape-character based encoding (or otherwise
non-byte-stream-preserving) automatically.

Less breakage is always a good thing.

> But when faced with ambiguity, it is best to refuse to guess.

You don't need to guess if you just preserve the byte sequence.  That
makes it somebody else's problem.  The GNU utilities have always made it
a point to work with arbitrary input without insisting on it being
"sensible".  Historically, most Unix utilities just crashed when you fed
them arbitrary garbage.  They have taken a lesson from GNU nowadays.

And I consider it a good lesson.

>  > We currently _have_ [a scheme for encoding invalid sequences of
>  > code units] in place.  We just use different Unicode-invalid code
>  > points [from Python].
>
> Conceded.  I realized that later; the important difference is that
> Python only uses that scheme when explicitly requested.

All in all, it is nobody else's business what encoding Emacs uses for
internal purposes.  Making Emacs preserve byte streams means that the
user has to worry less, not more, about what Emacs might be able to work
with.  The Emacs 23 internal encoding does a better job not getting into
the hair of users with encoding issues than Emacs 22 did, because of a
better correspondence with external encodings.  But ideally, the user
should not have to worry about the difference.

-- 
David Kastrup





^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Inadequate documentation of silly characters on screen.
  2009-11-21 14:36                               ` David Kastrup
@ 2009-11-21 17:53                                 ` Stephen J. Turnbull
  2009-11-21 23:30                                   ` David Kastrup
  0 siblings, 1 reply; 26+ messages in thread
From: Stephen J. Turnbull @ 2009-11-21 17:53 UTC (permalink / raw)
  To: David Kastrup; +Cc: emacs-devel

David Kastrup writes:
 > "Stephen J. Turnbull" <stephen@xemacs.org> writes:

 > > I meant pre-existing corruption [...]
 > 
 > That interpretation is not the business of the editor.

Precisely my point.  The editor has *no* way to interpret at the point
of encountering the invalid sequence, and therefore it should *stop*
and ask the user what to do.  That doesn't mean it should throw away
the data, but it sure does mean that it should not continue as though
there is valid data in the buffer.

Emacs is welcome to do that, but I am sure you will get bug reports
about it.





^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Inadequate documentation of silly characters on screen.
  2009-11-21 17:53                                 ` Stephen J. Turnbull
@ 2009-11-21 23:30                                   ` David Kastrup
  2009-11-22  1:27                                     ` Sebastian Rose
  0 siblings, 1 reply; 26+ messages in thread
From: David Kastrup @ 2009-11-21 23:30 UTC (permalink / raw)
  To: emacs-devel

"Stephen J. Turnbull" <stephen@xemacs.org> writes:

> David Kastrup writes:
>  > "Stephen J. Turnbull" <stephen@xemacs.org> writes:
>
>  > > I meant pre-existing corruption [...]
>  > 
>  > That interpretation is not the business of the editor.
>
> Precisely my point.  The editor has *no* way to interpret at the point
> of encountering the invalid sequence, and therefore it should *stop*
> and ask the user what to do.  That doesn't mean it should throw away
> the data, but it sure does mean that it should not continue as though
> there is valid data in the buffer.
>
> Emacs is welcome to do that, but I am sure you will get bug reports
> about it.

Why would we get a bug report about Emacs saving a file changed only in
the locations that the user actually edited?

People might complain when Emacs does not recognize some encoding
properly, but they certainly will not demand that Emacs should stop
working altogether.

-- 
David Kastrup





^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Inadequate documentation of silly characters on screen.
  2009-11-21 23:30                                   ` David Kastrup
@ 2009-11-22  1:27                                     ` Sebastian Rose
  2009-11-22  8:06                                       ` David Kastrup
  0 siblings, 1 reply; 26+ messages in thread
From: Sebastian Rose @ 2009-11-22  1:27 UTC (permalink / raw)
  To: David Kastrup; +Cc: emacs-devel

David Kastrup <dak@gnu.org> writes:
> "Stephen J. Turnbull" <stephen@xemacs.org> writes:
>
>> David Kastrup writes:
>>  > "Stephen J. Turnbull" <stephen@xemacs.org> writes:
>>
>>  > > I meant pre-existing corruption [...]
>>  > 
>>  > That interpretation is not the business of the editor.
>>
>> Precisely my point.  The editor has *no* way to interpret at the point
>> of encountering the invalid sequence, and therefore it should *stop*
>> and ask the user what to do.  That doesn't mean it should throw away
>> the data, but it sure does mean that it should not continue as though
>> there is valid data in the buffer.
>>
>> Emacs is welcome to do that, but I am sure you will get bug reports
>> about it.
>
> Why would we get a bug report about Emacs saving a file changed only in
> the locations that the user actually edited?
>
> People might complain when Emacs does not recognize some encoding
> properly, but they certainly will not demand that Emacs should stop
> working altogether.


People do indeed complain on the emacs-orgmode mailing list and I can
reproduce their problems.

You may read the details here:

  http://www.mail-archive.com/emacs-orgmode@gnu.org/msg19778.html

`M-x recode-file-name' doesn't work either.


I guess this is related?





Best wishes

      Sebastian




^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Inadequate documentation of silly characters on screen.
  2009-11-22  1:27                                     ` Sebastian Rose
@ 2009-11-22  8:06                                       ` David Kastrup
  2009-11-22 23:52                                         ` Sebastian Rose
  0 siblings, 1 reply; 26+ messages in thread
From: David Kastrup @ 2009-11-22  8:06 UTC (permalink / raw)
  To: emacs-devel

Sebastian Rose <sebastian_rose@gmx.de> writes:

> David Kastrup <dak@gnu.org> writes:
>> "Stephen J. Turnbull" <stephen@xemacs.org> writes:
>>
>>> David Kastrup writes:
>>>  > "Stephen J. Turnbull" <stephen@xemacs.org> writes:
>>>
>>>  > > I meant pre-existing corruption [...]
>>>  > 
>>>  > That interpretation is not the business of the editor.
>>>
>>> Precisely my point.  The editor has *no* way to interpret at the point
>>> of encountering the invalid sequence, and therefore it should *stop*
>>> and ask the user what to do.  That doesn't mean it should throw away
>>> the data, but it sure does mean that it should not continue as though
>>> there is valid data in the buffer.
>>>
>>> Emacs is welcome to do that, but I am sure you will get bug reports
>>> about it.
>>
>> Why would we get a bug report about Emacs saving a file changed only in
>> the locations that the user actually edited?
>>
>> People might complain when Emacs does not recognize some encoding
>> properly, but they certainly will not demand that Emacs should stop
>> working altogether.
>
>
> People do indeed complain on the emacs-orgmode mailing list and I can
> reproduce their problems.

What meaning of "indeed" are you using here?  This is a complaint about
Emacs _not_ faithfully replicating a byte pattern that it expects to be
in a particular encoding.

>   http://www.mail-archive.com/emacs-orgmode@gnu.org/msg19778.html
>
> I guess this is related?

It is related, but it bolsters rather than defeats my argument.

People don't _like_ Emacs to cop out altogether.

-- 
David Kastrup





^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Inadequate documentation of silly characters on screen.
  2009-11-22  8:06                                       ` David Kastrup
@ 2009-11-22 23:52                                         ` Sebastian Rose
  0 siblings, 0 replies; 26+ messages in thread
From: Sebastian Rose @ 2009-11-22 23:52 UTC (permalink / raw)
  To: David Kastrup; +Cc: emacs-devel

David Kastrup <dak@gnu.org> writes:

> Sebastian Rose <sebastian_rose@gmx.de> writes:
>
>> David Kastrup <dak@gnu.org> writes:
>>> "Stephen J. Turnbull" <stephen@xemacs.org> writes:
>>>
>>>> David Kastrup writes:
>>>>  > "Stephen J. Turnbull" <stephen@xemacs.org> writes:
>>>>
>>>>  > > I meant pre-existing corruption [...]
>>>>  > 
>>>>  > That interpretation is not the business of the editor.
>>>>
>>>> Precisely my point.  The editor has *no* way to interpret at the point
>>>> of encountering the invalid sequence, and therefore it should *stop*
>>>> and ask the user what to do.  That doesn't mean it should throw away
>>>> the data, but it sure does mean that it should not continue as though
>>>> there is valid data in the buffer.
>>>>
>>>> Emacs is welcome to do that, but I am sure you will get bug reports
>>>> about it.
>>>
>>> Why would we get a bug report about Emacs saving a file changed only in
>>> the locations that the user actually edited?
>>>
>>> People might complain when Emacs does not recognize some encoding
>>> properly, but they certainly will not demand that Emacs should stop
>>> working altogether.
>>
>>
>> People do indeed complain on the emacs-orgmode mailing list and I can
>> reproduce their problems.
>
> What meaning of "indeed" are you using here?  This is a complaint about
> Emacs _not_ faithfully replicating a byte pattern that it expects to be
> in a particular encoding.
>
>>   http://www.mail-archive.com/emacs-orgmode@gnu.org/msg19778.html
>>
>> I guess this is related?
>
> It is related, but it bolsters rather than defeats my argument.
>
> People don't _like_ Emacs to cop out altogether.


Sorry David. This was not meant as an argument. It was more a question
because I was a bit unsure if this was related (I did not follow thread
that closely).

And in that case, the OP reported, that Emacs indeed refused to work, in
that it didn't want to save the file (which I cannot fully reproduce).

I didn't mean to highjack this thread though.


Thanks for your answer anyway


  Sebastiab




^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Inadequate documentation of silly characters on screen.
  2009-11-21  7:27                             ` Stephen J. Turnbull
@ 2009-11-23  1:58                               ` Stefan Monnier
  0 siblings, 0 replies; 26+ messages in thread
From: Stefan Monnier @ 2009-11-23  1:58 UTC (permalink / raw)
  To: Stephen J. Turnbull; +Cc: Alan Mackenzie, emacs-devel

> So the problem is that because characters are integers and vice versa,
> there's no way for the user to let Emacs duck-type multibyte vs
> unibyte strings for him.  If he cares, he needs to check.  If he
> doesn't care, eventually Emacs will punish him for his lapse.

> I suppose subst-char-in-string is similarly useless for Alan's
> purpose, then?  What he really needs to use is something like

>     (replace-in-string str "\n" "ñ")

> right?

Pretty much yes.  When chars come within strings, the multibyteness of
the string indicates what the string elements are (chars or bytes), so
as long as you only manipulate strings, Emacs is able to DTRT.
As soon as you manipulate actual chars, the ambiguity between chars and
bytes for values [128..255] can bite you unless you're careful about how
you use them (e.g. about the multibyteness of the strings with which
you combine them).

That's where `aset' bites.  I hate `aset' on strings because it has
side-effects (obviously) and because strings aren't vectors so you can't
guarantee the expected efficiency, but neither are the source of the
problem here.  So indeed subst-char-in-string suffers similarly.


        Stefan




^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2009-11-23  1:58 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-11-18  9:37 Inadequate documentation of silly characters on screen Alan Mackenzie
2009-11-18  9:40 ` Miles Bader
2009-11-18 10:15   ` Alan Mackenzie
2009-11-18 12:03     ` Jason Rumney
2009-11-18 15:02     ` Stefan Monnier
  -- strict thread matches above, loose matches on Subject: below --
2009-11-18 19:12 [acm@muc.de: Re: Inadequate documentation of silly characters on screen.] Alan Mackenzie
2009-11-19  1:27 ` Fwd: Re: Inadequate documentation of silly characters on screen Stefan Monnier
2009-11-19  8:20   ` Alan Mackenzie
2009-11-19  8:50     ` Miles Bader
2009-11-19 14:08     ` Fwd: " Stefan Monnier
2009-11-19 14:50       ` Jason Rumney
2009-11-19 15:27         ` Stefan Monnier
2009-11-19 23:12           ` Miles Bader
2009-11-20  2:16             ` Stefan Monnier
2009-11-20  3:37             ` Stephen J. Turnbull
2009-11-20  4:30               ` Stefan Monnier
2009-11-20  7:18                 ` Stephen J. Turnbull
2009-11-20 14:16                   ` Stefan Monnier
2009-11-21  4:13                     ` Stephen J. Turnbull
2009-11-21  5:24                       ` Stefan Monnier
2009-11-21  6:42                         ` Stephen J. Turnbull
2009-11-21  6:49                           ` Stefan Monnier
2009-11-21  7:27                             ` Stephen J. Turnbull
2009-11-23  1:58                               ` Stefan Monnier
2009-11-21 12:33                           ` David Kastrup
2009-11-21 13:55                             ` Stephen J. Turnbull
2009-11-21 14:36                               ` David Kastrup
2009-11-21 17:53                                 ` Stephen J. Turnbull
2009-11-21 23:30                                   ` David Kastrup
2009-11-22  1:27                                     ` Sebastian Rose
2009-11-22  8:06                                       ` David Kastrup
2009-11-22 23:52                                         ` Sebastian Rose

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).