Re: 23.0.60; end-of-sentence and non-breaking space

unofficial mirror of emacs-devel@gnu.org 
 help / color / mirror / code / Atom feed

* Re: 23.0.60; end-of-sentence and non-breaking space
@ 2009-01-01  3:47 Chong Yidong
  2009-01-02  1:25 ` Richard M Stallman
                   ` (2 more replies)
  0 siblings, 3 replies; 18+ messages in thread
From: Chong Yidong @ 2009-01-01  3:47 UTC (permalink / raw)
  To: emacs-devel; +Cc: 1726, rms, 1727

From bug#1726 and bug#1727:

> forward-sentence does not treat non-breaking space as a space for
> purposes of sentence ends.
...
> When I type C-x = at a non-breaking space, it tells me that it
> has code 160, hex a0.  But when I execute (insert "\xa0"),
> it inserts something that displays as `\240' and for which C-x =
> displays this:
>
>    Char:   (4194208, #o17777640, #x3fffa0, raw-byte) point=198 of 211
>    (93%) column=5
>
> Is that a bug?  It seems quite confusing to me.

ISTR that there was an extended discussion about classifying
non-breaking spaces on this list a while back.  But I can't find it now.
Does anyone remember the details?




^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: 23.0.60; end-of-sentence and non-breaking space
  2009-01-01  3:47 23.0.60; end-of-sentence and non-breaking space Chong Yidong
@ 2009-01-02  1:25 ` Richard M Stallman
  2009-01-02  2:38 ` bug#1727: " Drew Adams
  2009-01-02  4:11 ` Stefan Monnier
  2 siblings, 0 replies; 18+ messages in thread
From: Richard M Stallman @ 2009-01-02  1:25 UTC (permalink / raw)
  To: Chong Yidong; +Cc: 1726, 1727, emacs-devel

    > When I type C-x = at a non-breaking space, it tells me that it
    > has code 160, hex a0.  But when I execute (insert "\xa0"),
    > it inserts something that displays as `\240' and for which C-x =
    > displays this:

    >    Char:   (4194208, #o17777640, #x3fffa0, raw-byte) point=198 of 211
    >    (93%) column=5
    >
    > Is that a bug?  It seems quite confusing to me.

    ISTR that there was an extended discussion about classifying
    non-breaking spaces on this list a while back.  But I can't find it now.
    Does anyone remember the details?

I am not sure we are talking about the same question.
The issue I am raising is not one of classifying it,
it is that these two different character codes get used
and I don't see an explanation of what's going on.




^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: bug#1727: 23.0.60; end-of-sentence and non-breaking space
  2009-01-01  3:47 23.0.60; end-of-sentence and non-breaking space Chong Yidong
  2009-01-02  1:25 ` Richard M Stallman
@ 2009-01-02  2:38 ` Drew Adams
  2009-01-02  4:11 ` Stefan Monnier
  2 siblings, 0 replies; 18+ messages in thread
From: Drew Adams @ 2009-01-02  2:38 UTC (permalink / raw)
  To: 'Chong Yidong', 1727, emacs-devel; +Cc: 1726, rms

> ISTR that there was an extended discussion about classifying
> non-breaking spaces on this list a while back.  But I can't 
> find it now. Does anyone remember the details?

Dunno if this is what you were thinking of, but there was this discussion about
treating (classifying) nonbreaking space as whitespace:

http://lists.gnu.org/archive/html/emacs-devel/2007-06/msg01089.html





^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: 23.0.60; end-of-sentence and non-breaking space
  2009-01-01  3:47 23.0.60; end-of-sentence and non-breaking space Chong Yidong
  2009-01-02  1:25 ` Richard M Stallman
  2009-01-02  2:38 ` bug#1727: " Drew Adams
@ 2009-01-02  4:11 ` Stefan Monnier
  2009-01-02 17:13   ` Richard M Stallman
  2 siblings, 1 reply; 18+ messages in thread
From: Stefan Monnier @ 2009-01-02  4:11 UTC (permalink / raw)
  To: Chong Yidong; +Cc: 1726, 1727, rms, emacs-devel

>> has code 160, hex a0.  But when I execute (insert "\xa0"),
>> it inserts something that displays as `\240' and for which C-x =
>> displays this:
>> 
>> Char:   (4194208, #o17777640, #x3fffa0, raw-byte) point=198 of 211
>> (93%) column=5
>> 
>> Is that a bug?  It seems quite confusing to me.

This raw-byte char is what used to be called an eight-bit-control (or
eight-bit-graphic depending on the actual value) char.

I.e. "\xa0" is treated as a string that contains the \xa0 byte (i.e. an
eight-bit-* (aka raw-byte) char) rather than the \xa0 char (a latin-1
non-breaking space).


        Stefan




^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: 23.0.60; end-of-sentence and non-breaking space
  2009-01-02  4:11 ` Stefan Monnier
@ 2009-01-02 17:13   ` Richard M Stallman
  2009-01-03  3:06     ` Stefan Monnier
  0 siblings, 1 reply; 18+ messages in thread
From: Richard M Stallman @ 2009-01-02 17:13 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: cyd, 1726, emacs-devel

    This raw-byte char is what used to be called an eight-bit-control (or
    eight-bit-graphic depending on the actual value) char.

    I.e. "\xa0" is treated as a string that contains the \xa0 byte (i.e. an
    eight-bit-* (aka raw-byte) char) rather than the \xa0 char (a latin-1
    non-breaking space).

1. Is that the right thing for \xa0 in a string to mean?
Or should it mean the character with code xa0?

2. I find it hard to think about that question since I don't see any
documentation explaining how this ought to work.  That documentation
is essential.




^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: 23.0.60; end-of-sentence and non-breaking space
  2009-01-02 17:13   ` Richard M Stallman
@ 2009-01-03  3:06     ` Stefan Monnier
  2009-01-03  9:54       ` Eli Zaretskii
  2009-01-03 15:21       ` Richard M Stallman
  0 siblings, 2 replies; 18+ messages in thread
From: Stefan Monnier @ 2009-01-03  3:06 UTC (permalink / raw)
  To: rms; +Cc: cyd, 1726, emacs-devel

>     This raw-byte char is what used to be called an eight-bit-control (or
>     eight-bit-graphic depending on the actual value) char.

>     I.e. "\xa0" is treated as a string that contains the \xa0 byte (i.e. an
>     eight-bit-* (aka raw-byte) char) rather than the \xa0 char (a latin-1
>     non-breaking space).

> 1. Is that the right thing for \xa0 in a string to mean?
> Or should it mean the character with code xa0?

> 2. I find it hard to think about that question since I don't see any
> documentation explaining how this ought to work.  That documentation
> is essential.

Good point.  Especially because I think this changed from Emacs-20 to
Emacs-21, and I think it also changed now from Emacs-22 to Emacs-23.

IIUC if you want the character with code #xa0, then using \u00a0 would
seem like the most unambiguous option (I notice that "\ua0" gives
a weird error "Non-hex digit used for Unicode escape").

Not sure what \NNN or \xMM should do.


        Stefan




^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: 23.0.60; end-of-sentence and non-breaking space
  2009-01-03  3:06     ` Stefan Monnier
@ 2009-01-03  9:54       ` Eli Zaretskii
  2009-01-03 15:21       ` Richard M Stallman
  1 sibling, 0 replies; 18+ messages in thread
From: Eli Zaretskii @ 2009-01-03  9:54 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: cyd, emacs-devel, rms, 1726

> From: Stefan Monnier <monnier@iro.umontreal.ca>
> Date: Fri, 02 Jan 2009 22:06:21 -0500
> Cc: cyd@stupidchicken.com, 1726@emacsbugs.donarmstrong.com, emacs-devel@gnu.org
> 
> Good point.  Especially because I think this changed from Emacs-20 to
> Emacs-21, and I think it also changed now from Emacs-22 to Emacs-23.

I think the change in Emacs 23 is OK, but it needs to be consistent in
characters and strings.

> IIUC if you want the character with code #xa0, then using \u00a0 would
> seem like the most unambiguous option

Agreed.  Using \uNNNN is an unambiguous way of saying you want a Unicode
character whose codepoint is NNNN in hex.

> Not sure what \NNN or \xMM should do.

I think they should insert a raw byte with that code, and I think they
should do that both in characters and in strings, so that the
inconsistent behavior reported by Richard will become consistent.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: 23.0.60; end-of-sentence and non-breaking space
  2009-01-03  3:06     ` Stefan Monnier
  2009-01-03  9:54       ` Eli Zaretskii
@ 2009-01-03 15:21       ` Richard M Stallman
  2009-01-03 16:44         ` Eli Zaretskii
  1 sibling, 1 reply; 18+ messages in thread
From: Richard M Stallman @ 2009-01-03 15:21 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: cyd, 1726, emacs-devel

    IIUC if you want the character with code #xa0, then using \u00a0 would
    seem like the most unambiguous option (I notice that "\ua0" gives
    a weird error "Non-hex digit used for Unicode escape").

I expected \xa0 to give me that character.  It still seems strange
that it would do anything else.

When I read the documentation of \u, I thought it meant "unicode" as
opposed to "Emacs's internal code".  Since I knew that Emacs now
follows unicode for these characters, I saw no reason to consider
using \u.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: 23.0.60; end-of-sentence and non-breaking space
  2009-01-03 15:21       ` Richard M Stallman
@ 2009-01-03 16:44         ` Eli Zaretskii
  2009-01-04  2:16           ` Richard M Stallman
  2009-01-05  6:37           ` Kenichi Handa
  0 siblings, 2 replies; 18+ messages in thread
From: Eli Zaretskii @ 2009-01-03 16:44 UTC (permalink / raw)
  To: rms; +Cc: cyd, emacs-devel, monnier, 1726

> From: Richard M Stallman <rms@gnu.org>
> Date: Sat, 03 Jan 2009 10:21:58 -0500
> Cc: cyd@stupidchicken.com, 1726@emacsbugs.donarmstrong.com, emacs-devel@gnu.org
> 
>     IIUC if you want the character with code #xa0, then using \u00a0 would
>     seem like the most unambiguous option (I notice that "\ua0" gives
>     a weird error "Non-hex digit used for Unicode escape").
> 
> I expected \xa0 to give me that character.  It still seems strange
> that it would do anything else.

We need some way of inserting raw 8-bit bytes, because otherwise code
that encodes and decodes text in Lisp will not work.  For inserting
characters, we have the \u alternative; but I don't think there's
alternative for raw bytes except insert \xNN.




^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: 23.0.60; end-of-sentence and non-breaking space
  2009-01-03 16:44         ` Eli Zaretskii
@ 2009-01-04  2:16           ` Richard M Stallman
  2009-01-04  4:18             ` Eli Zaretskii
                               ` (2 more replies)
  2009-01-05  6:37           ` Kenichi Handa
  1 sibling, 3 replies; 18+ messages in thread
From: Richard M Stallman @ 2009-01-04  2:16 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: cyd, 1726, monnier, emacs-devel

    We need some way of inserting raw 8-bit bytes, because otherwise code
    that encodes and decodes text in Lisp will not work.  For inserting
    characters, we have the \u alternative; but I don't think there's
    alternative for raw bytes except insert \xNN.

Naybe that is a valid reason for the current behavior, but that
doesn't alter the need for the manual to document the behavior.

Meanwhile, the Chinese and Chinese-derived character codes
do not follow Unicode.  So you can't enter them with \u.
What is the way to enter them?




^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: 23.0.60; end-of-sentence and non-breaking space
  2009-01-04  2:16           ` Richard M Stallman
@ 2009-01-04  4:18             ` Eli Zaretskii
  2009-01-04 21:42               ` Richard M Stallman
  2009-01-04  4:29             ` bug#1726: " Jason Rumney
  2009-01-05  7:11             ` Kenichi Handa
  2 siblings, 1 reply; 18+ messages in thread
From: Eli Zaretskii @ 2009-01-04  4:18 UTC (permalink / raw)
  To: rms; +Cc: cyd, 1726, monnier, emacs-devel

> From: Richard M Stallman <rms@gnu.org>
> CC: cyd@stupidchicken.com, emacs-devel@gnu.org,
> 	monnier@iro.umontreal.ca, 1726@emacsbugs.donarmstrong.com
> Date: Sat, 03 Jan 2009 21:16:21 -0500
> 
>     We need some way of inserting raw 8-bit bytes, because otherwise code
>     that encodes and decodes text in Lisp will not work.  For inserting
>     characters, we have the \u alternative; but I don't think there's
>     alternative for raw bytes except insert \xNN.
> 
> Naybe that is a valid reason for the current behavior, but that
> doesn't alter the need for the manual to document the behavior.

That was an attempt at explaining the reasons, not telling they don't
need to be documented.

> Meanwhile, the Chinese and Chinese-derived character codes
> do not follow Unicode.  So you can't enter them with \u.
> What is the way to enter them?

The problem at hand exists only for codes that are less than FF hex.




^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: bug#1726: 23.0.60; end-of-sentence and non-breaking space
  2009-01-04  2:16           ` Richard M Stallman
  2009-01-04  4:18             ` Eli Zaretskii
@ 2009-01-04  4:29             ` Jason Rumney
  2009-01-04 16:45               ` Eli Zaretskii
  2009-01-05  7:11             ` Kenichi Handa
  2 siblings, 1 reply; 18+ messages in thread
From: Jason Rumney @ 2009-01-04  4:29 UTC (permalink / raw)
  To: rms, 1726; +Cc: Eli Zaretskii, cyd, emacs-devel

Richard M Stallman wrote:
> Meanwhile, the Chinese and Chinese-derived character codes
> do not follow Unicode.

They do in Emacs 23, though I think if you enter \x1234, it will be 
treated the same as \u1234, as characters with more than 8 bits are 
clearly not eight-bit raw bytes.





^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: bug#1726: 23.0.60; end-of-sentence and non-breaking space
  2009-01-04  4:29             ` bug#1726: " Jason Rumney
@ 2009-01-04 16:45               ` Eli Zaretskii
  0 siblings, 0 replies; 18+ messages in thread
From: Eli Zaretskii @ 2009-01-04 16:45 UTC (permalink / raw)
  To: Jason Rumney; +Cc: cyd, 1726, rms, emacs-devel

> Date: Sun, 04 Jan 2009 12:29:05 +0800
> From: Jason Rumney <jasonr@gnu.org>
> Cc: Eli Zaretskii <eliz@gnu.org>, cyd@stupidchicken.com, emacs-devel@gnu.org
> 
> Richard M Stallman wrote:
> > Meanwhile, the Chinese and Chinese-derived character codes
> > do not follow Unicode.
> 
> They do in Emacs 23

Not all of them, I think.




^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: 23.0.60; end-of-sentence and non-breaking space
  2009-01-04  4:18             ` Eli Zaretskii
@ 2009-01-04 21:42               ` Richard M Stallman
  0 siblings, 0 replies; 18+ messages in thread
From: Richard M Stallman @ 2009-01-04 21:42 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: cyd, emacs-devel, 1726, monnier

    > Meanwhile, the Chinese and Chinese-derived character codes
    > do not follow Unicode.  So you can't enter them with \u.
    > What is the way to enter them?

    The problem at hand exists only for codes that are less than FF hex.

Maybe, but isn't there a similar problem for Chinese-derived
characters?  How does one specify these codes in a string constant?
Shouldn't there be some way?

Maybe there is never a need to do it; maybe we don't need to add a
feature for it.  But if we don't, we should document that there is
currently no way.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: 23.0.60; end-of-sentence and non-breaking space
  2009-01-03 16:44         ` Eli Zaretskii
  2009-01-04  2:16           ` Richard M Stallman
@ 2009-01-05  6:37           ` Kenichi Handa
  1 sibling, 0 replies; 18+ messages in thread
From: Kenichi Handa @ 2009-01-05  6:37 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: cyd, 1726, rms, monnier, emacs-devel

In article <utz8gmj9t.fsf@gnu.org>, Eli Zaretskii <eliz@gnu.org> writes:

> > From: Richard M Stallman <rms@gnu.org>
> > Date: Sat, 03 Jan 2009 10:21:58 -0500
> > Cc: cyd@stupidchicken.com, 1726@emacsbugs.donarmstrong.com, emacs-devel@gnu.org
> > 
> >     IIUC if you want the character with code #xa0, then using \u00a0 would
> >     seem like the most unambiguous option (I notice that "\ua0" gives
> >     a weird error "Non-hex digit used for Unicode escape").
> > 
> > I expected \xa0 to give me that character.  It still seems strange
> > that it would do anything else.

> We need some way of inserting raw 8-bit bytes, because otherwise code
> that encodes and decodes text in Lisp will not work.  For inserting
> characters, we have the \u alternative; but I don't think there's
> alternative for raw bytes except insert \xNN.

I modified read_escape to treat "\xXX" as a raw-byte code
but treat "\xXXX.." as a character code U+XXX...  As far as
I remember, this is to keep backward compatibility.

And, we have the alternative for raw bytes.  That is to use
octal form, something like "\240".

---
Kenichi Handa
handa@m17n.org




^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: 23.0.60; end-of-sentence and non-breaking space
  2009-01-04  2:16           ` Richard M Stallman
  2009-01-04  4:18             ` Eli Zaretskii
  2009-01-04  4:29             ` bug#1726: " Jason Rumney
@ 2009-01-05  7:11             ` Kenichi Handa
  2009-01-06  0:01               ` bug#1726: " Richard M Stallman
  2 siblings, 1 reply; 18+ messages in thread
From: Kenichi Handa @ 2009-01-05  7:11 UTC (permalink / raw)
  To: rms; +Cc: eliz, emacs-devel, cyd, monnier, 1726

In article <E1LJIXN-0008Vg-Pe@fencepost.gnu.org>, Richard M Stallman <rms@gnu.org> writes:

>     We need some way of inserting raw 8-bit bytes, because otherwise code
>     that encodes and decodes text in Lisp will not work.  For inserting
>     characters, we have the \u alternative; but I don't think there's
>     alternative for raw bytes except insert \xNN.

> Naybe that is a valid reason for the current behavior, but that
> doesn't alter the need for the manual to document the behavior.

> Meanwhile, the Chinese and Chinese-derived character codes
> do not follow Unicode.  So you can't enter them with \u.
> What is the way to enter them?

Most of Chinese and Chinese-derived character codes are
unified into Unicode area.  Only a few codes can't be
unified with Unicode, and thus decoded into the character
space over #x110000.  But, in that sense, Chinese and
Chinese-derived character codes are not special.  There
exist several non-Chinese character sets (e.g. tibetan)
containing characters that doesn't exist in Unicode, and
they are decoded into the character space over #x110000 too.

But, all of them can be accessed by "\U00XXXXXX".

---
Kenichi Handa
handa@m17n.org

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: bug#1726: 23.0.60; end-of-sentence and non-breaking space
  2009-01-05  7:11             ` Kenichi Handa
@ 2009-01-06  0:01               ` Richard M Stallman
  2009-01-06  4:08                 ` Eli Zaretskii
  0 siblings, 1 reply; 18+ messages in thread
From: Richard M Stallman @ 2009-01-06  0:01 UTC (permalink / raw)
  To: Kenichi Handa, 1726
  Cc: cyd, bug-submit-list, bug-gnu-emacs, 1726, emacs-devel

      There
    exist several non-Chinese character sets (e.g. tibetan)
    containing characters that doesn't exist in Unicode, and
    they are decoded into the character space over #x110000 too.

    But, all of them can be accessed by "\U00XXXXXX".

Can you please document this (and the rest of what we have discussed
in this thread)?




^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: bug#1726: 23.0.60; end-of-sentence and non-breaking space
  2009-01-06  0:01               ` bug#1726: " Richard M Stallman
@ 2009-01-06  4:08                 ` Eli Zaretskii
  0 siblings, 0 replies; 18+ messages in thread
From: Eli Zaretskii @ 2009-01-06  4:08 UTC (permalink / raw)
  To: rms; +Cc: 1726, emacs-devel, bug-gnu-emacs, cyd, handa

> From: Richard M Stallman <rms@gnu.org>
> Date: Mon, 05 Jan 2009 19:01:02 -0500
> Cc: cyd@stupidchicken.com, bug-submit-list@donarmstrong.com,
> 	bug-gnu-emacs@gnu.org, 1726@emacsbugs.donarmstrong.com, emacs-devel@gnu.org
> 
>       There
>     exist several non-Chinese character sets (e.g. tibetan)
>     containing characters that doesn't exist in Unicode, and
>     they are decoded into the character space over #x110000 too.
> 
>     But, all of them can be accessed by "\U00XXXXXX".
> 
> Can you please document this (and the rest of what we have discussed
> in this thread)?

You already asked me to do this, and it's on my TODO.




^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2009-01-06  4:08 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-01-01  3:47 23.0.60; end-of-sentence and non-breaking space Chong Yidong
2009-01-02  1:25 ` Richard M Stallman
2009-01-02  2:38 ` bug#1727: " Drew Adams
2009-01-02  4:11 ` Stefan Monnier
2009-01-02 17:13   ` Richard M Stallman
2009-01-03  3:06     ` Stefan Monnier
2009-01-03  9:54       ` Eli Zaretskii
2009-01-03 15:21       ` Richard M Stallman
2009-01-03 16:44         ` Eli Zaretskii
2009-01-04  2:16           ` Richard M Stallman
2009-01-04  4:18             ` Eli Zaretskii
2009-01-04 21:42               ` Richard M Stallman
2009-01-04  4:29             ` bug#1726: " Jason Rumney
2009-01-04 16:45               ` Eli Zaretskii
2009-01-05  7:11             ` Kenichi Handa
2009-01-06  0:01               ` bug#1726: " Richard M Stallman
2009-01-06  4:08                 ` Eli Zaretskii
2009-01-05  6:37           ` Kenichi Handa

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).