unofficial mirror of emacs-devel@gnu.org 
 help / color / mirror / code / Atom feed
* [PATCH] Unicode Lisp reader escapes
@ 2006-04-29 15:35 Aidan Kehoe
  2006-04-29 23:26 ` Stefan Monnier
                   ` (2 more replies)
  0 siblings, 3 replies; 202+ messages in thread
From: Aidan Kehoe @ 2006-04-29 15:35 UTC (permalink / raw)



I realise you are all focused on the release with an intensity that would
scare small children, were any of them let near, but if any of you have a
minute free, I’d love to hear philosophical and technical objections to the
below.

The background is that it hasn’t ever been possible to consistently specify
a non-Latin-1 character by means of a general escape sequence, since what
character a given integer represents varies from release to release and even
from invocation to invocation. The below allows you to specify a backslash
escape with exactly four or exactly eight hexadecimal digits in a character
or string, and have the editor interpret them as the corresponding Unicode
code point. So, ?\u20AC would be interpreted as the Euro sign, "\u0448" as
Cyrillic sha, ?\U001D0ED as Byzantine musical symbol arktiko ke. 

Why not wait until the Unicode branch is merged? Well, that won’t solve the
problem either; people naturally want their code to be as compatible as
possible, so they will avoid the assumption that the integer-to-character
mapping is Unicode compatible as long as there are editors in the wild for
which that is not true. If this is integrated a good bit before the Unicode
branch is (which is what I would like), it will mean people can use this
syntax (which most modern programming languages have already, and which
people use) and be sure it’s compatible years before what would otherwise be
the case. 

lispref/ChangeLog addition:

2006-04-29  Aidan Kehoe  <kehoea@parhasard.net>

	* objects.texi (Character Type):
	Describe the Unicode character escape syntax; \uABCD or \U00ABCDEF
	specifies Unicode characters U+ABCD and U+ABCDEF respectively. 
	

src/ChangeLog addition:

2006-04-29  Aidan Kehoe  <kehoea@parhasard.net>

	* lread.c (read_escape):
	Provide a Unicode character escape syntax; \u followed by exactly
	four or \U followed by exactly eight hex digits in a comment or
	string is read as a Unicode character with that code point. 


GNU Emacs Trunk source patch:
Diff command:   cvs -q diff -u
Files affected: src/lread.c lispref/objects.texi

Index: lispref/objects.texi
===================================================================
RCS file: /sources/emacs/emacs/lispref/objects.texi,v
retrieving revision 1.51
diff -u -u -r1.51 objects.texi
--- lispref/objects.texi	6 Feb 2006 11:55:10 -0000	1.51
+++ lispref/objects.texi	29 Apr 2006 15:15:09 -0000
@@ -431,6 +431,20 @@
 bit values are 2**22 for alt, 2**23 for super and 2**24 for hyper.
 @end ifnottex
 
+@cindex unicode character escape
+  Emacs provides a syntax for specifying characters by their Unicode code
+points.  @samp{?\uABCD} will give you an Emacs character that maps to
+the code point @samp{U+ABCD} in Unicode-based representations (UTF-8
+text files, Unicode-oriented fonts, etc.) There is a slightly different
+syntax for specifying characters with code points above @samp{#xFFFF};
+@samp{\U00ABCDEF} will give you an Emacs character that maps to the code
+point @samp{U+ABCDEF} in Unicode-based representations, if such an Emacs
+character exists.
+
+  Unlike in some other languages, while this syntax is available for
+character literals, and (see later) in strings, it is not available
+elsewhere in your Lisp source code.
+
 @cindex @samp{\} in character constant
 @cindex backslash in character constant
 @cindex octal character code
Index: src/lread.c
===================================================================
RCS file: /sources/emacs/emacs/src/lread.c,v
retrieving revision 1.350
diff -u -u -r1.350 lread.c
--- src/lread.c	27 Feb 2006 02:04:35 -0000	1.350
+++ src/lread.c	29 Apr 2006 15:15:10 -0000
@@ -1743,6 +1743,9 @@
      int *byterep;
 {
   register int c = READCHAR;
+  /* \u allows up to four hex digits, \U up to eight. Default to the
+     behaviour for \u, and change this value in the case that \U is seen. */
+  int unicode_hex_count = 4;
 
   *byterep = 0;
 
@@ -1907,6 +1910,48 @@
 	return i;
       }
 
+    case 'U':
+      /* Post-Unicode-2.0: Up to eight hex chars */
+      unicode_hex_count = 8;
+    case 'u':
+
+      /* A Unicode escape. We only permit them in strings and characters,
+	 not arbitrarily in the source code as in some other languages. */
+      {
+	int i = 0;
+	int count = 0;
+	Lisp_Object lisp_char;
+	while (++count <= unicode_hex_count)
+	  {
+	    c = READCHAR;
+	    /* isdigit(), isalpha() may be locale-specific, which we don't
+	       want. */
+	    if      (c >= '0' && c <= '9')  i = (i << 4) + (c - '0');
+	    else if (c >= 'a' && c <= 'f')  i = (i << 4) + (c - 'a') + 10;
+            else if (c >= 'A' && c <= 'F')  i = (i << 4) + (c - 'A') + 10;
+	    else
+	      {
+		error ("Non-hex digit used for Unicode escape");
+		break;
+	      }
+	  }
+
+	lisp_char = call2(intern("decode-char"), intern("ucs"),
+			  make_number(i));
+
+	if (EQ(Qnil, lisp_char))
+	  {
+	    /* This is ugly and horrible and trashes the user's data. */
+	    XSETFASTINT (i, MAKE_CHAR (charset_katakana_jisx0201, 
+				       34 + 128, 46 + 128));
+            return i;
+	  }
+	else
+	  {
+	    return XFASTINT (lisp_char);
+	  }
+      }
+
     default:
       if (BASE_LEADING_CODE_P (c))
 	c = read_multibyte (c, readcharfun);

-- 
In the beginning God created the heavens and the earth. And God was a
bug-eyed, hexagonal smurf with a head of electrified hair; and God said:
“Si, mi chiamano Mimi...”

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] Unicode Lisp reader escapes
  2006-04-29 15:35 [PATCH] Unicode Lisp reader escapes Aidan Kehoe
@ 2006-04-29 23:26 ` Stefan Monnier
  2006-04-30  8:26   ` Aidan Kehoe
  2006-04-30  3:04 ` Richard Stallman
  2006-05-02  6:43 ` Kenichi Handa
  2 siblings, 1 reply; 202+ messages in thread
From: Stefan Monnier @ 2006-04-29 23:26 UTC (permalink / raw)
  Cc: emacs-devel

> The background is that it hasn’t ever been possible to consistently
> specify a non-Latin-1 character by means of a general escape sequence,
> since what character a given integer represents varies from release to
> release and even from invocation to invocation.

There are two known workarounds:
- encode your file in utf-8.
- use an elisp expression like (decode-char 'ucs <foo>).
Neither of them is quite what you want, but I've found them good enough for
the cases I've had to deal with.


        Stefan

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] Unicode Lisp reader escapes
  2006-04-29 15:35 [PATCH] Unicode Lisp reader escapes Aidan Kehoe
  2006-04-29 23:26 ` Stefan Monnier
@ 2006-04-30  3:04 ` Richard Stallman
  2006-04-30  8:14   ` Aidan Kehoe
  2006-05-02  6:43 ` Kenichi Handa
  2 siblings, 1 reply; 202+ messages in thread
From: Richard Stallman @ 2006-04-30  3:04 UTC (permalink / raw)
  Cc: emacs-devel

    +  Emacs provides a syntax for specifying characters by their Unicode code
    +points.  @samp{?\uABCD} 

These are Lisp expressions, right?  So they should use @code, not @samp.

			     will give you an Emacs character that maps to

Please stick to present tense: change "will give you an" to "represents the".

    +text files, Unicode-oriented fonts, etc.) There is a slightly different

You need a period at the end of that sentence.  The period inside the
parentheses does not count for this.

    +syntax for specifying characters with code points above @samp{#xFFFF};
    +@samp{\U00ABCDEF} will give you an Emacs character that maps to the code

What is the reason for needing both \u and \U, and the difference?
Why not use a syntax like that of \x?

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] Unicode Lisp reader escapes
  2006-04-30  3:04 ` Richard Stallman
@ 2006-04-30  8:14   ` Aidan Kehoe
  2006-04-30 20:53     ` Richard Stallman
  0 siblings, 1 reply; 202+ messages in thread
From: Aidan Kehoe @ 2006-04-30  8:14 UTC (permalink / raw)
  Cc: emacs-devel

 
 Ar an naoú lá is fiche de mí Aibréan, scríobh Richard Stallman:  
 
 > [Comments on the text taken into account in the revised patch below.] 
 > 
 > [...] 
 > 
 > What is the reason for needing both \u and \U, and the difference? Why 
 > not use a syntax like that of \x? 
 
They are both fixed-length expressions, which is good, because people get 
into the habit of typing "\u0123As I walked out one evening" instead of the 
more disastrous "\u123As I walked out one evening". We could provide the 
same functionality with just the \U00ABCDEF syntax, but since the code 
points above #xFFFF are very rarely used, the need to provide the initial 
four zeroes would be very annoying for the majority of the time.  
 
The reason the approach is not to have variable length constants as is used 
with \x is exactly the "\u0123As I" versus "\u123As I walked out" issue 
above.  

lispref/ChangeLog addition:

2006-04-30  Aidan Kehoe  <kehoea@parhasard.net>

	* objects.texi (Character Type):
        Describe the Unicode character escape syntax; \uABCD or \U00ABCDEF 
        specifies Unicode characters U+ABCD and U+ABCDEF respectively.  
	

src/ChangeLog addition:

2006-04-30  Aidan Kehoe  <kehoea@parhasard.net>

	* lread.c (read_escape):
        Provide a Unicode character escape syntax; \u followed by exactly 
        four or \U followed by exactly eight hex digits in a comment or 
        string is read as a Unicode character with that code point.  
	

GNU Emacs Trunk source patch:
Diff command:   cvs -q diff -u
Files affected: src/lread.c lispref/objects.texi

Index: lispref/objects.texi
===================================================================
RCS file: /sources/emacs/emacs/lispref/objects.texi,v
retrieving revision 1.51
diff -u -u -r1.51 objects.texi
--- lispref/objects.texi	6 Feb 2006 11:55:10 -0000	1.51
+++ lispref/objects.texi	30 Apr 2006 08:08:05 -0000
@@ -431,6 +431,20 @@
 bit values are 2**22 for alt, 2**23 for super and 2**24 for hyper.
 @end ifnottex
 
+@cindex unicode character escape
+  Emacs provides a syntax for specifying characters by their Unicode code
+points.  @code{?\uABCD} represents a character that maps to the code
+point @samp{U+ABCD} in Unicode-based representations (UTF-8 text files,
+Unicode-oriented fonts, etc.).  There is a slightly different syntax for
+specifying characters with code points above @code{#xFFFF};
+@code{\U00ABCDEF} represents an Emacs character that maps to the code
+point @samp{U+ABCDEF} in Unicode-based representations, if such an Emacs
+character exists.
+
+  Unlike in some other languages, while this syntax is available for
+character literals, and (see later) in strings, it is not available
+elsewhere in your Lisp source code.
+
 @cindex @samp{\} in character constant
 @cindex backslash in character constant
 @cindex octal character code
Index: src/lread.c
===================================================================
RCS file: /sources/emacs/emacs/src/lread.c,v
retrieving revision 1.350
diff -u -u -r1.350 lread.c
--- src/lread.c	27 Feb 2006 02:04:35 -0000	1.350
+++ src/lread.c	30 Apr 2006 08:08:07 -0000
@@ -1743,6 +1743,9 @@
      int *byterep;
 {
   register int c = READCHAR;
+  /* \u allows up to four hex digits, \U up to eight. Default to the
+     behaviour for \u, and change this value in the case that \U is seen. */
+  int unicode_hex_count = 4;
 
   *byterep = 0;
 
@@ -1907,6 +1910,48 @@
 	return i;
       }
 
+    case 'U':
+      /* Post-Unicode-2.0: Up to eight hex chars */
+      unicode_hex_count = 8;
+    case 'u':
+
+      /* A Unicode escape. We only permit them in strings and characters,
+	 not arbitrarily in the source code as in some other languages. */
+      {
+	int i = 0;
+	int count = 0;
+	Lisp_Object lisp_char;
+	while (++count <= unicode_hex_count)
+	  {
+	    c = READCHAR;
+	    /* isdigit(), isalpha() may be locale-specific, which we don't
+	       want. */
+	    if      (c >= '0' && c <= '9')  i = (i << 4) + (c - '0');
+	    else if (c >= 'a' && c <= 'f')  i = (i << 4) + (c - 'a') + 10;
+            else if (c >= 'A' && c <= 'F')  i = (i << 4) + (c - 'A') + 10;
+	    else
+	      {
+		error ("Non-hex digit used for Unicode escape");
+		break;
+	      }
+	  }
+
+	lisp_char = call2(intern("decode-char"), intern("ucs"),
+			  make_number(i));
+
+	if (EQ(Qnil, lisp_char))
+	  {
+	    /* This is ugly and horrible and trashes the user's data. */
+	    XSETFASTINT (i, MAKE_CHAR (charset_katakana_jisx0201, 
+				       34 + 128, 46 + 128));
+            return i;
+	  }
+	else
+	  {
+	    return XFASTINT (lisp_char);
+	  }
+      }
+
     default:
       if (BASE_LEADING_CODE_P (c))
 	c = read_multibyte (c, readcharfun);

-- 
In the beginning God created the heavens and the earth. And God was a 
bug-eyed, hexagonal smurf with a head of electrified hair; and God said: 
“Si, mi chiamano Mimi...”

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] Unicode Lisp reader escapes
  2006-04-29 23:26 ` Stefan Monnier
@ 2006-04-30  8:26   ` Aidan Kehoe
  0 siblings, 0 replies; 202+ messages in thread
From: Aidan Kehoe @ 2006-04-30  8:26 UTC (permalink / raw)
  Cc: emacs-devel


 Ar an naoú lá is fiche de mí Aibréan, scríobh Stefan Monnier: 

 > There are two known workarounds:
 > - encode your file in utf-8.
 > - use an elisp expression like (decode-char 'ucs <foo>).
 > Neither of them is quite what you want, but I've found them good enough for
 > the cases I've had to deal with.

Sure, and encoding your file as ISO-8859-1 and using (char-to-int #xff)
would be possible were the escapes for Latin-1 not available. I will
certainly be using both your approaches for years to come, since making my
code pointlessly incompatible with existing editors is not a good idea. The
\u syntax is sugar, but it is used quite a bit in those languages that have
it, which seems to testify to its usefulness.

-- 
In the beginning God created the heavens and the earth. And God was a
bug-eyed, hexagonal smurf with a head of electrified hair; and God said:
“Si, mi chiamano Mimi...”

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] Unicode Lisp reader escapes
  2006-04-30  8:14   ` Aidan Kehoe
@ 2006-04-30 20:53     ` Richard Stallman
  2006-04-30 21:04       ` Andreas Schwab
                         ` (2 more replies)
  0 siblings, 3 replies; 202+ messages in thread
From: Richard Stallman @ 2006-04-30 20:53 UTC (permalink / raw)
  Cc: emacs-devel

    They are both fixed-length expressions, which is good, because people get 
    into the habit of typing "\u0123As I walked out one evening" instead of the 
    more disastrous "\u123As I walked out one evening".

I see, you are talking about using them in strings.
Still, I don't like having both \u and \U--it is ugly.

I think it would be better to put an explicit terminator into
the construct.  Perhaps #.  So you would write "\u123#As I walked"

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] Unicode Lisp reader escapes
  2006-04-30 20:53     ` Richard Stallman
@ 2006-04-30 21:04       ` Andreas Schwab
  2006-04-30 21:57         ` Aidan Kehoe
  2006-05-01 18:32         ` Richard Stallman
  2006-04-30 21:56       ` Aidan Kehoe
  2006-05-05 23:15       ` Juri Linkov
  2 siblings, 2 replies; 202+ messages in thread
From: Andreas Schwab @ 2006-04-30 21:04 UTC (permalink / raw)
  Cc: Aidan Kehoe, emacs-devel

Richard Stallman <rms@gnu.org> writes:

> I think it would be better to put an explicit terminator into
> the construct.  Perhaps #.  So you would write "\u123#As I walked"

There is already the possibility to use `\ ' as a terminator.

Andreas.

-- 
Andreas Schwab, SuSE Labs, schwab@suse.de
SuSE Linux Products GmbH, Maxfeldstraße 5, 90409 Nürnberg, Germany
PGP key fingerprint = 58CA 54C7 6D53 942B 1756  01D3 44D5 214B 8276 4ED5
"And now for something completely different."

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] Unicode Lisp reader escapes
       [not found] <E1FaJ0b-0008G8-8u@monty-python.gnu.org>
@ 2006-04-30 21:16 ` Jonathan Yavner
  2006-05-01 18:32   ` Richard Stallman
  0 siblings, 1 reply; 202+ messages in thread
From: Jonathan Yavner @ 2006-04-30 21:16 UTC (permalink / raw)


Richard wrote:
>I don't like having both \u and \U--it is ugly.

> I think it would be better to put an explicit terminator into
> the construct.  Perhaps #.  So you would write "\u123#As I walked"

That would be nonstandard.  Standards are better, even if ugly.

http://ftp.python.org/doc/ref/strings.html
http://www.gnu.org/software/coreutils/manual/html_chapter/coreutils_15.html

et cetera, et cetera

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] Unicode Lisp reader escapes
  2006-04-30 20:53     ` Richard Stallman
  2006-04-30 21:04       ` Andreas Schwab
@ 2006-04-30 21:56       ` Aidan Kehoe
  2006-05-01  1:44         ` Miles Bader
  2006-05-05 23:15       ` Juri Linkov
  2 siblings, 1 reply; 202+ messages in thread
From: Aidan Kehoe @ 2006-04-30 21:56 UTC (permalink / raw)
  Cc: emacs-devel


 Ar an triochadú lá de mí Aibréan, scríobh Richard Stallman: 

 >     They are both fixed-length expressions, which is good, because people
 >     get into the habit of typing "\u0123As I walked out one evening"
 >     instead of the more disastrous "\u123As I walked out one evening".
 > 
 > I see, you are talking about using them in strings.

Indeed, as I mentioned in the documentation.

 > Still, I don't like having both \u and \U--it is ugly.
 > 
 > I think it would be better to put an explicit terminator into
 > the construct.  Perhaps #.  So you would write "\u123#As I walked"

I find _that_ distinctly ugly, but more of a problem with it than the
aesthetics is that it’s unfamiliar to everyone, Lisp people and Java people
alike.

Another alternative to providing both \u and \U is to do what Java does;
only allow \u, and require code points above #xFFFF to use surrogate pairs. 
So "\uDA6F\uDCDE" would be how one would encode U+ABCDE. But I think that’s
very inconvenient. 

-- 
Aidan Kehoe, http://www.parhasard.net/

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] Unicode Lisp reader escapes
  2006-04-30 21:04       ` Andreas Schwab
@ 2006-04-30 21:57         ` Aidan Kehoe
  2006-04-30 22:14           ` Andreas Schwab
  2006-05-01 18:32         ` Richard Stallman
  1 sibling, 1 reply; 202+ messages in thread
From: Aidan Kehoe @ 2006-04-30 21:57 UTC (permalink / raw)
  Cc: emacs-devel


 Ar an triochadú lá de mí Aibréan, scríobh Andreas Schwab: 

 > Richard Stallman <rms@gnu.org> writes:
 > 
 > > I think it would be better to put an explicit terminator into
 > > the construct.  Perhaps #.  So you would write "\u123#As I walked"
 > 
 > There is already the possibility to use `\ ' as a terminator.

I don’t understand what you mean there--I imagine Richard meant a terminator
that would not be interpreted as part of the succeeding string, something
not the case for the space character.

-- 
Aidan Kehoe, http://www.parhasard.net/

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] Unicode Lisp reader escapes
  2006-04-30 21:57         ` Aidan Kehoe
@ 2006-04-30 22:14           ` Andreas Schwab
  0 siblings, 0 replies; 202+ messages in thread
From: Andreas Schwab @ 2006-04-30 22:14 UTC (permalink / raw)
  Cc: emacs-devel

Aidan Kehoe <kehoea@parhasard.net> writes:

>  Ar an triochadú lá de mí Aibréan, scríobh Andreas Schwab: 
>
>  > Richard Stallman <rms@gnu.org> writes:
>  > 
>  > > I think it would be better to put an explicit terminator into
>  > > the construct.  Perhaps #.  So you would write "\u123#As I walked"
>  > 
>  > There is already the possibility to use `\ ' as a terminator.
>
> I don’t understand what you mean there--I imagine Richard meant a terminator
> that would not be interpreted as part of the succeeding string, something
> not the case for the space character.

`\ ' is ignored in a string (*Note (elisp)Non-ASCII in Strings::)

Andreas.

-- 
Andreas Schwab, SuSE Labs, schwab@suse.de
SuSE Linux Products GmbH, Maxfeldstraße 5, 90409 Nürnberg, Germany
PGP key fingerprint = 58CA 54C7 6D53 942B 1756  01D3 44D5 214B 8276 4ED5
"And now for something completely different."

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] Unicode Lisp reader escapes
  2006-04-30 21:56       ` Aidan Kehoe
@ 2006-05-01  1:44         ` Miles Bader
  2006-05-01  3:12           ` Stefan Monnier
  0 siblings, 1 reply; 202+ messages in thread
From: Miles Bader @ 2006-05-01  1:44 UTC (permalink / raw)
  Cc: rms, emacs-devel

Aidan Kehoe <kehoea@parhasard.net> writes:
> I find _that_ distinctly ugly, but more of a problem with it than the
> aesthetics is that it’s unfamiliar to everyone, Lisp people and Java people
> alike.

How about supporting  both the "standard" syntax ("\u0123")
and a flexible-length syntax like "\u{123}" (I seem to recall this
a syntax like this being discussed on this list)?

-Miles
-- 
Freedom's just another word, for nothing left to lose   --Janis Joplin

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] Unicode Lisp reader escapes
  2006-05-01  1:44         ` Miles Bader
@ 2006-05-01  3:12           ` Stefan Monnier
  2006-05-01  3:41             ` Miles Bader
  0 siblings, 1 reply; 202+ messages in thread
From: Stefan Monnier @ 2006-05-01  3:12 UTC (permalink / raw)
  Cc: Aidan Kehoe, rms, emacs-devel

>> I find _that_ distinctly ugly, but more of a problem with it than the
>> aesthetics is that it’s unfamiliar to everyone, Lisp people and Java people
>> alike.

> How about supporting  both the "standard" syntax ("\u0123")
> and a flexible-length syntax like "\u{123}" (I seem to recall this
> a syntax like this being discussed on this list)?

Currently The syntax for \xNNNN hexadeciaml escapes is that it ends whenever
reaching a non-hexadecimal char, and if you need your \xNNN escape to be
followed by an hexidecimal char, then you have to seprate the two with "\ "
(and the Lisp printer does that automatically, of course).
Is there a strong reason not do use the same rule for \u ?


        Stefan

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] Unicode Lisp reader escapes
  2006-05-01  3:12           ` Stefan Monnier
@ 2006-05-01  3:41             ` Miles Bader
  2006-05-01 12:29               ` Stefan Monnier
  0 siblings, 1 reply; 202+ messages in thread
From: Miles Bader @ 2006-05-01  3:41 UTC (permalink / raw)
  Cc: Aidan Kehoe, rms, emacs-devel

Stefan Monnier <monnier@iro.umontreal.ca> writes:
> Currently The syntax for \xNNNN hexadeciaml escapes is that it ends whenever
> reaching a non-hexadecimal char, and if you need your \xNNN escape to be
> followed by an hexidecimal char, then you have to seprate the two with "\ "
> (and the Lisp printer does that automatically, of course).
> Is there a strong reason not do use the same rule for \u ?

That might be sufficient for programmatic output, but anything involving
a significant space seems problematic in general...

-Miles
-- 
x
y
Z!

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] Unicode Lisp reader escapes
  2006-05-01  3:41             ` Miles Bader
@ 2006-05-01 12:29               ` Stefan Monnier
  0 siblings, 0 replies; 202+ messages in thread
From: Stefan Monnier @ 2006-05-01 12:29 UTC (permalink / raw)
  Cc: Aidan Kehoe, rms, emacs-devel

>> Currently The syntax for \xNNNN hexadeciaml escapes is that it ends whenever
>> reaching a non-hexadecimal char, and if you need your \xNNN escape to be
>> followed by an hexidecimal char, then you have to seprate the two with "\ "
>> (and the Lisp printer does that automatically, of course).
>> Is there a strong reason not do use the same rule for \u ?

> That might be sufficient for programmatic output, but anything involving
> a significant space seems problematic in general...

Sorry, I don't understand what you mean by "significant space".


        Stefan

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] Unicode Lisp reader escapes
  2006-04-30 21:04       ` Andreas Schwab
  2006-04-30 21:57         ` Aidan Kehoe
@ 2006-05-01 18:32         ` Richard Stallman
  2006-05-01 19:03           ` Oliver Scholz
                             ` (2 more replies)
  1 sibling, 3 replies; 202+ messages in thread
From: Richard Stallman @ 2006-05-01 18:32 UTC (permalink / raw)
  Cc: kehoea, emacs-devel

    > I think it would be better to put an explicit terminator into
    > the construct.  Perhaps #.  So you would write "\u123#As I walked"

    There is already the possibility to use `\ ' as a terminator.

That is true.  The worry is that people might forget and run the
unicode constant together with the following text.  People might not
remember to use `\ ' when it is needed, if they usually don't need it.

But it is no great disaster to make such an error--it will be obvious
when you see the output.  So perhaps there's no need to do anything
to avoid the problem.

One other question occurs to me.  In the Unicode branch,
doesn't \x do this job?  If so, \u would be redundant once we
merge in that code.  It would have no lasting purpose.

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] Unicode Lisp reader escapes
  2006-04-30 21:16 ` Jonathan Yavner
@ 2006-05-01 18:32   ` Richard Stallman
  2006-05-02  5:03     ` Jonathan Yavner
  0 siblings, 1 reply; 202+ messages in thread
From: Richard Stallman @ 2006-05-01 18:32 UTC (permalink / raw)
  Cc: emacs-devel

    > I think it would be better to put an explicit terminator into
    > the construct.  Perhaps #.  So you would write "\u123#As I walked"

    That would be nonstandard.  Standards are better, even if ugly.

What standard are you talking about?

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] Unicode Lisp reader escapes
  2006-05-01 18:32         ` Richard Stallman
@ 2006-05-01 19:03           ` Oliver Scholz
  2006-05-02  4:45             ` Richard Stallman
  2006-05-02  0:46           ` Kenichi Handa
  2006-05-02  6:41           ` Aidan Kehoe
  2 siblings, 1 reply; 202+ messages in thread
From: Oliver Scholz @ 2006-05-01 19:03 UTC (permalink / raw)


Richard Stallman <rms@gnu.org> writes:

>     > I think it would be better to put an explicit terminator into
>     > the construct.  Perhaps #.  So you would write "\u123#As I walked"
>
>     There is already the possibility to use `\ ' as a terminator.
>
> That is true.  The worry is that people might forget and run the
> unicode constant together with the following text.  People might not
> remember to use `\ ' when it is needed, if they usually don't need it.
>
> But it is no great disaster to make such an error--it will be obvious
> when you see the output.  So perhaps there's no need to do anything
> to avoid the problem.

At any rate the syntax for \u and \x should be entirely in parallel,
IMNSHO.

> One other question occurs to me.  In the Unicode branch,
> doesn't \x do this job?  If so, \u would be redundant once we
> merge in that code.  It would have no lasting purpose.

There would still be a conceptual difference. \x refers to the
internal representation of a character in Emacs, while \u refers to an
abstract character. In the Unicode branch the hex numbers would be the
same in both cases, but conceptually it is still different. Like
writing `?a' in Lisp code instead of just `97' or like using
`(null list)' instead of `(not list)'.


    Oliver
-- 
12 Floréal an 214 de la Révolution
Liberté, Egalité, Fraternité!

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] Unicode Lisp reader escapes
  2006-05-01 18:32         ` Richard Stallman
  2006-05-01 19:03           ` Oliver Scholz
@ 2006-05-02  0:46           ` Kenichi Handa
  2006-05-02  6:41           ` Aidan Kehoe
  2 siblings, 0 replies; 202+ messages in thread
From: Kenichi Handa @ 2006-05-02  0:46 UTC (permalink / raw)
  Cc: kehoea, schwab, emacs-devel

In article <E1FadBn-0007Ny-5U@fencepost.gnu.org>, Richard Stallman <rms@gnu.org> writes:

> One other question occurs to me.  In the Unicode branch,
> doesn't \x do this job?  

Yes, it does.

---
Kenichi Handa
handa@m17n.org

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] Unicode Lisp reader escapes
  2006-05-01 19:03           ` Oliver Scholz
@ 2006-05-02  4:45             ` Richard Stallman
  0 siblings, 0 replies; 202+ messages in thread
From: Richard Stallman @ 2006-05-02  4:45 UTC (permalink / raw)
  Cc: emacs-devel

    > One other question occurs to me.  In the Unicode branch,
    > doesn't \x do this job?  If so, \u would be redundant once we
    > merge in that code.  It would have no lasting purpose.

    There would still be a conceptual difference. \x refers to the
    internal representation of a character in Emacs, while \u refers to an
    abstract character.

If \xabcd and \uabcd will forever be equivalent, I don't see
much benefit in having syntax to indicate the conceptual difference.

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] Unicode Lisp reader escapes
  2006-05-01 18:32   ` Richard Stallman
@ 2006-05-02  5:03     ` Jonathan Yavner
  0 siblings, 0 replies; 202+ messages in thread
From: Jonathan Yavner @ 2006-05-02  5:03 UTC (permalink / raw)
  Cc: emacs-devel

RMS wrote:
> JYavner wrote:
>> Standards are better, even if ugly. 
> What standard are you talking about?

ISO standard C99, also known as WG14/N1124:
    http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1124.pdf

This document contains the text:
> 6.4.3 Universal character names
> Syntax
>        universal-character-name:
>               \u hex-quad
>               \U hex-quad hex-quad

This syntax is also used in Java, Perl, Python, etc.  The main place 
where it *doesn't* seem to work is gcc -- its "C99 status" document 
says \u and \U are supposed to be working, but I get screwy results.

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] Unicode Lisp reader escapes
  2006-05-01 18:32         ` Richard Stallman
  2006-05-01 19:03           ` Oliver Scholz
  2006-05-02  0:46           ` Kenichi Handa
@ 2006-05-02  6:41           ` Aidan Kehoe
  2006-05-02 21:36             ` Richard Stallman
  2 siblings, 1 reply; 202+ messages in thread
From: Aidan Kehoe @ 2006-05-02  6:41 UTC (permalink / raw)
  Cc: emacs-devel


 Ar an chéad lá de mí Bealtaine, scríobh Richard Stallman: 

 >     > I think it would be better to put an explicit terminator into
 >     > the construct.  Perhaps #.  So you would write "\u123#As I walked"
 > 
 >     There is already the possibility to use `\ ' as a terminator.
 > 
 > That is true.  The worry is that people might forget and run the
 > unicode constant together with the following text.  People might not
 > remember to use `\ ' when it is needed, if they usually don't need it.
 > 
 > But it is no great disaster to make such an error--it will be obvious
 > when you see the output.  So perhaps there's no need to do anything
 > to avoid the problem.

One problem with that is that people writing portable code have never had
the option of assuming (equal "\ " ""). Now, of course, you may prefer that
people not write portable code; but it’s still a problem, because people
will try.

 > One other question occurs to me. In the Unicode branch, doesn't \x do
 > this job? If so, \u would be redundant once we merge in that code. It
 > would have no lasting purpose.

I addressed that in my first mail, and I quote: 

“Why not wait until the Unicode branch is merged? Well, that won’t solve the
problem either; people naturally want their code to be as compatible as
possible, so they will avoid the assumption that the integer-to-character
mapping is Unicode compatible as long as there are editors in the wild for
which that is not true. If this is integrated a good bit before the Unicode
branch is (which is what I would like), it will mean people can use this
syntax (which most modern programming languages have already, and which
people use) and be sure it’s compatible years before what would otherwise be
the case.” 

-- 
Aidan Kehoe, http://www.parhasard.net/

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] Unicode Lisp reader escapes
  2006-04-29 15:35 [PATCH] Unicode Lisp reader escapes Aidan Kehoe
  2006-04-29 23:26 ` Stefan Monnier
  2006-04-30  3:04 ` Richard Stallman
@ 2006-05-02  6:43 ` Kenichi Handa
  2006-05-02  7:00   ` Aidan Kehoe
  2006-05-02 10:36   ` Eli Zaretskii
  2 siblings, 2 replies; 202+ messages in thread
From: Kenichi Handa @ 2006-05-02  6:43 UTC (permalink / raw)
  Cc: emacs-devel

In article <17491.34779.959316.484740@parhasard.net>, Aidan Kehoe <kehoea@parhasard.net> writes:
[...]
> 2006-04-29  Aidan Kehoe  <kehoea@parhasard.net>

> 	* lread.c (read_escape):
> 	Provide a Unicode character escape syntax; \u followed by exactly
> 	four or \U followed by exactly eight hex digits in a comment or
> 	string is read as a Unicode character with that code point. 
[...]
> +	lisp_char = call2(intern("decode-char"), intern("ucs"),
> +			  make_number(i));
> +

First of all, is it safe to call Lisp program in
read_escape?  Don't we have to care about GC and
buffer/string-data relocation?

---
Kenichi Handa
handa@m17n.org

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] Unicode Lisp reader escapes
  2006-05-02  6:43 ` Kenichi Handa
@ 2006-05-02  7:00   ` Aidan Kehoe
  2006-05-02 10:45     ` Eli Zaretskii
  2006-05-02 11:33     ` Kenichi Handa
  2006-05-02 10:36   ` Eli Zaretskii
  1 sibling, 2 replies; 202+ messages in thread
From: Aidan Kehoe @ 2006-05-02  7:00 UTC (permalink / raw)
  Cc: emacs-devel


 Ar an dara lá de mí Bealtaine, scríobh Kenichi Handa: 

 > > +	lisp_char = call2(intern("decode-char"), intern("ucs"),
 > > +			  make_number(i));
 > > +
 > 
 > First of all, is it safe to call Lisp program in read_escape? Don't we
 > have to care about GC and buffer/string-data relocation?

Yay, a technical objection. 

If it isn’t safe to call a Lisp program in read_escape, then the function is
full of bugs already. It’s called with three arguments, a Lisp_Object
readcharfun, an integer, and a pointer to an integer. If readcharfun is a
Lisp function (it may not be, it may be a buffer, a marker, or a string),
then that Lisp function is called on line 348. Cf. the documentation of
`read', which describes that the input may be from a function. 

-- 
Aidan Kehoe, http://www.parhasard.net/

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] Unicode Lisp reader escapes
  2006-05-02  6:43 ` Kenichi Handa
  2006-05-02  7:00   ` Aidan Kehoe
@ 2006-05-02 10:36   ` Eli Zaretskii
  2006-05-02 10:59     ` Aidan Kehoe
  2006-05-03  2:59     ` Kenichi Handa
  1 sibling, 2 replies; 202+ messages in thread
From: Eli Zaretskii @ 2006-05-02 10:36 UTC (permalink / raw)
  Cc: kehoea, emacs-devel

> From: Kenichi Handa <handa@m17n.org>
> Date: Tue, 02 May 2006 15:43:16 +0900
> Cc: emacs-devel@gnu.org
> 
> > +	lisp_char = call2(intern("decode-char"), intern("ucs"),
> > +			  make_number(i));
> > +
> 
> First of all, is it safe to call Lisp program in
> read_escape?

Whether it is safe or not, I think it's certainly better to implement
the guts of decode-char in C, if it's gonna be called from
read_escape.  All those guts do is simple arithmetics, which will be
much faster in C.

Moreover, I think the fact that decode-char uses translation tables to
support unify-8859-on-*coding-mode (and thus might produce characters
other than mule-unicode-*) could even be a misfeature: do we really
want read_escape to produce Unicode or non-Unicode characters when it
sees \uNNNN, depending on the current user settings?

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] Unicode Lisp reader escapes
  2006-05-02  7:00   ` Aidan Kehoe
@ 2006-05-02 10:45     ` Eli Zaretskii
  2006-05-02 11:13       ` Aidan Kehoe
  2006-05-02 11:33     ` Kenichi Handa
  1 sibling, 1 reply; 202+ messages in thread
From: Eli Zaretskii @ 2006-05-02 10:45 UTC (permalink / raw)
  Cc: emacs-devel, handa

> From: Aidan Kehoe <kehoea@parhasard.net>
> Date: Tue, 2 May 2006 09:00:52 +0200
> Cc: emacs-devel@gnu.org
> 
>  > First of all, is it safe to call Lisp program in read_escape? Don't we
>  > have to care about GC and buffer/string-data relocation?
> 
> Yay, a technical objection. 

I don't know what you mean: the other objections were technical as
well.

> If it isn't safe to call a Lisp program in read_escape, then the function is
> full of bugs already.

``Full of bugs''?

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] Unicode Lisp reader escapes
  2006-05-02 10:36   ` Eli Zaretskii
@ 2006-05-02 10:59     ` Aidan Kehoe
  2006-05-02 19:26       ` Eli Zaretskii
  2006-05-03  2:59     ` Kenichi Handa
  1 sibling, 1 reply; 202+ messages in thread
From: Aidan Kehoe @ 2006-05-02 10:59 UTC (permalink / raw)
  Cc: emacs-devel


 Ar an dara lá de mí Bealtaine, scríobh Eli Zaretskii: 

 > Whether it is safe or not, I think it's certainly better to implement
 > the guts of decode-char in C, if it's gonna be called from
 > read_escape. 

If it’s only going to be called rarely (twice a file for non-byte-compiled
files, at a liberal guess, never for byte-compiled files), and after
decode-char is already loaded--both of which are the case--I don’t see the
argument for that.

 > All those guts do is simple arithmetics, which will be much faster in C.
 >
 > Moreover, I think the fact that decode-char uses translation tables to
 > support unify-8859-on-*coding-mode (and thus might produce characters
 > other than mule-unicode-*) could even be a misfeature: do we really
 > want read_escape to produce Unicode or non-Unicode characters when it
 > sees \uNNNN, depending on the current user settings?

This is not significantly different from the question “do we really want
(decode-char 'ucs #xABCD) to produce Unicode or non-Unicode characters
depending on the current user settings?”, since making string escapes
inconsistent with the Unicode coding systems does not make any sense. And
that question has already been answered. Cf. 
http://article.gmane.org/gmane.emacs.bugs/3422 . 

-- 
Aidan Kehoe, http://www.parhasard.net/

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] Unicode Lisp reader escapes
  2006-05-02 10:45     ` Eli Zaretskii
@ 2006-05-02 11:13       ` Aidan Kehoe
  2006-05-02 19:31         ` Eli Zaretskii
  0 siblings, 1 reply; 202+ messages in thread
From: Aidan Kehoe @ 2006-05-02 11:13 UTC (permalink / raw)
  Cc: emacs-devel


 Ar an dara lá de mí Bealtaine, scríobh Eli Zaretskii: 

 > >  > First of all, is it safe to call Lisp program in read_escape? Don't we
 > >  > have to care about GC and buffer/string-data relocation?
 > > 
 > > Yay, a technical objection. 
 > 
 > I don't know what you mean: the other objections were technical as
 > well.

I would rate questions of aesthetics (“ugliness”) and prose style as
non-technical. I don’t propose to impose that judgement on you, but I do
think it reasonable.

 > > If it isn't safe to call a Lisp program in read_escape, then the
 > > function is full of bugs already.
 > 
 > ``Full of bugs''?

Indeed; each READCHAR can call arbitrary Lisp, so something like

    case 'M':
      c = READCHAR;
      if (c != '-')
	error ("Invalid escape character syntax");
      c = READCHAR;
      if (c == '\\')
	c = read_escape (readcharfun, 0, byterep);
      return c | meta_modifier;

has two clear bugs in eight lines. 

-- 
Aidan Kehoe, http://www.parhasard.net/

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] Unicode Lisp reader escapes
  2006-05-02  7:00   ` Aidan Kehoe
  2006-05-02 10:45     ` Eli Zaretskii
@ 2006-05-02 11:33     ` Kenichi Handa
  2006-05-02 22:50       ` Aidan Kehoe
  1 sibling, 1 reply; 202+ messages in thread
From: Kenichi Handa @ 2006-05-02 11:33 UTC (permalink / raw)
  Cc: emacs-devel

In article <17495.932.70900.796282@parhasard.net>, Aidan Kehoe <kehoea@parhasard.net> writes:

>> First of all, is it safe to call Lisp program in read_escape? Don't we
>> have to care about GC and buffer/string-data relocation?

> Yay, a technical objection. 

> If it isn’t safe to call a Lisp program in read_escape, then the function is
> full of bugs already. It’s called with three arguments, a Lisp_Object
> readcharfun, an integer, and a pointer to an integer. If readcharfun is a
> Lisp function (it may not be, it may be a buffer, a marker, or a string),
> then that Lisp function is called on line 348. Cf. the documentation of
> `read', which describes that the input may be from a function. 

What I concern is the case that readcharfun is a string or a
buffer.  In that case, of course, the current code doesn't
call Lisp in read_escape.  So, there's no need of GCPRO
readcharfun.  But, if Lisp is called even if readcharfun is
a string, I think we should GCPRO it.  Is it already done?
(Sorry, I don't have a time to check lread.c by myself)

---
Kenichi Handa
handa@m17n.org

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] Unicode Lisp reader escapes
  2006-05-02 10:59     ` Aidan Kehoe
@ 2006-05-02 19:26       ` Eli Zaretskii
  0 siblings, 0 replies; 202+ messages in thread
From: Eli Zaretskii @ 2006-05-02 19:26 UTC (permalink / raw)
  Cc: emacs-devel

> From: Aidan Kehoe <kehoea@parhasard.net>
> Date: Tue, 2 May 2006 12:59:43 +0200
> Cc: emacs-devel@gnu.org
> 
>  > Whether it is safe or not, I think it's certainly better to implement
>  > the guts of decode-char in C, if it's gonna be called from
>  > read_escape. 
> 
> If it's only going to be called rarely (twice a file for non-byte-compiled
> files, at a liberal guess, never for byte-compiled files), and after
> decode-char is already loaded--both of which are the case--I don't see the
> argument for that.

And I don't see why we should assume anything for something as basic
as a subroutine of readevalloop.  It could be used to read anything,
not just .el files.

>  > Moreover, I think the fact that decode-char uses translation tables to
>  > support unify-8859-on-*coding-mode (and thus might produce characters
>  > other than mule-unicode-*) could even be a misfeature: do we really
>  > want read_escape to produce Unicode or non-Unicode characters when it
>  > sees \uNNNN, depending on the current user settings?
> 
> This is not significantly different from the question "do we really want
> (decode-char 'ucs #xABCD) to produce Unicode or non-Unicode characters
> depending on the current user settings?"

Maybe it's the same question, but since you are proposing to have
decode-char become part of routine reading of Lisp, this feature's
impact becomes much more important to discuss.

> since making string escapes inconsistent with the Unicode coding
> systems does not make any sense.

I'm not sure you are right; it should be discussed.

> And that question has already been answered. Cf.
> http://article.gmane.org/gmane.emacs.bugs/3422

Don't see any answers there about this, perhaps I'm too dumb.

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] Unicode Lisp reader escapes
  2006-05-02 11:13       ` Aidan Kehoe
@ 2006-05-02 19:31         ` Eli Zaretskii
  2006-05-02 20:25           ` Aidan Kehoe
  0 siblings, 1 reply; 202+ messages in thread
From: Eli Zaretskii @ 2006-05-02 19:31 UTC (permalink / raw)
  Cc: emacs-devel

> From: Aidan Kehoe <kehoea@parhasard.net>
> Date: Tue, 2 May 2006 13:13:00 +0200
> Cc: emacs-devel@gnu.org
> 
>  > I don't know what you mean: the other objections were technical as
>  > well.
> 
> I would rate questions of aesthetics ``ugliness'' and prose style as
> non-technical. I don't propose to impose that judgement on you, but I do
> think it reasonable.

The discussion was about quite a few technical issues, only one of
which was aesthetics.

>  > ``Full of bugs''?
> 
> Indeed; each READCHAR can call arbitrary Lisp, so something like
> 
>     case 'M':
>       c = READCHAR;
>       if (c != '-')
> 	error ("Invalid escape character syntax");
>       c = READCHAR;
>       if (c == '\\')
> 	c = read_escape (readcharfun, 0, byterep);
>       return c | meta_modifier;
> 
> has two clear bugs in eight lines. 

Yeah, right.  If you want your suggestions and opinions to be
considered seriously, my advice is to drop the attitude.  But I won't
impose that advice on you.

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] Unicode Lisp reader escapes
  2006-05-02 19:31         ` Eli Zaretskii
@ 2006-05-02 20:25           ` Aidan Kehoe
  2006-05-02 22:16             ` Oliver Scholz
  0 siblings, 1 reply; 202+ messages in thread
From: Aidan Kehoe @ 2006-05-02 20:25 UTC (permalink / raw)
  Cc: emacs-devel


 Ar an dara lá de mí Bealtaine, scríobh Eli Zaretskii: 

 > >  > I don't know what you mean: the other objections were technical as
 > >  > well.
 > > 
 > > I would rate questions of aesthetics ``ugliness'' and prose style as
 > > non-technical. I don't propose to impose that judgement on you, but I do
 > > think it reasonable.
 > 
 > The discussion was about quite a few technical issues, only one of
 > which was aesthetics.

I proposed a working patch, Richard Stallman suggested an alternative
approach on the grounds that having both '\u' and '\U' was ugly. (He made
that clear after asking what the reason for having both of them was.) He
then commented that the functionality of the patch would be available in GNU
Emacs once the Unicode branch was merged, apparently ignoring what I had
written on that in my first mail.

Stefan Monnier commented that workarounds were available; that was more
relevant comment than objection, IMO. 

Jonathan Yavner then objected to Richard’s objection, on the basis that my
already submitted patch followed a widely-implemented standard that
Richard’s alternative didn’t.

Miles Bader proposed an alternative to my patch, without objecting, to which
I didn’t follow up, because I wanted to see how people would react to
Jonathan’s mentioning of the existing standardisation of the escape.

Oliver Scholz said that the syntax for \u and \x should be entirely in
parallel “I[h]NSHO.”

And that is what had been posted directly in relation to my patch (as
opposed to in reaction to Richard’s proposed alterative) when you said that
the other objections were technical as well.

It seems to me that the only objections there are Richard’s, on the grounds
of ugliness, and Oliver’s, on the unexplained grounds of what I imagine is
his individual philosophy. I’d love to know what other objections you saw
before your posting; my email etiquette is far from perfect, and feedback is
always welcome.

 > >  > ``Full of bugs''?
 > > 
 > > Indeed; each READCHAR can call arbitrary Lisp, so something like
 > > 
 > >     case 'M':
 > >       c = READCHAR;
 > >       if (c != '-')
 > > 	error ("Invalid escape character syntax");
 > >       c = READCHAR;
 > >       if (c == '\\')
 > > 	c = read_escape (readcharfun, 0, byterep);
 > >       return c | meta_modifier;
 > > 
 > > has two clear bugs in eight lines. 
 > 
 > Yeah, right.  If you want your suggestions and opinions to be
 > considered seriously, my advice is to drop the attitude.  But I won't
 > impose that advice on you.

I would refer you to Kenichi Handa’s reply to that mail (that is, to
17495.932.70900.796282@parhasard.net ) for pointers on how to write what,
IM, especially Humble this time, O, is a much more constructive answer.

Best regards, 

	Aidan
-- 
Aidan Kehoe, http://www.parhasard.net/

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] Unicode Lisp reader escapes
  2006-05-02  6:41           ` Aidan Kehoe
@ 2006-05-02 21:36             ` Richard Stallman
  0 siblings, 0 replies; 202+ messages in thread
From: Richard Stallman @ 2006-05-02 21:36 UTC (permalink / raw)
  Cc: emacs-devel

     > One other question occurs to me. In the Unicode branch, doesn't \x do
     > this job? If so, \u would be redundant once we merge in that code. It
     > would have no lasting purpose.

    I addressed that in my first mail, and I quote: 

Yes, but I don't think the argument is that strong, because I don't
see a big hurry.

On the other hand, given that C and Java use these constructs, compatibility
in Emacs would be useful.

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] Unicode Lisp reader escapes
  2006-05-02 20:25           ` Aidan Kehoe
@ 2006-05-02 22:16             ` Oliver Scholz
  0 siblings, 0 replies; 202+ messages in thread
From: Oliver Scholz @ 2006-05-02 22:16 UTC (permalink / raw)
  Cc: kehoea

Aidan Kehoe <kehoea@parhasard.net> writes:

[...]
> It seems to me that the only objections there are Richard’s, on the grounds
> of ugliness, and Oliver’s, on the unexplained grounds of what I imagine is
> his individual philosophy.

Though unexplained, not based on individual philosophy but based on
principles which I assume to be shared by most readers here (and which
therefore don't need explanation unless somebody explicitely asks for
one). The principle is that the Lisp API should be as consistent and
regular as possible in order to minimise possible sources of surprise
for the user. \x and \u are so similar in what they do, that there
should be very strong reasons for a difference in their syntax.

As for \u and \U with fixed numbers of digits: it might be standard in
other languages, but for Lisp it is entirely alien.

My comment was not an objection. On the contrary I am a believer here.
I think having a syntax for UCS characters in the next release would
be a very important addition. (That's why I raised my voice in the
first place.) You mentioned the reasons already so there's no need to
repeat them here.

The way *I* understand the discussion, the only real objection still
standing in the room is Richard's concern that \u would become
obsolete as soon as Emacs switches to an internal UCS encoding. I
still disagree, but I see his point. The question is whether the
portability provided by \u is considered to be more important than the
(conceived) redundance in the future.

As for implementing decode-char in C: that should be really trivial.


    Oliver
-- 
13 Floréal an 214 de la Révolution
Liberté, Egalité, Fraternité!

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] Unicode Lisp reader escapes
  2006-05-02 11:33     ` Kenichi Handa
@ 2006-05-02 22:50       ` Aidan Kehoe
  2006-05-03  7:43         ` Kenichi Handa
  2006-05-03 17:21         ` Kevin Rodgers
  0 siblings, 2 replies; 202+ messages in thread
From: Aidan Kehoe @ 2006-05-02 22:50 UTC (permalink / raw)
  Cc: emacs-devel


 Ar an dara lá de mí Bealtaine, scríobh Kenichi Handa: 

 > > Yay, a technical objection. 
 > 
 > > If it isn’t safe to call a Lisp program in read_escape, then the
 > > function is full of bugs already. It’s called with three arguments, a
 > > Lisp_Object readcharfun, an integer, and a pointer to an integer. If
 > > readcharfun is a Lisp function (it may not be, it may be a buffer, a
 > > marker, or a string), then that Lisp function is called on line 348. 
 > > Cf. the documentation of `read', which describes that the input may be
 > > from a function.
 > 
 > What I concern is the case that readcharfun is a string or a buffer. In
 > that case, of course, the current code doesn't call Lisp in read_escape.
 > So, there's no need of GCPRO readcharfun.
 >
 > But, if Lisp is called even if readcharfun is a string, I think we should
 > GCPRO it. Is it already done? (Sorry, I don't have a time to check
 > lread.c by myself)

I’m reasonably sure it’s already done in the callers of read1, but I don’t
have graphing software to hand, and the English for the reasoning I’ve
written out is unreadably tedious. So, sure, GCPROing seems worth the time. 
Do you mean to GCPRO independent of what type readcharfun is? 

-- 
Aidan Kehoe, http://www.parhasard.net/

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] Unicode Lisp reader escapes
  2006-05-02 10:36   ` Eli Zaretskii
  2006-05-02 10:59     ` Aidan Kehoe
@ 2006-05-03  2:59     ` Kenichi Handa
  2006-05-03  8:47       ` Eli Zaretskii
  1 sibling, 1 reply; 202+ messages in thread
From: Kenichi Handa @ 2006-05-03  2:59 UTC (permalink / raw)
  Cc: kehoea, emacs-devel

In article <ufyjsemrn.fsf@gnu.org>, Eli Zaretskii <eliz@gnu.org> writes:

>> From: Kenichi Handa <handa@m17n.org>
>> Date: Tue, 02 May 2006 15:43:16 +0900
>> Cc: emacs-devel@gnu.org
>> 
>> > +	lisp_char = call2(intern("decode-char"), intern("ucs"),
>> > +			  make_number(i));
>> > +
>> 
>> First of all, is it safe to call Lisp program in
>> read_escape?

> Whether it is safe or not, I think it's certainly better to implement
> the guts of decode-char in C, if it's gonna be called from
> read_escape.  All those guts do is simple arithmetics, which will be
> much faster in C.

I agree.

> Moreover, I think the fact that decode-char uses translation tables to
> support unify-8859-on-*coding-mode (and thus might produce characters
> other than mule-unicode-*) could even be a misfeature:

Decode-char doesn't support unify-8859-on-*coding-mode but
supports utf-fragment-on-decoding and
utf-translate-cjk-mode.

> do we really want read_escape to produce Unicode or
> non-Unicode characters when it sees \uNNNN, depending on
> the current user settings?

I think, at least, CJK characters should be decoded into one
of CJK charsets because there's no other charsets.

---
Kenichi Handa
handa@m17n.org

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] Unicode Lisp reader escapes
  2006-05-02 22:50       ` Aidan Kehoe
@ 2006-05-03  7:43         ` Kenichi Handa
  2006-05-03 17:21         ` Kevin Rodgers
  1 sibling, 0 replies; 202+ messages in thread
From: Kenichi Handa @ 2006-05-03  7:43 UTC (permalink / raw)
  Cc: emacs-devel

In article <17495.57895.90438.848865@parhasard.net>, Aidan Kehoe <kehoea@parhasard.net> writes:

>> But, if Lisp is called even if readcharfun is a string, I think we should
>> GCPRO it. Is it already done? (Sorry, I don't have a time to check
>> lread.c by myself)

> I’m reasonably sure it’s already done in the callers of read1, but I don’t
> have graphing software to hand, and the English for the reasoning I’ve
> written out is unreadably tedious. So, sure, GCPROing seems worth the time. 
> Do you mean to GCPRO independent of what type readcharfun is? 

I have not yet considered in deep what should we GCPRO and
where to do that.

But, as I replied to Eli's mail, I now think that
implementing decode-char in C is better provided that it is
decided to handle \u.... in read_escape.

---
Kenichi Handa
handa@m17n.org

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] Unicode Lisp reader escapes
  2006-05-03  2:59     ` Kenichi Handa
@ 2006-05-03  8:47       ` Eli Zaretskii
  2006-05-03 14:21         ` Stefan Monnier
  2006-05-04  1:26         ` Kenichi Handa
  0 siblings, 2 replies; 202+ messages in thread
From: Eli Zaretskii @ 2006-05-03  8:47 UTC (permalink / raw)
  Cc: kehoea, emacs-devel

> From: Kenichi Handa <handa@m17n.org>
> CC: kehoea@parhasard.net, emacs-devel@gnu.org
> Date: Wed, 03 May 2006 11:59:52 +0900
> 
> > Moreover, I think the fact that decode-char uses translation tables to
> > support unify-8859-on-*coding-mode (and thus might produce characters
> > other than mule-unicode-*) could even be a misfeature:
> 
> Decode-char doesn't support unify-8859-on-*coding-mode but
> supports utf-fragment-on-decoding and
> utf-translate-cjk-mode.

Sorry, I meant utf-fragment-on-decoding, which decodes Cyrillic and
Greek into ISO 8859.  (I always get confused and lost in the maze of
those twisted *-on-decoding passages, all alike.)

> > do we really want read_escape to produce Unicode or
> > non-Unicode characters when it sees \uNNNN, depending on
> > the current user settings?
> 
> I think, at least, CJK characters should be decoded into one
> of CJK charsets because there's no other charsets.

Right, but what about Cyrillic and Greek?  The merits and demerits of
depending on utf-fragment-on-decoding are not clear when the Lisp
reader is involved.

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] Unicode Lisp reader escapes
  2006-05-03  8:47       ` Eli Zaretskii
@ 2006-05-03 14:21         ` Stefan Monnier
  2006-05-03 18:26           ` Eli Zaretskii
  2006-05-04  1:33           ` Kenichi Handa
  2006-05-04  1:26         ` Kenichi Handa
  1 sibling, 2 replies; 202+ messages in thread
From: Stefan Monnier @ 2006-05-03 14:21 UTC (permalink / raw)
  Cc: kehoea, emacs-devel, Kenichi Handa

>> I think, at least, CJK characters should be decoded into one
>> of CJK charsets because there's no other charsets.

> Right, but what about Cyrillic and Greek?  The merits and demerits of
> depending on utf-fragment-on-decoding are not clear when the Lisp
> reader is involved.

I think we should treat them as much as possible consistently with the rest
of the treatment of unicode chars.  If we start down the path of "OK, we can
do it like this for those chars but not these, oh and as for those ones over
there, we'll do it yet some other way", I think we're headed for headaches
with no real benefit.


        Stefan

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] Unicode Lisp reader escapes
  2006-05-02 22:50       ` Aidan Kehoe
  2006-05-03  7:43         ` Kenichi Handa
@ 2006-05-03 17:21         ` Kevin Rodgers
  2006-05-03 18:51           ` Andreas Schwab
  1 sibling, 1 reply; 202+ messages in thread
From: Kevin Rodgers @ 2006-05-03 17:21 UTC (permalink / raw)


Aidan Kehoe wrote:
>  Ar an dara lá de mí Bealtaine, scríobh Kenichi Handa: 
> 
>  > > Yay, a technical objection. 
>  > 
>  > > If it isn’t safe to call a Lisp program in read_escape, then the
>  > > function is full of bugs already. It’s called with three arguments, a
>  > > Lisp_Object readcharfun, an integer, and a pointer to an integer. If
>  > > readcharfun is a Lisp function (it may not be, it may be a buffer, a
>  > > marker, or a string), then that Lisp function is called on line 348. 
>  > > Cf. the documentation of `read', which describes that the input may be
>  > > from a function.
>  > 
>  > What I concern is the case that readcharfun is a string or a buffer. In
>  > that case, of course, the current code doesn't call Lisp in read_escape.
>  > So, there's no need of GCPRO readcharfun.
>  >
>  > But, if Lisp is called even if readcharfun is a string, I think we should
>  > GCPRO it. Is it already done? (Sorry, I don't have a time to check
>  > lread.c by myself)
> 
> I’m reasonably sure it’s already done in the callers of read1, but I don’t
> have graphing software to hand, and the English for the reasoning I’ve
> written out is unreadably tedious. So, sure, GCPROing seems worth the time. 
> Do you mean to GCPRO independent of what type readcharfun is? 

readcharfun is declared as a Lisp_Object in read1, so it should be
possible to check it's type and only GCPRO when necessary.

-- 
Kevin

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] Unicode Lisp reader escapes
  2006-05-03 14:21         ` Stefan Monnier
@ 2006-05-03 18:26           ` Eli Zaretskii
  2006-05-03 21:12             ` Ken Raeburn
  2006-05-04 14:17             ` Richard Stallman
  2006-05-04  1:33           ` Kenichi Handa
  1 sibling, 2 replies; 202+ messages in thread
From: Eli Zaretskii @ 2006-05-03 18:26 UTC (permalink / raw)
  Cc: kehoea, emacs-devel, handa

> Cc: Kenichi Handa <handa@m17n.org>,  kehoea@parhasard.net,
> 	  emacs-devel@gnu.org
> From: Stefan Monnier <monnier@iro.umontreal.ca>
> Date: Wed, 03 May 2006 10:21:15 -0400
> 
> > Right, but what about Cyrillic and Greek?  The merits and demerits of
> > depending on utf-fragment-on-decoding are not clear when the Lisp
> > reader is involved.
> 
> I think we should treat them as much as possible consistently with the rest
> of the treatment of unicode chars.

IIRC, we don't support such a consistency when we load Lisp files,
because we don't want loading and byte-compiling to depend on user
settings.

Of course, the same effect can also be achieved by binding
utf-fragment-on-decoding etc. to appropriate values.

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] Unicode Lisp reader escapes
  2006-05-03 17:21         ` Kevin Rodgers
@ 2006-05-03 18:51           ` Andreas Schwab
  2006-05-04 21:14             ` Aidan Kehoe
  0 siblings, 1 reply; 202+ messages in thread
From: Andreas Schwab @ 2006-05-03 18:51 UTC (permalink / raw)
  Cc: emacs-devel

Kevin Rodgers <ihs_4664@yahoo.com> writes:

> readcharfun is declared as a Lisp_Object in read1, so it should be
> possible to check it's type and only GCPRO when necessary.

I don't see any need to GCPRO readcharfun.  When called from Lisp the
arguments are already protected by being part of the call frame, and all
uses from C protect the object by other means (eg, by being put on
eval-buffer-list).

Andreas.

-- 
Andreas Schwab, SuSE Labs, schwab@suse.de
SuSE Linux Products GmbH, Maxfeldstraße 5, 90409 Nürnberg, Germany
PGP key fingerprint = 58CA 54C7 6D53 942B 1756  01D3 44D5 214B 8276 4ED5
"And now for something completely different."

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] Unicode Lisp reader escapes
  2006-05-03 18:26           ` Eli Zaretskii
@ 2006-05-03 21:12             ` Ken Raeburn
  2006-05-04 14:17             ` Richard Stallman
  1 sibling, 0 replies; 202+ messages in thread
From: Ken Raeburn @ 2006-05-03 21:12 UTC (permalink / raw)
  Cc: kehoea, handa, Stefan Monnier, emacs-devel

On May 3, 2006, at 14:26, Eli Zaretskii wrote:
> IIRC, we don't support such a consistency when we load Lisp files,
> because we don't want loading and byte-compiling to depend on user
> settings.

Not sure if it matters here, but shouldn't eval-last-sexp in a user's  
text buffer follow the user's (buffer's) settings?

Ken

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] Unicode Lisp reader escapes
  2006-05-03  8:47       ` Eli Zaretskii
  2006-05-03 14:21         ` Stefan Monnier
@ 2006-05-04  1:26         ` Kenichi Handa
  1 sibling, 0 replies; 202+ messages in thread
From: Kenichi Handa @ 2006-05-04  1:26 UTC (permalink / raw)
  Cc: kehoea, emacs-devel

In article <uy7xjcx5s.fsf@gnu.org>, Eli Zaretskii <eliz@gnu.org> writes:

>> > do we really want read_escape to produce Unicode or
>> > non-Unicode characters when it sees \uNNNN, depending on
>> > the current user settings?
>> 
>> I think, at least, CJK characters should be decoded into one
>> of CJK charsets because there's no other charsets.

> Right, but what about Cyrillic and Greek?  The merits and demerits of
> depending on utf-fragment-on-decoding are not clear when the Lisp
> reader is involved.

I don't see any strong reason for not following
utf-fragment-on-decoding in read_escape leaving the question
about the usefullness of this option.

---
Kenichi Handa
handa@m17n.org

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] Unicode Lisp reader escapes
  2006-05-03 14:21         ` Stefan Monnier
  2006-05-03 18:26           ` Eli Zaretskii
@ 2006-05-04  1:33           ` Kenichi Handa
  2006-05-04  8:23             ` Oliver Scholz
  2006-05-04 16:32             ` Eli Zaretskii
  1 sibling, 2 replies; 202+ messages in thread
From: Kenichi Handa @ 2006-05-04  1:33 UTC (permalink / raw)
  Cc: kehoea, eliz, emacs-devel

In article <87odyfnqcj.fsf-monnier+emacs@gnu.org>, Stefan Monnier <monnier@iro.umontreal.ca> writes:

>>> I think, at least, CJK characters should be decoded into one
>>> of CJK charsets because there's no other charsets.

>> Right, but what about Cyrillic and Greek?  The merits and demerits of
>> depending on utf-fragment-on-decoding are not clear when the Lisp
>> reader is involved.

> I think we should treat them as much as possible consistently with the rest
> of the treatment of unicode chars.  If we start down the path of "OK, we can
> do it like this for those chars but not these, oh and as for those ones over
> there, we'll do it yet some other way", I think we're headed for headaches
> with no real benefit.

I agree.

---
Kenichi Handa
handa@m17n.org

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] Unicode Lisp reader escapes
  2006-05-04  1:33           ` Kenichi Handa
@ 2006-05-04  8:23             ` Oliver Scholz
  2006-05-04 11:57               ` Kim F. Storm
  2006-05-04 16:32             ` Eli Zaretskii
  1 sibling, 1 reply; 202+ messages in thread
From: Oliver Scholz @ 2006-05-04  8:23 UTC (permalink / raw)


For what it's worth, I made a stab at implementing \u analogous to
\x---including a port of the core functionality of `decode-char' to C.

As for the current discussion: I regard both e.g. \u3b1 and
(decode-char 'ucs #x3b1) as a means to say "Give me that abstract
character---the greek letter alpha---I don't care about your internal
encoding, *just use your defaults*, but give me that character." So,
effectively the respective functions should deal with fragmentation
and the like. It would matter, for instance, if the fontset specifies
different glyphs for the same abstract character depending on the
charsets.

But I see Eli's point. Ideally, the conversion (to ISO 8859-X)
wouldn't take place when reading the string, but when it is
displayed/inserted into a buffer. Logically, because that's when the
difference between abstract character and internal representation
should become effective. Practically, because: if the user loads a
Library containing strings with \u escapes (or `decode-char'
expressions eval'ed at load-time) and *then* customises the value of
`utf-fragment-on-decoding', the change won't affect those characters.
However, I believe that this is rather a minor obscurity than a bug; I
don't believe that anybody would get bitten by this seriously.

    Oliver

Here's the patch, only slightly tested:

Index: src/lread.c
===================================================================
RCS file: /cvsroot/emacs/emacs/src/lread.c,v
retrieving revision 1.350
diff -u -r1.350 lread.c
--- src/lread.c	27 Feb 2006 02:04:35 -0000	1.350
+++ src/lread.c	4 May 2006 08:00:53 -0000
@@ -1731,6 +1731,102 @@
   return str[0];
 }
 
+
+#define READ_HEX_ESCAPE(i, c)                                         \
+  while (1)                                                           \
+    {                                                                 \
+      c = READCHAR;                                                   \
+      if (c >= '0' && c <= '9')                                       \
+        {                                                             \
+          i *= 16;                                                    \
+          i += c - '0';                                               \
+        }                                                             \
+      else if ((c >= 'a' && c <= 'f')                                 \
+               || (c >= 'A' && c <= 'F'))                             \
+        {                                                             \
+          i *= 16;                                                    \
+          if (c >= 'a' && c <= 'f')                                   \
+            i += c - 'a' + 10;                                        \
+          else                                                        \
+            i += c - 'A' + 10;                                        \
+        }                                                             \
+      else                                                            \
+        {                                                             \
+          UNREAD (c);                                                 \
+          break;                                                      \
+        }                                                             \
+    }
+
+
+
+/* Return the internal character coresponding to an UCS code point.*/
+
+int
+ucs_to_internal (ucs)
+     int ucs;
+{
+  int c = 0;
+  Lisp_Object tmp_char;
+
+  if (! EQ (Qnil, SYMBOL_VALUE (intern ("utf-translate-cjk-mode"))))
+    /* cf. `utf-lookup-subst-table-for-decode' */
+    {
+      if (EQ (Qnil, SYMBOL_VALUE (intern ("utf-translate-cjk-lang-env"))))
+        call0 (intern ("utf-translate-cjk-load-tables"));
+      tmp_char = Fgethash (make_number (ucs),
+                           Fget (intern ("utf-subst-table-for-decode"),
+                                 intern ("translation-hash-table")),
+                           Qnil);
+      if (! EQ (Qnil, tmp_char))
+        {
+          CHECK_NUMBER (tmp_char);
+          c = XFASTINT (tmp_char);
+        }
+    }
+
+  if (c)
+    /* We found the character already in the translation hash table.
+       Do nothing. */
+    ;
+  else if (ucs < 160)
+    c = ucs;
+  else if (ucs < 256)
+    c = MAKE_CHAR (charset_latin_iso8859_1, ucs, 0);
+  else if (ucs < 0x2500)
+    {
+      ucs -= 0x0100;
+      c = MAKE_CHAR (charset_mule_unicode_0100_24ff,
+                     ((ucs / 96) + 32),
+                     ((ucs % 96) + 32));
+    }
+    else if (ucs < 0x3400)
+    {
+      ucs -= 0x2500;
+      c = MAKE_CHAR (charset_mule_unicode_2500_33ff,
+                     ((ucs / 96) + 32),
+                     ((ucs % 96) + 32));
+    }
+    else if ((ucs >= 0xE000) && (ucs < 0x10000))
+      {
+        ucs -= 0xE000;
+        c = MAKE_CHAR (charset_mule_unicode_e000_ffff,
+                       ((ucs / 96) + 32),
+                       ((ucs % 96) + 32));
+      }
+  
+  if (c)
+    {
+      Lisp_Object vect = Fget (intern ("utf-translation-table-for-decode"),
+                               intern ("translation-table"));
+      tmp_char = Faref (vect, make_number (c));
+      if (! EQ (Qnil, tmp_char))
+        return XFASTINT (tmp_char);
+      return c;
+    }
+  else error ("Invalid or unsupported UCS character: %x", ucs);
+}
+
+      
 /* Read a \-escape sequence, assuming we already read the `\'.
    If the escape sequence forces unibyte, store 1 into *BYTEREP.
    If the escape sequence forces multibyte, store 2 into *BYTEREP.
@@ -1879,34 +1975,24 @@
       /* A hex escape, as in ANSI C.  */
       {
 	int i = 0;
-	while (1)
-	  {
-	    c = READCHAR;
-	    if (c >= '0' && c <= '9')
-	      {
-		i *= 16;
-		i += c - '0';
-	      }
-	    else if ((c >= 'a' && c <= 'f')
-		     || (c >= 'A' && c <= 'F'))
-	      {
-		i *= 16;
-		if (c >= 'a' && c <= 'f')
-		  i += c - 'a' + 10;
-		else
-		  i += c - 'A' + 10;
-	      }
-	    else
-	      {
-		UNREAD (c);
-		break;
-	      }
-	  }
-
+        READ_HEX_ESCAPE (i, c);
 	*byterep = 2;
 	return i;
       }
 
+    case 'u':
+      /* A hexadecimal reference to an UCS character. */
+      {
+        int i = 0;
+        Lisp_Object lisp_char;
+        
+        READ_HEX_ESCAPE (i, c);
+        *byterep = 2;
+
+        return ucs_to_internal (i);
+
+      }
+
     default:
       if (BASE_LEADING_CODE_P (c))
 	c = read_multibyte (c, readcharfun);

    
-- 
15 Floréal an 214 de la Révolution
Liberté, Egalité, Fraternité!

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] Unicode Lisp reader escapes
  2006-05-04  8:23             ` Oliver Scholz
@ 2006-05-04 11:57               ` Kim F. Storm
  2006-05-04 12:18                 ` Stefan Monnier
  2006-05-04 13:07                 ` Oliver Scholz
  0 siblings, 2 replies; 202+ messages in thread
From: Kim F. Storm @ 2006-05-04 11:57 UTC (permalink / raw)
  Cc: emacs-devel

Oliver Scholz <alkibiades@gmx.de> writes:

> Here's the patch, only slightly tested:

> +
> +/* Return the internal character coresponding to an UCS code point.*/
> +
> +int
> +ucs_to_internal (ucs)
> +     int ucs;
> +{
> +  if (! EQ (Qnil, SYMBOL_VALUE (intern ("utf-translate-cjk-mode"))))
> +      if (EQ (Qnil, SYMBOL_VALUE (intern ("utf-translate-cjk-lang-env"))))
> +        call0 (intern ("utf-translate-cjk-load-tables"));
> +                           Fget (intern ("utf-subst-table-for-decode"),
> +                                 intern ("translation-hash-table")),
> +      Lisp_Object vect = Fget (intern ("utf-translation-table-for-decode"),
> +                               intern ("translation-table"));
> +}

That's 7 lisp vars accessed from C - for decoding one character!?!

How often does this happen?

If it is only/primarily used for interactive use, I guess it doesn't matter.
Otherwise, I think those vars should be declared in C, to avoid the overhead
of interning them at run-time...

-- 
Kim F. Storm <storm@cua.dk> http://www.cua.dk

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] Unicode Lisp reader escapes
  2006-05-04 11:57               ` Kim F. Storm
@ 2006-05-04 12:18                 ` Stefan Monnier
  2006-05-04 12:21                   ` Kim F. Storm
  2006-05-04 16:31                   ` Eli Zaretskii
  2006-05-04 13:07                 ` Oliver Scholz
  1 sibling, 2 replies; 202+ messages in thread
From: Stefan Monnier @ 2006-05-04 12:18 UTC (permalink / raw)
  Cc: emacs-devel, Oliver Scholz

> That's 7 lisp vars accessed from C - for decoding one character!?!

> How often does this happen?

> If it is only/primarily used for interactive use, I guess it doesn't matter.
> Otherwise, I think those vars should be declared in C, to avoid the overhead
> of interning them at run-time...

I'd vote to keep the code in elisp.  After all, it's there, it works, and as
mentioned: there's no evidence that the decoding time of \u escapes it ever
going to need to be fast.  And it'll become fast in Emacs-unicode anyway, so
it doesn't seem to be worth the trouble.


        Stefan

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] Unicode Lisp reader escapes
  2006-05-04 12:18                 ` Stefan Monnier
@ 2006-05-04 12:21                   ` Kim F. Storm
  2006-05-04 16:31                   ` Eli Zaretskii
  1 sibling, 0 replies; 202+ messages in thread
From: Kim F. Storm @ 2006-05-04 12:21 UTC (permalink / raw)
  Cc: Oliver Scholz, emacs-devel

Stefan Monnier <monnier@iro.umontreal.ca> writes:

>> That's 7 lisp vars accessed from C - for decoding one character!?!
>
>> How often does this happen?
>
>> If it is only/primarily used for interactive use, I guess it doesn't matter.
>> Otherwise, I think those vars should be declared in C, to avoid the overhead
>> of interning them at run-time...
>
> I'd vote to keep the code in elisp.  After all, it's there, it works, and as
> mentioned: there's no evidence that the decoding time of \u escapes it ever
> going to need to be fast.  And it'll become fast in Emacs-unicode anyway, so
> it doesn't seem to be worth the trouble.

Ok.  Pls. disregard my query.

-- 
Kim F. Storm <storm@cua.dk> http://www.cua.dk

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] Unicode Lisp reader escapes
  2006-05-04 11:57               ` Kim F. Storm
  2006-05-04 12:18                 ` Stefan Monnier
@ 2006-05-04 13:07                 ` Oliver Scholz
  1 sibling, 0 replies; 202+ messages in thread
From: Oliver Scholz @ 2006-05-04 13:07 UTC (permalink / raw)


storm@cua.dk (Kim F. Storm) writes:

> Oliver Scholz <alkibiades@gmx.de> writes:
>
>> Here's the patch, only slightly tested:
>
>> +
>> +/* Return the internal character coresponding to an UCS code point.*/
>> +
>> +int
>> +ucs_to_internal (ucs)
>> +     int ucs;
>> +{
>> +  if (! EQ (Qnil, SYMBOL_VALUE (intern ("utf-translate-cjk-mode"))))
>> +      if (EQ (Qnil, SYMBOL_VALUE (intern ("utf-translate-cjk-lang-env"))))
>> +        call0 (intern ("utf-translate-cjk-load-tables"));
>> +                           Fget (intern ("utf-subst-table-for-decode"),
>> +                                 intern ("translation-hash-table")),
>> +      Lisp_Object vect = Fget (intern ("utf-translation-table-for-decode"),
>> +                               intern ("translation-table"));
>> +}
>
> That's 7 lisp vars accessed from C - for decoding one character!?!

Nearly inevitable, if you want to DTRT with CJK.

> How often does this happen?

Every time a character specified with \u is decoded. The call0,
however, probably just once per Emacs session.

> If it is only/primarily used for interactive use, I guess it doesn't matter.
> Otherwise, I think those vars should be declared in C, to avoid the overhead
> of interning them at run-time...

I tend to agree; but that is probably too intrusive.


    Oliver
-- 
15 Floréal an 214 de la Révolution
Liberté, Egalité, Fraternité!

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] Unicode Lisp reader escapes
  2006-05-03 18:26           ` Eli Zaretskii
  2006-05-03 21:12             ` Ken Raeburn
@ 2006-05-04 14:17             ` Richard Stallman
  2006-05-04 16:41               ` Aidan Kehoe
  1 sibling, 1 reply; 202+ messages in thread
From: Richard Stallman @ 2006-05-04 14:17 UTC (permalink / raw)
  Cc: kehoea, handa, monnier, emacs-devel

Regarding \u: the question is whether an Emacs escape for Unicode
characters should be compatible with C string syntax for Unicode
characters, or coherent with the Emacs \x escape.

I think one relevant question is to what extent the C and Emacs Lisp
string syntax are compatible in the first place.  Emacs Lisp string
syntax was largely based on C string syntax in 1984, but I don't know
how C has developed since 1990.  Can someone report on this question?

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] Unicode Lisp reader escapes
  2006-05-04 12:18                 ` Stefan Monnier
  2006-05-04 12:21                   ` Kim F. Storm
@ 2006-05-04 16:31                   ` Eli Zaretskii
  2006-05-04 21:40                     ` Stefan Monnier
  1 sibling, 1 reply; 202+ messages in thread
From: Eli Zaretskii @ 2006-05-04 16:31 UTC (permalink / raw)
  Cc: alkibiades, emacs-devel, storm

> From: Stefan Monnier <monnier@iro.umontreal.ca>
> Date: Thu, 04 May 2006 08:18:02 -0400
> Cc: emacs-devel@gnu.org, Oliver Scholz <alkibiades@gmx.de>
> 
> I'd vote to keep the code in elisp.

And I think it's ugly and hackish to call Lisp from within C code,
when all that Lisp does is simple integer arithmetics.

IIRC, `decode-char' was originally coded in Lisp because it was added
at the last moment before some past release happened.  That was cool
as long as it was a rarely-used vehicle for converting Unicode
codepoints to the Emacs internal representation, but it's certainly
NOT cool, IMO, as part of the Lisp reader.

> After all, it's there, it works, and as mentioned: there's no
> evidence that the decoding time of \u escapes it ever going to need
> to be fast.

??? inside the Lisp reader, everything needs to be fast, IMO.

> And it'll become fast in Emacs-unicode anyway

Which will be when? 5 years from now?

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] Unicode Lisp reader escapes
  2006-05-04  1:33           ` Kenichi Handa
  2006-05-04  8:23             ` Oliver Scholz
@ 2006-05-04 16:32             ` Eli Zaretskii
  2006-05-04 20:55               ` Aidan Kehoe
  2006-05-05 19:05               ` Richard Stallman
  1 sibling, 2 replies; 202+ messages in thread
From: Eli Zaretskii @ 2006-05-04 16:32 UTC (permalink / raw)
  Cc: kehoea, monnier, emacs-devel

> From: Kenichi Handa <handa@m17n.org>
> Date: Thu, 04 May 2006 10:33:55 +0900
> Cc: kehoea@parhasard.net, eliz@gnu.org, emacs-devel@gnu.org
> 
> > I think we should treat them as much as possible consistently with the rest
> > of the treatment of unicode chars.  If we start down the path of "OK, we can
> > do it like this for those chars but not these, oh and as for those ones over
> > there, we'll do it yet some other way", I think we're headed for headaches
> > with no real benefit.
> 
> I agree.

What happens when a Lisp file is byte-compiled--do we want the result
to depend on the local settings?

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] Unicode Lisp reader escapes
  2006-05-04 14:17             ` Richard Stallman
@ 2006-05-04 16:41               ` Aidan Kehoe
  2006-05-05 10:39                 ` Eli Zaretskii
  2006-05-05 19:05                 ` Richard Stallman
  0 siblings, 2 replies; 202+ messages in thread
From: Aidan Kehoe @ 2006-05-04 16:41 UTC (permalink / raw)
  Cc: emacs-devel


 Ar an ceathrú lá de mí Bealtaine, scríobh Richard Stallman: 

 > Regarding \u: the question is whether an Emacs escape for Unicode
 > characters should be compatible with C string syntax for Unicode
 > characters, or coherent with the Emacs \x escape.

The thing with the Emacs \x escape is that anyone using it for characters
outside of ASCII is asking for pain, and always has been. It has only ever
been clearly defined for that character set; any existing code in the
repository for other characters, for example, _will definitely_ break with
the merging of the Unicode branch.

Now, there is lots of code in 21.4’s source tree that uses the syntax for
things that are conceptually numbers and not Emacs characters. That code is
not broken, but it is bad style; that’s what the #x syntax is for.

So when people have been using the variable-length syntax with a length
greater than two, they are either writing buggy code, or using bad style.
I’m not sure that merits emulation. 

 > I think one relevant question is to what extent the C and Emacs Lisp
 > string syntax are compatible in the first place.  Emacs Lisp string
 > syntax was largely based on C string syntax in 1984, but I don't know
 > how C has developed since 1990.  Can someone report on this question?

The \u syntax (with a fixed number of digits) came into wide use with Java
in 1996. The necessity for the \U extension arose with progress towards
version 3.0 of Unicode and its ~1.1 million available code points. That
version of the standard was released in 1999; the C99 ISO standard for C of
the same year included both \u and \U. Various other C-oriented programming
languages have incorporated the syntax since. 

-- 
Aidan Kehoe, http://www.parhasard.net/

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] Unicode Lisp reader escapes
  2006-05-04 16:32             ` Eli Zaretskii
@ 2006-05-04 20:55               ` Aidan Kehoe
  2006-05-05  9:33                 ` Oliver Scholz
  2006-05-05 19:05               ` Richard Stallman
  1 sibling, 1 reply; 202+ messages in thread
From: Aidan Kehoe @ 2006-05-04 20:55 UTC (permalink / raw)
  Cc: emacs-devel


 Ar an ceathrú lá de mí Bealtaine, scríobh Eli Zaretskii>: 

 > > > I think we should treat them as much as possible consistently with
 > > > the rest of the treatment of unicode chars.  If we start down the
 > > > path of "OK, we can do it like this for those chars but not these, oh
 > > > and as for those ones over there, we'll do it yet some other way", I
 > > > think we're headed for headaches with no real benefit.
 > > 
 > > I agree.
 > 
 > What happens when a Lisp file is byte-compiled--do we want the result
 > to depend on the local settings?

It does currently, to the extent of local settings preventing successful
compilation. Cf. this code (on Unix):

(let ((our-test-file-name "/tmp/testing-byte-compile.el"))
  (let ((coding-system-for-write 'iso-8859-1))
    (set-buffer (get-buffer-create our-test-file-name))
    (insert (concat
	     ";; -*- coding: utf-8 -*-\n\n"
	     "(require 'cl)\n\n"
	     "(defun describe-our-string ()\n"
	     "  (let ((our-char ?"
	     (format "%c%c%c" ?\345 ?\215 ?\227)
	     "))\n"
	     "  (message (format \"\%c maps to \%s\n\" our-char "
	     "(split-char our-char)))))\n"))
    (write-file our-test-file-name nil)
    (kill-buffer (current-buffer)))
  (utf-translate-cjk-mode 1)
  (byte-compile-file our-test-file-name)
  (load-file (concat our-test-file-name "c")) 
  (describe-our-string)
  (delete-file (concat our-test-file-name "c")) 
  (utf-translate-cjk-mode 0)
  ;; The following byte compilation fails for me; error
  ;; Compiling file /tmp/testing-byte-compile.el at Thu May  4 22:49:00 2006
  ;; testing-byte-compile.el:4:1:Error: Invalid read syntax: "?"
  ;; 
  (byte-compile-file our-test-file-name)
  (load-file (concat our-test-file-name "c")) 
  (describe-our-string))


-- 
Aidan Kehoe, http://www.parhasard.net/

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] Unicode Lisp reader escapes
  2006-05-03 18:51           ` Andreas Schwab
@ 2006-05-04 21:14             ` Aidan Kehoe
  2006-05-08  1:31               ` Kenichi Handa
  0 siblings, 1 reply; 202+ messages in thread
From: Aidan Kehoe @ 2006-05-04 21:14 UTC (permalink / raw)
  Cc: emacs-devel


 Ar an triú lá de mí Bealtaine, scríobh Andreas Schwab>: 

 > Kevin Rodgers <ihs_4664@yahoo.com> writes:
 > 
 > > readcharfun is declared as a Lisp_Object in read1, so it should be
 > > possible to check it's type and only GCPRO when necessary.
 > 
 > I don't see any need to GCPRO readcharfun.  When called from Lisp the
 > arguments are already protected by being part of the call frame, and all
 > uses from C protect the object by other means (eg, by being put on
 > eval-buffer-list).

That was my understanding of the code too. 

-- 
Aidan Kehoe, http://www.parhasard.net/

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] Unicode Lisp reader escapes
  2006-05-04 16:31                   ` Eli Zaretskii
@ 2006-05-04 21:40                     ` Stefan Monnier
  2006-05-05 10:25                       ` Eli Zaretskii
  0 siblings, 1 reply; 202+ messages in thread
From: Stefan Monnier @ 2006-05-04 21:40 UTC (permalink / raw)
  Cc: alkibiades, emacs-devel, storm

>> After all, it's there, it works, and as mentioned: there's no
>> evidence that the decoding time of \u escapes it ever going to need
>> to be fast.

> ??? inside the Lisp reader, everything needs to be fast, IMO.

Why?  IMO the only things that need to be fast are those things whose
performance has a visible impact.  I see no evidence that there'll ever be
a case where the speed with which we can read \u escapes will matter.


        Stefan

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] Unicode Lisp reader escapes
  2006-05-04 20:55               ` Aidan Kehoe
@ 2006-05-05  9:33                 ` Oliver Scholz
  2006-05-05 10:02                   ` Oliver Scholz
                                     ` (2 more replies)
  0 siblings, 3 replies; 202+ messages in thread
From: Oliver Scholz @ 2006-05-05  9:33 UTC (permalink / raw)


Aidan Kehoe <kehoea@parhasard.net> writes:

>  Ar an ceathrú lá de mí Bealtaine, scríobh Eli Zaretskii>: 

[...]
>  > What happens when a Lisp file is byte-compiled--do we want the result
>  > to depend on the local settings?

Oy! This might be a bit more serious than what I called a "minor
obscurity".

I guess, you have a similar problem when the source file is encoded in
either ISO 8859-5 or ISO 8859-7 (btw., what about Hewbrew, Arabic and
Thai?). Unless I am much mistaken, the encoding of the characters in
the .elc file would also depend on the value of
`utf-fragment-on-decoding'. A difference might be that this is much
more obvious in the case of a ISO 8859-[57] encoded file; and that it
is more obscure and more likely to cause puzzlement in the case of a
couple of letters specified with \u.

I have no opinion on how serious that is. One might say, that is just
one of the glitches of emacs-mule. Or maybe not. I don't know.

At least I don't see a proper solution to this.

> It does currently, to the extent of local settings preventing successful
> compilation. Cf. this code (on Unix):

[...]
>     (insert (concat
> 	     ";; -*- coding: utf-8 -*-\n\n"
> 	     "(require 'cl)\n\n"
> 	     "(defun describe-our-string ()\n"
> 	     "  (let ((our-char ?"
> 	     (format "%c%c%c" ?\345 ?\215 ?\227)
[...]
>   (utf-translate-cjk-mode 0)
[...]
>   (byte-compile-file our-test-file-name)
[...]

I am afraid that is not relevant here. This just tells Emacs to not
deal with UTF-8 encoded CJK characters and then tell it to deal with
such an character.


    Oliver
-- 
16 Floréal an 214 de la Révolution
Liberté, Egalité, Fraternité!

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] Unicode Lisp reader escapes
  2006-05-05  9:33                 ` Oliver Scholz
@ 2006-05-05 10:02                   ` Oliver Scholz
  2006-05-05 18:33                   ` Aidan Kehoe
  2006-05-06 14:24                   ` Richard Stallman
  2 siblings, 0 replies; 202+ messages in thread
From: Oliver Scholz @ 2006-05-05 10:02 UTC (permalink / raw)


Correction.

Oliver Scholz <alkibiades@gmx.de> writes:

> I guess, you have a similar problem when the source file is encoded in
> either ISO 8859-5 or ISO 8859-7 (btw., what about Hewbrew, Arabic and
> Thai?). 

I meant to say: a source file in an UCS encoding containing characters
from the range of ISO 8859-[57].

> Unless I am much mistaken, the encoding of the characters in the
> .elc file would also depend on the value of
> `utf-fragment-on-decoding'.



    Oliver
-- 
16 Floréal an 214 de la Révolution
Liberté, Egalité, Fraternité!

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] Unicode Lisp reader escapes
  2006-05-04 21:40                     ` Stefan Monnier
@ 2006-05-05 10:25                       ` Eli Zaretskii
  2006-05-05 12:31                         ` Oliver Scholz
  2006-05-05 13:05                         ` Stefan Monnier
  0 siblings, 2 replies; 202+ messages in thread
From: Eli Zaretskii @ 2006-05-05 10:25 UTC (permalink / raw)
  Cc: alkibiades, emacs-devel

> Cc: storm@cua.dk,  emacs-devel@gnu.org,  alkibiades@gmx.de
> From: Stefan Monnier <monnier@iro.umontreal.ca>
> Date: Thu, 04 May 2006 17:40:28 -0400
> 
> >> After all, it's there, it works, and as mentioned: there's no
> >> evidence that the decoding time of \u escapes it ever going to need
> >> to be fast.
> 
> > ??? inside the Lisp reader, everything needs to be fast, IMO.
> 
> Why?

Because the Lisp reader is the backbone of the Lisp interpreter.

> IMO the only things that need to be fast are those things whose
> performance has a visible impact.  I see no evidence that there'll ever be
> a case where the speed with which we can read \u escapes will matter.

You don't need to see an evidence of a collapsing bridge to know that
it must be several times stronger than any imaginable load that could
ever be put on it.

In other words, not everything is empirical; there's a thing called
``good engineering practice.''

Sorry for being overly didactic, I'm sure you know all that.  I'm just
amazed that such a fundamental issue needs evidence.

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] Unicode Lisp reader escapes
  2006-05-04 16:41               ` Aidan Kehoe
@ 2006-05-05 10:39                 ` Eli Zaretskii
  2006-05-05 16:35                   ` Aidan Kehoe
  2006-05-05 19:05                 ` Richard Stallman
  1 sibling, 1 reply; 202+ messages in thread
From: Eli Zaretskii @ 2006-05-05 10:39 UTC (permalink / raw)
  Cc: rms, emacs-devel

> From: Aidan Kehoe <kehoea@parhasard.net>
> Date: Thu, 4 May 2006 18:41:17 +0200
> Cc: , emacs-devel@gnu.org
> 
>  > I think one relevant question is to what extent the C and Emacs Lisp
>  > string syntax are compatible in the first place.  Emacs Lisp string
>  > syntax was largely based on C string syntax in 1984, but I don't know
>  > how C has developed since 1990.  Can someone report on this question?
> 
> The \u syntax (with a fixed number of digits) came into wide use with Java
> in 1996. The necessity for the \U extension arose with progress towards
> version 3.0 of Unicode and its ~1.1 million available code points. That
> version of the standard was released in 1999; the C99 ISO standard for C of
> the same year included both \u and \U. Various other C-oriented programming
> languages have incorporated the syntax since. 

I think Richard was asking for a simple summary of the current C
string syntax, with special emphasis on the standard escapes.  \u and
\U are only part of the story.

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] Unicode Lisp reader escapes
  2006-05-05 10:25                       ` Eli Zaretskii
@ 2006-05-05 12:31                         ` Oliver Scholz
  2006-05-05 18:08                           ` Stuart D. Herring
  2006-05-05 13:05                         ` Stefan Monnier
  1 sibling, 1 reply; 202+ messages in thread
From: Oliver Scholz @ 2006-05-05 12:31 UTC (permalink / raw)


Eli Zaretskii <eliz@gnu.org> writes:

>> Cc: storm@cua.dk,  emacs-devel@gnu.org,  alkibiades@gmx.de
>> From: Stefan Monnier <monnier@iro.umontreal.ca>
>> Date: Thu, 04 May 2006 17:40:28 -0400
>> 
>> >> After all, it's there, it works, and as mentioned: there's no
>> >> evidence that the decoding time of \u escapes it ever going to need
>> >> to be fast.
>> 
>> > ??? inside the Lisp reader, everything needs to be fast, IMO.
>> 
>> Why?
>
> Because the Lisp reader is the backbone of the Lisp interpreter.
>
>> IMO the only things that need to be fast are those things whose
>> performance has a visible impact.  I see no evidence that there'll ever be
>> a case where the speed with which we can read \u escapes will matter.
>
> You don't need to see an evidence of a collapsing bridge to know that
> it must be several times stronger than any imaginable load that could
> ever be put on it.
>
> In other words, not everything is empirical; there's a thing called
> ``good engineering practice.''
>
> Sorry for being overly didactic, I'm sure you know all that.  I'm just
> amazed that such a fundamental issue needs evidence.

For the sake of peace: my opinion doesn't probably matter much, but
personally I believe that the changes necessary for a C implementation
of `decode-char' are local enough to be safe. I wouldn't like to
change anything in mule.el or utf-8.el---not even a defvar---, but
maybe I just could make the necessary symbols available to C in
syms_of_lread:

    Qutf_translate_cjk_mode = intern ("utf-translate-cjk-mode");
    staticpro (&Qutf_translate_cjk_mode);

    [...]

    Qutf_subst_table_for_decode = intern ("utf-subst-table-for-decode");
    staticpro (&Qutf_subst_table_for_decode);

    Qtranslation_hash_table = intern ("translation-hash-table");
    staticpro (&Qutf_subst_table_for_decode);


And then access them from my port of `decode-char's core functionality
like this:    
    
    SYMBOL_VALUE (Qutf_translate_cjk_mode)
    [...]
    Fget (Qutf_subst_table_for_decode, Qtranslation_hash_table)
  

[Well, in fact I already did in my copy of Emacs.]    

    
    Oliver
-- 
16 Floréal an 214 de la Révolution
Liberté, Egalité, Fraternité!

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] Unicode Lisp reader escapes
  2006-05-05 10:25                       ` Eli Zaretskii
  2006-05-05 12:31                         ` Oliver Scholz
@ 2006-05-05 13:05                         ` Stefan Monnier
  2006-05-05 17:23                           ` Oliver Scholz
  1 sibling, 1 reply; 202+ messages in thread
From: Stefan Monnier @ 2006-05-05 13:05 UTC (permalink / raw)
  Cc: alkibiades, emacs-devel

>> > ??? inside the Lisp reader, everything needs to be fast, IMO.
>> Why?
> Because the Lisp reader is the backbone of the Lisp interpreter.

>> IMO the only things that need to be fast are those things whose
>> performance has a visible impact.  I see no evidence that there'll ever be
>> a case where the speed with which we can read \u escapes will matter.
> You don't need to see an evidence of a collapsing bridge to know that
> it must be several times stronger than any imaginable load that could
> ever be put on it.

We're talking performance here, not correctness.

> In other words, not everything is empirical; there's a thing called
> ``good engineering practice.''

And we're talking about a micro-optimization, not an algorithmic
optimization, so the only good engineering principle I know in this domain
is: don't micro-optimize before you know it's necessary.

> Sorry for being overly didactic, I'm sure you know all that.  I'm just
> amazed that such a fundamental issue needs evidence.

I don't need evidence to accept the C code version, but I need such evidence
before I can accept "performance" as the motivation for the use of C code
rather than elisp.


        Stefan

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] Unicode Lisp reader escapes
  2006-05-05 10:39                 ` Eli Zaretskii
@ 2006-05-05 16:35                   ` Aidan Kehoe
  0 siblings, 0 replies; 202+ messages in thread
From: Aidan Kehoe @ 2006-05-05 16:35 UTC (permalink / raw)
  Cc: rms, emacs-devel


 Ar an cúigiú lá de mí Bealtaine, scríobh Eli Zaretskii: 

 > >  > I think one relevant question is to what extent the C and Emacs Lisp
 > >  > string syntax are compatible in the first place. Emacs Lisp string
 > >  > syntax was largely based on C string syntax in 1984, but I don't
 > >  > know how C has developed since 1990. Can someone report on this
 > >  > question?
 > > 
 > > The \u syntax (with a fixed number of digits) came into wide use with
 > > Java in 1996. The necessity for the \U extension arose with progress
 > > towards version 3.0 of Unicode and its ~1.1 million available code
 > > points. That version of the standard was released in 1999; the C99 ISO
 > > standard for C of the same year included both \u and \U. Various other
 > > C-oriented programming languages have incorporated the syntax since.
 > 
 > I think Richard was asking for a simple summary of the current C
 > string syntax, with special emphasis on the standard escapes.  \u and
 > \U are only part of the story.

Well, I read it as him asking how C has developed since 1990 in its string
syntax, and \u and \U are most of that story. Your parse is more reasonable;
the question is not clear, though. 

-- 
Aidan Kehoe, http://www.parhasard.net/

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] Unicode Lisp reader escapes
  2006-05-05 13:05                         ` Stefan Monnier
@ 2006-05-05 17:23                           ` Oliver Scholz
  0 siblings, 0 replies; 202+ messages in thread
From: Oliver Scholz @ 2006-05-05 17:23 UTC (permalink / raw)


[-- Attachment #1: Type: text/plain, Size: 817 bytes --]

For what it's worth, I just tried the attached little stress test on
an updated C port of `decode-char' in order to check whether it
returns equivalent results. It does. (Well, except intentional
differences like that `ucs_to_internal' throws an error where
`decode-char' returns nil.)

Basically the test runs through all positive integers up to MAX_CHAR
and inserts an alist into a temp buffer with each car being the
integer and each cdr being a character in the \u syntax (e.g.
`?\u3b1'). It then reads that alist again and checks whether
`decode-char' on its car is `eq' to its cdr. I tried it with and
without `utf-translate-cjk-mode' and with and without
`utf-fragment-on-decoding'. Since all tests succeed, ucs_to_internal
and `decode-char' are functionally equivalent on all supported
characters.

The test: 

[-- Attachment #2: ucs-test.el --]
[-- Type: application/emacs-lisp, Size: 1517 bytes --]

[-- Attachment #3: Type: text/plain, Size: 20 bytes --]


The updated patch: 

[-- Attachment #4: ucs-escapes.diff --]
[-- Type: text/plain, Size: 6643 bytes --]

Index: src/lread.c
===================================================================
RCS file: /cvsroot/emacs/emacs/src/lread.c,v
retrieving revision 1.350
diff -u -r1.350 lread.c
--- src/lread.c	27 Feb 2006 02:04:35 -0000	1.350
+++ src/lread.c	5 May 2006 17:09:37 -0000
@@ -87,6 +87,9 @@
 Lisp_Object Qbackquote, Qcomma, Qcomma_at, Qcomma_dot, Qfunction;
 Lisp_Object Qinhibit_file_name_operation;
 Lisp_Object Qeval_buffer_list, Veval_buffer_list;
+Lisp_Object Qutf_translate_cjk_mode, Qutf_translate_cjk_lang_env, Qutf_translate_cjk_load_tables;
+Lisp_Object Qutf_subst_table_for_decode, Qtranslation_hash_table;
+Lisp_Object Qutf_translation_table_for_decode, Qtranslation_table;
 
 extern Lisp_Object Qevent_symbol_element_mask;
 extern Lisp_Object Qfile_exists_p;
@@ -1731,6 +1734,110 @@
   return str[0];
 }
 
+
+#define READ_HEX_ESCAPE(i, c)                                         \
+  while (1)                                                           \
+    {                                                                 \
+      c = READCHAR;                                                   \
+      if (c >= '0' && c <= '9')                                       \
+        {                                                             \
+          i *= 16;                                                    \
+          i += c - '0';                                               \
+        }                                                             \
+      else if ((c >= 'a' && c <= 'f')                                 \
+               || (c >= 'A' && c <= 'F'))                             \
+        {                                                             \
+          i *= 16;                                                    \
+          if (c >= 'a' && c <= 'f')                                   \
+            i += c - 'a' + 10;                                        \
+          else                                                        \
+            i += c - 'A' + 10;                                        \
+        }                                                             \
+      else                                                            \
+        {                                                             \
+          UNREAD (c);                                                 \
+          break;                                                      \
+        }                                                             \
+    }
+
+
+
+/* Return the internal character coresponding to an UCS code point.*/
+
+int
+ucs_to_internal (ucs)
+     int ucs;
+{
+  int c = 0;
+  Lisp_Object tmp_char;
+
+  if (! EQ (Qnil, SYMBOL_VALUE (Qutf_translate_cjk_mode)))
+    /* cf. `utf-lookup-subst-table-for-decode' */
+    {
+      Lisp_Object hash;
+      
+      if (EQ (Qnil, SYMBOL_VALUE (Qutf_translate_cjk_lang_env)))
+        call0 (Qutf_translate_cjk_load_tables);
+
+      hash = Fget (Qutf_subst_table_for_decode, Qtranslation_hash_table);
+
+      if (HASH_TABLE_P (hash))
+        {
+          tmp_char = Fgethash (make_number (ucs), hash, Qnil);
+          if (! EQ (Qnil, tmp_char))
+            {
+              CHECK_NUMBER (tmp_char);
+              c = XFASTINT (tmp_char);
+            }
+        }
+    }
+
+  if (c)
+    /* We found the character already in the translation hash table.
+       Do nothing. */
+    ;
+  else if (ucs < 160)
+    c = ucs;
+  else if (ucs < 256)
+    c = MAKE_CHAR (charset_latin_iso8859_1, ucs, 0);
+  else if (ucs < 0x2500)
+    {
+      ucs -= 0x0100;
+      c = MAKE_CHAR (charset_mule_unicode_0100_24ff,
+                     ((ucs / 96) + 32),
+                     ((ucs % 96) + 32));
+    }
+    else if (ucs < 0x3400)
+    {
+      ucs -= 0x2500;
+      c = MAKE_CHAR (charset_mule_unicode_2500_33ff,
+                     ((ucs / 96) + 32),
+                     ((ucs % 96) + 32));
+    }
+    else if ((ucs >= 0xE000) && (ucs < 0x10000))
+      {
+        ucs -= 0xE000;
+        c = MAKE_CHAR (charset_mule_unicode_e000_ffff,
+                       ((ucs / 96) + 32),
+                       ((ucs % 96) + 32));
+      }
+  
+  if (c || ucs == 0) /* U+0000 is also a valid character. */
+    {
+      Lisp_Object vect = Fget (Qutf_translation_table_for_decode,
+                               Qtranslation_table);
+      if (CHAR_TABLE_P (vect))
+        {
+          tmp_char = Faref (vect, make_number (c));
+          if (! EQ (Qnil, tmp_char))
+            return XFASTINT (tmp_char);
+        }
+      return c;
+    }
+  else error ("Invalid or unsupported UCS character: %x", ucs);
+}
+
+      
 /* Read a \-escape sequence, assuming we already read the `\'.
    If the escape sequence forces unibyte, store 1 into *BYTEREP.
    If the escape sequence forces multibyte, store 2 into *BYTEREP.
@@ -1879,34 +1986,23 @@
       /* A hex escape, as in ANSI C.  */
       {
 	int i = 0;
-	while (1)
-	  {
-	    c = READCHAR;
-	    if (c >= '0' && c <= '9')
-	      {
-		i *= 16;
-		i += c - '0';
-	      }
-	    else if ((c >= 'a' && c <= 'f')
-		     || (c >= 'A' && c <= 'F'))
-	      {
-		i *= 16;
-		if (c >= 'a' && c <= 'f')
-		  i += c - 'a' + 10;
-		else
-		  i += c - 'A' + 10;
-	      }
-	    else
-	      {
-		UNREAD (c);
-		break;
-	      }
-	  }
-
+        READ_HEX_ESCAPE (i, c);
 	*byterep = 2;
 	return i;
       }
 
+    case 'u':
+      /* A hexadecimal reference to an UCS character. */
+      {
+        int i = 0;
+        
+        READ_HEX_ESCAPE (i, c);
+        *byterep = 2;
+
+        return ucs_to_internal (i);
+
+      }
+
     default:
       if (BASE_LEADING_CODE_P (c))
 	c = read_multibyte (c, readcharfun);
@@ -4121,6 +4217,27 @@
 
   Vloads_in_progress = Qnil;
   staticpro (&Vloads_in_progress);
+
+  Qutf_translate_cjk_mode = intern ("utf-translate-cjk-mode");
+  staticpro (&Qutf_translate_cjk_mode);
+  
+  Qutf_translate_cjk_lang_env = intern ("utf-translate-cjk-lang-env");
+  staticpro (&Qutf_translate_cjk_lang_env);
+  
+  Qutf_translate_cjk_load_tables = intern ("utf-translate-cjk-load-tables");
+  staticpro (&Qutf_translate_cjk_load_tables);
+  
+  Qutf_subst_table_for_decode = intern ("utf-subst-table-for-decode");
+  staticpro (&Qutf_subst_table_for_decode);
+  
+  Qtranslation_hash_table = intern ("translation-hash-table");
+  staticpro (&Qutf_subst_table_for_decode);
+
+  Qutf_translation_table_for_decode = intern ("utf-translation-table-for-decode");
+  staticpro (&Qutf_translation_table_for_decode);
+  
+  Qtranslation_table = intern ("translation-table");
+  staticpro (&Qtranslation_table);
 }
 
 /* arch-tag: a0d02733-0f96-4844-a659-9fd53c4f414d

[-- Attachment #5: Type: text/plain, Size: 87 bytes --]



    Oliver
-- 
16 Floréal an 214 de la Révolution
Liberté, Egalité, Fraternité!

[-- Attachment #6: Type: text/plain, Size: 142 bytes --]

_______________________________________________
Emacs-devel mailing list
Emacs-devel@gnu.org
http://lists.gnu.org/mailman/listinfo/emacs-devel

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] Unicode Lisp reader escapes
  2006-05-05 12:31                         ` Oliver Scholz
@ 2006-05-05 18:08                           ` Stuart D. Herring
  0 siblings, 0 replies; 202+ messages in thread
From: Stuart D. Herring @ 2006-05-05 18:08 UTC (permalink / raw)
  Cc: emacs-devel

> syms_of_lread:
>
>     Qutf_translate_cjk_mode = intern ("utf-translate-cjk-mode");
>     staticpro (&Qutf_translate_cjk_mode);
>
>     [...]
>
>     Qutf_subst_table_for_decode = intern ("utf-subst-table-for-decode");
>     staticpro (&Qutf_subst_table_for_decode);
>
>     Qtranslation_hash_table = intern ("translation-hash-table");

I'd suggest here:

>-    staticpro (&Qutf_subst_table_for_decode);
>+    staticpro (&Qtranslation_hash_table);

Davis

-- 
This product is sold by volume, not by mass.  If it appears too dense or
too sparse, it is because mass-energy conversion has occurred during
shipping.

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] Unicode Lisp reader escapes
  2006-05-05  9:33                 ` Oliver Scholz
  2006-05-05 10:02                   ` Oliver Scholz
@ 2006-05-05 18:33                   ` Aidan Kehoe
  2006-05-05 18:42                     ` Oliver Scholz
  2006-05-05 21:37                     ` Eli Zaretskii
  2006-05-06 14:24                   ` Richard Stallman
  2 siblings, 2 replies; 202+ messages in thread
From: Aidan Kehoe @ 2006-05-05 18:33 UTC (permalink / raw)
  Cc: emacs-devel


 Ar an cúigiú lá de mí Bealtaine, scríobh Oliver Scholz>: 

 > >  > What happens when a Lisp file is byte-compiled--do we want the result
 > >  > to depend on the local settings?
 >
 > [...]
 >
 > > It does currently, to the extent of local settings preventing successful
 > > compilation. Cf. this code (on Unix):
 > 
 > [...]
 > >     (insert (concat
 > > 	     ";; -*- coding: utf-8 -*-\n\n"
 > > 	     "(require 'cl)\n\n"
 > > 	     "(defun describe-our-string ()\n"
 > > 	     "  (let ((our-char ?"
 > > 	     (format "%c%c%c" ?\345 ?\215 ?\227)
 > [...]
 > >   (utf-translate-cjk-mode 0)
 > [...]
 > >   (byte-compile-file our-test-file-name)
 > [...]
 > 
 > I am afraid that is not relevant here. This just tells Emacs to not
 > deal with UTF-8 encoded CJK characters and then tell it to deal with
 > such an character.

It byte compiles a file, changes a local setting, and byte-compiles the file
again with a different result. That is relevant to Eli’s question. 

-- 
Aidan Kehoe, http://www.parhasard.net/

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] Unicode Lisp reader escapes
  2006-05-05 18:33                   ` Aidan Kehoe
@ 2006-05-05 18:42                     ` Oliver Scholz
  2006-05-05 21:37                     ` Eli Zaretskii
  1 sibling, 0 replies; 202+ messages in thread
From: Oliver Scholz @ 2006-05-05 18:42 UTC (permalink / raw)


Aidan Kehoe <kehoea@parhasard.net> writes:

>  Ar an cúigiú lá de mí Bealtaine, scríobh Oliver Scholz>: 
>
>  > >  > What happens when a Lisp file is byte-compiled--do we want the result
>  > >  > to depend on the local settings?
>  >
>  > [...]
>  >
>  > > It does currently, to the extent of local settings preventing successful
>  > > compilation. Cf. this code (on Unix):
>  > 
>  > [...]
>  > >     (insert (concat
>  > > 	     ";; -*- coding: utf-8 -*-\n\n"
>  > > 	     "(require 'cl)\n\n"
>  > > 	     "(defun describe-our-string ()\n"
>  > > 	     "  (let ((our-char ?"
>  > > 	     (format "%c%c%c" ?\345 ?\215 ?\227)
>  > [...]
>  > >   (utf-translate-cjk-mode 0)
>  > [...]
>  > >   (byte-compile-file our-test-file-name)
>  > [...]
>  > 
>  > I am afraid that is not relevant here. This just tells Emacs to not
>  > deal with UTF-8 encoded CJK characters and then tell it to deal with
>  > such an character.
>
> It byte compiles a file, changes a local setting, and byte-compiles the file
> again with a different result. That is relevant to Eli’s question. 

Sure, and I can put

(eval-after-load "bytecomp"
  '(fset 'byte-compile-file
         (lambda (&rest ignore) (error "lirum larum"))))

into my .emacs and bytecompiling will also yield different results
depending on local setting. I guess that would also be relevant here.
         
    Oliver
-- 
16 Floréal an 214 de la Révolution
Liberté, Egalité, Fraternité!

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] Unicode Lisp reader escapes
  2006-05-04 16:32             ` Eli Zaretskii
  2006-05-04 20:55               ` Aidan Kehoe
@ 2006-05-05 19:05               ` Richard Stallman
  2006-05-05 21:43                 ` Eli Zaretskii
  1 sibling, 1 reply; 202+ messages in thread
From: Richard Stallman @ 2006-05-05 19:05 UTC (permalink / raw)
  Cc: kehoea, emacs-devel, monnier, handa

    What happens when a Lisp file is byte-compiled--do we want the result
    to depend on the local settings?

No, we want that to depend only on local variables of the file itself.

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] Unicode Lisp reader escapes
  2006-05-04 16:41               ` Aidan Kehoe
  2006-05-05 10:39                 ` Eli Zaretskii
@ 2006-05-05 19:05                 ` Richard Stallman
  2006-05-05 19:20                   ` Aidan Kehoe
  1 sibling, 1 reply; 202+ messages in thread
From: Richard Stallman @ 2006-05-05 19:05 UTC (permalink / raw)
  Cc: emacs-devel

     > Regarding \u: the question is whether an Emacs escape for Unicode
     > characters should be compatible with C string syntax for Unicode
     > characters, or coherent with the Emacs \x escape.

    The thing with the Emacs \x escape is that anyone using it for characters
    outside of ASCII is asking for pain, and always has been. It has only ever
    been clearly defined for that character set; any existing code in the
    repository for other characters, for example, _will definitely_ break with
    the merging of the Unicode branch.

We are miscommunicating.  Whether it is wise to use \x is not the
question.  The issue I am talking about is that of _coherence_
(parallelism of syntax) between \x and \u.

     > I think one relevant question is to what extent the C and Emacs Lisp
     > string syntax are compatible in the first place.  Emacs Lisp string
     > syntax was largely based on C string syntax in 1984, but I don't know
     > how C has developed since 1990.  Can someone report on this question?

    The \u syntax (with a fixed number of digits) came into wide use with Java
    in 1996. The necessity for the \U extension arose with progress towards
    version 3.0 of Unicode and its ~1.1 million available code points. That
    version of the standard was released in 1999; the C99 ISO standard for C of
    the same year included both \u and \U. Various other C-oriented programming
    languages have incorporated the syntax since. 

Thank you, but my question here is not about \u.  Rather, it is about
whether there are OTHER incompatibilities between Emacs Lisp and C
string syntax.

I want to see that information before deciding what to do here.

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] Unicode Lisp reader escapes
  2006-05-05 19:05                 ` Richard Stallman
@ 2006-05-05 19:20                   ` Aidan Kehoe
  2006-05-05 19:57                     ` Aidan Kehoe
  0 siblings, 1 reply; 202+ messages in thread
From: Aidan Kehoe @ 2006-05-05 19:20 UTC (permalink / raw)
  Cc: emacs-devel


 Ar an cúigiú lá de mí Bealtaine, scríobh Richard Stallman: 

 >     [...] The thing with the Emacs \x escape is that anyone using it for
 >     characters outside of ASCII is asking for pain, and always has been. 
 >     It has only ever been clearly defined for that character set; any
 >     existing code in the repository for other characters, for example,
 >     _will definitely_ break with the merging of the Unicode branch.
 > 
 > We are miscommunicating.  Whether it is wise to use \x is not the
 > question.  The issue I am talking about is that of _coherence_
 > (parallelism of syntax) between \x and \u.

Indeed. And one of the paragraphs you snipped indicated my doubts as to
whether it is wise to be coherent with something that is either bad style or
broken.

 > [...] Thank you, but my question here is not about \u. Rather, it is
 > about whether there are OTHER incompatibilities between Emacs Lisp and C
 > string syntax.
 >
 > I want to see that information before deciding what to do here.

There aren’t, to my knowledge, C is a pretty conservative language. GCC and
its conscientious approach to the standards is a big part of why that is so,
as I understand it. 

-- 
Aidan Kehoe, http://www.parhasard.net/

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] Unicode Lisp reader escapes
  2006-05-05 19:20                   ` Aidan Kehoe
@ 2006-05-05 19:57                     ` Aidan Kehoe
  2006-05-06 14:25                       ` Richard Stallman
  0 siblings, 1 reply; 202+ messages in thread
From: Aidan Kehoe @ 2006-05-05 19:57 UTC (permalink / raw)
  Cc: emacs-devel


 Ar an cúigiú lá de mí Bealtaine, scríobh Aidan Kehoe: 

 >  > [...] Thank you, but my question here is not about \u. Rather, it is
 >  > about whether there are OTHER incompatibilities between Emacs Lisp and C
 >  > string syntax.
 >  >
 >  > I want to see that information before deciding what to do here.
 > 
 > There aren’t, to my knowledge, C is a pretty conservative language. GCC
 > and its conscientious approach to the standards is a big part of why that
 > is so, as I understand it.

Sorry, that is false. There are other incompatibilities; 

\d in Emacs Lisp string and character syntax gives \0177
\e in Emacs Lisp string and character syntax gives \033
\M-<CHAR> in Emacs Lisp string and character syntax gives CHAR + #x8000000
\S-<CHAR> in Emacs Lisp string and character syntax gives CHAR + #x2000000
\H-<CHAR> in Emacs Lisp string and character syntax gives CHAR + #x1000000
\A-<CHAR> in Emacs Lisp string and character syntax gives CHAR + #x400000
\s-<CHAR> in Emacs Lisp string and character syntax gives CHAR + #x400000
\C-<CHAR> and \^<CHAR> in Emacs Lisp string and character syntax gives the
control version of CHAR, which for non-ASCII characters is CHAR + #x4000000

All of these are incompatibilities on the Emacs Lisp side; except for the
Unicode escapes, a C programmer can use any C escape desired in Emacs Lisp.

-- 
Aidan Kehoe, http://www.parhasard.net/

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] Unicode Lisp reader escapes
  2006-05-05 18:33                   ` Aidan Kehoe
  2006-05-05 18:42                     ` Oliver Scholz
@ 2006-05-05 21:37                     ` Eli Zaretskii
  1 sibling, 0 replies; 202+ messages in thread
From: Eli Zaretskii @ 2006-05-05 21:37 UTC (permalink / raw)
  Cc: emacs-devel, alkibiades

> From: Aidan Kehoe <kehoea@parhasard.net>
> Date: Fri, 5 May 2006 20:33:56 +0200
> Cc: emacs-devel@gnu.org
> 
>  > I am afraid that is not relevant here. This just tells Emacs to not
>  > deal with UTF-8 encoded CJK characters and then tell it to deal with
>  > such an character.
> 
> It byte compiles a file, changes a local setting, and byte-compiles the file
> again with a different result. That is relevant to Eli's question. 

It's not necessarily relevant, because I didn't mean theoretical
exercises, I meant normal byte-compiling of Lisp files which just
happen to have \u escapes in them.  Such files usually won't be
encoded in some arbitrary encoding.  Use of 8-bit \nnn characters is
also discouraged due to the ambiguity of their interpretation.

Emacs gives us enough rope to hang ourselves, but that doesn't mean we
should actually do that whenever we have a few moments of free time ;-)

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] Unicode Lisp reader escapes
  2006-05-05 19:05               ` Richard Stallman
@ 2006-05-05 21:43                 ` Eli Zaretskii
  2006-05-06 14:25                   ` Richard Stallman
  0 siblings, 1 reply; 202+ messages in thread
From: Eli Zaretskii @ 2006-05-05 21:43 UTC (permalink / raw)
  Cc: kehoea, emacs-devel, monnier, handa

> From: Richard Stallman <rms@gnu.org>
> CC: handa@m17n.org, kehoea@parhasard.net, monnier@iro.umontreal.ca,
> 	emacs-devel@gnu.org
> Date: Fri, 05 May 2006 15:05:30 -0400
> 
>     What happens when a Lisp file is byte-compiled--do we want the result
>     to depend on the local settings?
> 
> No, we want that to depend only on local variables of the file itself.

So I think this means that if we eventually decide to use decode-char,
we should at least explicitly bind utf-fragment-on-decoding so as to
disable this translation.

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] Unicode Lisp reader escapes
  2006-04-30 20:53     ` Richard Stallman
  2006-04-30 21:04       ` Andreas Schwab
  2006-04-30 21:56       ` Aidan Kehoe
@ 2006-05-05 23:15       ` Juri Linkov
  2006-05-06 23:36         ` Richard Stallman
  2 siblings, 1 reply; 202+ messages in thread
From: Juri Linkov @ 2006-05-05 23:15 UTC (permalink / raw)
  Cc: kehoea, emacs-devel

>     They are both fixed-length expressions, which is good, because
>     people get into the habit of typing "\u0123As I walked out one
>     evening" instead of the more disastrous "\u123As I walked out one
>     evening".
>
> I see, you are talking about using them in strings.
> Still, I don't like having both \u and \U--it is ugly.

Are there reasons not to use Perl's notation for Unicode characters,
i.e. "\x{...}"?  The Unicode code for the desired character,
is placed in the braces in hexadecimal, and has no fixed length.
Examples: "\x{DF}", "\x{0448}", "\x{001D0ED}".

I think this notation is well suitable for Emacs, because \x{...}
indicates that a hexadecimal value is expected in the braces.
And in the Unicode branch it will be just another way to specify
a hexadecimal value with a variable length.

-- 
Juri Linkov
http://www.jurta.org/emacs/

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] Unicode Lisp reader escapes
  2006-05-05  9:33                 ` Oliver Scholz
  2006-05-05 10:02                   ` Oliver Scholz
  2006-05-05 18:33                   ` Aidan Kehoe
@ 2006-05-06 14:24                   ` Richard Stallman
  2006-05-06 15:01                     ` Oliver Scholz
       [not found]                     ` <877j4z5had.fsf@gmx.de>
  2 siblings, 2 replies; 202+ messages in thread
From: Richard Stallman @ 2006-05-06 14:24 UTC (permalink / raw)
  Cc: emacs-devel

    I guess, you have a similar problem when the source file is encoded in
    either ISO 8859-5 or ISO 8859-7 (btw., what about Hewbrew, Arabic and
    Thai?). Unless I am much mistaken, the encoding of the characters in
    the .elc file would also depend on the value of
    `utf-fragment-on-decoding'.

Are you talking about how the compiler would write the .elc file?
Or are you talking about how the .elc file would be interpreted?

If it is the latter, I don't think so.  Fload will disregard this
variable because Fload does not do decoding in the usual way.

The compiler should output the file in the representation that
Fload will read, and it should do so by binding any relevant variables.

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] Unicode Lisp reader escapes
  2006-05-05 19:57                     ` Aidan Kehoe
@ 2006-05-06 14:25                       ` Richard Stallman
  2006-05-06 17:26                         ` Aidan Kehoe
  0 siblings, 1 reply; 202+ messages in thread
From: Richard Stallman @ 2006-05-06 14:25 UTC (permalink / raw)
  Cc: emacs-devel

    All of these are incompatibilities on the Emacs Lisp side; except for the
    Unicode escapes, a C programmer can use any C escape desired in Emacs Lisp.

That being so, I think it is useful to keep that true, and implement
\u and \U in a way that is compatible with C.

We could install this now if someone writes changes for etc/NEWS and
the Lisp manual, as well as the code.

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] Unicode Lisp reader escapes
  2006-05-05 21:43                 ` Eli Zaretskii
@ 2006-05-06 14:25                   ` Richard Stallman
  0 siblings, 0 replies; 202+ messages in thread
From: Richard Stallman @ 2006-05-06 14:25 UTC (permalink / raw)
  Cc: kehoea, handa, monnier, emacs-devel

    So I think this means that if we eventually decide to use decode-char,
    we should at least explicitly bind utf-fragment-on-decoding so as to
    disable this translation.

Yes.  The byte compiler should bind all variables that users are
likely to change that would affect the way a file is turned into Lisp
code.

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] Unicode Lisp reader escapes
  2006-05-06 14:24                   ` Richard Stallman
@ 2006-05-06 15:01                     ` Oliver Scholz
       [not found]                     ` <877j4z5had.fsf@gmx.de>
  1 sibling, 0 replies; 202+ messages in thread
From: Oliver Scholz @ 2006-05-06 15:01 UTC (permalink / raw)
  Cc: Richard Stallman

Richard Stallman <rms@gnu.org> writes:

>     I guess, you have a similar problem when the source file is encoded in
>     either ISO 8859-5 or ISO 8859-7 (btw., what about Hewbrew, Arabic and
>     Thai?). Unless I am much mistaken, the encoding of the characters in
>     the .elc file would also depend on the value of
>     `utf-fragment-on-decoding'.

[Note the correction in a follow-up of mine: I made a mistake in
phrasing that paragraph. I am actually talking about Elisp source
files encoded in UTF-8 (or another UCS encoding) that contain
characters from the repertoire of ISO 8859-5 or ISO 8859-7.

There's a similar case with source files encoded in some of the ISO
8859 encodings and `unify-8859-on-decoding-mode', though.]

> Are you talking about how the compiler would write the .elc file?
> Or are you talking about how the .elc file would be interpreted?

It's the former. Meanwhile I have tested it. After some more thought I
think that the case of an UTF-8 encoded source files containing
characters from the Greek or Cyrillic repertoires is in fact entirely
analogous to what would happen with \u.

In other words: that particular bug is already there.


    Oliver, who still thinks that \u and \U is really ugly.
-- 
17 Floréal an 214 de la Révolution
Liberté, Egalité, Fraternité!

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] Unicode Lisp reader escapes
  2006-05-06 14:25                       ` Richard Stallman
@ 2006-05-06 17:26                         ` Aidan Kehoe
  2006-05-07  5:01                           ` Richard Stallman
  0 siblings, 1 reply; 202+ messages in thread
From: Aidan Kehoe @ 2006-05-06 17:26 UTC (permalink / raw)
  Cc: emacs-devel


 Ar an séiú lá de mí Bealtaine, scríobh Richard Stallman>: 

 >     All of these are incompatibilities on the Emacs Lisp side; except for
 >     the Unicode escapes, a C programmer can use any C escape desired in
 >     Emacs Lisp.
 > 
 > That being so, I think it is useful to keep that true, and implement
 > \u and \U in a way that is compatible with C.
 > 
 > We could install this now if someone writes changes for etc/NEWS and
 > the Lisp manual, as well as the code.

Okay. I’ve already signed papers; the patch below includes updates
to the NEWS file, the code and the Lisp manual. 

One mostly open question, which the below patch takes a clear stand on, is
whether it is acceptable to call decode-char (which is implemented in Lisp)
from the Lisp reader. I share Stefan Monnier’s judgement on this:

“I'd vote to keep the code in elisp.  After all, it's there, it works, and
as mentioned: there's no evidence that the decoding time of \u escapes it
ever going to need to be fast.  And it'll become fast in Emacs-unicode
anyway, so it doesn't seem to be worth the trouble.” 

I have no objection to implementing decode-char in C in general; it would
mean that handle_one_event in xterm.c could be made much more robust, for
example. It currently is the case that Unicode keysyms are handled
inconsistently with the Unicode coding systems and that code points above
#xFFFF are simply dropped, it doesn’t even try to convert them to Emacs
characters. But integrating it into Emacs for the sake of this patch seems
too much potential instability for too little benefit.

Another thing; if this patch is to be integrated, there is some Lisp in the
source tree using \u in strings (incorrectly) that will need to be changed
to use \\u.

etc/ChangeLog addition:

2006-05-06  Aidan Kehoe  <kehoea@parhasard.net>

	* NEWS:
	Describe the Unicode string and character escape
	

lispref/ChangeLog addition:

2006-05-06  Aidan Kehoe  <kehoea@parhasard.net>

	* objects.texi (Character Type):
        Describe the Unicode character escape syntax; \uABCD or \U00ABCDEF 
        specifies Unicode characters U+ABCD and U+ABCDEF respectively.  


src/ChangeLog addition:

2006-05-06  Aidan Kehoe  <kehoea@parhasard.net>

	* lread.c (read_escape):
        Provide a Unicode character escape syntax; \u followed by exactly 
        four or \U followed by exactly eight hex digits in a comment or 
        string is read as a Unicode character with that code point.  
	

GNU Emacs Trunk source patch:
Diff command:   cvs -q diff -u
Files affected: src/lread.c lispref/objects.texi etc/NEWS

Index: etc/NEWS
===================================================================
RCS file: /sources/emacs/emacs/etc/NEWS,v
retrieving revision 1.1337
diff -u -u -r1.1337 NEWS
--- etc/NEWS	2 May 2006 01:47:57 -0000	1.1337
+++ etc/NEWS	6 May 2006 16:57:54 -0000
@@ -3772,6 +3772,13 @@
 been declared obsolete.
 
 +++
+*** New syntax: \uXXXX and \UXXXXXXXX specify Unicode code points in hex.
+Use "\u0428" to specify a string consisting of CYRILLIC CAPITAL LETTER SHA,
+or "\U0001D6E2" to specify one consisting of MATHEMATICAL ITALIC CAPITAL
+ALPHA (the latter is greater than #xFFFF and thus needs the longer
+syntax). Also available for characters. 
+
++++
 ** Displaying warnings to the user.
 
 See the functions `warn' and `display-warning', or the Lisp Manual.
Index: lispref/objects.texi
===================================================================
RCS file: /sources/emacs/emacs/lispref/objects.texi,v
retrieving revision 1.53
diff -u -u -r1.53 objects.texi
--- lispref/objects.texi	1 May 2006 15:05:48 -0000	1.53
+++ lispref/objects.texi	6 May 2006 16:57:56 -0000
@@ -431,6 +431,20 @@
 bit values are 2**22 for alt, 2**23 for super and 2**24 for hyper.
 @end ifnottex
 
+@cindex unicode character escape
+  Emacs provides a syntax for specifying characters by their Unicode code
+points.  @code{?\uABCD} represents a character that maps to the code
+point @samp{U+ABCD} in Unicode-based representations (UTF-8 text files,
+Unicode-oriented fonts, etc.).  There is a slightly different syntax for
+specifying characters with code points above @code{#xFFFF};
+@code{\U00ABCDEF} represents an Emacs character that maps to the code
+point @samp{U+ABCDEF} in Unicode-based representations, if such an Emacs
+character exists.
+
+  Unlike in some other languages, while this syntax is available for
+character literals, and (see later) in strings, it is not available
+elsewhere in your Lisp source code.
+
 @cindex @samp{\} in character constant
 @cindex backslash in character constant
 @cindex octal character code
Index: src/lread.c
===================================================================
RCS file: /sources/emacs/emacs/src/lread.c,v
retrieving revision 1.350
diff -u -u -r1.350 lread.c
--- src/lread.c	27 Feb 2006 02:04:35 -0000	1.350
+++ src/lread.c	6 May 2006 16:57:57 -0000
@@ -1743,6 +1743,9 @@
      int *byterep;
 {
   register int c = READCHAR;
+  /* \u allows up to four hex digits, \U up to eight. Default to the
+     behaviour for \u, and change this value in the case that \U is seen. */
+  int unicode_hex_count = 4;
 
   *byterep = 0;
 
@@ -1907,6 +1910,48 @@
 	return i;
       }
 
+    case 'U':
+      /* Post-Unicode-2.0: Up to eight hex chars */
+      unicode_hex_count = 8;
+    case 'u':
+
+      /* A Unicode escape. We only permit them in strings and characters,
+	 not arbitrarily in the source code as in some other languages. */
+      {
+	int i = 0;
+	int count = 0;
+	Lisp_Object lisp_char;
+	while (++count <= unicode_hex_count)
+	  {
+	    c = READCHAR;
+	    /* isdigit(), isalpha() may be locale-specific, which we don't
+	       want. */
+	    if      (c >= '0' && c <= '9')  i = (i << 4) + (c - '0');
+	    else if (c >= 'a' && c <= 'f')  i = (i << 4) + (c - 'a') + 10;
+            else if (c >= 'A' && c <= 'F')  i = (i << 4) + (c - 'A') + 10;
+	    else
+	      {
+		error ("Non-hex digit used for Unicode escape");
+		break;
+	      }
+	  }
+
+	lisp_char = call2(intern("decode-char"), intern("ucs"),
+			  make_number(i));
+
+	if (EQ(Qnil, lisp_char))
+	  {
+	    /* This is ugly and horrible and trashes the user's data. */
+	    XSETFASTINT (i, MAKE_CHAR (charset_katakana_jisx0201, 
+				       34 + 128, 46 + 128));
+            return i;
+	  }
+	else
+	  {
+	    return XFASTINT (lisp_char);
+	  }
+      }
+
     default:
       if (BASE_LEADING_CODE_P (c))
 	c = read_multibyte (c, readcharfun);

-- 
Aidan Kehoe, http://www.parhasard.net/

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] Unicode Lisp reader escapes
  2006-05-05 23:15       ` Juri Linkov
@ 2006-05-06 23:36         ` Richard Stallman
  2006-05-09 20:43           ` Juri Linkov
  0 siblings, 1 reply; 202+ messages in thread
From: Richard Stallman @ 2006-05-06 23:36 UTC (permalink / raw)
  Cc: kehoea, emacs-devel

    Are there reasons not to use Perl's notation for Unicode characters,
    i.e. "\x{...}"?  The Unicode code for the desired character,
    is placed in the braces in hexadecimal, and has no fixed length.
    Examples: "\x{DF}", "\x{0448}", "\x{001D0ED}".

We could support this form, as well as \u and \U for compatibility
with other languages.

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] Unicode Lisp reader escapes
       [not found]                     ` <877j4z5had.fsf@gmx.de>
@ 2006-05-07  5:00                       ` Richard Stallman
  2006-05-07 12:38                         ` Kenichi Handa
  0 siblings, 1 reply; 202+ messages in thread
From: Richard Stallman @ 2006-05-07  5:00 UTC (permalink / raw)
  Cc: emacs-devel, handa

    > Are you talking about how the compiler would write the .elc file?
    > Or are you talking about how the .elc file would be interpreted?

    It's the former. Meanwhile I have tested it. After some more thought I
    think that the case of an UTF-8 encoded source files containing
    characters from the Greek or Cyrillic repertoires is in fact entirely
    analogous to what would happen with \u.

    In other words: that particular bug is already there.

Handa-san, could you please investigate the bug?
Oliver, could you email Handa-san a description of the
bug, in the form which already exists?  Can you find a test case
which fails in the present development sources?

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] Unicode Lisp reader escapes
  2006-05-06 17:26                         ` Aidan Kehoe
@ 2006-05-07  5:01                           ` Richard Stallman
  2006-05-07  6:38                             ` Aidan Kehoe
  2006-05-07 16:50                             ` Aidan Kehoe
  0 siblings, 2 replies; 202+ messages in thread
From: Richard Stallman @ 2006-05-07  5:01 UTC (permalink / raw)
  Cc: emacs-devel

    Okay. I?ve already signed papers;

When was that?  The only papers recorded in our file you are for Gnus.

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] Unicode Lisp reader escapes
  2006-05-07  5:01                           ` Richard Stallman
@ 2006-05-07  6:38                             ` Aidan Kehoe
  2006-05-07  7:00                               ` David Kastrup
  2006-05-07 16:50                             ` Aidan Kehoe
  1 sibling, 1 reply; 202+ messages in thread
From: Aidan Kehoe @ 2006-05-07  6:38 UTC (permalink / raw)
  Cc: emacs-devel


 Ar an seachtú lá de mí Bealtaine, scríobh Richard Stallman: 

 >     Okay. I?ve already signed papers;
 > 
 > When was that?  The only papers recorded in our file you are for Gnus.

Those papers say; 

1.(a) Developer hereby agrees to assign and does hereby assign to FSF
Deveoper’s copyright in changes and/or enhancements to the suite of programs
known as GNU Emacs (herein called the Program), including any accompanying
docmentation files and supporting files as well as the actual program code.
These changes and/or enhancements are herein called the Works. 

(b) The assignment of par. 1(a) above applies to all past and future works
of Developer that constitute changes and enhancements to the Program. 

[...]

Do you mean to say that agreement does not cover changes in C? 

-- 
Aidan Kehoe, http://www.parhasard.net/

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] Unicode Lisp reader escapes
  2006-05-07  6:38                             ` Aidan Kehoe
@ 2006-05-07  7:00                               ` David Kastrup
  2006-05-07  7:15                                 ` Aidan Kehoe
  0 siblings, 1 reply; 202+ messages in thread
From: David Kastrup @ 2006-05-07  7:00 UTC (permalink / raw)
  Cc: rms, emacs-devel

Aidan Kehoe <kehoea@parhasard.net> writes:

>  Ar an seachtú lá de mí Bealtaine, scríobh Richard Stallman: 
>
>  >     Okay. I?ve already signed papers;
>  > 
>  > When was that?  The only papers recorded in our file you are for Gnus.
>
> Those papers say; 
>
> 1.(a) Developer hereby agrees to assign and does hereby assign to FSF
> Deveoper’s copyright in changes and/or enhancements to the suite of programs
> known as GNU Emacs (herein called the Program), including any accompanying
> docmentation files and supporting files as well as the actual program code.
> These changes and/or enhancements are herein called the Works. 
>
> (b) The assignment of par. 1(a) above applies to all past and future works
> of Developer that constitute changes and enhancements to the Program. 
>
> [...]
>
> Do you mean to say that agreement does not cover changes in C? 

Oh, it would.  But in the record on electronic file, your only listing
is under "GNUS".  Did you sign several assignments or just one?  In
either case, there probably has been some oversight by the copyright
clerk, or your signed copy did not reach the FSF for some reason.

-- 
David Kastrup, Kriemhildstr. 15, 44793 Bochum

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] Unicode Lisp reader escapes
  2006-05-07  7:00                               ` David Kastrup
@ 2006-05-07  7:15                                 ` Aidan Kehoe
  0 siblings, 0 replies; 202+ messages in thread
From: Aidan Kehoe @ 2006-05-07  7:15 UTC (permalink / raw)
  Cc: rms, emacs-devel


 Ar an seachtú lá de mí Bealtaine, scríobh David Kastrup: 

 > > Those papers say; 
 > >
 > > 1.(a) Developer hereby agrees to assign and does hereby assign to FSF
 > > Deveoper’s copyright in changes and/or enhancements to the suite of
 > > programs known as GNU Emacs (herein called the Program), including any
 > > accompanying docmentation files and supporting files as well as the
 > > actual program code. These changes and/or enhancements are herein
 > > called the Works.
 > >
 > > (b) The assignment of par. 1(a) above applies to all past and future
 > > works of Developer that constitute changes and enhancements to the
 > > Program.
 > >
 > > [...]
 > >
 > > Do you mean to say that agreement does not cover changes in C? 
 > 
 > Oh, it would.  But in the record on electronic file, your only listing
 > is under "GNUS".  Did you sign several assignments or just one?  

Just one. 

 > In either case, there probably has been some oversight by the copyright
 > clerk, or your signed copy did not reach the FSF for some reason.

My signed copy certainly reached the FSF. I copied out the above text from
the courtesy copy posted back by them. 

-- 
Aidan Kehoe, http://www.parhasard.net/

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] Unicode Lisp reader escapes
  2006-05-07  5:00                       ` Richard Stallman
@ 2006-05-07 12:38                         ` Kenichi Handa
  2006-05-07 21:26                           ` Oliver Scholz
  2006-05-08  7:36                           ` Richard Stallman
  0 siblings, 2 replies; 202+ messages in thread
From: Kenichi Handa @ 2006-05-07 12:38 UTC (permalink / raw)
  Cc: emacs-devel, alkibiades

In article <E1FcbO2-0002U6-0r@fencepost.gnu.org>, Richard Stallman <rms@gnu.org> writes:

>> Are you talking about how the compiler would write the .elc file?
>> Or are you talking about how the .elc file would be interpreted?

>     It's the former. Meanwhile I have tested it. After some more thought I
>     think that the case of an UTF-8 encoded source files containing
>     characters from the Greek or Cyrillic repertoires is in fact entirely
>     analogous to what would happen with \u.

>     In other words: that particular bug is already there.

> Handa-san, could you please investigate the bug?
> Oliver, could you email Handa-san a description of the
> bug, in the form which already exists?  Can you find a test case
> which fails in the present development sources?

When you byte-compile a x.el file, x.el file is at first
decoded.  How x.el file is decoded depends on many thing,
and thus, of course, the resulting x.elc files become
different.  If you say that is a bug, I think there's no way
to fix it.

The very simple testcase is this:

(progn
  (let ((str "(setq x \"\300\300\")\n")
	(coding-system-for-write 'no-conversion))
    (write-region str nil "~/test1.el")
    (write-region str nil "~/test2.el"))
  (set-language-environment "Latin-1")
  (byte-compile-file "~/test1.el")
  (set-language-environment "Japanese")
  (byte-compile-file "~/test2.el"))

Although the source files are exactly the same, the
resulting test1.elc contains a string of two Latin-1
characters whereas the test2.elc contains a string of single
Japanese character.

I hope I misunderstand what you claim here.

---
Kenichi Handa
handa@m17n.org

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] Unicode Lisp reader escapes
  2006-05-07  5:01                           ` Richard Stallman
  2006-05-07  6:38                             ` Aidan Kehoe
@ 2006-05-07 16:50                             ` Aidan Kehoe
  2006-05-08 22:28                               ` Richard Stallman
  1 sibling, 1 reply; 202+ messages in thread
From: Aidan Kehoe @ 2006-05-07 16:50 UTC (permalink / raw)
  Cc: emacs-devel


 Ar an seachtú lá de mí Bealtaine, scríobh Richard Stallman: 

 >     Okay. I’ve already signed papers;
 > 
 > When was that?  The only papers recorded in our file you are for Gnus.

To be clearer: I’ve signed a declaration of assignment for Gnus, and that
declaration is headed “ASSIGNMENT - GNU Gnus” and contains the following
text:

  1.(a) Developer hereby agrees to assign and does hereby assign to FSF
  Deveoper’s copyright in changes and/or enhancements to the suite of programs
  known as GNU Emacs (herein called the Program), including any accompanying
  docmentation files and supporting files as well as the actual program code.
  These changes and/or enhancements are herein called the Works. 
 
  (b) The assignment of par. 1(a) above applies to all past and future works
  of Developer that constitute changes and enhancements to the Program. 

Now, if Gnus is not to be interpreted as one of the “suite of programs known
as GNU Emacs,” then I need to sign separate a declaration of assignment for
Gnus. In that event, please send me one, by email or by post; my current
address is

Wisbyerstr. 10c
10439 Berlin 
Germany

Best regards

	- Aidan
-- 
Aidan Kehoe, http://www.parhasard.net/

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] Unicode Lisp reader escapes
  2006-05-07 12:38                         ` Kenichi Handa
@ 2006-05-07 21:26                           ` Oliver Scholz
  2006-05-08  1:14                             ` Kenichi Handa
  2006-05-08 22:29                             ` Richard Stallman
  2006-05-08  7:36                           ` Richard Stallman
  1 sibling, 2 replies; 202+ messages in thread
From: Oliver Scholz @ 2006-05-07 21:26 UTC (permalink / raw)
  Cc: emacs-devel, rms, alkibiades

Kenichi Handa <handa@m17n.org> writes:

> In article <E1FcbO2-0002U6-0r@fencepost.gnu.org>, Richard Stallman <rms@gnu.org> writes:

[...]
> When you byte-compile a x.el file, x.el file is at first
> decoded.  How x.el file is decoded depends on many thing,
> and thus, of course, the resulting x.elc files become
> different.

Yes, that's what I meant.

> If you say that is a bug, I think there's no way to fix it.
>
> The very simple testcase is this:
>
> (progn
>   (let ((str "(setq x \"\300\300\")\n")
> 	(coding-system-for-write 'no-conversion))
>     (write-region str nil "~/test1.el")
>     (write-region str nil "~/test2.el"))
>   (set-language-environment "Latin-1")
>   (byte-compile-file "~/test1.el")
>   (set-language-environment "Japanese")
>   (byte-compile-file "~/test2.el"))

That's not exactly what I meant. This happens basically because Emacs
has no indication on how to decode that file properly. Here's a test
case for what I had in mind:

(let ((str1 (format "\
;; -*- coding: utf-8 -*-
\(defvar my-string \"The Greek letter alpha: %c\")" (decode-char 'ucs #x3B1)))
      (str2 (format "\
;; -*- coding: iso-8859-7 -*-
\(defvar my-string \"The Greek letter alpha: %c\")" (decode-char 'ucs #x3B1))))

  (let ((coding-system-for-write 'utf-8))
    (write-region str1 nil "~/fragment-test-1.el")
    (write-region str1 nil "~/fragment-test-2.el"))

  (let ((coding-system-for-write 'iso-8859-7))
    (write-region str2 nil "~/unify-test-1.el")
    (write-region str2 nil "~/unify-test-2.el"))

  (unify-8859-on-decoding-mode -1)
  (byte-compile-file "~/unify-test-1.el") ; ch. 2913 from
					  ; greek-iso8859-7
  (unify-8859-on-decoding-mode 1)
  (byte-compile-file "~/unify-test-2.el") ; ch. 332721 from
					  ; mule-unicode-0100-24ff

  ;; Assuming `utf-fragment-on-decoding' is nil.
  (byte-compile-file "~/fragment-test-1.el") ; ch. 332721 from
					     ; mule-unicode-0100-24ff

  ;; AFAICS there is no way to change the settings associated with
  ;; `utf-fragment-on-decoding' programmatically. However, the
  ;; following (taken from the variable's `defcustom' declaration)
  ;; should have the same effect as customizing it.
  (progn
    (define-translation-table 'utf-translation-table-for-decode
      utf-fragmentation-table)
    (unless (eq (get 'utf-translation-table-for-encode
                     'translation-table)
                ucs-mule-to-mule-unicode)
      (define-translation-table 'utf-translation-table-for-encode
        utf-defragmentation-table)))
  (byte-compile-file "~/fragment-test-2.el") ; ch. 2913 from
					     ; greek-iso8859-7
  )


As Richard wrote, the fix would be to change the settings to
their default, unless the files set a specific variable.

But given the work this would require and given that the value of
changing the defaults is IMO somewhat dubious, you could as well
just document it in etc/PROBLEMS.


    Oliver
-- 
Oliver Scholz               18 Floréal an 214 de la Révolution
Ostendstr. 61               Liberté, Egalité, Fraternité!
60314 Frankfurt a. M.       

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] Unicode Lisp reader escapes
  2006-05-07 21:26                           ` Oliver Scholz
@ 2006-05-08  1:14                             ` Kenichi Handa
  2006-05-08 22:29                             ` Richard Stallman
  1 sibling, 0 replies; 202+ messages in thread
From: Kenichi Handa @ 2006-05-08  1:14 UTC (permalink / raw)
  Cc: emacs-devel, rms, alkibiades

In article <87irohfrx1.fsf@gmx.de>, Oliver Scholz <alkibiades@gmx.de> writes:

>> (progn
>> (let ((str "(setq x \"\300\300\")\n")
>> (coding-system-for-write 'no-conversion))
>> (write-region str nil "~/test1.el")
>> (write-region str nil "~/test2.el"))
>> (set-language-environment "Latin-1")
>> (byte-compile-file "~/test1.el")
>> (set-language-environment "Japanese")
>> (byte-compile-file "~/test2.el"))

> That's not exactly what I meant. This happens basically because Emacs
> has no indication on how to decode that file properly. Here's a test
> case for what I had in mind:

The underlining problem is the same.  In your test case
also, even if you put coding: tags, the exact decoding
varies depending on many other things, and thus resulting
*.elc are different.

[...]
> As Richard wrote, the fix would be to change the settings to
> their default, unless the files set a specific variable.

Then you'll get different results in these two cases:

(1) visit *.el and M-x eval-current-buffer
(2) byte-compile *.el and load *.elc.

I think that is more like a bug.

> But given the work this would require and given that the value of
> changing the defaults is IMO somewhat dubious, you could as well
> just document it in etc/PROBLEMS.

I agree that is the best solution.

---
Kenichi Handa
handa@m17n.org

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] Unicode Lisp reader escapes
  2006-05-04 21:14             ` Aidan Kehoe
@ 2006-05-08  1:31               ` Kenichi Handa
  2006-05-08  6:54                 ` Aidan Kehoe
  2006-05-08 13:55                 ` Stefan Monnier
  0 siblings, 2 replies; 202+ messages in thread
From: Kenichi Handa @ 2006-05-08  1:31 UTC (permalink / raw)
  Cc: schwab, emacs-devel

In article <17498.28361.392872.954484@parhasard.net>, Aidan Kehoe <kehoea@parhasard.net> writes:

>> Kevin Rodgers <ihs_4664@yahoo.com> writes:
>> 
>> > readcharfun is declared as a Lisp_Object in read1, so it should be
>> > possible to check it's type and only GCPRO when necessary.
>> 
>> I don't see any need to GCPRO readcharfun.  When called from Lisp the
>> arguments are already protected by being part of the call frame, and all
>> uses from C protect the object by other means (eg, by being put on
>> eval-buffer-list).

> That was my understanding of the code too. 

For instance, Fread is called from Fcall_interactively as
below:

		Lisp_Object tem;
[...]
		tem = Fread_from_minibuffer (build_string (callint_message),
					     Qnil, Qnil, Qnil, Qnil, Qnil,
					     Qnil, Qnil);
		if (! STRINGP (tem) || SCHARS (tem) == 0)
		  args[i] = Qnil;
		else
		  args[i] = Fread (tem);

In the calling sequence of
Fread->read_internal_start->read0->read1, I see no place
where the original `tem' is GCPROed.  Do I overlook
something?

---
Kenichi Handa
handa@m17n.org

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] Unicode Lisp reader escapes
  2006-05-08  1:31               ` Kenichi Handa
@ 2006-05-08  6:54                 ` Aidan Kehoe
  2006-05-08 13:55                 ` Stefan Monnier
  1 sibling, 0 replies; 202+ messages in thread
From: Aidan Kehoe @ 2006-05-08  6:54 UTC (permalink / raw)
  Cc: schwab, emacs-devel


 Ar an t-ochtú lá de mí Bealtaine, scríobh Kenichi Handa: 

 > >> > readcharfun is declared as a Lisp_Object in read1, so it should be
 > >> > possible to check it's type and only GCPRO when necessary.
 > >> 
 > >> I don't see any need to GCPRO readcharfun. When called from Lisp the
 > >> arguments are already protected by being part of the call frame, and
 > >> all uses from C protect the object by other means (eg, by being put on
 > >> eval-buffer-list).
 > 
 > > That was my understanding of the code too. 
 > 
 > For instance, Fread is called from Fcall_interactively as
 > below:
 > 
 > 		Lisp_Object tem;
 > [...]
 > 		tem = Fread_from_minibuffer (build_string (callint_message),
 > 					     Qnil, Qnil, Qnil, Qnil, Qnil,
 > 					     Qnil, Qnil);
 > 		if (! STRINGP (tem) || SCHARS (tem) == 0)
 > 		  args[i] = Qnil;
 > 		else
 > 		  args[i] = Fread (tem);
 > 
 > In the calling sequence of Fread->read_internal_start->read0->read1, I
 > see no place where the original `tem' is GCPROed. Do I overlook
 > something?

I believe not, it does need to be protected. 

Also, my understanding of the above code is that build_string allocates
memory for a Lisp string, that is not visible from Lisp, and that will not
be GCPROed. So if garbage collection happens during Fread_from_minibuffer,
it may disappear. Ben Wing, in the XEmacs internals manual, says this: 

12. Be careful of traps, like calling `Fcons()' in the argument to another
function.  By the "caller protects" law, you should be `GCPRO'ing the
newly-created cons, but you aren't.  A certain number of functions that are
commonly called on freshly created stuff (e.g. `nconc2()', `Fsignal()')
break the "caller protects" law and go ahead and `GCPRO' their arguments so
as to simplify thngs, but make sure and check if it's OK whenever doing
something like this. 

This seems to me equivalent to calling Fcons in the argument to another
function. Is GNU Emacs different in this?

-- 
Aidan Kehoe, http://www.parhasard.net/

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] Unicode Lisp reader escapes
  2006-05-07 12:38                         ` Kenichi Handa
  2006-05-07 21:26                           ` Oliver Scholz
@ 2006-05-08  7:36                           ` Richard Stallman
  2006-05-08  7:50                             ` Kenichi Handa
  1 sibling, 1 reply; 202+ messages in thread
From: Richard Stallman @ 2006-05-08  7:36 UTC (permalink / raw)
  Cc: alkibiades, emacs-devel

      (set-language-environment "Latin-1")
      (byte-compile-file "~/test1.el")
      (set-language-environment "Japanese")
      (byte-compile-file "~/test2.el"))

    Although the source files are exactly the same, the
    resulting test1.elc contains a string of two Latin-1
    characters whereas the test2.elc contains a string of single
    Japanese character.

Is the difference due solely to the choice of coding system for
decoding the file?  That heuristic choice of coding system depends on
lots of things, but Lisp files can prevent variation by specifying
-*-coding-system:...;-*-.

When the file does that, does eliminate the problem?

Anyway, is the specific problem I asked you to look at a matter of
choice of coding system?  (I don't know the details, since I don't
know what that variable does--I just know it relates to Mule.)

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] Unicode Lisp reader escapes
  2006-05-08  7:36                           ` Richard Stallman
@ 2006-05-08  7:50                             ` Kenichi Handa
  0 siblings, 0 replies; 202+ messages in thread
From: Kenichi Handa @ 2006-05-08  7:50 UTC (permalink / raw)
  Cc: emacs-devel, alkibiades

In article <E1Fd0IY-0007sF-KE@fencepost.gnu.org>, Richard Stallman <rms@gnu.org> writes:

>       (set-language-environment "Latin-1")
>       (byte-compile-file "~/test1.el")
>       (set-language-environment "Japanese")
>       (byte-compile-file "~/test2.el"))

>     Although the source files are exactly the same, the
>     resulting test1.elc contains a string of two Latin-1
>     characters whereas the test2.elc contains a string of single
>     Japanese character.

> Is the difference due solely to the choice of coding system for
> decoding the file?

Yes, in the above example.

> That heuristic choice of coding system depends on
> lots of things, but Lisp files can prevent variation by specifying
> -*-coding-system:...;-*-.

> When the file does that, does eliminate the problem?

Yes, in the above example.

> Anyway, is the specific problem I asked you to look at a matter of
> choice of coding system?  (I don't know the details, since I don't
> know what that variable does--I just know it relates to Mule.)

"A matter of choice of coding system" is just one of the
problems.  Even if a coding system is deterministically
chosen, there are several options that controls the decoding
of utf-*.  And, binding all of them to the default values
while byte-compiling leads to another problem as I wrote in
the previsou mail.

---
Kenichi Handa
handa@m17n.org

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] Unicode Lisp reader escapes
  2006-05-08  1:31               ` Kenichi Handa
  2006-05-08  6:54                 ` Aidan Kehoe
@ 2006-05-08 13:55                 ` Stefan Monnier
  2006-05-08 14:24                   ` Aidan Kehoe
  2006-05-09  0:36                   ` Kenichi Handa
  1 sibling, 2 replies; 202+ messages in thread
From: Stefan Monnier @ 2006-05-08 13:55 UTC (permalink / raw)
  Cc: Aidan Kehoe, schwab, emacs-devel

> For instance, Fread is called from Fcall_interactively as
> below:

> 		Lisp_Object tem;
> [...]
> 		tem = Fread_from_minibuffer (build_string (callint_message),
> 					     Qnil, Qnil, Qnil, Qnil, Qnil,
> 					     Qnil, Qnil);
> 		if (! STRINGP (tem) || SCHARS (tem) == 0)
> 		  args[i] = Qnil;
> 		else
> 		  args[i] = Fread (tem);

> In the calling sequence of
> Fread-> read_internal_start->read0->read1, I see no place
> where the original `tem' is GCPROed.  Do I overlook
> something?

Why would it need to be protected? it's not used afterwards.


        Stefan

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] Unicode Lisp reader escapes
  2006-05-08 13:55                 ` Stefan Monnier
@ 2006-05-08 14:24                   ` Aidan Kehoe
  2006-05-08 15:32                     ` Stefan Monnier
  2006-05-09  0:36                   ` Kenichi Handa
  1 sibling, 1 reply; 202+ messages in thread
From: Aidan Kehoe @ 2006-05-08 14:24 UTC (permalink / raw)
  Cc: emacs-devel


 Ar an t-ochtú lá de mí Bealtaine, scríobh Stefan Monnier: 

 > > In the calling sequence of
 > > Fread-> read_internal_start->read0->read1, I see no place
 > > where the original `tem' is GCPROed.  Do I overlook
 > > something?
 > 
 > Why would it need to be protected? it's not used afterwards.

It can theoretically disappear in the middle of being used. With my patch,
if the string consisted of "\u20AC one two", Lisp will be called, the
garbage collector may be invoked, and the string overwritten, since to the
GC it’s not in use. Then the READCHAR -> retry loop may end up reading
incorrect data. 

-- 
Aidan Kehoe, http://www.parhasard.net/

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] Unicode Lisp reader escapes
  2006-05-08 14:24                   ` Aidan Kehoe
@ 2006-05-08 15:32                     ` Stefan Monnier
  2006-05-08 16:39                       ` Aidan Kehoe
  0 siblings, 1 reply; 202+ messages in thread
From: Stefan Monnier @ 2006-05-08 15:32 UTC (permalink / raw)
  Cc: emacs-devel

>> > In the calling sequence of
>> > Fread-> read_internal_start->read0->read1, I see no place
>> > where the original `tem' is GCPROed.  Do I overlook
>> > something?
>> 
>> Why would it need to be protected? it's not used afterwards.

> It can theoretically disappear in the middle of being used. With my patch,
> if the string consisted of "\u20AC one two", Lisp will be called, the
> garbage collector may be invoked, and the string overwritten, since to the
> GC it’s not in use. Then the READCHAR -> retry loop may end up reading
> incorrect data. 

That's of not concern to Fcall_interactively.  It's Fread should GCPRO its
argument when needed.  So it seems the bug is that read_internal_start calls
read0 (which can GC) and uses `stream' afterwards without having GCPRO'd it.


        Stefan

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] Unicode Lisp reader escapes
  2006-05-08 15:32                     ` Stefan Monnier
@ 2006-05-08 16:39                       ` Aidan Kehoe
  2006-05-08 17:39                         ` Stefan Monnier
  0 siblings, 1 reply; 202+ messages in thread
From: Aidan Kehoe @ 2006-05-08 16:39 UTC (permalink / raw)
  Cc: emacs-devel


 Ar an t-ochtú lá de mí Bealtaine, scríobh Stefan Monnier: 

 > >> > In the calling sequence of Fread-> read_internal_start -> read0 ->
 > >> > read1, I see no place where the original `tem' is GCPROed. Do I
 > >> > overlook something?
 > >> 
 > >> Why would it need to be protected? it's not used afterwards.
 > 
 > > It can theoretically disappear in the middle of being used. With my patch,
 > > if the string consisted of "\u20AC one two", Lisp will be called, the
 > > garbage collector may be invoked, and the string overwritten, since to the
 > > GC it’s not in use. Then the READCHAR -> retry loop may end up reading
 > > incorrect data. 
 > 
 > That's of not concern to Fcall_interactively.  It's Fread should GCPRO its
 > argument when needed. 

Fread is intended to be called from Lisp (it’s a subr). Functions called
from Lisp do not need to GCPRO their arguments, because the garbage
collector knows about the arguments, as it knows about all objects allocated
in Lisp.

C code that calls functions intended to be called from Lisp is optimistic at
best if, without having checked, it relies on the assumption that that the
arguments to those functions will be GCPROed.

 > So it seems the bug is that read_internal_start calls
 > read0 (which can GC) and uses `stream' afterwards without having GCPRO'd it.

-- 
Aidan Kehoe, http://www.parhasard.net/

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] Unicode Lisp reader escapes
  2006-05-08 16:39                       ` Aidan Kehoe
@ 2006-05-08 17:39                         ` Stefan Monnier
  2006-05-09  7:04                           ` Aidan Kehoe
  0 siblings, 1 reply; 202+ messages in thread
From: Stefan Monnier @ 2006-05-08 17:39 UTC (permalink / raw)
  Cc: emacs-devel

> Fread is intended to be called from Lisp (it’s a subr).  Functions called
> from Lisp do not need to GCPRO their arguments, because the garbage
> collector knows about the arguments, as it knows about all objects
> allocated in Lisp.

s/called/callable/
Are you sure we have such a convention?

> C code that calls functions intended to be called from Lisp is optimistic
> at best if, without having checked, it relies on the assumption that that
> the arguments to those functions will be GCPROed.

As far as I know, the GCPRO convention for arguments is mostly the
following:

GCPRO args you pass to functions iff those functions can GC and you need
to use the arg after the function returns.


        Stefan

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] Unicode Lisp reader escapes
  2006-05-07 16:50                             ` Aidan Kehoe
@ 2006-05-08 22:28                               ` Richard Stallman
  0 siblings, 0 replies; 202+ messages in thread
From: Richard Stallman @ 2006-05-08 22:28 UTC (permalink / raw)
  Cc: emacs-devel

Gnus IS part of GNU Emacs.  So the question is whether your assignment
covers only Gnus, or only GNU Emacs.

It sounds now like it covers all of Emacs.  I will have the clerk
check it up.

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] Unicode Lisp reader escapes
  2006-05-07 21:26                           ` Oliver Scholz
  2006-05-08  1:14                             ` Kenichi Handa
@ 2006-05-08 22:29                             ` Richard Stallman
  2006-05-09  3:42                               ` Eli Zaretskii
  2006-05-09  5:13                               ` Kenichi Handa
  1 sibling, 2 replies; 202+ messages in thread
From: Richard Stallman @ 2006-05-08 22:29 UTC (permalink / raw)
  Cc: alkibiades, emacs-devel, handa

    As Richard wrote, the fix would be to change the settings to
    their default, unless the files set a specific variable.

    But given the work this would require and given that the value of
    changing the defaults is IMO somewhat dubious, you could as well
    just document it in etc/PROBLEMS.

We seem to be talking about two variables here:
unify-8859-on-decoding-mode and utf-fragment-on-decoding.  Are there
any others involved?

I do not know what those variables mean.  Do they affect the
choice of coding system?  Or do they take effect by altering
the meaning of a given coding system?

If it is the former, the Lisp source file can defend against this
problem by specifying coding in the -*- line.  We tell people to do this
in Lisp source files.

If it is the latter, there are two possible solutions:
1. to make the compiler bind these variables to their default values.
2. to tell people that all Lisp files for which this is relevant
   should specify these variables explicitly.

If it is just those two variables, I think #1 is easy and preferable.

Are there any other variables for which this arises?

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] Unicode Lisp reader escapes
  2006-05-08 13:55                 ` Stefan Monnier
  2006-05-08 14:24                   ` Aidan Kehoe
@ 2006-05-09  0:36                   ` Kenichi Handa
  1 sibling, 0 replies; 202+ messages in thread
From: Kenichi Handa @ 2006-05-09  0:36 UTC (permalink / raw)
  Cc: kehoea, schwab, emacs-devel

In article <jwvwtcwzkn9.fsf-monnier+emacs@gnu.org>, Stefan Monnier <monnier@iro.umontreal.ca> writes:

>> For instance, Fread is called from Fcall_interactively as
>> below:

>> Lisp_Object tem;
>> [...]
>> tem = Fread_from_minibuffer (build_string (callint_message),
>> Qnil, Qnil, Qnil, Qnil, Qnil,
>> Qnil, Qnil);
>> if (! STRINGP (tem) || SCHARS (tem) == 0)
>> args[i] = Qnil;
>> else
>> args[i] = Fread (tem);

>> In the calling sequence of
>> Fread-> read_internal_start->read0->read1, I see no place
>> where the original `tem' is GCPROed.  Do I overlook
>> something?

> Why would it need to be protected? it's not used afterwards.

It's not used in Fcall_interactively afterwards.  So
Fcall_interactively doesn't have to protect it.  But, read1
or read_escape have to protect the argument READCHARFUN,
don't they?

---
Kenichi Handa
handa@m17n.org

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] Unicode Lisp reader escapes
  2006-05-08 22:29                             ` Richard Stallman
@ 2006-05-09  3:42                               ` Eli Zaretskii
  2006-05-09 20:41                                 ` Richard Stallman
  2006-05-09  5:13                               ` Kenichi Handa
  1 sibling, 1 reply; 202+ messages in thread
From: Eli Zaretskii @ 2006-05-09  3:42 UTC (permalink / raw)
  Cc: emacs-devel, handa, alkibiades

> From: Richard Stallman <rms@gnu.org>
> Date: Mon, 08 May 2006 18:29:13 -0400
> Cc: alkibiades@gmx.de, emacs-devel@gnu.org, handa@m17n.org
> 
> We seem to be talking about two variables here:
> unify-8859-on-decoding-mode and utf-fragment-on-decoding.  Are there
> any others involved?
> 
> I do not know what those variables mean.  Do they affect the
> choice of coding system?  Or do they take effect by altering
> the meaning of a given coding system?

They select the target character set.  When Emacs decodes text with
Latin or Cyrillic or Greek characters, it could produce either Unicode
charset or one of the ISO 8859 charsets.  These variables control
that.

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] Unicode Lisp reader escapes
  2006-05-08 22:29                             ` Richard Stallman
  2006-05-09  3:42                               ` Eli Zaretskii
@ 2006-05-09  5:13                               ` Kenichi Handa
  2006-05-10  3:20                                 ` Richard Stallman
  1 sibling, 1 reply; 202+ messages in thread
From: Kenichi Handa @ 2006-05-09  5:13 UTC (permalink / raw)
  Cc: emacs-devel, alkibiades

In article <E1FdEE5-0008RE-Tp@fencepost.gnu.org>, Richard Stallman <rms@gnu.org> writes:

> We seem to be talking about two variables here:
> unify-8859-on-decoding-mode and utf-fragment-on-decoding.

Their roles are as Eli wrote.

> Are there any others involved?

utf-translate-cjk-mode also plays a role on decoding utf-*.

> I do not know what those variables mean.  Do they affect the
> choice of coding system?  Or do they take effect by altering
> the meaning of a given coding system?

> If it is the former, the Lisp source file can defend against this
> problem by specifying coding in the -*- line.  We tell people to do this
> in Lisp source files.

> If it is the latter, there are two possible solutions:
> 1. to make the compiler bind these variables to their default values.
> 2. to tell people that all Lisp files for which this is relevant
>    should specify these variables explicitly.

The latter.

> If it is just those two variables, I think #1 is easy and preferable.

> Are there any other variables for which this arises?

Just setting those variables doesn't work; they should be
customized.  In addition, the default value of
utf-translate-cjk-mode t, and to which CJK charsets Han
characters of Unicode are decoded depends on these:

(1) current-language-environment

(2) utf-translate-cjk-unicode-range (which also
should be customized to take effect),

(3) utf-translate-cjk-charsets

(4) the contents of the hash table ucs-unicode-to-mule-cjk
(a user can freely reflect one's preference on how to decode
Unicode character by modifying this hash table).

---
Kenichi Handa
handa@m17n.org

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] Unicode Lisp reader escapes
  2006-05-08 17:39                         ` Stefan Monnier
@ 2006-05-09  7:04                           ` Aidan Kehoe
  2006-05-09 19:05                             ` Eli Zaretskii
  0 siblings, 1 reply; 202+ messages in thread
From: Aidan Kehoe @ 2006-05-09  7:04 UTC (permalink / raw)
  Cc: emacs-devel


 Ar an t-ochtú lá de mí Bealtaine, scríobh Stefan Monnier: 

 > > Fread is intended to be called from Lisp (it’s a subr).  Functions called
 > > from Lisp do not need to GCPRO their arguments, because the garbage
 > > collector knows about the arguments, as it knows about all objects
 > > allocated in Lisp.
 > 
 > s/called/callable/

The two are not mutually exclusive :-) . 

 > Are you sure we have such a convention?

That in particular is not really a convention, it’s part of the semantics of
the Lisp implementation. Objects visible to Lisp are visible to the garbage
collector, except in the very specific case where they’re only visible from
weak hash tables. 

 > > C code that calls functions intended to be called from Lisp is optimistic
 > > at best if, without having checked, it relies on the assumption that that
 > > the arguments to those functions will be GCPROed.
 > 
 > As far as I know, the GCPRO convention for arguments is mostly the
 > following:
 > 
 > GCPRO args you pass to functions iff those functions can GC and you need
 > to use the arg after the function returns.

Okay. Do you know of any document detailing that? No-one followed up to my
reference to what Ben Wing writes on the subject. 

-- 
Aidan Kehoe, http://www.parhasard.net/

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] Unicode Lisp reader escapes
  2006-05-09  7:04                           ` Aidan Kehoe
@ 2006-05-09 19:05                             ` Eli Zaretskii
  2006-05-10  6:05                               ` Aidan Kehoe
  0 siblings, 1 reply; 202+ messages in thread
From: Eli Zaretskii @ 2006-05-09 19:05 UTC (permalink / raw)
  Cc: emacs-devel

> From: Aidan Kehoe <kehoea@parhasard.net>
> Date: Tue, 9 May 2006 09:04:50 +0200
> Cc: emacs-devel@gnu.org
> 
>  > As far as I know, the GCPRO convention for arguments is mostly the
>  > following:
>  > 
>  > GCPRO args you pass to functions iff those functions can GC and you need
>  > to use the arg after the function returns.
> 
> Okay. Do you know of any document detailing that?

Does the excerpt below from the Lisp manual answer your concerns?

> No-one followed up to my reference to what Ben Wing writes on the
> subject.

AFAIU, he is wrong, or at least inaccurate.  But maybe I misunderstand
something.

>From (elisp)Writing Emacs Primitives:

       Within the function `For' itself, note the use of the macros
    `GCPRO1' and `UNGCPRO'.  `GCPRO1' is used to "protect" a variable from
    garbage collection--to inform the garbage collector that it must look
    in that variable and regard its contents as an accessible object.  This
    is necessary whenever you call `Feval' or anything that can directly or
    indirectly call `Feval'.  At such a time, any Lisp object that you
    intend to refer to again must be protected somehow.  `UNGCPRO' cancels
    the protection of the variables that are protected in the current
    function.  It is necessary to do this explicitly.

       It suffices to ensure that at least one pointer to each object is
    GC-protected; as long as the object is not recycled, all pointers to it
    remain valid.  So if you are sure that a local variable points to an
    object that will be preserved by some other pointer, that local
    variable does not need a `GCPRO'.  (Formerly, strings were an exception
    to this rule; in older Emacs versions, every pointer to a string needed
    to be marked by GC.)

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] Unicode Lisp reader escapes
  2006-05-09  3:42                               ` Eli Zaretskii
@ 2006-05-09 20:41                                 ` Richard Stallman
  2006-05-09 21:03                                   ` Stefan Monnier
  2006-05-10  3:33                                   ` Eli Zaretskii
  0 siblings, 2 replies; 202+ messages in thread
From: Richard Stallman @ 2006-05-09 20:41 UTC (permalink / raw)
  Cc: emacs-devel, handa, alkibiades

    > I do not know what those variables mean.  Do they affect the
    > choice of coding system?  Or do they take effect by altering
    > the meaning of a given coding system?

    They select the target character set.  When Emacs decodes text with
    Latin or Cyrillic or Greek characters, it could produce either Unicode
    charset or one of the ISO 8859 charsets.  These variables control
    that.

I cannot determine clearly, from your response, the answer to my
questions.  Do these variables affect the choice of coding system?  Or
do they take effect by altering the meaning of a given coding system?

I think perhaps you are saying it is the latter, but I am not sure.

If it is the latter, perhaps the best solution is to say
that every Lisp file should specify these variables, in the -*-
line or local variables list, if they affect it.

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] Unicode Lisp reader escapes
  2006-05-06 23:36         ` Richard Stallman
@ 2006-05-09 20:43           ` Juri Linkov
  2006-05-11  3:44             ` Richard Stallman
  0 siblings, 1 reply; 202+ messages in thread
From: Juri Linkov @ 2006-05-09 20:43 UTC (permalink / raw)
  Cc: kehoea, emacs-devel

>     Are there reasons not to use Perl's notation for Unicode characters,
>     i.e. "\x{...}"?  The Unicode code for the desired character,
>     is placed in the braces in hexadecimal, and has no fixed length.
>     Examples: "\x{DF}", "\x{0448}", "\x{001D0ED}".
>
> We could support this form, as well as \u and \U for compatibility
> with other languages.

Support for \u and \U in Emacs Lisp would be good since other _Lisp_ languages
support \uXXXX and \UXXXXXXXX as well.  But other Lisp languages support
also Lisp notation for Unicode characters.  I think Emacs should support
it too.  In this notation Unicode characters are written as #\u3042 or
#\u0002a6b2 with the leading hash mark.

Also it would be good to support a syntax for named Unicode characters.
Common Lisp has the syntax #\euro_sign, and Perl - \N{EURO SIGN}.

-- 
Juri Linkov
http://www.jurta.org/emacs/

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] Unicode Lisp reader escapes
  2006-05-09 20:41                                 ` Richard Stallman
@ 2006-05-09 21:03                                   ` Stefan Monnier
  2006-05-10  3:33                                   ` Eli Zaretskii
  1 sibling, 0 replies; 202+ messages in thread
From: Stefan Monnier @ 2006-05-09 21:03 UTC (permalink / raw)
  Cc: alkibiades, Eli Zaretskii, handa, emacs-devel

>> I do not know what those variables mean.  Do they affect the
>> choice of coding system?  Or do they take effect by altering
>> the meaning of a given coding system?

>     They select the target character set.  When Emacs decodes text with
>     Latin or Cyrillic or Greek characters, it could produce either Unicode
>     charset or one of the ISO 8859 charsets.  These variables control
>     that.

> I cannot determine clearly, from your response, the answer to my
> questions.  Do these variables affect the choice of coding system?  Or
> do they take effect by altering the meaning of a given coding system?

> I think perhaps you are saying it is the latter, but I am not sure.

It is the latter (except that they actually affect pretty much all coding
systems).

> If it is the latter, perhaps the best solution is to say
> that every Lisp file should specify these variables, in the -*-
> line or local variables list, if they affect it.

But currently those settings a global AFAICT.


        Stefan

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] Unicode Lisp reader escapes
  2006-05-09  5:13                               ` Kenichi Handa
@ 2006-05-10  3:20                                 ` Richard Stallman
  2006-05-10  5:37                                   ` Kenichi Handa
  0 siblings, 1 reply; 202+ messages in thread
From: Richard Stallman @ 2006-05-10  3:20 UTC (permalink / raw)
  Cc: emacs-devel, alkibiades

      In addition, the default value of
    utf-translate-cjk-mode t, and to which CJK charsets Han
    characters of Unicode are decoded depends on these:

    (1) current-language-environment

What effect does this have?  (Aside from the choice of coding system,
that is.)

    (4) the contents of the hash table ucs-unicode-to-mule-cjk
    (a user can freely reflect one's preference on how to decode
    Unicode character by modifying this hash table).

Could you tell me some examples for how users are really expected
to use this?

Overall:

With so many different variables that might affect the reading of
these characters, it is just too inconvenient for every file to
specify them all.  So I think we need a new feature to make that easy
to do.

Here's one idea.

Add a new "variable" `buffer-coding' which is analogous to `coding'.
Whereas `coding' specifies the encoding in the file, `buffer-coding'
specifies the in-buffer encoding to produce in the buffer.  Its value
could be a list or plist, which would specify the values of all these
many variables.

What do you think?  If you think this is a good idea, could
you try designing the details?

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] Unicode Lisp reader escapes
  2006-05-09 20:41                                 ` Richard Stallman
  2006-05-09 21:03                                   ` Stefan Monnier
@ 2006-05-10  3:33                                   ` Eli Zaretskii
  1 sibling, 0 replies; 202+ messages in thread
From: Eli Zaretskii @ 2006-05-10  3:33 UTC (permalink / raw)
  Cc: emacs-devel, handa, alkibiades

> From: Richard Stallman <rms@gnu.org>
> CC: alkibiades@gmx.de, alkibiades@gmx.de, emacs-devel@gnu.org,
> 	handa@m17n.org
> Date: Tue, 09 May 2006 16:41:29 -0400
> 
>     > I do not know what those variables mean.  Do they affect the
>     > choice of coding system?  Or do they take effect by altering
>     > the meaning of a given coding system?
> 
>     They select the target character set.  When Emacs decodes text with
>     Latin or Cyrillic or Greek characters, it could produce either Unicode
>     charset or one of the ISO 8859 charsets.  These variables control
>     that.
> 
> I cannot determine clearly, from your response, the answer to my
> questions.  Do these variables affect the choice of coding system?  Or
> do they take effect by altering the meaning of a given coding system?
> 
> I think perhaps you are saying it is the latter, but I am not sure.

It's certainly not the former.  I didn't say it's the latter because
``the meaning of a coding system'' is something I cannot define
clearly.  Instead, I described what is the precise effect of using
these variables.

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] Unicode Lisp reader escapes
  2006-05-10  3:20                                 ` Richard Stallman
@ 2006-05-10  5:37                                   ` Kenichi Handa
  2006-05-10  7:22                                     ` Stefan Monnier
                                                       ` (2 more replies)
  0 siblings, 3 replies; 202+ messages in thread
From: Kenichi Handa @ 2006-05-10  5:37 UTC (permalink / raw)
  Cc: alkibiades, emacs-devel

In article <E1FdfFt-0006ux-Pm@fencepost.gnu.org>, Richard Stallman <rms@gnu.org> writes:

>       In addition, the default value of
>     utf-translate-cjk-mode t, and to which CJK charsets Han
>     characters of Unicode are decoded depends on these:

>     (1) current-language-environment

> What effect does this have?  (Aside from the choice of coding system,
> that is.)

Some Han characters in Unicode can be decoded into several
CJK charsets (e.g. chinese-gb2312, chinese-big5-1,
japanese-jisx0208).  current-language-environment decides
which of them to use.

>     (4) the contents of the hash table ucs-unicode-to-mule-cjk
>     (a user can freely reflect one's preference on how to decode
>     Unicode character by modifying this hash table).

> Could you tell me some examples for how users are really expected
> to use this?

I don't know a concrete example, but I can imagine this.
U+9AD9 is a variant of U+9AD8, but japanese-jisx0208
contains only the latter.  Actually, non of legacy CJK
charset contains U+9AD9.  But, as it is just a variant of
U+9AD8, just for reading, one may want to decode it into
japanese-jisx0208.  In such a case, one can simply do this:

(puthash #x9AD9 ?高 ucs-unicode-to-mule-cjk)

> Overall:

> With so many different variables that might affect the reading of
> these characters, it is just too inconvenient for every file to
> specify them all.  So I think we need a new feature to make that easy
> to do.

> Here's one idea.

> Add a new "variable" `buffer-coding' which is analogous to `coding'.
> Whereas `coding' specifies the encoding in the file, `buffer-coding'
> specifies the in-buffer encoding to produce in the buffer.  Its value
> could be a list or plist, which would specify the values of all these
> many variables.

> What do you think?  If you think this is a good idea, could
> you try designing the details?

No, it's an incredibly hard and heavy task.  When you read
utf-8.el and ucs-tables.el, you'll soon realize that.  I
believe it's just a waste of time to work on such a thing.

We have already done lots of workarounds for workarounds for
workarounds for not using Unicode internally, but there's a
limit.  I believe no one is pleased by producing the same
*.elc in such a situation.

Please accept this problem as a bad feature (not a bug), and
write something in etc/PROBLEMS.  If not, please decide to
shift to emacs-unicode just now.  It's the right thing to
solve this problem.

---
Kenichi Handa
handa@m17n.org

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] Unicode Lisp reader escapes
  2006-05-09 19:05                             ` Eli Zaretskii
@ 2006-05-10  6:05                               ` Aidan Kehoe
  2006-05-10 17:49                                 ` Eli Zaretskii
  0 siblings, 1 reply; 202+ messages in thread
From: Aidan Kehoe @ 2006-05-10  6:05 UTC (permalink / raw)
  Cc: emacs-devel


 Ar an naoiú lá de mí Bealtaine, scríobh Eli Zaretskii: 

 > >  > As far as I know, the GCPRO convention for arguments is mostly the
 > >  > following:
 > >  > 
 > >  > GCPRO args you pass to functions iff those functions can GC and you
 > >  > need to use the arg after the function returns.
 > > 
 > > Okay. Do you know of any document detailing that?
 > 
 > Does the excerpt below from the Lisp manual answer your concerns?

It read ambiguously to me. “Any Lisp object that you intend to refer to
again” could be one that you intend to refer to in the bodies of the
functions you call.

 > > No-one followed up to my reference to what Ben Wing writes on the
 > > subject.
 > 
 > AFAIU, he is wrong, or at least inaccurate.  

He’s not wrong, he’s describing conventions within XEmacs, and XEmacs source
code does follow those conventions. My question was, are those conventions
followed in GNU Emacs? You, and Stefan are telling me they’re not. Okay,
you’ve answered my question, thank you.

[excerpt snipped]

-- 
Aidan Kehoe, http://www.parhasard.net/

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] Unicode Lisp reader escapes
  2006-05-10  5:37                                   ` Kenichi Handa
@ 2006-05-10  7:22                                     ` Stefan Monnier
  2006-05-11  3:45                                       ` Richard Stallman
  2006-05-11  3:44                                     ` Richard Stallman
  2006-05-11  3:44                                     ` Richard Stallman
  2 siblings, 1 reply; 202+ messages in thread
From: Stefan Monnier @ 2006-05-10  7:22 UTC (permalink / raw)
  Cc: emacs-devel, rms, alkibiades

>> What do you think?  If you think this is a good idea, could
>> you try designing the details?
> No, it's an incredibly hard and heavy task.

Agreed.
It's just a lot of work for very little benefit: people have lived with this
problem for a while now and haven't found the workarounds to be serious
(basically: don't use utf-8 for those files and don't use
unify-8859-on-decoding if you manipulate such files).  Such a "feature"
would only be an ugly workaround anyway.
As for a real fix: fixing it is what emacs-unicode is all about.


        Stefan

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] Unicode Lisp reader escapes
  2006-05-10  6:05                               ` Aidan Kehoe
@ 2006-05-10 17:49                                 ` Eli Zaretskii
  2006-05-10 21:37                                   ` Luc Teirlinck
                                                     ` (3 more replies)
  0 siblings, 4 replies; 202+ messages in thread
From: Eli Zaretskii @ 2006-05-10 17:49 UTC (permalink / raw)
  Cc: emacs-devel

> From: Aidan Kehoe <kehoea@parhasard.net>
> Date: Wed, 10 May 2006 08:05:32 +0200
> Cc: emacs-devel@gnu.org
> 
>  > >  > As far as I know, the GCPRO convention for arguments is mostly the
>  > >  > following:
>  > >  > 
>  > >  > GCPRO args you pass to functions iff those functions can GC and you
>  > >  > need to use the arg after the function returns.
>  > > 
>  > > Okay. Do you know of any document detailing that?
>  > 
>  > Does the excerpt below from the Lisp manual answer your concerns?
> 
> It read ambiguously to me. ``Any Lisp object that you intend to refer to
> again'' could be one that you intend to refer to in the bodies of the
> functions you call.

Can someone in the know (Richard?) state a clear rule?  I think the
ELisp manual should be unequivocal about the GCPRO issue.

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] Unicode Lisp reader escapes
  2006-05-10 17:49                                 ` Eli Zaretskii
@ 2006-05-10 21:37                                   ` Luc Teirlinck
  2006-05-11  3:45                                     ` Eli Zaretskii
  2006-05-10 21:48                                   ` Luc Teirlinck
                                                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 202+ messages in thread
From: Luc Teirlinck @ 2006-05-10 21:37 UTC (permalink / raw)
  Cc: kehoea, emacs-devel

Eli Zaretskii wrote:

   > It read ambiguously to me. ``Any Lisp object that you intend to refer to
   > again'' could be one that you intend to refer to in the bodies of the
   > functions you call.

   Can someone in the know (Richard?) state a clear rule?  I think the
   ELisp manual should be unequivocal about the GCPRO issue.

Reading in the Elisp manual:

    This is necessary whenever you call `Feval' or anything that can
    directly or indirectly call `Feval'.  At such a time, any Lisp
    object that you intend to refer to again must be protected
    somehow.

I have always interpreted this as meaning that as long as Feval is not
directly or indirectly called, there is no problem whatsoever.  If
Feval gets called, directly or indirectly, the memory for the object
may have been freed by gc, unless the object is protected some way or
the other, for instance by a GCPPRO.  If the object was not protected
in any way, then if after the call to Feval it gets referenced any way
whatsoever, directly or in the functions you call, from C or from
Lisp, trouble can result because its memory may have been freed.

Is there any _other_ way to understand the above quote from the Elisp
manual or am I just completely misunderstanding the issue?  If the
description in the above paragraph would not be accurate, then the
text would indeed be very misleading.

Sincerely,

Luc.

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] Unicode Lisp reader escapes
  2006-05-10 17:49                                 ` Eli Zaretskii
  2006-05-10 21:37                                   ` Luc Teirlinck
@ 2006-05-10 21:48                                   ` Luc Teirlinck
  2006-05-11  1:08                                   ` Luc Teirlinck
  2006-05-11  3:46                                   ` Richard Stallman
  3 siblings, 0 replies; 202+ messages in thread
From: Luc Teirlinck @ 2006-05-10 21:48 UTC (permalink / raw)
  Cc: kehoea, emacs-devel

Eli Zaretskii wrote:

   Can someone in the know (Richard?) state a clear rule?  I think the
   ELisp manual should be unequivocal about the GCPRO issue.

I should have mentioned that I do not consider myself as "someone in
the know".  I just wanted to point out that to me the Elisp manual
sounds unequivocal.  So, if I am actually wrong, then I believe that
there is a real doc problem.  If I am right, then I do not believe so,
unless somebody points out another plausible way to understand the text.

Sincerely,

Luc.

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] Unicode Lisp reader escapes
  2006-05-10 17:49                                 ` Eli Zaretskii
  2006-05-10 21:37                                   ` Luc Teirlinck
  2006-05-10 21:48                                   ` Luc Teirlinck
@ 2006-05-11  1:08                                   ` Luc Teirlinck
  2006-05-11  2:29                                     ` Luc Teirlinck
  2006-05-11  3:46                                   ` Richard Stallman
  3 siblings, 1 reply; 202+ messages in thread
From: Luc Teirlinck @ 2006-05-11  1:08 UTC (permalink / raw)
  Cc: kehoea, emacs-devel

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain, Size: 3398 bytes --]

Eli Zaretskii wrote:

   > It read ambiguously to me. ``Any Lisp object that you intend to refer to
   > again'' could be one that you intend to refer to in the bodies of the
   > functions you call.

   Can someone in the know (Richard?) state a clear rule?  I think the
   ELisp manual should be unequivocal about the GCPRO issue.

I probably responded to quickly to this and misunderstood the
question.  I am still not completely sure what the concrete question
is, but maybe Richard will understand.

In the meantime, after rereading the thread, there are some things
that seem confusing to me and maybe my two questions imply your
question.

My questions concern two responses by Stefan to quotes from Aidan Kehoe:

   > Fread is intended to be called from Lisp (it¢s a subr).  Functions called
   > from Lisp do not need to GCPRO their arguments, because the garbage
   > collector knows about the arguments, as it knows about all objects
   > allocated in Lisp.

   s/called/callable/
   Are you sure we have such a convention?

`(elisp)Writing Emacs Primitives' discusses writing primitives, gives
For as an example, which carefully GCPROs its ARGS argument, then
talks about how important GCPROing variables of type Lisp_Object is
(if Feval is called and so on...)  and then states that there is an
exception: Lisp primitives that take a variable number of args at the
Lisp level (other than special forms) do not need to GCPRO the args
they are to receive at the Lisp level: that responsibility rests with
their caller, because what is passed as an arg at the C level is a
Lisp_Object * pointer to a C vector containing those Lisp args.

To me, this leads to the "obvious" conclusion that Lisp primitives can
safely forget about GCPROing their args iff (they take a variable
number of args and are not special forms).

Apparently Aidan Kehoe's assertion that Lisp primitives do not need to
GCPRO their args is not fully accurate, because For does.  Maybe that
is because For is a special form, but if so, this is apparently
nowhere pointed out in `(elisp)Writing'.

On the other hand, my "obvious" conclusion from reading
`(elisp)Writing Emacs Primitives' seems to be wrong too.  Both
Fdirectory_file_name and Fmake_directory_internal take a fixed number
of args, one, `directory', and do not GCPRO it, even though they both
call `call2' which calls Feval and they both still refer to their
`directory' arg afterwards.

   > C code that calls functions intended to be called from Lisp is optimistic
   > at best if, without having checked, it relies on the assumption that that
   > the arguments to those functions will be GCPROed.

   As far as I know, the GCPRO convention for arguments is mostly the
   following:

   GCPRO args you pass to functions iff those functions can GC and you need
   to use the arg after the function returns.

All C functions that call Fdirectory_file_name or
Fmake_directory_internal still use `directory' after those latter
functions return and they all GCPRO it.

But what if a C function called Fdirectory_file_name or
Fmake_directory_internal without using directory afterward?  Would
they need to GCPRO `directory'.  To me, the logical answer would seem
no, since it is the responsibility of the called function to protect
its args.  Do these two functions implicitly do that by the way the
garbage collector is implemented or not?

Sincerely,

Luc.

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] Unicode Lisp reader escapes
  2006-05-11  1:08                                   ` Luc Teirlinck
@ 2006-05-11  2:29                                     ` Luc Teirlinck
  0 siblings, 0 replies; 202+ messages in thread
From: Luc Teirlinck @ 2006-05-11  2:29 UTC (permalink / raw)
  Cc: kehoea, eliz, emacs-devel

>From my previous reply:

   In the meantime, after rereading the thread, there are some things
   that seem confusing to me and maybe my two questions imply your
   question.

   My questions concern two responses by Stefan to quotes from Aidan Kehoe:

Sorry forget about this and my entire long message.  It was silly.
I somehow just forgot to see that the calls to call2 in the two
primitives I mentioned were in a return statement.  So _obviously_ no
GCPROing was necessary.  I should have payed closer attention.

Sorry for the confusion.

Sincerely,

Luc.

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] Unicode Lisp reader escapes
  2006-05-09 20:43           ` Juri Linkov
@ 2006-05-11  3:44             ` Richard Stallman
  2006-05-11 12:03               ` Juri Linkov
  0 siblings, 1 reply; 202+ messages in thread
From: Richard Stallman @ 2006-05-11  3:44 UTC (permalink / raw)
  Cc: kehoea, emacs-devel

    Support for \u and \U in Emacs Lisp would be good since other _Lisp_ languages
    support \uXXXX and \UXXXXXXXX as well.  But other Lisp languages support
    also Lisp notation for Unicode characters.  I think Emacs should support
    it too.  In this notation Unicode characters are written as #\u3042 or
    #\u0002a6b2 with the leading hash mark.

We do not in general try to be compatible with Common Lisp on input
syntax for characters.  So forget this.

    Also it would be good to support a syntax for named Unicode characters.
    Common Lisp has the syntax #\euro_sign, and Perl - \N{EURO SIGN}.

I tend to think we should not do this now.
Does Emacs have a table of these names?

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] Unicode Lisp reader escapes
  2006-05-10  5:37                                   ` Kenichi Handa
  2006-05-10  7:22                                     ` Stefan Monnier
@ 2006-05-11  3:44                                     ` Richard Stallman
  2006-05-11  3:44                                     ` Richard Stallman
  2 siblings, 0 replies; 202+ messages in thread
From: Richard Stallman @ 2006-05-11  3:44 UTC (permalink / raw)
  Cc: alkibiades, emacs-devel

    Some Han characters in Unicode can be decoded into several
    CJK charsets (e.g. chinese-gb2312, chinese-big5-1,
    japanese-jisx0208).  current-language-environment decides
    which of them to use.

Through what mechanism does current-language-environment control this
decision?

Can we make a new variable to control this, and have each language
environment set that new variable accordingly?

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] Unicode Lisp reader escapes
  2006-05-10  5:37                                   ` Kenichi Handa
  2006-05-10  7:22                                     ` Stefan Monnier
  2006-05-11  3:44                                     ` Richard Stallman
@ 2006-05-11  3:44                                     ` Richard Stallman
  2006-05-11  7:31                                       ` Kenichi Handa
  2006-05-11  9:44                                       ` Oliver Scholz
  2 siblings, 2 replies; 202+ messages in thread
From: Richard Stallman @ 2006-05-11  3:44 UTC (permalink / raw)
  Cc: alkibiades, emacs-devel

    > Add a new "variable" `buffer-coding' which is analogous to `coding'.
    > Whereas `coding' specifies the encoding in the file, `buffer-coding'
    > specifies the in-buffer encoding to produce in the buffer.  Its value
    > could be a list or plist, which would specify the values of all these
    > many variables.

    > What do you think?  If you think this is a good idea, could
    > you try designing the details?

    No, it's an incredibly hard and heavy task.

I am surprised you think so, and this means there is some sort of
misunderstanding between us.

You've listed around 6 variables that affect the decoding.  So it
seems to me that if we make a convenient way for each Lisp file to
specify those 6 variables, we solve the problem.  It looks easy to me.

If you think it is difficult, could you explain where the difficulty
is?

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] Unicode Lisp reader escapes
  2006-05-10  7:22                                     ` Stefan Monnier
@ 2006-05-11  3:45                                       ` Richard Stallman
  2006-05-11 12:41                                         ` Stefan Monnier
  0 siblings, 1 reply; 202+ messages in thread
From: Richard Stallman @ 2006-05-11  3:45 UTC (permalink / raw)
  Cc: alkibiades, emacs-devel, handa

    It's just a lot of work for very little benefit: people have lived with this
    problem for a while now and haven't found the workarounds to be serious
    (basically: don't use utf-8 for those files and don't use
    unify-8859-on-decoding if you manipulate such files).

I don't follow.  Are you saying that this problem only occurs
for utf-8 encoding?

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] Unicode Lisp reader escapes
  2006-05-10 21:37                                   ` Luc Teirlinck
@ 2006-05-11  3:45                                     ` Eli Zaretskii
  0 siblings, 0 replies; 202+ messages in thread
From: Eli Zaretskii @ 2006-05-11  3:45 UTC (permalink / raw)
  Cc: kehoea, emacs-devel

> Date: Wed, 10 May 2006 16:37:56 -0500 (CDT)
> From: Luc Teirlinck <teirllm@dms.auburn.edu>
> CC: kehoea@parhasard.net, emacs-devel@gnu.org
> 
> Eli Zaretskii wrote:
> 
>    > It read ambiguously to me. ``Any Lisp object that you intend to refer to
>    > again'' could be one that you intend to refer to in the bodies of the
>    > functions you call.
> 
>    Can someone in the know (Richard?) state a clear rule?  I think the
>    ELisp manual should be unequivocal about the GCPRO issue.
> 
> Reading in the Elisp manual:
> 
>     This is necessary whenever you call `Feval' or anything that can
>     directly or indirectly call `Feval'.  At such a time, any Lisp
>     object that you intend to refer to again must be protected
>     somehow.
> 
> I have always interpreted this as meaning that as long as Feval is not
> directly or indirectly called, there is no problem whatsoever.

That is not the issue for which I asked for clarifications.  The issue
was what does ``refer to again'' means, precisely.

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] Unicode Lisp reader escapes
  2006-05-10 17:49                                 ` Eli Zaretskii
                                                     ` (2 preceding siblings ...)
  2006-05-11  1:08                                   ` Luc Teirlinck
@ 2006-05-11  3:46                                   ` Richard Stallman
  3 siblings, 0 replies; 202+ messages in thread
From: Richard Stallman @ 2006-05-11  3:46 UTC (permalink / raw)
  Cc: kehoea, emacs-devel

    > It read ambiguously to me. ``Any Lisp object that you intend to refer to
    > again'' could be one that you intend to refer to in the bodies of the
    > functions you call.

    Can someone in the know (Richard?) state a clear rule?  I think the
    ELisp manual should be unequivocal about the GCPRO issue.

I clarified this.
Thanks.

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] Unicode Lisp reader escapes
  2006-05-11  3:44                                     ` Richard Stallman
@ 2006-05-11  7:31                                       ` Kenichi Handa
  2006-05-12  4:14                                         ` Richard Stallman
  2006-05-11  9:44                                       ` Oliver Scholz
  1 sibling, 1 reply; 202+ messages in thread
From: Kenichi Handa @ 2006-05-11  7:31 UTC (permalink / raw)
  Cc: emacs-devel, alkibiades

In article <E1Fe26h-0007re-Mx@fencepost.gnu.org>, Richard Stallman <rms@gnu.org> writes:

> You've listed around 6 variables that affect the decoding.  So it
> seems to me that if we make a convenient way for each Lisp file to
> specify those 6 variables, we solve the problem.  It looks easy to me.

> If you think it is difficult, could you explain where the difficulty
> is?

I don't know a convenient way to specify values of huge
char-tables and hash-tables in each file.

---
Kenichi Handa
handa@m17n.org

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] Unicode Lisp reader escapes
  2006-05-11  3:44                                     ` Richard Stallman
  2006-05-11  7:31                                       ` Kenichi Handa
@ 2006-05-11  9:44                                       ` Oliver Scholz
  1 sibling, 0 replies; 202+ messages in thread
From: Oliver Scholz @ 2006-05-11  9:44 UTC (permalink / raw)
  Cc: alkibiades, emacs-devel, Kenichi Handa

Richard Stallman <rms@gnu.org> writes:

>     > Add a new "variable" `buffer-coding' which is analogous to `coding'.
>     > Whereas `coding' specifies the encoding in the file, `buffer-coding'
>     > specifies the in-buffer encoding to produce in the buffer.  Its value
>     > could be a list or plist, which would specify the values of all these
>     > many variables.
>
>     > What do you think?  If you think this is a good idea, could
>     > you try designing the details?
>
>     No, it's an incredibly hard and heavy task.
>
> I am surprised you think so, and this means there is some sort of
> misunderstanding between us.
>
> You've listed around 6 variables that affect the decoding.  So it
> seems to me that if we make a convenient way for each Lisp file to
> specify those 6 variables, we solve the problem.  It looks easy to me.

Yes, I think there is a misunderstanding. It is not the value of those
variables that affects decoding. But changing the value of those
variables via their corresponding minor mode functions or via
customize initialises translation tables (char tables and arrays) and
in some cases adjusts codings systems to use those tables. See
`ucs-unify-8859' and its counterpart `ucs-fragment-8859' for an
example. In most, if not all affected cases, binding variables to
another value would have no effect whatsoever. In some cases like
`utf-fragment-on-decoding' you'd first have to write functions to
programmatically cause the associated effect.

In fact, it might be easier (and even safer) to just change the
encoding of *.elc files from emacs-mule to utf-8.

Then there may be possible consecutive problems. For instance,
Handa-san mentioned an example of how a user could have reason to
modify the affected translation tables. Are users supposed to do that?
(I'd argue, they should rather change the fontset.) If yes, you'd need
to make sure that such changes are preserved after swapping and/or
redefining all those translation tables during byte compilation. Maybe
they are anyways, but we are talking about a lot of mule code here.

Finally, users might encounter *either* behaviour in a way that makes
them think it is a bug. If byte compilation is modified the way you
propose, then what some users will probably just see is that the
glyphs of some characters coming from a byte compiled file differ from
what they specified in their .emacs. That will come as a surprise to
them and investigating it is not exactly easy, if you are not familiar
with Emacs' internal handling of characters. So it might be a good
idea to document even the fix of the bug we are discussing in
etc/PROBLEMS, because it *is* a design problem of `emacs-mule'.

(And as Handa-san mentioned, things like that are the actual reason
that changing the internal encoding to UTF-8 is a worthwile enterprise
in the first place.)


    Oliver
-- 
Oliver Scholz               22 Floréal an 214 de la Révolution
Ostendstr. 61               Liberté, Egalité, Fraternité!
60314 Frankfurt a. M.       

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] Unicode Lisp reader escapes
  2006-05-11  3:44             ` Richard Stallman
@ 2006-05-11 12:03               ` Juri Linkov
  2006-05-11 13:16                 ` Kenichi Handa
  2006-05-12  4:15                 ` Richard Stallman
  0 siblings, 2 replies; 202+ messages in thread
From: Juri Linkov @ 2006-05-11 12:03 UTC (permalink / raw)
  Cc: kehoea, emacs-devel

>     Support for \u and \U in Emacs Lisp would be good since other _Lisp_ languages
>     support \uXXXX and \UXXXXXXXX as well.  But other Lisp languages support
>     also Lisp notation for Unicode characters.  I think Emacs should support
>     it too.  In this notation Unicode characters are written as #\u3042 or
>     #\u0002a6b2 with the leading hash mark.
>
> We do not in general try to be compatible with Common Lisp on input
> syntax for characters.  So forget this.

The initial `#' character is a valid Emacs hash notation for writing
integers in various bases.  After adding `\uXXXX' it seems reasonable
to add `#\uXXXX' as well.  However, there is one difference: Emacs Lisp
hash notation doesn't use the backslash `\' after `#', e.g. `#x42',
but other Lisps use the backslash in the notation of Unicode characters,
e.g. `#\u3042'.  I have no opinion which notation is better.

>     Also it would be good to support a syntax for named Unicode characters.
>     Common Lisp has the syntax #\euro_sign, and Perl - \N{EURO SIGN}.
>
> I tend to think we should not do this now.
> Does Emacs have a table of these names?

The variable `describe-char-unicodedata-file' points to the file
`UnicodeData.txt' not distributed currently with Emacs.  This could be
done in the emacs-unicode branch.  I think this question should be
considered after the release.

-- 
Juri Linkov
http://www.jurta.org/emacs/

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] Unicode Lisp reader escapes
  2006-05-11  3:45                                       ` Richard Stallman
@ 2006-05-11 12:41                                         ` Stefan Monnier
  2006-05-11 12:51                                           ` Kenichi Handa
  0 siblings, 1 reply; 202+ messages in thread
From: Stefan Monnier @ 2006-05-11 12:41 UTC (permalink / raw)
  Cc: alkibiades, emacs-devel, handa

>     It's just a lot of work for very little benefit: people have lived
>     with this problem for a while now and haven't found the workarounds to
>     be serious (basically: don't use utf-8 for those files and don't use
>     unify-8859-on-decoding if you manipulate such files).

> I don't follow.  Are you saying that this problem only occurs
> for utf-8 encoding?

IIRC such tables are used either during encoding to unicode (i.e. if you
save as utf-8) or upon decoding (but only if you've enabled
unify-8859-on-decoding).  My memory is fuzzy, tho.


        Stefan

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] Unicode Lisp reader escapes
  2006-05-11 12:41                                         ` Stefan Monnier
@ 2006-05-11 12:51                                           ` Kenichi Handa
  2006-05-11 16:46                                             ` Stefan Monnier
  0 siblings, 1 reply; 202+ messages in thread
From: Kenichi Handa @ 2006-05-11 12:51 UTC (permalink / raw)
  Cc: emacs-devel, rms, alkibiades

In article <87lkt8ybui.fsf-monnier+emacs@gnu.org>, Stefan Monnier <monnier@iro.umontreal.ca> writes:

>> It's just a lot of work for very little benefit: people have lived
>> with this problem for a while now and haven't found the workarounds to
>> be serious (basically: don't use utf-8 for those files and don't use
>> unify-8859-on-decoding if you manipulate such files).

>> I don't follow.  Are you saying that this problem only occurs
>> for utf-8 encoding?

> IIRC such tables are used either during encoding to unicode (i.e. if you
> save as utf-8) or upon decoding (but only if you've enabled
> unify-8859-on-decoding).  My memory is fuzzy, tho.

unify-8859-on-decoding-mode affects iso-8859-* coding
systems.  If it is on, characters in a file of those coding
systems are decoded into iso-8859-1 or
mule-unicode-0100-24FF.  That's the meaning "unify 8859".

---
Kenichi Handa
handa@m17n.org

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] Unicode Lisp reader escapes
  2006-05-11 12:03               ` Juri Linkov
@ 2006-05-11 13:16                 ` Kenichi Handa
  2006-05-12  4:15                 ` Richard Stallman
  1 sibling, 0 replies; 202+ messages in thread
From: Kenichi Handa @ 2006-05-11 13:16 UTC (permalink / raw)
  Cc: kehoea, rms, emacs-devel

In article <878xp8g2a9.fsf@jurta.org>, Juri Linkov <juri@jurta.org> writes:

>> Also it would be good to support a syntax for named Unicode characters.
>> Common Lisp has the syntax #\euro_sign, and Perl - \N{EURO SIGN}.
>> 
>> I tend to think we should not do this now.
>> Does Emacs have a table of these names?

> The variable `describe-char-unicodedata-file' points to the file
> `UnicodeData.txt' not distributed currently with Emacs.  This could be
> done in the emacs-unicode branch.  I think this question should be
> considered after the release.

Actually, emacs-unicode already contains various data
(including names) extracted from UnicodeData.txt, and
get-char-code-property is extended to information about a
character that is provided by UnicodeData.txt.

---
Kenichi Handa
handa@m17n.org

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] Unicode Lisp reader escapes
  2006-05-11 12:51                                           ` Kenichi Handa
@ 2006-05-11 16:46                                             ` Stefan Monnier
  0 siblings, 0 replies; 202+ messages in thread
From: Stefan Monnier @ 2006-05-11 16:46 UTC (permalink / raw)
  Cc: emacs-devel, rms, alkibiades

>>> It's just a lot of work for very little benefit: people have lived
>>> with this problem for a while now and haven't found the workarounds to
>>> be serious (basically: don't use utf-8 for those files and don't use
>>> unify-8859-on-decoding if you manipulate such files).

>>> I don't follow.  Are you saying that this problem only occurs
>>> for utf-8 encoding?

>> IIRC such tables are used either during encoding to unicode (i.e. if you
>> save as utf-8) or upon decoding (but only if you've enabled
>> unify-8859-on-decoding).  My memory is fuzzy, tho.

> unify-8859-on-decoding-mode affects iso-8859-* coding
> systems.  If it is on, characters in a file of those coding
> systems are decoded into iso-8859-1 or
> mule-unicode-0100-24FF.  That's the meaning "unify 8859".

We seem to be in violent agreement.


        Stefan

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] Unicode Lisp reader escapes
  2006-05-11  7:31                                       ` Kenichi Handa
@ 2006-05-12  4:14                                         ` Richard Stallman
  2006-05-12  5:26                                           ` Kenichi Handa
  0 siblings, 1 reply; 202+ messages in thread
From: Richard Stallman @ 2006-05-12  4:14 UTC (permalink / raw)
  Cc: emacs-devel, alkibiades

    I don't know a convenient way to specify values of huge
    char-tables and hash-tables in each file.

Obviously we find another way to specify the information.

Please try to find a solution; don't give up just because it
nontrivial.

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] Unicode Lisp reader escapes
  2006-05-11 12:03               ` Juri Linkov
  2006-05-11 13:16                 ` Kenichi Handa
@ 2006-05-12  4:15                 ` Richard Stallman
  2006-06-03 18:44                   ` Aidan Kehoe
       [not found]                   ` <17537.54719.354843.89030@parhasard.net>
  1 sibling, 2 replies; 202+ messages in thread
From: Richard Stallman @ 2006-05-12  4:15 UTC (permalink / raw)
  Cc: kehoea, emacs-devel

    The initial `#' character is a valid Emacs hash notation for writing
    integers in various bases.  After adding `\uXXXX' it seems reasonable
    to add `#\uXXXX' as well.  However, there is one difference: Emacs Lisp
    hash notation doesn't use the backslash `\' after `#', e.g. `#x42',
    but other Lisps use the backslash in the notation of Unicode characters,
    e.g. `#\u3042'.  I have no opinion which notation is better.

I think it is better to consistent with the existing Emacs Lisp
constructs.

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] Unicode Lisp reader escapes
  2006-05-12  4:14                                         ` Richard Stallman
@ 2006-05-12  5:26                                           ` Kenichi Handa
  2006-05-13  4:52                                             ` Richard Stallman
  0 siblings, 1 reply; 202+ messages in thread
From: Kenichi Handa @ 2006-05-12  5:26 UTC (permalink / raw)
  Cc: alkibiades, emacs-devel

In article <E1FeP3C-0002Jp-JW@fencepost.gnu.org>, Richard Stallman <rms@gnu.org> writes:

>     I don't know a convenient way to specify values of huge
>     char-tables and hash-tables in each file.

> Obviously we find another way to specify the information.

> Please try to find a solution; don't give up just because it
> nontrivial.

At least you now understand it's not trivial.  Why do you
think it's worth doing at this stage even if it requires
nontrivial work?

How about just asking users to use emacs-mule coding system
for *.el files if they want them decoded the same way
independent of various settings on byte-compiling?  Such
*.elc files are still loadable by emacs-unicode.

---
Kenichi Handa
handa@m17n.org

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] Unicode Lisp reader escapes
  2006-05-12  5:26                                           ` Kenichi Handa
@ 2006-05-13  4:52                                             ` Richard Stallman
  2006-05-13 13:25                                               ` Stefan Monnier
  2006-05-15  5:13                                               ` Kenichi Handa
  0 siblings, 2 replies; 202+ messages in thread
From: Richard Stallman @ 2006-05-13  4:52 UTC (permalink / raw)
  Cc: alkibiades, emacs-devel

    At least you now understand it's not trivial.  Why do you
    think it's worth doing at this stage even if it requires
    nontrivial work?

Because this is a serious cause of unreliability.
It is a bug, or something pretty close to a bug.

    How about just asking users to use emacs-mule coding system
    for *.el files if they want them decoded the same way
    independent of various settings on byte-compiling?

Maybe that is a good enough solution.  Does this solution
solve the whole problem?

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] Unicode Lisp reader escapes
  2006-05-13  4:52                                             ` Richard Stallman
@ 2006-05-13 13:25                                               ` Stefan Monnier
  2006-05-13 20:41                                                 ` Richard Stallman
  2006-05-15  5:13                                               ` Kenichi Handa
  1 sibling, 1 reply; 202+ messages in thread
From: Stefan Monnier @ 2006-05-13 13:25 UTC (permalink / raw)
  Cc: alkibiades, emacs-devel, Kenichi Handa

>     At least you now understand it's not trivial.  Why do you
>     think it's worth doing at this stage even if it requires
>     nontrivial work?
> Because this is a serious cause of unreliability.

I don't see why you'd think so.  Could you expand?


        Stefan

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] Unicode Lisp reader escapes
  2006-05-13 13:25                                               ` Stefan Monnier
@ 2006-05-13 20:41                                                 ` Richard Stallman
  2006-05-14 13:32                                                   ` Stefan Monnier
  0 siblings, 1 reply; 202+ messages in thread
From: Richard Stallman @ 2006-05-13 20:41 UTC (permalink / raw)
  Cc: emacs-devel, handa, alkibiades

    > Because this is a serious cause of unreliability.

    I don't see why you'd think so.  Could you expand?

If Lisp files get executed and compiled in different ways
according to the user's settings, this is unreliability of a
very bad kind.

Handa says that telling people "don't use utf-8" solves the problem.
If that is a good solution, I think the problem is solved.
Does everyone agree that that solution works?

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] Unicode Lisp reader escapes
  2006-05-13 20:41                                                 ` Richard Stallman
@ 2006-05-14 13:32                                                   ` Stefan Monnier
  2006-05-14 23:29                                                     ` Richard Stallman
  0 siblings, 1 reply; 202+ messages in thread
From: Stefan Monnier @ 2006-05-14 13:32 UTC (permalink / raw)
  Cc: emacs-devel, handa, alkibiades

>> Because this is a serious cause of unreliability.

>     I don't see why you'd think so.  Could you expand?

> If Lisp files get executed and compiled in different ways
> according to the user's settings, this is unreliability of a
> very bad kind.

In theory I agree.  But the problem is fixed in emacs-unicode, there are
known workarounds in Emacs-CVS, and fixing it in Emacs-CVS is going to be
difficult.

> Handa says that telling people "don't use utf-8" solves the problem.

Additionally to "don't use unify-8859-on-decoding" which causes similar
problems (which we already bumped into a few years ago when we included
unify-8859-on-decoding) with iso8859 chars and coding systems like iso-2022.

> If that is a good solution, I think the problem is solved.

OK, good.

> Does everyone agree that that solution works?

Don't know about everyone, but at least I do,


        Stefan

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] Unicode Lisp reader escapes
  2006-05-14 13:32                                                   ` Stefan Monnier
@ 2006-05-14 23:29                                                     ` Richard Stallman
  2006-05-15  0:55                                                       ` Stefan Monnier
  0 siblings, 1 reply; 202+ messages in thread
From: Richard Stallman @ 2006-05-14 23:29 UTC (permalink / raw)
  Cc: emacs-devel, handa, alkibiades

    > Handa says that telling people "don't use utf-8" solves the problem.

    Additionally to "don't use unify-8859-on-decoding" which causes similar
    problems (which we already bumped into a few years ago when we included
    unify-8859-on-decoding) with iso8859 chars and coding systems like iso-2022.

There is a way for a Lisp file to specify a coding system which isn't
utf-8.  Is there a way for a Lisp file to specify that
unify-8859-on-decoding should not be used when reading it?

If not, maybe we should make one.

Here's one idea: if the -*- line specifies `coding' and specifies
the mode `emacs-lisp' then force unify-8859-on-decoding to nil
for that file. 

That idea has the advantage that most of the Lisp files where
this issue might arise won't need any change in order to be
assured of DTRT.

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] Unicode Lisp reader escapes
  2006-05-14 23:29                                                     ` Richard Stallman
@ 2006-05-15  0:55                                                       ` Stefan Monnier
  2006-05-15  2:49                                                         ` Oliver Scholz
  2006-05-15 20:37                                                         ` Richard Stallman
  0 siblings, 2 replies; 202+ messages in thread
From: Stefan Monnier @ 2006-05-15  0:55 UTC (permalink / raw)
  Cc: emacs-devel, handa, alkibiades

>> Handa says that telling people "don't use utf-8" solves the problem.
>     Additionally to "don't use unify-8859-on-decoding" which causes
>     similar problems (which we already bumped into a few years ago when we
>     included unify-8859-on-decoding) with iso8859 chars and coding systems
>     like iso-2022.

> There is a way for a Lisp file to specify a coding system which isn't
> utf-8.  Is there a way for a Lisp file to specify that
> unify-8859-on-decoding should not be used when reading it?

> If not, maybe we should make one.

> Here's one idea: if the -*- line specifies `coding' and specifies
> the mode `emacs-lisp' then force unify-8859-on-decoding to nil
> for that file.

Forcing it to nil for a particular file is maybe too much work to implement
compared to th benefit.
Maybe an easier solution is to add a file-local variable
`no-8859-unification' such that if that file is loaded in an Emacs which
is configured to use unify-8859-on-decoding it signals an error.

It could then be added to files like ucs-tables.el.


        Stefan

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] Unicode Lisp reader escapes
  2006-05-15  0:55                                                       ` Stefan Monnier
@ 2006-05-15  2:49                                                         ` Oliver Scholz
  2006-05-15  3:27                                                           ` Stefan Monnier
  2006-05-15 20:37                                                           ` Richard Stallman
  2006-05-15 20:37                                                         ` Richard Stallman
  1 sibling, 2 replies; 202+ messages in thread
From: Oliver Scholz @ 2006-05-15  2:49 UTC (permalink / raw)
  Cc: emacs-devel, rms, handa, alkibiades

Stefan Monnier <monnier@iro.umontreal.ca> writes:

>>> Handa says that telling people "don't use utf-8" solves the problem.
>>     Additionally to "don't use unify-8859-on-decoding" which causes
>>     similar problems (which we already bumped into a few years ago when we
>>     included unify-8859-on-decoding) with iso8859 chars and coding systems
>>     like iso-2022.
>
>> There is a way for a Lisp file to specify a coding system which isn't
>> utf-8.  Is there a way for a Lisp file to specify that
>> unify-8859-on-decoding should not be used when reading it?
>
>> If not, maybe we should make one.
>
>> Here's one idea: if the -*- line specifies `coding' and specifies
>> the mode `emacs-lisp' then force unify-8859-on-decoding to nil
>> for that file.

Besides the work already mentioned, this would also require to turn
unify-8859-on-decoding-mode into a buffer-local minor mode. Which
would require to make the necessary translation tables somehow (!)
buffer-local.

> Forcing it to nil for a particular file is maybe too much work to implement
> compared to th benefit.
> Maybe an easier solution is to add a file-local variable
> `no-8859-unification' such that if that file is loaded in an Emacs which
> is configured to use unify-8859-on-decoding it signals an error.
>
> It could then be added to files like ucs-tables.el.

[Nitpick: ucs-tables.el is encoded in ISO 2022. Most of Emacs' files
containing m18n characters are, AFAIK. I don't know the reason. Maybe
because it's 7bit, but still ASCII compatible.]

How about just issuing a warning with the warning message containing a
description of the effects and of what to do to change the settings?

e.g.:

(when (and (memq (coding-system-base buffer-file-coding-system)
                 '(mule-utf-8 utf-7 mule-utf-16
                         ; ...
                         mule-utf-16be-with-signature))
           utf-fragment-on-decoding ; default is nil
           (let ((charsets (find-charset-region (point-min) (point-max))))
             (or (memq 'greek-iso8859-7 charsets)
                 (memq 'cyrillic-iso8859-5 charsets))))
  (warn "You have enabled ... but this source file contains
characters from ... Emacs has ... This might or might not be what
you want ... To restore the defaults do ... bla bla ...
... you might want to use `emacs-mule' as coding system for Emacs Lisp
source files ..."))

And similar for the other cases.

[FWIW, I think that `emacs-mule'---as Handa suggested---is a perfectly
valid file encoding for Emacs Lisp source files. Since it is, by
definition unambigous w.r.t. the specified charsets, emacs-mule has
none of the problems we are discussing. Of course, Emacs is probably
the only text editor that can deal with emacs-mule, but that would
hardly matter for Elisp sources. I can think only of two drawbacks: 1.
You can't simply insert or attach such files to mail or usenet
postings. You have to zip, tar, base64 etc. them first. 2. Specifying
particular charsets might exactly *not* be what an author wants. --

Though, the only way to deal with the latter would be to modify the
Lisp printer for writing *.elc files so that it escapes non-ascii
characters whereever possible with the new \u syntax. This would be
another solution to the problem we are discussing.]


    Oliver
-- 
Oliver Scholz               26 Floréal an 214 de la Révolution
Ostendstr. 61               Liberté, Egalité, Fraternité!
60314 Frankfurt a. M.       

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] Unicode Lisp reader escapes
  2006-05-15  2:49                                                         ` Oliver Scholz
@ 2006-05-15  3:27                                                           ` Stefan Monnier
  2006-05-15 10:20                                                             ` Oliver Scholz
  2006-05-15 20:37                                                           ` Richard Stallman
  1 sibling, 1 reply; 202+ messages in thread
From: Stefan Monnier @ 2006-05-15  3:27 UTC (permalink / raw)
  Cc: handa, rms, emacs-devel

>> Forcing it to nil for a particular file is maybe too much work to implement
>> compared to th benefit.
>> Maybe an easier solution is to add a file-local variable
>> `no-8859-unification' such that if that file is loaded in an Emacs which
>> is configured to use unify-8859-on-decoding it signals an error.
>> 
>> It could then be added to files like ucs-tables.el.

> [Nitpick: ucs-tables.el is encoded in ISO 2022.  Most of Emacs' files
> containing m18n characters are, AFAIK.  I don't know the reason.  Maybe
> because it's 7bit, but still ASCII compatible. ]

I'm not sure I understand the nitpick:
- the reason most files use iso-2022 is because it was the only mildly
  standard generic encoding well supported by Emacs (utf-8 is slowly
  getting there, but Emacs-CVS's support for it is still behind).

- ucs-tables.el, if saved as utf-8, would not do the same any more: it
  relies on the various "equivalent" 8859 chars to be distinguished (as is
  done in iso-2022, and as can't be done in utf-8).  That's also why opening
  it with unify-8859-on-decoding is wrong: you're not looking at the right
  code any more because you basically get what you'd get if it had been
  saved in a unified encoding such as utf-8.

> How about just issuing a warning with the warning message containing a
> description of the effects and of what to do to change the settings?

>   (warn "You have enabled ... but this source file contains
> characters from ... Emacs has ... This might or might not be what
> you want ... To restore the defaults do ... bla bla ...
> ... you might want to use `emacs-mule' as coding system for Emacs Lisp
> source files ..."))

I'm actually not sure if using emacs-mule instead of iso-2022 helps.
It depends on whether or not unify-8859-on-decoding is also applied to
emacs-mule "decoding".

> Though, the only way to deal with the latter would be to modify the
> Lisp printer for writing *.elc files so that it escapes non-ascii
> characters whereever possible with the new \u syntax. This would be
> another solution to the problem we are discussing.]

This would break the compilation of ucs-tables.el.


        Stefan

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] Unicode Lisp reader escapes
  2006-05-13  4:52                                             ` Richard Stallman
  2006-05-13 13:25                                               ` Stefan Monnier
@ 2006-05-15  5:13                                               ` Kenichi Handa
  2006-05-15  8:06                                                 ` Kim F. Storm
                                                                   ` (2 more replies)
  1 sibling, 3 replies; 202+ messages in thread
From: Kenichi Handa @ 2006-05-15  5:13 UTC (permalink / raw)
  Cc: emacs-devel, alkibiades

In article <E1Fem6n-0002Qv-9c@fencepost.gnu.org>, Richard Stallman <rms@gnu.org> writes:

>     How about just asking users to use emacs-mule coding system
>     for *.el files if they want them decoded the same way
>     independent of various settings on byte-compiling?

> Maybe that is a good enough solution.  Does this solution
> solve the whole problem?

Yes, as far as I know.

> Handa says that telling people "don't use utf-8" solves the problem.

That's NOT what I saied.  I said "use emacs-mule".  The
other coding systems are affected by
unify-8859-on-decoding-mode, and also by users setting of
standard-translation-table-for-decode.

> There is a way for a Lisp file to specify a coding system which isn't
> utf-8.  Is there a way for a Lisp file to specify that
> unify-8859-on-decoding should not be used when reading it?

No.

> If not, maybe we should make one.

But, as emacs-mule is not affected by
unify-8859-on-decoding, we don't have to invent it as long
as we suggest people to use emacs-mule in a problematic
case.

---
Kenichi Handa
handa@m17n.org

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] Unicode Lisp reader escapes
  2006-05-15  5:13                                               ` Kenichi Handa
@ 2006-05-15  8:06                                                 ` Kim F. Storm
  2006-05-15  9:04                                                   ` Andreas Schwab
  2006-05-15 20:38                                                   ` Richard Stallman
  2006-05-15 14:08                                                 ` Stefan Monnier
  2006-05-15 20:37                                                 ` Richard Stallman
  2 siblings, 2 replies; 202+ messages in thread
From: Kim F. Storm @ 2006-05-15  8:06 UTC (permalink / raw)
  Cc: alkibiades, rms, emacs-devel

Kenichi Handa <handa@m17n.org> writes:

> But, as emacs-mule is not affected by
> unify-8859-on-decoding, we don't have to invent it as long
> as we suggest people to use emacs-mule in a problematic
> case.

So why not _always_ use emacs-mule for .elc files (both on write
and on load)?  

Could it be a problem with existing *.elc files?


-- 
Kim F. Storm <storm@cua.dk> http://www.cua.dk

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] Unicode Lisp reader escapes
  2006-05-15  8:06                                                 ` Kim F. Storm
@ 2006-05-15  9:04                                                   ` Andreas Schwab
  2006-05-15 20:38                                                   ` Richard Stallman
  1 sibling, 0 replies; 202+ messages in thread
From: Andreas Schwab @ 2006-05-15  9:04 UTC (permalink / raw)
  Cc: alkibiades, emacs-devel, rms, Kenichi Handa

storm@cua.dk (Kim F. Storm) writes:

> Kenichi Handa <handa@m17n.org> writes:
>
>> But, as emacs-mule is not affected by
>> unify-8859-on-decoding, we don't have to invent it as long
>> as we suggest people to use emacs-mule in a problematic
>> case.
>
> So why not _always_ use emacs-mule for .elc files (both on write
> and on load)?  

Don't we already?

(setq file-coding-system-alist
      '(("\\.elc\\'" . (emacs-mule . emacs-mule))
      ...)

Andreas.

-- 
Andreas Schwab, SuSE Labs, schwab@suse.de
SuSE Linux Products GmbH, Maxfeldstraße 5, 90409 Nürnberg, Germany
PGP key fingerprint = 58CA 54C7 6D53 942B 1756  01D3 44D5 214B 8276 4ED5
"And now for something completely different."

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] Unicode Lisp reader escapes
  2006-05-15  3:27                                                           ` Stefan Monnier
@ 2006-05-15 10:20                                                             ` Oliver Scholz
  2006-05-15 11:12                                                               ` Oliver Scholz
  0 siblings, 1 reply; 202+ messages in thread
From: Oliver Scholz @ 2006-05-15 10:20 UTC (permalink / raw)
  Cc: emacs-devel, rms, handa, Oliver Scholz

Stefan Monnier <monnier@iro.umontreal.ca> writes:

> I'm not sure I understand the nitpick:
[...]

That is entirely my mistake. It was late when I wrote the message and
my mind was occupied with utf-8 and `utf-fragment-on-decoding'. So I
misunderstood you as implying that ucs-table were encoded in UTF-8.

> I'm actually not sure if using emacs-mule instead of iso-2022 helps.
> It depends on whether or not unify-8859-on-decoding is also applied to
> emacs-mule "decoding".

It doesn't. decode_coding_emacs_mule in coding.c doesn't refer to
Vstandard_translation_table_for_decode at all, which would be
necessary for unification.

>> Though, the only way to deal with the latter would be to modify the
>> Lisp printer for writing *.elc files so that it escapes non-ascii
>> characters whereever possible with the new \u syntax. This would be
>> another solution to the problem we are discussing.]
>
> This would break the compilation of ucs-tables.el.

Ah, of course, I have not thought about that. Well, there would have
to be an exeption. I am not saying that this idea of mine is a good
idea, though, because I don't know how hairy it is to implement this.
IIRC `encode-char' and `decode-char' are not entirely symmetric, that
is, there are characters that `encode-char' can encode, but
`decode-char' can't encode. IIRC. But it would be the solution that
DTRT from the user's point of view. And it *could* be less hairy than
any of the other options discussed here, save "use emacs mule!" and
"warn/throw an error/document the problem", of course.


    Oliver
-- 
Oliver Scholz               26 Floréal an 214 de la Révolution
Ostendstr. 61               Liberté, Egalité, Fraternité!
60314 Frankfurt a. M.       

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] Unicode Lisp reader escapes
  2006-05-15 10:20                                                             ` Oliver Scholz
@ 2006-05-15 11:12                                                               ` Oliver Scholz
  0 siblings, 0 replies; 202+ messages in thread
From: Oliver Scholz @ 2006-05-15 11:12 UTC (permalink / raw)
  Cc: emacs-devel, Stefan Monnier, handa, rms

Oliver Scholz <alkibiades@gmx.de> writes:

[...]
>>> Though, the only way to deal with the latter would be to modify the
>>> Lisp printer for writing *.elc files so that it escapes non-ascii
>>> characters whereever possible with the new \u syntax. This would be
>>> another solution to the problem we are discussing.]

[...]
> But it would be the solution that DTRT from the user's point of
> view.

Scrap that. Again, I was only thinking about UCS fragmentation. Sorry.
It would *not* DTRT for ISO 8859 encoded files if unification on
decoding is OFF (as is the default), since that would unify
everything.


    Oliver
-- 
Oliver Scholz               26 Floréal an 214 de la Révolution
Ostendstr. 61               Liberté, Egalité, Fraternité!
60314 Frankfurt a. M.       

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] Unicode Lisp reader escapes
  2006-05-15  5:13                                               ` Kenichi Handa
  2006-05-15  8:06                                                 ` Kim F. Storm
@ 2006-05-15 14:08                                                 ` Stefan Monnier
  2006-05-15 20:37                                                 ` Richard Stallman
  2 siblings, 0 replies; 202+ messages in thread
From: Stefan Monnier @ 2006-05-15 14:08 UTC (permalink / raw)
  Cc: alkibiades, rms, emacs-devel

> That's NOT what I said.  I said "use emacs-mule".  The other coding
> systems are affected by unify-8859-on-decoding-mode, and also by users
> setting of standard-translation-table-for-decode.

Ah, that's good, thanks,


        Stefan

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] Unicode Lisp reader escapes
  2006-05-15  0:55                                                       ` Stefan Monnier
  2006-05-15  2:49                                                         ` Oliver Scholz
@ 2006-05-15 20:37                                                         ` Richard Stallman
  1 sibling, 0 replies; 202+ messages in thread
From: Richard Stallman @ 2006-05-15 20:37 UTC (permalink / raw)
  Cc: emacs-devel, handa, alkibiades

    > Here's one idea: if the -*- line specifies `coding' and specifies
    > the mode `emacs-lisp' then force unify-8859-on-decoding to nil
    > for that file.

    Forcing it to nil for a particular file is maybe too much work to implement
    compared to th benefit.

What makes it hard?

    Maybe an easier solution is to add a file-local variable
    `no-8859-unification' such that if that file is loaded in an Emacs which
    is configured to use unify-8859-on-decoding it signals an error.

Why is it much harder to switch to the nil mode
than to signal an error?

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] Unicode Lisp reader escapes
  2006-05-15  2:49                                                         ` Oliver Scholz
  2006-05-15  3:27                                                           ` Stefan Monnier
@ 2006-05-15 20:37                                                           ` Richard Stallman
  2006-05-16  9:49                                                             ` Oliver Scholz
  1 sibling, 1 reply; 202+ messages in thread
From: Richard Stallman @ 2006-05-15 20:37 UTC (permalink / raw)
  Cc: emacs-devel, monnier, handa, alkibiades

    >> Here's one idea: if the -*- line specifies `coding' and specifies
    >> the mode `emacs-lisp' then force unify-8859-on-decoding to nil
    >> for that file.

    Besides the work already mentioned, this would also require to turn
    unify-8859-on-decoding-mode into a buffer-local minor mode.

That is not the only possible implementation mechanism.  The commands
that read and write the buffer could change it temporarily and change
it back.

However, it seems like a really bad thing to have a minor mode
that CAN'T be buffer-local.  Why can't it be?  What is the difficulty?

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] Unicode Lisp reader escapes
  2006-05-15  5:13                                               ` Kenichi Handa
  2006-05-15  8:06                                                 ` Kim F. Storm
  2006-05-15 14:08                                                 ` Stefan Monnier
@ 2006-05-15 20:37                                                 ` Richard Stallman
  2006-05-16 10:07                                                   ` Oliver Scholz
  2006-05-18  0:31                                                   ` Kenichi Handa
  2 siblings, 2 replies; 202+ messages in thread
From: Richard Stallman @ 2006-05-15 20:37 UTC (permalink / raw)
  Cc: emacs-devel, alkibiades

    > Handa says that telling people "don't use utf-8" solves the problem.

    That's NOT what I saied.  I said "use emacs-mule".  The
    other coding systems are affected by
    unify-8859-on-decoding-mode, and also by users setting of
    standard-translation-table-for-decode.

Ok, I stand corrected.

However, people have pointed out that there are practical drawbacks
to using emacs-mule, and that iso-2022 is more convenient.
Let's see if we can arrange for iso-2022 to work properly.

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] Unicode Lisp reader escapes
  2006-05-15  8:06                                                 ` Kim F. Storm
  2006-05-15  9:04                                                   ` Andreas Schwab
@ 2006-05-15 20:38                                                   ` Richard Stallman
  1 sibling, 0 replies; 202+ messages in thread
From: Richard Stallman @ 2006-05-15 20:38 UTC (permalink / raw)
  Cc: alkibiades, emacs-devel, handa

    > But, as emacs-mule is not affected by
    > unify-8859-on-decoding, we don't have to invent it as long
    > as we suggest people to use emacs-mule in a problematic
    > case.

    So why not _always_ use emacs-mule for .elc files (both on write
    and on load)?  

.elc files do not undergo decoding.  The issue is about compilation
and loading of Lisp source files.

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] Unicode Lisp reader escapes
  2006-05-15 20:37                                                           ` Richard Stallman
@ 2006-05-16  9:49                                                             ` Oliver Scholz
  2006-05-16 11:16                                                               ` Kim F. Storm
  2006-05-17  3:45                                                               ` Richard Stallman
  0 siblings, 2 replies; 202+ messages in thread
From: Oliver Scholz @ 2006-05-16  9:49 UTC (permalink / raw)
  Cc: emacs-devel, monnier, handa, Oliver Scholz

Richard Stallman <rms@gnu.org> writes:

>     >> Here's one idea: if the -*- line specifies `coding' and specifies
>     >> the mode `emacs-lisp' then force unify-8859-on-decoding to nil
>     >> for that file.
>
>     Besides the work already mentioned, this would also require to turn
>     unify-8859-on-decoding-mode into a buffer-local minor mode.
>
> That is not the only possible implementation mechanism.  The commands
> that read and write the buffer could change it temporarily and change
> it back.
>
> However, it seems like a really bad thing to have a minor mode
> that CAN'T be buffer-local.  Why can't it be?  What is the difficulty?

Well, as already mentioned, unification and fragmentation are
implemented by means of translation tables. Unification, for instance,
for non-CCL decodings happens by means of modifying the parent of the
char table in the variable `standard-translation-table-for-decode'.
This is accessed as Vstandard_translation_table_for_decode in the
various decode_coding_XXX functions, for instance
decode_coding_iso2022, which affects many of the ISO 8859 coding
systems.

I have no idea whether it is simple to make this variable buffer local
or not. But, well, it's certainly intrusive to change such things at
the very heart and core of Emacs' decoding/encoding apparatus. (And
I'd like to second Kenichi Handa here: you'd might like to change to
Unicode Emacs *now* rather than making this kind of modification. The
Emacs Unicode branch is in sync with the current HEAD. Wielding out
the remaining coding issues means possibly possibly not much more work
and possibly *not* much more destabilizing than some of the
modifications we are discussing here.)

As for CCL-based coding systems, it is even a bit more difficult. CCL
coding systems do the translation table lookup in the CCL program
(with the CCL command `translate-character'). A named translation
table is *not* stored in a variable; it is stored in a
`translation-table' symbol property of the translation table's name.
The translation table relevant for unification in CCL decoding is
`ucs-translation-table-for-decode' (AFAICS only the cyrillic encodings
make use of this). The translation table relevant for fragmentation of
UCS coding systems is `utf-translation-table-for-decode'. You'd have
to find a way to make *that* buffer local.

As for being bad ... no, I don't think that it is bad that those minor
modes are global. They are a means to tune some details of Emacs'
internal handling of coding systems. *Internal* is the key point here.
It is nothing that conceptually relates to a particular file. (This
whole issue is something users should IMO not concern themselves with.
The benefit of changing the defaults is IMO dubious, anyways. I expect
that "unify on decoding of ISO 8859-*" and "fragmentation of UCS" will
mostly be abused for dealing with glyph issues -- i.e. something that
should be dealt with by adjusting the fontset.)

We are discussing a *very* special case here; it affects only Emacs
Lisp source files, because compilation of those, so to say, "freezes"
the particular settings for unification/fragmentation in the *.elc
file.


    Oliver
-- 
Oliver Scholz               27 Floréal an 214 de la Révolution
Ostendstr. 61               Liberté, Egalité, Fraternité!
60314 Frankfurt a. M.       

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] Unicode Lisp reader escapes
  2006-05-15 20:37                                                 ` Richard Stallman
@ 2006-05-16 10:07                                                   ` Oliver Scholz
  2006-05-18  0:31                                                   ` Kenichi Handa
  1 sibling, 0 replies; 202+ messages in thread
From: Oliver Scholz @ 2006-05-16 10:07 UTC (permalink / raw)
  Cc: alkibiades, emacs-devel, Kenichi Handa

Richard Stallman <rms@gnu.org> writes:

>     > Handa says that telling people "don't use utf-8" solves the problem.
>
>     That's NOT what I saied.  I said "use emacs-mule".  The
>     other coding systems are affected by
>     unify-8859-on-decoding-mode, and also by users setting of
>     standard-translation-table-for-decode.
>
> Ok, I stand corrected.
>
> However, people have pointed out that there are practical drawbacks
> to using emacs-mule, and that iso-2022 is more convenient.
> Let's see if we can arrange for iso-2022 to work properly.

The same here. decode_coding_iso2022 (which is also responsible for
some ISO 8859 encodings) refers to
Vstandard_translation_table_for_decode.

The practical drawbacks *I* mentioned are basically the same with ISO
2022-7bit. (Disclaimer: I don't really understand ISO 2022. I am not
even sure that this particular ISO standard specifies an encoding
(character set + transfer encoding) or rather a standard *for*
specifying encodings.) I see that ISO-2022-JP-2 is, thanks to Kenichi
Handa, a registered IANA encoding. (But that is probably not the same
as ISO 2022-7bit?) But that means only that you are not, strictly
speaking, violating a standard if you use it in mail or news. In
practise, however, I very much doubt that outside of Japan there are
any editors, mail clients or news clients other than Emacs that are
able to deal with it.

I don't know whether being "8 bit clean" is still an issue for
networking connections today. If it is, then ISO-2022-7bit might have
an advantage for files in a CVS repository. But that's pretty much the
only advantage in practise.


    Oliver
-- 
Oliver Scholz               27 Floréal an 214 de la Révolution
Ostendstr. 61               Liberté, Egalité, Fraternité!
60314 Frankfurt a. M.       

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] Unicode Lisp reader escapes
  2006-05-16  9:49                                                             ` Oliver Scholz
@ 2006-05-16 11:16                                                               ` Kim F. Storm
  2006-05-16 11:39                                                                 ` Romain Francoise
                                                                                   ` (2 more replies)
  2006-05-17  3:45                                                               ` Richard Stallman
  1 sibling, 3 replies; 202+ messages in thread
From: Kim F. Storm @ 2006-05-16 11:16 UTC (permalink / raw)
  Cc: handa, rms, monnier, emacs-devel

Oliver Scholz <alkibiades@gmx.de> writes:

> We are discussing a *very* special case here; it affects only Emacs
> Lisp source files, because compilation of those, so to say, "freezes"
> the particular settings for unification/fragmentation in the *.elc
> file.

I really wonder why this has suddenly become such a big issue.

In practice, these things have worked fine for ages, so why bother _now_
when we should focus on finalizing the release of 22.1 ?

IIRC, the current issue was raised because someone suggested to add \u
and \U for unicode to the Lisp reader -- something we have also lived
without for ages.

I would suggest to leave the entire issue for after the release!

Then everything will be solved the right way, as we will migrate to
unicode internally for 23.x.

-- 
Kim F. Storm <storm@cua.dk> http://www.cua.dk

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] Unicode Lisp reader escapes
  2006-05-16 11:16                                                               ` Kim F. Storm
@ 2006-05-16 11:39                                                                 ` Romain Francoise
  2006-05-16 11:58                                                                 ` Oliver Scholz
  2006-05-17  3:45                                                                 ` Richard Stallman
  2 siblings, 0 replies; 202+ messages in thread
From: Romain Francoise @ 2006-05-16 11:39 UTC (permalink / raw)
  Cc: handa, emacs-devel, rms, monnier, Oliver Scholz

storm@cua.dk (Kim F. Storm) writes:

> I would suggest to leave the entire issue for after the release!

I concur.

-- 
Romain Francoise <romain@orebokech.com> | The sea! the sea! the open
it's a miracle -- http://orebokech.com/ | sea! The blue, the fresh, the
                                        | ever free! --Bryan W. Procter

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] Unicode Lisp reader escapes
  2006-05-16 11:16                                                               ` Kim F. Storm
  2006-05-16 11:39                                                                 ` Romain Francoise
@ 2006-05-16 11:58                                                                 ` Oliver Scholz
  2006-05-16 14:24                                                                   ` Kim F. Storm
                                                                                     ` (2 more replies)
  2006-05-17  3:45                                                                 ` Richard Stallman
  2 siblings, 3 replies; 202+ messages in thread
From: Oliver Scholz @ 2006-05-16 11:58 UTC (permalink / raw)
  Cc: handa, emacs-devel, rms, monnier, Oliver Scholz

storm@cua.dk (Kim F. Storm) writes:

> Oliver Scholz <alkibiades@gmx.de> writes:
>
>> We are discussing a *very* special case here; it affects only Emacs
>> Lisp source files, because compilation of those, so to say, "freezes"
>> the particular settings for unification/fragmentation in the *.elc
>> file.
>
> I really wonder why this has suddenly become such a big issue.
>
> In practice, these things have worked fine for ages, so why bother _now_
> when we should focus on finalizing the release of 22.1 ?

Unification and UCS fragmentation are new in Emacs 22.

[...]
> I would suggest to leave the entire issue for after the release!

I agree, in principle. IIRC, I was the first one here to suggest to
just document the issue and be done with it. But documenting it would
be a good idea. Something along the lines: "When using non-ASCII
characters in Emacs Lisp source files, beware that compilation
"freezes" some of your current settings for character unification
and/or fragmentation. This might exactly be what you want. But if you
compile Emacs Lisp files with the intention to give the compiled files
to other users, you should make sure that the following settings are
at their default value: ..."


    Oliver
-- 
Oliver Scholz               27 Floréal an 214 de la Révolution
Ostendstr. 61               Liberté, Egalité, Fraternité!
60314 Frankfurt a. M.       

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] Unicode Lisp reader escapes
  2006-05-16 11:58                                                                 ` Oliver Scholz
@ 2006-05-16 14:24                                                                   ` Kim F. Storm
  2006-05-17  3:45                                                                   ` Richard Stallman
  2006-05-17 15:15                                                                   ` Stefan Monnier
  2 siblings, 0 replies; 202+ messages in thread
From: Kim F. Storm @ 2006-05-16 14:24 UTC (permalink / raw)
  Cc: emacs-devel, rms, monnier, handa

Oliver Scholz <alkibiades@gmx.de> writes:

> storm@cua.dk (Kim F. Storm) writes:
>
>> Oliver Scholz <alkibiades@gmx.de> writes:
>>
>>> We are discussing a *very* special case here; it affects only Emacs
>>> Lisp source files, because compilation of those, so to say, "freezes"
>>> the particular settings for unification/fragmentation in the *.elc
>>> file.
>>
>> I really wonder why this has suddenly become such a big issue.
>>
>> In practice, these things have worked fine for ages, so why bother _now_
>> when we should focus on finalizing the release of 22.1 ?
>
> Unification and UCS fragmentation are new in Emacs 22.

But Emacs 22 already has a large user community, and I don't
recall anyone actually complaining about it!

>
> [...]
>> I would suggest to leave the entire issue for after the release!
>
> I agree, in principle. IIRC, I was the first one here to suggest to
> just document the issue and be done with it. But documenting it would
> be a good idea. 

Indeed.

>                 Something along the lines: "When using non-ASCII
> characters in Emacs Lisp source files, beware that compilation
> "freezes" some of your current settings for character unification
> and/or fragmentation. This might exactly be what you want. But if you
> compile Emacs Lisp files with the intention to give the compiled files
> to other users, you should make sure that the following settings are
> at their default value: ..."

Is this a problem for any of the lisp files included in CVS emacs ??

Then it could be a problem wrt distribution pre-built versions.

Otherwise, I still think it is better left alone -- and documented
as you suggested (or simply advise against distributing such files
to anybody and let them compile the files themselves).

Could the byte-compiler warn if it encounters a non-ascii character
that may cause problems?

-- 
Kim F. Storm <storm@cua.dk> http://www.cua.dk

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] Unicode Lisp reader escapes
  2006-05-16  9:49                                                             ` Oliver Scholz
  2006-05-16 11:16                                                               ` Kim F. Storm
@ 2006-05-17  3:45                                                               ` Richard Stallman
  2006-05-17  8:53                                                                 ` Oliver Scholz
  1 sibling, 1 reply; 202+ messages in thread
From: Richard Stallman @ 2006-05-17  3:45 UTC (permalink / raw)
  Cc: emacs-devel, monnier, handa, alkibiades

    Well, as already mentioned, unification and fragmentation are
    implemented by means of translation tables. Unification, for instance,
    for non-CCL decodings happens by means of modifying the parent of the
    char table in the variable `standard-translation-table-for-decode'.

This suggests another implementation: make two such tables, one for
unification and one not for unification, and the mode can choose
which one gets used.

I'd like to understand a little more of the current design.  The
translation table that unification alters is the parent of the one in
`standard-translation-table-for-decode'.  What is the purpose of
making these child maps?

    I have no idea whether it is simple to make this variable buffer local
    or not. But, well, it's certainly intrusive to change such things at
    the very heart and core of Emacs' decoding/encoding apparatus.

This kind of change is not intrusive at all.  It ought to be pretty trivial.

    The translation table relevant for unification in CCL decoding is
    `ucs-translation-table-for-decode' (AFAICS only the cyrillic encodings
    make use of this). The translation table relevant for fragmentation of
    UCS coding systems is `utf-translation-table-for-decode'. You'd have
    to find a way to make *that* buffer local.

It looks very easy to make the choice of table dynamic
for the Cyrillic coding systems.

What does "fragmentation" mean?  I do not recall seeing that term
in this context.

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] Unicode Lisp reader escapes
  2006-05-16 11:16                                                               ` Kim F. Storm
  2006-05-16 11:39                                                                 ` Romain Francoise
  2006-05-16 11:58                                                                 ` Oliver Scholz
@ 2006-05-17  3:45                                                                 ` Richard Stallman
  2 siblings, 0 replies; 202+ messages in thread
From: Richard Stallman @ 2006-05-17  3:45 UTC (permalink / raw)
  Cc: emacs-devel, monnier, handa, alkibiades

    IIRC, the current issue was raised because someone suggested to add \u
    and \U for unicode to the Lisp reader -- something we have also lived
    without for ages.

That suggestion led me to recognize that we have a problem,
but isn't logically related to the problem.

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] Unicode Lisp reader escapes
  2006-05-16 11:58                                                                 ` Oliver Scholz
  2006-05-16 14:24                                                                   ` Kim F. Storm
@ 2006-05-17  3:45                                                                   ` Richard Stallman
  2006-05-17  8:37                                                                     ` Oliver Scholz
                                                                                       ` (2 more replies)
  2006-05-17 15:15                                                                   ` Stefan Monnier
  2 siblings, 3 replies; 202+ messages in thread
From: Richard Stallman @ 2006-05-17  3:45 UTC (permalink / raw)
  Cc: alkibiades, handa, emacs-devel, monnier, storm

    I agree, in principle. IIRC, I was the first one here to suggest to
    just document the issue and be done with it. But documenting it would
    be a good idea. Something along the lines: "When using non-ASCII
    characters in Emacs Lisp source files, beware that compilation
    "freezes" some of your current settings for character unification
    and/or fragmentation.

I want to fix this bug, not document it.

As far as I can see, people are overestimating the difficulty of
fixing it, by focusing on certain approaches (which are difficult)
rather than looking for the ways that are easy.

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] Unicode Lisp reader escapes
  2006-05-17  3:45                                                                   ` Richard Stallman
@ 2006-05-17  8:37                                                                     ` Oliver Scholz
  2006-05-17 20:09                                                                       ` Richard Stallman
  2006-05-17 12:37                                                                     ` Oliver Scholz
  2006-05-18  1:09                                                                     ` Kenichi Handa
  2 siblings, 1 reply; 202+ messages in thread
From: Oliver Scholz @ 2006-05-17  8:37 UTC (permalink / raw)
  Cc: storm, emacs-devel, monnier, handa, Oliver Scholz

Richard Stallman <rms@gnu.org> writes:

[...]
> I want to fix this bug, not document it.
>
> As far as I can see, people are overestimating the difficulty of
> fixing it, by focusing on certain approaches (which are difficult)
> rather than looking for the ways that are easy.

Well, the solution that is both the easiest and the *cleanest* is to
remove `unify-8859-on-decoding-mode' and `utf-fragment-on-decoding'. I
am not kidding. I don't see the need for having those.


    Oliver
-- 
Oliver Scholz               28 Floréal an 214 de la Révolution
Ostendstr. 61               Liberté, Egalité, Fraternité!
60314 Frankfurt a. M.       

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] Unicode Lisp reader escapes
  2006-05-17  3:45                                                               ` Richard Stallman
@ 2006-05-17  8:53                                                                 ` Oliver Scholz
  2006-05-17 20:09                                                                   ` Richard Stallman
  0 siblings, 1 reply; 202+ messages in thread
From: Oliver Scholz @ 2006-05-17  8:53 UTC (permalink / raw)
  Cc: emacs-devel, monnier, handa, Oliver Scholz

Richard Stallman <rms@gnu.org> writes:

[...]
> What does "fragmentation" mean?  I do not recall seeing that term
> in this context.

It's the opposite of unification. In this context it can mean two
different things:

1. Undo the effects of `unify-8859-on-decoding' mode. That is, wenn
   decoding some encodings like cyrillic or some ISO 8859 encodings,
   then decode them to characters from appropriate mule charsets (e.g.
   `greek-iso8859-7') rather than to characters from the charset
   `mule-unicode-0100-24ff'. This is the default.

2. When decoding UCS encodings like UTF-8, decode characters from
   certain reperoires, e.g. "Greek", to different mule charsets like
   `greek-iso8859-7'. The default is to decode them all to characters
   from `mule-unicode-0100-24ff'. A user can turn this behaviour on by
   customizing `utf-fragment-on-decoding'.


    Oliver
-- 
Oliver Scholz               28 Floréal an 214 de la Révolution
Ostendstr. 61               Liberté, Egalité, Fraternité!
60314 Frankfurt a. M.       

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] Unicode Lisp reader escapes
  2006-05-17  3:45                                                                   ` Richard Stallman
  2006-05-17  8:37                                                                     ` Oliver Scholz
@ 2006-05-17 12:37                                                                     ` Oliver Scholz
  2006-05-19  2:05                                                                       ` Richard Stallman
  2006-05-18  1:09                                                                     ` Kenichi Handa
  2 siblings, 1 reply; 202+ messages in thread
From: Oliver Scholz @ 2006-05-17 12:37 UTC (permalink / raw)
  Cc: storm, emacs-devel, monnier, handa, Oliver Scholz

Growing a bit tired of this discussion, I hacked a kludge that might
do what you want. It introduces a variable
`byte-compile-no-char-translation' that is meant to be put into the
Local Variables section of an Emacs Lisp source file in order to
inhibit the effects of `utf-fragment-on-decoding' and
`unifiy-8859-on-decoding'. In other words: This patch deals only with
the issues that *I* can understand. I seem to recall that Handa also
mentioned some effects of certain CJK language environments.

It is *absolutely vital*, that Kenichi Handa reviews this patch. I am
not entirely sure whether this breaks something or not.

With my patch, in decode_coding_iso2022 looking up characters in
Vstandard_translation_table_for_decode is inhibited at all if
`byte-compile-no-char-translation' is non-nil. This might be wrong.
Vstandard_translation_table_for_decode is not empty by default. I
guess instead of inhibiting its use one could just temporarily set its
parent at about the same place. But maybe this is unnecessary.

decode_coding_sjis_big5 refers to
Vstandard_translation_table_for_decode, too. I did not modify it,
though, thus introducing a possible inconsistency. The reason is that
I don't understand CJK issues and I don't understand this encoding.

Note: Even with the remaining issues wielded out, IMNSHO this patch is
worse than the two other solutions (1) Tell users to use emacs-mule.
Or: (2) Remove `unify-8859-on-decoding-mode' and
`utf-fragment-on-decoding'. The reasoning goes as follows:

    Check: Are `unify-8859-on-decoding-mode' and
    `utf-fragment-on-decoding' useful options?

    If no: Remove them, since they cause only trouble.

    If yes: then a user who set them, will want them for all affected
            characters. The choice for unification/fragmention should
            not be the choice of the programmer of the Lisp package;
            it should be the choice of the user.

            (To quote a future user, complaining on gnu-emacs-help:
            "The heck! Why do I have only hollow boxes for my Greek
            characters after byte compilation??? It's all fine in the
            source file!!!")

    Exception: In the event that the particular choice of charsets is
    important for a Lisp Package: Use `emacs-mule'!

    
    Oliver

Index: lisp/files.el
===================================================================
RCS file: /cvsroot/emacs/emacs/lisp/files.el,v
retrieving revision 1.836
diff -u -r1.836 files.el
--- lisp/files.el	16 May 2006 18:33:31 -0000	1.836
+++ lisp/files.el	17 May 2006 12:08:43 -0000
@@ -2361,6 +2361,7 @@
 	(left-margin                     . integerp) ;; C source code
 	(no-update-autoloads             . booleanp)
 	(tab-width                       . integerp) ;; C source code
+        (byte-compile-no-char-translation . booleanp) ;; C source code
 	(truncate-lines                  . booleanp))) ;; C source code
 
 (put 'c-set-style 'safe-local-eval-function t)
Index: lisp/emacs-lisp/bytecomp.el
===================================================================
RCS file: /cvsroot/emacs/emacs/lisp/emacs-lisp/bytecomp.el,v
retrieving revision 2.185
diff -u -r2.185 bytecomp.el
--- lisp/emacs-lisp/bytecomp.el	16 May 2006 10:05:09 -0000	2.185
+++ lisp/emacs-lisp/bytecomp.el	17 May 2006 12:08:45 -0000
@@ -1673,6 +1673,14 @@
 	    (enable-local-eval nil))
 	;; Arg of t means don't alter enable-local-variables.
         (normal-mode t)
+
+        ;; KLUDGE: `byte-compile-no-char-translation' should affect
+        ;; how characters are decoded. But at this point decoding
+        ;; already happend. So we insert the file contents again.
+        (when byte-compile-no-char-translation
+          (erase-buffer)
+          (insert-file-contents filename))
+        
         (setq filename buffer-file-name))
       ;; Set the default directory, in case an eval-when-compile uses it.
       (setq default-directory (file-name-directory filename)))
Index: src/coding.c
===================================================================
RCS file: /cvsroot/emacs/emacs/src/coding.c,v
retrieving revision 1.336
diff -u -r1.336 coding.c
--- src/coding.c	8 May 2006 05:25:02 -0000	1.336
+++ src/coding.c	17 May 2006 12:08:50 -0000
@@ -405,6 +405,15 @@
 
 Lisp_Object Qcoding_system_p, Qcoding_system_error;
 
+/* This variable is meant to turn off character tranlation during byte
+   compilation. */
+
+Lisp_Object Vbyte_compile_no_char_translation;
+
+Lisp_Object empty_translation_table;
+Lisp_Object Qucs_translation_table_for_decode, Qutf_translation_table_for_decode;
+Lisp_Object Qunify_8859_on_decoding_mode, Qutf_fragment_on_decoding;
+
 /* Coding system emacs-mule and raw-text are for converting only
    end-of-line format.  */
 Lisp_Object Qemacs_mule, Qraw_text;
@@ -1849,7 +1858,7 @@
   else
     {
       translation_table = coding->translation_table_for_decode;
-      if (NILP (translation_table))
+      if (NILP (translation_table) && NILP (Vbyte_compile_no_char_translation))
 	translation_table = Vstandard_translation_table_for_decode;
     }
 
@@ -4938,8 +4947,48 @@
 	  dst_bytes--;
 	  extra = coding->spec.ccl.cr_carryover;
 	}
-      ccl_coding_driver (coding, source, destination + extra,
-			 src_bytes, dst_bytes, 0);
+
+      /*KLUDGE: Inhibit unification and or fragmentation. This is
+        meant for byte compiling Emacs Lisp source files. For CCL
+        based coding systems it has to be done here, because we want
+        it only for decoding. We temporarily swap the affected
+        translation tables in Vtranslation_table_vector with an empty
+        translation table.*/
+      if (! NILP (Vbyte_compile_no_char_translation)
+          && (! NILP (SYMBOL_VALUE (Qunify_8859_on_decoding_mode))
+              || ! NILP (SYMBOL_VALUE (Qutf_fragment_on_decoding))))
+        {
+          if (NILP (empty_translation_table))
+            {
+              empty_translation_table =
+                call0 (intern ("make-translation-table"));
+            }
+
+          Lisp_Object ucs_tt = Fget (Qucs_translation_table_for_decode, Qtranslation_table);
+          Lisp_Object ucs_id = Fget (Qucs_translation_table_for_decode, Qtranslation_table_id);
+
+          Lisp_Object utf_tt = Fget (Qutf_translation_table_for_decode, Qtranslation_table);
+          Lisp_Object utf_id = Fget (Qutf_translation_table_for_decode, Qtranslation_table_id);
+
+          /* Should this be `unwind-protect'ed? */
+
+          Faset (Vtranslation_table_vector, ucs_id, Fcons (Qucs_translation_table_for_decode,
+                                                           empty_translation_table));
+          Faset (Vtranslation_table_vector, utf_id, Fcons (Qutf_translation_table_for_decode,
+                                                           empty_translation_table));
+
+          ccl_coding_driver (coding, source, destination + extra,
+                             src_bytes, dst_bytes, 0);
+
+          Faset (Vtranslation_table_vector, ucs_id, Fcons (Qucs_translation_table_for_decode,
+                                                           ucs_tt));
+          Faset (Vtranslation_table_vector, utf_id, Fcons (Qutf_translation_table_for_decode,
+                                                           utf_tt));
+
+        }
+      else ccl_coding_driver (coding, source, destination + extra,
+                              src_bytes, dst_bytes, 0);
+      
       if (coding->eol_type != CODING_EOL_LF)
 	{
 	  coding->produced += extra;
@@ -7852,6 +7901,34 @@
   defsubr (&Sset_coding_priority_internal);
   defsubr (&Sdefine_coding_system_internal);
 
+  DEFVAR_LISP ("byte-compile-no-char-translation", &Vbyte_compile_no_char_translation,
+               doc: /* Don't translate characters during byte compilation.
+
+Options like `utf-fragment-on-decoding' or the minor mode
+`unify-8859-on-decoding-mode' modify the way Emacs maps file encodings
+to mule charsets.  Since *.elc files are encoded in emacs-mule, such
+settings are preserved in the compiled file.  If this variable is
+non-nil, Emacs uses the default mule charsets.
+
+You can set this variable in the local variables section of a file. */);
+  Vbyte_compile_no_char_translation = Qnil;
+
+  empty_translation_table = Qnil;
+  staticpro (&empty_translation_table);
+  
+  Qucs_translation_table_for_decode = intern ("ucs-translation-table-for-decode");
+  staticpro (&Qucs_translation_table_for_decode);
+
+  Qutf_translation_table_for_decode = intern ("utf-translation-table-for-decode");
+  staticpro (&Qutf_translation_table_for_decode);
+
+  Qunify_8859_on_decoding_mode = intern ("unify-8859-on-decoding-mode");
+  staticpro (&Qunify_8859_on_decoding_mode);
+
+  Qutf_fragment_on_decoding = intern ("utf-fragment-on-decoding");
+  staticpro (&Qunify_8859_on_decoding_mode);
+  
+  
   DEFVAR_LISP ("coding-system-list", &Vcoding_system_list,
 	       doc: /* List of coding systems.
 
    
-- 
Oliver Scholz               28 Floréal an 214 de la Révolution
Ostendstr. 61               Liberté, Egalité, Fraternité!
60314 Frankfurt a. M.       

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] Unicode Lisp reader escapes
  2006-05-16 11:58                                                                 ` Oliver Scholz
  2006-05-16 14:24                                                                   ` Kim F. Storm
  2006-05-17  3:45                                                                   ` Richard Stallman
@ 2006-05-17 15:15                                                                   ` Stefan Monnier
  2 siblings, 0 replies; 202+ messages in thread
From: Stefan Monnier @ 2006-05-17 15:15 UTC (permalink / raw)
  Cc: emacs-devel, rms, handa, Kim F. Storm

> Unification and UCS fragmentation are new in Emacs 22.

unify-8859-on-decoding-mode was introduced in Emacs-21.3.


        Stefan

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] Unicode Lisp reader escapes
  2006-05-17  8:37                                                                     ` Oliver Scholz
@ 2006-05-17 20:09                                                                       ` Richard Stallman
  0 siblings, 0 replies; 202+ messages in thread
From: Richard Stallman @ 2006-05-17 20:09 UTC (permalink / raw)
  Cc: storm, emacs-devel, monnier, handa, alkibiades

    Well, the solution that is both the easiest and the *cleanest* is to
    remove `unify-8859-on-decoding-mode' and `utf-fragment-on-decoding'.

It looks like `utf-fragment-on-decoding' is not relevant to the issue
of making Lisp files encoded in iso-2022 reliable.  So we can forget
about that.

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] Unicode Lisp reader escapes
  2006-05-17  8:53                                                                 ` Oliver Scholz
@ 2006-05-17 20:09                                                                   ` Richard Stallman
  2006-05-18  9:12                                                                     ` Oliver Scholz
  0 siblings, 1 reply; 202+ messages in thread
From: Richard Stallman @ 2006-05-17 20:09 UTC (permalink / raw)
  Cc: emacs-devel, monnier, handa, alkibiades

    > What does "fragmentation" mean?  I do not recall seeing that term
    > in this context.

    It's the opposite of unification. In this context it can mean two
    different things:

Thanks.

    1. Undo the effects of `unify-8859-on-decoding' mode. That is, wenn
       decoding some encodings like cyrillic or some ISO 8859 encodings,
       then decode them to characters from appropriate mule charsets (e.g.
       `greek-iso8859-7') rather than to characters from the charset
       `mule-unicode-0100-24ff'. This is the default.

Don't you mean "Don't perform the actions of `unify-8859-on-decoding'
mode?"

    2. When decoding UCS encodings like UTF-8, decode characters from
       certain reperoires, e.g. "Greek", to different mule charsets like
       `greek-iso8859-7'. The default is to decode them all to characters
       from `mule-unicode-0100-24ff'. A user can turn this behaviour on by
       customizing `utf-fragment-on-decoding'.

That one isn't relevant to the problem we need to solve now.

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] Unicode Lisp reader escapes
  2006-05-15 20:37                                                 ` Richard Stallman
  2006-05-16 10:07                                                   ` Oliver Scholz
@ 2006-05-18  0:31                                                   ` Kenichi Handa
  1 sibling, 0 replies; 202+ messages in thread
From: Kenichi Handa @ 2006-05-18  0:31 UTC (permalink / raw)
  Cc: emacs-devel, alkibiades

In article <E1Ffjop-0000Wx-9d@fencepost.gnu.org>, Richard Stallman <rms@gnu.org> writes:

>> Handa says that telling people "don't use utf-8" solves the problem.
>     That's NOT what I saied.  I said "use emacs-mule".  The
>     other coding systems are affected by
>     unify-8859-on-decoding-mode, and also by users setting of
>     standard-translation-table-for-decode.

> Ok, I stand corrected.

> However, people have pointed out that there are practical drawbacks
> to using emacs-mule, and that iso-2022 is more convenient.
> Let's see if we can arrange for iso-2022 to work properly.

For iso-2022 based coding-systems, the situation is simpler
that utf-* case.  As we already have the variable
`enable-character-translation', just by making it
local-variable of a buffer where a file is being read, and
setting it to nil, we can read a file in a constant way.
The only hack we need is to detect that variable in "Local
Variables:" section before start decoding (perhaps in
set-auto-coding).

---
Kenichi Handa
handa@m17n.org

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] Unicode Lisp reader escapes
  2006-05-17  3:45                                                                   ` Richard Stallman
  2006-05-17  8:37                                                                     ` Oliver Scholz
  2006-05-17 12:37                                                                     ` Oliver Scholz
@ 2006-05-18  1:09                                                                     ` Kenichi Handa
  2006-05-21  0:57                                                                       ` Richard Stallman
  2 siblings, 1 reply; 202+ messages in thread
From: Kenichi Handa @ 2006-05-18  1:09 UTC (permalink / raw)
  Cc: emacs-devel, monnier, storm, alkibiades

In article <E1FgCyb-0001Uq-Pk@fencepost.gnu.org>, Richard Stallman <rms@gnu.org> writes:

>     I agree, in principle. IIRC, I was the first one here to suggest to
>     just document the issue and be done with it. But documenting it would
>     be a good idea. Something along the lines: "When using non-ASCII
>     characters in Emacs Lisp source files, beware that compilation
>     "freezes" some of your current settings for character unification
>     and/or fragmentation.

> I want to fix this bug, not document it.

I'm confused.  You wrote:

> Handa says that telling people "don't use utf-8" solves the problem.
> If that is a good solution, I think the problem is solved.
> Does everyone agree that that solution works?

So, I thought you accepted such kind of solution;
i.e. documenting the potential problem about decoding and
the way to avoid the ambiguity if one has a problematic *.el
file.  There are two ways to avoid it.

(1) use emacs-mule coding system
(2) use one of iso-2022 based coding systems (they include
iso-8859-X) with setting enable-character-translation to nil
in "Local Variables:" section.

(1) works now.  (2) doens't work now but easy to make it
work as I wrote in the previous mail.

Do you want something more?

---
Kenichi Handa
handa@m17n.org

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] Unicode Lisp reader escapes
  2006-05-17 20:09                                                                   ` Richard Stallman
@ 2006-05-18  9:12                                                                     ` Oliver Scholz
  0 siblings, 0 replies; 202+ messages in thread
From: Oliver Scholz @ 2006-05-18  9:12 UTC (permalink / raw)
  Cc: emacs-devel, monnier, handa, Oliver Scholz

Richard Stallman <rms@gnu.org> writes:

>     > What does "fragmentation" mean?  I do not recall seeing that term
>     > in this context.
>
>     It's the opposite of unification. In this context it can mean two
>     different things:
>
> Thanks.
>
>     1. Undo the effects of `unify-8859-on-decoding' mode. That is, wenn
>        decoding some encodings like cyrillic or some ISO 8859 encodings,
>        then decode them to characters from appropriate mule charsets (e.g.
>        `greek-iso8859-7') rather than to characters from the charset
>        `mule-unicode-0100-24ff'. This is the default.
>
> Don't you mean "Don't perform the actions of `unify-8859-on-decoding'
> mode?"

Both actually. This is what Emacs does by default. But there's also a
function `ucs-fragment-8859' that, when called with its second
argument non-nil, reverses the effects of unification on decoding.
This is how `unify-8859-on-decoding-mode' is implemented:

(define-minor-mode unify-8859-on-decoding-mode
  "Set up translation-tables for unifying ISO 8859 characters on decoding.
[...]"
  :group 'mule
  :global t
  :init-value nil
  (if unify-8859-on-decoding-mode
      (ucs-unify-8859 nil t)
    (ucs-fragment-8859 nil t)))


>     2. When decoding UCS encodings like UTF-8, decode characters from
>        certain reperoires, e.g. "Greek", to different mule charsets like
>        `greek-iso8859-7'. The default is to decode them all to characters
>        from `mule-unicode-0100-24ff'. A user can turn this behaviour on by
>        customizing `utf-fragment-on-decoding'.
>
> That one isn't relevant to the problem we need to solve now.

Sorry, I was not aware that this decision was definitive.


    Oliver
-- 
Oliver Scholz               29 Floréal an 214 de la Révolution
Ostendstr. 61               Liberté, Egalité, Fraternité!
60314 Frankfurt a. M.       

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] Unicode Lisp reader escapes
  2006-05-17 12:37                                                                     ` Oliver Scholz
@ 2006-05-19  2:05                                                                       ` Richard Stallman
  2006-05-19  8:47                                                                         ` Oliver Scholz
  0 siblings, 1 reply; 202+ messages in thread
From: Richard Stallman @ 2006-05-19  2:05 UTC (permalink / raw)
  Cc: storm, emacs-devel, monnier, handa, alkibiades

The C code you wrote to implement byte-compile-no-char-translation
might be the right C-level feature.  We need Handa to check that, as
you said.

It should only affect unify-8859-on-decoding-mode, since that's the
only one that's relevant to iso-2022.  (We decided to give up on
stabilizing utf-8 in this way, because too many different variables
affect the behavior of utf-8.)

However, the right place to set the variable is in find-auto-coding,
not in the compiler.  Therefore, the variable's name should be
changed, since it won't be specific to compilation.  It could be
stabilize-iso-2022.

I can see three possible ways for Lisp files to set this variable:

1. Explicitly.  You should specify the variable in the -*- line or the
Local Variables list if it matters.

2. Automatically.  Whenever a file specifies Emacs-Lisp mode and
coding iso-2022, it gets set to t.

3. Both.  It gets set to t automatically, but a file can explicitly
specify nil.

I have another, further suggestion.  Rename the variable to
unify-8859-on-decoding-mode, and reimplement the function
unify-8859-on-decoding-mode to work just by setting the variable.
That would be an improvement, since it would mean you can set the mode
just by setting the variable.

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] Unicode Lisp reader escapes
  2006-05-19  2:05                                                                       ` Richard Stallman
@ 2006-05-19  8:47                                                                         ` Oliver Scholz
  0 siblings, 0 replies; 202+ messages in thread
From: Oliver Scholz @ 2006-05-19  8:47 UTC (permalink / raw)
  Cc: storm, emacs-devel, monnier, handa, Oliver Scholz

Richard Stallman <rms@gnu.org> writes:

> The C code you wrote to implement byte-compile-no-char-translation
> might be the right C-level feature.  We need Handa to check that, as
> you said.

No need to, anymore. Only those two issues you mentioned are new with
my patch: the limitation to byte compilation and the attempt to fix
UTF-8. Since you said you don't want either, and since Handa said,
Venable_character_translation can be used, my patch is meaningless.
Putting it into find-auto-coding is indeed much better, since it
avoids any inconsistencies in character encoding between the visited
*.el file and the byte compiled file.


    Oliver
-- 
Oliver Scholz               30 Floréal an 214 de la Révolution
Ostendstr. 61               Liberté, Egalité, Fraternité!
60314 Frankfurt a. M.       

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] Unicode Lisp reader escapes
  2006-05-18  1:09                                                                     ` Kenichi Handa
@ 2006-05-21  0:57                                                                       ` Richard Stallman
  2006-05-22  1:33                                                                         ` Kenichi Handa
  0 siblings, 1 reply; 202+ messages in thread
From: Richard Stallman @ 2006-05-21  0:57 UTC (permalink / raw)
  Cc: emacs-devel, monnier, storm, alkibiades

    (1) use emacs-mule coding system

    (2) use one of iso-2022 based coding systems (they include
    iso-8859-X) with setting enable-character-translation to nil
    in "Local Variables" section.

    (1) works now.  (2) doens't work now but easy to make it
    work as I wrote in the previous mail.

People have pointed out disadvantages of (1).

Maybe (2) is a good solution.  Iwant to check, though.  It would turn
off *all* character translation.  We need to verify that this is ok.

Supposing that unify-8859-on-decoding-mode is off, and you read a file
in an iso-2022 coding system.  What character translation is done, or
might be done, and in what cases?

In this code,

    (defun ucs-fragment-8859 (for-encode for-decode)
      "Undo the unification done by `ucs-unify-8859'.
    With prefix arg, undo unification on encoding only, i.e. don't undo
    unification on input operations."
      (when for-decode
	;; Don't Unify 8859 on decoding.
	;; For non-CCL coding systems (e.g. iso-latin-2).
	(set-char-table-parent standard-translation-table-for-decode nil)

we turn off the parent of standard-translation-table-for-decode.
But what else might standard-translation-table-for-decode do
for some of these coding systems?

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] Unicode Lisp reader escapes
  2006-05-21  0:57                                                                       ` Richard Stallman
@ 2006-05-22  1:33                                                                         ` Kenichi Handa
  2006-05-22 15:12                                                                           ` Richard Stallman
  0 siblings, 1 reply; 202+ messages in thread
From: Kenichi Handa @ 2006-05-22  1:33 UTC (permalink / raw)
  Cc: emacs-devel, monnier, storm, alkibiades

In article <E1FhcG5-0002nB-VN@fencepost.gnu.org>, Richard Stallman <rms@gnu.org> writes:

>     (1) use emacs-mule coding system
>     (2) use one of iso-2022 based coding systems (they include
>     iso-8859-X) with setting enable-character-translation to nil
>     in "Local Variables" section.

>     (1) works now.  (2) doens't work now but easy to make it
>     work as I wrote in the previous mail.

> People have pointed out disadvantages of (1).

I don't think that is a big problem because it seems that
it's very rare to handle *.el file by some tool other than
Emacs.

> Maybe (2) is a good solution.  Iwant to check, though.  It would turn
> off *all* character translation.  We need to verify that this is ok.

I believe it is ok.

> Supposing that unify-8859-on-decoding-mode is off, and you read a file
> in an iso-2022 coding system.  What character translation is done, or
> might be done, and in what cases?

> In this code,

>     (defun ucs-fragment-8859 (for-encode for-decode)
>       "Undo the unification done by `ucs-unify-8859'.
>     With prefix arg, undo unification on encoding only, i.e. don't undo
>     unification on input operations."
>       (when for-decode
> 	;; Don't Unify 8859 on decoding.
> 	;; For non-CCL coding systems (e.g. iso-latin-2).
> 	(set-char-table-parent standard-translation-table-for-decode nil)

> we turn off the parent of standard-translation-table-for-decode.
> But what else might standard-translation-table-for-decode do
> for some of these coding systems?

standard-translation-table-for-decode is for reflecting any
user preferences on decoding.  So, it can do anything.  If
one hates SOFT-HYPEN (U+00AD), he can map it to `-'.

The default value of standard-translation-table-for-decode
is not nil.  It contains a mapping for
JISX0208.1978->JISX0208.1980 and JISX0201->ASCII.  But, this
is to compensate for an encoding used in Japan in vary old
time, and even if Emacs reads a *.el file in such an
encoding, on writing, the new encoding is used.  That means
that the mapping is not used when the file is read next
time.

So, disabling character translation on reading an iso-2022
*.el file effectively stabilize the byte-compiling of the
file without any actual problem.

---
Kenichi Handa
handa@m17n.org

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] Unicode Lisp reader escapes
  2006-05-22  1:33                                                                         ` Kenichi Handa
@ 2006-05-22 15:12                                                                           ` Richard Stallman
  2006-05-23  1:05                                                                             ` Kenichi Handa
  0 siblings, 1 reply; 202+ messages in thread
From: Richard Stallman @ 2006-05-22 15:12 UTC (permalink / raw)
  Cc: emacs-devel, monnier, storm, alkibiades

    So, disabling character translation on reading an iso-2022
    *.el file effectively stabilize the byte-compiling of the
    file without any actual problem.

Ok, I am convinced that disabling character translation is a good
solution _mechanism_.  The remaining question is what user interface
to use.  That is, how should Emacs determine that it should set
enable-character-translation to nil for these files?

One obvious way is an explicit specification of the variable
enable-character-translation.  But that would be cumbersome to use.

Another way is that specification of coding: together with mode:
emacs-lisp could do this automatically.

Another way is that you could specify coding: in a special way,
perhaps with ! at the end of the coding system name.

What do you think?

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] Unicode Lisp reader escapes
  2006-05-22 15:12                                                                           ` Richard Stallman
@ 2006-05-23  1:05                                                                             ` Kenichi Handa
  2006-05-23  5:18                                                                               ` Juri Linkov
  2006-05-24  2:17                                                                               ` Richard Stallman
  0 siblings, 2 replies; 202+ messages in thread
From: Kenichi Handa @ 2006-05-23  1:05 UTC (permalink / raw)
  Cc: alkibiades, storm, monnier, emacs-devel

In article <E1FiC4w-00075E-2u@fencepost.gnu.org>, Richard Stallman <rms@gnu.org> writes:

>     So, disabling character translation on reading an iso-2022
>     *.el file effectively stabilize the byte-compiling of the
>     file without any actual problem.

> Ok, I am convinced that disabling character translation is a good
> solution _mechanism_.  The remaining question is what user interface
> to use.  That is, how should Emacs determine that it should set
> enable-character-translation to nil for these files?

> One obvious way is an explicit specification of the variable
> enable-character-translation.  But that would be cumbersome to use.

At least, this should work for people who don't mind the
cumbersomeness.

> Another way is that specification of coding: together with mode:
> emacs-lisp could do this automatically.

I object to this because it's an incompatible change that
should be avoided at this stage.  In addition, we then have
to invent someway to "enable" normal character translation.

> Another way is that you could specify coding: in a special way,
> perhaps with ! at the end of the coding system name.

I don't know if that is aesthetically good, but at least
it's a quite handy way.

;; xxx.el -- Do XXX.   -*- coding: latin-1!; -*-

So, I'd like to implement both the 1st and 3rd method.  Ok?

---
Kenichi Handa
handa@m17n.org

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] Unicode Lisp reader escapes
  2006-05-23  1:05                                                                             ` Kenichi Handa
@ 2006-05-23  5:18                                                                               ` Juri Linkov
  2006-05-24  2:18                                                                                 ` Richard Stallman
  2006-05-24  2:17                                                                               ` Richard Stallman
  1 sibling, 1 reply; 202+ messages in thread
From: Juri Linkov @ 2006-05-23  5:18 UTC (permalink / raw)
  Cc: storm, emacs-devel, rms, monnier, alkibiades

>> Another way is that you could specify coding: in a special way,
>> perhaps with ! at the end of the coding system name.
>
> I don't know if that is aesthetically good, but at least
> it's a quite handy way.
>
> ;; xxx.el -- Do XXX.   -*- coding: latin-1!; -*-

Such a notation is not self-evident.  What about

;; xxx.el -- Do XXX.   -*- coding: latin-1; translation: no -*-

or

;; xxx.el -- Do XXX.   -*- coding: latin-1; char-trans: no -*-

-- 
Juri Linkov
http://www.jurta.org/emacs/

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] Unicode Lisp reader escapes
  2006-05-23  1:05                                                                             ` Kenichi Handa
  2006-05-23  5:18                                                                               ` Juri Linkov
@ 2006-05-24  2:17                                                                               ` Richard Stallman
  1 sibling, 0 replies; 202+ messages in thread
From: Richard Stallman @ 2006-05-24  2:17 UTC (permalink / raw)
  Cc: alkibiades, storm, monnier, emacs-devel

    So, I'd like to implement both the 1st and 3rd method.  Ok?

Ok, please do.

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] Unicode Lisp reader escapes
  2006-05-23  5:18                                                                               ` Juri Linkov
@ 2006-05-24  2:18                                                                                 ` Richard Stallman
  2006-06-02  6:49                                                                                   ` Kenichi Handa
  0 siblings, 1 reply; 202+ messages in thread
From: Richard Stallman @ 2006-05-24  2:18 UTC (permalink / raw)
  Cc: alkibiades, emacs-devel, monnier, storm, handa

    > ;; xxx.el -- Do XXX.   -*- coding: latin-1!; -*-

    Such a notation is not self-evident.  What about

    ;; xxx.el -- Do XXX.   -*- coding: latin-1; translation: no -*-

It would be useful to support the latter as well,
but I think we want something terse for this.

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] Unicode Lisp reader escapes
  2006-05-24  2:18                                                                                 ` Richard Stallman
@ 2006-06-02  6:49                                                                                   ` Kenichi Handa
  2006-06-02  8:00                                                                                     ` Kim F. Storm
                                                                                                       ` (2 more replies)
  0 siblings, 3 replies; 202+ messages in thread
From: Kenichi Handa @ 2006-06-02  6:49 UTC (permalink / raw)
  Cc: juri, storm, emacs-devel, monnier, alkibiades

In article <E1Fiix4-0006Tz-RZ@fencepost.gnu.org>, Richard Stallman <rms@gnu.org> writes:

>> ;; xxx.el -- Do XXX.   -*- coding: latin-1!; -*-
>     Such a notation is not self-evident.  What about

>     ;; xxx.el -- Do XXX.   -*- coding: latin-1; translation: no -*-

> It would be useful to support the latter as well,
> but I think we want something terse for this.

I've just installed these changes.

(1) Accept something like "latin-1!" as value of coding: in
    header and local variables section.
(2) Accept "char-trans: VAL" in header and local variables section.
    I think "translation: VAL" is too ambiguous.
(3) Accept "enable-character-translation: VAL" in local
    variables section.

In which NEWS section, that information should go?
Previously we simply had "* Changes in Emacs XX.YY", but now
we have these sections:

* Installation Changes in Emacs 22.1
* Startup Changes in Emacs 22.1
* Incompatible Editing Changes in Emacs 22.1
* Editing Changes in Emacs 22.1
* New Modes and Packages in Emacs 22.1
* Changes in Specialized Modes and Packages in Emacs 22.1
* Changes in Emacs 22.1 on non-free operating systems
* Incompatible Lisp Changes in Emacs 22.1
* Lisp Changes in Emacs 22.1
* New Packages for Lisp Programming in Emacs 22.1

It seems that this change is "Editing Changes", but I'm not
sure we can declare it incompatible or not.  Perviously, if
a file has "coding: latin-1!", it is treated as an invalid
coding specification.  In that sense, this change is
incompatible, but...

---
Kenichi Handa
handa@m17n.org

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] Unicode Lisp reader escapes
  2006-06-02  6:49                                                                                   ` Kenichi Handa
@ 2006-06-02  8:00                                                                                     ` Kim F. Storm
  2006-06-02  9:27                                                                                     ` Juri Linkov
  2006-06-02 22:39                                                                                     ` Richard Stallman
  2 siblings, 0 replies; 202+ messages in thread
From: Kim F. Storm @ 2006-06-02  8:00 UTC (permalink / raw)
  Cc: juri, alkibiades, rms, monnier, emacs-devel

Kenichi Handa <handa@m17n.org> writes:

> It seems that this change is "Editing Changes", but I'm not
> sure we can declare it incompatible or not.  Perviously, if
> a file has "coding: latin-1!", it is treated as an invalid
> coding specification.  In that sense, this change is
> incompatible, but...

The change is not incompatible in the sense that it breaks
existing _valid_ coding specs.

The section

** Multilingual Environment (Mule) changes:

seems appropriate??

-- 
Kim F. Storm <storm@cua.dk> http://www.cua.dk

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] Unicode Lisp reader escapes
  2006-06-02  6:49                                                                                   ` Kenichi Handa
  2006-06-02  8:00                                                                                     ` Kim F. Storm
@ 2006-06-02  9:27                                                                                     ` Juri Linkov
  2006-06-02 10:50                                                                                       ` Eli Zaretskii
                                                                                                         ` (2 more replies)
  2006-06-02 22:39                                                                                     ` Richard Stallman
  2 siblings, 3 replies; 202+ messages in thread
From: Juri Linkov @ 2006-06-02  9:27 UTC (permalink / raw)
  Cc: storm, emacs-devel, rms, monnier, alkibiades

> (2) Accept "char-trans: VAL" in header and local variables section.
>     I think "translation: VAL" is too ambiguous.
> (3) Accept "enable-character-translation: VAL" in local
>     variables section.

I think using different names in the first line and in the local
variables section is not good.  Users may move these settings
between these two places in the same file, and incompatible names
are the source of inconvenience and confusion.  Since the variable name
is `enable-character-translation', there should be no problem in using it
in the first line as well.

-- 
Juri Linkov
http://www.jurta.org/emacs/

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] Unicode Lisp reader escapes
  2006-06-02  9:27                                                                                     ` Juri Linkov
@ 2006-06-02 10:50                                                                                       ` Eli Zaretskii
  2006-06-02 11:39                                                                                       ` Kenichi Handa
  2006-06-02 22:39                                                                                       ` Richard Stallman
  2 siblings, 0 replies; 202+ messages in thread
From: Eli Zaretskii @ 2006-06-02 10:50 UTC (permalink / raw)
  Cc: alkibiades, emacs-devel

> From: Juri Linkov <juri@jurta.org>
> Date: Fri, 02 Jun 2006 12:27:01 +0300
> Cc: storm@cua.dk, emacs-devel@gnu.org, rms@gnu.org, monnier@iro.umontreal.ca,
> 	alkibiades@gmx.de
> 
> > (2) Accept "char-trans: VAL" in header and local variables section.
> >     I think "translation: VAL" is too ambiguous.
> > (3) Accept "enable-character-translation: VAL" in local
> >     variables section.
> 
> I think using different names in the first line and in the local
> variables section is not good.

But we already do that, e.g. with coding: in the header vs
coding-system in local vars.

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] Unicode Lisp reader escapes
  2006-06-02  9:27                                                                                     ` Juri Linkov
  2006-06-02 10:50                                                                                       ` Eli Zaretskii
@ 2006-06-02 11:39                                                                                       ` Kenichi Handa
  2006-06-02 12:12                                                                                         ` Juri Linkov
  2006-06-02 22:39                                                                                       ` Richard Stallman
  2 siblings, 1 reply; 202+ messages in thread
From: Kenichi Handa @ 2006-06-02 11:39 UTC (permalink / raw)
  Cc: storm, emacs-devel, rms, monnier, alkibiades

In article <87ac8vor6u.fsf@jurta.org>, Juri Linkov <juri@jurta.org> writes:

>> (2) Accept "char-trans: VAL" in header and local variables section.
>> I think "translation: VAL" is too ambiguous.
>> (3) Accept "enable-character-translation: VAL" in local
>> variables section.

> I think using different names in the first line and in the local
> variables section is not good.  Users may move these settings
> between these two places in the same file, and incompatible names
> are the source of inconvenience and confusion.  Since the variable name
> is `enable-character-translation', there should be no problem in using it
> in the first line as well.

But, wasn't it you who proposed "char-trans"?

Eli Zaretskii <eliz@gnu.org> writes:
> But we already do that, e.g. with coding: in the header vs
> coding-system in local vars.

No, AFAIK we only accepts "coding" in local vars.

Anyway, the situation of coding: tag is a little bit
different from enable-character-translation case because we
don't have a variable for the formar.

---
Kenichi Handa
handa@m17n.org

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] Unicode Lisp reader escapes
  2006-06-02 11:39                                                                                       ` Kenichi Handa
@ 2006-06-02 12:12                                                                                         ` Juri Linkov
  0 siblings, 0 replies; 202+ messages in thread
From: Juri Linkov @ 2006-06-02 12:12 UTC (permalink / raw)
  Cc: storm, emacs-devel, rms, monnier, alkibiades

>>> (2) Accept "char-trans: VAL" in header and local variables section.
>>> I think "translation: VAL" is too ambiguous.
>>> (3) Accept "enable-character-translation: VAL" in local
>>> variables section.
>
>> I think using different names in the first line and in the local
>> variables section is not good.  Users may move these settings
>> between these two places in the same file, and incompatible names
>> are the source of inconvenience and confusion.  Since the variable name
>> is `enable-character-translation', there should be no problem in using it
>> in the first line as well.
>
> But, wasn't it you who proposed "char-trans"?

I was unaware of the variable `enable-character-translation'.
It's documented nowhere.  Since such a variable really exists,
I think its name is ideal both for the local variables section
and the first line.  For cases when its name is too long to fit
to the first line, IIUC there are already an alternative syntax
with the trailing !, you just installed, for those who want to use
terse syntax.

-- 
Juri Linkov
http://www.jurta.org/emacs/

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] Unicode Lisp reader escapes
  2006-06-02  6:49                                                                                   ` Kenichi Handa
  2006-06-02  8:00                                                                                     ` Kim F. Storm
  2006-06-02  9:27                                                                                     ` Juri Linkov
@ 2006-06-02 22:39                                                                                     ` Richard Stallman
  2 siblings, 0 replies; 202+ messages in thread
From: Richard Stallman @ 2006-06-02 22:39 UTC (permalink / raw)
  Cc: juri, storm, emacs-devel, monnier, alkibiades

    In which NEWS section, that information should go?

I think it belongs in

    * Editing Changes in Emacs 22.1


    It seems that this change is "Editing Changes", but I'm not
    sure we can declare it incompatible or not.  Perviously, if
    a file has "coding: latin-1!", it is treated as an invalid
    coding specification.  In that sense, this change is
    incompatible, but...

Giving a meaning to something that formerly was invalid
is not incompatible.

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] Unicode Lisp reader escapes
  2006-06-02  9:27                                                                                     ` Juri Linkov
  2006-06-02 10:50                                                                                       ` Eli Zaretskii
  2006-06-02 11:39                                                                                       ` Kenichi Handa
@ 2006-06-02 22:39                                                                                       ` Richard Stallman
  2006-06-03  6:42                                                                                         ` Juri Linkov
  2 siblings, 1 reply; 202+ messages in thread
From: Richard Stallman @ 2006-06-02 22:39 UTC (permalink / raw)
  Cc: alkibiades, emacs-devel, monnier, storm, handa

    I think using different names in the first line and in the local
    variables section is not good.

I am not sure.

				    Users may move these settings
    between these two places in the same file, and incompatible names
    are the source of inconvenience and confusion.  Since the variable name
    is `enable-character-translation', there should be no problem in using it
    in the first line as well.

No, that name is too long for the first line.

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] Unicode Lisp reader escapes
  2006-06-02 22:39                                                                                       ` Richard Stallman
@ 2006-06-03  6:42                                                                                         ` Juri Linkov
  2006-06-04  2:23                                                                                           ` Richard Stallman
  0 siblings, 1 reply; 202+ messages in thread
From: Juri Linkov @ 2006-06-03  6:42 UTC (permalink / raw)
  Cc: alkibiades, emacs-devel, monnier, storm, handa

>     I think using different names in the first line and in the local
>     variables section is not good.
>
> I am not sure.

When the first line has enough space, there is no reason to disallow
using the same variable name that is allowed in the local variables section.
Example:

;;; -*- coding:utf-8; enable-character-translation:t -*-

-- 
Juri Linkov
http://www.jurta.org/emacs/

^ permalink raw reply	[flat|nested] 202+ messages in thread

* [PATCH] Unicode Lisp reader escapes.
  2006-05-12  4:15                 ` Richard Stallman
@ 2006-06-03 18:44                   ` Aidan Kehoe
       [not found]                   ` <17537.54719.354843.89030@parhasard.net>
  1 sibling, 0 replies; 202+ messages in thread
From: Aidan Kehoe @ 2006-06-03 18:44 UTC (permalink / raw)



Jonas Jacobson just sent me confirmation that my once again signed
assignments have been received, together with PDF copies of same. Given
that, here is my final version of the patch I proposed in my first mail;
differences from that version are an entry in the NEWS file, some prose
style changes in the manual, and a GCPRO to protect readcharfun in lread.c.

etc/ChangeLog addition:

2006-06-03  Aidan Kehoe  <kehoea@parhasard.net>

	* NEWS:
	Describe the new syntax for specifying characters with Unicode
	escapes. 
	

lispref/ChangeLog addition:

2006-06-03  Aidan Kehoe  <kehoea@parhasard.net>

	* objects.texi (Character Type):
        Describe the Unicode character escape syntax; \uABCD or \U00ABCDEF 
        specifies Unicode characters U+ABCD and U+ABCDEF respectively.  


src/ChangeLog addition:

2006-06-03  Aidan Kehoe  <kehoea@parhasard.net>

	* lread.c (read_escape):
        Provide a Unicode character escape syntax; \u followed by exactly 
        four or \U followed by exactly eight hex digits in a comment or 
        string is read as a Unicode character with that code point.  
	

GNU Emacs Trunk source patch:
Diff command:   cvs -q diff -u
Files affected: src/lread.c lispref/objects.texi etc/NEWS

Index: etc/NEWS
===================================================================
RCS file: /sources/emacs/emacs/etc/NEWS,v
retrieving revision 1.1337
diff -u -u -r1.1337 NEWS
--- etc/NEWS	2 May 2006 01:47:57 -0000	1.1337
+++ etc/NEWS	3 Jun 2006 18:16:51 -0000
@@ -3772,6 +3772,13 @@
 been declared obsolete.
 
 +++
+*** New syntax: \uXXXX and \UXXXXXXXX specify Unicode code points in hex.
+Use "\u0428" to specify a string consisting of CYRILLIC CAPITAL LETTER SHA,
+or "\U0001D6E2" to specify one consisting of MATHEMATICAL ITALIC CAPITAL
+ALPHA (the latter is greater than #xFFFF and thus needs the longer
+syntax). Also available for characters. 
+
++++
 ** Displaying warnings to the user.
 
 See the functions `warn' and `display-warning', or the Lisp Manual.
Index: lispref/objects.texi
===================================================================
RCS file: /sources/emacs/emacs/lispref/objects.texi,v
retrieving revision 1.53
diff -u -u -r1.53 objects.texi
--- lispref/objects.texi	1 May 2006 15:05:48 -0000	1.53
+++ lispref/objects.texi	3 Jun 2006 18:16:52 -0000
@@ -431,6 +431,20 @@
 bit values are 2**22 for alt, 2**23 for super and 2**24 for hyper.
 @end ifnottex
 
+@cindex unicode character escape
+  Emacs provides a syntax for specifying characters by their Unicode code
+points.  @code{?\uABCD} represents a character that maps to the code
+point @samp{U+ABCD} in Unicode-based representations (UTF-8 text files,
+Unicode-oriented fonts, etc.).  There is a slightly different syntax for
+specifying characters with code points above @code{#xFFFF};
+@code{\U00ABCDEF} represents an Emacs character that maps to the code
+point @samp{U+ABCDEF} in Unicode-based representations, if such an Emacs
+character exists.
+
+  Unlike in some other languages, while this syntax is available for
+character literals, and (see later) in strings, it is not available
+elsewhere in your Lisp source code.
+
 @cindex @samp{\} in character constant
 @cindex backslash in character constant
 @cindex octal character code
Index: src/lread.c
===================================================================
RCS file: /sources/emacs/emacs/src/lread.c,v
retrieving revision 1.350
diff -u -u -r1.350 lread.c
--- src/lread.c	27 Feb 2006 02:04:35 -0000	1.350
+++ src/lread.c	3 Jun 2006 18:16:54 -0000
@@ -1743,6 +1743,9 @@
      int *byterep;
 {
   register int c = READCHAR;
+  /* \u allows up to four hex digits, \U up to eight. Default to the
+     behaviour for \u, and change this value in the case that \U is seen. */
+  int unicode_hex_count = 4;
 
   *byterep = 0;
 
@@ -1907,6 +1910,52 @@
 	return i;
       }
 
+    case 'U':
+      /* Post-Unicode-2.0: Up to eight hex chars */
+      unicode_hex_count = 8;
+    case 'u':
+
+      /* A Unicode escape. We only permit them in strings and characters,
+	 not arbitrarily in the source code as in some other languages. */
+      {
+	int i = 0;
+	int count = 0;
+	Lisp_Object lisp_char;
+	struct gcpro gcpro1;
+
+	while (++count <= unicode_hex_count)
+	  {
+	    c = READCHAR;
+	    /* isdigit(), isalpha() may be locale-specific, which we don't
+	       want. */
+	    if      (c >= '0' && c <= '9')  i = (i << 4) + (c - '0');
+	    else if (c >= 'a' && c <= 'f')  i = (i << 4) + (c - 'a') + 10;
+            else if (c >= 'A' && c <= 'F')  i = (i << 4) + (c - 'A') + 10;
+	    else
+	      {
+		error ("Non-hex digit used for Unicode escape");
+		break;
+	      }
+	  }
+
+	GCPRO1 (readcharfun);
+	lisp_char = call2(intern("decode-char"), intern("ucs"),
+			  make_number(i));
+	UNGCPRO;
+
+	if (EQ(Qnil, lisp_char))
+	  {
+	    /* This is ugly and horrible and trashes the user's data. */
+	    XSETFASTINT (i, MAKE_CHAR (charset_katakana_jisx0201, 
+				       34 + 128, 46 + 128));
+            return i;
+	  }
+	else
+	  {
+	    return XFASTINT (lisp_char);
+	  }
+      }
+
     default:
       if (BASE_LEADING_CODE_P (c))
 	c = read_multibyte (c, readcharfun);

-- 
Aidan Kehoe, http://www.parhasard.net/

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] Unicode Lisp reader escapes
  2006-06-03  6:42                                                                                         ` Juri Linkov
@ 2006-06-04  2:23                                                                                           ` Richard Stallman
  2006-06-05  7:24                                                                                             ` Kenichi Handa
  0 siblings, 1 reply; 202+ messages in thread
From: Richard Stallman @ 2006-06-04  2:23 UTC (permalink / raw)
  Cc: handa, emacs-devel, monnier, storm, alkibiades

    When the first line has enough space, there is no reason to disallow
    using the same variable name that is allowed in the local variables section.
    Example:

    ;;; -*- coding:utf-8; enable-character-translation:t -*-

I agree that the name enable-character-translation ought to work
if used in the first line.

Does it fail now?

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] Unicode Lisp reader escapes
  2006-06-04  2:23                                                                                           ` Richard Stallman
@ 2006-06-05  7:24                                                                                             ` Kenichi Handa
  2006-06-05 21:31                                                                                               ` Richard Stallman
  0 siblings, 1 reply; 202+ messages in thread
From: Kenichi Handa @ 2006-06-05  7:24 UTC (permalink / raw)
  Cc: juri, alkibiades, storm, monnier, emacs-devel

In article <E1FmiHR-00027r-Li@fencepost.gnu.org>, Richard Stallman <rms@gnu.org> writes:

>     When the first line has enough space, there is no reason to disallow
>     using the same variable name that is allowed in the local variables section.
>     Example:

>     ;;; -*- coding:utf-8; enable-character-translation:t -*-

> I agree that the name enable-character-translation ought to work
> if used in the first line.

> Does it fail now?

As I've just installed a fix, it doesn't fail now .  So, the
remaining problem is whether or not we should allow the
short name "char-trans" for it.  I tend to agree with Juri
that we don't need it because we can use "CODING!" notation.

If you agree with deleting it too, I'll install a proper
change soon.

---
Kenichi Handa
handa@m17n.org

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] Unicode Lisp reader escapes
  2006-06-05  7:24                                                                                             ` Kenichi Handa
@ 2006-06-05 21:31                                                                                               ` Richard Stallman
  2006-06-07  1:24                                                                                                 ` Kenichi Handa
  0 siblings, 1 reply; 202+ messages in thread
From: Richard Stallman @ 2006-06-05 21:31 UTC (permalink / raw)
  Cc: juri, alkibiades, storm, monnier, emacs-devel

    As I've just installed a fix, it doesn't fail now .  So, the
    remaining problem is whether or not we should allow the
    short name "char-trans" for it.  I tend to agree with Juri
    that we don't need it because we can use "CODING!" notation.

    If you agree with deleting it too, I'll install a proper
    change soon.

Ok.

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] Unicode Lisp reader escapes
  2006-06-05 21:31                                                                                               ` Richard Stallman
@ 2006-06-07  1:24                                                                                                 ` Kenichi Handa
  0 siblings, 0 replies; 202+ messages in thread
From: Kenichi Handa @ 2006-06-07  1:24 UTC (permalink / raw)
  Cc: juri, storm, emacs-devel, monnier, alkibiades

In article <E1FnMfX-0006Id-7e@fencepost.gnu.org>, Richard Stallman <rms@gnu.org> writes:

>     As I've just installed a fix, it doesn't fail now .  So, the
>     remaining problem is whether or not we should allow the
>     short name "char-trans" for it.  I tend to agree with Juri
>     that we don't need it because we can use "CODING!" notation.

>     If you agree with deleting it too, I'll install a proper
>     change soon.

> Ok.

I've just installed a change for not handling the short-name
"char-trans".

storm@cua.dk (Kim F. Storm) writes:

> The change is not incompatible in the sense that it breaks
> existing _valid_ coding specs.

> The section

> ** Multilingual Environment (Mule) changes:

> seems appropriate??

Ok, I've just added this in that section.

*** You can disable character translation for a file using the -*-
construct.  Include `enable-character-translation: nil' inside the
-*-...-*- to disable any character translation that may happen by
various global and per-coding-system translation tables.  You can also
specify it in a local variable list at the end of the file.  For
shortcut, instead of using this long variable name, you can append the
character "!" at the end of coding-system name specified in -*-
construct or in a local variable list.

---
Kenichi Handa
handa@m17n.org

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] Unicode Lisp reader escapes.
       [not found]                     ` <ufyieqj0v.fsf@gnu.org>
@ 2006-06-15 18:38                       ` Aidan Kehoe
  2006-06-17 18:57                         ` Eli Zaretskii
  0 siblings, 1 reply; 202+ messages in thread
From: Aidan Kehoe @ 2006-06-15 18:38 UTC (permalink / raw)
  Cc: emacs-pretest-bug, emacs-devel


 > 	if (EQ(Qnil, lisp_char))
 > 	  {
 > 	    /* This is ugly and horrible and trashes the user's data. */
 > 	    XSETFASTINT (i, MAKE_CHAR (charset_katakana_jisx0201, 
 > 				       34 + 128, 46 + 128));
 >             return i;
 > 	  }
 > 
 > What is this special Katakana character, and why are we producing it?

Firstly, thank you for posing the question; the character intended was not a
member of JISX0201 at all, rather of JISX0208. I yanked the wrong charset
identifier from charset.h when porting the code from XEmacs. The patch below
addresses this. 

(make-char 'japanese-jisx0208 34 46) gives U+3013 GETA MARK, a character in
JISX 0208 that is used to represent unknown or corrupted data. The
Unicode-specific equivalent is U+FFFD REPLACEMENT CHARACTER. I used the GETA
MARK because I was certain it would be available in Mule and it is
equivalent. It turns out that (make-char 'mule-unicode-e000-ffff 117 61)
gives U+FFFD, so it might be worthwhile to replace that. 

 > Is it to trigger an "Invalid character" message, or is something else
 > going on here?

It doesn’t actually trigger a message, it displays a character to be
interpreted as “the character couldn’t be interpreted.”

My feeling is that the syntax should be close in its behaviour to what the
coding systems do, and when the coding systems see a code point that is
valid but that they can’t interpret, they trash the user’s data. (Or do
something totally mad like transform invalid UTF-16 to invalid UTF-8!?)

src/ChangeLog addition:

2006-06-14  Aidan Kehoe  <kehoea@parhasard.net>

	* lread.c (read_escape):
	Change charset_katakana_jisx0201 to charset_jisx0208 as it should
	have been in the first place, since we intended U+3013 GETA MARK. 
	

GNU Emacs Trunk source patch:
Diff command:   cvs -q diff -u
Files affected: src/lread.c

Index: src/lread.c
===================================================================
RCS file: /sources/emacs/emacs/src/lread.c,v
retrieving revision 1.353
diff -u -u -r1.353 lread.c
--- src/lread.c	9 Jun 2006 18:22:30 -0000	1.353
+++ src/lread.c	14 Jun 2006 06:57:49 -0000
@@ -1967,7 +1967,7 @@
 	if (EQ(Qnil, lisp_char))
 	  {
 	    /* This is ugly and horrible and trashes the user's data.  */
-	    XSETFASTINT (i, MAKE_CHAR (charset_katakana_jisx0201,
+	    XSETFASTINT (i, MAKE_CHAR (charset_jisx0208,
 				       34 + 128, 46 + 128));
             return i;
 	  }


-- 
Aidan Kehoe, http://www.parhasard.net/

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] Unicode Lisp reader escapes.
  2006-06-15 18:38                       ` Aidan Kehoe
@ 2006-06-17 18:57                         ` Eli Zaretskii
  2006-06-18 16:11                           ` Aidan Kehoe
  0 siblings, 1 reply; 202+ messages in thread
From: Eli Zaretskii @ 2006-06-17 18:57 UTC (permalink / raw)
  Cc: emacs-pretest-bug, emacs-devel

> From: Aidan Kehoe <kehoea@parhasard.net>
> Date: Thu, 15 Jun 2006 20:38:06 +0200
> Cc: emacs-pretest-bug@gnu.org, emacs-devel@gnu.org
> 
> 
>  > Is it to trigger an "Invalid character" message, or is something else
>  > going on here?
> 
> It doesn't actually trigger a message, it displays a character to be
> interpreted as ``the character couldn't be interpreted.''

But in my testing, I do see an "Invalid character" message.

Could you please show an example of using this new function to produce
this special ``character that couldn't be interpreted''?

> My feeling is that the syntax should be close in its behaviour to what the
> coding systems do, and when the coding systems see a code point that is
> valid but that they can't interpret, they trash the user's data.

This function is not about coding systems, it's about character sets.

Coding systems already replace unsupported characters with `?' (other
applications behave like that as well), so perhaps we should use some
more conventional character here.

Does anyone have an opinion?

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] Unicode Lisp reader escapes.
  2006-06-17 18:57                         ` Eli Zaretskii
@ 2006-06-18 16:11                           ` Aidan Kehoe
  2006-06-18 19:55                             ` Eli Zaretskii
  0 siblings, 1 reply; 202+ messages in thread
From: Aidan Kehoe @ 2006-06-18 16:11 UTC (permalink / raw)
  Cc: emacs-pretest-bug, emacs-devel


 Ar an seachtú lá déag de mí Meitheamh, scríobh Eli Zaretskii: 

 > >  > Is it to trigger an "Invalid character" message, or is something else
 > >  > going on here?
 > > 
 > > It doesn't actually trigger a message, it displays a character to be
 > > interpreted as ``the character couldn't be interpreted.''
 > 
 > But in my testing, I do see an "Invalid character" message.

Yes. That’s because I yanked the wrong charset from charset.h when porting
the code from XEmacs, and the attempt to create two-dimensional character in
JISX0201 fails, as it should, since JISX0201 is a one-dimensional character
set. 

The code as intended, doesn’t trigger the message. As it was written, to my
discredit, it did.

 > Could you please show an example of using this new function to produce
 > this special ``character that couldn't be interpreted''?

 > > My feeling is that the syntax should be close in its behaviour to what the
 > > coding systems do, and when the coding systems see a code point that is
 > > valid but that they can't interpret, they trash the user's data.
 > 
 > This function is not about coding systems, it's about character sets.

This function is about transformation from an external format to the
editor’s internal format. Which is a big part of what coding systems do. So
some parallels in our approach is reasonable.

 > Coding systems already replace unsupported characters with `?'  (other
 > applications behave like that as well), so perhaps we should use some
 > more conventional character here.
 > Does anyone have an opinion?

Perhaps, indeed.

-- 
Aidan Kehoe, http://www.parhasard.net/

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] Unicode Lisp reader escapes.
  2006-06-18 16:11                           ` Aidan Kehoe
@ 2006-06-18 19:55                             ` Eli Zaretskii
  2006-06-20  2:37                               ` Kenichi Handa
  0 siblings, 1 reply; 202+ messages in thread
From: Eli Zaretskii @ 2006-06-18 19:55 UTC (permalink / raw)
  Cc: emacs-pretest-bug, emacs-devel

> From: Aidan Kehoe <kehoea@parhasard.net>
> Date: Sun, 18 Jun 2006 18:11:06 +0200
> Cc: emacs-pretest-bug@gnu.org, emacs-devel@gnu.org
> 
>  > Coding systems already replace unsupported characters with `?'  (other
>  > applications behave like that as well), so perhaps we should use some
>  > more conventional character here.
>  > Does anyone have an opinion?
> 
> Perhaps, indeed.

Handa-san, could you please comment on this issue?

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] Unicode Lisp reader escapes.
  2006-06-18 19:55                             ` Eli Zaretskii
@ 2006-06-20  2:37                               ` Kenichi Handa
  2006-06-20 17:56                                 ` Richard Stallman
  2006-06-23 18:35                                 ` Aidan Kehoe
  0 siblings, 2 replies; 202+ messages in thread
From: Kenichi Handa @ 2006-06-20  2:37 UTC (permalink / raw)
  Cc: kehoea, emacs-pretest-bug, emacs-devel

In article <uk67e2q96.fsf@gnu.org>, Eli Zaretskii <eliz@gnu.org> writes:

>> From: Aidan Kehoe <kehoea@parhasard.net>
>> Date: Sun, 18 Jun 2006 18:11:06 +0200
>> Cc: emacs-pretest-bug@gnu.org, emacs-devel@gnu.org
>> 
>> > Coding systems already replace unsupported characters with `?'  (other
>> > applications behave like that as well), so perhaps we should use some
>> > more conventional character here.
>> > Does anyone have an opinion?
>> 
>> Perhaps, indeed.

> Handa-san, could you please comment on this issue?

At first, coding system (utf-8) doesn't replace unsupported
characters with '?' on decoding.  It preserves the original
byte sequence and attaches a special text property to
display it the Unicode replacement character U+FFFD.

But, as we can't do that in read_escape, I propose to simply
signal an error as unsupported character.  I think anything
else leads to unexpected behavior.

---
Kenichi Handa
handa@m17n.org

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] Unicode Lisp reader escapes.
  2006-06-20  2:37                               ` Kenichi Handa
@ 2006-06-20 17:56                                 ` Richard Stallman
  2006-06-23 18:35                                 ` Aidan Kehoe
  1 sibling, 0 replies; 202+ messages in thread
From: Richard Stallman @ 2006-06-20 17:56 UTC (permalink / raw)
  Cc: kehoea, emacs-pretest-bug, emacs-devel

    But, as we can't do that in read_escape, I propose to simply
    signal an error as unsupported character.  I think anything
    else leads to unexpected behavior.

That seems right to me.

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] Unicode Lisp reader escapes.
  2006-06-20  2:37                               ` Kenichi Handa
  2006-06-20 17:56                                 ` Richard Stallman
@ 2006-06-23 18:35                                 ` Aidan Kehoe
  2006-06-24  6:50                                   ` Eli Zaretskii
  1 sibling, 1 reply; 202+ messages in thread
From: Aidan Kehoe @ 2006-06-23 18:35 UTC (permalink / raw)
  Cc: emacs-pretest-bug, Eli Zaretskii, emacs-devel


 Ar an fichiú lá de mí Meitheamh, scríobh Kenichi Handa: 

 > But, as we can't do that in read_escape, I propose to simply
 > signal an error as unsupported character.  I think anything
 > else leads to unexpected behavior.

Okay, here’s a patch to implement that behaviour. 

src/ChangeLog addition:

2006-06-23  Aidan Kehoe  <kehoea@parhasard.net>

	* lread.c (read_escape):
	Instead of creating a place-holder character when an unknown
	Unicode code point is encountered as a string or character escape,
	signal an error.
	

GNU Emacs Trunk source patch:
Diff command:   cvs -q diff -u
Files affected: src/lread.c

Index: src/lread.c
===================================================================
RCS file: /sources/emacs/emacs/src/lread.c,v
retrieving revision 1.353
diff -u -u -r1.353 lread.c
--- src/lread.c	9 Jun 2006 18:22:30 -0000	1.353
+++ src/lread.c	23 Jun 2006 18:24:28 -0000
@@ -1964,17 +1964,12 @@
 			  make_number(i));
 	UNGCPRO;
 
-	if (EQ(Qnil, lisp_char))
+	if (NILP(lisp_char))
 	  {
-	    /* This is ugly and horrible and trashes the user's data.  */
-	    XSETFASTINT (i, MAKE_CHAR (charset_katakana_jisx0201,
-				       34 + 128, 46 + 128));
-            return i;
-	  }
-	else
-	  {
-	    return XFASTINT (lisp_char);
+	    error ("No support for Unicode code point U+%x", i);
 	  }
+
+	return XFASTINT (lisp_char);
       }
 
     default:


-- 
Santa Maradona, priez pour moi!

^ permalink raw reply	[flat|nested] 202+ messages in thread

* Re: [PATCH] Unicode Lisp reader escapes.
  2006-06-23 18:35                                 ` Aidan Kehoe
@ 2006-06-24  6:50                                   ` Eli Zaretskii
  0 siblings, 0 replies; 202+ messages in thread
From: Eli Zaretskii @ 2006-06-24  6:50 UTC (permalink / raw)
  Cc: emacs-pretest-bug, emacs-devel, handa

> From: Aidan Kehoe <kehoea@parhasard.net>
> Date: Fri, 23 Jun 2006 20:35:00 +0200
> Cc: Eli Zaretskii <eliz@gnu.org>, emacs-pretest-bug@gnu.org,
> 	emacs-devel@gnu.org
> 
>  > But, as we can't do that in read_escape, I propose to simply
>  > signal an error as unsupported character.  I think anything
>  > else leads to unexpected behavior.
> 
> Okay, here's a patch to implement that behaviour. 
> 
> src/ChangeLog addition:
> 
> 2006-06-23  Aidan Kehoe  <kehoea@parhasard.net>
> 
> 	* lread.c (read_escape):
> 	Instead of creating a place-holder character when an unknown
> 	Unicode code point is encountered as a string or character escape,
> 	signal an error.

Thanks, installed.

^ permalink raw reply	[flat|nested] 202+ messages in thread

end of thread, other threads:[~2006-06-24  6:50 UTC | newest]

Thread overview: 202+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-04-29 15:35 [PATCH] Unicode Lisp reader escapes Aidan Kehoe
2006-04-29 23:26 ` Stefan Monnier
2006-04-30  8:26   ` Aidan Kehoe
2006-04-30  3:04 ` Richard Stallman
2006-04-30  8:14   ` Aidan Kehoe
2006-04-30 20:53     ` Richard Stallman
2006-04-30 21:04       ` Andreas Schwab
2006-04-30 21:57         ` Aidan Kehoe
2006-04-30 22:14           ` Andreas Schwab
2006-05-01 18:32         ` Richard Stallman
2006-05-01 19:03           ` Oliver Scholz
2006-05-02  4:45             ` Richard Stallman
2006-05-02  0:46           ` Kenichi Handa
2006-05-02  6:41           ` Aidan Kehoe
2006-05-02 21:36             ` Richard Stallman
2006-04-30 21:56       ` Aidan Kehoe
2006-05-01  1:44         ` Miles Bader
2006-05-01  3:12           ` Stefan Monnier
2006-05-01  3:41             ` Miles Bader
2006-05-01 12:29               ` Stefan Monnier
2006-05-05 23:15       ` Juri Linkov
2006-05-06 23:36         ` Richard Stallman
2006-05-09 20:43           ` Juri Linkov
2006-05-11  3:44             ` Richard Stallman
2006-05-11 12:03               ` Juri Linkov
2006-05-11 13:16                 ` Kenichi Handa
2006-05-12  4:15                 ` Richard Stallman
2006-06-03 18:44                   ` Aidan Kehoe
     [not found]                   ` <17537.54719.354843.89030@parhasard.net>
     [not found]                     ` <ufyieqj0v.fsf@gnu.org>
2006-06-15 18:38                       ` Aidan Kehoe
2006-06-17 18:57                         ` Eli Zaretskii
2006-06-18 16:11                           ` Aidan Kehoe
2006-06-18 19:55                             ` Eli Zaretskii
2006-06-20  2:37                               ` Kenichi Handa
2006-06-20 17:56                                 ` Richard Stallman
2006-06-23 18:35                                 ` Aidan Kehoe
2006-06-24  6:50                                   ` Eli Zaretskii
2006-05-02  6:43 ` Kenichi Handa
2006-05-02  7:00   ` Aidan Kehoe
2006-05-02 10:45     ` Eli Zaretskii
2006-05-02 11:13       ` Aidan Kehoe
2006-05-02 19:31         ` Eli Zaretskii
2006-05-02 20:25           ` Aidan Kehoe
2006-05-02 22:16             ` Oliver Scholz
2006-05-02 11:33     ` Kenichi Handa
2006-05-02 22:50       ` Aidan Kehoe
2006-05-03  7:43         ` Kenichi Handa
2006-05-03 17:21         ` Kevin Rodgers
2006-05-03 18:51           ` Andreas Schwab
2006-05-04 21:14             ` Aidan Kehoe
2006-05-08  1:31               ` Kenichi Handa
2006-05-08  6:54                 ` Aidan Kehoe
2006-05-08 13:55                 ` Stefan Monnier
2006-05-08 14:24                   ` Aidan Kehoe
2006-05-08 15:32                     ` Stefan Monnier
2006-05-08 16:39                       ` Aidan Kehoe
2006-05-08 17:39                         ` Stefan Monnier
2006-05-09  7:04                           ` Aidan Kehoe
2006-05-09 19:05                             ` Eli Zaretskii
2006-05-10  6:05                               ` Aidan Kehoe
2006-05-10 17:49                                 ` Eli Zaretskii
2006-05-10 21:37                                   ` Luc Teirlinck
2006-05-11  3:45                                     ` Eli Zaretskii
2006-05-10 21:48                                   ` Luc Teirlinck
2006-05-11  1:08                                   ` Luc Teirlinck
2006-05-11  2:29                                     ` Luc Teirlinck
2006-05-11  3:46                                   ` Richard Stallman
2006-05-09  0:36                   ` Kenichi Handa
2006-05-02 10:36   ` Eli Zaretskii
2006-05-02 10:59     ` Aidan Kehoe
2006-05-02 19:26       ` Eli Zaretskii
2006-05-03  2:59     ` Kenichi Handa
2006-05-03  8:47       ` Eli Zaretskii
2006-05-03 14:21         ` Stefan Monnier
2006-05-03 18:26           ` Eli Zaretskii
2006-05-03 21:12             ` Ken Raeburn
2006-05-04 14:17             ` Richard Stallman
2006-05-04 16:41               ` Aidan Kehoe
2006-05-05 10:39                 ` Eli Zaretskii
2006-05-05 16:35                   ` Aidan Kehoe
2006-05-05 19:05                 ` Richard Stallman
2006-05-05 19:20                   ` Aidan Kehoe
2006-05-05 19:57                     ` Aidan Kehoe
2006-05-06 14:25                       ` Richard Stallman
2006-05-06 17:26                         ` Aidan Kehoe
2006-05-07  5:01                           ` Richard Stallman
2006-05-07  6:38                             ` Aidan Kehoe
2006-05-07  7:00                               ` David Kastrup
2006-05-07  7:15                                 ` Aidan Kehoe
2006-05-07 16:50                             ` Aidan Kehoe
2006-05-08 22:28                               ` Richard Stallman
2006-05-04  1:33           ` Kenichi Handa
2006-05-04  8:23             ` Oliver Scholz
2006-05-04 11:57               ` Kim F. Storm
2006-05-04 12:18                 ` Stefan Monnier
2006-05-04 12:21                   ` Kim F. Storm
2006-05-04 16:31                   ` Eli Zaretskii
2006-05-04 21:40                     ` Stefan Monnier
2006-05-05 10:25                       ` Eli Zaretskii
2006-05-05 12:31                         ` Oliver Scholz
2006-05-05 18:08                           ` Stuart D. Herring
2006-05-05 13:05                         ` Stefan Monnier
2006-05-05 17:23                           ` Oliver Scholz
2006-05-04 13:07                 ` Oliver Scholz
2006-05-04 16:32             ` Eli Zaretskii
2006-05-04 20:55               ` Aidan Kehoe
2006-05-05  9:33                 ` Oliver Scholz
2006-05-05 10:02                   ` Oliver Scholz
2006-05-05 18:33                   ` Aidan Kehoe
2006-05-05 18:42                     ` Oliver Scholz
2006-05-05 21:37                     ` Eli Zaretskii
2006-05-06 14:24                   ` Richard Stallman
2006-05-06 15:01                     ` Oliver Scholz
     [not found]                     ` <877j4z5had.fsf@gmx.de>
2006-05-07  5:00                       ` Richard Stallman
2006-05-07 12:38                         ` Kenichi Handa
2006-05-07 21:26                           ` Oliver Scholz
2006-05-08  1:14                             ` Kenichi Handa
2006-05-08 22:29                             ` Richard Stallman
2006-05-09  3:42                               ` Eli Zaretskii
2006-05-09 20:41                                 ` Richard Stallman
2006-05-09 21:03                                   ` Stefan Monnier
2006-05-10  3:33                                   ` Eli Zaretskii
2006-05-09  5:13                               ` Kenichi Handa
2006-05-10  3:20                                 ` Richard Stallman
2006-05-10  5:37                                   ` Kenichi Handa
2006-05-10  7:22                                     ` Stefan Monnier
2006-05-11  3:45                                       ` Richard Stallman
2006-05-11 12:41                                         ` Stefan Monnier
2006-05-11 12:51                                           ` Kenichi Handa
2006-05-11 16:46                                             ` Stefan Monnier
2006-05-11  3:44                                     ` Richard Stallman
2006-05-11  3:44                                     ` Richard Stallman
2006-05-11  7:31                                       ` Kenichi Handa
2006-05-12  4:14                                         ` Richard Stallman
2006-05-12  5:26                                           ` Kenichi Handa
2006-05-13  4:52                                             ` Richard Stallman
2006-05-13 13:25                                               ` Stefan Monnier
2006-05-13 20:41                                                 ` Richard Stallman
2006-05-14 13:32                                                   ` Stefan Monnier
2006-05-14 23:29                                                     ` Richard Stallman
2006-05-15  0:55                                                       ` Stefan Monnier
2006-05-15  2:49                                                         ` Oliver Scholz
2006-05-15  3:27                                                           ` Stefan Monnier
2006-05-15 10:20                                                             ` Oliver Scholz
2006-05-15 11:12                                                               ` Oliver Scholz
2006-05-15 20:37                                                           ` Richard Stallman
2006-05-16  9:49                                                             ` Oliver Scholz
2006-05-16 11:16                                                               ` Kim F. Storm
2006-05-16 11:39                                                                 ` Romain Francoise
2006-05-16 11:58                                                                 ` Oliver Scholz
2006-05-16 14:24                                                                   ` Kim F. Storm
2006-05-17  3:45                                                                   ` Richard Stallman
2006-05-17  8:37                                                                     ` Oliver Scholz
2006-05-17 20:09                                                                       ` Richard Stallman
2006-05-17 12:37                                                                     ` Oliver Scholz
2006-05-19  2:05                                                                       ` Richard Stallman
2006-05-19  8:47                                                                         ` Oliver Scholz
2006-05-18  1:09                                                                     ` Kenichi Handa
2006-05-21  0:57                                                                       ` Richard Stallman
2006-05-22  1:33                                                                         ` Kenichi Handa
2006-05-22 15:12                                                                           ` Richard Stallman
2006-05-23  1:05                                                                             ` Kenichi Handa
2006-05-23  5:18                                                                               ` Juri Linkov
2006-05-24  2:18                                                                                 ` Richard Stallman
2006-06-02  6:49                                                                                   ` Kenichi Handa
2006-06-02  8:00                                                                                     ` Kim F. Storm
2006-06-02  9:27                                                                                     ` Juri Linkov
2006-06-02 10:50                                                                                       ` Eli Zaretskii
2006-06-02 11:39                                                                                       ` Kenichi Handa
2006-06-02 12:12                                                                                         ` Juri Linkov
2006-06-02 22:39                                                                                       ` Richard Stallman
2006-06-03  6:42                                                                                         ` Juri Linkov
2006-06-04  2:23                                                                                           ` Richard Stallman
2006-06-05  7:24                                                                                             ` Kenichi Handa
2006-06-05 21:31                                                                                               ` Richard Stallman
2006-06-07  1:24                                                                                                 ` Kenichi Handa
2006-06-02 22:39                                                                                     ` Richard Stallman
2006-05-24  2:17                                                                               ` Richard Stallman
2006-05-17 15:15                                                                   ` Stefan Monnier
2006-05-17  3:45                                                                 ` Richard Stallman
2006-05-17  3:45                                                               ` Richard Stallman
2006-05-17  8:53                                                                 ` Oliver Scholz
2006-05-17 20:09                                                                   ` Richard Stallman
2006-05-18  9:12                                                                     ` Oliver Scholz
2006-05-15 20:37                                                         ` Richard Stallman
2006-05-15  5:13                                               ` Kenichi Handa
2006-05-15  8:06                                                 ` Kim F. Storm
2006-05-15  9:04                                                   ` Andreas Schwab
2006-05-15 20:38                                                   ` Richard Stallman
2006-05-15 14:08                                                 ` Stefan Monnier
2006-05-15 20:37                                                 ` Richard Stallman
2006-05-16 10:07                                                   ` Oliver Scholz
2006-05-18  0:31                                                   ` Kenichi Handa
2006-05-11  9:44                                       ` Oliver Scholz
2006-05-08  7:36                           ` Richard Stallman
2006-05-08  7:50                             ` Kenichi Handa
2006-05-05 19:05               ` Richard Stallman
2006-05-05 21:43                 ` Eli Zaretskii
2006-05-06 14:25                   ` Richard Stallman
2006-05-04  1:26         ` Kenichi Handa
     [not found] <E1FaJ0b-0008G8-8u@monty-python.gnu.org>
2006-04-30 21:16 ` Jonathan Yavner
2006-05-01 18:32   ` Richard Stallman
2006-05-02  5:03     ` Jonathan Yavner

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).