* [PATCH] Unicode Lisp reader escapes @ 2006-04-29 15:35 Aidan Kehoe 2006-04-29 23:26 ` Stefan Monnier ` (2 more replies) 0 siblings, 3 replies; 202+ messages in thread From: Aidan Kehoe @ 2006-04-29 15:35 UTC (permalink / raw) I realise you are all focused on the release with an intensity that would scare small children, were any of them let near, but if any of you have a minute free, I’d love to hear philosophical and technical objections to the below. The background is that it hasn’t ever been possible to consistently specify a non-Latin-1 character by means of a general escape sequence, since what character a given integer represents varies from release to release and even from invocation to invocation. The below allows you to specify a backslash escape with exactly four or exactly eight hexadecimal digits in a character or string, and have the editor interpret them as the corresponding Unicode code point. So, ?\u20AC would be interpreted as the Euro sign, "\u0448" as Cyrillic sha, ?\U001D0ED as Byzantine musical symbol arktiko ke. Why not wait until the Unicode branch is merged? Well, that won’t solve the problem either; people naturally want their code to be as compatible as possible, so they will avoid the assumption that the integer-to-character mapping is Unicode compatible as long as there are editors in the wild for which that is not true. If this is integrated a good bit before the Unicode branch is (which is what I would like), it will mean people can use this syntax (which most modern programming languages have already, and which people use) and be sure it’s compatible years before what would otherwise be the case. lispref/ChangeLog addition: 2006-04-29 Aidan Kehoe <kehoea@parhasard.net> * objects.texi (Character Type): Describe the Unicode character escape syntax; \uABCD or \U00ABCDEF specifies Unicode characters U+ABCD and U+ABCDEF respectively. src/ChangeLog addition: 2006-04-29 Aidan Kehoe <kehoea@parhasard.net> * lread.c (read_escape): Provide a Unicode character escape syntax; \u followed by exactly four or \U followed by exactly eight hex digits in a comment or string is read as a Unicode character with that code point. GNU Emacs Trunk source patch: Diff command: cvs -q diff -u Files affected: src/lread.c lispref/objects.texi Index: lispref/objects.texi =================================================================== RCS file: /sources/emacs/emacs/lispref/objects.texi,v retrieving revision 1.51 diff -u -u -r1.51 objects.texi --- lispref/objects.texi 6 Feb 2006 11:55:10 -0000 1.51 +++ lispref/objects.texi 29 Apr 2006 15:15:09 -0000 @@ -431,6 +431,20 @@ bit values are 2**22 for alt, 2**23 for super and 2**24 for hyper. @end ifnottex +@cindex unicode character escape + Emacs provides a syntax for specifying characters by their Unicode code +points. @samp{?\uABCD} will give you an Emacs character that maps to +the code point @samp{U+ABCD} in Unicode-based representations (UTF-8 +text files, Unicode-oriented fonts, etc.) There is a slightly different +syntax for specifying characters with code points above @samp{#xFFFF}; +@samp{\U00ABCDEF} will give you an Emacs character that maps to the code +point @samp{U+ABCDEF} in Unicode-based representations, if such an Emacs +character exists. + + Unlike in some other languages, while this syntax is available for +character literals, and (see later) in strings, it is not available +elsewhere in your Lisp source code. + @cindex @samp{\} in character constant @cindex backslash in character constant @cindex octal character code Index: src/lread.c =================================================================== RCS file: /sources/emacs/emacs/src/lread.c,v retrieving revision 1.350 diff -u -u -r1.350 lread.c --- src/lread.c 27 Feb 2006 02:04:35 -0000 1.350 +++ src/lread.c 29 Apr 2006 15:15:10 -0000 @@ -1743,6 +1743,9 @@ int *byterep; { register int c = READCHAR; + /* \u allows up to four hex digits, \U up to eight. Default to the + behaviour for \u, and change this value in the case that \U is seen. */ + int unicode_hex_count = 4; *byterep = 0; @@ -1907,6 +1910,48 @@ return i; } + case 'U': + /* Post-Unicode-2.0: Up to eight hex chars */ + unicode_hex_count = 8; + case 'u': + + /* A Unicode escape. We only permit them in strings and characters, + not arbitrarily in the source code as in some other languages. */ + { + int i = 0; + int count = 0; + Lisp_Object lisp_char; + while (++count <= unicode_hex_count) + { + c = READCHAR; + /* isdigit(), isalpha() may be locale-specific, which we don't + want. */ + if (c >= '0' && c <= '9') i = (i << 4) + (c - '0'); + else if (c >= 'a' && c <= 'f') i = (i << 4) + (c - 'a') + 10; + else if (c >= 'A' && c <= 'F') i = (i << 4) + (c - 'A') + 10; + else + { + error ("Non-hex digit used for Unicode escape"); + break; + } + } + + lisp_char = call2(intern("decode-char"), intern("ucs"), + make_number(i)); + + if (EQ(Qnil, lisp_char)) + { + /* This is ugly and horrible and trashes the user's data. */ + XSETFASTINT (i, MAKE_CHAR (charset_katakana_jisx0201, + 34 + 128, 46 + 128)); + return i; + } + else + { + return XFASTINT (lisp_char); + } + } + default: if (BASE_LEADING_CODE_P (c)) c = read_multibyte (c, readcharfun); -- In the beginning God created the heavens and the earth. And God was a bug-eyed, hexagonal smurf with a head of electrified hair; and God said: “Si, mi chiamano Mimi...” ^ permalink raw reply [flat|nested] 202+ messages in thread
* Re: [PATCH] Unicode Lisp reader escapes 2006-04-29 15:35 [PATCH] Unicode Lisp reader escapes Aidan Kehoe @ 2006-04-29 23:26 ` Stefan Monnier 2006-04-30 8:26 ` Aidan Kehoe 2006-04-30 3:04 ` Richard Stallman 2006-05-02 6:43 ` Kenichi Handa 2 siblings, 1 reply; 202+ messages in thread From: Stefan Monnier @ 2006-04-29 23:26 UTC (permalink / raw) Cc: emacs-devel > The background is that it hasn’t ever been possible to consistently > specify a non-Latin-1 character by means of a general escape sequence, > since what character a given integer represents varies from release to > release and even from invocation to invocation. There are two known workarounds: - encode your file in utf-8. - use an elisp expression like (decode-char 'ucs <foo>). Neither of them is quite what you want, but I've found them good enough for the cases I've had to deal with. Stefan ^ permalink raw reply [flat|nested] 202+ messages in thread
* Re: [PATCH] Unicode Lisp reader escapes 2006-04-29 23:26 ` Stefan Monnier @ 2006-04-30 8:26 ` Aidan Kehoe 0 siblings, 0 replies; 202+ messages in thread From: Aidan Kehoe @ 2006-04-30 8:26 UTC (permalink / raw) Cc: emacs-devel Ar an naoú lá is fiche de mí Aibréan, scríobh Stefan Monnier: > There are two known workarounds: > - encode your file in utf-8. > - use an elisp expression like (decode-char 'ucs <foo>). > Neither of them is quite what you want, but I've found them good enough for > the cases I've had to deal with. Sure, and encoding your file as ISO-8859-1 and using (char-to-int #xff) would be possible were the escapes for Latin-1 not available. I will certainly be using both your approaches for years to come, since making my code pointlessly incompatible with existing editors is not a good idea. The \u syntax is sugar, but it is used quite a bit in those languages that have it, which seems to testify to its usefulness. -- In the beginning God created the heavens and the earth. And God was a bug-eyed, hexagonal smurf with a head of electrified hair; and God said: “Si, mi chiamano Mimi...” ^ permalink raw reply [flat|nested] 202+ messages in thread
* Re: [PATCH] Unicode Lisp reader escapes 2006-04-29 15:35 [PATCH] Unicode Lisp reader escapes Aidan Kehoe 2006-04-29 23:26 ` Stefan Monnier @ 2006-04-30 3:04 ` Richard Stallman 2006-04-30 8:14 ` Aidan Kehoe 2006-05-02 6:43 ` Kenichi Handa 2 siblings, 1 reply; 202+ messages in thread From: Richard Stallman @ 2006-04-30 3:04 UTC (permalink / raw) Cc: emacs-devel + Emacs provides a syntax for specifying characters by their Unicode code +points. @samp{?\uABCD} These are Lisp expressions, right? So they should use @code, not @samp. will give you an Emacs character that maps to Please stick to present tense: change "will give you an" to "represents the". +text files, Unicode-oriented fonts, etc.) There is a slightly different You need a period at the end of that sentence. The period inside the parentheses does not count for this. +syntax for specifying characters with code points above @samp{#xFFFF}; +@samp{\U00ABCDEF} will give you an Emacs character that maps to the code What is the reason for needing both \u and \U, and the difference? Why not use a syntax like that of \x? ^ permalink raw reply [flat|nested] 202+ messages in thread
* Re: [PATCH] Unicode Lisp reader escapes 2006-04-30 3:04 ` Richard Stallman @ 2006-04-30 8:14 ` Aidan Kehoe 2006-04-30 20:53 ` Richard Stallman 0 siblings, 1 reply; 202+ messages in thread From: Aidan Kehoe @ 2006-04-30 8:14 UTC (permalink / raw) Cc: emacs-devel Ar an naoú lá is fiche de mí Aibréan, scríobh Richard Stallman: > [Comments on the text taken into account in the revised patch below.] > > [...] > > What is the reason for needing both \u and \U, and the difference? Why > not use a syntax like that of \x? They are both fixed-length expressions, which is good, because people get into the habit of typing "\u0123As I walked out one evening" instead of the more disastrous "\u123As I walked out one evening". We could provide the same functionality with just the \U00ABCDEF syntax, but since the code points above #xFFFF are very rarely used, the need to provide the initial four zeroes would be very annoying for the majority of the time. The reason the approach is not to have variable length constants as is used with \x is exactly the "\u0123As I" versus "\u123As I walked out" issue above. lispref/ChangeLog addition: 2006-04-30 Aidan Kehoe <kehoea@parhasard.net> * objects.texi (Character Type): Describe the Unicode character escape syntax; \uABCD or \U00ABCDEF specifies Unicode characters U+ABCD and U+ABCDEF respectively. src/ChangeLog addition: 2006-04-30 Aidan Kehoe <kehoea@parhasard.net> * lread.c (read_escape): Provide a Unicode character escape syntax; \u followed by exactly four or \U followed by exactly eight hex digits in a comment or string is read as a Unicode character with that code point. GNU Emacs Trunk source patch: Diff command: cvs -q diff -u Files affected: src/lread.c lispref/objects.texi Index: lispref/objects.texi =================================================================== RCS file: /sources/emacs/emacs/lispref/objects.texi,v retrieving revision 1.51 diff -u -u -r1.51 objects.texi --- lispref/objects.texi 6 Feb 2006 11:55:10 -0000 1.51 +++ lispref/objects.texi 30 Apr 2006 08:08:05 -0000 @@ -431,6 +431,20 @@ bit values are 2**22 for alt, 2**23 for super and 2**24 for hyper. @end ifnottex +@cindex unicode character escape + Emacs provides a syntax for specifying characters by their Unicode code +points. @code{?\uABCD} represents a character that maps to the code +point @samp{U+ABCD} in Unicode-based representations (UTF-8 text files, +Unicode-oriented fonts, etc.). There is a slightly different syntax for +specifying characters with code points above @code{#xFFFF}; +@code{\U00ABCDEF} represents an Emacs character that maps to the code +point @samp{U+ABCDEF} in Unicode-based representations, if such an Emacs +character exists. + + Unlike in some other languages, while this syntax is available for +character literals, and (see later) in strings, it is not available +elsewhere in your Lisp source code. + @cindex @samp{\} in character constant @cindex backslash in character constant @cindex octal character code Index: src/lread.c =================================================================== RCS file: /sources/emacs/emacs/src/lread.c,v retrieving revision 1.350 diff -u -u -r1.350 lread.c --- src/lread.c 27 Feb 2006 02:04:35 -0000 1.350 +++ src/lread.c 30 Apr 2006 08:08:07 -0000 @@ -1743,6 +1743,9 @@ int *byterep; { register int c = READCHAR; + /* \u allows up to four hex digits, \U up to eight. Default to the + behaviour for \u, and change this value in the case that \U is seen. */ + int unicode_hex_count = 4; *byterep = 0; @@ -1907,6 +1910,48 @@ return i; } + case 'U': + /* Post-Unicode-2.0: Up to eight hex chars */ + unicode_hex_count = 8; + case 'u': + + /* A Unicode escape. We only permit them in strings and characters, + not arbitrarily in the source code as in some other languages. */ + { + int i = 0; + int count = 0; + Lisp_Object lisp_char; + while (++count <= unicode_hex_count) + { + c = READCHAR; + /* isdigit(), isalpha() may be locale-specific, which we don't + want. */ + if (c >= '0' && c <= '9') i = (i << 4) + (c - '0'); + else if (c >= 'a' && c <= 'f') i = (i << 4) + (c - 'a') + 10; + else if (c >= 'A' && c <= 'F') i = (i << 4) + (c - 'A') + 10; + else + { + error ("Non-hex digit used for Unicode escape"); + break; + } + } + + lisp_char = call2(intern("decode-char"), intern("ucs"), + make_number(i)); + + if (EQ(Qnil, lisp_char)) + { + /* This is ugly and horrible and trashes the user's data. */ + XSETFASTINT (i, MAKE_CHAR (charset_katakana_jisx0201, + 34 + 128, 46 + 128)); + return i; + } + else + { + return XFASTINT (lisp_char); + } + } + default: if (BASE_LEADING_CODE_P (c)) c = read_multibyte (c, readcharfun); -- In the beginning God created the heavens and the earth. And God was a bug-eyed, hexagonal smurf with a head of electrified hair; and God said: “Si, mi chiamano Mimi...” ^ permalink raw reply [flat|nested] 202+ messages in thread
* Re: [PATCH] Unicode Lisp reader escapes 2006-04-30 8:14 ` Aidan Kehoe @ 2006-04-30 20:53 ` Richard Stallman 2006-04-30 21:04 ` Andreas Schwab ` (2 more replies) 0 siblings, 3 replies; 202+ messages in thread From: Richard Stallman @ 2006-04-30 20:53 UTC (permalink / raw) Cc: emacs-devel They are both fixed-length expressions, which is good, because people get into the habit of typing "\u0123As I walked out one evening" instead of the more disastrous "\u123As I walked out one evening". I see, you are talking about using them in strings. Still, I don't like having both \u and \U--it is ugly. I think it would be better to put an explicit terminator into the construct. Perhaps #. So you would write "\u123#As I walked" ^ permalink raw reply [flat|nested] 202+ messages in thread
* Re: [PATCH] Unicode Lisp reader escapes 2006-04-30 20:53 ` Richard Stallman @ 2006-04-30 21:04 ` Andreas Schwab 2006-04-30 21:57 ` Aidan Kehoe 2006-05-01 18:32 ` Richard Stallman 2006-04-30 21:56 ` Aidan Kehoe 2006-05-05 23:15 ` Juri Linkov 2 siblings, 2 replies; 202+ messages in thread From: Andreas Schwab @ 2006-04-30 21:04 UTC (permalink / raw) Cc: Aidan Kehoe, emacs-devel Richard Stallman <rms@gnu.org> writes: > I think it would be better to put an explicit terminator into > the construct. Perhaps #. So you would write "\u123#As I walked" There is already the possibility to use `\ ' as a terminator. Andreas. -- Andreas Schwab, SuSE Labs, schwab@suse.de SuSE Linux Products GmbH, Maxfeldstraße 5, 90409 Nürnberg, Germany PGP key fingerprint = 58CA 54C7 6D53 942B 1756 01D3 44D5 214B 8276 4ED5 "And now for something completely different." ^ permalink raw reply [flat|nested] 202+ messages in thread
* Re: [PATCH] Unicode Lisp reader escapes 2006-04-30 21:04 ` Andreas Schwab @ 2006-04-30 21:57 ` Aidan Kehoe 2006-04-30 22:14 ` Andreas Schwab 2006-05-01 18:32 ` Richard Stallman 1 sibling, 1 reply; 202+ messages in thread From: Aidan Kehoe @ 2006-04-30 21:57 UTC (permalink / raw) Cc: emacs-devel Ar an triochadú lá de mí Aibréan, scríobh Andreas Schwab: > Richard Stallman <rms@gnu.org> writes: > > > I think it would be better to put an explicit terminator into > > the construct. Perhaps #. So you would write "\u123#As I walked" > > There is already the possibility to use `\ ' as a terminator. I don’t understand what you mean there--I imagine Richard meant a terminator that would not be interpreted as part of the succeeding string, something not the case for the space character. -- Aidan Kehoe, http://www.parhasard.net/ ^ permalink raw reply [flat|nested] 202+ messages in thread
* Re: [PATCH] Unicode Lisp reader escapes 2006-04-30 21:57 ` Aidan Kehoe @ 2006-04-30 22:14 ` Andreas Schwab 0 siblings, 0 replies; 202+ messages in thread From: Andreas Schwab @ 2006-04-30 22:14 UTC (permalink / raw) Cc: emacs-devel Aidan Kehoe <kehoea@parhasard.net> writes: > Ar an triochadú lá de mí Aibréan, scríobh Andreas Schwab: > > > Richard Stallman <rms@gnu.org> writes: > > > > > I think it would be better to put an explicit terminator into > > > the construct. Perhaps #. So you would write "\u123#As I walked" > > > > There is already the possibility to use `\ ' as a terminator. > > I don’t understand what you mean there--I imagine Richard meant a terminator > that would not be interpreted as part of the succeeding string, something > not the case for the space character. `\ ' is ignored in a string (*Note (elisp)Non-ASCII in Strings::) Andreas. -- Andreas Schwab, SuSE Labs, schwab@suse.de SuSE Linux Products GmbH, Maxfeldstraße 5, 90409 Nürnberg, Germany PGP key fingerprint = 58CA 54C7 6D53 942B 1756 01D3 44D5 214B 8276 4ED5 "And now for something completely different." ^ permalink raw reply [flat|nested] 202+ messages in thread
* Re: [PATCH] Unicode Lisp reader escapes 2006-04-30 21:04 ` Andreas Schwab 2006-04-30 21:57 ` Aidan Kehoe @ 2006-05-01 18:32 ` Richard Stallman 2006-05-01 19:03 ` Oliver Scholz ` (2 more replies) 1 sibling, 3 replies; 202+ messages in thread From: Richard Stallman @ 2006-05-01 18:32 UTC (permalink / raw) Cc: kehoea, emacs-devel > I think it would be better to put an explicit terminator into > the construct. Perhaps #. So you would write "\u123#As I walked" There is already the possibility to use `\ ' as a terminator. That is true. The worry is that people might forget and run the unicode constant together with the following text. People might not remember to use `\ ' when it is needed, if they usually don't need it. But it is no great disaster to make such an error--it will be obvious when you see the output. So perhaps there's no need to do anything to avoid the problem. One other question occurs to me. In the Unicode branch, doesn't \x do this job? If so, \u would be redundant once we merge in that code. It would have no lasting purpose. ^ permalink raw reply [flat|nested] 202+ messages in thread
* Re: [PATCH] Unicode Lisp reader escapes 2006-05-01 18:32 ` Richard Stallman @ 2006-05-01 19:03 ` Oliver Scholz 2006-05-02 4:45 ` Richard Stallman 2006-05-02 0:46 ` Kenichi Handa 2006-05-02 6:41 ` Aidan Kehoe 2 siblings, 1 reply; 202+ messages in thread From: Oliver Scholz @ 2006-05-01 19:03 UTC (permalink / raw) Richard Stallman <rms@gnu.org> writes: > > I think it would be better to put an explicit terminator into > > the construct. Perhaps #. So you would write "\u123#As I walked" > > There is already the possibility to use `\ ' as a terminator. > > That is true. The worry is that people might forget and run the > unicode constant together with the following text. People might not > remember to use `\ ' when it is needed, if they usually don't need it. > > But it is no great disaster to make such an error--it will be obvious > when you see the output. So perhaps there's no need to do anything > to avoid the problem. At any rate the syntax for \u and \x should be entirely in parallel, IMNSHO. > One other question occurs to me. In the Unicode branch, > doesn't \x do this job? If so, \u would be redundant once we > merge in that code. It would have no lasting purpose. There would still be a conceptual difference. \x refers to the internal representation of a character in Emacs, while \u refers to an abstract character. In the Unicode branch the hex numbers would be the same in both cases, but conceptually it is still different. Like writing `?a' in Lisp code instead of just `97' or like using `(null list)' instead of `(not list)'. Oliver -- 12 Floréal an 214 de la Révolution Liberté, Egalité, Fraternité! ^ permalink raw reply [flat|nested] 202+ messages in thread
* Re: [PATCH] Unicode Lisp reader escapes 2006-05-01 19:03 ` Oliver Scholz @ 2006-05-02 4:45 ` Richard Stallman 0 siblings, 0 replies; 202+ messages in thread From: Richard Stallman @ 2006-05-02 4:45 UTC (permalink / raw) Cc: emacs-devel > One other question occurs to me. In the Unicode branch, > doesn't \x do this job? If so, \u would be redundant once we > merge in that code. It would have no lasting purpose. There would still be a conceptual difference. \x refers to the internal representation of a character in Emacs, while \u refers to an abstract character. If \xabcd and \uabcd will forever be equivalent, I don't see much benefit in having syntax to indicate the conceptual difference. ^ permalink raw reply [flat|nested] 202+ messages in thread
* Re: [PATCH] Unicode Lisp reader escapes 2006-05-01 18:32 ` Richard Stallman 2006-05-01 19:03 ` Oliver Scholz @ 2006-05-02 0:46 ` Kenichi Handa 2006-05-02 6:41 ` Aidan Kehoe 2 siblings, 0 replies; 202+ messages in thread From: Kenichi Handa @ 2006-05-02 0:46 UTC (permalink / raw) Cc: kehoea, schwab, emacs-devel In article <E1FadBn-0007Ny-5U@fencepost.gnu.org>, Richard Stallman <rms@gnu.org> writes: > One other question occurs to me. In the Unicode branch, > doesn't \x do this job? Yes, it does. --- Kenichi Handa handa@m17n.org ^ permalink raw reply [flat|nested] 202+ messages in thread
* Re: [PATCH] Unicode Lisp reader escapes 2006-05-01 18:32 ` Richard Stallman 2006-05-01 19:03 ` Oliver Scholz 2006-05-02 0:46 ` Kenichi Handa @ 2006-05-02 6:41 ` Aidan Kehoe 2006-05-02 21:36 ` Richard Stallman 2 siblings, 1 reply; 202+ messages in thread From: Aidan Kehoe @ 2006-05-02 6:41 UTC (permalink / raw) Cc: emacs-devel Ar an chéad lá de mí Bealtaine, scríobh Richard Stallman: > > I think it would be better to put an explicit terminator into > > the construct. Perhaps #. So you would write "\u123#As I walked" > > There is already the possibility to use `\ ' as a terminator. > > That is true. The worry is that people might forget and run the > unicode constant together with the following text. People might not > remember to use `\ ' when it is needed, if they usually don't need it. > > But it is no great disaster to make such an error--it will be obvious > when you see the output. So perhaps there's no need to do anything > to avoid the problem. One problem with that is that people writing portable code have never had the option of assuming (equal "\ " ""). Now, of course, you may prefer that people not write portable code; but it’s still a problem, because people will try. > One other question occurs to me. In the Unicode branch, doesn't \x do > this job? If so, \u would be redundant once we merge in that code. It > would have no lasting purpose. I addressed that in my first mail, and I quote: “Why not wait until the Unicode branch is merged? Well, that won’t solve the problem either; people naturally want their code to be as compatible as possible, so they will avoid the assumption that the integer-to-character mapping is Unicode compatible as long as there are editors in the wild for which that is not true. If this is integrated a good bit before the Unicode branch is (which is what I would like), it will mean people can use this syntax (which most modern programming languages have already, and which people use) and be sure it’s compatible years before what would otherwise be the case.” -- Aidan Kehoe, http://www.parhasard.net/ ^ permalink raw reply [flat|nested] 202+ messages in thread
* Re: [PATCH] Unicode Lisp reader escapes 2006-05-02 6:41 ` Aidan Kehoe @ 2006-05-02 21:36 ` Richard Stallman 0 siblings, 0 replies; 202+ messages in thread From: Richard Stallman @ 2006-05-02 21:36 UTC (permalink / raw) Cc: emacs-devel > One other question occurs to me. In the Unicode branch, doesn't \x do > this job? If so, \u would be redundant once we merge in that code. It > would have no lasting purpose. I addressed that in my first mail, and I quote: Yes, but I don't think the argument is that strong, because I don't see a big hurry. On the other hand, given that C and Java use these constructs, compatibility in Emacs would be useful. ^ permalink raw reply [flat|nested] 202+ messages in thread
* Re: [PATCH] Unicode Lisp reader escapes 2006-04-30 20:53 ` Richard Stallman 2006-04-30 21:04 ` Andreas Schwab @ 2006-04-30 21:56 ` Aidan Kehoe 2006-05-01 1:44 ` Miles Bader 2006-05-05 23:15 ` Juri Linkov 2 siblings, 1 reply; 202+ messages in thread From: Aidan Kehoe @ 2006-04-30 21:56 UTC (permalink / raw) Cc: emacs-devel Ar an triochadú lá de mí Aibréan, scríobh Richard Stallman: > They are both fixed-length expressions, which is good, because people > get into the habit of typing "\u0123As I walked out one evening" > instead of the more disastrous "\u123As I walked out one evening". > > I see, you are talking about using them in strings. Indeed, as I mentioned in the documentation. > Still, I don't like having both \u and \U--it is ugly. > > I think it would be better to put an explicit terminator into > the construct. Perhaps #. So you would write "\u123#As I walked" I find _that_ distinctly ugly, but more of a problem with it than the aesthetics is that it’s unfamiliar to everyone, Lisp people and Java people alike. Another alternative to providing both \u and \U is to do what Java does; only allow \u, and require code points above #xFFFF to use surrogate pairs. So "\uDA6F\uDCDE" would be how one would encode U+ABCDE. But I think that’s very inconvenient. -- Aidan Kehoe, http://www.parhasard.net/ ^ permalink raw reply [flat|nested] 202+ messages in thread
* Re: [PATCH] Unicode Lisp reader escapes 2006-04-30 21:56 ` Aidan Kehoe @ 2006-05-01 1:44 ` Miles Bader 2006-05-01 3:12 ` Stefan Monnier 0 siblings, 1 reply; 202+ messages in thread From: Miles Bader @ 2006-05-01 1:44 UTC (permalink / raw) Cc: rms, emacs-devel Aidan Kehoe <kehoea@parhasard.net> writes: > I find _that_ distinctly ugly, but more of a problem with it than the > aesthetics is that it’s unfamiliar to everyone, Lisp people and Java people > alike. How about supporting both the "standard" syntax ("\u0123") and a flexible-length syntax like "\u{123}" (I seem to recall this a syntax like this being discussed on this list)? -Miles -- Freedom's just another word, for nothing left to lose --Janis Joplin ^ permalink raw reply [flat|nested] 202+ messages in thread
* Re: [PATCH] Unicode Lisp reader escapes 2006-05-01 1:44 ` Miles Bader @ 2006-05-01 3:12 ` Stefan Monnier 2006-05-01 3:41 ` Miles Bader 0 siblings, 1 reply; 202+ messages in thread From: Stefan Monnier @ 2006-05-01 3:12 UTC (permalink / raw) Cc: Aidan Kehoe, rms, emacs-devel >> I find _that_ distinctly ugly, but more of a problem with it than the >> aesthetics is that it’s unfamiliar to everyone, Lisp people and Java people >> alike. > How about supporting both the "standard" syntax ("\u0123") > and a flexible-length syntax like "\u{123}" (I seem to recall this > a syntax like this being discussed on this list)? Currently The syntax for \xNNNN hexadeciaml escapes is that it ends whenever reaching a non-hexadecimal char, and if you need your \xNNN escape to be followed by an hexidecimal char, then you have to seprate the two with "\ " (and the Lisp printer does that automatically, of course). Is there a strong reason not do use the same rule for \u ? Stefan ^ permalink raw reply [flat|nested] 202+ messages in thread
* Re: [PATCH] Unicode Lisp reader escapes 2006-05-01 3:12 ` Stefan Monnier @ 2006-05-01 3:41 ` Miles Bader 2006-05-01 12:29 ` Stefan Monnier 0 siblings, 1 reply; 202+ messages in thread From: Miles Bader @ 2006-05-01 3:41 UTC (permalink / raw) Cc: Aidan Kehoe, rms, emacs-devel Stefan Monnier <monnier@iro.umontreal.ca> writes: > Currently The syntax for \xNNNN hexadeciaml escapes is that it ends whenever > reaching a non-hexadecimal char, and if you need your \xNNN escape to be > followed by an hexidecimal char, then you have to seprate the two with "\ " > (and the Lisp printer does that automatically, of course). > Is there a strong reason not do use the same rule for \u ? That might be sufficient for programmatic output, but anything involving a significant space seems problematic in general... -Miles -- x y Z! ^ permalink raw reply [flat|nested] 202+ messages in thread
* Re: [PATCH] Unicode Lisp reader escapes 2006-05-01 3:41 ` Miles Bader @ 2006-05-01 12:29 ` Stefan Monnier 0 siblings, 0 replies; 202+ messages in thread From: Stefan Monnier @ 2006-05-01 12:29 UTC (permalink / raw) Cc: Aidan Kehoe, rms, emacs-devel >> Currently The syntax for \xNNNN hexadeciaml escapes is that it ends whenever >> reaching a non-hexadecimal char, and if you need your \xNNN escape to be >> followed by an hexidecimal char, then you have to seprate the two with "\ " >> (and the Lisp printer does that automatically, of course). >> Is there a strong reason not do use the same rule for \u ? > That might be sufficient for programmatic output, but anything involving > a significant space seems problematic in general... Sorry, I don't understand what you mean by "significant space". Stefan ^ permalink raw reply [flat|nested] 202+ messages in thread
* Re: [PATCH] Unicode Lisp reader escapes 2006-04-30 20:53 ` Richard Stallman 2006-04-30 21:04 ` Andreas Schwab 2006-04-30 21:56 ` Aidan Kehoe @ 2006-05-05 23:15 ` Juri Linkov 2006-05-06 23:36 ` Richard Stallman 2 siblings, 1 reply; 202+ messages in thread From: Juri Linkov @ 2006-05-05 23:15 UTC (permalink / raw) Cc: kehoea, emacs-devel > They are both fixed-length expressions, which is good, because > people get into the habit of typing "\u0123As I walked out one > evening" instead of the more disastrous "\u123As I walked out one > evening". > > I see, you are talking about using them in strings. > Still, I don't like having both \u and \U--it is ugly. Are there reasons not to use Perl's notation for Unicode characters, i.e. "\x{...}"? The Unicode code for the desired character, is placed in the braces in hexadecimal, and has no fixed length. Examples: "\x{DF}", "\x{0448}", "\x{001D0ED}". I think this notation is well suitable for Emacs, because \x{...} indicates that a hexadecimal value is expected in the braces. And in the Unicode branch it will be just another way to specify a hexadecimal value with a variable length. -- Juri Linkov http://www.jurta.org/emacs/ ^ permalink raw reply [flat|nested] 202+ messages in thread
* Re: [PATCH] Unicode Lisp reader escapes 2006-05-05 23:15 ` Juri Linkov @ 2006-05-06 23:36 ` Richard Stallman 2006-05-09 20:43 ` Juri Linkov 0 siblings, 1 reply; 202+ messages in thread From: Richard Stallman @ 2006-05-06 23:36 UTC (permalink / raw) Cc: kehoea, emacs-devel Are there reasons not to use Perl's notation for Unicode characters, i.e. "\x{...}"? The Unicode code for the desired character, is placed in the braces in hexadecimal, and has no fixed length. Examples: "\x{DF}", "\x{0448}", "\x{001D0ED}". We could support this form, as well as \u and \U for compatibility with other languages. ^ permalink raw reply [flat|nested] 202+ messages in thread
* Re: [PATCH] Unicode Lisp reader escapes 2006-05-06 23:36 ` Richard Stallman @ 2006-05-09 20:43 ` Juri Linkov 2006-05-11 3:44 ` Richard Stallman 0 siblings, 1 reply; 202+ messages in thread From: Juri Linkov @ 2006-05-09 20:43 UTC (permalink / raw) Cc: kehoea, emacs-devel > Are there reasons not to use Perl's notation for Unicode characters, > i.e. "\x{...}"? The Unicode code for the desired character, > is placed in the braces in hexadecimal, and has no fixed length. > Examples: "\x{DF}", "\x{0448}", "\x{001D0ED}". > > We could support this form, as well as \u and \U for compatibility > with other languages. Support for \u and \U in Emacs Lisp would be good since other _Lisp_ languages support \uXXXX and \UXXXXXXXX as well. But other Lisp languages support also Lisp notation for Unicode characters. I think Emacs should support it too. In this notation Unicode characters are written as #\u3042 or #\u0002a6b2 with the leading hash mark. Also it would be good to support a syntax for named Unicode characters. Common Lisp has the syntax #\euro_sign, and Perl - \N{EURO SIGN}. -- Juri Linkov http://www.jurta.org/emacs/ ^ permalink raw reply [flat|nested] 202+ messages in thread
* Re: [PATCH] Unicode Lisp reader escapes 2006-05-09 20:43 ` Juri Linkov @ 2006-05-11 3:44 ` Richard Stallman 2006-05-11 12:03 ` Juri Linkov 0 siblings, 1 reply; 202+ messages in thread From: Richard Stallman @ 2006-05-11 3:44 UTC (permalink / raw) Cc: kehoea, emacs-devel Support for \u and \U in Emacs Lisp would be good since other _Lisp_ languages support \uXXXX and \UXXXXXXXX as well. But other Lisp languages support also Lisp notation for Unicode characters. I think Emacs should support it too. In this notation Unicode characters are written as #\u3042 or #\u0002a6b2 with the leading hash mark. We do not in general try to be compatible with Common Lisp on input syntax for characters. So forget this. Also it would be good to support a syntax for named Unicode characters. Common Lisp has the syntax #\euro_sign, and Perl - \N{EURO SIGN}. I tend to think we should not do this now. Does Emacs have a table of these names? ^ permalink raw reply [flat|nested] 202+ messages in thread
* Re: [PATCH] Unicode Lisp reader escapes 2006-05-11 3:44 ` Richard Stallman @ 2006-05-11 12:03 ` Juri Linkov 2006-05-11 13:16 ` Kenichi Handa 2006-05-12 4:15 ` Richard Stallman 0 siblings, 2 replies; 202+ messages in thread From: Juri Linkov @ 2006-05-11 12:03 UTC (permalink / raw) Cc: kehoea, emacs-devel > Support for \u and \U in Emacs Lisp would be good since other _Lisp_ languages > support \uXXXX and \UXXXXXXXX as well. But other Lisp languages support > also Lisp notation for Unicode characters. I think Emacs should support > it too. In this notation Unicode characters are written as #\u3042 or > #\u0002a6b2 with the leading hash mark. > > We do not in general try to be compatible with Common Lisp on input > syntax for characters. So forget this. The initial `#' character is a valid Emacs hash notation for writing integers in various bases. After adding `\uXXXX' it seems reasonable to add `#\uXXXX' as well. However, there is one difference: Emacs Lisp hash notation doesn't use the backslash `\' after `#', e.g. `#x42', but other Lisps use the backslash in the notation of Unicode characters, e.g. `#\u3042'. I have no opinion which notation is better. > Also it would be good to support a syntax for named Unicode characters. > Common Lisp has the syntax #\euro_sign, and Perl - \N{EURO SIGN}. > > I tend to think we should not do this now. > Does Emacs have a table of these names? The variable `describe-char-unicodedata-file' points to the file `UnicodeData.txt' not distributed currently with Emacs. This could be done in the emacs-unicode branch. I think this question should be considered after the release. -- Juri Linkov http://www.jurta.org/emacs/ ^ permalink raw reply [flat|nested] 202+ messages in thread
* Re: [PATCH] Unicode Lisp reader escapes 2006-05-11 12:03 ` Juri Linkov @ 2006-05-11 13:16 ` Kenichi Handa 2006-05-12 4:15 ` Richard Stallman 1 sibling, 0 replies; 202+ messages in thread From: Kenichi Handa @ 2006-05-11 13:16 UTC (permalink / raw) Cc: kehoea, rms, emacs-devel In article <878xp8g2a9.fsf@jurta.org>, Juri Linkov <juri@jurta.org> writes: >> Also it would be good to support a syntax for named Unicode characters. >> Common Lisp has the syntax #\euro_sign, and Perl - \N{EURO SIGN}. >> >> I tend to think we should not do this now. >> Does Emacs have a table of these names? > The variable `describe-char-unicodedata-file' points to the file > `UnicodeData.txt' not distributed currently with Emacs. This could be > done in the emacs-unicode branch. I think this question should be > considered after the release. Actually, emacs-unicode already contains various data (including names) extracted from UnicodeData.txt, and get-char-code-property is extended to information about a character that is provided by UnicodeData.txt. --- Kenichi Handa handa@m17n.org ^ permalink raw reply [flat|nested] 202+ messages in thread
* Re: [PATCH] Unicode Lisp reader escapes 2006-05-11 12:03 ` Juri Linkov 2006-05-11 13:16 ` Kenichi Handa @ 2006-05-12 4:15 ` Richard Stallman 2006-06-03 18:44 ` Aidan Kehoe [not found] ` <17537.54719.354843.89030@parhasard.net> 1 sibling, 2 replies; 202+ messages in thread From: Richard Stallman @ 2006-05-12 4:15 UTC (permalink / raw) Cc: kehoea, emacs-devel The initial `#' character is a valid Emacs hash notation for writing integers in various bases. After adding `\uXXXX' it seems reasonable to add `#\uXXXX' as well. However, there is one difference: Emacs Lisp hash notation doesn't use the backslash `\' after `#', e.g. `#x42', but other Lisps use the backslash in the notation of Unicode characters, e.g. `#\u3042'. I have no opinion which notation is better. I think it is better to consistent with the existing Emacs Lisp constructs. ^ permalink raw reply [flat|nested] 202+ messages in thread
* [PATCH] Unicode Lisp reader escapes. 2006-05-12 4:15 ` Richard Stallman @ 2006-06-03 18:44 ` Aidan Kehoe [not found] ` <17537.54719.354843.89030@parhasard.net> 1 sibling, 0 replies; 202+ messages in thread From: Aidan Kehoe @ 2006-06-03 18:44 UTC (permalink / raw) Jonas Jacobson just sent me confirmation that my once again signed assignments have been received, together with PDF copies of same. Given that, here is my final version of the patch I proposed in my first mail; differences from that version are an entry in the NEWS file, some prose style changes in the manual, and a GCPRO to protect readcharfun in lread.c. etc/ChangeLog addition: 2006-06-03 Aidan Kehoe <kehoea@parhasard.net> * NEWS: Describe the new syntax for specifying characters with Unicode escapes. lispref/ChangeLog addition: 2006-06-03 Aidan Kehoe <kehoea@parhasard.net> * objects.texi (Character Type): Describe the Unicode character escape syntax; \uABCD or \U00ABCDEF specifies Unicode characters U+ABCD and U+ABCDEF respectively. src/ChangeLog addition: 2006-06-03 Aidan Kehoe <kehoea@parhasard.net> * lread.c (read_escape): Provide a Unicode character escape syntax; \u followed by exactly four or \U followed by exactly eight hex digits in a comment or string is read as a Unicode character with that code point. GNU Emacs Trunk source patch: Diff command: cvs -q diff -u Files affected: src/lread.c lispref/objects.texi etc/NEWS Index: etc/NEWS =================================================================== RCS file: /sources/emacs/emacs/etc/NEWS,v retrieving revision 1.1337 diff -u -u -r1.1337 NEWS --- etc/NEWS 2 May 2006 01:47:57 -0000 1.1337 +++ etc/NEWS 3 Jun 2006 18:16:51 -0000 @@ -3772,6 +3772,13 @@ been declared obsolete. +++ +*** New syntax: \uXXXX and \UXXXXXXXX specify Unicode code points in hex. +Use "\u0428" to specify a string consisting of CYRILLIC CAPITAL LETTER SHA, +or "\U0001D6E2" to specify one consisting of MATHEMATICAL ITALIC CAPITAL +ALPHA (the latter is greater than #xFFFF and thus needs the longer +syntax). Also available for characters. + ++++ ** Displaying warnings to the user. See the functions `warn' and `display-warning', or the Lisp Manual. Index: lispref/objects.texi =================================================================== RCS file: /sources/emacs/emacs/lispref/objects.texi,v retrieving revision 1.53 diff -u -u -r1.53 objects.texi --- lispref/objects.texi 1 May 2006 15:05:48 -0000 1.53 +++ lispref/objects.texi 3 Jun 2006 18:16:52 -0000 @@ -431,6 +431,20 @@ bit values are 2**22 for alt, 2**23 for super and 2**24 for hyper. @end ifnottex +@cindex unicode character escape + Emacs provides a syntax for specifying characters by their Unicode code +points. @code{?\uABCD} represents a character that maps to the code +point @samp{U+ABCD} in Unicode-based representations (UTF-8 text files, +Unicode-oriented fonts, etc.). There is a slightly different syntax for +specifying characters with code points above @code{#xFFFF}; +@code{\U00ABCDEF} represents an Emacs character that maps to the code +point @samp{U+ABCDEF} in Unicode-based representations, if such an Emacs +character exists. + + Unlike in some other languages, while this syntax is available for +character literals, and (see later) in strings, it is not available +elsewhere in your Lisp source code. + @cindex @samp{\} in character constant @cindex backslash in character constant @cindex octal character code Index: src/lread.c =================================================================== RCS file: /sources/emacs/emacs/src/lread.c,v retrieving revision 1.350 diff -u -u -r1.350 lread.c --- src/lread.c 27 Feb 2006 02:04:35 -0000 1.350 +++ src/lread.c 3 Jun 2006 18:16:54 -0000 @@ -1743,6 +1743,9 @@ int *byterep; { register int c = READCHAR; + /* \u allows up to four hex digits, \U up to eight. Default to the + behaviour for \u, and change this value in the case that \U is seen. */ + int unicode_hex_count = 4; *byterep = 0; @@ -1907,6 +1910,52 @@ return i; } + case 'U': + /* Post-Unicode-2.0: Up to eight hex chars */ + unicode_hex_count = 8; + case 'u': + + /* A Unicode escape. We only permit them in strings and characters, + not arbitrarily in the source code as in some other languages. */ + { + int i = 0; + int count = 0; + Lisp_Object lisp_char; + struct gcpro gcpro1; + + while (++count <= unicode_hex_count) + { + c = READCHAR; + /* isdigit(), isalpha() may be locale-specific, which we don't + want. */ + if (c >= '0' && c <= '9') i = (i << 4) + (c - '0'); + else if (c >= 'a' && c <= 'f') i = (i << 4) + (c - 'a') + 10; + else if (c >= 'A' && c <= 'F') i = (i << 4) + (c - 'A') + 10; + else + { + error ("Non-hex digit used for Unicode escape"); + break; + } + } + + GCPRO1 (readcharfun); + lisp_char = call2(intern("decode-char"), intern("ucs"), + make_number(i)); + UNGCPRO; + + if (EQ(Qnil, lisp_char)) + { + /* This is ugly and horrible and trashes the user's data. */ + XSETFASTINT (i, MAKE_CHAR (charset_katakana_jisx0201, + 34 + 128, 46 + 128)); + return i; + } + else + { + return XFASTINT (lisp_char); + } + } + default: if (BASE_LEADING_CODE_P (c)) c = read_multibyte (c, readcharfun); -- Aidan Kehoe, http://www.parhasard.net/ ^ permalink raw reply [flat|nested] 202+ messages in thread
[parent not found: <17537.54719.354843.89030@parhasard.net>]
[parent not found: <ufyieqj0v.fsf@gnu.org>]
* Re: [PATCH] Unicode Lisp reader escapes. [not found] ` <ufyieqj0v.fsf@gnu.org> @ 2006-06-15 18:38 ` Aidan Kehoe 2006-06-17 18:57 ` Eli Zaretskii 0 siblings, 1 reply; 202+ messages in thread From: Aidan Kehoe @ 2006-06-15 18:38 UTC (permalink / raw) Cc: emacs-pretest-bug, emacs-devel > if (EQ(Qnil, lisp_char)) > { > /* This is ugly and horrible and trashes the user's data. */ > XSETFASTINT (i, MAKE_CHAR (charset_katakana_jisx0201, > 34 + 128, 46 + 128)); > return i; > } > > What is this special Katakana character, and why are we producing it? Firstly, thank you for posing the question; the character intended was not a member of JISX0201 at all, rather of JISX0208. I yanked the wrong charset identifier from charset.h when porting the code from XEmacs. The patch below addresses this. (make-char 'japanese-jisx0208 34 46) gives U+3013 GETA MARK, a character in JISX 0208 that is used to represent unknown or corrupted data. The Unicode-specific equivalent is U+FFFD REPLACEMENT CHARACTER. I used the GETA MARK because I was certain it would be available in Mule and it is equivalent. It turns out that (make-char 'mule-unicode-e000-ffff 117 61) gives U+FFFD, so it might be worthwhile to replace that. > Is it to trigger an "Invalid character" message, or is something else > going on here? It doesn’t actually trigger a message, it displays a character to be interpreted as “the character couldn’t be interpreted.” My feeling is that the syntax should be close in its behaviour to what the coding systems do, and when the coding systems see a code point that is valid but that they can’t interpret, they trash the user’s data. (Or do something totally mad like transform invalid UTF-16 to invalid UTF-8!?) src/ChangeLog addition: 2006-06-14 Aidan Kehoe <kehoea@parhasard.net> * lread.c (read_escape): Change charset_katakana_jisx0201 to charset_jisx0208 as it should have been in the first place, since we intended U+3013 GETA MARK. GNU Emacs Trunk source patch: Diff command: cvs -q diff -u Files affected: src/lread.c Index: src/lread.c =================================================================== RCS file: /sources/emacs/emacs/src/lread.c,v retrieving revision 1.353 diff -u -u -r1.353 lread.c --- src/lread.c 9 Jun 2006 18:22:30 -0000 1.353 +++ src/lread.c 14 Jun 2006 06:57:49 -0000 @@ -1967,7 +1967,7 @@ if (EQ(Qnil, lisp_char)) { /* This is ugly and horrible and trashes the user's data. */ - XSETFASTINT (i, MAKE_CHAR (charset_katakana_jisx0201, + XSETFASTINT (i, MAKE_CHAR (charset_jisx0208, 34 + 128, 46 + 128)); return i; } -- Aidan Kehoe, http://www.parhasard.net/ ^ permalink raw reply [flat|nested] 202+ messages in thread
* Re: [PATCH] Unicode Lisp reader escapes. 2006-06-15 18:38 ` Aidan Kehoe @ 2006-06-17 18:57 ` Eli Zaretskii 2006-06-18 16:11 ` Aidan Kehoe 0 siblings, 1 reply; 202+ messages in thread From: Eli Zaretskii @ 2006-06-17 18:57 UTC (permalink / raw) Cc: emacs-pretest-bug, emacs-devel > From: Aidan Kehoe <kehoea@parhasard.net> > Date: Thu, 15 Jun 2006 20:38:06 +0200 > Cc: emacs-pretest-bug@gnu.org, emacs-devel@gnu.org > > > > Is it to trigger an "Invalid character" message, or is something else > > going on here? > > It doesn't actually trigger a message, it displays a character to be > interpreted as ``the character couldn't be interpreted.'' But in my testing, I do see an "Invalid character" message. Could you please show an example of using this new function to produce this special ``character that couldn't be interpreted''? > My feeling is that the syntax should be close in its behaviour to what the > coding systems do, and when the coding systems see a code point that is > valid but that they can't interpret, they trash the user's data. This function is not about coding systems, it's about character sets. Coding systems already replace unsupported characters with `?' (other applications behave like that as well), so perhaps we should use some more conventional character here. Does anyone have an opinion? ^ permalink raw reply [flat|nested] 202+ messages in thread
* Re: [PATCH] Unicode Lisp reader escapes. 2006-06-17 18:57 ` Eli Zaretskii @ 2006-06-18 16:11 ` Aidan Kehoe 2006-06-18 19:55 ` Eli Zaretskii 0 siblings, 1 reply; 202+ messages in thread From: Aidan Kehoe @ 2006-06-18 16:11 UTC (permalink / raw) Cc: emacs-pretest-bug, emacs-devel Ar an seachtú lá déag de mí Meitheamh, scríobh Eli Zaretskii: > > > Is it to trigger an "Invalid character" message, or is something else > > > going on here? > > > > It doesn't actually trigger a message, it displays a character to be > > interpreted as ``the character couldn't be interpreted.'' > > But in my testing, I do see an "Invalid character" message. Yes. That’s because I yanked the wrong charset from charset.h when porting the code from XEmacs, and the attempt to create two-dimensional character in JISX0201 fails, as it should, since JISX0201 is a one-dimensional character set. The code as intended, doesn’t trigger the message. As it was written, to my discredit, it did. > Could you please show an example of using this new function to produce > this special ``character that couldn't be interpreted''? > > My feeling is that the syntax should be close in its behaviour to what the > > coding systems do, and when the coding systems see a code point that is > > valid but that they can't interpret, they trash the user's data. > > This function is not about coding systems, it's about character sets. This function is about transformation from an external format to the editor’s internal format. Which is a big part of what coding systems do. So some parallels in our approach is reasonable. > Coding systems already replace unsupported characters with `?' (other > applications behave like that as well), so perhaps we should use some > more conventional character here. > Does anyone have an opinion? Perhaps, indeed. -- Aidan Kehoe, http://www.parhasard.net/ ^ permalink raw reply [flat|nested] 202+ messages in thread
* Re: [PATCH] Unicode Lisp reader escapes. 2006-06-18 16:11 ` Aidan Kehoe @ 2006-06-18 19:55 ` Eli Zaretskii 2006-06-20 2:37 ` Kenichi Handa 0 siblings, 1 reply; 202+ messages in thread From: Eli Zaretskii @ 2006-06-18 19:55 UTC (permalink / raw) Cc: emacs-pretest-bug, emacs-devel > From: Aidan Kehoe <kehoea@parhasard.net> > Date: Sun, 18 Jun 2006 18:11:06 +0200 > Cc: emacs-pretest-bug@gnu.org, emacs-devel@gnu.org > > > Coding systems already replace unsupported characters with `?' (other > > applications behave like that as well), so perhaps we should use some > > more conventional character here. > > Does anyone have an opinion? > > Perhaps, indeed. Handa-san, could you please comment on this issue? ^ permalink raw reply [flat|nested] 202+ messages in thread
* Re: [PATCH] Unicode Lisp reader escapes. 2006-06-18 19:55 ` Eli Zaretskii @ 2006-06-20 2:37 ` Kenichi Handa 2006-06-20 17:56 ` Richard Stallman 2006-06-23 18:35 ` Aidan Kehoe 0 siblings, 2 replies; 202+ messages in thread From: Kenichi Handa @ 2006-06-20 2:37 UTC (permalink / raw) Cc: kehoea, emacs-pretest-bug, emacs-devel In article <uk67e2q96.fsf@gnu.org>, Eli Zaretskii <eliz@gnu.org> writes: >> From: Aidan Kehoe <kehoea@parhasard.net> >> Date: Sun, 18 Jun 2006 18:11:06 +0200 >> Cc: emacs-pretest-bug@gnu.org, emacs-devel@gnu.org >> >> > Coding systems already replace unsupported characters with `?' (other >> > applications behave like that as well), so perhaps we should use some >> > more conventional character here. >> > Does anyone have an opinion? >> >> Perhaps, indeed. > Handa-san, could you please comment on this issue? At first, coding system (utf-8) doesn't replace unsupported characters with '?' on decoding. It preserves the original byte sequence and attaches a special text property to display it the Unicode replacement character U+FFFD. But, as we can't do that in read_escape, I propose to simply signal an error as unsupported character. I think anything else leads to unexpected behavior. --- Kenichi Handa handa@m17n.org ^ permalink raw reply [flat|nested] 202+ messages in thread
* Re: [PATCH] Unicode Lisp reader escapes. 2006-06-20 2:37 ` Kenichi Handa @ 2006-06-20 17:56 ` Richard Stallman 2006-06-23 18:35 ` Aidan Kehoe 1 sibling, 0 replies; 202+ messages in thread From: Richard Stallman @ 2006-06-20 17:56 UTC (permalink / raw) Cc: kehoea, emacs-pretest-bug, emacs-devel But, as we can't do that in read_escape, I propose to simply signal an error as unsupported character. I think anything else leads to unexpected behavior. That seems right to me. ^ permalink raw reply [flat|nested] 202+ messages in thread
* Re: [PATCH] Unicode Lisp reader escapes. 2006-06-20 2:37 ` Kenichi Handa 2006-06-20 17:56 ` Richard Stallman @ 2006-06-23 18:35 ` Aidan Kehoe 2006-06-24 6:50 ` Eli Zaretskii 1 sibling, 1 reply; 202+ messages in thread From: Aidan Kehoe @ 2006-06-23 18:35 UTC (permalink / raw) Cc: emacs-pretest-bug, Eli Zaretskii, emacs-devel Ar an fichiú lá de mí Meitheamh, scríobh Kenichi Handa: > But, as we can't do that in read_escape, I propose to simply > signal an error as unsupported character. I think anything > else leads to unexpected behavior. Okay, here’s a patch to implement that behaviour. src/ChangeLog addition: 2006-06-23 Aidan Kehoe <kehoea@parhasard.net> * lread.c (read_escape): Instead of creating a place-holder character when an unknown Unicode code point is encountered as a string or character escape, signal an error. GNU Emacs Trunk source patch: Diff command: cvs -q diff -u Files affected: src/lread.c Index: src/lread.c =================================================================== RCS file: /sources/emacs/emacs/src/lread.c,v retrieving revision 1.353 diff -u -u -r1.353 lread.c --- src/lread.c 9 Jun 2006 18:22:30 -0000 1.353 +++ src/lread.c 23 Jun 2006 18:24:28 -0000 @@ -1964,17 +1964,12 @@ make_number(i)); UNGCPRO; - if (EQ(Qnil, lisp_char)) + if (NILP(lisp_char)) { - /* This is ugly and horrible and trashes the user's data. */ - XSETFASTINT (i, MAKE_CHAR (charset_katakana_jisx0201, - 34 + 128, 46 + 128)); - return i; - } - else - { - return XFASTINT (lisp_char); + error ("No support for Unicode code point U+%x", i); } + + return XFASTINT (lisp_char); } default: -- Santa Maradona, priez pour moi! ^ permalink raw reply [flat|nested] 202+ messages in thread
* Re: [PATCH] Unicode Lisp reader escapes. 2006-06-23 18:35 ` Aidan Kehoe @ 2006-06-24 6:50 ` Eli Zaretskii 0 siblings, 0 replies; 202+ messages in thread From: Eli Zaretskii @ 2006-06-24 6:50 UTC (permalink / raw) Cc: emacs-pretest-bug, emacs-devel, handa > From: Aidan Kehoe <kehoea@parhasard.net> > Date: Fri, 23 Jun 2006 20:35:00 +0200 > Cc: Eli Zaretskii <eliz@gnu.org>, emacs-pretest-bug@gnu.org, > emacs-devel@gnu.org > > > But, as we can't do that in read_escape, I propose to simply > > signal an error as unsupported character. I think anything > > else leads to unexpected behavior. > > Okay, here's a patch to implement that behaviour. > > src/ChangeLog addition: > > 2006-06-23 Aidan Kehoe <kehoea@parhasard.net> > > * lread.c (read_escape): > Instead of creating a place-holder character when an unknown > Unicode code point is encountered as a string or character escape, > signal an error. Thanks, installed. ^ permalink raw reply [flat|nested] 202+ messages in thread
* Re: [PATCH] Unicode Lisp reader escapes 2006-04-29 15:35 [PATCH] Unicode Lisp reader escapes Aidan Kehoe 2006-04-29 23:26 ` Stefan Monnier 2006-04-30 3:04 ` Richard Stallman @ 2006-05-02 6:43 ` Kenichi Handa 2006-05-02 7:00 ` Aidan Kehoe 2006-05-02 10:36 ` Eli Zaretskii 2 siblings, 2 replies; 202+ messages in thread From: Kenichi Handa @ 2006-05-02 6:43 UTC (permalink / raw) Cc: emacs-devel In article <17491.34779.959316.484740@parhasard.net>, Aidan Kehoe <kehoea@parhasard.net> writes: [...] > 2006-04-29 Aidan Kehoe <kehoea@parhasard.net> > * lread.c (read_escape): > Provide a Unicode character escape syntax; \u followed by exactly > four or \U followed by exactly eight hex digits in a comment or > string is read as a Unicode character with that code point. [...] > + lisp_char = call2(intern("decode-char"), intern("ucs"), > + make_number(i)); > + First of all, is it safe to call Lisp program in read_escape? Don't we have to care about GC and buffer/string-data relocation? --- Kenichi Handa handa@m17n.org ^ permalink raw reply [flat|nested] 202+ messages in thread
* Re: [PATCH] Unicode Lisp reader escapes 2006-05-02 6:43 ` Kenichi Handa @ 2006-05-02 7:00 ` Aidan Kehoe 2006-05-02 10:45 ` Eli Zaretskii 2006-05-02 11:33 ` Kenichi Handa 2006-05-02 10:36 ` Eli Zaretskii 1 sibling, 2 replies; 202+ messages in thread From: Aidan Kehoe @ 2006-05-02 7:00 UTC (permalink / raw) Cc: emacs-devel Ar an dara lá de mí Bealtaine, scríobh Kenichi Handa: > > + lisp_char = call2(intern("decode-char"), intern("ucs"), > > + make_number(i)); > > + > > First of all, is it safe to call Lisp program in read_escape? Don't we > have to care about GC and buffer/string-data relocation? Yay, a technical objection. If it isn’t safe to call a Lisp program in read_escape, then the function is full of bugs already. It’s called with three arguments, a Lisp_Object readcharfun, an integer, and a pointer to an integer. If readcharfun is a Lisp function (it may not be, it may be a buffer, a marker, or a string), then that Lisp function is called on line 348. Cf. the documentation of `read', which describes that the input may be from a function. -- Aidan Kehoe, http://www.parhasard.net/ ^ permalink raw reply [flat|nested] 202+ messages in thread
* Re: [PATCH] Unicode Lisp reader escapes 2006-05-02 7:00 ` Aidan Kehoe @ 2006-05-02 10:45 ` Eli Zaretskii 2006-05-02 11:13 ` Aidan Kehoe 2006-05-02 11:33 ` Kenichi Handa 1 sibling, 1 reply; 202+ messages in thread From: Eli Zaretskii @ 2006-05-02 10:45 UTC (permalink / raw) Cc: emacs-devel, handa > From: Aidan Kehoe <kehoea@parhasard.net> > Date: Tue, 2 May 2006 09:00:52 +0200 > Cc: emacs-devel@gnu.org > > > First of all, is it safe to call Lisp program in read_escape? Don't we > > have to care about GC and buffer/string-data relocation? > > Yay, a technical objection. I don't know what you mean: the other objections were technical as well. > If it isn't safe to call a Lisp program in read_escape, then the function is > full of bugs already. ``Full of bugs''? ^ permalink raw reply [flat|nested] 202+ messages in thread
* Re: [PATCH] Unicode Lisp reader escapes 2006-05-02 10:45 ` Eli Zaretskii @ 2006-05-02 11:13 ` Aidan Kehoe 2006-05-02 19:31 ` Eli Zaretskii 0 siblings, 1 reply; 202+ messages in thread From: Aidan Kehoe @ 2006-05-02 11:13 UTC (permalink / raw) Cc: emacs-devel Ar an dara lá de mí Bealtaine, scríobh Eli Zaretskii: > > > First of all, is it safe to call Lisp program in read_escape? Don't we > > > have to care about GC and buffer/string-data relocation? > > > > Yay, a technical objection. > > I don't know what you mean: the other objections were technical as > well. I would rate questions of aesthetics (“ugliness”) and prose style as non-technical. I don’t propose to impose that judgement on you, but I do think it reasonable. > > If it isn't safe to call a Lisp program in read_escape, then the > > function is full of bugs already. > > ``Full of bugs''? Indeed; each READCHAR can call arbitrary Lisp, so something like case 'M': c = READCHAR; if (c != '-') error ("Invalid escape character syntax"); c = READCHAR; if (c == '\\') c = read_escape (readcharfun, 0, byterep); return c | meta_modifier; has two clear bugs in eight lines. -- Aidan Kehoe, http://www.parhasard.net/ ^ permalink raw reply [flat|nested] 202+ messages in thread
* Re: [PATCH] Unicode Lisp reader escapes 2006-05-02 11:13 ` Aidan Kehoe @ 2006-05-02 19:31 ` Eli Zaretskii 2006-05-02 20:25 ` Aidan Kehoe 0 siblings, 1 reply; 202+ messages in thread From: Eli Zaretskii @ 2006-05-02 19:31 UTC (permalink / raw) Cc: emacs-devel > From: Aidan Kehoe <kehoea@parhasard.net> > Date: Tue, 2 May 2006 13:13:00 +0200 > Cc: emacs-devel@gnu.org > > > I don't know what you mean: the other objections were technical as > > well. > > I would rate questions of aesthetics ``ugliness'' and prose style as > non-technical. I don't propose to impose that judgement on you, but I do > think it reasonable. The discussion was about quite a few technical issues, only one of which was aesthetics. > > ``Full of bugs''? > > Indeed; each READCHAR can call arbitrary Lisp, so something like > > case 'M': > c = READCHAR; > if (c != '-') > error ("Invalid escape character syntax"); > c = READCHAR; > if (c == '\\') > c = read_escape (readcharfun, 0, byterep); > return c | meta_modifier; > > has two clear bugs in eight lines. Yeah, right. If you want your suggestions and opinions to be considered seriously, my advice is to drop the attitude. But I won't impose that advice on you. ^ permalink raw reply [flat|nested] 202+ messages in thread
* Re: [PATCH] Unicode Lisp reader escapes 2006-05-02 19:31 ` Eli Zaretskii @ 2006-05-02 20:25 ` Aidan Kehoe 2006-05-02 22:16 ` Oliver Scholz 0 siblings, 1 reply; 202+ messages in thread From: Aidan Kehoe @ 2006-05-02 20:25 UTC (permalink / raw) Cc: emacs-devel Ar an dara lá de mí Bealtaine, scríobh Eli Zaretskii: > > > I don't know what you mean: the other objections were technical as > > > well. > > > > I would rate questions of aesthetics ``ugliness'' and prose style as > > non-technical. I don't propose to impose that judgement on you, but I do > > think it reasonable. > > The discussion was about quite a few technical issues, only one of > which was aesthetics. I proposed a working patch, Richard Stallman suggested an alternative approach on the grounds that having both '\u' and '\U' was ugly. (He made that clear after asking what the reason for having both of them was.) He then commented that the functionality of the patch would be available in GNU Emacs once the Unicode branch was merged, apparently ignoring what I had written on that in my first mail. Stefan Monnier commented that workarounds were available; that was more relevant comment than objection, IMO. Jonathan Yavner then objected to Richard’s objection, on the basis that my already submitted patch followed a widely-implemented standard that Richard’s alternative didn’t. Miles Bader proposed an alternative to my patch, without objecting, to which I didn’t follow up, because I wanted to see how people would react to Jonathan’s mentioning of the existing standardisation of the escape. Oliver Scholz said that the syntax for \u and \x should be entirely in parallel “I[h]NSHO.” And that is what had been posted directly in relation to my patch (as opposed to in reaction to Richard’s proposed alterative) when you said that the other objections were technical as well. It seems to me that the only objections there are Richard’s, on the grounds of ugliness, and Oliver’s, on the unexplained grounds of what I imagine is his individual philosophy. I’d love to know what other objections you saw before your posting; my email etiquette is far from perfect, and feedback is always welcome. > > > ``Full of bugs''? > > > > Indeed; each READCHAR can call arbitrary Lisp, so something like > > > > case 'M': > > c = READCHAR; > > if (c != '-') > > error ("Invalid escape character syntax"); > > c = READCHAR; > > if (c == '\\') > > c = read_escape (readcharfun, 0, byterep); > > return c | meta_modifier; > > > > has two clear bugs in eight lines. > > Yeah, right. If you want your suggestions and opinions to be > considered seriously, my advice is to drop the attitude. But I won't > impose that advice on you. I would refer you to Kenichi Handa’s reply to that mail (that is, to 17495.932.70900.796282@parhasard.net ) for pointers on how to write what, IM, especially Humble this time, O, is a much more constructive answer. Best regards, Aidan -- Aidan Kehoe, http://www.parhasard.net/ ^ permalink raw reply [flat|nested] 202+ messages in thread
* Re: [PATCH] Unicode Lisp reader escapes 2006-05-02 20:25 ` Aidan Kehoe @ 2006-05-02 22:16 ` Oliver Scholz 0 siblings, 0 replies; 202+ messages in thread From: Oliver Scholz @ 2006-05-02 22:16 UTC (permalink / raw) Cc: kehoea Aidan Kehoe <kehoea@parhasard.net> writes: [...] > It seems to me that the only objections there are Richard’s, on the grounds > of ugliness, and Oliver’s, on the unexplained grounds of what I imagine is > his individual philosophy. Though unexplained, not based on individual philosophy but based on principles which I assume to be shared by most readers here (and which therefore don't need explanation unless somebody explicitely asks for one). The principle is that the Lisp API should be as consistent and regular as possible in order to minimise possible sources of surprise for the user. \x and \u are so similar in what they do, that there should be very strong reasons for a difference in their syntax. As for \u and \U with fixed numbers of digits: it might be standard in other languages, but for Lisp it is entirely alien. My comment was not an objection. On the contrary I am a believer here. I think having a syntax for UCS characters in the next release would be a very important addition. (That's why I raised my voice in the first place.) You mentioned the reasons already so there's no need to repeat them here. The way *I* understand the discussion, the only real objection still standing in the room is Richard's concern that \u would become obsolete as soon as Emacs switches to an internal UCS encoding. I still disagree, but I see his point. The question is whether the portability provided by \u is considered to be more important than the (conceived) redundance in the future. As for implementing decode-char in C: that should be really trivial. Oliver -- 13 Floréal an 214 de la Révolution Liberté, Egalité, Fraternité! ^ permalink raw reply [flat|nested] 202+ messages in thread
* Re: [PATCH] Unicode Lisp reader escapes 2006-05-02 7:00 ` Aidan Kehoe 2006-05-02 10:45 ` Eli Zaretskii @ 2006-05-02 11:33 ` Kenichi Handa 2006-05-02 22:50 ` Aidan Kehoe 1 sibling, 1 reply; 202+ messages in thread From: Kenichi Handa @ 2006-05-02 11:33 UTC (permalink / raw) Cc: emacs-devel In article <17495.932.70900.796282@parhasard.net>, Aidan Kehoe <kehoea@parhasard.net> writes: >> First of all, is it safe to call Lisp program in read_escape? Don't we >> have to care about GC and buffer/string-data relocation? > Yay, a technical objection. > If it isn’t safe to call a Lisp program in read_escape, then the function is > full of bugs already. It’s called with three arguments, a Lisp_Object > readcharfun, an integer, and a pointer to an integer. If readcharfun is a > Lisp function (it may not be, it may be a buffer, a marker, or a string), > then that Lisp function is called on line 348. Cf. the documentation of > `read', which describes that the input may be from a function. What I concern is the case that readcharfun is a string or a buffer. In that case, of course, the current code doesn't call Lisp in read_escape. So, there's no need of GCPRO readcharfun. But, if Lisp is called even if readcharfun is a string, I think we should GCPRO it. Is it already done? (Sorry, I don't have a time to check lread.c by myself) --- Kenichi Handa handa@m17n.org ^ permalink raw reply [flat|nested] 202+ messages in thread
* Re: [PATCH] Unicode Lisp reader escapes 2006-05-02 11:33 ` Kenichi Handa @ 2006-05-02 22:50 ` Aidan Kehoe 2006-05-03 7:43 ` Kenichi Handa 2006-05-03 17:21 ` Kevin Rodgers 0 siblings, 2 replies; 202+ messages in thread From: Aidan Kehoe @ 2006-05-02 22:50 UTC (permalink / raw) Cc: emacs-devel Ar an dara lá de mí Bealtaine, scríobh Kenichi Handa: > > Yay, a technical objection. > > > If it isn’t safe to call a Lisp program in read_escape, then the > > function is full of bugs already. It’s called with three arguments, a > > Lisp_Object readcharfun, an integer, and a pointer to an integer. If > > readcharfun is a Lisp function (it may not be, it may be a buffer, a > > marker, or a string), then that Lisp function is called on line 348. > > Cf. the documentation of `read', which describes that the input may be > > from a function. > > What I concern is the case that readcharfun is a string or a buffer. In > that case, of course, the current code doesn't call Lisp in read_escape. > So, there's no need of GCPRO readcharfun. > > But, if Lisp is called even if readcharfun is a string, I think we should > GCPRO it. Is it already done? (Sorry, I don't have a time to check > lread.c by myself) I’m reasonably sure it’s already done in the callers of read1, but I don’t have graphing software to hand, and the English for the reasoning I’ve written out is unreadably tedious. So, sure, GCPROing seems worth the time. Do you mean to GCPRO independent of what type readcharfun is? -- Aidan Kehoe, http://www.parhasard.net/ ^ permalink raw reply [flat|nested] 202+ messages in thread
* Re: [PATCH] Unicode Lisp reader escapes 2006-05-02 22:50 ` Aidan Kehoe @ 2006-05-03 7:43 ` Kenichi Handa 2006-05-03 17:21 ` Kevin Rodgers 1 sibling, 0 replies; 202+ messages in thread From: Kenichi Handa @ 2006-05-03 7:43 UTC (permalink / raw) Cc: emacs-devel In article <17495.57895.90438.848865@parhasard.net>, Aidan Kehoe <kehoea@parhasard.net> writes: >> But, if Lisp is called even if readcharfun is a string, I think we should >> GCPRO it. Is it already done? (Sorry, I don't have a time to check >> lread.c by myself) > I’m reasonably sure it’s already done in the callers of read1, but I don’t > have graphing software to hand, and the English for the reasoning I’ve > written out is unreadably tedious. So, sure, GCPROing seems worth the time. > Do you mean to GCPRO independent of what type readcharfun is? I have not yet considered in deep what should we GCPRO and where to do that. But, as I replied to Eli's mail, I now think that implementing decode-char in C is better provided that it is decided to handle \u.... in read_escape. --- Kenichi Handa handa@m17n.org ^ permalink raw reply [flat|nested] 202+ messages in thread
* Re: [PATCH] Unicode Lisp reader escapes 2006-05-02 22:50 ` Aidan Kehoe 2006-05-03 7:43 ` Kenichi Handa @ 2006-05-03 17:21 ` Kevin Rodgers 2006-05-03 18:51 ` Andreas Schwab 1 sibling, 1 reply; 202+ messages in thread From: Kevin Rodgers @ 2006-05-03 17:21 UTC (permalink / raw) Aidan Kehoe wrote: > Ar an dara lá de mí Bealtaine, scríobh Kenichi Handa: > > > > Yay, a technical objection. > > > > > If it isn’t safe to call a Lisp program in read_escape, then the > > > function is full of bugs already. It’s called with three arguments, a > > > Lisp_Object readcharfun, an integer, and a pointer to an integer. If > > > readcharfun is a Lisp function (it may not be, it may be a buffer, a > > > marker, or a string), then that Lisp function is called on line 348. > > > Cf. the documentation of `read', which describes that the input may be > > > from a function. > > > > What I concern is the case that readcharfun is a string or a buffer. In > > that case, of course, the current code doesn't call Lisp in read_escape. > > So, there's no need of GCPRO readcharfun. > > > > But, if Lisp is called even if readcharfun is a string, I think we should > > GCPRO it. Is it already done? (Sorry, I don't have a time to check > > lread.c by myself) > > I’m reasonably sure it’s already done in the callers of read1, but I don’t > have graphing software to hand, and the English for the reasoning I’ve > written out is unreadably tedious. So, sure, GCPROing seems worth the time. > Do you mean to GCPRO independent of what type readcharfun is? readcharfun is declared as a Lisp_Object in read1, so it should be possible to check it's type and only GCPRO when necessary. -- Kevin ^ permalink raw reply [flat|nested] 202+ messages in thread
* Re: [PATCH] Unicode Lisp reader escapes 2006-05-03 17:21 ` Kevin Rodgers @ 2006-05-03 18:51 ` Andreas Schwab 2006-05-04 21:14 ` Aidan Kehoe 0 siblings, 1 reply; 202+ messages in thread From: Andreas Schwab @ 2006-05-03 18:51 UTC (permalink / raw) Cc: emacs-devel Kevin Rodgers <ihs_4664@yahoo.com> writes: > readcharfun is declared as a Lisp_Object in read1, so it should be > possible to check it's type and only GCPRO when necessary. I don't see any need to GCPRO readcharfun. When called from Lisp the arguments are already protected by being part of the call frame, and all uses from C protect the object by other means (eg, by being put on eval-buffer-list). Andreas. -- Andreas Schwab, SuSE Labs, schwab@suse.de SuSE Linux Products GmbH, Maxfeldstraße 5, 90409 Nürnberg, Germany PGP key fingerprint = 58CA 54C7 6D53 942B 1756 01D3 44D5 214B 8276 4ED5 "And now for something completely different." ^ permalink raw reply [flat|nested] 202+ messages in thread
* Re: [PATCH] Unicode Lisp reader escapes 2006-05-03 18:51 ` Andreas Schwab @ 2006-05-04 21:14 ` Aidan Kehoe 2006-05-08 1:31 ` Kenichi Handa 0 siblings, 1 reply; 202+ messages in thread From: Aidan Kehoe @ 2006-05-04 21:14 UTC (permalink / raw) Cc: emacs-devel Ar an triú lá de mí Bealtaine, scríobh Andreas Schwab>: > Kevin Rodgers <ihs_4664@yahoo.com> writes: > > > readcharfun is declared as a Lisp_Object in read1, so it should be > > possible to check it's type and only GCPRO when necessary. > > I don't see any need to GCPRO readcharfun. When called from Lisp the > arguments are already protected by being part of the call frame, and all > uses from C protect the object by other means (eg, by being put on > eval-buffer-list). That was my understanding of the code too. -- Aidan Kehoe, http://www.parhasard.net/ ^ permalink raw reply [flat|nested] 202+ messages in thread
* Re: [PATCH] Unicode Lisp reader escapes 2006-05-04 21:14 ` Aidan Kehoe @ 2006-05-08 1:31 ` Kenichi Handa 2006-05-08 6:54 ` Aidan Kehoe 2006-05-08 13:55 ` Stefan Monnier 0 siblings, 2 replies; 202+ messages in thread From: Kenichi Handa @ 2006-05-08 1:31 UTC (permalink / raw) Cc: schwab, emacs-devel In article <17498.28361.392872.954484@parhasard.net>, Aidan Kehoe <kehoea@parhasard.net> writes: >> Kevin Rodgers <ihs_4664@yahoo.com> writes: >> >> > readcharfun is declared as a Lisp_Object in read1, so it should be >> > possible to check it's type and only GCPRO when necessary. >> >> I don't see any need to GCPRO readcharfun. When called from Lisp the >> arguments are already protected by being part of the call frame, and all >> uses from C protect the object by other means (eg, by being put on >> eval-buffer-list). > That was my understanding of the code too. For instance, Fread is called from Fcall_interactively as below: Lisp_Object tem; [...] tem = Fread_from_minibuffer (build_string (callint_message), Qnil, Qnil, Qnil, Qnil, Qnil, Qnil, Qnil); if (! STRINGP (tem) || SCHARS (tem) == 0) args[i] = Qnil; else args[i] = Fread (tem); In the calling sequence of Fread->read_internal_start->read0->read1, I see no place where the original `tem' is GCPROed. Do I overlook something? --- Kenichi Handa handa@m17n.org ^ permalink raw reply [flat|nested] 202+ messages in thread
* Re: [PATCH] Unicode Lisp reader escapes 2006-05-08 1:31 ` Kenichi Handa @ 2006-05-08 6:54 ` Aidan Kehoe 2006-05-08 13:55 ` Stefan Monnier 1 sibling, 0 replies; 202+ messages in thread From: Aidan Kehoe @ 2006-05-08 6:54 UTC (permalink / raw) Cc: schwab, emacs-devel Ar an t-ochtú lá de mí Bealtaine, scríobh Kenichi Handa: > >> > readcharfun is declared as a Lisp_Object in read1, so it should be > >> > possible to check it's type and only GCPRO when necessary. > >> > >> I don't see any need to GCPRO readcharfun. When called from Lisp the > >> arguments are already protected by being part of the call frame, and > >> all uses from C protect the object by other means (eg, by being put on > >> eval-buffer-list). > > > That was my understanding of the code too. > > For instance, Fread is called from Fcall_interactively as > below: > > Lisp_Object tem; > [...] > tem = Fread_from_minibuffer (build_string (callint_message), > Qnil, Qnil, Qnil, Qnil, Qnil, > Qnil, Qnil); > if (! STRINGP (tem) || SCHARS (tem) == 0) > args[i] = Qnil; > else > args[i] = Fread (tem); > > In the calling sequence of Fread->read_internal_start->read0->read1, I > see no place where the original `tem' is GCPROed. Do I overlook > something? I believe not, it does need to be protected. Also, my understanding of the above code is that build_string allocates memory for a Lisp string, that is not visible from Lisp, and that will not be GCPROed. So if garbage collection happens during Fread_from_minibuffer, it may disappear. Ben Wing, in the XEmacs internals manual, says this: 12. Be careful of traps, like calling `Fcons()' in the argument to another function. By the "caller protects" law, you should be `GCPRO'ing the newly-created cons, but you aren't. A certain number of functions that are commonly called on freshly created stuff (e.g. `nconc2()', `Fsignal()') break the "caller protects" law and go ahead and `GCPRO' their arguments so as to simplify thngs, but make sure and check if it's OK whenever doing something like this. This seems to me equivalent to calling Fcons in the argument to another function. Is GNU Emacs different in this? -- Aidan Kehoe, http://www.parhasard.net/ ^ permalink raw reply [flat|nested] 202+ messages in thread
* Re: [PATCH] Unicode Lisp reader escapes 2006-05-08 1:31 ` Kenichi Handa 2006-05-08 6:54 ` Aidan Kehoe @ 2006-05-08 13:55 ` Stefan Monnier 2006-05-08 14:24 ` Aidan Kehoe 2006-05-09 0:36 ` Kenichi Handa 1 sibling, 2 replies; 202+ messages in thread From: Stefan Monnier @ 2006-05-08 13:55 UTC (permalink / raw) Cc: Aidan Kehoe, schwab, emacs-devel > For instance, Fread is called from Fcall_interactively as > below: > Lisp_Object tem; > [...] > tem = Fread_from_minibuffer (build_string (callint_message), > Qnil, Qnil, Qnil, Qnil, Qnil, > Qnil, Qnil); > if (! STRINGP (tem) || SCHARS (tem) == 0) > args[i] = Qnil; > else > args[i] = Fread (tem); > In the calling sequence of > Fread-> read_internal_start->read0->read1, I see no place > where the original `tem' is GCPROed. Do I overlook > something? Why would it need to be protected? it's not used afterwards. Stefan ^ permalink raw reply [flat|nested] 202+ messages in thread
* Re: [PATCH] Unicode Lisp reader escapes 2006-05-08 13:55 ` Stefan Monnier @ 2006-05-08 14:24 ` Aidan Kehoe 2006-05-08 15:32 ` Stefan Monnier 2006-05-09 0:36 ` Kenichi Handa 1 sibling, 1 reply; 202+ messages in thread From: Aidan Kehoe @ 2006-05-08 14:24 UTC (permalink / raw) Cc: emacs-devel Ar an t-ochtú lá de mí Bealtaine, scríobh Stefan Monnier: > > In the calling sequence of > > Fread-> read_internal_start->read0->read1, I see no place > > where the original `tem' is GCPROed. Do I overlook > > something? > > Why would it need to be protected? it's not used afterwards. It can theoretically disappear in the middle of being used. With my patch, if the string consisted of "\u20AC one two", Lisp will be called, the garbage collector may be invoked, and the string overwritten, since to the GC it’s not in use. Then the READCHAR -> retry loop may end up reading incorrect data. -- Aidan Kehoe, http://www.parhasard.net/ ^ permalink raw reply [flat|nested] 202+ messages in thread
* Re: [PATCH] Unicode Lisp reader escapes 2006-05-08 14:24 ` Aidan Kehoe @ 2006-05-08 15:32 ` Stefan Monnier 2006-05-08 16:39 ` Aidan Kehoe 0 siblings, 1 reply; 202+ messages in thread From: Stefan Monnier @ 2006-05-08 15:32 UTC (permalink / raw) Cc: emacs-devel >> > In the calling sequence of >> > Fread-> read_internal_start->read0->read1, I see no place >> > where the original `tem' is GCPROed. Do I overlook >> > something? >> >> Why would it need to be protected? it's not used afterwards. > It can theoretically disappear in the middle of being used. With my patch, > if the string consisted of "\u20AC one two", Lisp will be called, the > garbage collector may be invoked, and the string overwritten, since to the > GC it’s not in use. Then the READCHAR -> retry loop may end up reading > incorrect data. That's of not concern to Fcall_interactively. It's Fread should GCPRO its argument when needed. So it seems the bug is that read_internal_start calls read0 (which can GC) and uses `stream' afterwards without having GCPRO'd it. Stefan ^ permalink raw reply [flat|nested] 202+ messages in thread
* Re: [PATCH] Unicode Lisp reader escapes 2006-05-08 15:32 ` Stefan Monnier @ 2006-05-08 16:39 ` Aidan Kehoe 2006-05-08 17:39 ` Stefan Monnier 0 siblings, 1 reply; 202+ messages in thread From: Aidan Kehoe @ 2006-05-08 16:39 UTC (permalink / raw) Cc: emacs-devel Ar an t-ochtú lá de mí Bealtaine, scríobh Stefan Monnier: > >> > In the calling sequence of Fread-> read_internal_start -> read0 -> > >> > read1, I see no place where the original `tem' is GCPROed. Do I > >> > overlook something? > >> > >> Why would it need to be protected? it's not used afterwards. > > > It can theoretically disappear in the middle of being used. With my patch, > > if the string consisted of "\u20AC one two", Lisp will be called, the > > garbage collector may be invoked, and the string overwritten, since to the > > GC it’s not in use. Then the READCHAR -> retry loop may end up reading > > incorrect data. > > That's of not concern to Fcall_interactively. It's Fread should GCPRO its > argument when needed. Fread is intended to be called from Lisp (it’s a subr). Functions called from Lisp do not need to GCPRO their arguments, because the garbage collector knows about the arguments, as it knows about all objects allocated in Lisp. C code that calls functions intended to be called from Lisp is optimistic at best if, without having checked, it relies on the assumption that that the arguments to those functions will be GCPROed. > So it seems the bug is that read_internal_start calls > read0 (which can GC) and uses `stream' afterwards without having GCPRO'd it. -- Aidan Kehoe, http://www.parhasard.net/ ^ permalink raw reply [flat|nested] 202+ messages in thread
* Re: [PATCH] Unicode Lisp reader escapes 2006-05-08 16:39 ` Aidan Kehoe @ 2006-05-08 17:39 ` Stefan Monnier 2006-05-09 7:04 ` Aidan Kehoe 0 siblings, 1 reply; 202+ messages in thread From: Stefan Monnier @ 2006-05-08 17:39 UTC (permalink / raw) Cc: emacs-devel > Fread is intended to be called from Lisp (it’s a subr). Functions called > from Lisp do not need to GCPRO their arguments, because the garbage > collector knows about the arguments, as it knows about all objects > allocated in Lisp. s/called/callable/ Are you sure we have such a convention? > C code that calls functions intended to be called from Lisp is optimistic > at best if, without having checked, it relies on the assumption that that > the arguments to those functions will be GCPROed. As far as I know, the GCPRO convention for arguments is mostly the following: GCPRO args you pass to functions iff those functions can GC and you need to use the arg after the function returns. Stefan ^ permalink raw reply [flat|nested] 202+ messages in thread
* Re: [PATCH] Unicode Lisp reader escapes 2006-05-08 17:39 ` Stefan Monnier @ 2006-05-09 7:04 ` Aidan Kehoe 2006-05-09 19:05 ` Eli Zaretskii 0 siblings, 1 reply; 202+ messages in thread From: Aidan Kehoe @ 2006-05-09 7:04 UTC (permalink / raw) Cc: emacs-devel Ar an t-ochtú lá de mí Bealtaine, scríobh Stefan Monnier: > > Fread is intended to be called from Lisp (it’s a subr). Functions called > > from Lisp do not need to GCPRO their arguments, because the garbage > > collector knows about the arguments, as it knows about all objects > > allocated in Lisp. > > s/called/callable/ The two are not mutually exclusive :-) . > Are you sure we have such a convention? That in particular is not really a convention, it’s part of the semantics of the Lisp implementation. Objects visible to Lisp are visible to the garbage collector, except in the very specific case where they’re only visible from weak hash tables. > > C code that calls functions intended to be called from Lisp is optimistic > > at best if, without having checked, it relies on the assumption that that > > the arguments to those functions will be GCPROed. > > As far as I know, the GCPRO convention for arguments is mostly the > following: > > GCPRO args you pass to functions iff those functions can GC and you need > to use the arg after the function returns. Okay. Do you know of any document detailing that? No-one followed up to my reference to what Ben Wing writes on the subject. -- Aidan Kehoe, http://www.parhasard.net/ ^ permalink raw reply [flat|nested] 202+ messages in thread
* Re: [PATCH] Unicode Lisp reader escapes 2006-05-09 7:04 ` Aidan Kehoe @ 2006-05-09 19:05 ` Eli Zaretskii 2006-05-10 6:05 ` Aidan Kehoe 0 siblings, 1 reply; 202+ messages in thread From: Eli Zaretskii @ 2006-05-09 19:05 UTC (permalink / raw) Cc: emacs-devel > From: Aidan Kehoe <kehoea@parhasard.net> > Date: Tue, 9 May 2006 09:04:50 +0200 > Cc: emacs-devel@gnu.org > > > As far as I know, the GCPRO convention for arguments is mostly the > > following: > > > > GCPRO args you pass to functions iff those functions can GC and you need > > to use the arg after the function returns. > > Okay. Do you know of any document detailing that? Does the excerpt below from the Lisp manual answer your concerns? > No-one followed up to my reference to what Ben Wing writes on the > subject. AFAIU, he is wrong, or at least inaccurate. But maybe I misunderstand something. >From (elisp)Writing Emacs Primitives: Within the function `For' itself, note the use of the macros `GCPRO1' and `UNGCPRO'. `GCPRO1' is used to "protect" a variable from garbage collection--to inform the garbage collector that it must look in that variable and regard its contents as an accessible object. This is necessary whenever you call `Feval' or anything that can directly or indirectly call `Feval'. At such a time, any Lisp object that you intend to refer to again must be protected somehow. `UNGCPRO' cancels the protection of the variables that are protected in the current function. It is necessary to do this explicitly. It suffices to ensure that at least one pointer to each object is GC-protected; as long as the object is not recycled, all pointers to it remain valid. So if you are sure that a local variable points to an object that will be preserved by some other pointer, that local variable does not need a `GCPRO'. (Formerly, strings were an exception to this rule; in older Emacs versions, every pointer to a string needed to be marked by GC.) ^ permalink raw reply [flat|nested] 202+ messages in thread
* Re: [PATCH] Unicode Lisp reader escapes 2006-05-09 19:05 ` Eli Zaretskii @ 2006-05-10 6:05 ` Aidan Kehoe 2006-05-10 17:49 ` Eli Zaretskii 0 siblings, 1 reply; 202+ messages in thread From: Aidan Kehoe @ 2006-05-10 6:05 UTC (permalink / raw) Cc: emacs-devel Ar an naoiú lá de mí Bealtaine, scríobh Eli Zaretskii: > > > As far as I know, the GCPRO convention for arguments is mostly the > > > following: > > > > > > GCPRO args you pass to functions iff those functions can GC and you > > > need to use the arg after the function returns. > > > > Okay. Do you know of any document detailing that? > > Does the excerpt below from the Lisp manual answer your concerns? It read ambiguously to me. “Any Lisp object that you intend to refer to again” could be one that you intend to refer to in the bodies of the functions you call. > > No-one followed up to my reference to what Ben Wing writes on the > > subject. > > AFAIU, he is wrong, or at least inaccurate. He’s not wrong, he’s describing conventions within XEmacs, and XEmacs source code does follow those conventions. My question was, are those conventions followed in GNU Emacs? You, and Stefan are telling me they’re not. Okay, you’ve answered my question, thank you. [excerpt snipped] -- Aidan Kehoe, http://www.parhasard.net/ ^ permalink raw reply [flat|nested] 202+ messages in thread
* Re: [PATCH] Unicode Lisp reader escapes 2006-05-10 6:05 ` Aidan Kehoe @ 2006-05-10 17:49 ` Eli Zaretskii 2006-05-10 21:37 ` Luc Teirlinck ` (3 more replies) 0 siblings, 4 replies; 202+ messages in thread From: Eli Zaretskii @ 2006-05-10 17:49 UTC (permalink / raw) Cc: emacs-devel > From: Aidan Kehoe <kehoea@parhasard.net> > Date: Wed, 10 May 2006 08:05:32 +0200 > Cc: emacs-devel@gnu.org > > > > > As far as I know, the GCPRO convention for arguments is mostly the > > > > following: > > > > > > > > GCPRO args you pass to functions iff those functions can GC and you > > > > need to use the arg after the function returns. > > > > > > Okay. Do you know of any document detailing that? > > > > Does the excerpt below from the Lisp manual answer your concerns? > > It read ambiguously to me. ``Any Lisp object that you intend to refer to > again'' could be one that you intend to refer to in the bodies of the > functions you call. Can someone in the know (Richard?) state a clear rule? I think the ELisp manual should be unequivocal about the GCPRO issue. ^ permalink raw reply [flat|nested] 202+ messages in thread
* Re: [PATCH] Unicode Lisp reader escapes 2006-05-10 17:49 ` Eli Zaretskii @ 2006-05-10 21:37 ` Luc Teirlinck 2006-05-11 3:45 ` Eli Zaretskii 2006-05-10 21:48 ` Luc Teirlinck ` (2 subsequent siblings) 3 siblings, 1 reply; 202+ messages in thread From: Luc Teirlinck @ 2006-05-10 21:37 UTC (permalink / raw) Cc: kehoea, emacs-devel Eli Zaretskii wrote: > It read ambiguously to me. ``Any Lisp object that you intend to refer to > again'' could be one that you intend to refer to in the bodies of the > functions you call. Can someone in the know (Richard?) state a clear rule? I think the ELisp manual should be unequivocal about the GCPRO issue. Reading in the Elisp manual: This is necessary whenever you call `Feval' or anything that can directly or indirectly call `Feval'. At such a time, any Lisp object that you intend to refer to again must be protected somehow. I have always interpreted this as meaning that as long as Feval is not directly or indirectly called, there is no problem whatsoever. If Feval gets called, directly or indirectly, the memory for the object may have been freed by gc, unless the object is protected some way or the other, for instance by a GCPPRO. If the object was not protected in any way, then if after the call to Feval it gets referenced any way whatsoever, directly or in the functions you call, from C or from Lisp, trouble can result because its memory may have been freed. Is there any _other_ way to understand the above quote from the Elisp manual or am I just completely misunderstanding the issue? If the description in the above paragraph would not be accurate, then the text would indeed be very misleading. Sincerely, Luc. ^ permalink raw reply [flat|nested] 202+ messages in thread
* Re: [PATCH] Unicode Lisp reader escapes 2006-05-10 21:37 ` Luc Teirlinck @ 2006-05-11 3:45 ` Eli Zaretskii 0 siblings, 0 replies; 202+ messages in thread From: Eli Zaretskii @ 2006-05-11 3:45 UTC (permalink / raw) Cc: kehoea, emacs-devel > Date: Wed, 10 May 2006 16:37:56 -0500 (CDT) > From: Luc Teirlinck <teirllm@dms.auburn.edu> > CC: kehoea@parhasard.net, emacs-devel@gnu.org > > Eli Zaretskii wrote: > > > It read ambiguously to me. ``Any Lisp object that you intend to refer to > > again'' could be one that you intend to refer to in the bodies of the > > functions you call. > > Can someone in the know (Richard?) state a clear rule? I think the > ELisp manual should be unequivocal about the GCPRO issue. > > Reading in the Elisp manual: > > This is necessary whenever you call `Feval' or anything that can > directly or indirectly call `Feval'. At such a time, any Lisp > object that you intend to refer to again must be protected > somehow. > > I have always interpreted this as meaning that as long as Feval is not > directly or indirectly called, there is no problem whatsoever. That is not the issue for which I asked for clarifications. The issue was what does ``refer to again'' means, precisely. ^ permalink raw reply [flat|nested] 202+ messages in thread
* Re: [PATCH] Unicode Lisp reader escapes 2006-05-10 17:49 ` Eli Zaretskii 2006-05-10 21:37 ` Luc Teirlinck @ 2006-05-10 21:48 ` Luc Teirlinck 2006-05-11 1:08 ` Luc Teirlinck 2006-05-11 3:46 ` Richard Stallman 3 siblings, 0 replies; 202+ messages in thread From: Luc Teirlinck @ 2006-05-10 21:48 UTC (permalink / raw) Cc: kehoea, emacs-devel Eli Zaretskii wrote: Can someone in the know (Richard?) state a clear rule? I think the ELisp manual should be unequivocal about the GCPRO issue. I should have mentioned that I do not consider myself as "someone in the know". I just wanted to point out that to me the Elisp manual sounds unequivocal. So, if I am actually wrong, then I believe that there is a real doc problem. If I am right, then I do not believe so, unless somebody points out another plausible way to understand the text. Sincerely, Luc. ^ permalink raw reply [flat|nested] 202+ messages in thread
* Re: [PATCH] Unicode Lisp reader escapes 2006-05-10 17:49 ` Eli Zaretskii 2006-05-10 21:37 ` Luc Teirlinck 2006-05-10 21:48 ` Luc Teirlinck @ 2006-05-11 1:08 ` Luc Teirlinck 2006-05-11 2:29 ` Luc Teirlinck 2006-05-11 3:46 ` Richard Stallman 3 siblings, 1 reply; 202+ messages in thread From: Luc Teirlinck @ 2006-05-11 1:08 UTC (permalink / raw) Cc: kehoea, emacs-devel [-- Warning: decoded text below may be mangled, UTF-8 assumed --] [-- Attachment #1: Type: text/plain, Size: 3398 bytes --] Eli Zaretskii wrote: > It read ambiguously to me. ``Any Lisp object that you intend to refer to > again'' could be one that you intend to refer to in the bodies of the > functions you call. Can someone in the know (Richard?) state a clear rule? I think the ELisp manual should be unequivocal about the GCPRO issue. I probably responded to quickly to this and misunderstood the question. I am still not completely sure what the concrete question is, but maybe Richard will understand. In the meantime, after rereading the thread, there are some things that seem confusing to me and maybe my two questions imply your question. My questions concern two responses by Stefan to quotes from Aidan Kehoe: > Fread is intended to be called from Lisp (it¢s a subr). Functions called > from Lisp do not need to GCPRO their arguments, because the garbage > collector knows about the arguments, as it knows about all objects > allocated in Lisp. s/called/callable/ Are you sure we have such a convention? `(elisp)Writing Emacs Primitives' discusses writing primitives, gives For as an example, which carefully GCPROs its ARGS argument, then talks about how important GCPROing variables of type Lisp_Object is (if Feval is called and so on...) and then states that there is an exception: Lisp primitives that take a variable number of args at the Lisp level (other than special forms) do not need to GCPRO the args they are to receive at the Lisp level: that responsibility rests with their caller, because what is passed as an arg at the C level is a Lisp_Object * pointer to a C vector containing those Lisp args. To me, this leads to the "obvious" conclusion that Lisp primitives can safely forget about GCPROing their args iff (they take a variable number of args and are not special forms). Apparently Aidan Kehoe's assertion that Lisp primitives do not need to GCPRO their args is not fully accurate, because For does. Maybe that is because For is a special form, but if so, this is apparently nowhere pointed out in `(elisp)Writing'. On the other hand, my "obvious" conclusion from reading `(elisp)Writing Emacs Primitives' seems to be wrong too. Both Fdirectory_file_name and Fmake_directory_internal take a fixed number of args, one, `directory', and do not GCPRO it, even though they both call `call2' which calls Feval and they both still refer to their `directory' arg afterwards. > C code that calls functions intended to be called from Lisp is optimistic > at best if, without having checked, it relies on the assumption that that > the arguments to those functions will be GCPROed. As far as I know, the GCPRO convention for arguments is mostly the following: GCPRO args you pass to functions iff those functions can GC and you need to use the arg after the function returns. All C functions that call Fdirectory_file_name or Fmake_directory_internal still use `directory' after those latter functions return and they all GCPRO it. But what if a C function called Fdirectory_file_name or Fmake_directory_internal without using directory afterward? Would they need to GCPRO `directory'. To me, the logical answer would seem no, since it is the responsibility of the called function to protect its args. Do these two functions implicitly do that by the way the garbage collector is implemented or not? Sincerely, Luc. ^ permalink raw reply [flat|nested] 202+ messages in thread
* Re: [PATCH] Unicode Lisp reader escapes 2006-05-11 1:08 ` Luc Teirlinck @ 2006-05-11 2:29 ` Luc Teirlinck 0 siblings, 0 replies; 202+ messages in thread From: Luc Teirlinck @ 2006-05-11 2:29 UTC (permalink / raw) Cc: kehoea, eliz, emacs-devel >From my previous reply: In the meantime, after rereading the thread, there are some things that seem confusing to me and maybe my two questions imply your question. My questions concern two responses by Stefan to quotes from Aidan Kehoe: Sorry forget about this and my entire long message. It was silly. I somehow just forgot to see that the calls to call2 in the two primitives I mentioned were in a return statement. So _obviously_ no GCPROing was necessary. I should have payed closer attention. Sorry for the confusion. Sincerely, Luc. ^ permalink raw reply [flat|nested] 202+ messages in thread
* Re: [PATCH] Unicode Lisp reader escapes 2006-05-10 17:49 ` Eli Zaretskii ` (2 preceding siblings ...) 2006-05-11 1:08 ` Luc Teirlinck @ 2006-05-11 3:46 ` Richard Stallman 3 siblings, 0 replies; 202+ messages in thread From: Richard Stallman @ 2006-05-11 3:46 UTC (permalink / raw) Cc: kehoea, emacs-devel > It read ambiguously to me. ``Any Lisp object that you intend to refer to > again'' could be one that you intend to refer to in the bodies of the > functions you call. Can someone in the know (Richard?) state a clear rule? I think the ELisp manual should be unequivocal about the GCPRO issue. I clarified this. Thanks. ^ permalink raw reply [flat|nested] 202+ messages in thread
* Re: [PATCH] Unicode Lisp reader escapes 2006-05-08 13:55 ` Stefan Monnier 2006-05-08 14:24 ` Aidan Kehoe @ 2006-05-09 0:36 ` Kenichi Handa 1 sibling, 0 replies; 202+ messages in thread From: Kenichi Handa @ 2006-05-09 0:36 UTC (permalink / raw) Cc: kehoea, schwab, emacs-devel In article <jwvwtcwzkn9.fsf-monnier+emacs@gnu.org>, Stefan Monnier <monnier@iro.umontreal.ca> writes: >> For instance, Fread is called from Fcall_interactively as >> below: >> Lisp_Object tem; >> [...] >> tem = Fread_from_minibuffer (build_string (callint_message), >> Qnil, Qnil, Qnil, Qnil, Qnil, >> Qnil, Qnil); >> if (! STRINGP (tem) || SCHARS (tem) == 0) >> args[i] = Qnil; >> else >> args[i] = Fread (tem); >> In the calling sequence of >> Fread-> read_internal_start->read0->read1, I see no place >> where the original `tem' is GCPROed. Do I overlook >> something? > Why would it need to be protected? it's not used afterwards. It's not used in Fcall_interactively afterwards. So Fcall_interactively doesn't have to protect it. But, read1 or read_escape have to protect the argument READCHARFUN, don't they? --- Kenichi Handa handa@m17n.org ^ permalink raw reply [flat|nested] 202+ messages in thread
* Re: [PATCH] Unicode Lisp reader escapes 2006-05-02 6:43 ` Kenichi Handa 2006-05-02 7:00 ` Aidan Kehoe @ 2006-05-02 10:36 ` Eli Zaretskii 2006-05-02 10:59 ` Aidan Kehoe 2006-05-03 2:59 ` Kenichi Handa 1 sibling, 2 replies; 202+ messages in thread From: Eli Zaretskii @ 2006-05-02 10:36 UTC (permalink / raw) Cc: kehoea, emacs-devel > From: Kenichi Handa <handa@m17n.org> > Date: Tue, 02 May 2006 15:43:16 +0900 > Cc: emacs-devel@gnu.org > > > + lisp_char = call2(intern("decode-char"), intern("ucs"), > > + make_number(i)); > > + > > First of all, is it safe to call Lisp program in > read_escape? Whether it is safe or not, I think it's certainly better to implement the guts of decode-char in C, if it's gonna be called from read_escape. All those guts do is simple arithmetics, which will be much faster in C. Moreover, I think the fact that decode-char uses translation tables to support unify-8859-on-*coding-mode (and thus might produce characters other than mule-unicode-*) could even be a misfeature: do we really want read_escape to produce Unicode or non-Unicode characters when it sees \uNNNN, depending on the current user settings? ^ permalink raw reply [flat|nested] 202+ messages in thread
* Re: [PATCH] Unicode Lisp reader escapes 2006-05-02 10:36 ` Eli Zaretskii @ 2006-05-02 10:59 ` Aidan Kehoe 2006-05-02 19:26 ` Eli Zaretskii 2006-05-03 2:59 ` Kenichi Handa 1 sibling, 1 reply; 202+ messages in thread From: Aidan Kehoe @ 2006-05-02 10:59 UTC (permalink / raw) Cc: emacs-devel Ar an dara lá de mí Bealtaine, scríobh Eli Zaretskii: > Whether it is safe or not, I think it's certainly better to implement > the guts of decode-char in C, if it's gonna be called from > read_escape. If it’s only going to be called rarely (twice a file for non-byte-compiled files, at a liberal guess, never for byte-compiled files), and after decode-char is already loaded--both of which are the case--I don’t see the argument for that. > All those guts do is simple arithmetics, which will be much faster in C. > > Moreover, I think the fact that decode-char uses translation tables to > support unify-8859-on-*coding-mode (and thus might produce characters > other than mule-unicode-*) could even be a misfeature: do we really > want read_escape to produce Unicode or non-Unicode characters when it > sees \uNNNN, depending on the current user settings? This is not significantly different from the question “do we really want (decode-char 'ucs #xABCD) to produce Unicode or non-Unicode characters depending on the current user settings?”, since making string escapes inconsistent with the Unicode coding systems does not make any sense. And that question has already been answered. Cf. http://article.gmane.org/gmane.emacs.bugs/3422 . -- Aidan Kehoe, http://www.parhasard.net/ ^ permalink raw reply [flat|nested] 202+ messages in thread
* Re: [PATCH] Unicode Lisp reader escapes 2006-05-02 10:59 ` Aidan Kehoe @ 2006-05-02 19:26 ` Eli Zaretskii 0 siblings, 0 replies; 202+ messages in thread From: Eli Zaretskii @ 2006-05-02 19:26 UTC (permalink / raw) Cc: emacs-devel > From: Aidan Kehoe <kehoea@parhasard.net> > Date: Tue, 2 May 2006 12:59:43 +0200 > Cc: emacs-devel@gnu.org > > > Whether it is safe or not, I think it's certainly better to implement > > the guts of decode-char in C, if it's gonna be called from > > read_escape. > > If it's only going to be called rarely (twice a file for non-byte-compiled > files, at a liberal guess, never for byte-compiled files), and after > decode-char is already loaded--both of which are the case--I don't see the > argument for that. And I don't see why we should assume anything for something as basic as a subroutine of readevalloop. It could be used to read anything, not just .el files. > > Moreover, I think the fact that decode-char uses translation tables to > > support unify-8859-on-*coding-mode (and thus might produce characters > > other than mule-unicode-*) could even be a misfeature: do we really > > want read_escape to produce Unicode or non-Unicode characters when it > > sees \uNNNN, depending on the current user settings? > > This is not significantly different from the question "do we really want > (decode-char 'ucs #xABCD) to produce Unicode or non-Unicode characters > depending on the current user settings?" Maybe it's the same question, but since you are proposing to have decode-char become part of routine reading of Lisp, this feature's impact becomes much more important to discuss. > since making string escapes inconsistent with the Unicode coding > systems does not make any sense. I'm not sure you are right; it should be discussed. > And that question has already been answered. Cf. > http://article.gmane.org/gmane.emacs.bugs/3422 Don't see any answers there about this, perhaps I'm too dumb. ^ permalink raw reply [flat|nested] 202+ messages in thread
* Re: [PATCH] Unicode Lisp reader escapes 2006-05-02 10:36 ` Eli Zaretskii 2006-05-02 10:59 ` Aidan Kehoe @ 2006-05-03 2:59 ` Kenichi Handa 2006-05-03 8:47 ` Eli Zaretskii 1 sibling, 1 reply; 202+ messages in thread From: Kenichi Handa @ 2006-05-03 2:59 UTC (permalink / raw) Cc: kehoea, emacs-devel In article <ufyjsemrn.fsf@gnu.org>, Eli Zaretskii <eliz@gnu.org> writes: >> From: Kenichi Handa <handa@m17n.org> >> Date: Tue, 02 May 2006 15:43:16 +0900 >> Cc: emacs-devel@gnu.org >> >> > + lisp_char = call2(intern("decode-char"), intern("ucs"), >> > + make_number(i)); >> > + >> >> First of all, is it safe to call Lisp program in >> read_escape? > Whether it is safe or not, I think it's certainly better to implement > the guts of decode-char in C, if it's gonna be called from > read_escape. All those guts do is simple arithmetics, which will be > much faster in C. I agree. > Moreover, I think the fact that decode-char uses translation tables to > support unify-8859-on-*coding-mode (and thus might produce characters > other than mule-unicode-*) could even be a misfeature: Decode-char doesn't support unify-8859-on-*coding-mode but supports utf-fragment-on-decoding and utf-translate-cjk-mode. > do we really want read_escape to produce Unicode or > non-Unicode characters when it sees \uNNNN, depending on > the current user settings? I think, at least, CJK characters should be decoded into one of CJK charsets because there's no other charsets. --- Kenichi Handa handa@m17n.org ^ permalink raw reply [flat|nested] 202+ messages in thread
* Re: [PATCH] Unicode Lisp reader escapes 2006-05-03 2:59 ` Kenichi Handa @ 2006-05-03 8:47 ` Eli Zaretskii 2006-05-03 14:21 ` Stefan Monnier 2006-05-04 1:26 ` Kenichi Handa 0 siblings, 2 replies; 202+ messages in thread From: Eli Zaretskii @ 2006-05-03 8:47 UTC (permalink / raw) Cc: kehoea, emacs-devel > From: Kenichi Handa <handa@m17n.org> > CC: kehoea@parhasard.net, emacs-devel@gnu.org > Date: Wed, 03 May 2006 11:59:52 +0900 > > > Moreover, I think the fact that decode-char uses translation tables to > > support unify-8859-on-*coding-mode (and thus might produce characters > > other than mule-unicode-*) could even be a misfeature: > > Decode-char doesn't support unify-8859-on-*coding-mode but > supports utf-fragment-on-decoding and > utf-translate-cjk-mode. Sorry, I meant utf-fragment-on-decoding, which decodes Cyrillic and Greek into ISO 8859. (I always get confused and lost in the maze of those twisted *-on-decoding passages, all alike.) > > do we really want read_escape to produce Unicode or > > non-Unicode characters when it sees \uNNNN, depending on > > the current user settings? > > I think, at least, CJK characters should be decoded into one > of CJK charsets because there's no other charsets. Right, but what about Cyrillic and Greek? The merits and demerits of depending on utf-fragment-on-decoding are not clear when the Lisp reader is involved. ^ permalink raw reply [flat|nested] 202+ messages in thread
* Re: [PATCH] Unicode Lisp reader escapes 2006-05-03 8:47 ` Eli Zaretskii @ 2006-05-03 14:21 ` Stefan Monnier 2006-05-03 18:26 ` Eli Zaretskii 2006-05-04 1:33 ` Kenichi Handa 2006-05-04 1:26 ` Kenichi Handa 1 sibling, 2 replies; 202+ messages in thread From: Stefan Monnier @ 2006-05-03 14:21 UTC (permalink / raw) Cc: kehoea, emacs-devel, Kenichi Handa >> I think, at least, CJK characters should be decoded into one >> of CJK charsets because there's no other charsets. > Right, but what about Cyrillic and Greek? The merits and demerits of > depending on utf-fragment-on-decoding are not clear when the Lisp > reader is involved. I think we should treat them as much as possible consistently with the rest of the treatment of unicode chars. If we start down the path of "OK, we can do it like this for those chars but not these, oh and as for those ones over there, we'll do it yet some other way", I think we're headed for headaches with no real benefit. Stefan ^ permalink raw reply [flat|nested] 202+ messages in thread
* Re: [PATCH] Unicode Lisp reader escapes 2006-05-03 14:21 ` Stefan Monnier @ 2006-05-03 18:26 ` Eli Zaretskii 2006-05-03 21:12 ` Ken Raeburn 2006-05-04 14:17 ` Richard Stallman 2006-05-04 1:33 ` Kenichi Handa 1 sibling, 2 replies; 202+ messages in thread From: Eli Zaretskii @ 2006-05-03 18:26 UTC (permalink / raw) Cc: kehoea, emacs-devel, handa > Cc: Kenichi Handa <handa@m17n.org>, kehoea@parhasard.net, > emacs-devel@gnu.org > From: Stefan Monnier <monnier@iro.umontreal.ca> > Date: Wed, 03 May 2006 10:21:15 -0400 > > > Right, but what about Cyrillic and Greek? The merits and demerits of > > depending on utf-fragment-on-decoding are not clear when the Lisp > > reader is involved. > > I think we should treat them as much as possible consistently with the rest > of the treatment of unicode chars. IIRC, we don't support such a consistency when we load Lisp files, because we don't want loading and byte-compiling to depend on user settings. Of course, the same effect can also be achieved by binding utf-fragment-on-decoding etc. to appropriate values. ^ permalink raw reply [flat|nested] 202+ messages in thread
* Re: [PATCH] Unicode Lisp reader escapes 2006-05-03 18:26 ` Eli Zaretskii @ 2006-05-03 21:12 ` Ken Raeburn 2006-05-04 14:17 ` Richard Stallman 1 sibling, 0 replies; 202+ messages in thread From: Ken Raeburn @ 2006-05-03 21:12 UTC (permalink / raw) Cc: kehoea, handa, Stefan Monnier, emacs-devel On May 3, 2006, at 14:26, Eli Zaretskii wrote: > IIRC, we don't support such a consistency when we load Lisp files, > because we don't want loading and byte-compiling to depend on user > settings. Not sure if it matters here, but shouldn't eval-last-sexp in a user's text buffer follow the user's (buffer's) settings? Ken ^ permalink raw reply [flat|nested] 202+ messages in thread
* Re: [PATCH] Unicode Lisp reader escapes 2006-05-03 18:26 ` Eli Zaretskii 2006-05-03 21:12 ` Ken Raeburn @ 2006-05-04 14:17 ` Richard Stallman 2006-05-04 16:41 ` Aidan Kehoe 1 sibling, 1 reply; 202+ messages in thread From: Richard Stallman @ 2006-05-04 14:17 UTC (permalink / raw) Cc: kehoea, handa, monnier, emacs-devel Regarding \u: the question is whether an Emacs escape for Unicode characters should be compatible with C string syntax for Unicode characters, or coherent with the Emacs \x escape. I think one relevant question is to what extent the C and Emacs Lisp string syntax are compatible in the first place. Emacs Lisp string syntax was largely based on C string syntax in 1984, but I don't know how C has developed since 1990. Can someone report on this question? ^ permalink raw reply [flat|nested] 202+ messages in thread
* Re: [PATCH] Unicode Lisp reader escapes 2006-05-04 14:17 ` Richard Stallman @ 2006-05-04 16:41 ` Aidan Kehoe 2006-05-05 10:39 ` Eli Zaretskii 2006-05-05 19:05 ` Richard Stallman 0 siblings, 2 replies; 202+ messages in thread From: Aidan Kehoe @ 2006-05-04 16:41 UTC (permalink / raw) Cc: emacs-devel Ar an ceathrú lá de mí Bealtaine, scríobh Richard Stallman: > Regarding \u: the question is whether an Emacs escape for Unicode > characters should be compatible with C string syntax for Unicode > characters, or coherent with the Emacs \x escape. The thing with the Emacs \x escape is that anyone using it for characters outside of ASCII is asking for pain, and always has been. It has only ever been clearly defined for that character set; any existing code in the repository for other characters, for example, _will definitely_ break with the merging of the Unicode branch. Now, there is lots of code in 21.4’s source tree that uses the syntax for things that are conceptually numbers and not Emacs characters. That code is not broken, but it is bad style; that’s what the #x syntax is for. So when people have been using the variable-length syntax with a length greater than two, they are either writing buggy code, or using bad style. I’m not sure that merits emulation. > I think one relevant question is to what extent the C and Emacs Lisp > string syntax are compatible in the first place. Emacs Lisp string > syntax was largely based on C string syntax in 1984, but I don't know > how C has developed since 1990. Can someone report on this question? The \u syntax (with a fixed number of digits) came into wide use with Java in 1996. The necessity for the \U extension arose with progress towards version 3.0 of Unicode and its ~1.1 million available code points. That version of the standard was released in 1999; the C99 ISO standard for C of the same year included both \u and \U. Various other C-oriented programming languages have incorporated the syntax since. -- Aidan Kehoe, http://www.parhasard.net/ ^ permalink raw reply [flat|nested] 202+ messages in thread
* Re: [PATCH] Unicode Lisp reader escapes 2006-05-04 16:41 ` Aidan Kehoe @ 2006-05-05 10:39 ` Eli Zaretskii 2006-05-05 16:35 ` Aidan Kehoe 2006-05-05 19:05 ` Richard Stallman 1 sibling, 1 reply; 202+ messages in thread From: Eli Zaretskii @ 2006-05-05 10:39 UTC (permalink / raw) Cc: rms, emacs-devel > From: Aidan Kehoe <kehoea@parhasard.net> > Date: Thu, 4 May 2006 18:41:17 +0200 > Cc: , emacs-devel@gnu.org > > > I think one relevant question is to what extent the C and Emacs Lisp > > string syntax are compatible in the first place. Emacs Lisp string > > syntax was largely based on C string syntax in 1984, but I don't know > > how C has developed since 1990. Can someone report on this question? > > The \u syntax (with a fixed number of digits) came into wide use with Java > in 1996. The necessity for the \U extension arose with progress towards > version 3.0 of Unicode and its ~1.1 million available code points. That > version of the standard was released in 1999; the C99 ISO standard for C of > the same year included both \u and \U. Various other C-oriented programming > languages have incorporated the syntax since. I think Richard was asking for a simple summary of the current C string syntax, with special emphasis on the standard escapes. \u and \U are only part of the story. ^ permalink raw reply [flat|nested] 202+ messages in thread
* Re: [PATCH] Unicode Lisp reader escapes 2006-05-05 10:39 ` Eli Zaretskii @ 2006-05-05 16:35 ` Aidan Kehoe 0 siblings, 0 replies; 202+ messages in thread From: Aidan Kehoe @ 2006-05-05 16:35 UTC (permalink / raw) Cc: rms, emacs-devel Ar an cúigiú lá de mí Bealtaine, scríobh Eli Zaretskii: > > > I think one relevant question is to what extent the C and Emacs Lisp > > > string syntax are compatible in the first place. Emacs Lisp string > > > syntax was largely based on C string syntax in 1984, but I don't > > > know how C has developed since 1990. Can someone report on this > > > question? > > > > The \u syntax (with a fixed number of digits) came into wide use with > > Java in 1996. The necessity for the \U extension arose with progress > > towards version 3.0 of Unicode and its ~1.1 million available code > > points. That version of the standard was released in 1999; the C99 ISO > > standard for C of the same year included both \u and \U. Various other > > C-oriented programming languages have incorporated the syntax since. > > I think Richard was asking for a simple summary of the current C > string syntax, with special emphasis on the standard escapes. \u and > \U are only part of the story. Well, I read it as him asking how C has developed since 1990 in its string syntax, and \u and \U are most of that story. Your parse is more reasonable; the question is not clear, though. -- Aidan Kehoe, http://www.parhasard.net/ ^ permalink raw reply [flat|nested] 202+ messages in thread
* Re: [PATCH] Unicode Lisp reader escapes 2006-05-04 16:41 ` Aidan Kehoe 2006-05-05 10:39 ` Eli Zaretskii @ 2006-05-05 19:05 ` Richard Stallman 2006-05-05 19:20 ` Aidan Kehoe 1 sibling, 1 reply; 202+ messages in thread From: Richard Stallman @ 2006-05-05 19:05 UTC (permalink / raw) Cc: emacs-devel > Regarding \u: the question is whether an Emacs escape for Unicode > characters should be compatible with C string syntax for Unicode > characters, or coherent with the Emacs \x escape. The thing with the Emacs \x escape is that anyone using it for characters outside of ASCII is asking for pain, and always has been. It has only ever been clearly defined for that character set; any existing code in the repository for other characters, for example, _will definitely_ break with the merging of the Unicode branch. We are miscommunicating. Whether it is wise to use \x is not the question. The issue I am talking about is that of _coherence_ (parallelism of syntax) between \x and \u. > I think one relevant question is to what extent the C and Emacs Lisp > string syntax are compatible in the first place. Emacs Lisp string > syntax was largely based on C string syntax in 1984, but I don't know > how C has developed since 1990. Can someone report on this question? The \u syntax (with a fixed number of digits) came into wide use with Java in 1996. The necessity for the \U extension arose with progress towards version 3.0 of Unicode and its ~1.1 million available code points. That version of the standard was released in 1999; the C99 ISO standard for C of the same year included both \u and \U. Various other C-oriented programming languages have incorporated the syntax since. Thank you, but my question here is not about \u. Rather, it is about whether there are OTHER incompatibilities between Emacs Lisp and C string syntax. I want to see that information before deciding what to do here. ^ permalink raw reply [flat|nested] 202+ messages in thread
* Re: [PATCH] Unicode Lisp reader escapes 2006-05-05 19:05 ` Richard Stallman @ 2006-05-05 19:20 ` Aidan Kehoe 2006-05-05 19:57 ` Aidan Kehoe 0 siblings, 1 reply; 202+ messages in thread From: Aidan Kehoe @ 2006-05-05 19:20 UTC (permalink / raw) Cc: emacs-devel Ar an cúigiú lá de mí Bealtaine, scríobh Richard Stallman: > [...] The thing with the Emacs \x escape is that anyone using it for > characters outside of ASCII is asking for pain, and always has been. > It has only ever been clearly defined for that character set; any > existing code in the repository for other characters, for example, > _will definitely_ break with the merging of the Unicode branch. > > We are miscommunicating. Whether it is wise to use \x is not the > question. The issue I am talking about is that of _coherence_ > (parallelism of syntax) between \x and \u. Indeed. And one of the paragraphs you snipped indicated my doubts as to whether it is wise to be coherent with something that is either bad style or broken. > [...] Thank you, but my question here is not about \u. Rather, it is > about whether there are OTHER incompatibilities between Emacs Lisp and C > string syntax. > > I want to see that information before deciding what to do here. There aren’t, to my knowledge, C is a pretty conservative language. GCC and its conscientious approach to the standards is a big part of why that is so, as I understand it. -- Aidan Kehoe, http://www.parhasard.net/ ^ permalink raw reply [flat|nested] 202+ messages in thread
* Re: [PATCH] Unicode Lisp reader escapes 2006-05-05 19:20 ` Aidan Kehoe @ 2006-05-05 19:57 ` Aidan Kehoe 2006-05-06 14:25 ` Richard Stallman 0 siblings, 1 reply; 202+ messages in thread From: Aidan Kehoe @ 2006-05-05 19:57 UTC (permalink / raw) Cc: emacs-devel Ar an cúigiú lá de mí Bealtaine, scríobh Aidan Kehoe: > > [...] Thank you, but my question here is not about \u. Rather, it is > > about whether there are OTHER incompatibilities between Emacs Lisp and C > > string syntax. > > > > I want to see that information before deciding what to do here. > > There aren’t, to my knowledge, C is a pretty conservative language. GCC > and its conscientious approach to the standards is a big part of why that > is so, as I understand it. Sorry, that is false. There are other incompatibilities; \d in Emacs Lisp string and character syntax gives \0177 \e in Emacs Lisp string and character syntax gives \033 \M-<CHAR> in Emacs Lisp string and character syntax gives CHAR + #x8000000 \S-<CHAR> in Emacs Lisp string and character syntax gives CHAR + #x2000000 \H-<CHAR> in Emacs Lisp string and character syntax gives CHAR + #x1000000 \A-<CHAR> in Emacs Lisp string and character syntax gives CHAR + #x400000 \s-<CHAR> in Emacs Lisp string and character syntax gives CHAR + #x400000 \C-<CHAR> and \^<CHAR> in Emacs Lisp string and character syntax gives the control version of CHAR, which for non-ASCII characters is CHAR + #x4000000 All of these are incompatibilities on the Emacs Lisp side; except for the Unicode escapes, a C programmer can use any C escape desired in Emacs Lisp. -- Aidan Kehoe, http://www.parhasard.net/ ^ permalink raw reply [flat|nested] 202+ messages in thread
* Re: [PATCH] Unicode Lisp reader escapes 2006-05-05 19:57 ` Aidan Kehoe @ 2006-05-06 14:25 ` Richard Stallman 2006-05-06 17:26 ` Aidan Kehoe 0 siblings, 1 reply; 202+ messages in thread From: Richard Stallman @ 2006-05-06 14:25 UTC (permalink / raw) Cc: emacs-devel All of these are incompatibilities on the Emacs Lisp side; except for the Unicode escapes, a C programmer can use any C escape desired in Emacs Lisp. That being so, I think it is useful to keep that true, and implement \u and \U in a way that is compatible with C. We could install this now if someone writes changes for etc/NEWS and the Lisp manual, as well as the code. ^ permalink raw reply [flat|nested] 202+ messages in thread
* Re: [PATCH] Unicode Lisp reader escapes 2006-05-06 14:25 ` Richard Stallman @ 2006-05-06 17:26 ` Aidan Kehoe 2006-05-07 5:01 ` Richard Stallman 0 siblings, 1 reply; 202+ messages in thread From: Aidan Kehoe @ 2006-05-06 17:26 UTC (permalink / raw) Cc: emacs-devel Ar an séiú lá de mí Bealtaine, scríobh Richard Stallman>: > All of these are incompatibilities on the Emacs Lisp side; except for > the Unicode escapes, a C programmer can use any C escape desired in > Emacs Lisp. > > That being so, I think it is useful to keep that true, and implement > \u and \U in a way that is compatible with C. > > We could install this now if someone writes changes for etc/NEWS and > the Lisp manual, as well as the code. Okay. I’ve already signed papers; the patch below includes updates to the NEWS file, the code and the Lisp manual. One mostly open question, which the below patch takes a clear stand on, is whether it is acceptable to call decode-char (which is implemented in Lisp) from the Lisp reader. I share Stefan Monnier’s judgement on this: “I'd vote to keep the code in elisp. After all, it's there, it works, and as mentioned: there's no evidence that the decoding time of \u escapes it ever going to need to be fast. And it'll become fast in Emacs-unicode anyway, so it doesn't seem to be worth the trouble.” I have no objection to implementing decode-char in C in general; it would mean that handle_one_event in xterm.c could be made much more robust, for example. It currently is the case that Unicode keysyms are handled inconsistently with the Unicode coding systems and that code points above #xFFFF are simply dropped, it doesn’t even try to convert them to Emacs characters. But integrating it into Emacs for the sake of this patch seems too much potential instability for too little benefit. Another thing; if this patch is to be integrated, there is some Lisp in the source tree using \u in strings (incorrectly) that will need to be changed to use \\u. etc/ChangeLog addition: 2006-05-06 Aidan Kehoe <kehoea@parhasard.net> * NEWS: Describe the Unicode string and character escape lispref/ChangeLog addition: 2006-05-06 Aidan Kehoe <kehoea@parhasard.net> * objects.texi (Character Type): Describe the Unicode character escape syntax; \uABCD or \U00ABCDEF specifies Unicode characters U+ABCD and U+ABCDEF respectively. src/ChangeLog addition: 2006-05-06 Aidan Kehoe <kehoea@parhasard.net> * lread.c (read_escape): Provide a Unicode character escape syntax; \u followed by exactly four or \U followed by exactly eight hex digits in a comment or string is read as a Unicode character with that code point. GNU Emacs Trunk source patch: Diff command: cvs -q diff -u Files affected: src/lread.c lispref/objects.texi etc/NEWS Index: etc/NEWS =================================================================== RCS file: /sources/emacs/emacs/etc/NEWS,v retrieving revision 1.1337 diff -u -u -r1.1337 NEWS --- etc/NEWS 2 May 2006 01:47:57 -0000 1.1337 +++ etc/NEWS 6 May 2006 16:57:54 -0000 @@ -3772,6 +3772,13 @@ been declared obsolete. +++ +*** New syntax: \uXXXX and \UXXXXXXXX specify Unicode code points in hex. +Use "\u0428" to specify a string consisting of CYRILLIC CAPITAL LETTER SHA, +or "\U0001D6E2" to specify one consisting of MATHEMATICAL ITALIC CAPITAL +ALPHA (the latter is greater than #xFFFF and thus needs the longer +syntax). Also available for characters. + ++++ ** Displaying warnings to the user. See the functions `warn' and `display-warning', or the Lisp Manual. Index: lispref/objects.texi =================================================================== RCS file: /sources/emacs/emacs/lispref/objects.texi,v retrieving revision 1.53 diff -u -u -r1.53 objects.texi --- lispref/objects.texi 1 May 2006 15:05:48 -0000 1.53 +++ lispref/objects.texi 6 May 2006 16:57:56 -0000 @@ -431,6 +431,20 @@ bit values are 2**22 for alt, 2**23 for super and 2**24 for hyper. @end ifnottex +@cindex unicode character escape + Emacs provides a syntax for specifying characters by their Unicode code +points. @code{?\uABCD} represents a character that maps to the code +point @samp{U+ABCD} in Unicode-based representations (UTF-8 text files, +Unicode-oriented fonts, etc.). There is a slightly different syntax for +specifying characters with code points above @code{#xFFFF}; +@code{\U00ABCDEF} represents an Emacs character that maps to the code +point @samp{U+ABCDEF} in Unicode-based representations, if such an Emacs +character exists. + + Unlike in some other languages, while this syntax is available for +character literals, and (see later) in strings, it is not available +elsewhere in your Lisp source code. + @cindex @samp{\} in character constant @cindex backslash in character constant @cindex octal character code Index: src/lread.c =================================================================== RCS file: /sources/emacs/emacs/src/lread.c,v retrieving revision 1.350 diff -u -u -r1.350 lread.c --- src/lread.c 27 Feb 2006 02:04:35 -0000 1.350 +++ src/lread.c 6 May 2006 16:57:57 -0000 @@ -1743,6 +1743,9 @@ int *byterep; { register int c = READCHAR; + /* \u allows up to four hex digits, \U up to eight. Default to the + behaviour for \u, and change this value in the case that \U is seen. */ + int unicode_hex_count = 4; *byterep = 0; @@ -1907,6 +1910,48 @@ return i; } + case 'U': + /* Post-Unicode-2.0: Up to eight hex chars */ + unicode_hex_count = 8; + case 'u': + + /* A Unicode escape. We only permit them in strings and characters, + not arbitrarily in the source code as in some other languages. */ + { + int i = 0; + int count = 0; + Lisp_Object lisp_char; + while (++count <= unicode_hex_count) + { + c = READCHAR; + /* isdigit(), isalpha() may be locale-specific, which we don't + want. */ + if (c >= '0' && c <= '9') i = (i << 4) + (c - '0'); + else if (c >= 'a' && c <= 'f') i = (i << 4) + (c - 'a') + 10; + else if (c >= 'A' && c <= 'F') i = (i << 4) + (c - 'A') + 10; + else + { + error ("Non-hex digit used for Unicode escape"); + break; + } + } + + lisp_char = call2(intern("decode-char"), intern("ucs"), + make_number(i)); + + if (EQ(Qnil, lisp_char)) + { + /* This is ugly and horrible and trashes the user's data. */ + XSETFASTINT (i, MAKE_CHAR (charset_katakana_jisx0201, + 34 + 128, 46 + 128)); + return i; + } + else + { + return XFASTINT (lisp_char); + } + } + default: if (BASE_LEADING_CODE_P (c)) c = read_multibyte (c, readcharfun); -- Aidan Kehoe, http://www.parhasard.net/ ^ permalink raw reply [flat|nested] 202+ messages in thread
* Re: [PATCH] Unicode Lisp reader escapes 2006-05-06 17:26 ` Aidan Kehoe @ 2006-05-07 5:01 ` Richard Stallman 2006-05-07 6:38 ` Aidan Kehoe 2006-05-07 16:50 ` Aidan Kehoe 0 siblings, 2 replies; 202+ messages in thread From: Richard Stallman @ 2006-05-07 5:01 UTC (permalink / raw) Cc: emacs-devel Okay. I?ve already signed papers; When was that? The only papers recorded in our file you are for Gnus. ^ permalink raw reply [flat|nested] 202+ messages in thread
* Re: [PATCH] Unicode Lisp reader escapes 2006-05-07 5:01 ` Richard Stallman @ 2006-05-07 6:38 ` Aidan Kehoe 2006-05-07 7:00 ` David Kastrup 2006-05-07 16:50 ` Aidan Kehoe 1 sibling, 1 reply; 202+ messages in thread From: Aidan Kehoe @ 2006-05-07 6:38 UTC (permalink / raw) Cc: emacs-devel Ar an seachtú lá de mí Bealtaine, scríobh Richard Stallman: > Okay. I?ve already signed papers; > > When was that? The only papers recorded in our file you are for Gnus. Those papers say; 1.(a) Developer hereby agrees to assign and does hereby assign to FSF Deveoper’s copyright in changes and/or enhancements to the suite of programs known as GNU Emacs (herein called the Program), including any accompanying docmentation files and supporting files as well as the actual program code. These changes and/or enhancements are herein called the Works. (b) The assignment of par. 1(a) above applies to all past and future works of Developer that constitute changes and enhancements to the Program. [...] Do you mean to say that agreement does not cover changes in C? -- Aidan Kehoe, http://www.parhasard.net/ ^ permalink raw reply [flat|nested] 202+ messages in thread
* Re: [PATCH] Unicode Lisp reader escapes 2006-05-07 6:38 ` Aidan Kehoe @ 2006-05-07 7:00 ` David Kastrup 2006-05-07 7:15 ` Aidan Kehoe 0 siblings, 1 reply; 202+ messages in thread From: David Kastrup @ 2006-05-07 7:00 UTC (permalink / raw) Cc: rms, emacs-devel Aidan Kehoe <kehoea@parhasard.net> writes: > Ar an seachtú lá de mí Bealtaine, scríobh Richard Stallman: > > > Okay. I?ve already signed papers; > > > > When was that? The only papers recorded in our file you are for Gnus. > > Those papers say; > > 1.(a) Developer hereby agrees to assign and does hereby assign to FSF > Deveoper’s copyright in changes and/or enhancements to the suite of programs > known as GNU Emacs (herein called the Program), including any accompanying > docmentation files and supporting files as well as the actual program code. > These changes and/or enhancements are herein called the Works. > > (b) The assignment of par. 1(a) above applies to all past and future works > of Developer that constitute changes and enhancements to the Program. > > [...] > > Do you mean to say that agreement does not cover changes in C? Oh, it would. But in the record on electronic file, your only listing is under "GNUS". Did you sign several assignments or just one? In either case, there probably has been some oversight by the copyright clerk, or your signed copy did not reach the FSF for some reason. -- David Kastrup, Kriemhildstr. 15, 44793 Bochum ^ permalink raw reply [flat|nested] 202+ messages in thread
* Re: [PATCH] Unicode Lisp reader escapes 2006-05-07 7:00 ` David Kastrup @ 2006-05-07 7:15 ` Aidan Kehoe 0 siblings, 0 replies; 202+ messages in thread From: Aidan Kehoe @ 2006-05-07 7:15 UTC (permalink / raw) Cc: rms, emacs-devel Ar an seachtú lá de mí Bealtaine, scríobh David Kastrup: > > Those papers say; > > > > 1.(a) Developer hereby agrees to assign and does hereby assign to FSF > > Deveoper’s copyright in changes and/or enhancements to the suite of > > programs known as GNU Emacs (herein called the Program), including any > > accompanying docmentation files and supporting files as well as the > > actual program code. These changes and/or enhancements are herein > > called the Works. > > > > (b) The assignment of par. 1(a) above applies to all past and future > > works of Developer that constitute changes and enhancements to the > > Program. > > > > [...] > > > > Do you mean to say that agreement does not cover changes in C? > > Oh, it would. But in the record on electronic file, your only listing > is under "GNUS". Did you sign several assignments or just one? Just one. > In either case, there probably has been some oversight by the copyright > clerk, or your signed copy did not reach the FSF for some reason. My signed copy certainly reached the FSF. I copied out the above text from the courtesy copy posted back by them. -- Aidan Kehoe, http://www.parhasard.net/ ^ permalink raw reply [flat|nested] 202+ messages in thread
* Re: [PATCH] Unicode Lisp reader escapes 2006-05-07 5:01 ` Richard Stallman 2006-05-07 6:38 ` Aidan Kehoe @ 2006-05-07 16:50 ` Aidan Kehoe 2006-05-08 22:28 ` Richard Stallman 1 sibling, 1 reply; 202+ messages in thread From: Aidan Kehoe @ 2006-05-07 16:50 UTC (permalink / raw) Cc: emacs-devel Ar an seachtú lá de mí Bealtaine, scríobh Richard Stallman: > Okay. I’ve already signed papers; > > When was that? The only papers recorded in our file you are for Gnus. To be clearer: I’ve signed a declaration of assignment for Gnus, and that declaration is headed “ASSIGNMENT - GNU Gnus” and contains the following text: 1.(a) Developer hereby agrees to assign and does hereby assign to FSF Deveoper’s copyright in changes and/or enhancements to the suite of programs known as GNU Emacs (herein called the Program), including any accompanying docmentation files and supporting files as well as the actual program code. These changes and/or enhancements are herein called the Works. (b) The assignment of par. 1(a) above applies to all past and future works of Developer that constitute changes and enhancements to the Program. Now, if Gnus is not to be interpreted as one of the “suite of programs known as GNU Emacs,” then I need to sign separate a declaration of assignment for Gnus. In that event, please send me one, by email or by post; my current address is Wisbyerstr. 10c 10439 Berlin Germany Best regards - Aidan -- Aidan Kehoe, http://www.parhasard.net/ ^ permalink raw reply [flat|nested] 202+ messages in thread
* Re: [PATCH] Unicode Lisp reader escapes 2006-05-07 16:50 ` Aidan Kehoe @ 2006-05-08 22:28 ` Richard Stallman 0 siblings, 0 replies; 202+ messages in thread From: Richard Stallman @ 2006-05-08 22:28 UTC (permalink / raw) Cc: emacs-devel Gnus IS part of GNU Emacs. So the question is whether your assignment covers only Gnus, or only GNU Emacs. It sounds now like it covers all of Emacs. I will have the clerk check it up. ^ permalink raw reply [flat|nested] 202+ messages in thread
* Re: [PATCH] Unicode Lisp reader escapes 2006-05-03 14:21 ` Stefan Monnier 2006-05-03 18:26 ` Eli Zaretskii @ 2006-05-04 1:33 ` Kenichi Handa 2006-05-04 8:23 ` Oliver Scholz 2006-05-04 16:32 ` Eli Zaretskii 1 sibling, 2 replies; 202+ messages in thread From: Kenichi Handa @ 2006-05-04 1:33 UTC (permalink / raw) Cc: kehoea, eliz, emacs-devel In article <87odyfnqcj.fsf-monnier+emacs@gnu.org>, Stefan Monnier <monnier@iro.umontreal.ca> writes: >>> I think, at least, CJK characters should be decoded into one >>> of CJK charsets because there's no other charsets. >> Right, but what about Cyrillic and Greek? The merits and demerits of >> depending on utf-fragment-on-decoding are not clear when the Lisp >> reader is involved. > I think we should treat them as much as possible consistently with the rest > of the treatment of unicode chars. If we start down the path of "OK, we can > do it like this for those chars but not these, oh and as for those ones over > there, we'll do it yet some other way", I think we're headed for headaches > with no real benefit. I agree. --- Kenichi Handa handa@m17n.org ^ permalink raw reply [flat|nested] 202+ messages in thread
* Re: [PATCH] Unicode Lisp reader escapes 2006-05-04 1:33 ` Kenichi Handa @ 2006-05-04 8:23 ` Oliver Scholz 2006-05-04 11:57 ` Kim F. Storm 2006-05-04 16:32 ` Eli Zaretskii 1 sibling, 1 reply; 202+ messages in thread From: Oliver Scholz @ 2006-05-04 8:23 UTC (permalink / raw) For what it's worth, I made a stab at implementing \u analogous to \x---including a port of the core functionality of `decode-char' to C. As for the current discussion: I regard both e.g. \u3b1 and (decode-char 'ucs #x3b1) as a means to say "Give me that abstract character---the greek letter alpha---I don't care about your internal encoding, *just use your defaults*, but give me that character." So, effectively the respective functions should deal with fragmentation and the like. It would matter, for instance, if the fontset specifies different glyphs for the same abstract character depending on the charsets. But I see Eli's point. Ideally, the conversion (to ISO 8859-X) wouldn't take place when reading the string, but when it is displayed/inserted into a buffer. Logically, because that's when the difference between abstract character and internal representation should become effective. Practically, because: if the user loads a Library containing strings with \u escapes (or `decode-char' expressions eval'ed at load-time) and *then* customises the value of `utf-fragment-on-decoding', the change won't affect those characters. However, I believe that this is rather a minor obscurity than a bug; I don't believe that anybody would get bitten by this seriously. Oliver Here's the patch, only slightly tested: Index: src/lread.c =================================================================== RCS file: /cvsroot/emacs/emacs/src/lread.c,v retrieving revision 1.350 diff -u -r1.350 lread.c --- src/lread.c 27 Feb 2006 02:04:35 -0000 1.350 +++ src/lread.c 4 May 2006 08:00:53 -0000 @@ -1731,6 +1731,102 @@ return str[0]; } + +#define READ_HEX_ESCAPE(i, c) \ + while (1) \ + { \ + c = READCHAR; \ + if (c >= '0' && c <= '9') \ + { \ + i *= 16; \ + i += c - '0'; \ + } \ + else if ((c >= 'a' && c <= 'f') \ + || (c >= 'A' && c <= 'F')) \ + { \ + i *= 16; \ + if (c >= 'a' && c <= 'f') \ + i += c - 'a' + 10; \ + else \ + i += c - 'A' + 10; \ + } \ + else \ + { \ + UNREAD (c); \ + break; \ + } \ + } + + + +/* Return the internal character coresponding to an UCS code point.*/ + +int +ucs_to_internal (ucs) + int ucs; +{ + int c = 0; + Lisp_Object tmp_char; + + if (! EQ (Qnil, SYMBOL_VALUE (intern ("utf-translate-cjk-mode")))) + /* cf. `utf-lookup-subst-table-for-decode' */ + { + if (EQ (Qnil, SYMBOL_VALUE (intern ("utf-translate-cjk-lang-env")))) + call0 (intern ("utf-translate-cjk-load-tables")); + tmp_char = Fgethash (make_number (ucs), + Fget (intern ("utf-subst-table-for-decode"), + intern ("translation-hash-table")), + Qnil); + if (! EQ (Qnil, tmp_char)) + { + CHECK_NUMBER (tmp_char); + c = XFASTINT (tmp_char); + } + } + + if (c) + /* We found the character already in the translation hash table. + Do nothing. */ + ; + else if (ucs < 160) + c = ucs; + else if (ucs < 256) + c = MAKE_CHAR (charset_latin_iso8859_1, ucs, 0); + else if (ucs < 0x2500) + { + ucs -= 0x0100; + c = MAKE_CHAR (charset_mule_unicode_0100_24ff, + ((ucs / 96) + 32), + ((ucs % 96) + 32)); + } + else if (ucs < 0x3400) + { + ucs -= 0x2500; + c = MAKE_CHAR (charset_mule_unicode_2500_33ff, + ((ucs / 96) + 32), + ((ucs % 96) + 32)); + } + else if ((ucs >= 0xE000) && (ucs < 0x10000)) + { + ucs -= 0xE000; + c = MAKE_CHAR (charset_mule_unicode_e000_ffff, + ((ucs / 96) + 32), + ((ucs % 96) + 32)); + } + + if (c) + { + Lisp_Object vect = Fget (intern ("utf-translation-table-for-decode"), + intern ("translation-table")); + tmp_char = Faref (vect, make_number (c)); + if (! EQ (Qnil, tmp_char)) + return XFASTINT (tmp_char); + return c; + } + else error ("Invalid or unsupported UCS character: %x", ucs); +} + + /* Read a \-escape sequence, assuming we already read the `\'. If the escape sequence forces unibyte, store 1 into *BYTEREP. If the escape sequence forces multibyte, store 2 into *BYTEREP. @@ -1879,34 +1975,24 @@ /* A hex escape, as in ANSI C. */ { int i = 0; - while (1) - { - c = READCHAR; - if (c >= '0' && c <= '9') - { - i *= 16; - i += c - '0'; - } - else if ((c >= 'a' && c <= 'f') - || (c >= 'A' && c <= 'F')) - { - i *= 16; - if (c >= 'a' && c <= 'f') - i += c - 'a' + 10; - else - i += c - 'A' + 10; - } - else - { - UNREAD (c); - break; - } - } - + READ_HEX_ESCAPE (i, c); *byterep = 2; return i; } + case 'u': + /* A hexadecimal reference to an UCS character. */ + { + int i = 0; + Lisp_Object lisp_char; + + READ_HEX_ESCAPE (i, c); + *byterep = 2; + + return ucs_to_internal (i); + + } + default: if (BASE_LEADING_CODE_P (c)) c = read_multibyte (c, readcharfun); -- 15 Floréal an 214 de la Révolution Liberté, Egalité, Fraternité! ^ permalink raw reply [flat|nested] 202+ messages in thread
* Re: [PATCH] Unicode Lisp reader escapes 2006-05-04 8:23 ` Oliver Scholz @ 2006-05-04 11:57 ` Kim F. Storm 2006-05-04 12:18 ` Stefan Monnier 2006-05-04 13:07 ` Oliver Scholz 0 siblings, 2 replies; 202+ messages in thread From: Kim F. Storm @ 2006-05-04 11:57 UTC (permalink / raw) Cc: emacs-devel Oliver Scholz <alkibiades@gmx.de> writes: > Here's the patch, only slightly tested: > + > +/* Return the internal character coresponding to an UCS code point.*/ > + > +int > +ucs_to_internal (ucs) > + int ucs; > +{ > + if (! EQ (Qnil, SYMBOL_VALUE (intern ("utf-translate-cjk-mode")))) > + if (EQ (Qnil, SYMBOL_VALUE (intern ("utf-translate-cjk-lang-env")))) > + call0 (intern ("utf-translate-cjk-load-tables")); > + Fget (intern ("utf-subst-table-for-decode"), > + intern ("translation-hash-table")), > + Lisp_Object vect = Fget (intern ("utf-translation-table-for-decode"), > + intern ("translation-table")); > +} That's 7 lisp vars accessed from C - for decoding one character!?! How often does this happen? If it is only/primarily used for interactive use, I guess it doesn't matter. Otherwise, I think those vars should be declared in C, to avoid the overhead of interning them at run-time... -- Kim F. Storm <storm@cua.dk> http://www.cua.dk ^ permalink raw reply [flat|nested] 202+ messages in thread
* Re: [PATCH] Unicode Lisp reader escapes 2006-05-04 11:57 ` Kim F. Storm @ 2006-05-04 12:18 ` Stefan Monnier 2006-05-04 12:21 ` Kim F. Storm 2006-05-04 16:31 ` Eli Zaretskii 2006-05-04 13:07 ` Oliver Scholz 1 sibling, 2 replies; 202+ messages in thread From: Stefan Monnier @ 2006-05-04 12:18 UTC (permalink / raw) Cc: emacs-devel, Oliver Scholz > That's 7 lisp vars accessed from C - for decoding one character!?! > How often does this happen? > If it is only/primarily used for interactive use, I guess it doesn't matter. > Otherwise, I think those vars should be declared in C, to avoid the overhead > of interning them at run-time... I'd vote to keep the code in elisp. After all, it's there, it works, and as mentioned: there's no evidence that the decoding time of \u escapes it ever going to need to be fast. And it'll become fast in Emacs-unicode anyway, so it doesn't seem to be worth the trouble. Stefan ^ permalink raw reply [flat|nested] 202+ messages in thread
* Re: [PATCH] Unicode Lisp reader escapes 2006-05-04 12:18 ` Stefan Monnier @ 2006-05-04 12:21 ` Kim F. Storm 2006-05-04 16:31 ` Eli Zaretskii 1 sibling, 0 replies; 202+ messages in thread From: Kim F. Storm @ 2006-05-04 12:21 UTC (permalink / raw) Cc: Oliver Scholz, emacs-devel Stefan Monnier <monnier@iro.umontreal.ca> writes: >> That's 7 lisp vars accessed from C - for decoding one character!?! > >> How often does this happen? > >> If it is only/primarily used for interactive use, I guess it doesn't matter. >> Otherwise, I think those vars should be declared in C, to avoid the overhead >> of interning them at run-time... > > I'd vote to keep the code in elisp. After all, it's there, it works, and as > mentioned: there's no evidence that the decoding time of \u escapes it ever > going to need to be fast. And it'll become fast in Emacs-unicode anyway, so > it doesn't seem to be worth the trouble. Ok. Pls. disregard my query. -- Kim F. Storm <storm@cua.dk> http://www.cua.dk ^ permalink raw reply [flat|nested] 202+ messages in thread
* Re: [PATCH] Unicode Lisp reader escapes 2006-05-04 12:18 ` Stefan Monnier 2006-05-04 12:21 ` Kim F. Storm @ 2006-05-04 16:31 ` Eli Zaretskii 2006-05-04 21:40 ` Stefan Monnier 1 sibling, 1 reply; 202+ messages in thread From: Eli Zaretskii @ 2006-05-04 16:31 UTC (permalink / raw) Cc: alkibiades, emacs-devel, storm > From: Stefan Monnier <monnier@iro.umontreal.ca> > Date: Thu, 04 May 2006 08:18:02 -0400 > Cc: emacs-devel@gnu.org, Oliver Scholz <alkibiades@gmx.de> > > I'd vote to keep the code in elisp. And I think it's ugly and hackish to call Lisp from within C code, when all that Lisp does is simple integer arithmetics. IIRC, `decode-char' was originally coded in Lisp because it was added at the last moment before some past release happened. That was cool as long as it was a rarely-used vehicle for converting Unicode codepoints to the Emacs internal representation, but it's certainly NOT cool, IMO, as part of the Lisp reader. > After all, it's there, it works, and as mentioned: there's no > evidence that the decoding time of \u escapes it ever going to need > to be fast. ??? inside the Lisp reader, everything needs to be fast, IMO. > And it'll become fast in Emacs-unicode anyway Which will be when? 5 years from now? ^ permalink raw reply [flat|nested] 202+ messages in thread
* Re: [PATCH] Unicode Lisp reader escapes 2006-05-04 16:31 ` Eli Zaretskii @ 2006-05-04 21:40 ` Stefan Monnier 2006-05-05 10:25 ` Eli Zaretskii 0 siblings, 1 reply; 202+ messages in thread From: Stefan Monnier @ 2006-05-04 21:40 UTC (permalink / raw) Cc: alkibiades, emacs-devel, storm >> After all, it's there, it works, and as mentioned: there's no >> evidence that the decoding time of \u escapes it ever going to need >> to be fast. > ??? inside the Lisp reader, everything needs to be fast, IMO. Why? IMO the only things that need to be fast are those things whose performance has a visible impact. I see no evidence that there'll ever be a case where the speed with which we can read \u escapes will matter. Stefan ^ permalink raw reply [flat|nested] 202+ messages in thread
* Re: [PATCH] Unicode Lisp reader escapes 2006-05-04 21:40 ` Stefan Monnier @ 2006-05-05 10:25 ` Eli Zaretskii 2006-05-05 12:31 ` Oliver Scholz 2006-05-05 13:05 ` Stefan Monnier 0 siblings, 2 replies; 202+ messages in thread From: Eli Zaretskii @ 2006-05-05 10:25 UTC (permalink / raw) Cc: alkibiades, emacs-devel > Cc: storm@cua.dk, emacs-devel@gnu.org, alkibiades@gmx.de > From: Stefan Monnier <monnier@iro.umontreal.ca> > Date: Thu, 04 May 2006 17:40:28 -0400 > > >> After all, it's there, it works, and as mentioned: there's no > >> evidence that the decoding time of \u escapes it ever going to need > >> to be fast. > > > ??? inside the Lisp reader, everything needs to be fast, IMO. > > Why? Because the Lisp reader is the backbone of the Lisp interpreter. > IMO the only things that need to be fast are those things whose > performance has a visible impact. I see no evidence that there'll ever be > a case where the speed with which we can read \u escapes will matter. You don't need to see an evidence of a collapsing bridge to know that it must be several times stronger than any imaginable load that could ever be put on it. In other words, not everything is empirical; there's a thing called ``good engineering practice.'' Sorry for being overly didactic, I'm sure you know all that. I'm just amazed that such a fundamental issue needs evidence. ^ permalink raw reply [flat|nested] 202+ messages in thread
* Re: [PATCH] Unicode Lisp reader escapes 2006-05-05 10:25 ` Eli Zaretskii @ 2006-05-05 12:31 ` Oliver Scholz 2006-05-05 18:08 ` Stuart D. Herring 2006-05-05 13:05 ` Stefan Monnier 1 sibling, 1 reply; 202+ messages in thread From: Oliver Scholz @ 2006-05-05 12:31 UTC (permalink / raw) Eli Zaretskii <eliz@gnu.org> writes: >> Cc: storm@cua.dk, emacs-devel@gnu.org, alkibiades@gmx.de >> From: Stefan Monnier <monnier@iro.umontreal.ca> >> Date: Thu, 04 May 2006 17:40:28 -0400 >> >> >> After all, it's there, it works, and as mentioned: there's no >> >> evidence that the decoding time of \u escapes it ever going to need >> >> to be fast. >> >> > ??? inside the Lisp reader, everything needs to be fast, IMO. >> >> Why? > > Because the Lisp reader is the backbone of the Lisp interpreter. > >> IMO the only things that need to be fast are those things whose >> performance has a visible impact. I see no evidence that there'll ever be >> a case where the speed with which we can read \u escapes will matter. > > You don't need to see an evidence of a collapsing bridge to know that > it must be several times stronger than any imaginable load that could > ever be put on it. > > In other words, not everything is empirical; there's a thing called > ``good engineering practice.'' > > Sorry for being overly didactic, I'm sure you know all that. I'm just > amazed that such a fundamental issue needs evidence. For the sake of peace: my opinion doesn't probably matter much, but personally I believe that the changes necessary for a C implementation of `decode-char' are local enough to be safe. I wouldn't like to change anything in mule.el or utf-8.el---not even a defvar---, but maybe I just could make the necessary symbols available to C in syms_of_lread: Qutf_translate_cjk_mode = intern ("utf-translate-cjk-mode"); staticpro (&Qutf_translate_cjk_mode); [...] Qutf_subst_table_for_decode = intern ("utf-subst-table-for-decode"); staticpro (&Qutf_subst_table_for_decode); Qtranslation_hash_table = intern ("translation-hash-table"); staticpro (&Qutf_subst_table_for_decode); And then access them from my port of `decode-char's core functionality like this: SYMBOL_VALUE (Qutf_translate_cjk_mode) [...] Fget (Qutf_subst_table_for_decode, Qtranslation_hash_table) [Well, in fact I already did in my copy of Emacs.] Oliver -- 16 Floréal an 214 de la Révolution Liberté, Egalité, Fraternité! ^ permalink raw reply [flat|nested] 202+ messages in thread
* Re: [PATCH] Unicode Lisp reader escapes 2006-05-05 12:31 ` Oliver Scholz @ 2006-05-05 18:08 ` Stuart D. Herring 0 siblings, 0 replies; 202+ messages in thread From: Stuart D. Herring @ 2006-05-05 18:08 UTC (permalink / raw) Cc: emacs-devel > syms_of_lread: > > Qutf_translate_cjk_mode = intern ("utf-translate-cjk-mode"); > staticpro (&Qutf_translate_cjk_mode); > > [...] > > Qutf_subst_table_for_decode = intern ("utf-subst-table-for-decode"); > staticpro (&Qutf_subst_table_for_decode); > > Qtranslation_hash_table = intern ("translation-hash-table"); I'd suggest here: >- staticpro (&Qutf_subst_table_for_decode); >+ staticpro (&Qtranslation_hash_table); Davis -- This product is sold by volume, not by mass. If it appears too dense or too sparse, it is because mass-energy conversion has occurred during shipping. ^ permalink raw reply [flat|nested] 202+ messages in thread
* Re: [PATCH] Unicode Lisp reader escapes 2006-05-05 10:25 ` Eli Zaretskii 2006-05-05 12:31 ` Oliver Scholz @ 2006-05-05 13:05 ` Stefan Monnier 2006-05-05 17:23 ` Oliver Scholz 1 sibling, 1 reply; 202+ messages in thread From: Stefan Monnier @ 2006-05-05 13:05 UTC (permalink / raw) Cc: alkibiades, emacs-devel >> > ??? inside the Lisp reader, everything needs to be fast, IMO. >> Why? > Because the Lisp reader is the backbone of the Lisp interpreter. >> IMO the only things that need to be fast are those things whose >> performance has a visible impact. I see no evidence that there'll ever be >> a case where the speed with which we can read \u escapes will matter. > You don't need to see an evidence of a collapsing bridge to know that > it must be several times stronger than any imaginable load that could > ever be put on it. We're talking performance here, not correctness. > In other words, not everything is empirical; there's a thing called > ``good engineering practice.'' And we're talking about a micro-optimization, not an algorithmic optimization, so the only good engineering principle I know in this domain is: don't micro-optimize before you know it's necessary. > Sorry for being overly didactic, I'm sure you know all that. I'm just > amazed that such a fundamental issue needs evidence. I don't need evidence to accept the C code version, but I need such evidence before I can accept "performance" as the motivation for the use of C code rather than elisp. Stefan ^ permalink raw reply [flat|nested] 202+ messages in thread
* Re: [PATCH] Unicode Lisp reader escapes 2006-05-05 13:05 ` Stefan Monnier @ 2006-05-05 17:23 ` Oliver Scholz 0 siblings, 0 replies; 202+ messages in thread From: Oliver Scholz @ 2006-05-05 17:23 UTC (permalink / raw) [-- Attachment #1: Type: text/plain, Size: 817 bytes --] For what it's worth, I just tried the attached little stress test on an updated C port of `decode-char' in order to check whether it returns equivalent results. It does. (Well, except intentional differences like that `ucs_to_internal' throws an error where `decode-char' returns nil.) Basically the test runs through all positive integers up to MAX_CHAR and inserts an alist into a temp buffer with each car being the integer and each cdr being a character in the \u syntax (e.g. `?\u3b1'). It then reads that alist again and checks whether `decode-char' on its car is `eq' to its cdr. I tried it with and without `utf-translate-cjk-mode' and with and without `utf-fragment-on-decoding'. Since all tests succeed, ucs_to_internal and `decode-char' are functionally equivalent on all supported characters. The test: [-- Attachment #2: ucs-test.el --] [-- Type: application/emacs-lisp, Size: 1517 bytes --] [-- Attachment #3: Type: text/plain, Size: 20 bytes --] The updated patch: [-- Attachment #4: ucs-escapes.diff --] [-- Type: text/plain, Size: 6643 bytes --] Index: src/lread.c =================================================================== RCS file: /cvsroot/emacs/emacs/src/lread.c,v retrieving revision 1.350 diff -u -r1.350 lread.c --- src/lread.c 27 Feb 2006 02:04:35 -0000 1.350 +++ src/lread.c 5 May 2006 17:09:37 -0000 @@ -87,6 +87,9 @@ Lisp_Object Qbackquote, Qcomma, Qcomma_at, Qcomma_dot, Qfunction; Lisp_Object Qinhibit_file_name_operation; Lisp_Object Qeval_buffer_list, Veval_buffer_list; +Lisp_Object Qutf_translate_cjk_mode, Qutf_translate_cjk_lang_env, Qutf_translate_cjk_load_tables; +Lisp_Object Qutf_subst_table_for_decode, Qtranslation_hash_table; +Lisp_Object Qutf_translation_table_for_decode, Qtranslation_table; extern Lisp_Object Qevent_symbol_element_mask; extern Lisp_Object Qfile_exists_p; @@ -1731,6 +1734,110 @@ return str[0]; } + +#define READ_HEX_ESCAPE(i, c) \ + while (1) \ + { \ + c = READCHAR; \ + if (c >= '0' && c <= '9') \ + { \ + i *= 16; \ + i += c - '0'; \ + } \ + else if ((c >= 'a' && c <= 'f') \ + || (c >= 'A' && c <= 'F')) \ + { \ + i *= 16; \ + if (c >= 'a' && c <= 'f') \ + i += c - 'a' + 10; \ + else \ + i += c - 'A' + 10; \ + } \ + else \ + { \ + UNREAD (c); \ + break; \ + } \ + } + + + +/* Return the internal character coresponding to an UCS code point.*/ + +int +ucs_to_internal (ucs) + int ucs; +{ + int c = 0; + Lisp_Object tmp_char; + + if (! EQ (Qnil, SYMBOL_VALUE (Qutf_translate_cjk_mode))) + /* cf. `utf-lookup-subst-table-for-decode' */ + { + Lisp_Object hash; + + if (EQ (Qnil, SYMBOL_VALUE (Qutf_translate_cjk_lang_env))) + call0 (Qutf_translate_cjk_load_tables); + + hash = Fget (Qutf_subst_table_for_decode, Qtranslation_hash_table); + + if (HASH_TABLE_P (hash)) + { + tmp_char = Fgethash (make_number (ucs), hash, Qnil); + if (! EQ (Qnil, tmp_char)) + { + CHECK_NUMBER (tmp_char); + c = XFASTINT (tmp_char); + } + } + } + + if (c) + /* We found the character already in the translation hash table. + Do nothing. */ + ; + else if (ucs < 160) + c = ucs; + else if (ucs < 256) + c = MAKE_CHAR (charset_latin_iso8859_1, ucs, 0); + else if (ucs < 0x2500) + { + ucs -= 0x0100; + c = MAKE_CHAR (charset_mule_unicode_0100_24ff, + ((ucs / 96) + 32), + ((ucs % 96) + 32)); + } + else if (ucs < 0x3400) + { + ucs -= 0x2500; + c = MAKE_CHAR (charset_mule_unicode_2500_33ff, + ((ucs / 96) + 32), + ((ucs % 96) + 32)); + } + else if ((ucs >= 0xE000) && (ucs < 0x10000)) + { + ucs -= 0xE000; + c = MAKE_CHAR (charset_mule_unicode_e000_ffff, + ((ucs / 96) + 32), + ((ucs % 96) + 32)); + } + + if (c || ucs == 0) /* U+0000 is also a valid character. */ + { + Lisp_Object vect = Fget (Qutf_translation_table_for_decode, + Qtranslation_table); + if (CHAR_TABLE_P (vect)) + { + tmp_char = Faref (vect, make_number (c)); + if (! EQ (Qnil, tmp_char)) + return XFASTINT (tmp_char); + } + return c; + } + else error ("Invalid or unsupported UCS character: %x", ucs); +} + + /* Read a \-escape sequence, assuming we already read the `\'. If the escape sequence forces unibyte, store 1 into *BYTEREP. If the escape sequence forces multibyte, store 2 into *BYTEREP. @@ -1879,34 +1986,23 @@ /* A hex escape, as in ANSI C. */ { int i = 0; - while (1) - { - c = READCHAR; - if (c >= '0' && c <= '9') - { - i *= 16; - i += c - '0'; - } - else if ((c >= 'a' && c <= 'f') - || (c >= 'A' && c <= 'F')) - { - i *= 16; - if (c >= 'a' && c <= 'f') - i += c - 'a' + 10; - else - i += c - 'A' + 10; - } - else - { - UNREAD (c); - break; - } - } - + READ_HEX_ESCAPE (i, c); *byterep = 2; return i; } + case 'u': + /* A hexadecimal reference to an UCS character. */ + { + int i = 0; + + READ_HEX_ESCAPE (i, c); + *byterep = 2; + + return ucs_to_internal (i); + + } + default: if (BASE_LEADING_CODE_P (c)) c = read_multibyte (c, readcharfun); @@ -4121,6 +4217,27 @@ Vloads_in_progress = Qnil; staticpro (&Vloads_in_progress); + + Qutf_translate_cjk_mode = intern ("utf-translate-cjk-mode"); + staticpro (&Qutf_translate_cjk_mode); + + Qutf_translate_cjk_lang_env = intern ("utf-translate-cjk-lang-env"); + staticpro (&Qutf_translate_cjk_lang_env); + + Qutf_translate_cjk_load_tables = intern ("utf-translate-cjk-load-tables"); + staticpro (&Qutf_translate_cjk_load_tables); + + Qutf_subst_table_for_decode = intern ("utf-subst-table-for-decode"); + staticpro (&Qutf_subst_table_for_decode); + + Qtranslation_hash_table = intern ("translation-hash-table"); + staticpro (&Qutf_subst_table_for_decode); + + Qutf_translation_table_for_decode = intern ("utf-translation-table-for-decode"); + staticpro (&Qutf_translation_table_for_decode); + + Qtranslation_table = intern ("translation-table"); + staticpro (&Qtranslation_table); } /* arch-tag: a0d02733-0f96-4844-a659-9fd53c4f414d [-- Attachment #5: Type: text/plain, Size: 87 bytes --] Oliver -- 16 Floréal an 214 de la Révolution Liberté, Egalité, Fraternité! [-- Attachment #6: Type: text/plain, Size: 142 bytes --] _______________________________________________ Emacs-devel mailing list Emacs-devel@gnu.org http://lists.gnu.org/mailman/listinfo/emacs-devel ^ permalink raw reply [flat|nested] 202+ messages in thread
* Re: [PATCH] Unicode Lisp reader escapes 2006-05-04 11:57 ` Kim F. Storm 2006-05-04 12:18 ` Stefan Monnier @ 2006-05-04 13:07 ` Oliver Scholz 1 sibling, 0 replies; 202+ messages in thread From: Oliver Scholz @ 2006-05-04 13:07 UTC (permalink / raw) storm@cua.dk (Kim F. Storm) writes: > Oliver Scholz <alkibiades@gmx.de> writes: > >> Here's the patch, only slightly tested: > >> + >> +/* Return the internal character coresponding to an UCS code point.*/ >> + >> +int >> +ucs_to_internal (ucs) >> + int ucs; >> +{ >> + if (! EQ (Qnil, SYMBOL_VALUE (intern ("utf-translate-cjk-mode")))) >> + if (EQ (Qnil, SYMBOL_VALUE (intern ("utf-translate-cjk-lang-env")))) >> + call0 (intern ("utf-translate-cjk-load-tables")); >> + Fget (intern ("utf-subst-table-for-decode"), >> + intern ("translation-hash-table")), >> + Lisp_Object vect = Fget (intern ("utf-translation-table-for-decode"), >> + intern ("translation-table")); >> +} > > That's 7 lisp vars accessed from C - for decoding one character!?! Nearly inevitable, if you want to DTRT with CJK. > How often does this happen? Every time a character specified with \u is decoded. The call0, however, probably just once per Emacs session. > If it is only/primarily used for interactive use, I guess it doesn't matter. > Otherwise, I think those vars should be declared in C, to avoid the overhead > of interning them at run-time... I tend to agree; but that is probably too intrusive. Oliver -- 15 Floréal an 214 de la Révolution Liberté, Egalité, Fraternité! ^ permalink raw reply [flat|nested] 202+ messages in thread
* Re: [PATCH] Unicode Lisp reader escapes 2006-05-04 1:33 ` Kenichi Handa 2006-05-04 8:23 ` Oliver Scholz @ 2006-05-04 16:32 ` Eli Zaretskii 2006-05-04 20:55 ` Aidan Kehoe 2006-05-05 19:05 ` Richard Stallman 1 sibling, 2 replies; 202+ messages in thread From: Eli Zaretskii @ 2006-05-04 16:32 UTC (permalink / raw) Cc: kehoea, monnier, emacs-devel > From: Kenichi Handa <handa@m17n.org> > Date: Thu, 04 May 2006 10:33:55 +0900 > Cc: kehoea@parhasard.net, eliz@gnu.org, emacs-devel@gnu.org > > > I think we should treat them as much as possible consistently with the rest > > of the treatment of unicode chars. If we start down the path of "OK, we can > > do it like this for those chars but not these, oh and as for those ones over > > there, we'll do it yet some other way", I think we're headed for headaches > > with no real benefit. > > I agree. What happens when a Lisp file is byte-compiled--do we want the result to depend on the local settings? ^ permalink raw reply [flat|nested] 202+ messages in thread
* Re: [PATCH] Unicode Lisp reader escapes 2006-05-04 16:32 ` Eli Zaretskii @ 2006-05-04 20:55 ` Aidan Kehoe 2006-05-05 9:33 ` Oliver Scholz 2006-05-05 19:05 ` Richard Stallman 1 sibling, 1 reply; 202+ messages in thread From: Aidan Kehoe @ 2006-05-04 20:55 UTC (permalink / raw) Cc: emacs-devel Ar an ceathrú lá de mí Bealtaine, scríobh Eli Zaretskii>: > > > I think we should treat them as much as possible consistently with > > > the rest of the treatment of unicode chars. If we start down the > > > path of "OK, we can do it like this for those chars but not these, oh > > > and as for those ones over there, we'll do it yet some other way", I > > > think we're headed for headaches with no real benefit. > > > > I agree. > > What happens when a Lisp file is byte-compiled--do we want the result > to depend on the local settings? It does currently, to the extent of local settings preventing successful compilation. Cf. this code (on Unix): (let ((our-test-file-name "/tmp/testing-byte-compile.el")) (let ((coding-system-for-write 'iso-8859-1)) (set-buffer (get-buffer-create our-test-file-name)) (insert (concat ";; -*- coding: utf-8 -*-\n\n" "(require 'cl)\n\n" "(defun describe-our-string ()\n" " (let ((our-char ?" (format "%c%c%c" ?\345 ?\215 ?\227) "))\n" " (message (format \"\%c maps to \%s\n\" our-char " "(split-char our-char)))))\n")) (write-file our-test-file-name nil) (kill-buffer (current-buffer))) (utf-translate-cjk-mode 1) (byte-compile-file our-test-file-name) (load-file (concat our-test-file-name "c")) (describe-our-string) (delete-file (concat our-test-file-name "c")) (utf-translate-cjk-mode 0) ;; The following byte compilation fails for me; error ;; Compiling file /tmp/testing-byte-compile.el at Thu May 4 22:49:00 2006 ;; testing-byte-compile.el:4:1:Error: Invalid read syntax: "?" ;; (byte-compile-file our-test-file-name) (load-file (concat our-test-file-name "c")) (describe-our-string)) -- Aidan Kehoe, http://www.parhasard.net/ ^ permalink raw reply [flat|nested] 202+ messages in thread
* Re: [PATCH] Unicode Lisp reader escapes 2006-05-04 20:55 ` Aidan Kehoe @ 2006-05-05 9:33 ` Oliver Scholz 2006-05-05 10:02 ` Oliver Scholz ` (2 more replies) 0 siblings, 3 replies; 202+ messages in thread From: Oliver Scholz @ 2006-05-05 9:33 UTC (permalink / raw) Aidan Kehoe <kehoea@parhasard.net> writes: > Ar an ceathrú lá de mí Bealtaine, scríobh Eli Zaretskii>: [...] > > What happens when a Lisp file is byte-compiled--do we want the result > > to depend on the local settings? Oy! This might be a bit more serious than what I called a "minor obscurity". I guess, you have a similar problem when the source file is encoded in either ISO 8859-5 or ISO 8859-7 (btw., what about Hewbrew, Arabic and Thai?). Unless I am much mistaken, the encoding of the characters in the .elc file would also depend on the value of `utf-fragment-on-decoding'. A difference might be that this is much more obvious in the case of a ISO 8859-[57] encoded file; and that it is more obscure and more likely to cause puzzlement in the case of a couple of letters specified with \u. I have no opinion on how serious that is. One might say, that is just one of the glitches of emacs-mule. Or maybe not. I don't know. At least I don't see a proper solution to this. > It does currently, to the extent of local settings preventing successful > compilation. Cf. this code (on Unix): [...] > (insert (concat > ";; -*- coding: utf-8 -*-\n\n" > "(require 'cl)\n\n" > "(defun describe-our-string ()\n" > " (let ((our-char ?" > (format "%c%c%c" ?\345 ?\215 ?\227) [...] > (utf-translate-cjk-mode 0) [...] > (byte-compile-file our-test-file-name) [...] I am afraid that is not relevant here. This just tells Emacs to not deal with UTF-8 encoded CJK characters and then tell it to deal with such an character. Oliver -- 16 Floréal an 214 de la Révolution Liberté, Egalité, Fraternité! ^ permalink raw reply [flat|nested] 202+ messages in thread
* Re: [PATCH] Unicode Lisp reader escapes 2006-05-05 9:33 ` Oliver Scholz @ 2006-05-05 10:02 ` Oliver Scholz 2006-05-05 18:33 ` Aidan Kehoe 2006-05-06 14:24 ` Richard Stallman 2 siblings, 0 replies; 202+ messages in thread From: Oliver Scholz @ 2006-05-05 10:02 UTC (permalink / raw) Correction. Oliver Scholz <alkibiades@gmx.de> writes: > I guess, you have a similar problem when the source file is encoded in > either ISO 8859-5 or ISO 8859-7 (btw., what about Hewbrew, Arabic and > Thai?). I meant to say: a source file in an UCS encoding containing characters from the range of ISO 8859-[57]. > Unless I am much mistaken, the encoding of the characters in the > .elc file would also depend on the value of > `utf-fragment-on-decoding'. Oliver -- 16 Floréal an 214 de la Révolution Liberté, Egalité, Fraternité! ^ permalink raw reply [flat|nested] 202+ messages in thread
* Re: [PATCH] Unicode Lisp reader escapes 2006-05-05 9:33 ` Oliver Scholz 2006-05-05 10:02 ` Oliver Scholz @ 2006-05-05 18:33 ` Aidan Kehoe 2006-05-05 18:42 ` Oliver Scholz 2006-05-05 21:37 ` Eli Zaretskii 2006-05-06 14:24 ` Richard Stallman 2 siblings, 2 replies; 202+ messages in thread From: Aidan Kehoe @ 2006-05-05 18:33 UTC (permalink / raw) Cc: emacs-devel Ar an cúigiú lá de mí Bealtaine, scríobh Oliver Scholz>: > > > What happens when a Lisp file is byte-compiled--do we want the result > > > to depend on the local settings? > > [...] > > > It does currently, to the extent of local settings preventing successful > > compilation. Cf. this code (on Unix): > > [...] > > (insert (concat > > ";; -*- coding: utf-8 -*-\n\n" > > "(require 'cl)\n\n" > > "(defun describe-our-string ()\n" > > " (let ((our-char ?" > > (format "%c%c%c" ?\345 ?\215 ?\227) > [...] > > (utf-translate-cjk-mode 0) > [...] > > (byte-compile-file our-test-file-name) > [...] > > I am afraid that is not relevant here. This just tells Emacs to not > deal with UTF-8 encoded CJK characters and then tell it to deal with > such an character. It byte compiles a file, changes a local setting, and byte-compiles the file again with a different result. That is relevant to Eli’s question. -- Aidan Kehoe, http://www.parhasard.net/ ^ permalink raw reply [flat|nested] 202+ messages in thread
* Re: [PATCH] Unicode Lisp reader escapes 2006-05-05 18:33 ` Aidan Kehoe @ 2006-05-05 18:42 ` Oliver Scholz 2006-05-05 21:37 ` Eli Zaretskii 1 sibling, 0 replies; 202+ messages in thread From: Oliver Scholz @ 2006-05-05 18:42 UTC (permalink / raw) Aidan Kehoe <kehoea@parhasard.net> writes: > Ar an cúigiú lá de mí Bealtaine, scríobh Oliver Scholz>: > > > > > What happens when a Lisp file is byte-compiled--do we want the result > > > > to depend on the local settings? > > > > [...] > > > > > It does currently, to the extent of local settings preventing successful > > > compilation. Cf. this code (on Unix): > > > > [...] > > > (insert (concat > > > ";; -*- coding: utf-8 -*-\n\n" > > > "(require 'cl)\n\n" > > > "(defun describe-our-string ()\n" > > > " (let ((our-char ?" > > > (format "%c%c%c" ?\345 ?\215 ?\227) > > [...] > > > (utf-translate-cjk-mode 0) > > [...] > > > (byte-compile-file our-test-file-name) > > [...] > > > > I am afraid that is not relevant here. This just tells Emacs to not > > deal with UTF-8 encoded CJK characters and then tell it to deal with > > such an character. > > It byte compiles a file, changes a local setting, and byte-compiles the file > again with a different result. That is relevant to Eli’s question. Sure, and I can put (eval-after-load "bytecomp" '(fset 'byte-compile-file (lambda (&rest ignore) (error "lirum larum")))) into my .emacs and bytecompiling will also yield different results depending on local setting. I guess that would also be relevant here. Oliver -- 16 Floréal an 214 de la Révolution Liberté, Egalité, Fraternité! ^ permalink raw reply [flat|nested] 202+ messages in thread
* Re: [PATCH] Unicode Lisp reader escapes 2006-05-05 18:33 ` Aidan Kehoe 2006-05-05 18:42 ` Oliver Scholz @ 2006-05-05 21:37 ` Eli Zaretskii 1 sibling, 0 replies; 202+ messages in thread From: Eli Zaretskii @ 2006-05-05 21:37 UTC (permalink / raw) Cc: emacs-devel, alkibiades > From: Aidan Kehoe <kehoea@parhasard.net> > Date: Fri, 5 May 2006 20:33:56 +0200 > Cc: emacs-devel@gnu.org > > > I am afraid that is not relevant here. This just tells Emacs to not > > deal with UTF-8 encoded CJK characters and then tell it to deal with > > such an character. > > It byte compiles a file, changes a local setting, and byte-compiles the file > again with a different result. That is relevant to Eli's question. It's not necessarily relevant, because I didn't mean theoretical exercises, I meant normal byte-compiling of Lisp files which just happen to have \u escapes in them. Such files usually won't be encoded in some arbitrary encoding. Use of 8-bit \nnn characters is also discouraged due to the ambiguity of their interpretation. Emacs gives us enough rope to hang ourselves, but that doesn't mean we should actually do that whenever we have a few moments of free time ;-) ^ permalink raw reply [flat|nested] 202+ messages in thread
* Re: [PATCH] Unicode Lisp reader escapes 2006-05-05 9:33 ` Oliver Scholz 2006-05-05 10:02 ` Oliver Scholz 2006-05-05 18:33 ` Aidan Kehoe @ 2006-05-06 14:24 ` Richard Stallman 2006-05-06 15:01 ` Oliver Scholz [not found] ` <877j4z5had.fsf@gmx.de> 2 siblings, 2 replies; 202+ messages in thread From: Richard Stallman @ 2006-05-06 14:24 UTC (permalink / raw) Cc: emacs-devel I guess, you have a similar problem when the source file is encoded in either ISO 8859-5 or ISO 8859-7 (btw., what about Hewbrew, Arabic and Thai?). Unless I am much mistaken, the encoding of the characters in the .elc file would also depend on the value of `utf-fragment-on-decoding'. Are you talking about how the compiler would write the .elc file? Or are you talking about how the .elc file would be interpreted? If it is the latter, I don't think so. Fload will disregard this variable because Fload does not do decoding in the usual way. The compiler should output the file in the representation that Fload will read, and it should do so by binding any relevant variables. ^ permalink raw reply [flat|nested] 202+ messages in thread
* Re: [PATCH] Unicode Lisp reader escapes 2006-05-06 14:24 ` Richard Stallman @ 2006-05-06 15:01 ` Oliver Scholz [not found] ` <877j4z5had.fsf@gmx.de> 1 sibling, 0 replies; 202+ messages in thread From: Oliver Scholz @ 2006-05-06 15:01 UTC (permalink / raw) Cc: Richard Stallman Richard Stallman <rms@gnu.org> writes: > I guess, you have a similar problem when the source file is encoded in > either ISO 8859-5 or ISO 8859-7 (btw., what about Hewbrew, Arabic and > Thai?). Unless I am much mistaken, the encoding of the characters in > the .elc file would also depend on the value of > `utf-fragment-on-decoding'. [Note the correction in a follow-up of mine: I made a mistake in phrasing that paragraph. I am actually talking about Elisp source files encoded in UTF-8 (or another UCS encoding) that contain characters from the repertoire of ISO 8859-5 or ISO 8859-7. There's a similar case with source files encoded in some of the ISO 8859 encodings and `unify-8859-on-decoding-mode', though.] > Are you talking about how the compiler would write the .elc file? > Or are you talking about how the .elc file would be interpreted? It's the former. Meanwhile I have tested it. After some more thought I think that the case of an UTF-8 encoded source files containing characters from the Greek or Cyrillic repertoires is in fact entirely analogous to what would happen with \u. In other words: that particular bug is already there. Oliver, who still thinks that \u and \U is really ugly. -- 17 Floréal an 214 de la Révolution Liberté, Egalité, Fraternité! ^ permalink raw reply [flat|nested] 202+ messages in thread
[parent not found: <877j4z5had.fsf@gmx.de>]
* Re: [PATCH] Unicode Lisp reader escapes [not found] ` <877j4z5had.fsf@gmx.de> @ 2006-05-07 5:00 ` Richard Stallman 2006-05-07 12:38 ` Kenichi Handa 0 siblings, 1 reply; 202+ messages in thread From: Richard Stallman @ 2006-05-07 5:00 UTC (permalink / raw) Cc: emacs-devel, handa > Are you talking about how the compiler would write the .elc file? > Or are you talking about how the .elc file would be interpreted? It's the former. Meanwhile I have tested it. After some more thought I think that the case of an UTF-8 encoded source files containing characters from the Greek or Cyrillic repertoires is in fact entirely analogous to what would happen with \u. In other words: that particular bug is already there. Handa-san, could you please investigate the bug? Oliver, could you email Handa-san a description of the bug, in the form which already exists? Can you find a test case which fails in the present development sources? ^ permalink raw reply [flat|nested] 202+ messages in thread
* Re: [PATCH] Unicode Lisp reader escapes 2006-05-07 5:00 ` Richard Stallman @ 2006-05-07 12:38 ` Kenichi Handa 2006-05-07 21:26 ` Oliver Scholz 2006-05-08 7:36 ` Richard Stallman 0 siblings, 2 replies; 202+ messages in thread From: Kenichi Handa @ 2006-05-07 12:38 UTC (permalink / raw) Cc: emacs-devel, alkibiades In article <E1FcbO2-0002U6-0r@fencepost.gnu.org>, Richard Stallman <rms@gnu.org> writes: >> Are you talking about how the compiler would write the .elc file? >> Or are you talking about how the .elc file would be interpreted? > It's the former. Meanwhile I have tested it. After some more thought I > think that the case of an UTF-8 encoded source files containing > characters from the Greek or Cyrillic repertoires is in fact entirely > analogous to what would happen with \u. > In other words: that particular bug is already there. > Handa-san, could you please investigate the bug? > Oliver, could you email Handa-san a description of the > bug, in the form which already exists? Can you find a test case > which fails in the present development sources? When you byte-compile a x.el file, x.el file is at first decoded. How x.el file is decoded depends on many thing, and thus, of course, the resulting x.elc files become different. If you say that is a bug, I think there's no way to fix it. The very simple testcase is this: (progn (let ((str "(setq x \"\300\300\")\n") (coding-system-for-write 'no-conversion)) (write-region str nil "~/test1.el") (write-region str nil "~/test2.el")) (set-language-environment "Latin-1") (byte-compile-file "~/test1.el") (set-language-environment "Japanese") (byte-compile-file "~/test2.el")) Although the source files are exactly the same, the resulting test1.elc contains a string of two Latin-1 characters whereas the test2.elc contains a string of single Japanese character. I hope I misunderstand what you claim here. --- Kenichi Handa handa@m17n.org ^ permalink raw reply [flat|nested] 202+ messages in thread
* Re: [PATCH] Unicode Lisp reader escapes 2006-05-07 12:38 ` Kenichi Handa @ 2006-05-07 21:26 ` Oliver Scholz 2006-05-08 1:14 ` Kenichi Handa 2006-05-08 22:29 ` Richard Stallman 2006-05-08 7:36 ` Richard Stallman 1 sibling, 2 replies; 202+ messages in thread From: Oliver Scholz @ 2006-05-07 21:26 UTC (permalink / raw) Cc: emacs-devel, rms, alkibiades Kenichi Handa <handa@m17n.org> writes: > In article <E1FcbO2-0002U6-0r@fencepost.gnu.org>, Richard Stallman <rms@gnu.org> writes: [...] > When you byte-compile a x.el file, x.el file is at first > decoded. How x.el file is decoded depends on many thing, > and thus, of course, the resulting x.elc files become > different. Yes, that's what I meant. > If you say that is a bug, I think there's no way to fix it. > > The very simple testcase is this: > > (progn > (let ((str "(setq x \"\300\300\")\n") > (coding-system-for-write 'no-conversion)) > (write-region str nil "~/test1.el") > (write-region str nil "~/test2.el")) > (set-language-environment "Latin-1") > (byte-compile-file "~/test1.el") > (set-language-environment "Japanese") > (byte-compile-file "~/test2.el")) That's not exactly what I meant. This happens basically because Emacs has no indication on how to decode that file properly. Here's a test case for what I had in mind: (let ((str1 (format "\ ;; -*- coding: utf-8 -*- \(defvar my-string \"The Greek letter alpha: %c\")" (decode-char 'ucs #x3B1))) (str2 (format "\ ;; -*- coding: iso-8859-7 -*- \(defvar my-string \"The Greek letter alpha: %c\")" (decode-char 'ucs #x3B1)))) (let ((coding-system-for-write 'utf-8)) (write-region str1 nil "~/fragment-test-1.el") (write-region str1 nil "~/fragment-test-2.el")) (let ((coding-system-for-write 'iso-8859-7)) (write-region str2 nil "~/unify-test-1.el") (write-region str2 nil "~/unify-test-2.el")) (unify-8859-on-decoding-mode -1) (byte-compile-file "~/unify-test-1.el") ; ch. 2913 from ; greek-iso8859-7 (unify-8859-on-decoding-mode 1) (byte-compile-file "~/unify-test-2.el") ; ch. 332721 from ; mule-unicode-0100-24ff ;; Assuming `utf-fragment-on-decoding' is nil. (byte-compile-file "~/fragment-test-1.el") ; ch. 332721 from ; mule-unicode-0100-24ff ;; AFAICS there is no way to change the settings associated with ;; `utf-fragment-on-decoding' programmatically. However, the ;; following (taken from the variable's `defcustom' declaration) ;; should have the same effect as customizing it. (progn (define-translation-table 'utf-translation-table-for-decode utf-fragmentation-table) (unless (eq (get 'utf-translation-table-for-encode 'translation-table) ucs-mule-to-mule-unicode) (define-translation-table 'utf-translation-table-for-encode utf-defragmentation-table))) (byte-compile-file "~/fragment-test-2.el") ; ch. 2913 from ; greek-iso8859-7 ) As Richard wrote, the fix would be to change the settings to their default, unless the files set a specific variable. But given the work this would require and given that the value of changing the defaults is IMO somewhat dubious, you could as well just document it in etc/PROBLEMS. Oliver -- Oliver Scholz 18 Floréal an 214 de la Révolution Ostendstr. 61 Liberté, Egalité, Fraternité! 60314 Frankfurt a. M. ^ permalink raw reply [flat|nested] 202+ messages in thread
* Re: [PATCH] Unicode Lisp reader escapes 2006-05-07 21:26 ` Oliver Scholz @ 2006-05-08 1:14 ` Kenichi Handa 2006-05-08 22:29 ` Richard Stallman 1 sibling, 0 replies; 202+ messages in thread From: Kenichi Handa @ 2006-05-08 1:14 UTC (permalink / raw) Cc: emacs-devel, rms, alkibiades In article <87irohfrx1.fsf@gmx.de>, Oliver Scholz <alkibiades@gmx.de> writes: >> (progn >> (let ((str "(setq x \"\300\300\")\n") >> (coding-system-for-write 'no-conversion)) >> (write-region str nil "~/test1.el") >> (write-region str nil "~/test2.el")) >> (set-language-environment "Latin-1") >> (byte-compile-file "~/test1.el") >> (set-language-environment "Japanese") >> (byte-compile-file "~/test2.el")) > That's not exactly what I meant. This happens basically because Emacs > has no indication on how to decode that file properly. Here's a test > case for what I had in mind: The underlining problem is the same. In your test case also, even if you put coding: tags, the exact decoding varies depending on many other things, and thus resulting *.elc are different. [...] > As Richard wrote, the fix would be to change the settings to > their default, unless the files set a specific variable. Then you'll get different results in these two cases: (1) visit *.el and M-x eval-current-buffer (2) byte-compile *.el and load *.elc. I think that is more like a bug. > But given the work this would require and given that the value of > changing the defaults is IMO somewhat dubious, you could as well > just document it in etc/PROBLEMS. I agree that is the best solution. --- Kenichi Handa handa@m17n.org ^ permalink raw reply [flat|nested] 202+ messages in thread
* Re: [PATCH] Unicode Lisp reader escapes 2006-05-07 21:26 ` Oliver Scholz 2006-05-08 1:14 ` Kenichi Handa @ 2006-05-08 22:29 ` Richard Stallman 2006-05-09 3:42 ` Eli Zaretskii 2006-05-09 5:13 ` Kenichi Handa 1 sibling, 2 replies; 202+ messages in thread From: Richard Stallman @ 2006-05-08 22:29 UTC (permalink / raw) Cc: alkibiades, emacs-devel, handa As Richard wrote, the fix would be to change the settings to their default, unless the files set a specific variable. But given the work this would require and given that the value of changing the defaults is IMO somewhat dubious, you could as well just document it in etc/PROBLEMS. We seem to be talking about two variables here: unify-8859-on-decoding-mode and utf-fragment-on-decoding. Are there any others involved? I do not know what those variables mean. Do they affect the choice of coding system? Or do they take effect by altering the meaning of a given coding system? If it is the former, the Lisp source file can defend against this problem by specifying coding in the -*- line. We tell people to do this in Lisp source files. If it is the latter, there are two possible solutions: 1. to make the compiler bind these variables to their default values. 2. to tell people that all Lisp files for which this is relevant should specify these variables explicitly. If it is just those two variables, I think #1 is easy and preferable. Are there any other variables for which this arises? ^ permalink raw reply [flat|nested] 202+ messages in thread
* Re: [PATCH] Unicode Lisp reader escapes 2006-05-08 22:29 ` Richard Stallman @ 2006-05-09 3:42 ` Eli Zaretskii 2006-05-09 20:41 ` Richard Stallman 2006-05-09 5:13 ` Kenichi Handa 1 sibling, 1 reply; 202+ messages in thread From: Eli Zaretskii @ 2006-05-09 3:42 UTC (permalink / raw) Cc: emacs-devel, handa, alkibiades > From: Richard Stallman <rms@gnu.org> > Date: Mon, 08 May 2006 18:29:13 -0400 > Cc: alkibiades@gmx.de, emacs-devel@gnu.org, handa@m17n.org > > We seem to be talking about two variables here: > unify-8859-on-decoding-mode and utf-fragment-on-decoding. Are there > any others involved? > > I do not know what those variables mean. Do they affect the > choice of coding system? Or do they take effect by altering > the meaning of a given coding system? They select the target character set. When Emacs decodes text with Latin or Cyrillic or Greek characters, it could produce either Unicode charset or one of the ISO 8859 charsets. These variables control that. ^ permalink raw reply [flat|nested] 202+ messages in thread
* Re: [PATCH] Unicode Lisp reader escapes 2006-05-09 3:42 ` Eli Zaretskii @ 2006-05-09 20:41 ` Richard Stallman 2006-05-09 21:03 ` Stefan Monnier 2006-05-10 3:33 ` Eli Zaretskii 0 siblings, 2 replies; 202+ messages in thread From: Richard Stallman @ 2006-05-09 20:41 UTC (permalink / raw) Cc: emacs-devel, handa, alkibiades > I do not know what those variables mean. Do they affect the > choice of coding system? Or do they take effect by altering > the meaning of a given coding system? They select the target character set. When Emacs decodes text with Latin or Cyrillic or Greek characters, it could produce either Unicode charset or one of the ISO 8859 charsets. These variables control that. I cannot determine clearly, from your response, the answer to my questions. Do these variables affect the choice of coding system? Or do they take effect by altering the meaning of a given coding system? I think perhaps you are saying it is the latter, but I am not sure. If it is the latter, perhaps the best solution is to say that every Lisp file should specify these variables, in the -*- line or local variables list, if they affect it. ^ permalink raw reply [flat|nested] 202+ messages in thread
* Re: [PATCH] Unicode Lisp reader escapes 2006-05-09 20:41 ` Richard Stallman @ 2006-05-09 21:03 ` Stefan Monnier 2006-05-10 3:33 ` Eli Zaretskii 1 sibling, 0 replies; 202+ messages in thread From: Stefan Monnier @ 2006-05-09 21:03 UTC (permalink / raw) Cc: alkibiades, Eli Zaretskii, handa, emacs-devel >> I do not know what those variables mean. Do they affect the >> choice of coding system? Or do they take effect by altering >> the meaning of a given coding system? > They select the target character set. When Emacs decodes text with > Latin or Cyrillic or Greek characters, it could produce either Unicode > charset or one of the ISO 8859 charsets. These variables control > that. > I cannot determine clearly, from your response, the answer to my > questions. Do these variables affect the choice of coding system? Or > do they take effect by altering the meaning of a given coding system? > I think perhaps you are saying it is the latter, but I am not sure. It is the latter (except that they actually affect pretty much all coding systems). > If it is the latter, perhaps the best solution is to say > that every Lisp file should specify these variables, in the -*- > line or local variables list, if they affect it. But currently those settings a global AFAICT. Stefan ^ permalink raw reply [flat|nested] 202+ messages in thread
* Re: [PATCH] Unicode Lisp reader escapes 2006-05-09 20:41 ` Richard Stallman 2006-05-09 21:03 ` Stefan Monnier @ 2006-05-10 3:33 ` Eli Zaretskii 1 sibling, 0 replies; 202+ messages in thread From: Eli Zaretskii @ 2006-05-10 3:33 UTC (permalink / raw) Cc: emacs-devel, handa, alkibiades > From: Richard Stallman <rms@gnu.org> > CC: alkibiades@gmx.de, alkibiades@gmx.de, emacs-devel@gnu.org, > handa@m17n.org > Date: Tue, 09 May 2006 16:41:29 -0400 > > > I do not know what those variables mean. Do they affect the > > choice of coding system? Or do they take effect by altering > > the meaning of a given coding system? > > They select the target character set. When Emacs decodes text with > Latin or Cyrillic or Greek characters, it could produce either Unicode > charset or one of the ISO 8859 charsets. These variables control > that. > > I cannot determine clearly, from your response, the answer to my > questions. Do these variables affect the choice of coding system? Or > do they take effect by altering the meaning of a given coding system? > > I think perhaps you are saying it is the latter, but I am not sure. It's certainly not the former. I didn't say it's the latter because ``the meaning of a coding system'' is something I cannot define clearly. Instead, I described what is the precise effect of using these variables. ^ permalink raw reply [flat|nested] 202+ messages in thread
* Re: [PATCH] Unicode Lisp reader escapes 2006-05-08 22:29 ` Richard Stallman 2006-05-09 3:42 ` Eli Zaretskii @ 2006-05-09 5:13 ` Kenichi Handa 2006-05-10 3:20 ` Richard Stallman 1 sibling, 1 reply; 202+ messages in thread From: Kenichi Handa @ 2006-05-09 5:13 UTC (permalink / raw) Cc: emacs-devel, alkibiades In article <E1FdEE5-0008RE-Tp@fencepost.gnu.org>, Richard Stallman <rms@gnu.org> writes: > We seem to be talking about two variables here: > unify-8859-on-decoding-mode and utf-fragment-on-decoding. Their roles are as Eli wrote. > Are there any others involved? utf-translate-cjk-mode also plays a role on decoding utf-*. > I do not know what those variables mean. Do they affect the > choice of coding system? Or do they take effect by altering > the meaning of a given coding system? > If it is the former, the Lisp source file can defend against this > problem by specifying coding in the -*- line. We tell people to do this > in Lisp source files. > If it is the latter, there are two possible solutions: > 1. to make the compiler bind these variables to their default values. > 2. to tell people that all Lisp files for which this is relevant > should specify these variables explicitly. The latter. > If it is just those two variables, I think #1 is easy and preferable. > Are there any other variables for which this arises? Just setting those variables doesn't work; they should be customized. In addition, the default value of utf-translate-cjk-mode t, and to which CJK charsets Han characters of Unicode are decoded depends on these: (1) current-language-environment (2) utf-translate-cjk-unicode-range (which also should be customized to take effect), (3) utf-translate-cjk-charsets (4) the contents of the hash table ucs-unicode-to-mule-cjk (a user can freely reflect one's preference on how to decode Unicode character by modifying this hash table). --- Kenichi Handa handa@m17n.org ^ permalink raw reply [flat|nested] 202+ messages in thread
* Re: [PATCH] Unicode Lisp reader escapes 2006-05-09 5:13 ` Kenichi Handa @ 2006-05-10 3:20 ` Richard Stallman 2006-05-10 5:37 ` Kenichi Handa 0 siblings, 1 reply; 202+ messages in thread From: Richard Stallman @ 2006-05-10 3:20 UTC (permalink / raw) Cc: emacs-devel, alkibiades In addition, the default value of utf-translate-cjk-mode t, and to which CJK charsets Han characters of Unicode are decoded depends on these: (1) current-language-environment What effect does this have? (Aside from the choice of coding system, that is.) (4) the contents of the hash table ucs-unicode-to-mule-cjk (a user can freely reflect one's preference on how to decode Unicode character by modifying this hash table). Could you tell me some examples for how users are really expected to use this? Overall: With so many different variables that might affect the reading of these characters, it is just too inconvenient for every file to specify them all. So I think we need a new feature to make that easy to do. Here's one idea. Add a new "variable" `buffer-coding' which is analogous to `coding'. Whereas `coding' specifies the encoding in the file, `buffer-coding' specifies the in-buffer encoding to produce in the buffer. Its value could be a list or plist, which would specify the values of all these many variables. What do you think? If you think this is a good idea, could you try designing the details? ^ permalink raw reply [flat|nested] 202+ messages in thread
* Re: [PATCH] Unicode Lisp reader escapes 2006-05-10 3:20 ` Richard Stallman @ 2006-05-10 5:37 ` Kenichi Handa 2006-05-10 7:22 ` Stefan Monnier ` (2 more replies) 0 siblings, 3 replies; 202+ messages in thread From: Kenichi Handa @ 2006-05-10 5:37 UTC (permalink / raw) Cc: alkibiades, emacs-devel In article <E1FdfFt-0006ux-Pm@fencepost.gnu.org>, Richard Stallman <rms@gnu.org> writes: > In addition, the default value of > utf-translate-cjk-mode t, and to which CJK charsets Han > characters of Unicode are decoded depends on these: > (1) current-language-environment > What effect does this have? (Aside from the choice of coding system, > that is.) Some Han characters in Unicode can be decoded into several CJK charsets (e.g. chinese-gb2312, chinese-big5-1, japanese-jisx0208). current-language-environment decides which of them to use. > (4) the contents of the hash table ucs-unicode-to-mule-cjk > (a user can freely reflect one's preference on how to decode > Unicode character by modifying this hash table). > Could you tell me some examples for how users are really expected > to use this? I don't know a concrete example, but I can imagine this. U+9AD9 is a variant of U+9AD8, but japanese-jisx0208 contains only the latter. Actually, non of legacy CJK charset contains U+9AD9. But, as it is just a variant of U+9AD8, just for reading, one may want to decode it into japanese-jisx0208. In such a case, one can simply do this: (puthash #x9AD9 ?高 ucs-unicode-to-mule-cjk) > Overall: > With so many different variables that might affect the reading of > these characters, it is just too inconvenient for every file to > specify them all. So I think we need a new feature to make that easy > to do. > Here's one idea. > Add a new "variable" `buffer-coding' which is analogous to `coding'. > Whereas `coding' specifies the encoding in the file, `buffer-coding' > specifies the in-buffer encoding to produce in the buffer. Its value > could be a list or plist, which would specify the values of all these > many variables. > What do you think? If you think this is a good idea, could > you try designing the details? No, it's an incredibly hard and heavy task. When you read utf-8.el and ucs-tables.el, you'll soon realize that. I believe it's just a waste of time to work on such a thing. We have already done lots of workarounds for workarounds for workarounds for not using Unicode internally, but there's a limit. I believe no one is pleased by producing the same *.elc in such a situation. Please accept this problem as a bad feature (not a bug), and write something in etc/PROBLEMS. If not, please decide to shift to emacs-unicode just now. It's the right thing to solve this problem. --- Kenichi Handa handa@m17n.org ^ permalink raw reply [flat|nested] 202+ messages in thread
* Re: [PATCH] Unicode Lisp reader escapes 2006-05-10 5:37 ` Kenichi Handa @ 2006-05-10 7:22 ` Stefan Monnier 2006-05-11 3:45 ` Richard Stallman 2006-05-11 3:44 ` Richard Stallman 2006-05-11 3:44 ` Richard Stallman 2 siblings, 1 reply; 202+ messages in thread From: Stefan Monnier @ 2006-05-10 7:22 UTC (permalink / raw) Cc: emacs-devel, rms, alkibiades >> What do you think? If you think this is a good idea, could >> you try designing the details? > No, it's an incredibly hard and heavy task. Agreed. It's just a lot of work for very little benefit: people have lived with this problem for a while now and haven't found the workarounds to be serious (basically: don't use utf-8 for those files and don't use unify-8859-on-decoding if you manipulate such files). Such a "feature" would only be an ugly workaround anyway. As for a real fix: fixing it is what emacs-unicode is all about. Stefan ^ permalink raw reply [flat|nested] 202+ messages in thread
* Re: [PATCH] Unicode Lisp reader escapes 2006-05-10 7:22 ` Stefan Monnier @ 2006-05-11 3:45 ` Richard Stallman 2006-05-11 12:41 ` Stefan Monnier 0 siblings, 1 reply; 202+ messages in thread From: Richard Stallman @ 2006-05-11 3:45 UTC (permalink / raw) Cc: alkibiades, emacs-devel, handa It's just a lot of work for very little benefit: people have lived with this problem for a while now and haven't found the workarounds to be serious (basically: don't use utf-8 for those files and don't use unify-8859-on-decoding if you manipulate such files). I don't follow. Are you saying that this problem only occurs for utf-8 encoding? ^ permalink raw reply [flat|nested] 202+ messages in thread
* Re: [PATCH] Unicode Lisp reader escapes 2006-05-11 3:45 ` Richard Stallman @ 2006-05-11 12:41 ` Stefan Monnier 2006-05-11 12:51 ` Kenichi Handa 0 siblings, 1 reply; 202+ messages in thread From: Stefan Monnier @ 2006-05-11 12:41 UTC (permalink / raw) Cc: alkibiades, emacs-devel, handa > It's just a lot of work for very little benefit: people have lived > with this problem for a while now and haven't found the workarounds to > be serious (basically: don't use utf-8 for those files and don't use > unify-8859-on-decoding if you manipulate such files). > I don't follow. Are you saying that this problem only occurs > for utf-8 encoding? IIRC such tables are used either during encoding to unicode (i.e. if you save as utf-8) or upon decoding (but only if you've enabled unify-8859-on-decoding). My memory is fuzzy, tho. Stefan ^ permalink raw reply [flat|nested] 202+ messages in thread
* Re: [PATCH] Unicode Lisp reader escapes 2006-05-11 12:41 ` Stefan Monnier @ 2006-05-11 12:51 ` Kenichi Handa 2006-05-11 16:46 ` Stefan Monnier 0 siblings, 1 reply; 202+ messages in thread From: Kenichi Handa @ 2006-05-11 12:51 UTC (permalink / raw) Cc: emacs-devel, rms, alkibiades In article <87lkt8ybui.fsf-monnier+emacs@gnu.org>, Stefan Monnier <monnier@iro.umontreal.ca> writes: >> It's just a lot of work for very little benefit: people have lived >> with this problem for a while now and haven't found the workarounds to >> be serious (basically: don't use utf-8 for those files and don't use >> unify-8859-on-decoding if you manipulate such files). >> I don't follow. Are you saying that this problem only occurs >> for utf-8 encoding? > IIRC such tables are used either during encoding to unicode (i.e. if you > save as utf-8) or upon decoding (but only if you've enabled > unify-8859-on-decoding). My memory is fuzzy, tho. unify-8859-on-decoding-mode affects iso-8859-* coding systems. If it is on, characters in a file of those coding systems are decoded into iso-8859-1 or mule-unicode-0100-24FF. That's the meaning "unify 8859". --- Kenichi Handa handa@m17n.org ^ permalink raw reply [flat|nested] 202+ messages in thread
* Re: [PATCH] Unicode Lisp reader escapes 2006-05-11 12:51 ` Kenichi Handa @ 2006-05-11 16:46 ` Stefan Monnier 0 siblings, 0 replies; 202+ messages in thread From: Stefan Monnier @ 2006-05-11 16:46 UTC (permalink / raw) Cc: emacs-devel, rms, alkibiades >>> It's just a lot of work for very little benefit: people have lived >>> with this problem for a while now and haven't found the workarounds to >>> be serious (basically: don't use utf-8 for those files and don't use >>> unify-8859-on-decoding if you manipulate such files). >>> I don't follow. Are you saying that this problem only occurs >>> for utf-8 encoding? >> IIRC such tables are used either during encoding to unicode (i.e. if you >> save as utf-8) or upon decoding (but only if you've enabled >> unify-8859-on-decoding). My memory is fuzzy, tho. > unify-8859-on-decoding-mode affects iso-8859-* coding > systems. If it is on, characters in a file of those coding > systems are decoded into iso-8859-1 or > mule-unicode-0100-24FF. That's the meaning "unify 8859". We seem to be in violent agreement. Stefan ^ permalink raw reply [flat|nested] 202+ messages in thread
* Re: [PATCH] Unicode Lisp reader escapes 2006-05-10 5:37 ` Kenichi Handa 2006-05-10 7:22 ` Stefan Monnier @ 2006-05-11 3:44 ` Richard Stallman 2006-05-11 3:44 ` Richard Stallman 2 siblings, 0 replies; 202+ messages in thread From: Richard Stallman @ 2006-05-11 3:44 UTC (permalink / raw) Cc: alkibiades, emacs-devel Some Han characters in Unicode can be decoded into several CJK charsets (e.g. chinese-gb2312, chinese-big5-1, japanese-jisx0208). current-language-environment decides which of them to use. Through what mechanism does current-language-environment control this decision? Can we make a new variable to control this, and have each language environment set that new variable accordingly? ^ permalink raw reply [flat|nested] 202+ messages in thread
* Re: [PATCH] Unicode Lisp reader escapes 2006-05-10 5:37 ` Kenichi Handa 2006-05-10 7:22 ` Stefan Monnier 2006-05-11 3:44 ` Richard Stallman @ 2006-05-11 3:44 ` Richard Stallman 2006-05-11 7:31 ` Kenichi Handa 2006-05-11 9:44 ` Oliver Scholz 2 siblings, 2 replies; 202+ messages in thread From: Richard Stallman @ 2006-05-11 3:44 UTC (permalink / raw) Cc: alkibiades, emacs-devel > Add a new "variable" `buffer-coding' which is analogous to `coding'. > Whereas `coding' specifies the encoding in the file, `buffer-coding' > specifies the in-buffer encoding to produce in the buffer. Its value > could be a list or plist, which would specify the values of all these > many variables. > What do you think? If you think this is a good idea, could > you try designing the details? No, it's an incredibly hard and heavy task. I am surprised you think so, and this means there is some sort of misunderstanding between us. You've listed around 6 variables that affect the decoding. So it seems to me that if we make a convenient way for each Lisp file to specify those 6 variables, we solve the problem. It looks easy to me. If you think it is difficult, could you explain where the difficulty is? ^ permalink raw reply [flat|nested] 202+ messages in thread
* Re: [PATCH] Unicode Lisp reader escapes 2006-05-11 3:44 ` Richard Stallman @ 2006-05-11 7:31 ` Kenichi Handa 2006-05-12 4:14 ` Richard Stallman 2006-05-11 9:44 ` Oliver Scholz 1 sibling, 1 reply; 202+ messages in thread From: Kenichi Handa @ 2006-05-11 7:31 UTC (permalink / raw) Cc: emacs-devel, alkibiades In article <E1Fe26h-0007re-Mx@fencepost.gnu.org>, Richard Stallman <rms@gnu.org> writes: > You've listed around 6 variables that affect the decoding. So it > seems to me that if we make a convenient way for each Lisp file to > specify those 6 variables, we solve the problem. It looks easy to me. > If you think it is difficult, could you explain where the difficulty > is? I don't know a convenient way to specify values of huge char-tables and hash-tables in each file. --- Kenichi Handa handa@m17n.org ^ permalink raw reply [flat|nested] 202+ messages in thread
* Re: [PATCH] Unicode Lisp reader escapes 2006-05-11 7:31 ` Kenichi Handa @ 2006-05-12 4:14 ` Richard Stallman 2006-05-12 5:26 ` Kenichi Handa 0 siblings, 1 reply; 202+ messages in thread From: Richard Stallman @ 2006-05-12 4:14 UTC (permalink / raw) Cc: emacs-devel, alkibiades I don't know a convenient way to specify values of huge char-tables and hash-tables in each file. Obviously we find another way to specify the information. Please try to find a solution; don't give up just because it nontrivial. ^ permalink raw reply [flat|nested] 202+ messages in thread
* Re: [PATCH] Unicode Lisp reader escapes 2006-05-12 4:14 ` Richard Stallman @ 2006-05-12 5:26 ` Kenichi Handa 2006-05-13 4:52 ` Richard Stallman 0 siblings, 1 reply; 202+ messages in thread From: Kenichi Handa @ 2006-05-12 5:26 UTC (permalink / raw) Cc: alkibiades, emacs-devel In article <E1FeP3C-0002Jp-JW@fencepost.gnu.org>, Richard Stallman <rms@gnu.org> writes: > I don't know a convenient way to specify values of huge > char-tables and hash-tables in each file. > Obviously we find another way to specify the information. > Please try to find a solution; don't give up just because it > nontrivial. At least you now understand it's not trivial. Why do you think it's worth doing at this stage even if it requires nontrivial work? How about just asking users to use emacs-mule coding system for *.el files if they want them decoded the same way independent of various settings on byte-compiling? Such *.elc files are still loadable by emacs-unicode. --- Kenichi Handa handa@m17n.org ^ permalink raw reply [flat|nested] 202+ messages in thread
* Re: [PATCH] Unicode Lisp reader escapes 2006-05-12 5:26 ` Kenichi Handa @ 2006-05-13 4:52 ` Richard Stallman 2006-05-13 13:25 ` Stefan Monnier 2006-05-15 5:13 ` Kenichi Handa 0 siblings, 2 replies; 202+ messages in thread From: Richard Stallman @ 2006-05-13 4:52 UTC (permalink / raw) Cc: alkibiades, emacs-devel At least you now understand it's not trivial. Why do you think it's worth doing at this stage even if it requires nontrivial work? Because this is a serious cause of unreliability. It is a bug, or something pretty close to a bug. How about just asking users to use emacs-mule coding system for *.el files if they want them decoded the same way independent of various settings on byte-compiling? Maybe that is a good enough solution. Does this solution solve the whole problem? ^ permalink raw reply [flat|nested] 202+ messages in thread
* Re: [PATCH] Unicode Lisp reader escapes 2006-05-13 4:52 ` Richard Stallman @ 2006-05-13 13:25 ` Stefan Monnier 2006-05-13 20:41 ` Richard Stallman 2006-05-15 5:13 ` Kenichi Handa 1 sibling, 1 reply; 202+ messages in thread From: Stefan Monnier @ 2006-05-13 13:25 UTC (permalink / raw) Cc: alkibiades, emacs-devel, Kenichi Handa > At least you now understand it's not trivial. Why do you > think it's worth doing at this stage even if it requires > nontrivial work? > Because this is a serious cause of unreliability. I don't see why you'd think so. Could you expand? Stefan ^ permalink raw reply [flat|nested] 202+ messages in thread
* Re: [PATCH] Unicode Lisp reader escapes 2006-05-13 13:25 ` Stefan Monnier @ 2006-05-13 20:41 ` Richard Stallman 2006-05-14 13:32 ` Stefan Monnier 0 siblings, 1 reply; 202+ messages in thread From: Richard Stallman @ 2006-05-13 20:41 UTC (permalink / raw) Cc: emacs-devel, handa, alkibiades > Because this is a serious cause of unreliability. I don't see why you'd think so. Could you expand? If Lisp files get executed and compiled in different ways according to the user's settings, this is unreliability of a very bad kind. Handa says that telling people "don't use utf-8" solves the problem. If that is a good solution, I think the problem is solved. Does everyone agree that that solution works? ^ permalink raw reply [flat|nested] 202+ messages in thread
* Re: [PATCH] Unicode Lisp reader escapes 2006-05-13 20:41 ` Richard Stallman @ 2006-05-14 13:32 ` Stefan Monnier 2006-05-14 23:29 ` Richard Stallman 0 siblings, 1 reply; 202+ messages in thread From: Stefan Monnier @ 2006-05-14 13:32 UTC (permalink / raw) Cc: emacs-devel, handa, alkibiades >> Because this is a serious cause of unreliability. > I don't see why you'd think so. Could you expand? > If Lisp files get executed and compiled in different ways > according to the user's settings, this is unreliability of a > very bad kind. In theory I agree. But the problem is fixed in emacs-unicode, there are known workarounds in Emacs-CVS, and fixing it in Emacs-CVS is going to be difficult. > Handa says that telling people "don't use utf-8" solves the problem. Additionally to "don't use unify-8859-on-decoding" which causes similar problems (which we already bumped into a few years ago when we included unify-8859-on-decoding) with iso8859 chars and coding systems like iso-2022. > If that is a good solution, I think the problem is solved. OK, good. > Does everyone agree that that solution works? Don't know about everyone, but at least I do, Stefan ^ permalink raw reply [flat|nested] 202+ messages in thread
* Re: [PATCH] Unicode Lisp reader escapes 2006-05-14 13:32 ` Stefan Monnier @ 2006-05-14 23:29 ` Richard Stallman 2006-05-15 0:55 ` Stefan Monnier 0 siblings, 1 reply; 202+ messages in thread From: Richard Stallman @ 2006-05-14 23:29 UTC (permalink / raw) Cc: emacs-devel, handa, alkibiades > Handa says that telling people "don't use utf-8" solves the problem. Additionally to "don't use unify-8859-on-decoding" which causes similar problems (which we already bumped into a few years ago when we included unify-8859-on-decoding) with iso8859 chars and coding systems like iso-2022. There is a way for a Lisp file to specify a coding system which isn't utf-8. Is there a way for a Lisp file to specify that unify-8859-on-decoding should not be used when reading it? If not, maybe we should make one. Here's one idea: if the -*- line specifies `coding' and specifies the mode `emacs-lisp' then force unify-8859-on-decoding to nil for that file. That idea has the advantage that most of the Lisp files where this issue might arise won't need any change in order to be assured of DTRT. ^ permalink raw reply [flat|nested] 202+ messages in thread
* Re: [PATCH] Unicode Lisp reader escapes 2006-05-14 23:29 ` Richard Stallman @ 2006-05-15 0:55 ` Stefan Monnier 2006-05-15 2:49 ` Oliver Scholz 2006-05-15 20:37 ` Richard Stallman 0 siblings, 2 replies; 202+ messages in thread From: Stefan Monnier @ 2006-05-15 0:55 UTC (permalink / raw) Cc: emacs-devel, handa, alkibiades >> Handa says that telling people "don't use utf-8" solves the problem. > Additionally to "don't use unify-8859-on-decoding" which causes > similar problems (which we already bumped into a few years ago when we > included unify-8859-on-decoding) with iso8859 chars and coding systems > like iso-2022. > There is a way for a Lisp file to specify a coding system which isn't > utf-8. Is there a way for a Lisp file to specify that > unify-8859-on-decoding should not be used when reading it? > If not, maybe we should make one. > Here's one idea: if the -*- line specifies `coding' and specifies > the mode `emacs-lisp' then force unify-8859-on-decoding to nil > for that file. Forcing it to nil for a particular file is maybe too much work to implement compared to th benefit. Maybe an easier solution is to add a file-local variable `no-8859-unification' such that if that file is loaded in an Emacs which is configured to use unify-8859-on-decoding it signals an error. It could then be added to files like ucs-tables.el. Stefan ^ permalink raw reply [flat|nested] 202+ messages in thread
* Re: [PATCH] Unicode Lisp reader escapes 2006-05-15 0:55 ` Stefan Monnier @ 2006-05-15 2:49 ` Oliver Scholz 2006-05-15 3:27 ` Stefan Monnier 2006-05-15 20:37 ` Richard Stallman 2006-05-15 20:37 ` Richard Stallman 1 sibling, 2 replies; 202+ messages in thread From: Oliver Scholz @ 2006-05-15 2:49 UTC (permalink / raw) Cc: emacs-devel, rms, handa, alkibiades Stefan Monnier <monnier@iro.umontreal.ca> writes: >>> Handa says that telling people "don't use utf-8" solves the problem. >> Additionally to "don't use unify-8859-on-decoding" which causes >> similar problems (which we already bumped into a few years ago when we >> included unify-8859-on-decoding) with iso8859 chars and coding systems >> like iso-2022. > >> There is a way for a Lisp file to specify a coding system which isn't >> utf-8. Is there a way for a Lisp file to specify that >> unify-8859-on-decoding should not be used when reading it? > >> If not, maybe we should make one. > >> Here's one idea: if the -*- line specifies `coding' and specifies >> the mode `emacs-lisp' then force unify-8859-on-decoding to nil >> for that file. Besides the work already mentioned, this would also require to turn unify-8859-on-decoding-mode into a buffer-local minor mode. Which would require to make the necessary translation tables somehow (!) buffer-local. > Forcing it to nil for a particular file is maybe too much work to implement > compared to th benefit. > Maybe an easier solution is to add a file-local variable > `no-8859-unification' such that if that file is loaded in an Emacs which > is configured to use unify-8859-on-decoding it signals an error. > > It could then be added to files like ucs-tables.el. [Nitpick: ucs-tables.el is encoded in ISO 2022. Most of Emacs' files containing m18n characters are, AFAIK. I don't know the reason. Maybe because it's 7bit, but still ASCII compatible.] How about just issuing a warning with the warning message containing a description of the effects and of what to do to change the settings? e.g.: (when (and (memq (coding-system-base buffer-file-coding-system) '(mule-utf-8 utf-7 mule-utf-16 ; ... mule-utf-16be-with-signature)) utf-fragment-on-decoding ; default is nil (let ((charsets (find-charset-region (point-min) (point-max)))) (or (memq 'greek-iso8859-7 charsets) (memq 'cyrillic-iso8859-5 charsets)))) (warn "You have enabled ... but this source file contains characters from ... Emacs has ... This might or might not be what you want ... To restore the defaults do ... bla bla ... ... you might want to use `emacs-mule' as coding system for Emacs Lisp source files ...")) And similar for the other cases. [FWIW, I think that `emacs-mule'---as Handa suggested---is a perfectly valid file encoding for Emacs Lisp source files. Since it is, by definition unambigous w.r.t. the specified charsets, emacs-mule has none of the problems we are discussing. Of course, Emacs is probably the only text editor that can deal with emacs-mule, but that would hardly matter for Elisp sources. I can think only of two drawbacks: 1. You can't simply insert or attach such files to mail or usenet postings. You have to zip, tar, base64 etc. them first. 2. Specifying particular charsets might exactly *not* be what an author wants. -- Though, the only way to deal with the latter would be to modify the Lisp printer for writing *.elc files so that it escapes non-ascii characters whereever possible with the new \u syntax. This would be another solution to the problem we are discussing.] Oliver -- Oliver Scholz 26 Floréal an 214 de la Révolution Ostendstr. 61 Liberté, Egalité, Fraternité! 60314 Frankfurt a. M. ^ permalink raw reply [flat|nested] 202+ messages in thread
* Re: [PATCH] Unicode Lisp reader escapes 2006-05-15 2:49 ` Oliver Scholz @ 2006-05-15 3:27 ` Stefan Monnier 2006-05-15 10:20 ` Oliver Scholz 2006-05-15 20:37 ` Richard Stallman 1 sibling, 1 reply; 202+ messages in thread From: Stefan Monnier @ 2006-05-15 3:27 UTC (permalink / raw) Cc: handa, rms, emacs-devel >> Forcing it to nil for a particular file is maybe too much work to implement >> compared to th benefit. >> Maybe an easier solution is to add a file-local variable >> `no-8859-unification' such that if that file is loaded in an Emacs which >> is configured to use unify-8859-on-decoding it signals an error. >> >> It could then be added to files like ucs-tables.el. > [Nitpick: ucs-tables.el is encoded in ISO 2022. Most of Emacs' files > containing m18n characters are, AFAIK. I don't know the reason. Maybe > because it's 7bit, but still ASCII compatible. ] I'm not sure I understand the nitpick: - the reason most files use iso-2022 is because it was the only mildly standard generic encoding well supported by Emacs (utf-8 is slowly getting there, but Emacs-CVS's support for it is still behind). - ucs-tables.el, if saved as utf-8, would not do the same any more: it relies on the various "equivalent" 8859 chars to be distinguished (as is done in iso-2022, and as can't be done in utf-8). That's also why opening it with unify-8859-on-decoding is wrong: you're not looking at the right code any more because you basically get what you'd get if it had been saved in a unified encoding such as utf-8. > How about just issuing a warning with the warning message containing a > description of the effects and of what to do to change the settings? > (warn "You have enabled ... but this source file contains > characters from ... Emacs has ... This might or might not be what > you want ... To restore the defaults do ... bla bla ... > ... you might want to use `emacs-mule' as coding system for Emacs Lisp > source files ...")) I'm actually not sure if using emacs-mule instead of iso-2022 helps. It depends on whether or not unify-8859-on-decoding is also applied to emacs-mule "decoding". > Though, the only way to deal with the latter would be to modify the > Lisp printer for writing *.elc files so that it escapes non-ascii > characters whereever possible with the new \u syntax. This would be > another solution to the problem we are discussing.] This would break the compilation of ucs-tables.el. Stefan ^ permalink raw reply [flat|nested] 202+ messages in thread
* Re: [PATCH] Unicode Lisp reader escapes 2006-05-15 3:27 ` Stefan Monnier @ 2006-05-15 10:20 ` Oliver Scholz 2006-05-15 11:12 ` Oliver Scholz 0 siblings, 1 reply; 202+ messages in thread From: Oliver Scholz @ 2006-05-15 10:20 UTC (permalink / raw) Cc: emacs-devel, rms, handa, Oliver Scholz Stefan Monnier <monnier@iro.umontreal.ca> writes: > I'm not sure I understand the nitpick: [...] That is entirely my mistake. It was late when I wrote the message and my mind was occupied with utf-8 and `utf-fragment-on-decoding'. So I misunderstood you as implying that ucs-table were encoded in UTF-8. > I'm actually not sure if using emacs-mule instead of iso-2022 helps. > It depends on whether or not unify-8859-on-decoding is also applied to > emacs-mule "decoding". It doesn't. decode_coding_emacs_mule in coding.c doesn't refer to Vstandard_translation_table_for_decode at all, which would be necessary for unification. >> Though, the only way to deal with the latter would be to modify the >> Lisp printer for writing *.elc files so that it escapes non-ascii >> characters whereever possible with the new \u syntax. This would be >> another solution to the problem we are discussing.] > > This would break the compilation of ucs-tables.el. Ah, of course, I have not thought about that. Well, there would have to be an exeption. I am not saying that this idea of mine is a good idea, though, because I don't know how hairy it is to implement this. IIRC `encode-char' and `decode-char' are not entirely symmetric, that is, there are characters that `encode-char' can encode, but `decode-char' can't encode. IIRC. But it would be the solution that DTRT from the user's point of view. And it *could* be less hairy than any of the other options discussed here, save "use emacs mule!" and "warn/throw an error/document the problem", of course. Oliver -- Oliver Scholz 26 Floréal an 214 de la Révolution Ostendstr. 61 Liberté, Egalité, Fraternité! 60314 Frankfurt a. M. ^ permalink raw reply [flat|nested] 202+ messages in thread
* Re: [PATCH] Unicode Lisp reader escapes 2006-05-15 10:20 ` Oliver Scholz @ 2006-05-15 11:12 ` Oliver Scholz 0 siblings, 0 replies; 202+ messages in thread From: Oliver Scholz @ 2006-05-15 11:12 UTC (permalink / raw) Cc: emacs-devel, Stefan Monnier, handa, rms Oliver Scholz <alkibiades@gmx.de> writes: [...] >>> Though, the only way to deal with the latter would be to modify the >>> Lisp printer for writing *.elc files so that it escapes non-ascii >>> characters whereever possible with the new \u syntax. This would be >>> another solution to the problem we are discussing.] [...] > But it would be the solution that DTRT from the user's point of > view. Scrap that. Again, I was only thinking about UCS fragmentation. Sorry. It would *not* DTRT for ISO 8859 encoded files if unification on decoding is OFF (as is the default), since that would unify everything. Oliver -- Oliver Scholz 26 Floréal an 214 de la Révolution Ostendstr. 61 Liberté, Egalité, Fraternité! 60314 Frankfurt a. M. ^ permalink raw reply [flat|nested] 202+ messages in thread
* Re: [PATCH] Unicode Lisp reader escapes 2006-05-15 2:49 ` Oliver Scholz 2006-05-15 3:27 ` Stefan Monnier @ 2006-05-15 20:37 ` Richard Stallman 2006-05-16 9:49 ` Oliver Scholz 1 sibling, 1 reply; 202+ messages in thread From: Richard Stallman @ 2006-05-15 20:37 UTC (permalink / raw) Cc: emacs-devel, monnier, handa, alkibiades >> Here's one idea: if the -*- line specifies `coding' and specifies >> the mode `emacs-lisp' then force unify-8859-on-decoding to nil >> for that file. Besides the work already mentioned, this would also require to turn unify-8859-on-decoding-mode into a buffer-local minor mode. That is not the only possible implementation mechanism. The commands that read and write the buffer could change it temporarily and change it back. However, it seems like a really bad thing to have a minor mode that CAN'T be buffer-local. Why can't it be? What is the difficulty? ^ permalink raw reply [flat|nested] 202+ messages in thread
* Re: [PATCH] Unicode Lisp reader escapes 2006-05-15 20:37 ` Richard Stallman @ 2006-05-16 9:49 ` Oliver Scholz 2006-05-16 11:16 ` Kim F. Storm 2006-05-17 3:45 ` Richard Stallman 0 siblings, 2 replies; 202+ messages in thread From: Oliver Scholz @ 2006-05-16 9:49 UTC (permalink / raw) Cc: emacs-devel, monnier, handa, Oliver Scholz Richard Stallman <rms@gnu.org> writes: > >> Here's one idea: if the -*- line specifies `coding' and specifies > >> the mode `emacs-lisp' then force unify-8859-on-decoding to nil > >> for that file. > > Besides the work already mentioned, this would also require to turn > unify-8859-on-decoding-mode into a buffer-local minor mode. > > That is not the only possible implementation mechanism. The commands > that read and write the buffer could change it temporarily and change > it back. > > However, it seems like a really bad thing to have a minor mode > that CAN'T be buffer-local. Why can't it be? What is the difficulty? Well, as already mentioned, unification and fragmentation are implemented by means of translation tables. Unification, for instance, for non-CCL decodings happens by means of modifying the parent of the char table in the variable `standard-translation-table-for-decode'. This is accessed as Vstandard_translation_table_for_decode in the various decode_coding_XXX functions, for instance decode_coding_iso2022, which affects many of the ISO 8859 coding systems. I have no idea whether it is simple to make this variable buffer local or not. But, well, it's certainly intrusive to change such things at the very heart and core of Emacs' decoding/encoding apparatus. (And I'd like to second Kenichi Handa here: you'd might like to change to Unicode Emacs *now* rather than making this kind of modification. The Emacs Unicode branch is in sync with the current HEAD. Wielding out the remaining coding issues means possibly possibly not much more work and possibly *not* much more destabilizing than some of the modifications we are discussing here.) As for CCL-based coding systems, it is even a bit more difficult. CCL coding systems do the translation table lookup in the CCL program (with the CCL command `translate-character'). A named translation table is *not* stored in a variable; it is stored in a `translation-table' symbol property of the translation table's name. The translation table relevant for unification in CCL decoding is `ucs-translation-table-for-decode' (AFAICS only the cyrillic encodings make use of this). The translation table relevant for fragmentation of UCS coding systems is `utf-translation-table-for-decode'. You'd have to find a way to make *that* buffer local. As for being bad ... no, I don't think that it is bad that those minor modes are global. They are a means to tune some details of Emacs' internal handling of coding systems. *Internal* is the key point here. It is nothing that conceptually relates to a particular file. (This whole issue is something users should IMO not concern themselves with. The benefit of changing the defaults is IMO dubious, anyways. I expect that "unify on decoding of ISO 8859-*" and "fragmentation of UCS" will mostly be abused for dealing with glyph issues -- i.e. something that should be dealt with by adjusting the fontset.) We are discussing a *very* special case here; it affects only Emacs Lisp source files, because compilation of those, so to say, "freezes" the particular settings for unification/fragmentation in the *.elc file. Oliver -- Oliver Scholz 27 Floréal an 214 de la Révolution Ostendstr. 61 Liberté, Egalité, Fraternité! 60314 Frankfurt a. M. ^ permalink raw reply [flat|nested] 202+ messages in thread
* Re: [PATCH] Unicode Lisp reader escapes 2006-05-16 9:49 ` Oliver Scholz @ 2006-05-16 11:16 ` Kim F. Storm 2006-05-16 11:39 ` Romain Francoise ` (2 more replies) 2006-05-17 3:45 ` Richard Stallman 1 sibling, 3 replies; 202+ messages in thread From: Kim F. Storm @ 2006-05-16 11:16 UTC (permalink / raw) Cc: handa, rms, monnier, emacs-devel Oliver Scholz <alkibiades@gmx.de> writes: > We are discussing a *very* special case here; it affects only Emacs > Lisp source files, because compilation of those, so to say, "freezes" > the particular settings for unification/fragmentation in the *.elc > file. I really wonder why this has suddenly become such a big issue. In practice, these things have worked fine for ages, so why bother _now_ when we should focus on finalizing the release of 22.1 ? IIRC, the current issue was raised because someone suggested to add \u and \U for unicode to the Lisp reader -- something we have also lived without for ages. I would suggest to leave the entire issue for after the release! Then everything will be solved the right way, as we will migrate to unicode internally for 23.x. -- Kim F. Storm <storm@cua.dk> http://www.cua.dk ^ permalink raw reply [flat|nested] 202+ messages in thread
* Re: [PATCH] Unicode Lisp reader escapes 2006-05-16 11:16 ` Kim F. Storm @ 2006-05-16 11:39 ` Romain Francoise 2006-05-16 11:58 ` Oliver Scholz 2006-05-17 3:45 ` Richard Stallman 2 siblings, 0 replies; 202+ messages in thread From: Romain Francoise @ 2006-05-16 11:39 UTC (permalink / raw) Cc: handa, emacs-devel, rms, monnier, Oliver Scholz storm@cua.dk (Kim F. Storm) writes: > I would suggest to leave the entire issue for after the release! I concur. -- Romain Francoise <romain@orebokech.com> | The sea! the sea! the open it's a miracle -- http://orebokech.com/ | sea! The blue, the fresh, the | ever free! --Bryan W. Procter ^ permalink raw reply [flat|nested] 202+ messages in thread
* Re: [PATCH] Unicode Lisp reader escapes 2006-05-16 11:16 ` Kim F. Storm 2006-05-16 11:39 ` Romain Francoise @ 2006-05-16 11:58 ` Oliver Scholz 2006-05-16 14:24 ` Kim F. Storm ` (2 more replies) 2006-05-17 3:45 ` Richard Stallman 2 siblings, 3 replies; 202+ messages in thread From: Oliver Scholz @ 2006-05-16 11:58 UTC (permalink / raw) Cc: handa, emacs-devel, rms, monnier, Oliver Scholz storm@cua.dk (Kim F. Storm) writes: > Oliver Scholz <alkibiades@gmx.de> writes: > >> We are discussing a *very* special case here; it affects only Emacs >> Lisp source files, because compilation of those, so to say, "freezes" >> the particular settings for unification/fragmentation in the *.elc >> file. > > I really wonder why this has suddenly become such a big issue. > > In practice, these things have worked fine for ages, so why bother _now_ > when we should focus on finalizing the release of 22.1 ? Unification and UCS fragmentation are new in Emacs 22. [...] > I would suggest to leave the entire issue for after the release! I agree, in principle. IIRC, I was the first one here to suggest to just document the issue and be done with it. But documenting it would be a good idea. Something along the lines: "When using non-ASCII characters in Emacs Lisp source files, beware that compilation "freezes" some of your current settings for character unification and/or fragmentation. This might exactly be what you want. But if you compile Emacs Lisp files with the intention to give the compiled files to other users, you should make sure that the following settings are at their default value: ..." Oliver -- Oliver Scholz 27 Floréal an 214 de la Révolution Ostendstr. 61 Liberté, Egalité, Fraternité! 60314 Frankfurt a. M. ^ permalink raw reply [flat|nested] 202+ messages in thread
* Re: [PATCH] Unicode Lisp reader escapes 2006-05-16 11:58 ` Oliver Scholz @ 2006-05-16 14:24 ` Kim F. Storm 2006-05-17 3:45 ` Richard Stallman 2006-05-17 15:15 ` Stefan Monnier 2 siblings, 0 replies; 202+ messages in thread From: Kim F. Storm @ 2006-05-16 14:24 UTC (permalink / raw) Cc: emacs-devel, rms, monnier, handa Oliver Scholz <alkibiades@gmx.de> writes: > storm@cua.dk (Kim F. Storm) writes: > >> Oliver Scholz <alkibiades@gmx.de> writes: >> >>> We are discussing a *very* special case here; it affects only Emacs >>> Lisp source files, because compilation of those, so to say, "freezes" >>> the particular settings for unification/fragmentation in the *.elc >>> file. >> >> I really wonder why this has suddenly become such a big issue. >> >> In practice, these things have worked fine for ages, so why bother _now_ >> when we should focus on finalizing the release of 22.1 ? > > Unification and UCS fragmentation are new in Emacs 22. But Emacs 22 already has a large user community, and I don't recall anyone actually complaining about it! > > [...] >> I would suggest to leave the entire issue for after the release! > > I agree, in principle. IIRC, I was the first one here to suggest to > just document the issue and be done with it. But documenting it would > be a good idea. Indeed. > Something along the lines: "When using non-ASCII > characters in Emacs Lisp source files, beware that compilation > "freezes" some of your current settings for character unification > and/or fragmentation. This might exactly be what you want. But if you > compile Emacs Lisp files with the intention to give the compiled files > to other users, you should make sure that the following settings are > at their default value: ..." Is this a problem for any of the lisp files included in CVS emacs ?? Then it could be a problem wrt distribution pre-built versions. Otherwise, I still think it is better left alone -- and documented as you suggested (or simply advise against distributing such files to anybody and let them compile the files themselves). Could the byte-compiler warn if it encounters a non-ascii character that may cause problems? -- Kim F. Storm <storm@cua.dk> http://www.cua.dk ^ permalink raw reply [flat|nested] 202+ messages in thread
* Re: [PATCH] Unicode Lisp reader escapes 2006-05-16 11:58 ` Oliver Scholz 2006-05-16 14:24 ` Kim F. Storm @ 2006-05-17 3:45 ` Richard Stallman 2006-05-17 8:37 ` Oliver Scholz ` (2 more replies) 2006-05-17 15:15 ` Stefan Monnier 2 siblings, 3 replies; 202+ messages in thread From: Richard Stallman @ 2006-05-17 3:45 UTC (permalink / raw) Cc: alkibiades, handa, emacs-devel, monnier, storm I agree, in principle. IIRC, I was the first one here to suggest to just document the issue and be done with it. But documenting it would be a good idea. Something along the lines: "When using non-ASCII characters in Emacs Lisp source files, beware that compilation "freezes" some of your current settings for character unification and/or fragmentation. I want to fix this bug, not document it. As far as I can see, people are overestimating the difficulty of fixing it, by focusing on certain approaches (which are difficult) rather than looking for the ways that are easy. ^ permalink raw reply [flat|nested] 202+ messages in thread
* Re: [PATCH] Unicode Lisp reader escapes 2006-05-17 3:45 ` Richard Stallman @ 2006-05-17 8:37 ` Oliver Scholz 2006-05-17 20:09 ` Richard Stallman 2006-05-17 12:37 ` Oliver Scholz 2006-05-18 1:09 ` Kenichi Handa 2 siblings, 1 reply; 202+ messages in thread From: Oliver Scholz @ 2006-05-17 8:37 UTC (permalink / raw) Cc: storm, emacs-devel, monnier, handa, Oliver Scholz Richard Stallman <rms@gnu.org> writes: [...] > I want to fix this bug, not document it. > > As far as I can see, people are overestimating the difficulty of > fixing it, by focusing on certain approaches (which are difficult) > rather than looking for the ways that are easy. Well, the solution that is both the easiest and the *cleanest* is to remove `unify-8859-on-decoding-mode' and `utf-fragment-on-decoding'. I am not kidding. I don't see the need for having those. Oliver -- Oliver Scholz 28 Floréal an 214 de la Révolution Ostendstr. 61 Liberté, Egalité, Fraternité! 60314 Frankfurt a. M. ^ permalink raw reply [flat|nested] 202+ messages in thread
* Re: [PATCH] Unicode Lisp reader escapes 2006-05-17 8:37 ` Oliver Scholz @ 2006-05-17 20:09 ` Richard Stallman 0 siblings, 0 replies; 202+ messages in thread From: Richard Stallman @ 2006-05-17 20:09 UTC (permalink / raw) Cc: storm, emacs-devel, monnier, handa, alkibiades Well, the solution that is both the easiest and the *cleanest* is to remove `unify-8859-on-decoding-mode' and `utf-fragment-on-decoding'. It looks like `utf-fragment-on-decoding' is not relevant to the issue of making Lisp files encoded in iso-2022 reliable. So we can forget about that. ^ permalink raw reply [flat|nested] 202+ messages in thread
* Re: [PATCH] Unicode Lisp reader escapes 2006-05-17 3:45 ` Richard Stallman 2006-05-17 8:37 ` Oliver Scholz @ 2006-05-17 12:37 ` Oliver Scholz 2006-05-19 2:05 ` Richard Stallman 2006-05-18 1:09 ` Kenichi Handa 2 siblings, 1 reply; 202+ messages in thread From: Oliver Scholz @ 2006-05-17 12:37 UTC (permalink / raw) Cc: storm, emacs-devel, monnier, handa, Oliver Scholz Growing a bit tired of this discussion, I hacked a kludge that might do what you want. It introduces a variable `byte-compile-no-char-translation' that is meant to be put into the Local Variables section of an Emacs Lisp source file in order to inhibit the effects of `utf-fragment-on-decoding' and `unifiy-8859-on-decoding'. In other words: This patch deals only with the issues that *I* can understand. I seem to recall that Handa also mentioned some effects of certain CJK language environments. It is *absolutely vital*, that Kenichi Handa reviews this patch. I am not entirely sure whether this breaks something or not. With my patch, in decode_coding_iso2022 looking up characters in Vstandard_translation_table_for_decode is inhibited at all if `byte-compile-no-char-translation' is non-nil. This might be wrong. Vstandard_translation_table_for_decode is not empty by default. I guess instead of inhibiting its use one could just temporarily set its parent at about the same place. But maybe this is unnecessary. decode_coding_sjis_big5 refers to Vstandard_translation_table_for_decode, too. I did not modify it, though, thus introducing a possible inconsistency. The reason is that I don't understand CJK issues and I don't understand this encoding. Note: Even with the remaining issues wielded out, IMNSHO this patch is worse than the two other solutions (1) Tell users to use emacs-mule. Or: (2) Remove `unify-8859-on-decoding-mode' and `utf-fragment-on-decoding'. The reasoning goes as follows: Check: Are `unify-8859-on-decoding-mode' and `utf-fragment-on-decoding' useful options? If no: Remove them, since they cause only trouble. If yes: then a user who set them, will want them for all affected characters. The choice for unification/fragmention should not be the choice of the programmer of the Lisp package; it should be the choice of the user. (To quote a future user, complaining on gnu-emacs-help: "The heck! Why do I have only hollow boxes for my Greek characters after byte compilation??? It's all fine in the source file!!!") Exception: In the event that the particular choice of charsets is important for a Lisp Package: Use `emacs-mule'! Oliver Index: lisp/files.el =================================================================== RCS file: /cvsroot/emacs/emacs/lisp/files.el,v retrieving revision 1.836 diff -u -r1.836 files.el --- lisp/files.el 16 May 2006 18:33:31 -0000 1.836 +++ lisp/files.el 17 May 2006 12:08:43 -0000 @@ -2361,6 +2361,7 @@ (left-margin . integerp) ;; C source code (no-update-autoloads . booleanp) (tab-width . integerp) ;; C source code + (byte-compile-no-char-translation . booleanp) ;; C source code (truncate-lines . booleanp))) ;; C source code (put 'c-set-style 'safe-local-eval-function t) Index: lisp/emacs-lisp/bytecomp.el =================================================================== RCS file: /cvsroot/emacs/emacs/lisp/emacs-lisp/bytecomp.el,v retrieving revision 2.185 diff -u -r2.185 bytecomp.el --- lisp/emacs-lisp/bytecomp.el 16 May 2006 10:05:09 -0000 2.185 +++ lisp/emacs-lisp/bytecomp.el 17 May 2006 12:08:45 -0000 @@ -1673,6 +1673,14 @@ (enable-local-eval nil)) ;; Arg of t means don't alter enable-local-variables. (normal-mode t) + + ;; KLUDGE: `byte-compile-no-char-translation' should affect + ;; how characters are decoded. But at this point decoding + ;; already happend. So we insert the file contents again. + (when byte-compile-no-char-translation + (erase-buffer) + (insert-file-contents filename)) + (setq filename buffer-file-name)) ;; Set the default directory, in case an eval-when-compile uses it. (setq default-directory (file-name-directory filename))) Index: src/coding.c =================================================================== RCS file: /cvsroot/emacs/emacs/src/coding.c,v retrieving revision 1.336 diff -u -r1.336 coding.c --- src/coding.c 8 May 2006 05:25:02 -0000 1.336 +++ src/coding.c 17 May 2006 12:08:50 -0000 @@ -405,6 +405,15 @@ Lisp_Object Qcoding_system_p, Qcoding_system_error; +/* This variable is meant to turn off character tranlation during byte + compilation. */ + +Lisp_Object Vbyte_compile_no_char_translation; + +Lisp_Object empty_translation_table; +Lisp_Object Qucs_translation_table_for_decode, Qutf_translation_table_for_decode; +Lisp_Object Qunify_8859_on_decoding_mode, Qutf_fragment_on_decoding; + /* Coding system emacs-mule and raw-text are for converting only end-of-line format. */ Lisp_Object Qemacs_mule, Qraw_text; @@ -1849,7 +1858,7 @@ else { translation_table = coding->translation_table_for_decode; - if (NILP (translation_table)) + if (NILP (translation_table) && NILP (Vbyte_compile_no_char_translation)) translation_table = Vstandard_translation_table_for_decode; } @@ -4938,8 +4947,48 @@ dst_bytes--; extra = coding->spec.ccl.cr_carryover; } - ccl_coding_driver (coding, source, destination + extra, - src_bytes, dst_bytes, 0); + + /*KLUDGE: Inhibit unification and or fragmentation. This is + meant for byte compiling Emacs Lisp source files. For CCL + based coding systems it has to be done here, because we want + it only for decoding. We temporarily swap the affected + translation tables in Vtranslation_table_vector with an empty + translation table.*/ + if (! NILP (Vbyte_compile_no_char_translation) + && (! NILP (SYMBOL_VALUE (Qunify_8859_on_decoding_mode)) + || ! NILP (SYMBOL_VALUE (Qutf_fragment_on_decoding)))) + { + if (NILP (empty_translation_table)) + { + empty_translation_table = + call0 (intern ("make-translation-table")); + } + + Lisp_Object ucs_tt = Fget (Qucs_translation_table_for_decode, Qtranslation_table); + Lisp_Object ucs_id = Fget (Qucs_translation_table_for_decode, Qtranslation_table_id); + + Lisp_Object utf_tt = Fget (Qutf_translation_table_for_decode, Qtranslation_table); + Lisp_Object utf_id = Fget (Qutf_translation_table_for_decode, Qtranslation_table_id); + + /* Should this be `unwind-protect'ed? */ + + Faset (Vtranslation_table_vector, ucs_id, Fcons (Qucs_translation_table_for_decode, + empty_translation_table)); + Faset (Vtranslation_table_vector, utf_id, Fcons (Qutf_translation_table_for_decode, + empty_translation_table)); + + ccl_coding_driver (coding, source, destination + extra, + src_bytes, dst_bytes, 0); + + Faset (Vtranslation_table_vector, ucs_id, Fcons (Qucs_translation_table_for_decode, + ucs_tt)); + Faset (Vtranslation_table_vector, utf_id, Fcons (Qutf_translation_table_for_decode, + utf_tt)); + + } + else ccl_coding_driver (coding, source, destination + extra, + src_bytes, dst_bytes, 0); + if (coding->eol_type != CODING_EOL_LF) { coding->produced += extra; @@ -7852,6 +7901,34 @@ defsubr (&Sset_coding_priority_internal); defsubr (&Sdefine_coding_system_internal); + DEFVAR_LISP ("byte-compile-no-char-translation", &Vbyte_compile_no_char_translation, + doc: /* Don't translate characters during byte compilation. + +Options like `utf-fragment-on-decoding' or the minor mode +`unify-8859-on-decoding-mode' modify the way Emacs maps file encodings +to mule charsets. Since *.elc files are encoded in emacs-mule, such +settings are preserved in the compiled file. If this variable is +non-nil, Emacs uses the default mule charsets. + +You can set this variable in the local variables section of a file. */); + Vbyte_compile_no_char_translation = Qnil; + + empty_translation_table = Qnil; + staticpro (&empty_translation_table); + + Qucs_translation_table_for_decode = intern ("ucs-translation-table-for-decode"); + staticpro (&Qucs_translation_table_for_decode); + + Qutf_translation_table_for_decode = intern ("utf-translation-table-for-decode"); + staticpro (&Qutf_translation_table_for_decode); + + Qunify_8859_on_decoding_mode = intern ("unify-8859-on-decoding-mode"); + staticpro (&Qunify_8859_on_decoding_mode); + + Qutf_fragment_on_decoding = intern ("utf-fragment-on-decoding"); + staticpro (&Qunify_8859_on_decoding_mode); + + DEFVAR_LISP ("coding-system-list", &Vcoding_system_list, doc: /* List of coding systems. -- Oliver Scholz 28 Floréal an 214 de la Révolution Ostendstr. 61 Liberté, Egalité, Fraternité! 60314 Frankfurt a. M. ^ permalink raw reply [flat|nested] 202+ messages in thread
* Re: [PATCH] Unicode Lisp reader escapes 2006-05-17 12:37 ` Oliver Scholz @ 2006-05-19 2:05 ` Richard Stallman 2006-05-19 8:47 ` Oliver Scholz 0 siblings, 1 reply; 202+ messages in thread From: Richard Stallman @ 2006-05-19 2:05 UTC (permalink / raw) Cc: storm, emacs-devel, monnier, handa, alkibiades The C code you wrote to implement byte-compile-no-char-translation might be the right C-level feature. We need Handa to check that, as you said. It should only affect unify-8859-on-decoding-mode, since that's the only one that's relevant to iso-2022. (We decided to give up on stabilizing utf-8 in this way, because too many different variables affect the behavior of utf-8.) However, the right place to set the variable is in find-auto-coding, not in the compiler. Therefore, the variable's name should be changed, since it won't be specific to compilation. It could be stabilize-iso-2022. I can see three possible ways for Lisp files to set this variable: 1. Explicitly. You should specify the variable in the -*- line or the Local Variables list if it matters. 2. Automatically. Whenever a file specifies Emacs-Lisp mode and coding iso-2022, it gets set to t. 3. Both. It gets set to t automatically, but a file can explicitly specify nil. I have another, further suggestion. Rename the variable to unify-8859-on-decoding-mode, and reimplement the function unify-8859-on-decoding-mode to work just by setting the variable. That would be an improvement, since it would mean you can set the mode just by setting the variable. ^ permalink raw reply [flat|nested] 202+ messages in thread
* Re: [PATCH] Unicode Lisp reader escapes 2006-05-19 2:05 ` Richard Stallman @ 2006-05-19 8:47 ` Oliver Scholz 0 siblings, 0 replies; 202+ messages in thread From: Oliver Scholz @ 2006-05-19 8:47 UTC (permalink / raw) Cc: storm, emacs-devel, monnier, handa, Oliver Scholz Richard Stallman <rms@gnu.org> writes: > The C code you wrote to implement byte-compile-no-char-translation > might be the right C-level feature. We need Handa to check that, as > you said. No need to, anymore. Only those two issues you mentioned are new with my patch: the limitation to byte compilation and the attempt to fix UTF-8. Since you said you don't want either, and since Handa said, Venable_character_translation can be used, my patch is meaningless. Putting it into find-auto-coding is indeed much better, since it avoids any inconsistencies in character encoding between the visited *.el file and the byte compiled file. Oliver -- Oliver Scholz 30 Floréal an 214 de la Révolution Ostendstr. 61 Liberté, Egalité, Fraternité! 60314 Frankfurt a. M. ^ permalink raw reply [flat|nested] 202+ messages in thread
* Re: [PATCH] Unicode Lisp reader escapes 2006-05-17 3:45 ` Richard Stallman 2006-05-17 8:37 ` Oliver Scholz 2006-05-17 12:37 ` Oliver Scholz @ 2006-05-18 1:09 ` Kenichi Handa 2006-05-21 0:57 ` Richard Stallman 2 siblings, 1 reply; 202+ messages in thread From: Kenichi Handa @ 2006-05-18 1:09 UTC (permalink / raw) Cc: emacs-devel, monnier, storm, alkibiades In article <E1FgCyb-0001Uq-Pk@fencepost.gnu.org>, Richard Stallman <rms@gnu.org> writes: > I agree, in principle. IIRC, I was the first one here to suggest to > just document the issue and be done with it. But documenting it would > be a good idea. Something along the lines: "When using non-ASCII > characters in Emacs Lisp source files, beware that compilation > "freezes" some of your current settings for character unification > and/or fragmentation. > I want to fix this bug, not document it. I'm confused. You wrote: > Handa says that telling people "don't use utf-8" solves the problem. > If that is a good solution, I think the problem is solved. > Does everyone agree that that solution works? So, I thought you accepted such kind of solution; i.e. documenting the potential problem about decoding and the way to avoid the ambiguity if one has a problematic *.el file. There are two ways to avoid it. (1) use emacs-mule coding system (2) use one of iso-2022 based coding systems (they include iso-8859-X) with setting enable-character-translation to nil in "Local Variables:" section. (1) works now. (2) doens't work now but easy to make it work as I wrote in the previous mail. Do you want something more? --- Kenichi Handa handa@m17n.org ^ permalink raw reply [flat|nested] 202+ messages in thread
* Re: [PATCH] Unicode Lisp reader escapes 2006-05-18 1:09 ` Kenichi Handa @ 2006-05-21 0:57 ` Richard Stallman 2006-05-22 1:33 ` Kenichi Handa 0 siblings, 1 reply; 202+ messages in thread From: Richard Stallman @ 2006-05-21 0:57 UTC (permalink / raw) Cc: emacs-devel, monnier, storm, alkibiades (1) use emacs-mule coding system (2) use one of iso-2022 based coding systems (they include iso-8859-X) with setting enable-character-translation to nil in "Local Variables" section. (1) works now. (2) doens't work now but easy to make it work as I wrote in the previous mail. People have pointed out disadvantages of (1). Maybe (2) is a good solution. Iwant to check, though. It would turn off *all* character translation. We need to verify that this is ok. Supposing that unify-8859-on-decoding-mode is off, and you read a file in an iso-2022 coding system. What character translation is done, or might be done, and in what cases? In this code, (defun ucs-fragment-8859 (for-encode for-decode) "Undo the unification done by `ucs-unify-8859'. With prefix arg, undo unification on encoding only, i.e. don't undo unification on input operations." (when for-decode ;; Don't Unify 8859 on decoding. ;; For non-CCL coding systems (e.g. iso-latin-2). (set-char-table-parent standard-translation-table-for-decode nil) we turn off the parent of standard-translation-table-for-decode. But what else might standard-translation-table-for-decode do for some of these coding systems? ^ permalink raw reply [flat|nested] 202+ messages in thread
* Re: [PATCH] Unicode Lisp reader escapes 2006-05-21 0:57 ` Richard Stallman @ 2006-05-22 1:33 ` Kenichi Handa 2006-05-22 15:12 ` Richard Stallman 0 siblings, 1 reply; 202+ messages in thread From: Kenichi Handa @ 2006-05-22 1:33 UTC (permalink / raw) Cc: emacs-devel, monnier, storm, alkibiades In article <E1FhcG5-0002nB-VN@fencepost.gnu.org>, Richard Stallman <rms@gnu.org> writes: > (1) use emacs-mule coding system > (2) use one of iso-2022 based coding systems (they include > iso-8859-X) with setting enable-character-translation to nil > in "Local Variables" section. > (1) works now. (2) doens't work now but easy to make it > work as I wrote in the previous mail. > People have pointed out disadvantages of (1). I don't think that is a big problem because it seems that it's very rare to handle *.el file by some tool other than Emacs. > Maybe (2) is a good solution. Iwant to check, though. It would turn > off *all* character translation. We need to verify that this is ok. I believe it is ok. > Supposing that unify-8859-on-decoding-mode is off, and you read a file > in an iso-2022 coding system. What character translation is done, or > might be done, and in what cases? > In this code, > (defun ucs-fragment-8859 (for-encode for-decode) > "Undo the unification done by `ucs-unify-8859'. > With prefix arg, undo unification on encoding only, i.e. don't undo > unification on input operations." > (when for-decode > ;; Don't Unify 8859 on decoding. > ;; For non-CCL coding systems (e.g. iso-latin-2). > (set-char-table-parent standard-translation-table-for-decode nil) > we turn off the parent of standard-translation-table-for-decode. > But what else might standard-translation-table-for-decode do > for some of these coding systems? standard-translation-table-for-decode is for reflecting any user preferences on decoding. So, it can do anything. If one hates SOFT-HYPEN (U+00AD), he can map it to `-'. The default value of standard-translation-table-for-decode is not nil. It contains a mapping for JISX0208.1978->JISX0208.1980 and JISX0201->ASCII. But, this is to compensate for an encoding used in Japan in vary old time, and even if Emacs reads a *.el file in such an encoding, on writing, the new encoding is used. That means that the mapping is not used when the file is read next time. So, disabling character translation on reading an iso-2022 *.el file effectively stabilize the byte-compiling of the file without any actual problem. --- Kenichi Handa handa@m17n.org ^ permalink raw reply [flat|nested] 202+ messages in thread
* Re: [PATCH] Unicode Lisp reader escapes 2006-05-22 1:33 ` Kenichi Handa @ 2006-05-22 15:12 ` Richard Stallman 2006-05-23 1:05 ` Kenichi Handa 0 siblings, 1 reply; 202+ messages in thread From: Richard Stallman @ 2006-05-22 15:12 UTC (permalink / raw) Cc: emacs-devel, monnier, storm, alkibiades So, disabling character translation on reading an iso-2022 *.el file effectively stabilize the byte-compiling of the file without any actual problem. Ok, I am convinced that disabling character translation is a good solution _mechanism_. The remaining question is what user interface to use. That is, how should Emacs determine that it should set enable-character-translation to nil for these files? One obvious way is an explicit specification of the variable enable-character-translation. But that would be cumbersome to use. Another way is that specification of coding: together with mode: emacs-lisp could do this automatically. Another way is that you could specify coding: in a special way, perhaps with ! at the end of the coding system name. What do you think? ^ permalink raw reply [flat|nested] 202+ messages in thread
* Re: [PATCH] Unicode Lisp reader escapes 2006-05-22 15:12 ` Richard Stallman @ 2006-05-23 1:05 ` Kenichi Handa 2006-05-23 5:18 ` Juri Linkov 2006-05-24 2:17 ` Richard Stallman 0 siblings, 2 replies; 202+ messages in thread From: Kenichi Handa @ 2006-05-23 1:05 UTC (permalink / raw) Cc: alkibiades, storm, monnier, emacs-devel In article <E1FiC4w-00075E-2u@fencepost.gnu.org>, Richard Stallman <rms@gnu.org> writes: > So, disabling character translation on reading an iso-2022 > *.el file effectively stabilize the byte-compiling of the > file without any actual problem. > Ok, I am convinced that disabling character translation is a good > solution _mechanism_. The remaining question is what user interface > to use. That is, how should Emacs determine that it should set > enable-character-translation to nil for these files? > One obvious way is an explicit specification of the variable > enable-character-translation. But that would be cumbersome to use. At least, this should work for people who don't mind the cumbersomeness. > Another way is that specification of coding: together with mode: > emacs-lisp could do this automatically. I object to this because it's an incompatible change that should be avoided at this stage. In addition, we then have to invent someway to "enable" normal character translation. > Another way is that you could specify coding: in a special way, > perhaps with ! at the end of the coding system name. I don't know if that is aesthetically good, but at least it's a quite handy way. ;; xxx.el -- Do XXX. -*- coding: latin-1!; -*- So, I'd like to implement both the 1st and 3rd method. Ok? --- Kenichi Handa handa@m17n.org ^ permalink raw reply [flat|nested] 202+ messages in thread
* Re: [PATCH] Unicode Lisp reader escapes 2006-05-23 1:05 ` Kenichi Handa @ 2006-05-23 5:18 ` Juri Linkov 2006-05-24 2:18 ` Richard Stallman 2006-05-24 2:17 ` Richard Stallman 1 sibling, 1 reply; 202+ messages in thread From: Juri Linkov @ 2006-05-23 5:18 UTC (permalink / raw) Cc: storm, emacs-devel, rms, monnier, alkibiades >> Another way is that you could specify coding: in a special way, >> perhaps with ! at the end of the coding system name. > > I don't know if that is aesthetically good, but at least > it's a quite handy way. > > ;; xxx.el -- Do XXX. -*- coding: latin-1!; -*- Such a notation is not self-evident. What about ;; xxx.el -- Do XXX. -*- coding: latin-1; translation: no -*- or ;; xxx.el -- Do XXX. -*- coding: latin-1; char-trans: no -*- -- Juri Linkov http://www.jurta.org/emacs/ ^ permalink raw reply [flat|nested] 202+ messages in thread
* Re: [PATCH] Unicode Lisp reader escapes 2006-05-23 5:18 ` Juri Linkov @ 2006-05-24 2:18 ` Richard Stallman 2006-06-02 6:49 ` Kenichi Handa 0 siblings, 1 reply; 202+ messages in thread From: Richard Stallman @ 2006-05-24 2:18 UTC (permalink / raw) Cc: alkibiades, emacs-devel, monnier, storm, handa > ;; xxx.el -- Do XXX. -*- coding: latin-1!; -*- Such a notation is not self-evident. What about ;; xxx.el -- Do XXX. -*- coding: latin-1; translation: no -*- It would be useful to support the latter as well, but I think we want something terse for this. ^ permalink raw reply [flat|nested] 202+ messages in thread
* Re: [PATCH] Unicode Lisp reader escapes 2006-05-24 2:18 ` Richard Stallman @ 2006-06-02 6:49 ` Kenichi Handa 2006-06-02 8:00 ` Kim F. Storm ` (2 more replies) 0 siblings, 3 replies; 202+ messages in thread From: Kenichi Handa @ 2006-06-02 6:49 UTC (permalink / raw) Cc: juri, storm, emacs-devel, monnier, alkibiades In article <E1Fiix4-0006Tz-RZ@fencepost.gnu.org>, Richard Stallman <rms@gnu.org> writes: >> ;; xxx.el -- Do XXX. -*- coding: latin-1!; -*- > Such a notation is not self-evident. What about > ;; xxx.el -- Do XXX. -*- coding: latin-1; translation: no -*- > It would be useful to support the latter as well, > but I think we want something terse for this. I've just installed these changes. (1) Accept something like "latin-1!" as value of coding: in header and local variables section. (2) Accept "char-trans: VAL" in header and local variables section. I think "translation: VAL" is too ambiguous. (3) Accept "enable-character-translation: VAL" in local variables section. In which NEWS section, that information should go? Previously we simply had "* Changes in Emacs XX.YY", but now we have these sections: * Installation Changes in Emacs 22.1 * Startup Changes in Emacs 22.1 * Incompatible Editing Changes in Emacs 22.1 * Editing Changes in Emacs 22.1 * New Modes and Packages in Emacs 22.1 * Changes in Specialized Modes and Packages in Emacs 22.1 * Changes in Emacs 22.1 on non-free operating systems * Incompatible Lisp Changes in Emacs 22.1 * Lisp Changes in Emacs 22.1 * New Packages for Lisp Programming in Emacs 22.1 It seems that this change is "Editing Changes", but I'm not sure we can declare it incompatible or not. Perviously, if a file has "coding: latin-1!", it is treated as an invalid coding specification. In that sense, this change is incompatible, but... --- Kenichi Handa handa@m17n.org ^ permalink raw reply [flat|nested] 202+ messages in thread
* Re: [PATCH] Unicode Lisp reader escapes 2006-06-02 6:49 ` Kenichi Handa @ 2006-06-02 8:00 ` Kim F. Storm 2006-06-02 9:27 ` Juri Linkov 2006-06-02 22:39 ` Richard Stallman 2 siblings, 0 replies; 202+ messages in thread From: Kim F. Storm @ 2006-06-02 8:00 UTC (permalink / raw) Cc: juri, alkibiades, rms, monnier, emacs-devel Kenichi Handa <handa@m17n.org> writes: > It seems that this change is "Editing Changes", but I'm not > sure we can declare it incompatible or not. Perviously, if > a file has "coding: latin-1!", it is treated as an invalid > coding specification. In that sense, this change is > incompatible, but... The change is not incompatible in the sense that it breaks existing _valid_ coding specs. The section ** Multilingual Environment (Mule) changes: seems appropriate?? -- Kim F. Storm <storm@cua.dk> http://www.cua.dk ^ permalink raw reply [flat|nested] 202+ messages in thread
* Re: [PATCH] Unicode Lisp reader escapes 2006-06-02 6:49 ` Kenichi Handa 2006-06-02 8:00 ` Kim F. Storm @ 2006-06-02 9:27 ` Juri Linkov 2006-06-02 10:50 ` Eli Zaretskii ` (2 more replies) 2006-06-02 22:39 ` Richard Stallman 2 siblings, 3 replies; 202+ messages in thread From: Juri Linkov @ 2006-06-02 9:27 UTC (permalink / raw) Cc: storm, emacs-devel, rms, monnier, alkibiades > (2) Accept "char-trans: VAL" in header and local variables section. > I think "translation: VAL" is too ambiguous. > (3) Accept "enable-character-translation: VAL" in local > variables section. I think using different names in the first line and in the local variables section is not good. Users may move these settings between these two places in the same file, and incompatible names are the source of inconvenience and confusion. Since the variable name is `enable-character-translation', there should be no problem in using it in the first line as well. -- Juri Linkov http://www.jurta.org/emacs/ ^ permalink raw reply [flat|nested] 202+ messages in thread
* Re: [PATCH] Unicode Lisp reader escapes 2006-06-02 9:27 ` Juri Linkov @ 2006-06-02 10:50 ` Eli Zaretskii 2006-06-02 11:39 ` Kenichi Handa 2006-06-02 22:39 ` Richard Stallman 2 siblings, 0 replies; 202+ messages in thread From: Eli Zaretskii @ 2006-06-02 10:50 UTC (permalink / raw) Cc: alkibiades, emacs-devel > From: Juri Linkov <juri@jurta.org> > Date: Fri, 02 Jun 2006 12:27:01 +0300 > Cc: storm@cua.dk, emacs-devel@gnu.org, rms@gnu.org, monnier@iro.umontreal.ca, > alkibiades@gmx.de > > > (2) Accept "char-trans: VAL" in header and local variables section. > > I think "translation: VAL" is too ambiguous. > > (3) Accept "enable-character-translation: VAL" in local > > variables section. > > I think using different names in the first line and in the local > variables section is not good. But we already do that, e.g. with coding: in the header vs coding-system in local vars. ^ permalink raw reply [flat|nested] 202+ messages in thread
* Re: [PATCH] Unicode Lisp reader escapes 2006-06-02 9:27 ` Juri Linkov 2006-06-02 10:50 ` Eli Zaretskii @ 2006-06-02 11:39 ` Kenichi Handa 2006-06-02 12:12 ` Juri Linkov 2006-06-02 22:39 ` Richard Stallman 2 siblings, 1 reply; 202+ messages in thread From: Kenichi Handa @ 2006-06-02 11:39 UTC (permalink / raw) Cc: storm, emacs-devel, rms, monnier, alkibiades In article <87ac8vor6u.fsf@jurta.org>, Juri Linkov <juri@jurta.org> writes: >> (2) Accept "char-trans: VAL" in header and local variables section. >> I think "translation: VAL" is too ambiguous. >> (3) Accept "enable-character-translation: VAL" in local >> variables section. > I think using different names in the first line and in the local > variables section is not good. Users may move these settings > between these two places in the same file, and incompatible names > are the source of inconvenience and confusion. Since the variable name > is `enable-character-translation', there should be no problem in using it > in the first line as well. But, wasn't it you who proposed "char-trans"? Eli Zaretskii <eliz@gnu.org> writes: > But we already do that, e.g. with coding: in the header vs > coding-system in local vars. No, AFAIK we only accepts "coding" in local vars. Anyway, the situation of coding: tag is a little bit different from enable-character-translation case because we don't have a variable for the formar. --- Kenichi Handa handa@m17n.org ^ permalink raw reply [flat|nested] 202+ messages in thread
* Re: [PATCH] Unicode Lisp reader escapes 2006-06-02 11:39 ` Kenichi Handa @ 2006-06-02 12:12 ` Juri Linkov 0 siblings, 0 replies; 202+ messages in thread From: Juri Linkov @ 2006-06-02 12:12 UTC (permalink / raw) Cc: storm, emacs-devel, rms, monnier, alkibiades >>> (2) Accept "char-trans: VAL" in header and local variables section. >>> I think "translation: VAL" is too ambiguous. >>> (3) Accept "enable-character-translation: VAL" in local >>> variables section. > >> I think using different names in the first line and in the local >> variables section is not good. Users may move these settings >> between these two places in the same file, and incompatible names >> are the source of inconvenience and confusion. Since the variable name >> is `enable-character-translation', there should be no problem in using it >> in the first line as well. > > But, wasn't it you who proposed "char-trans"? I was unaware of the variable `enable-character-translation'. It's documented nowhere. Since such a variable really exists, I think its name is ideal both for the local variables section and the first line. For cases when its name is too long to fit to the first line, IIUC there are already an alternative syntax with the trailing !, you just installed, for those who want to use terse syntax. -- Juri Linkov http://www.jurta.org/emacs/ ^ permalink raw reply [flat|nested] 202+ messages in thread
* Re: [PATCH] Unicode Lisp reader escapes 2006-06-02 9:27 ` Juri Linkov 2006-06-02 10:50 ` Eli Zaretskii 2006-06-02 11:39 ` Kenichi Handa @ 2006-06-02 22:39 ` Richard Stallman 2006-06-03 6:42 ` Juri Linkov 2 siblings, 1 reply; 202+ messages in thread From: Richard Stallman @ 2006-06-02 22:39 UTC (permalink / raw) Cc: alkibiades, emacs-devel, monnier, storm, handa I think using different names in the first line and in the local variables section is not good. I am not sure. Users may move these settings between these two places in the same file, and incompatible names are the source of inconvenience and confusion. Since the variable name is `enable-character-translation', there should be no problem in using it in the first line as well. No, that name is too long for the first line. ^ permalink raw reply [flat|nested] 202+ messages in thread
* Re: [PATCH] Unicode Lisp reader escapes 2006-06-02 22:39 ` Richard Stallman @ 2006-06-03 6:42 ` Juri Linkov 2006-06-04 2:23 ` Richard Stallman 0 siblings, 1 reply; 202+ messages in thread From: Juri Linkov @ 2006-06-03 6:42 UTC (permalink / raw) Cc: alkibiades, emacs-devel, monnier, storm, handa > I think using different names in the first line and in the local > variables section is not good. > > I am not sure. When the first line has enough space, there is no reason to disallow using the same variable name that is allowed in the local variables section. Example: ;;; -*- coding:utf-8; enable-character-translation:t -*- -- Juri Linkov http://www.jurta.org/emacs/ ^ permalink raw reply [flat|nested] 202+ messages in thread
* Re: [PATCH] Unicode Lisp reader escapes 2006-06-03 6:42 ` Juri Linkov @ 2006-06-04 2:23 ` Richard Stallman 2006-06-05 7:24 ` Kenichi Handa 0 siblings, 1 reply; 202+ messages in thread From: Richard Stallman @ 2006-06-04 2:23 UTC (permalink / raw) Cc: handa, emacs-devel, monnier, storm, alkibiades When the first line has enough space, there is no reason to disallow using the same variable name that is allowed in the local variables section. Example: ;;; -*- coding:utf-8; enable-character-translation:t -*- I agree that the name enable-character-translation ought to work if used in the first line. Does it fail now? ^ permalink raw reply [flat|nested] 202+ messages in thread
* Re: [PATCH] Unicode Lisp reader escapes 2006-06-04 2:23 ` Richard Stallman @ 2006-06-05 7:24 ` Kenichi Handa 2006-06-05 21:31 ` Richard Stallman 0 siblings, 1 reply; 202+ messages in thread From: Kenichi Handa @ 2006-06-05 7:24 UTC (permalink / raw) Cc: juri, alkibiades, storm, monnier, emacs-devel In article <E1FmiHR-00027r-Li@fencepost.gnu.org>, Richard Stallman <rms@gnu.org> writes: > When the first line has enough space, there is no reason to disallow > using the same variable name that is allowed in the local variables section. > Example: > ;;; -*- coding:utf-8; enable-character-translation:t -*- > I agree that the name enable-character-translation ought to work > if used in the first line. > Does it fail now? As I've just installed a fix, it doesn't fail now . So, the remaining problem is whether or not we should allow the short name "char-trans" for it. I tend to agree with Juri that we don't need it because we can use "CODING!" notation. If you agree with deleting it too, I'll install a proper change soon. --- Kenichi Handa handa@m17n.org ^ permalink raw reply [flat|nested] 202+ messages in thread
* Re: [PATCH] Unicode Lisp reader escapes 2006-06-05 7:24 ` Kenichi Handa @ 2006-06-05 21:31 ` Richard Stallman 2006-06-07 1:24 ` Kenichi Handa 0 siblings, 1 reply; 202+ messages in thread From: Richard Stallman @ 2006-06-05 21:31 UTC (permalink / raw) Cc: juri, alkibiades, storm, monnier, emacs-devel As I've just installed a fix, it doesn't fail now . So, the remaining problem is whether or not we should allow the short name "char-trans" for it. I tend to agree with Juri that we don't need it because we can use "CODING!" notation. If you agree with deleting it too, I'll install a proper change soon. Ok. ^ permalink raw reply [flat|nested] 202+ messages in thread
* Re: [PATCH] Unicode Lisp reader escapes 2006-06-05 21:31 ` Richard Stallman @ 2006-06-07 1:24 ` Kenichi Handa 0 siblings, 0 replies; 202+ messages in thread From: Kenichi Handa @ 2006-06-07 1:24 UTC (permalink / raw) Cc: juri, storm, emacs-devel, monnier, alkibiades In article <E1FnMfX-0006Id-7e@fencepost.gnu.org>, Richard Stallman <rms@gnu.org> writes: > As I've just installed a fix, it doesn't fail now . So, the > remaining problem is whether or not we should allow the > short name "char-trans" for it. I tend to agree with Juri > that we don't need it because we can use "CODING!" notation. > If you agree with deleting it too, I'll install a proper > change soon. > Ok. I've just installed a change for not handling the short-name "char-trans". storm@cua.dk (Kim F. Storm) writes: > The change is not incompatible in the sense that it breaks > existing _valid_ coding specs. > The section > ** Multilingual Environment (Mule) changes: > seems appropriate?? Ok, I've just added this in that section. *** You can disable character translation for a file using the -*- construct. Include `enable-character-translation: nil' inside the -*-...-*- to disable any character translation that may happen by various global and per-coding-system translation tables. You can also specify it in a local variable list at the end of the file. For shortcut, instead of using this long variable name, you can append the character "!" at the end of coding-system name specified in -*- construct or in a local variable list. --- Kenichi Handa handa@m17n.org ^ permalink raw reply [flat|nested] 202+ messages in thread
* Re: [PATCH] Unicode Lisp reader escapes 2006-06-02 6:49 ` Kenichi Handa 2006-06-02 8:00 ` Kim F. Storm 2006-06-02 9:27 ` Juri Linkov @ 2006-06-02 22:39 ` Richard Stallman 2 siblings, 0 replies; 202+ messages in thread From: Richard Stallman @ 2006-06-02 22:39 UTC (permalink / raw) Cc: juri, storm, emacs-devel, monnier, alkibiades In which NEWS section, that information should go? I think it belongs in * Editing Changes in Emacs 22.1 It seems that this change is "Editing Changes", but I'm not sure we can declare it incompatible or not. Perviously, if a file has "coding: latin-1!", it is treated as an invalid coding specification. In that sense, this change is incompatible, but... Giving a meaning to something that formerly was invalid is not incompatible. ^ permalink raw reply [flat|nested] 202+ messages in thread
* Re: [PATCH] Unicode Lisp reader escapes 2006-05-23 1:05 ` Kenichi Handa 2006-05-23 5:18 ` Juri Linkov @ 2006-05-24 2:17 ` Richard Stallman 1 sibling, 0 replies; 202+ messages in thread From: Richard Stallman @ 2006-05-24 2:17 UTC (permalink / raw) Cc: alkibiades, storm, monnier, emacs-devel So, I'd like to implement both the 1st and 3rd method. Ok? Ok, please do. ^ permalink raw reply [flat|nested] 202+ messages in thread
* Re: [PATCH] Unicode Lisp reader escapes 2006-05-16 11:58 ` Oliver Scholz 2006-05-16 14:24 ` Kim F. Storm 2006-05-17 3:45 ` Richard Stallman @ 2006-05-17 15:15 ` Stefan Monnier 2 siblings, 0 replies; 202+ messages in thread From: Stefan Monnier @ 2006-05-17 15:15 UTC (permalink / raw) Cc: emacs-devel, rms, handa, Kim F. Storm > Unification and UCS fragmentation are new in Emacs 22. unify-8859-on-decoding-mode was introduced in Emacs-21.3. Stefan ^ permalink raw reply [flat|nested] 202+ messages in thread
* Re: [PATCH] Unicode Lisp reader escapes 2006-05-16 11:16 ` Kim F. Storm 2006-05-16 11:39 ` Romain Francoise 2006-05-16 11:58 ` Oliver Scholz @ 2006-05-17 3:45 ` Richard Stallman 2 siblings, 0 replies; 202+ messages in thread From: Richard Stallman @ 2006-05-17 3:45 UTC (permalink / raw) Cc: emacs-devel, monnier, handa, alkibiades IIRC, the current issue was raised because someone suggested to add \u and \U for unicode to the Lisp reader -- something we have also lived without for ages. That suggestion led me to recognize that we have a problem, but isn't logically related to the problem. ^ permalink raw reply [flat|nested] 202+ messages in thread
* Re: [PATCH] Unicode Lisp reader escapes 2006-05-16 9:49 ` Oliver Scholz 2006-05-16 11:16 ` Kim F. Storm @ 2006-05-17 3:45 ` Richard Stallman 2006-05-17 8:53 ` Oliver Scholz 1 sibling, 1 reply; 202+ messages in thread From: Richard Stallman @ 2006-05-17 3:45 UTC (permalink / raw) Cc: emacs-devel, monnier, handa, alkibiades Well, as already mentioned, unification and fragmentation are implemented by means of translation tables. Unification, for instance, for non-CCL decodings happens by means of modifying the parent of the char table in the variable `standard-translation-table-for-decode'. This suggests another implementation: make two such tables, one for unification and one not for unification, and the mode can choose which one gets used. I'd like to understand a little more of the current design. The translation table that unification alters is the parent of the one in `standard-translation-table-for-decode'. What is the purpose of making these child maps? I have no idea whether it is simple to make this variable buffer local or not. But, well, it's certainly intrusive to change such things at the very heart and core of Emacs' decoding/encoding apparatus. This kind of change is not intrusive at all. It ought to be pretty trivial. The translation table relevant for unification in CCL decoding is `ucs-translation-table-for-decode' (AFAICS only the cyrillic encodings make use of this). The translation table relevant for fragmentation of UCS coding systems is `utf-translation-table-for-decode'. You'd have to find a way to make *that* buffer local. It looks very easy to make the choice of table dynamic for the Cyrillic coding systems. What does "fragmentation" mean? I do not recall seeing that term in this context. ^ permalink raw reply [flat|nested] 202+ messages in thread
* Re: [PATCH] Unicode Lisp reader escapes 2006-05-17 3:45 ` Richard Stallman @ 2006-05-17 8:53 ` Oliver Scholz 2006-05-17 20:09 ` Richard Stallman 0 siblings, 1 reply; 202+ messages in thread From: Oliver Scholz @ 2006-05-17 8:53 UTC (permalink / raw) Cc: emacs-devel, monnier, handa, Oliver Scholz Richard Stallman <rms@gnu.org> writes: [...] > What does "fragmentation" mean? I do not recall seeing that term > in this context. It's the opposite of unification. In this context it can mean two different things: 1. Undo the effects of `unify-8859-on-decoding' mode. That is, wenn decoding some encodings like cyrillic or some ISO 8859 encodings, then decode them to characters from appropriate mule charsets (e.g. `greek-iso8859-7') rather than to characters from the charset `mule-unicode-0100-24ff'. This is the default. 2. When decoding UCS encodings like UTF-8, decode characters from certain reperoires, e.g. "Greek", to different mule charsets like `greek-iso8859-7'. The default is to decode them all to characters from `mule-unicode-0100-24ff'. A user can turn this behaviour on by customizing `utf-fragment-on-decoding'. Oliver -- Oliver Scholz 28 Floréal an 214 de la Révolution Ostendstr. 61 Liberté, Egalité, Fraternité! 60314 Frankfurt a. M. ^ permalink raw reply [flat|nested] 202+ messages in thread
* Re: [PATCH] Unicode Lisp reader escapes 2006-05-17 8:53 ` Oliver Scholz @ 2006-05-17 20:09 ` Richard Stallman 2006-05-18 9:12 ` Oliver Scholz 0 siblings, 1 reply; 202+ messages in thread From: Richard Stallman @ 2006-05-17 20:09 UTC (permalink / raw) Cc: emacs-devel, monnier, handa, alkibiades > What does "fragmentation" mean? I do not recall seeing that term > in this context. It's the opposite of unification. In this context it can mean two different things: Thanks. 1. Undo the effects of `unify-8859-on-decoding' mode. That is, wenn decoding some encodings like cyrillic or some ISO 8859 encodings, then decode them to characters from appropriate mule charsets (e.g. `greek-iso8859-7') rather than to characters from the charset `mule-unicode-0100-24ff'. This is the default. Don't you mean "Don't perform the actions of `unify-8859-on-decoding' mode?" 2. When decoding UCS encodings like UTF-8, decode characters from certain reperoires, e.g. "Greek", to different mule charsets like `greek-iso8859-7'. The default is to decode them all to characters from `mule-unicode-0100-24ff'. A user can turn this behaviour on by customizing `utf-fragment-on-decoding'. That one isn't relevant to the problem we need to solve now. ^ permalink raw reply [flat|nested] 202+ messages in thread
* Re: [PATCH] Unicode Lisp reader escapes 2006-05-17 20:09 ` Richard Stallman @ 2006-05-18 9:12 ` Oliver Scholz 0 siblings, 0 replies; 202+ messages in thread From: Oliver Scholz @ 2006-05-18 9:12 UTC (permalink / raw) Cc: emacs-devel, monnier, handa, Oliver Scholz Richard Stallman <rms@gnu.org> writes: > > What does "fragmentation" mean? I do not recall seeing that term > > in this context. > > It's the opposite of unification. In this context it can mean two > different things: > > Thanks. > > 1. Undo the effects of `unify-8859-on-decoding' mode. That is, wenn > decoding some encodings like cyrillic or some ISO 8859 encodings, > then decode them to characters from appropriate mule charsets (e.g. > `greek-iso8859-7') rather than to characters from the charset > `mule-unicode-0100-24ff'. This is the default. > > Don't you mean "Don't perform the actions of `unify-8859-on-decoding' > mode?" Both actually. This is what Emacs does by default. But there's also a function `ucs-fragment-8859' that, when called with its second argument non-nil, reverses the effects of unification on decoding. This is how `unify-8859-on-decoding-mode' is implemented: (define-minor-mode unify-8859-on-decoding-mode "Set up translation-tables for unifying ISO 8859 characters on decoding. [...]" :group 'mule :global t :init-value nil (if unify-8859-on-decoding-mode (ucs-unify-8859 nil t) (ucs-fragment-8859 nil t))) > 2. When decoding UCS encodings like UTF-8, decode characters from > certain reperoires, e.g. "Greek", to different mule charsets like > `greek-iso8859-7'. The default is to decode them all to characters > from `mule-unicode-0100-24ff'. A user can turn this behaviour on by > customizing `utf-fragment-on-decoding'. > > That one isn't relevant to the problem we need to solve now. Sorry, I was not aware that this decision was definitive. Oliver -- Oliver Scholz 29 Floréal an 214 de la Révolution Ostendstr. 61 Liberté, Egalité, Fraternité! 60314 Frankfurt a. M. ^ permalink raw reply [flat|nested] 202+ messages in thread
* Re: [PATCH] Unicode Lisp reader escapes 2006-05-15 0:55 ` Stefan Monnier 2006-05-15 2:49 ` Oliver Scholz @ 2006-05-15 20:37 ` Richard Stallman 1 sibling, 0 replies; 202+ messages in thread From: Richard Stallman @ 2006-05-15 20:37 UTC (permalink / raw) Cc: emacs-devel, handa, alkibiades > Here's one idea: if the -*- line specifies `coding' and specifies > the mode `emacs-lisp' then force unify-8859-on-decoding to nil > for that file. Forcing it to nil for a particular file is maybe too much work to implement compared to th benefit. What makes it hard? Maybe an easier solution is to add a file-local variable `no-8859-unification' such that if that file is loaded in an Emacs which is configured to use unify-8859-on-decoding it signals an error. Why is it much harder to switch to the nil mode than to signal an error? ^ permalink raw reply [flat|nested] 202+ messages in thread
* Re: [PATCH] Unicode Lisp reader escapes 2006-05-13 4:52 ` Richard Stallman 2006-05-13 13:25 ` Stefan Monnier @ 2006-05-15 5:13 ` Kenichi Handa 2006-05-15 8:06 ` Kim F. Storm ` (2 more replies) 1 sibling, 3 replies; 202+ messages in thread From: Kenichi Handa @ 2006-05-15 5:13 UTC (permalink / raw) Cc: emacs-devel, alkibiades In article <E1Fem6n-0002Qv-9c@fencepost.gnu.org>, Richard Stallman <rms@gnu.org> writes: > How about just asking users to use emacs-mule coding system > for *.el files if they want them decoded the same way > independent of various settings on byte-compiling? > Maybe that is a good enough solution. Does this solution > solve the whole problem? Yes, as far as I know. > Handa says that telling people "don't use utf-8" solves the problem. That's NOT what I saied. I said "use emacs-mule". The other coding systems are affected by unify-8859-on-decoding-mode, and also by users setting of standard-translation-table-for-decode. > There is a way for a Lisp file to specify a coding system which isn't > utf-8. Is there a way for a Lisp file to specify that > unify-8859-on-decoding should not be used when reading it? No. > If not, maybe we should make one. But, as emacs-mule is not affected by unify-8859-on-decoding, we don't have to invent it as long as we suggest people to use emacs-mule in a problematic case. --- Kenichi Handa handa@m17n.org ^ permalink raw reply [flat|nested] 202+ messages in thread
* Re: [PATCH] Unicode Lisp reader escapes 2006-05-15 5:13 ` Kenichi Handa @ 2006-05-15 8:06 ` Kim F. Storm 2006-05-15 9:04 ` Andreas Schwab 2006-05-15 20:38 ` Richard Stallman 2006-05-15 14:08 ` Stefan Monnier 2006-05-15 20:37 ` Richard Stallman 2 siblings, 2 replies; 202+ messages in thread From: Kim F. Storm @ 2006-05-15 8:06 UTC (permalink / raw) Cc: alkibiades, rms, emacs-devel Kenichi Handa <handa@m17n.org> writes: > But, as emacs-mule is not affected by > unify-8859-on-decoding, we don't have to invent it as long > as we suggest people to use emacs-mule in a problematic > case. So why not _always_ use emacs-mule for .elc files (both on write and on load)? Could it be a problem with existing *.elc files? -- Kim F. Storm <storm@cua.dk> http://www.cua.dk ^ permalink raw reply [flat|nested] 202+ messages in thread
* Re: [PATCH] Unicode Lisp reader escapes 2006-05-15 8:06 ` Kim F. Storm @ 2006-05-15 9:04 ` Andreas Schwab 2006-05-15 20:38 ` Richard Stallman 1 sibling, 0 replies; 202+ messages in thread From: Andreas Schwab @ 2006-05-15 9:04 UTC (permalink / raw) Cc: alkibiades, emacs-devel, rms, Kenichi Handa storm@cua.dk (Kim F. Storm) writes: > Kenichi Handa <handa@m17n.org> writes: > >> But, as emacs-mule is not affected by >> unify-8859-on-decoding, we don't have to invent it as long >> as we suggest people to use emacs-mule in a problematic >> case. > > So why not _always_ use emacs-mule for .elc files (both on write > and on load)? Don't we already? (setq file-coding-system-alist '(("\\.elc\\'" . (emacs-mule . emacs-mule)) ...) Andreas. -- Andreas Schwab, SuSE Labs, schwab@suse.de SuSE Linux Products GmbH, Maxfeldstraße 5, 90409 Nürnberg, Germany PGP key fingerprint = 58CA 54C7 6D53 942B 1756 01D3 44D5 214B 8276 4ED5 "And now for something completely different." ^ permalink raw reply [flat|nested] 202+ messages in thread
* Re: [PATCH] Unicode Lisp reader escapes 2006-05-15 8:06 ` Kim F. Storm 2006-05-15 9:04 ` Andreas Schwab @ 2006-05-15 20:38 ` Richard Stallman 1 sibling, 0 replies; 202+ messages in thread From: Richard Stallman @ 2006-05-15 20:38 UTC (permalink / raw) Cc: alkibiades, emacs-devel, handa > But, as emacs-mule is not affected by > unify-8859-on-decoding, we don't have to invent it as long > as we suggest people to use emacs-mule in a problematic > case. So why not _always_ use emacs-mule for .elc files (both on write and on load)? .elc files do not undergo decoding. The issue is about compilation and loading of Lisp source files. ^ permalink raw reply [flat|nested] 202+ messages in thread
* Re: [PATCH] Unicode Lisp reader escapes 2006-05-15 5:13 ` Kenichi Handa 2006-05-15 8:06 ` Kim F. Storm @ 2006-05-15 14:08 ` Stefan Monnier 2006-05-15 20:37 ` Richard Stallman 2 siblings, 0 replies; 202+ messages in thread From: Stefan Monnier @ 2006-05-15 14:08 UTC (permalink / raw) Cc: alkibiades, rms, emacs-devel > That's NOT what I said. I said "use emacs-mule". The other coding > systems are affected by unify-8859-on-decoding-mode, and also by users > setting of standard-translation-table-for-decode. Ah, that's good, thanks, Stefan ^ permalink raw reply [flat|nested] 202+ messages in thread
* Re: [PATCH] Unicode Lisp reader escapes 2006-05-15 5:13 ` Kenichi Handa 2006-05-15 8:06 ` Kim F. Storm 2006-05-15 14:08 ` Stefan Monnier @ 2006-05-15 20:37 ` Richard Stallman 2006-05-16 10:07 ` Oliver Scholz 2006-05-18 0:31 ` Kenichi Handa 2 siblings, 2 replies; 202+ messages in thread From: Richard Stallman @ 2006-05-15 20:37 UTC (permalink / raw) Cc: emacs-devel, alkibiades > Handa says that telling people "don't use utf-8" solves the problem. That's NOT what I saied. I said "use emacs-mule". The other coding systems are affected by unify-8859-on-decoding-mode, and also by users setting of standard-translation-table-for-decode. Ok, I stand corrected. However, people have pointed out that there are practical drawbacks to using emacs-mule, and that iso-2022 is more convenient. Let's see if we can arrange for iso-2022 to work properly. ^ permalink raw reply [flat|nested] 202+ messages in thread
* Re: [PATCH] Unicode Lisp reader escapes 2006-05-15 20:37 ` Richard Stallman @ 2006-05-16 10:07 ` Oliver Scholz 2006-05-18 0:31 ` Kenichi Handa 1 sibling, 0 replies; 202+ messages in thread From: Oliver Scholz @ 2006-05-16 10:07 UTC (permalink / raw) Cc: alkibiades, emacs-devel, Kenichi Handa Richard Stallman <rms@gnu.org> writes: > > Handa says that telling people "don't use utf-8" solves the problem. > > That's NOT what I saied. I said "use emacs-mule". The > other coding systems are affected by > unify-8859-on-decoding-mode, and also by users setting of > standard-translation-table-for-decode. > > Ok, I stand corrected. > > However, people have pointed out that there are practical drawbacks > to using emacs-mule, and that iso-2022 is more convenient. > Let's see if we can arrange for iso-2022 to work properly. The same here. decode_coding_iso2022 (which is also responsible for some ISO 8859 encodings) refers to Vstandard_translation_table_for_decode. The practical drawbacks *I* mentioned are basically the same with ISO 2022-7bit. (Disclaimer: I don't really understand ISO 2022. I am not even sure that this particular ISO standard specifies an encoding (character set + transfer encoding) or rather a standard *for* specifying encodings.) I see that ISO-2022-JP-2 is, thanks to Kenichi Handa, a registered IANA encoding. (But that is probably not the same as ISO 2022-7bit?) But that means only that you are not, strictly speaking, violating a standard if you use it in mail or news. In practise, however, I very much doubt that outside of Japan there are any editors, mail clients or news clients other than Emacs that are able to deal with it. I don't know whether being "8 bit clean" is still an issue for networking connections today. If it is, then ISO-2022-7bit might have an advantage for files in a CVS repository. But that's pretty much the only advantage in practise. Oliver -- Oliver Scholz 27 Floréal an 214 de la Révolution Ostendstr. 61 Liberté, Egalité, Fraternité! 60314 Frankfurt a. M. ^ permalink raw reply [flat|nested] 202+ messages in thread
* Re: [PATCH] Unicode Lisp reader escapes 2006-05-15 20:37 ` Richard Stallman 2006-05-16 10:07 ` Oliver Scholz @ 2006-05-18 0:31 ` Kenichi Handa 1 sibling, 0 replies; 202+ messages in thread From: Kenichi Handa @ 2006-05-18 0:31 UTC (permalink / raw) Cc: emacs-devel, alkibiades In article <E1Ffjop-0000Wx-9d@fencepost.gnu.org>, Richard Stallman <rms@gnu.org> writes: >> Handa says that telling people "don't use utf-8" solves the problem. > That's NOT what I saied. I said "use emacs-mule". The > other coding systems are affected by > unify-8859-on-decoding-mode, and also by users setting of > standard-translation-table-for-decode. > Ok, I stand corrected. > However, people have pointed out that there are practical drawbacks > to using emacs-mule, and that iso-2022 is more convenient. > Let's see if we can arrange for iso-2022 to work properly. For iso-2022 based coding-systems, the situation is simpler that utf-* case. As we already have the variable `enable-character-translation', just by making it local-variable of a buffer where a file is being read, and setting it to nil, we can read a file in a constant way. The only hack we need is to detect that variable in "Local Variables:" section before start decoding (perhaps in set-auto-coding). --- Kenichi Handa handa@m17n.org ^ permalink raw reply [flat|nested] 202+ messages in thread
* Re: [PATCH] Unicode Lisp reader escapes 2006-05-11 3:44 ` Richard Stallman 2006-05-11 7:31 ` Kenichi Handa @ 2006-05-11 9:44 ` Oliver Scholz 1 sibling, 0 replies; 202+ messages in thread From: Oliver Scholz @ 2006-05-11 9:44 UTC (permalink / raw) Cc: alkibiades, emacs-devel, Kenichi Handa Richard Stallman <rms@gnu.org> writes: > > Add a new "variable" `buffer-coding' which is analogous to `coding'. > > Whereas `coding' specifies the encoding in the file, `buffer-coding' > > specifies the in-buffer encoding to produce in the buffer. Its value > > could be a list or plist, which would specify the values of all these > > many variables. > > > What do you think? If you think this is a good idea, could > > you try designing the details? > > No, it's an incredibly hard and heavy task. > > I am surprised you think so, and this means there is some sort of > misunderstanding between us. > > You've listed around 6 variables that affect the decoding. So it > seems to me that if we make a convenient way for each Lisp file to > specify those 6 variables, we solve the problem. It looks easy to me. Yes, I think there is a misunderstanding. It is not the value of those variables that affects decoding. But changing the value of those variables via their corresponding minor mode functions or via customize initialises translation tables (char tables and arrays) and in some cases adjusts codings systems to use those tables. See `ucs-unify-8859' and its counterpart `ucs-fragment-8859' for an example. In most, if not all affected cases, binding variables to another value would have no effect whatsoever. In some cases like `utf-fragment-on-decoding' you'd first have to write functions to programmatically cause the associated effect. In fact, it might be easier (and even safer) to just change the encoding of *.elc files from emacs-mule to utf-8. Then there may be possible consecutive problems. For instance, Handa-san mentioned an example of how a user could have reason to modify the affected translation tables. Are users supposed to do that? (I'd argue, they should rather change the fontset.) If yes, you'd need to make sure that such changes are preserved after swapping and/or redefining all those translation tables during byte compilation. Maybe they are anyways, but we are talking about a lot of mule code here. Finally, users might encounter *either* behaviour in a way that makes them think it is a bug. If byte compilation is modified the way you propose, then what some users will probably just see is that the glyphs of some characters coming from a byte compiled file differ from what they specified in their .emacs. That will come as a surprise to them and investigating it is not exactly easy, if you are not familiar with Emacs' internal handling of characters. So it might be a good idea to document even the fix of the bug we are discussing in etc/PROBLEMS, because it *is* a design problem of `emacs-mule'. (And as Handa-san mentioned, things like that are the actual reason that changing the internal encoding to UTF-8 is a worthwile enterprise in the first place.) Oliver -- Oliver Scholz 22 Floréal an 214 de la Révolution Ostendstr. 61 Liberté, Egalité, Fraternité! 60314 Frankfurt a. M. ^ permalink raw reply [flat|nested] 202+ messages in thread
* Re: [PATCH] Unicode Lisp reader escapes 2006-05-07 12:38 ` Kenichi Handa 2006-05-07 21:26 ` Oliver Scholz @ 2006-05-08 7:36 ` Richard Stallman 2006-05-08 7:50 ` Kenichi Handa 1 sibling, 1 reply; 202+ messages in thread From: Richard Stallman @ 2006-05-08 7:36 UTC (permalink / raw) Cc: alkibiades, emacs-devel (set-language-environment "Latin-1") (byte-compile-file "~/test1.el") (set-language-environment "Japanese") (byte-compile-file "~/test2.el")) Although the source files are exactly the same, the resulting test1.elc contains a string of two Latin-1 characters whereas the test2.elc contains a string of single Japanese character. Is the difference due solely to the choice of coding system for decoding the file? That heuristic choice of coding system depends on lots of things, but Lisp files can prevent variation by specifying -*-coding-system:...;-*-. When the file does that, does eliminate the problem? Anyway, is the specific problem I asked you to look at a matter of choice of coding system? (I don't know the details, since I don't know what that variable does--I just know it relates to Mule.) ^ permalink raw reply [flat|nested] 202+ messages in thread
* Re: [PATCH] Unicode Lisp reader escapes 2006-05-08 7:36 ` Richard Stallman @ 2006-05-08 7:50 ` Kenichi Handa 0 siblings, 0 replies; 202+ messages in thread From: Kenichi Handa @ 2006-05-08 7:50 UTC (permalink / raw) Cc: emacs-devel, alkibiades In article <E1Fd0IY-0007sF-KE@fencepost.gnu.org>, Richard Stallman <rms@gnu.org> writes: > (set-language-environment "Latin-1") > (byte-compile-file "~/test1.el") > (set-language-environment "Japanese") > (byte-compile-file "~/test2.el")) > Although the source files are exactly the same, the > resulting test1.elc contains a string of two Latin-1 > characters whereas the test2.elc contains a string of single > Japanese character. > Is the difference due solely to the choice of coding system for > decoding the file? Yes, in the above example. > That heuristic choice of coding system depends on > lots of things, but Lisp files can prevent variation by specifying > -*-coding-system:...;-*-. > When the file does that, does eliminate the problem? Yes, in the above example. > Anyway, is the specific problem I asked you to look at a matter of > choice of coding system? (I don't know the details, since I don't > know what that variable does--I just know it relates to Mule.) "A matter of choice of coding system" is just one of the problems. Even if a coding system is deterministically chosen, there are several options that controls the decoding of utf-*. And, binding all of them to the default values while byte-compiling leads to another problem as I wrote in the previsou mail. --- Kenichi Handa handa@m17n.org ^ permalink raw reply [flat|nested] 202+ messages in thread
* Re: [PATCH] Unicode Lisp reader escapes 2006-05-04 16:32 ` Eli Zaretskii 2006-05-04 20:55 ` Aidan Kehoe @ 2006-05-05 19:05 ` Richard Stallman 2006-05-05 21:43 ` Eli Zaretskii 1 sibling, 1 reply; 202+ messages in thread From: Richard Stallman @ 2006-05-05 19:05 UTC (permalink / raw) Cc: kehoea, emacs-devel, monnier, handa What happens when a Lisp file is byte-compiled--do we want the result to depend on the local settings? No, we want that to depend only on local variables of the file itself. ^ permalink raw reply [flat|nested] 202+ messages in thread
* Re: [PATCH] Unicode Lisp reader escapes 2006-05-05 19:05 ` Richard Stallman @ 2006-05-05 21:43 ` Eli Zaretskii 2006-05-06 14:25 ` Richard Stallman 0 siblings, 1 reply; 202+ messages in thread From: Eli Zaretskii @ 2006-05-05 21:43 UTC (permalink / raw) Cc: kehoea, emacs-devel, monnier, handa > From: Richard Stallman <rms@gnu.org> > CC: handa@m17n.org, kehoea@parhasard.net, monnier@iro.umontreal.ca, > emacs-devel@gnu.org > Date: Fri, 05 May 2006 15:05:30 -0400 > > What happens when a Lisp file is byte-compiled--do we want the result > to depend on the local settings? > > No, we want that to depend only on local variables of the file itself. So I think this means that if we eventually decide to use decode-char, we should at least explicitly bind utf-fragment-on-decoding so as to disable this translation. ^ permalink raw reply [flat|nested] 202+ messages in thread
* Re: [PATCH] Unicode Lisp reader escapes 2006-05-05 21:43 ` Eli Zaretskii @ 2006-05-06 14:25 ` Richard Stallman 0 siblings, 0 replies; 202+ messages in thread From: Richard Stallman @ 2006-05-06 14:25 UTC (permalink / raw) Cc: kehoea, handa, monnier, emacs-devel So I think this means that if we eventually decide to use decode-char, we should at least explicitly bind utf-fragment-on-decoding so as to disable this translation. Yes. The byte compiler should bind all variables that users are likely to change that would affect the way a file is turned into Lisp code. ^ permalink raw reply [flat|nested] 202+ messages in thread
* Re: [PATCH] Unicode Lisp reader escapes 2006-05-03 8:47 ` Eli Zaretskii 2006-05-03 14:21 ` Stefan Monnier @ 2006-05-04 1:26 ` Kenichi Handa 1 sibling, 0 replies; 202+ messages in thread From: Kenichi Handa @ 2006-05-04 1:26 UTC (permalink / raw) Cc: kehoea, emacs-devel In article <uy7xjcx5s.fsf@gnu.org>, Eli Zaretskii <eliz@gnu.org> writes: >> > do we really want read_escape to produce Unicode or >> > non-Unicode characters when it sees \uNNNN, depending on >> > the current user settings? >> >> I think, at least, CJK characters should be decoded into one >> of CJK charsets because there's no other charsets. > Right, but what about Cyrillic and Greek? The merits and demerits of > depending on utf-fragment-on-decoding are not clear when the Lisp > reader is involved. I don't see any strong reason for not following utf-fragment-on-decoding in read_escape leaving the question about the usefullness of this option. --- Kenichi Handa handa@m17n.org ^ permalink raw reply [flat|nested] 202+ messages in thread
[parent not found: <E1FaJ0b-0008G8-8u@monty-python.gnu.org>]
* Re: [PATCH] Unicode Lisp reader escapes [not found] <E1FaJ0b-0008G8-8u@monty-python.gnu.org> @ 2006-04-30 21:16 ` Jonathan Yavner 2006-05-01 18:32 ` Richard Stallman 0 siblings, 1 reply; 202+ messages in thread From: Jonathan Yavner @ 2006-04-30 21:16 UTC (permalink / raw) Richard wrote: >I don't like having both \u and \U--it is ugly. > I think it would be better to put an explicit terminator into > the construct. Perhaps #. So you would write "\u123#As I walked" That would be nonstandard. Standards are better, even if ugly. http://ftp.python.org/doc/ref/strings.html http://www.gnu.org/software/coreutils/manual/html_chapter/coreutils_15.html et cetera, et cetera ^ permalink raw reply [flat|nested] 202+ messages in thread
* Re: [PATCH] Unicode Lisp reader escapes 2006-04-30 21:16 ` Jonathan Yavner @ 2006-05-01 18:32 ` Richard Stallman 2006-05-02 5:03 ` Jonathan Yavner 0 siblings, 1 reply; 202+ messages in thread From: Richard Stallman @ 2006-05-01 18:32 UTC (permalink / raw) Cc: emacs-devel > I think it would be better to put an explicit terminator into > the construct. Perhaps #. So you would write "\u123#As I walked" That would be nonstandard. Standards are better, even if ugly. What standard are you talking about? ^ permalink raw reply [flat|nested] 202+ messages in thread
* Re: [PATCH] Unicode Lisp reader escapes 2006-05-01 18:32 ` Richard Stallman @ 2006-05-02 5:03 ` Jonathan Yavner 0 siblings, 0 replies; 202+ messages in thread From: Jonathan Yavner @ 2006-05-02 5:03 UTC (permalink / raw) Cc: emacs-devel RMS wrote: > JYavner wrote: >> Standards are better, even if ugly. > What standard are you talking about? ISO standard C99, also known as WG14/N1124: http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1124.pdf This document contains the text: > 6.4.3 Universal character names > Syntax > universal-character-name: > \u hex-quad > \U hex-quad hex-quad This syntax is also used in Java, Perl, Python, etc. The main place where it *doesn't* seem to work is gcc -- its "C99 status" document says \u and \U are supposed to be working, but I get screwy results. ^ permalink raw reply [flat|nested] 202+ messages in thread
end of thread, other threads:[~2006-06-24 6:50 UTC | newest] Thread overview: 202+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2006-04-29 15:35 [PATCH] Unicode Lisp reader escapes Aidan Kehoe 2006-04-29 23:26 ` Stefan Monnier 2006-04-30 8:26 ` Aidan Kehoe 2006-04-30 3:04 ` Richard Stallman 2006-04-30 8:14 ` Aidan Kehoe 2006-04-30 20:53 ` Richard Stallman 2006-04-30 21:04 ` Andreas Schwab 2006-04-30 21:57 ` Aidan Kehoe 2006-04-30 22:14 ` Andreas Schwab 2006-05-01 18:32 ` Richard Stallman 2006-05-01 19:03 ` Oliver Scholz 2006-05-02 4:45 ` Richard Stallman 2006-05-02 0:46 ` Kenichi Handa 2006-05-02 6:41 ` Aidan Kehoe 2006-05-02 21:36 ` Richard Stallman 2006-04-30 21:56 ` Aidan Kehoe 2006-05-01 1:44 ` Miles Bader 2006-05-01 3:12 ` Stefan Monnier 2006-05-01 3:41 ` Miles Bader 2006-05-01 12:29 ` Stefan Monnier 2006-05-05 23:15 ` Juri Linkov 2006-05-06 23:36 ` Richard Stallman 2006-05-09 20:43 ` Juri Linkov 2006-05-11 3:44 ` Richard Stallman 2006-05-11 12:03 ` Juri Linkov 2006-05-11 13:16 ` Kenichi Handa 2006-05-12 4:15 ` Richard Stallman 2006-06-03 18:44 ` Aidan Kehoe [not found] ` <17537.54719.354843.89030@parhasard.net> [not found] ` <ufyieqj0v.fsf@gnu.org> 2006-06-15 18:38 ` Aidan Kehoe 2006-06-17 18:57 ` Eli Zaretskii 2006-06-18 16:11 ` Aidan Kehoe 2006-06-18 19:55 ` Eli Zaretskii 2006-06-20 2:37 ` Kenichi Handa 2006-06-20 17:56 ` Richard Stallman 2006-06-23 18:35 ` Aidan Kehoe 2006-06-24 6:50 ` Eli Zaretskii 2006-05-02 6:43 ` Kenichi Handa 2006-05-02 7:00 ` Aidan Kehoe 2006-05-02 10:45 ` Eli Zaretskii 2006-05-02 11:13 ` Aidan Kehoe 2006-05-02 19:31 ` Eli Zaretskii 2006-05-02 20:25 ` Aidan Kehoe 2006-05-02 22:16 ` Oliver Scholz 2006-05-02 11:33 ` Kenichi Handa 2006-05-02 22:50 ` Aidan Kehoe 2006-05-03 7:43 ` Kenichi Handa 2006-05-03 17:21 ` Kevin Rodgers 2006-05-03 18:51 ` Andreas Schwab 2006-05-04 21:14 ` Aidan Kehoe 2006-05-08 1:31 ` Kenichi Handa 2006-05-08 6:54 ` Aidan Kehoe 2006-05-08 13:55 ` Stefan Monnier 2006-05-08 14:24 ` Aidan Kehoe 2006-05-08 15:32 ` Stefan Monnier 2006-05-08 16:39 ` Aidan Kehoe 2006-05-08 17:39 ` Stefan Monnier 2006-05-09 7:04 ` Aidan Kehoe 2006-05-09 19:05 ` Eli Zaretskii 2006-05-10 6:05 ` Aidan Kehoe 2006-05-10 17:49 ` Eli Zaretskii 2006-05-10 21:37 ` Luc Teirlinck 2006-05-11 3:45 ` Eli Zaretskii 2006-05-10 21:48 ` Luc Teirlinck 2006-05-11 1:08 ` Luc Teirlinck 2006-05-11 2:29 ` Luc Teirlinck 2006-05-11 3:46 ` Richard Stallman 2006-05-09 0:36 ` Kenichi Handa 2006-05-02 10:36 ` Eli Zaretskii 2006-05-02 10:59 ` Aidan Kehoe 2006-05-02 19:26 ` Eli Zaretskii 2006-05-03 2:59 ` Kenichi Handa 2006-05-03 8:47 ` Eli Zaretskii 2006-05-03 14:21 ` Stefan Monnier 2006-05-03 18:26 ` Eli Zaretskii 2006-05-03 21:12 ` Ken Raeburn 2006-05-04 14:17 ` Richard Stallman 2006-05-04 16:41 ` Aidan Kehoe 2006-05-05 10:39 ` Eli Zaretskii 2006-05-05 16:35 ` Aidan Kehoe 2006-05-05 19:05 ` Richard Stallman 2006-05-05 19:20 ` Aidan Kehoe 2006-05-05 19:57 ` Aidan Kehoe 2006-05-06 14:25 ` Richard Stallman 2006-05-06 17:26 ` Aidan Kehoe 2006-05-07 5:01 ` Richard Stallman 2006-05-07 6:38 ` Aidan Kehoe 2006-05-07 7:00 ` David Kastrup 2006-05-07 7:15 ` Aidan Kehoe 2006-05-07 16:50 ` Aidan Kehoe 2006-05-08 22:28 ` Richard Stallman 2006-05-04 1:33 ` Kenichi Handa 2006-05-04 8:23 ` Oliver Scholz 2006-05-04 11:57 ` Kim F. Storm 2006-05-04 12:18 ` Stefan Monnier 2006-05-04 12:21 ` Kim F. Storm 2006-05-04 16:31 ` Eli Zaretskii 2006-05-04 21:40 ` Stefan Monnier 2006-05-05 10:25 ` Eli Zaretskii 2006-05-05 12:31 ` Oliver Scholz 2006-05-05 18:08 ` Stuart D. Herring 2006-05-05 13:05 ` Stefan Monnier 2006-05-05 17:23 ` Oliver Scholz 2006-05-04 13:07 ` Oliver Scholz 2006-05-04 16:32 ` Eli Zaretskii 2006-05-04 20:55 ` Aidan Kehoe 2006-05-05 9:33 ` Oliver Scholz 2006-05-05 10:02 ` Oliver Scholz 2006-05-05 18:33 ` Aidan Kehoe 2006-05-05 18:42 ` Oliver Scholz 2006-05-05 21:37 ` Eli Zaretskii 2006-05-06 14:24 ` Richard Stallman 2006-05-06 15:01 ` Oliver Scholz [not found] ` <877j4z5had.fsf@gmx.de> 2006-05-07 5:00 ` Richard Stallman 2006-05-07 12:38 ` Kenichi Handa 2006-05-07 21:26 ` Oliver Scholz 2006-05-08 1:14 ` Kenichi Handa 2006-05-08 22:29 ` Richard Stallman 2006-05-09 3:42 ` Eli Zaretskii 2006-05-09 20:41 ` Richard Stallman 2006-05-09 21:03 ` Stefan Monnier 2006-05-10 3:33 ` Eli Zaretskii 2006-05-09 5:13 ` Kenichi Handa 2006-05-10 3:20 ` Richard Stallman 2006-05-10 5:37 ` Kenichi Handa 2006-05-10 7:22 ` Stefan Monnier 2006-05-11 3:45 ` Richard Stallman 2006-05-11 12:41 ` Stefan Monnier 2006-05-11 12:51 ` Kenichi Handa 2006-05-11 16:46 ` Stefan Monnier 2006-05-11 3:44 ` Richard Stallman 2006-05-11 3:44 ` Richard Stallman 2006-05-11 7:31 ` Kenichi Handa 2006-05-12 4:14 ` Richard Stallman 2006-05-12 5:26 ` Kenichi Handa 2006-05-13 4:52 ` Richard Stallman 2006-05-13 13:25 ` Stefan Monnier 2006-05-13 20:41 ` Richard Stallman 2006-05-14 13:32 ` Stefan Monnier 2006-05-14 23:29 ` Richard Stallman 2006-05-15 0:55 ` Stefan Monnier 2006-05-15 2:49 ` Oliver Scholz 2006-05-15 3:27 ` Stefan Monnier 2006-05-15 10:20 ` Oliver Scholz 2006-05-15 11:12 ` Oliver Scholz 2006-05-15 20:37 ` Richard Stallman 2006-05-16 9:49 ` Oliver Scholz 2006-05-16 11:16 ` Kim F. Storm 2006-05-16 11:39 ` Romain Francoise 2006-05-16 11:58 ` Oliver Scholz 2006-05-16 14:24 ` Kim F. Storm 2006-05-17 3:45 ` Richard Stallman 2006-05-17 8:37 ` Oliver Scholz 2006-05-17 20:09 ` Richard Stallman 2006-05-17 12:37 ` Oliver Scholz 2006-05-19 2:05 ` Richard Stallman 2006-05-19 8:47 ` Oliver Scholz 2006-05-18 1:09 ` Kenichi Handa 2006-05-21 0:57 ` Richard Stallman 2006-05-22 1:33 ` Kenichi Handa 2006-05-22 15:12 ` Richard Stallman 2006-05-23 1:05 ` Kenichi Handa 2006-05-23 5:18 ` Juri Linkov 2006-05-24 2:18 ` Richard Stallman 2006-06-02 6:49 ` Kenichi Handa 2006-06-02 8:00 ` Kim F. Storm 2006-06-02 9:27 ` Juri Linkov 2006-06-02 10:50 ` Eli Zaretskii 2006-06-02 11:39 ` Kenichi Handa 2006-06-02 12:12 ` Juri Linkov 2006-06-02 22:39 ` Richard Stallman 2006-06-03 6:42 ` Juri Linkov 2006-06-04 2:23 ` Richard Stallman 2006-06-05 7:24 ` Kenichi Handa 2006-06-05 21:31 ` Richard Stallman 2006-06-07 1:24 ` Kenichi Handa 2006-06-02 22:39 ` Richard Stallman 2006-05-24 2:17 ` Richard Stallman 2006-05-17 15:15 ` Stefan Monnier 2006-05-17 3:45 ` Richard Stallman 2006-05-17 3:45 ` Richard Stallman 2006-05-17 8:53 ` Oliver Scholz 2006-05-17 20:09 ` Richard Stallman 2006-05-18 9:12 ` Oliver Scholz 2006-05-15 20:37 ` Richard Stallman 2006-05-15 5:13 ` Kenichi Handa 2006-05-15 8:06 ` Kim F. Storm 2006-05-15 9:04 ` Andreas Schwab 2006-05-15 20:38 ` Richard Stallman 2006-05-15 14:08 ` Stefan Monnier 2006-05-15 20:37 ` Richard Stallman 2006-05-16 10:07 ` Oliver Scholz 2006-05-18 0:31 ` Kenichi Handa 2006-05-11 9:44 ` Oliver Scholz 2006-05-08 7:36 ` Richard Stallman 2006-05-08 7:50 ` Kenichi Handa 2006-05-05 19:05 ` Richard Stallman 2006-05-05 21:43 ` Eli Zaretskii 2006-05-06 14:25 ` Richard Stallman 2006-05-04 1:26 ` Kenichi Handa [not found] <E1FaJ0b-0008G8-8u@monty-python.gnu.org> 2006-04-30 21:16 ` Jonathan Yavner 2006-05-01 18:32 ` Richard Stallman 2006-05-02 5:03 ` Jonathan Yavner
Code repositories for project(s) associated with this public inbox https://git.savannah.gnu.org/cgit/emacs.git This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).