* [acm@muc.de: Re: Inadequate documentation of silly characters on screen.] @ 2009-11-18 19:12 Alan Mackenzie 2009-11-19 1:27 ` Fwd: Re: Inadequate documentation of silly characters on screen Stefan Monnier 0 siblings, 1 reply; 96+ messages in thread From: Alan Mackenzie @ 2009-11-18 19:12 UTC (permalink / raw) To: emacs-devel Hi, Emacs! This is the message I meant to CC: to emacs-devel. It looks serious. ----- Forwarded message from Alan Mackenzie <acm@muc.de> ----- Date: Wed, 18 Nov 2009 11:04:53 +0000 From: Alan Mackenzie <acm@muc.de> To: Miles Bader <miles@gnu.org> Subject: Re: Inadequate documentation of silly characters on screen. Hi, again, Miles! On Wed, Nov 18, 2009 at 06:40:53PM +0900, Miles Bader wrote: > Alan Mackenzie <acm@muc.de> writes: > > Once again, I'm getting silly characters on the screen. In *scratch*, > > where's I've written "ñ", what gets displayed is "\361". It may have > > happened when I upgraded to Emacs 23. > Does it happen with "emacs -Q"? > How do you "write" ñ (do you use an input method? Type it on your keyboard...?)? Of the good and the bad representations, if I do "C-x =" on each, I get this: Char: ñ (241, #o361, #xf1, file #xF1) Char: \361 (4194289, #o17777761, #x3ffff1, raw-byte) This sequence reproduces the bug: M-: (setq nl "\n") M-: (aset nl 0 ?ñ M-: (insert nl) So it looks a bit like the `aset' invocation is doing damage, by doing sign extension rather than zero filling. > Do you use X emacs, emacs in a tty, etc.? If tty emacs, which type of > terminal do you use? Linux tty. > -Miles -- Alan Mackenzie (Nuremberg, Germany). ----- End forwarded message ----- ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: Fwd: Re: Inadequate documentation of silly characters on screen. 2009-11-18 19:12 [acm@muc.de: Re: Inadequate documentation of silly characters on screen.] Alan Mackenzie @ 2009-11-19 1:27 ` Stefan Monnier 2009-11-19 8:20 ` Alan Mackenzie 0 siblings, 1 reply; 96+ messages in thread From: Stefan Monnier @ 2009-11-19 1:27 UTC (permalink / raw) To: Alan Mackenzie; +Cc: emacs-devel > This is the message I meant to CC: to emacs-devel. It looks serious. The integer 241 is used to represent the char ?ñ, but it's also used for many other things, one of them being to represent the byte 241 (tho such a byte can also be represented as the integer 4194289). Now strings come in two flavors: multibyte (i.e. sequences of chars) and unibyte (i.e. sequences of bytes). So when you do: M-: (setq nl "\n") M-: (aset nl 0 ?ñ) M-: (insert nl) The `aset' part may do two different things depending on whether `nl' is unibyte or multibyte: it will either insert the char ?ñ or the byte 241. In the above code the "\n" is taken as a unibyte string, tho I'm not sure why we made this arbitrary choice. If you give us more context (i.e. more of the real code where the problem show up), maybe we can tell you how to avoid it. Usually, I recommend to stay away from `aset' on strings for various reasons, and it seems that it also helps avoid those tricky issues (tho it doesn't protect you from them completely). Stefan ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: Fwd: Re: Inadequate documentation of silly characters on screen. 2009-11-19 1:27 ` Fwd: Re: Inadequate documentation of silly characters on screen Stefan Monnier @ 2009-11-19 8:20 ` Alan Mackenzie 2009-11-19 8:50 ` Miles Bader ` (2 more replies) 0 siblings, 3 replies; 96+ messages in thread From: Alan Mackenzie @ 2009-11-19 8:20 UTC (permalink / raw) To: Stefan Monnier; +Cc: emacs-devel Morning, Stefan! On Wed, Nov 18, 2009 at 08:27:24PM -0500, Stefan Monnier wrote: > The integer 241 is used to represent the char ?ñ, but it's also used for > many other things, one of them being to represent the byte 241 (tho such > a byte can also be represented as the integer 4194289). > Now strings come in two flavors: multibyte (i.e. sequences of chars) and > unibyte (i.e. sequences of bytes). So when you do: > M-: (setq nl "\n") > M-: (aset nl 0 ?ñ) > M-: (insert nl) > The `aset' part may do two different things depending on whether `nl' is > unibyte or multibyte: it will either insert the char ?ñ or the byte 241. > In the above code the "\n" is taken as a unibyte string, tho I'm not > sure why we made this arbitrary choice. The above sequence "works" in Emacs 22.3, in the sense that "ñ" gets displayed - when I do M-: (aset nl 0 ?ñ), I get "2289 (#o4361, #x8f1)" (Emacs 22.3) "241 (#o361, #xf1)" (Emacs 23.1) displayed in the echo area. So my `aset' invocation is trying to write a multibyte ?ñ into a unibyte ?\n, and gets truncated from #x8f1 to #xf1 in the process. Surely this behaviour in Emacs 23.1 is a bug? Shouldn't we fix it before the pretest? How about interpreting "\n" and friends as multibyte or unibyte according to the prevailing flavour? > If you give us more context (i.e. more of the real code where the > problem show up), maybe we can tell you how to avoid it. OK. I have my own routine to display regexps. As a first step, I translate \n -> ñ, (and \t, \r, \f similarly). This is how: (defun translate-rnt (regexp) "REGEXP is a string. Translate any \t \n \r and \f characters to wierd non-ASCII printable characters: \t to Î (206, \xCE), \n to ñ (241, \xF1), \r to ® (174, \xAE) and \f to £ (163, \xA3). The original string is modified." (let (ch pos) (while (setq pos (string-match "[\t\n\r\f]" regexp)) (setq ch (aref regexp pos)) (aset regexp pos ; <=================== (cond ((eq ch ?\t) ?Î) ((eq ch ?\n) ?ñ) ((eq ch ?\r) ?®) (t ?£)))) regexp)) > Usually, I recommend to stay away from `aset' on strings for various > reasons, and it seems that it also helps avoid those tricky issues (tho > it doesn't protect you from them completely). Again, surely this is a bug? These tricky issues should be dealt with in the lisp interpreter in a way that lisp hackers don't have to worry about. Why do we have both unibyte and multibyte? Is there any reason not to remove unibyte altogether (though obviously not for 23.2). What was the change between 22.3 and 23.1 that broke my code? Would it, perhaps, be a good idea to reconsider that change? > Stefan -- Alan Mackenzie (Nurmberg, Germany). ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: Inadequate documentation of silly characters on screen. 2009-11-19 8:20 ` Alan Mackenzie @ 2009-11-19 8:50 ` Miles Bader 2009-11-19 10:16 ` Fwd: " Andreas Schwab 2009-11-19 14:08 ` Stefan Monnier 2 siblings, 0 replies; 96+ messages in thread From: Miles Bader @ 2009-11-19 8:50 UTC (permalink / raw) To: Alan Mackenzie; +Cc: Stefan Monnier, emacs-devel Alan Mackenzie <acm@muc.de> writes: > Why do we have both unibyte and multibyte? Is there any reason > not to remove unibyte altogether (though obviously not for 23.2). For certain rare cases, it's useful for efficiency reasons, but maybe it should never be the default. -Miles -- Opposition, n. In politics the party that prevents the Goverment from running amok by hamstringing it. ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: Fwd: Re: Inadequate documentation of silly characters on screen. 2009-11-19 8:20 ` Alan Mackenzie 2009-11-19 8:50 ` Miles Bader @ 2009-11-19 10:16 ` Andreas Schwab 2009-11-19 12:21 ` Alan Mackenzie 2009-11-19 13:21 ` Jason Rumney 2009-11-19 14:08 ` Stefan Monnier 2 siblings, 2 replies; 96+ messages in thread From: Andreas Schwab @ 2009-11-19 10:16 UTC (permalink / raw) To: Alan Mackenzie; +Cc: Stefan Monnier, emacs-devel Alan Mackenzie <acm@muc.de> writes: > So my `aset' invocation is trying to write a multibyte ?ñ into a > unibyte ?\n, and gets truncated from #x8f1 to #xf1 in the process. Nothing gets truncated. In Emacs 23 ?ñ is simply the number 241, whereas in Emacs 22 is it the number 2289. You can put 2289 in a string in Emacs 23, but there is no defined unicode character with that value. Andreas. -- Andreas Schwab, schwab@linux-m68k.org GPG Key fingerprint = 58CA 54C7 6D53 942B 1756 01D3 44D5 214B 8276 4ED5 "And now for something completely different." ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: Fwd: Re: Inadequate documentation of silly characters on screen. 2009-11-19 10:16 ` Fwd: " Andreas Schwab @ 2009-11-19 12:21 ` Alan Mackenzie 2009-11-19 13:21 ` Jason Rumney 1 sibling, 0 replies; 96+ messages in thread From: Alan Mackenzie @ 2009-11-19 12:21 UTC (permalink / raw) To: Andreas Schwab; +Cc: Stefan Monnier, emacs-devel Hi, Andreas, On Thu, Nov 19, 2009 at 11:16:03AM +0100, Andreas Schwab wrote: > Alan Mackenzie <acm@muc.de> writes: > > So my `aset' invocation is trying to write a multibyte ?ñ into a > > unibyte ?\n, and gets truncated from #x8f1 to #xf1 in the process. > Nothing gets truncated. In Emacs 23 ?ñ is simply the number 241, > whereas in Emacs 22 is it the number 2289. You can put 2289 in a string > in Emacs 23, but there is no defined unicode character with that value. Ah, thanks! So when I do M-: (setq nl "\n") M-: (aset nl 0 ?ñ) M-: (insert nl) , after the `aset', the string nl correctly contains, one character which is the single byte #xf1. The bug happens in `insert', where something is interpreting the byte #xf1 as the signed integer #xfffff.....ffff1. Delving into the bowels of Emacs, I find this in character.h: 1. #define STRING_CHAR_AND_LENGTH(p, len, actual_len) \ 2. (!((p)[0] & 0x80) \ 3. ? ((actual_len) = 1, (p)[0]) \ 4. : ! ((p)[0] & 0x20) \ 5. ? ((actual_len) = 2, \ 6. (((((p)[0] & 0x1F) << 6) \ 7. | ((p)[1] & 0x3F)) \ 8. + (((unsigned char) (p)[0]) < 0xC2 ? 0x3FFF80 : 0))) \ 9. : ! ((p)[0] & 0x10) \ 10. ? ((actual_len) = 3, \ 11. ((((p)[0] & 0x0F) << 12) \ 12. | (((p)[1] & 0x3F) << 6) \ 13. | ((p)[2] & 0x3F))) \ 14. : string_char ((p), NULL, &actual_len)) #xf1 drops through all this nonsense to string_char (in character.c). It drops through to this case: else if (! (*p & 0x08)) { c = ((((p)[0] & 0xF) << 18) | (((p)[1] & 0x3F) << 12) | (((p)[2] & 0x3F) << 6) | ((p)[3] & 0x3F)); p += 4; } , where it obviously becomes silly. At least, I think that's where it ends up. This isn't the most maintainable piece of code in Emacs. So, if ISO-8559-1 characters are now represented as single bytes in Emacs, what test for mutibyticity should STRING_CHAR_AND_LENGTH be using? > Andreas. -- Alan Mackenzie (Nuremberg, Germany). ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: Fwd: Re: Inadequate documentation of silly characters on screen. 2009-11-19 10:16 ` Fwd: " Andreas Schwab 2009-11-19 12:21 ` Alan Mackenzie @ 2009-11-19 13:21 ` Jason Rumney 2009-11-19 13:35 ` Stefan Monnier 2009-11-19 14:18 ` Alan Mackenzie 1 sibling, 2 replies; 96+ messages in thread From: Jason Rumney @ 2009-11-19 13:21 UTC (permalink / raw) To: Andreas Schwab; +Cc: Alan Mackenzie, Stefan Monnier, emacs-devel Andreas Schwab <schwab@linux-m68k.org> writes: > Nothing gets truncated. In Emacs 23 ?ñ is simply the number 241, > whereas in Emacs 22 is it the number 2289. You can put 2289 in a string > in Emacs 23, but there is no defined unicode character with that value. The bug here is likely that setting a character in a unibyte string to a value between 160 and 255 does not result in an automatic conversion to multibyte. That was correct in 22.3, since values in that range were raw binary bytes outside of any character set, but in 23.1 they correspond to valid Latin-1 codepoints. ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: Fwd: Re: Inadequate documentation of silly characters on screen. 2009-11-19 13:21 ` Jason Rumney @ 2009-11-19 13:35 ` Stefan Monnier 2009-11-19 14:18 ` Alan Mackenzie 1 sibling, 0 replies; 96+ messages in thread From: Stefan Monnier @ 2009-11-19 13:35 UTC (permalink / raw) To: Jason Rumney; +Cc: Alan Mackenzie, Andreas Schwab, emacs-devel >> Nothing gets truncated. In Emacs 23 ?ñ is simply the number 241, >> whereas in Emacs 22 is it the number 2289. You can put 2289 in a string >> in Emacs 23, but there is no defined unicode character with that value. > The bug here is likely that setting a character in a unibyte string to a > value between 160 and 255 does not result in an automatic conversion to > multibyte. That was correct in 22.3, since values in that range were > raw binary bytes outside of any character set, but in 23.1 they correspond > to valid Latin-1 codepoints. If you think of unibyte strings as sequences of bytes, it makes perfect sense to not automatically convert them to multibyte strings, since a sequence of bytes cannot hold the character ñ, only the byte 241. Stefan ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: Fwd: Re: Inadequate documentation of silly characters on screen. 2009-11-19 13:21 ` Jason Rumney 2009-11-19 13:35 ` Stefan Monnier @ 2009-11-19 14:18 ` Alan Mackenzie 2009-11-19 14:58 ` Jason Rumney 2009-11-19 15:30 ` Stefan Monnier 1 sibling, 2 replies; 96+ messages in thread From: Alan Mackenzie @ 2009-11-19 14:18 UTC (permalink / raw) To: Jason Rumney; +Cc: Andreas Schwab, Stefan Monnier, emacs-devel On Thu, Nov 19, 2009 at 09:21:41PM +0800, Jason Rumney wrote: > Andreas Schwab <schwab@linux-m68k.org> writes: > > Nothing gets truncated. In Emacs 23 ?ñ is simply the number 241, > > whereas in Emacs 22 is it the number 2289. You can put 2289 in a > > string in Emacs 23, but there is no defined unicode character with > > that value. > The bug here is likely that setting a character in a unibyte string to > a value between 160 and 255 does not result in an automatic conversion > to multibyte. That was correct in 22.3, since values in that range > were raw binary bytes outside of any character set, but in 23.1 they > correspond to valid Latin-1 codepoints. Putting point over the \361 and doing C-x = shows the character is Char: \361 (4194289, #o17777761, #x3ffff1, raw-byte) The actual character in the string is ñ (#x3f). Going through all the motions, here is what I think is happening: the \361 is put there by `insert'. insert calls general_insert_function, calls insert_from_string (via a function pointer), calls insert_from_string_1, calls copy_text at this stage, I'm assuming to_multibyte (the screen buffer, in some form) is TRUE, and from_multibyte (a string holding the single character #xf1) is FALSE. We thus execute this code in copy_txt: else { unsigned char *initial_to_addr = to_addr; /* Convert single-byte to multibyte. */ while (nbytes > 0) { int c = *from_addr++; <============================== if (c >= 0200) { c = unibyte_char_to_multibyte (c); to_addr += CHAR_STRING (c, to_addr); nbytes--; } else /* Special case for speed. */ *to_addr++ = c, nbytes--; } return to_addr - initial_to_addr; } At the indicated line, c is a SIGNED integer, therefore will get the value 0xfffffff1, not 0xf1. copy_text then invokes the macro unibyte_char_to_multibyte (-15), at which point there's no point going any further. At least, that's my guess as to what's happening. A fix would be to change the declaration of "int c" to "unsigned int c". I'm going to try that now. -- Alan Mackenzie (Nuremberg, Germany). ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: Fwd: Re: Inadequate documentation of silly characters on screen. 2009-11-19 14:18 ` Alan Mackenzie @ 2009-11-19 14:58 ` Jason Rumney 2009-11-19 15:42 ` Alan Mackenzie 2009-11-19 15:30 ` Stefan Monnier 1 sibling, 1 reply; 96+ messages in thread From: Jason Rumney @ 2009-11-19 14:58 UTC (permalink / raw) To: Alan Mackenzie; +Cc: Andreas Schwab, Stefan Monnier, emacs-devel Alan Mackenzie <acm@muc.de> writes: > At the indicated line, c is a SIGNED integer, therefore will get > the value 0xfffffff1, not 0xf1. Surely 0xf1 is the same, regardless of whether the integer is signed or unsigned. Since \361 == \xf1, I don't think this is a bug where the value is accidentally being corrupted, but one where the character is deliberately being assigned to its corresponding raw-byte codepoint. ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: Fwd: Re: Inadequate documentation of silly characters on screen. 2009-11-19 14:58 ` Jason Rumney @ 2009-11-19 15:42 ` Alan Mackenzie 2009-11-19 19:39 ` Eli Zaretskii 0 siblings, 1 reply; 96+ messages in thread From: Alan Mackenzie @ 2009-11-19 15:42 UTC (permalink / raw) To: Jason Rumney; +Cc: Andreas Schwab, Stefan Monnier, emacs-devel On Thu, Nov 19, 2009 at 10:58:36PM +0800, Jason Rumney wrote: > Alan Mackenzie <acm@muc.de> writes: > > At the indicated line, c is a SIGNED integer, therefore will get > > the value 0xfffffff1, not 0xf1. > Surely 0xf1 is the same, regardless of whether the integer is signed > or unsigned. Yes it is. Sorry - I just tried it out. It depends only on the signedness of the char on the RHS of the assignment. Nevertheless, I think the bug is caused by something along these lines. > Since \361 == \xf1, I don't think this is a bug where the value is > accidentally being corrupted, but one where the character is > deliberately being assigned to its corresponding raw-byte codepoint. It's getting the value -15, at least to 23 places of ones-complement. In the sequence (aset nl 0 ?ñ) (insert nl) , the character that comes out isn't the one that went in. That is a bug. -- Alan Mackenzie (Nuremberg, Germany). ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: Fwd: Re: Inadequate documentation of silly characters on screen. 2009-11-19 15:42 ` Alan Mackenzie @ 2009-11-19 19:39 ` Eli Zaretskii 0 siblings, 0 replies; 96+ messages in thread From: Eli Zaretskii @ 2009-11-19 19:39 UTC (permalink / raw) To: Alan Mackenzie; +Cc: emacs-devel, schwab, monnier, jasonr > Date: Thu, 19 Nov 2009 15:42:31 +0000 > From: Alan Mackenzie <acm@muc.de> > Cc: Andreas Schwab <schwab@linux-m68k.org>, > Stefan Monnier <monnier@iro.umontreal.ca>, emacs-devel@gnu.org > > In the sequence > > (aset nl 0 ?ñ) > (insert nl) > > , the character that comes out isn't the one that went in. That is a > bug. No, it isn't. You inserted 241 and got it back. ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: Fwd: Re: Inadequate documentation of silly characters on screen. 2009-11-19 14:18 ` Alan Mackenzie 2009-11-19 14:58 ` Jason Rumney @ 2009-11-19 15:30 ` Stefan Monnier 2009-11-19 15:58 ` Alan Mackenzie 1 sibling, 1 reply; 96+ messages in thread From: Stefan Monnier @ 2009-11-19 15:30 UTC (permalink / raw) To: Alan Mackenzie; +Cc: emacs-devel, Andreas Schwab, Jason Rumney > The actual character in the string is ñ (#x3f). No: the string does not contain any characters, only bytes, because it's a unibyte string. So it contains the byte 241, not the character ñ. The byte 241 can be inserted in multibyte strings and buffers because it is also a char of code 4194289 (which gets displayed as \361). Stefan ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: Fwd: Re: Inadequate documentation of silly characters on screen. 2009-11-19 15:30 ` Stefan Monnier @ 2009-11-19 15:58 ` Alan Mackenzie 2009-11-19 16:06 ` Andreas Schwab ` (4 more replies) 0 siblings, 5 replies; 96+ messages in thread From: Alan Mackenzie @ 2009-11-19 15:58 UTC (permalink / raw) To: Stefan Monnier; +Cc: emacs-devel, Andreas Schwab, Jason Rumney Hi, Stefan, On Thu, Nov 19, 2009 at 10:30:18AM -0500, Stefan Monnier wrote: > > The actual character in the string is ñ (#x3f). > No: the string does not contain any characters, only bytes, because it's > a unibyte string. I'm thinking from the lisp viewpoint. The string is a data structure which contains characters. I really don't want to have to think about the difference between "chars" and "bytes" when I'm hacking lisp. If I do, then the abstraction "string" is broken. > So it contains the byte 241, not the character ñ. That is then a bug. I wrote "(aset nl 0 ?ñ)", not "(aset nl 0 241)". > The byte 241 can be inserted in multibyte strings and buffers because > it is also a char of code 4194289 (which gets displayed as \361). Hang on a mo'! How can the byte 241 "be" a char of code 4194289? This is some strange usage of the word "be" that I wasn't previously aware of. ;-) At this point, would you please just agree with me that when I do (setq nl "\n") (aset nl 0 ?ñ) (insert nl) , what should appear on the screen should be "ñ", NOT "\361"? Thanks! > Stefan -- Alan Mackenzie (Nuremberg, Germany). ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: Fwd: Re: Inadequate documentation of silly characters on screen. 2009-11-19 15:58 ` Alan Mackenzie @ 2009-11-19 16:06 ` Andreas Schwab 2009-11-19 16:47 ` Aidan Kehoe ` (3 subsequent siblings) 4 siblings, 0 replies; 96+ messages in thread From: Andreas Schwab @ 2009-11-19 16:06 UTC (permalink / raw) To: Alan Mackenzie; +Cc: emacs-devel, Stefan Monnier, Jason Rumney Alan Mackenzie <acm@muc.de> writes: > I wrote "(aset nl 0 ?ñ)", not "(aset nl 0 241)". Those expressions are entirely identical, indistinguishable. Andreas. -- Andreas Schwab, schwab@linux-m68k.org GPG Key fingerprint = 58CA 54C7 6D53 942B 1756 01D3 44D5 214B 8276 4ED5 "And now for something completely different." ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: Fwd: Re: Inadequate documentation of silly characters on screen. 2009-11-19 15:58 ` Alan Mackenzie 2009-11-19 16:06 ` Andreas Schwab @ 2009-11-19 16:47 ` Aidan Kehoe 2009-11-19 17:29 ` Alan Mackenzie ` (2 more replies) 2009-11-19 16:55 ` David Kastrup ` (2 subsequent siblings) 4 siblings, 3 replies; 96+ messages in thread From: Aidan Kehoe @ 2009-11-19 16:47 UTC (permalink / raw) To: Alan Mackenzie; +Cc: Jason Rumney, Andreas Schwab, Stefan Monnier, emacs-devel Ar an naoú lá déag de mí na Samhain, scríobh Alan Mackenzie: > Hi, Stefan, > > On Thu, Nov 19, 2009 at 10:30:18AM -0500, Stefan Monnier wrote: > > > The actual character in the string is ñ (#x3f). > > > No: the string does not contain any characters, only bytes, because it's > > a unibyte string. > > I'm thinking from the lisp viewpoint. The string is a data structure > I really don't want to have to think about > the difference between "chars" and "bytes" when I'm hacking lisp. If I > do, then the abstraction "string" is broken. For some context on this, that’s how it works in XEmacs; we’ve never had problems with it, we seem to avoid an entire class of programming errors that GNU Emacs developers deal with on a regular basis. Tangentally, for those that like the unibyte/multibyte distinction, to my knowledge the editor does not have any way of representing “an octet with numeric value < #x7f to be treated with byte semantics, not character semantics”, which seems arbitrary to me. For example: ;; Both the decoded sequences are illegal in UTF-16: (split-char (car (append (decode-coding-string "\xd8\x00\x00\x7f" 'utf-16-be) nil))) => (ascii 127) (split-char (car (append (decode-coding-string "\xd8\x00\x00\x80" 'utf-16-be) nil))) => (eight-bit-control 128) -- “Apart from the nine-banded armadillo, man is the only natural host of Mycobacterium leprae, although it can be grown in the footpads of mice.” -- Kumar & Clark, Clinical Medicine, summarising improbable leprosy research ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: Fwd: Re: Inadequate documentation of silly characters on screen. 2009-11-19 16:47 ` Aidan Kehoe @ 2009-11-19 17:29 ` Alan Mackenzie 2009-11-19 18:21 ` Aidan Kehoe 2009-11-20 2:43 ` Stephen J. Turnbull 2009-11-19 19:45 ` Eli Zaretskii 2009-11-19 19:55 ` Stefan Monnier 2 siblings, 2 replies; 96+ messages in thread From: Alan Mackenzie @ 2009-11-19 17:29 UTC (permalink / raw) To: Aidan Kehoe; +Cc: Jason Rumney, Andreas Schwab, Stefan Monnier, emacs-devel On Thu, Nov 19, 2009 at 04:47:09PM +0000, Aidan Kehoe wrote: > Ar an naoú lá déag de mí na Samhain, scríobh Alan Mackenzie: > > Hi, Stefan, > > On Thu, Nov 19, 2009 at 10:30:18AM -0500, Stefan Monnier wrote: > > > > The actual character in the string is ñ (#x3f). > > > No: the string does not contain any characters, only bytes, because it's > > > a unibyte string. > > I'm thinking from the lisp viewpoint. The string is a data structure > > I really don't want to have to think about > > the difference between "chars" and "bytes" when I'm hacking lisp. If I > > do, then the abstraction "string" is broken. > For some context on this, that???s how it works in XEmacs; we???ve never had > problems with it, we seem to avoid an entire class of programming errors > that GNU Emacs developers deal with on a regular basis. In XEmacs, characters and integers are distinct types. That causes extra work having to convert between them, both mentally and in writing code. It is not that the GNU Emacs way is wrong, it just has a bug at the moment. -- Alan Mackenzie (Nuremberg, Germany). ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: Fwd: Re: Inadequate documentation of silly characters on screen. 2009-11-19 17:29 ` Alan Mackenzie @ 2009-11-19 18:21 ` Aidan Kehoe 2009-11-20 2:43 ` Stephen J. Turnbull 1 sibling, 0 replies; 96+ messages in thread From: Aidan Kehoe @ 2009-11-19 18:21 UTC (permalink / raw) To: Alan Mackenzie; +Cc: Jason Rumney, Andreas Schwab, Stefan Monnier, emacs-devel Ar an naoú lá déag de mí na Samhain, scríobh Alan Mackenzie: > On Thu, Nov 19, 2009 at 04:47:09PM +0000, Aidan Kehoe wrote: > > > Ar an naoú lá déag de mí na Samhain, scríobh Alan Mackenzie: > > > > Hi, Stefan, > > > > [...] I really don't want to have to think about the difference > > > between "chars" and "bytes" when I'm hacking lisp. If I do, then the > > > abstraction "string" is broken. > > > For some context on this, that’s how it works in XEmacs; we’ve > > never had problems with it, we seem to avoid an entire class of > > programming errors that GNU Emacs developers deal with on a regular > > basis. > > In XEmacs, characters and integers are distinct types. That causes > extra work having to convert between them, both mentally and in writing > code. Certainly--that’s orthogonal to the issue at hand, though, it involves some of the same things but is distinct. XEmacs could have implemented the unibyte-string/multibyte-string Lisp distinction and kept the type distinction between characters and integers; we didn’t, though. (Or maybe it was just that the Mule version that we based our code on didn’t have it.) > It is not that the GNU Emacs way is wrong, it just has a bug at > the moment. As far as I can see it’s an old design decision. -- “Apart from the nine-banded armadillo, man is the only natural host of Mycobacterium leprae, although it can be grown in the footpads of mice.” -- Kumar & Clark, Clinical Medicine, summarising improbable leprosy research ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: Fwd: Re: Inadequate documentation of silly characters on screen. 2009-11-19 17:29 ` Alan Mackenzie 2009-11-19 18:21 ` Aidan Kehoe @ 2009-11-20 2:43 ` Stephen J. Turnbull 1 sibling, 0 replies; 96+ messages in thread From: Stephen J. Turnbull @ 2009-11-20 2:43 UTC (permalink / raw) To: Alan Mackenzie Cc: Aidan Kehoe, emacs-devel, Andreas Schwab, Stefan Monnier, Jason Rumney Alan Mackenzie writes: > In XEmacs, characters and integers are distinct types. That causes > extra work having to convert between them, both mentally and in writing > code. Why do you have to convert? The only time you need to worry about the integer values of characters is (1) when implementing a coding system and (2) when dealing with control characters which do not have consistent names or graphic representations (mostly the C1 set, but there are areas in C0 as well -- quick, what's the name of \034?) When do you need to do either? > It is not that the GNU Emacs way is wrong, it just has a bug at the > moment. I agree that equating the character type to the integer type is not "wrong". It's a tradeoff which we make differently from Emacs: Emacs prefers code that is shorter and easier to write, XEmacs prefers code that may be longer (ie, uses explicit conversions where necessary) but is easier to debug because it signals errors earlier (ie, when a function receives an object of the wrong type rather than when a user observes incorrect display). However, I think that allowing a given array of bytes to change type from unibyte to multibyte and back is just insane. Either the types should be different and immutable (as in Python) or there should be only one representation (multibyte) as in XEmacs. ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: Fwd: Re: Inadequate documentation of silly characters on screen. 2009-11-19 16:47 ` Aidan Kehoe 2009-11-19 17:29 ` Alan Mackenzie @ 2009-11-19 19:45 ` Eli Zaretskii 2009-11-19 20:07 ` Eli Zaretskii 2009-11-19 19:55 ` Stefan Monnier 2 siblings, 1 reply; 96+ messages in thread From: Eli Zaretskii @ 2009-11-19 19:45 UTC (permalink / raw) To: Aidan Kehoe; +Cc: acm, emacs-devel, schwab, monnier, jasonr > From: Aidan Kehoe <kehoea@parhasard.net> > Date: Thu, 19 Nov 2009 16:47:09 +0000 > Cc: Jason Rumney <jasonr@gnu.org>, Andreas Schwab <schwab@linux-m68k.org>, > Stefan Monnier <monnier@iro.umontreal.ca>, emacs-devel@gnu.org > > Tangentally, for those that like the unibyte/multibyte distinction, to my > knowledge the editor does not have any way of representing “an octet with > numeric value < #x7f to be treated with byte semantics, not character > semantics” Emacs 23 does have a way of representing raw bytes, and it distinguishes between them and characters. See the ELisp manual (I mentioned the node earlier in this thread). ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: Fwd: Re: Inadequate documentation of silly characters on screen. 2009-11-19 19:45 ` Eli Zaretskii @ 2009-11-19 20:07 ` Eli Zaretskii 0 siblings, 0 replies; 96+ messages in thread From: Eli Zaretskii @ 2009-11-19 20:07 UTC (permalink / raw) To: kehoea, acm, emacs-devel, schwab, monnier, jasonr > Date: Thu, 19 Nov 2009 21:45:02 +0200 > From: Eli Zaretskii <eliz@gnu.org> > Cc: acm@muc.de, emacs-devel@gnu.org, schwab@linux-m68k.org, > monnier@iro.umontreal.ca, jasonr@gnu.org > > > From: Aidan Kehoe <kehoea@parhasard.net> > > Date: Thu, 19 Nov 2009 16:47:09 +0000 > > Cc: Jason Rumney <jasonr@gnu.org>, Andreas Schwab <schwab@linux-m68k.org>, > > Stefan Monnier <monnier@iro.umontreal.ca>, emacs-devel@gnu.org > > > > Tangentally, for those that like the unibyte/multibyte distinction, to my > > knowledge the editor does not have any way of representing “an octet with > > numeric value < #x7f to be treated with byte semantics, not character > > semantics” > > Emacs 23 does have a way of representing raw bytes, and it > distinguishes between them and characters. See the ELisp manual (I > mentioned the node earlier in this thread). This is of course true, but for bytes > #x7f, not < #x7f. Sorry. ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: Fwd: Re: Inadequate documentation of silly characters on screen. 2009-11-19 16:47 ` Aidan Kehoe 2009-11-19 17:29 ` Alan Mackenzie 2009-11-19 19:45 ` Eli Zaretskii @ 2009-11-19 19:55 ` Stefan Monnier 2009-11-20 3:13 ` Stephen J. Turnbull 2 siblings, 1 reply; 96+ messages in thread From: Stefan Monnier @ 2009-11-19 19:55 UTC (permalink / raw) To: Aidan Kehoe; +Cc: Alan Mackenzie, Jason Rumney, Andreas Schwab, emacs-devel >> I'm thinking from the lisp viewpoint. The string is a data structure >> I really don't want to have to think about >> the difference between "chars" and "bytes" when I'm hacking lisp. If I >> do, then the abstraction "string" is broken. > For some context on this, that’s how it works in XEmacs; we’ve never had > problems with it, we seem to avoid an entire class of programming errors > that GNU Emacs developers deal with on a regular basis. Indeed XEmacs does not represent chars as integers, and that can eliminate several sources of problems. Note that this problem is new in Emacs-23, since in Emacs-22 (and in XEmacs, IIUC), there was no character whose integer value was between 127 and 256, so there was no ambiguity. AFAIK most of the programming errors we've had to deal with over the years (i.e. in Emacs-20, 21, 22) had to do with incorrect (or missing) encoding/decoding and most of those errors existed just as much on XEmacs because there's no way to fix them right in the infrastructure code (tho XEmacs may have managed to hide them better by detecting the lack of encoding/decoding and guessing an appropriate coding-system instead). > Tangentally, for those that like the unibyte/multibyte distinction, to my > knowledge the editor does not have any way of representing “an octet with > numeric value < #x7f to be treated with byte semantics, not character > semantics”, which seems arbitrary to me. For example: Indeed. It hasn't bitten us hard yet, mostly because (luckily) there are very few coding-system which use chars 0-127 in ways incompatible with ascii. Stefan ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: Fwd: Re: Inadequate documentation of silly characters on screen. 2009-11-19 19:55 ` Stefan Monnier @ 2009-11-20 3:13 ` Stephen J. Turnbull 0 siblings, 0 replies; 96+ messages in thread From: Stephen J. Turnbull @ 2009-11-20 3:13 UTC (permalink / raw) To: Stefan Monnier Cc: Aidan Kehoe, Alan Mackenzie, emacs-devel, Andreas Schwab, Jason Rumney [-- Attachment #1: Type: text/plain, Size: 544 bytes --] Stefan Monnier writes: > Indeed XEmacs does not represent chars as integers, and that can > eliminate several sources of problems. Note that this problem is new in > Emacs-23, since in Emacs-22 (and in XEmacs, IIUC), there was no > character whose integer value was between 127 and 256, so there was no > ambiguity. In XEmacs: (char-int-p 241) => t (int-char 241) => ?ñ No problems with this that I can recall, except a few people with code that did (set-face-font 'default "-*-*-*-*-*-*-*-*-*-*-*-*-iso8859-2") [-- Attachment #2: Type: text/plain, Size: 67 bytes --] and expected `(insert (int-char 241))' to display `ń' instead of [-- Attachment #3: Type: text/plain, Size: 1832 bytes --] `ñ'. (For the non-Mule-implementers, this hack works without Mule but won't work in Mule because Mule matches those two trailing fields to the character's charset, and 241 corresponds to a Latin-1 character, so a "-*-*-*-*-*-*-*-*-*-*-*-*-iso8859-1" font from the set associated with the default face will be used.) For this reason, using char-int and int-char in XEmacs is generally a bug unless you want to examine the internal coding system; you almost always want to use make-char. (Of course for ASCII values it's an accepted idiom, but still a bad habit.) > AFAIK most of the programming errors we've had to deal with over the > years (i.e. in Emacs-20, 21, 22) had to do with incorrect (or missing) > encoding/decoding and most of those errors existed just as much on > XEmacs I don't think that's true; AFAIK we have *no* recorded instances of the \201 bug, while that regression persisted in GNU Emacs (albeit a patched version, at first) from at the latest 1992 until just a few years ago. I think it got fixed in Mule (ie, all paths into or out of a text object got a coding stage) before that was integrated into XEmacs or Emacs, and the regression when Mule was integrated into Emacs was cause by the performance hack, "text object as unibyte". > because there's no way to fix them right in the infrastructure code > (tho XEmacs may have managed to hide them better by detecting the > lack of encoding/decoding and guessing an appropriate coding-system > instead). I don't know of any such guessing. When the user asks us to, we guess on input, just as you do, but once we've got text in internal format, there is no more guessing to be done. Emacs will encounter the need to guess because you support "text object as unibyte". Vive la difference technical! ;-) ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: Fwd: Re: Inadequate documentation of silly characters on screen. 2009-11-19 15:58 ` Alan Mackenzie 2009-11-19 16:06 ` Andreas Schwab 2009-11-19 16:47 ` Aidan Kehoe @ 2009-11-19 16:55 ` David Kastrup 2009-11-19 18:08 ` Alan Mackenzie 2009-11-19 19:43 ` Eli Zaretskii 2009-11-19 20:02 ` Stefan Monnier 4 siblings, 1 reply; 96+ messages in thread From: David Kastrup @ 2009-11-19 16:55 UTC (permalink / raw) To: emacs-devel Alan Mackenzie <acm@muc.de> writes: > On Thu, Nov 19, 2009 at 10:30:18AM -0500, Stefan Monnier wrote: >> > The actual character in the string is ñ (#x3f). > >> No: the string does not contain any characters, only bytes, because >> it's a unibyte string. > > I'm thinking from the lisp viewpoint. The string is a data structure > which contains characters. I really don't want to have to think about > the difference between "chars" and "bytes" when I'm hacking lisp. If > I do, then the abstraction "string" is broken. > >> So it contains the byte 241, not the character ñ. > > That is then a bug. I wrote "(aset nl 0 ?ñ)", not "(aset nl 0 241)". Huh? ?ñ is the Emacs code point of ñ. Which is pretty much identical to the Unicode code point in Emacs 23. >> The byte 241 can be inserted in multibyte strings and buffers because >> it is also a char of code 4194289 (which gets displayed as \361). > > Hang on a mo'! How can the byte 241 "be" a char of code 4194289? > This is some strange usage of the word "be" that I wasn't previously > aware of. ;-) Emacs encodes most of its things in utf-8. A Unicode code point is an integer. You can encode it in different encodings, resulting in different byte streams. Inside of a byte stream encoded in utf-8, the isolated byte 241 does not correspond to a Unicode character. It is not valid utf-8. When Emacs reads a file supposedly in utf-8, it wants to represent _all_ possible byte streams in order to be able to save unchanged data unmolested. So it encodes the entity "illegal isolated byte 241 in an utf-8 document" with the character code 4194289 which has a representation in Emacs' internal variant of utf-8, but is outside of the range of Unicode. > At this point, would you please just agree with me that when I do > > (setq nl "\n") > (aset nl 0 ?ñ) > (insert nl) > > , what should appear on the screen should be "ñ", NOT "\361"? Thanks! You assume that ?ñ is a character. But in Emacs, it is an integer, a Unicode code point in Emacs 23. As long as there is something like a unibyte string, there is no way to distinguish the character 241 and the byte 241 except when Emacs is told explicitly. Because Emacs has no separate "character" data type. -- David Kastrup ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: Fwd: Re: Inadequate documentation of silly characters on screen. 2009-11-19 16:55 ` David Kastrup @ 2009-11-19 18:08 ` Alan Mackenzie 2009-11-19 19:25 ` Davis Herring ` (2 more replies) 0 siblings, 3 replies; 96+ messages in thread From: Alan Mackenzie @ 2009-11-19 18:08 UTC (permalink / raw) To: David Kastrup; +Cc: emacs-devel Hi, David! On Thu, Nov 19, 2009 at 05:55:10PM +0100, David Kastrup wrote: > Alan Mackenzie <acm@muc.de> writes: > > On Thu, Nov 19, 2009 at 10:30:18AM -0500, Stefan Monnier wrote: > >> > The actual character in the string is ñ (#x3f). > >> No: the string does not contain any characters, only bytes, because > >> it's a unibyte string. > > I'm thinking from the lisp viewpoint. The string is a data > > structure which contains characters. I really don't want to have to > > think about the difference between "chars" and "bytes" when I'm > > hacking lisp. If I do, then the abstraction "string" is broken. > >> So it contains the byte 241, not the character ñ. > > That is then a bug. I wrote "(aset nl 0 ?ñ)", not "(aset nl 0 241)". > Huh? ?ñ is the Emacs code point of ñ. Which is pretty much identical > to the Unicode code point in Emacs 23. No, you (all of you) are missing the point. That point is that if an Emacs Lisp hacker writes "?ñ", it should work, regardless of what "codepoint" it has, what "bytes" represent it, whether those "bytes" are coded with a different codepoint, or what have you. All of that stuff is uninteresting. If it gets interesting, like now, it is because it is buggy. > >> The byte 241 can be inserted in multibyte strings and buffers > >> because it is also a char of code 4194289 (which gets displayed as > >> \361). OK. Surely displaying it as "\361" is a bug? Should it not display as "\17777761". If it did, it would have saved half of my ranting. > > Hang on a mo'! How can the byte 241 "be" a char of code 4194289? > > This is some strange usage of the word "be" that I wasn't previously > > aware of. ;-) > Emacs encodes most of its things in utf-8. A Unicode code point is an > integer. You can encode it in different encodings, resulting in > different byte streams. Inside of a byte stream encoded in utf-8, the > isolated byte 241 does not correspond to a Unicode character. It is not > valid utf-8. When Emacs reads a file supposedly in utf-8, it wants to > represent _all_ possible byte streams in order to be able to save > unchanged data unmolested. That's a good explanation - it's sort of like < in html. Thanks. > So it encodes the entity "illegal isolated byte 241 in an utf-8 > document" with the character code 4194289 which has a representation in > Emacs' internal variant of utf-8, but is outside of the range of > Unicode. So, how did the character "ñ" get turned into the illegal byte #xf1? Is that the bug? > > At this point, would you please just agree with me that when I do > > (setq nl "\n") > > (aset nl 0 ?ñ) > > (insert nl) > > , what should appear on the screen should be "ñ", NOT "\361"? Thanks! > You assume that ?ñ is a character. I do indeed. It is self evident. Now, would you too please just agree that when I execute the three forms above, and "ñ" should appear? The identical argument applies to "ä". They are character used in writing wierd European languages like Spanish and German. Emacs should not have difficulty with them. It is a standard Emacs idiom that ?x (or ?\x) is the integer representing the character x. Indeed (unlike in XEmacs), characters ARE integers. Why does this not work for, e.g., ISO-8559-1? > But in Emacs, it is an integer, a Unicode code point in Emacs 23. That sounds like the sort of argument one might read on gnu-misc-discuss. ;-) Sorry. Are you saying that Emacs is converting "?ñ" and "?ä" into the wrong integers? > As long as there is something like a unibyte string, there is no way > to distinguish the character 241 and the byte 241 except when Emacs is > told explicitly. What is the correct Emacs internal representation for "ñ" and "ä"? They surely cannot share internal representations with other (non-)characters? > Because Emacs has no separate "character" data type. For which I am thankful. > -- > David Kastrup -- Alan Mackenzie (Nuremberg, Germany). ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: Fwd: Re: Inadequate documentation of silly characters on screen. 2009-11-19 18:08 ` Alan Mackenzie @ 2009-11-19 19:25 ` Davis Herring 2009-11-19 21:25 ` Alan Mackenzie 2009-11-19 19:52 ` Eli Zaretskii 2009-11-19 20:05 ` Stefan Monnier 2 siblings, 1 reply; 96+ messages in thread From: Davis Herring @ 2009-11-19 19:25 UTC (permalink / raw) To: Alan Mackenzie; +Cc: David Kastrup, emacs-devel [I end up having to say the same thing several times here; I thought it preferable to omitting any of Alan's questions or any aspect of the problem. It's not meant to be a rant.] > No, you (all of you) are missing the point. That point is that if an > Emacs Lisp hacker writes "?ñ", it should work, regardless of > what "codepoint" it has, what "bytes" represent it, whether those > "bytes" are coded with a different codepoint, or what have you. All of > that stuff is uninteresting. If it gets interesting, like now, it is > because it is buggy. When you wrote ?ñ, it did work -- that character has the Unicode (and Emacs 23) code point 241, so that two-character token is entirely equivalent to the token "241" in Emacs source. (This is independent of the encoding of the source file: the same two characters might be represented by many different octet sequences in the source file, but you always get 241 as the value (which is a code point and is distinct from octet sequences anyway).) But you didn't insert that object! You forced it into a (perhaps surprisingly: unibyte) string, which interpreted its argument (the integer 241) as a raw byte value, because that's what unibyte strings contain. When you then inserted the string, Emacs transformed it into a (somewhat artificial) character whose meaning is "this was really the byte 241, which, since it corresponds to no UTF-8 character, must merely be reproduced literally on disk" and whose Emacs code point is 4194289. (That integer looks like it could be derived from 241 by sign-extension for the convenience of Emacs hackers; the connection is unimportant to the user.) > OK. Surely displaying it as "\361" is a bug? Should it not display as > "\17777761". If it did, it would have saved half of my ranting. No: characters are displayed according to their meaning, not their internal code point. As it happens, this character's whole meaning is "the byte #o361", so that's what's displayed. > So, how did the character "ñ" get turned into the illegal byte #xf1? Is > that the bug? By its use in `aset' in a unibyte context (determined entirely by the target string). >> You assume that ?ñ is a character. > > I do indeed. It is self evident. Its characterness is determined by context, because (as you know) Emacs has no distinct character type. So, in the isolation of English prose, we have no way of telling whether ?ñ "is" a character or an integer, any more than we can guess about 241. (We can guess about the writer's desires, but not about the real effects.) > Now, would you too please just agree that when I execute the three forms > above, and "ñ" should appear? That's Stefan's point: should common string literals generate multibyte strings (so as to change the meaning, not of the string, but of `aset', to what you want)? Maybe: one could also address the issue by disallowing `aset' on unibyte strings (or strings entirely) and introducing `aset-unibyte' (and perhaps `aset-multibyte') so that the argument interpretation (and the O(n) nature of the latter) would be made clear to the programmer. Maybe the doc-string for `aset' should just bear a really loud warning. It bears more consideration than merely "yes" to your question, as reasonable as it seems. > What is the correct Emacs internal representation for "ñ" and "ä"? They > surely cannot share internal representations with other > (non-)characters? They have the unique internal representation as (mostly) Unicode code points (integers) 241 and 228, which happen to be identical to the representations of bytes of those values (which interpretation prevails in a unibyte context). Davis -- This product is sold by volume, not by mass. If it appears too dense or too sparse, it is because mass-energy conversion has occurred during shipping. ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: Fwd: Re: Inadequate documentation of silly characters on screen. 2009-11-19 19:25 ` Davis Herring @ 2009-11-19 21:25 ` Alan Mackenzie 2009-11-19 22:31 ` David Kastrup 2009-11-20 8:48 ` Fwd: Re: Inadequate documentation of silly characters on screen Eli Zaretskii 0 siblings, 2 replies; 96+ messages in thread From: Alan Mackenzie @ 2009-11-19 21:25 UTC (permalink / raw) To: Davis Herring; +Cc: David Kastrup, emacs-devel Hi, Davis, always good to hear from you! On Thu, Nov 19, 2009 at 11:25:05AM -0800, Davis Herring wrote: > [I end up having to say the same thing several times here; I thought it > preferable to omitting any of Alan's questions or any aspect of the > problem. It's not meant to be a rant.] > > No, you (all of you) are missing the point. That point is that if an > > Emacs Lisp hacker writes "?ñ", it should work, regardless of > > what "codepoint" it has, what "bytes" represent it, whether those > > "bytes" are coded with a different codepoint, or what have you. All of > > that stuff is uninteresting. If it gets interesting, like now, it is > > because it is buggy. > When you wrote ?ñ, it did work -- that character has the Unicode (and > Emacs 23) code point 241, so that two-character token is entirely > equivalent to the token "241" in Emacs source. (This is independent of > the encoding of the source file: the same two characters might be > represented by many different octet sequences in the source file, but you > always get 241 as the value (which is a code point and is distinct from > octet sequences anyway).) OK - so what's happening is that ?ñ is unambiguously 241. But Emacs cannot say whether that is unibyte 241 or multibyte 241, which it encodes as 4194289. Despite not knowing, Emacs is determined never to confuse a 4194289 type of 241 with a 241 type of 241. So, despite the fact that the character 4194289 probably originated as a unibyte ?ñ, it prints it uglily on the screen as "\361". > But you didn't insert that object! You forced it into a (perhaps > surprisingly: unibyte) string, which interpreted its argument (the integer > 241) as a raw byte value, because that's what unibyte strings contain. > When you then inserted the string, Emacs transformed it into a (somewhat > artificial) character whose meaning is "this was really the byte 241, > which, since it corresponds to no UTF-8 character, must merely be > reproduced literally on disk" and whose Emacs code point is 4194289. > (That integer looks like it could be derived from 241 by sign-extension > for the convenience of Emacs hackers; the connection is unimportant to the > user.) Why couldn't Emacs have simply displayed the character as "ñ"? Why does it have to enforce its internal dirty linen on an unsuspecting hacker? > > OK. Surely displaying it as "\361" is a bug? Should it not display > > as "\17777761". If it did, it would have saved half of my ranting. > No: characters are displayed according to their meaning, not their > internal code point. As it happens, this character's whole meaning is > "the byte #o361", so that's what's displayed. That meaning is an artificial one imposed by Emacs itself. Is there any pressing reason to distinguish 4194289 from 241 when displaying them as characters on a screen? > > So, how did the character "ñ" get turned into the illegal byte #xf1? > > Is that the bug? > By its use in `aset' in a unibyte context (determined entirely by the > target string). > >> You assume that ?ñ is a character. > > I do indeed. It is self evident. > Its characterness is determined by context, because (as you know) Emacs > has no distinct character type. So, in the isolation of English prose, we > have no way of telling whether ?ñ "is" a character or an integer, any more > than we can guess about 241. (We can guess about the writer's desires, > but not about the real effects.) > > Now, would you too please just agree that when I execute the three > > forms above, and "ñ" should appear? > That's Stefan's point: should common string literals generate multibyte > strings (so as to change the meaning, not of the string, but of `aset', > to what you want)? Lisp is a high level language. It should do the Right Thing in its representation of low level concepts, and shouldn't bug its users with these things. The situation is like having a text document with some characters in ISO-8559-1 and some in UTF-8. Chaos. I stick with one of these character sets for my personal stuff. > Maybe: one could also address the issue by disallowing `aset' on > unibyte strings (or strings entirely) and introducing `aset-unibyte' > (and perhaps `aset-multibyte') so that the argument interpretation (and > the O(n) nature of the latter) would be made clear to the programmer. No. The problem should be solved by deciding on one single character set visible to lisp hackers, and sticking to it rigidly. At least, that's my humble opinion as one of the Emacs hackers least well informed on the matter. ;-( > Maybe the doc-string for `aset' should just bear a really loud warning. Yes. But it's not really `aset' which is the liability. It's "?". > It bears more consideration than merely "yes" to your question, as > reasonable as it seems. > > What is the correct Emacs internal representation for "ñ" and "ä"? They > > surely cannot share internal representations with other > > (non-)characters? > They have the unique internal representation as (mostly) Unicode code > points (integers) 241 and 228, which happen to be identical to the > representations of bytes of those values (which interpretation prevails in > a unibyte context). Sorry, what the heck is "the byte with value 241"? Does this concept have any meaning, any utility beyond the machiavellian one of confusing me? How would one use "the byte with value 241", and why does it need to be kept distinct from "ñ"? > Davis -- Alan Mackenzie (Nuremberg, Germany). ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: Fwd: Re: Inadequate documentation of silly characters on screen. 2009-11-19 21:25 ` Alan Mackenzie @ 2009-11-19 22:31 ` David Kastrup 2009-11-21 22:52 ` Richard Stallman 2009-11-20 8:48 ` Fwd: Re: Inadequate documentation of silly characters on screen Eli Zaretskii 1 sibling, 1 reply; 96+ messages in thread From: David Kastrup @ 2009-11-19 22:31 UTC (permalink / raw) To: emacs-devel Alan Mackenzie <acm@muc.de> writes: > OK - so what's happening is that ?ñ is unambiguously 241. But Emacs > cannot say whether that is unibyte 241 or multibyte 241, which it > encodes as 4194289. Despite not knowing, Emacs is determined never to > confuse a 4194289 type of 241 with a 241 type of 241. So, despite the > fact that the character 4194289 probably originated as a unibyte ?ñ, ?ñ is the code point of a character. Unibyte strings contain bytes, not characters. ?ñ is a confusing way of writing 241 in the context of unibyte, just like '\n' may be a confusing way of writing 10 in the context of number bases. > Why couldn't Emacs have simply displayed the character as "ñ"? Because there is no character with a byte representation of 241. You are apparently demanding that Emacs display this "wild byte" as if it were really encoded in latin-1. What is so special about latin-1? Latin-1 characters have a byte representation in utf-8, but it is not 241. > Why does it have to enforce its internal dirty linen on an > unsuspecting hacker? It doesn't. And since we are talking about a non-character isolated byte, Emacs displays it as a non-character isolated byte rather than throwing it out on the terminal and confusing the user with whatever the terminal may make of it. > That meaning is an artificial one imposed by Emacs itself. Is there > any pressing reason to distinguish 4194289 from 241 when displaying > them as characters on a screen? 4194289 is the Emacs code point for "invalid raw byte with value 241", 241 is the Emacs code point for "Unicode character 241, part of latin-1 plane". If you throw them to encode-region, the resulting unibyte string will contain 241 for the first, but whatever external representation is proper for the specified encoding for the second. If you encode to latin-1, the distinction will get lost. If you encode to other encodings, it won't. > Sorry, what the heck is "the byte with value 241"? Does this concept > have any meaning, any utility beyond the machiavellian one of > confusing me? How would one use "the byte with value 241", and why > does it need to be kept distinct from "ñ"? You can use Emacs to load an executable, change some string inside of it (make sure that it contains the same number of bytes afterwards!) and save, and everything you did not edit is the same. That's a very fine thing. To have this work, Emacs needs an internal representation for "byte with code x that is not valid as part of a character". -- David Kastrup ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: Fwd: Re: Inadequate documentation of silly characters on screen. 2009-11-19 22:31 ` David Kastrup @ 2009-11-21 22:52 ` Richard Stallman 2009-11-23 2:08 ` Displaying bytes (was: Inadequate documentation of silly characters on screen.) Stefan Monnier 0 siblings, 1 reply; 96+ messages in thread From: Richard Stallman @ 2009-11-21 22:52 UTC (permalink / raw) To: David Kastrup; +Cc: emacs-devel > Why couldn't Emacs have simply displayed the character as "ñ"? Because there is no character with a byte representation of 241. You are apparently demanding that Emacs display this "wild byte" as if it were really encoded in latin-1. Latin-1 or Unicode. The Unicode code point for ñ is 241. (aref "ñ" 0) returns 241, which is 361 in octal. So if there is a character \361, it seems that ought to be the same as ñ. Basically, it isn't clear that \361 is a byte rather than a character, and what difference that ought to make, and what you should do if you want to turn it from a byte into a character. ^ permalink raw reply [flat|nested] 96+ messages in thread
* Displaying bytes (was: Inadequate documentation of silly characters on screen.) 2009-11-21 22:52 ` Richard Stallman @ 2009-11-23 2:08 ` Stefan Monnier 2009-11-23 20:38 ` Richard Stallman 0 siblings, 1 reply; 96+ messages in thread From: Stefan Monnier @ 2009-11-23 2:08 UTC (permalink / raw) To: rms; +Cc: David Kastrup, emacs-devel > Basically, it isn't clear that \361 is a byte rather than a character, > and what difference that ought to make, and what you should do > if you want to turn it from a byte into a character. So how do you suggest we represent the byte 241? Stefan ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: Displaying bytes (was: Inadequate documentation of silly characters on screen.) 2009-11-23 2:08 ` Displaying bytes (was: Inadequate documentation of silly characters on screen.) Stefan Monnier @ 2009-11-23 20:38 ` Richard Stallman 2009-11-23 21:34 ` Per Starbäck 2009-11-24 1:28 ` Displaying bytes Stefan Monnier 0 siblings, 2 replies; 96+ messages in thread From: Richard Stallman @ 2009-11-23 20:38 UTC (permalink / raw) To: Stefan Monnier; +Cc: dak, emacs-devel > Basically, it isn't clear that \361 is a byte rather than a character, > and what difference that ought to make, and what you should do > if you want to turn it from a byte into a character. So how do you suggest we represent the byte 241? No better way jumps into my mind. But maybe we could figure out some way to make the current way easier to understand. For instance, C-u C-x = on \224 says character: (4194196, #o17777624, #x3fff94) preferred charset: tis620-2533 (TIS620.2533) code point: 0x94 syntax: w which means: word buffer code: #x94 file code: #x94 (encoded by coding system no-conversion) display: not encodable for terminal Character code properties: customize what to show [back] Perhaps it should say, character: Stray byte (4194196, #o17777624, #x3fff94) What are the situations where a user is likely to see these stray bytes. When visiting a binary file, of course; but in that situation, nobody will be surprised or disappointed. So what are the other cases, and what might the user really want instead? Does it mean the user probably wants to do M-x decode-coding-region? If so, can we find a way to give the user that hint? When I click on tis620-2533 in that output, I get this Character set: tis620-2533 TIS620.2533 Number of contained characters: 256 ASCII compatible. Code space: [0 255] [back] which is totally unhelpful. What is this character set's main purpose? Does it exist specifically for stray non-ASCII bytes? If so, saying so here would help. If not -- if it has some other purpose -- then it would be good to explain both purposes here. Also, if it exists for these stray non-ASCII bytes, why does it have 256 chars in it? There are only 128 possible stray non-ASCII bytes. (It is also not clear to me what "ASCII compatible" means in this context.) ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: Displaying bytes (was: Inadequate documentation of silly characters on screen.) 2009-11-23 20:38 ` Richard Stallman @ 2009-11-23 21:34 ` Per Starbäck 2009-11-24 22:47 ` Richard Stallman 2009-11-24 1:28 ` Displaying bytes Stefan Monnier 1 sibling, 1 reply; 96+ messages in thread From: Per Starbäck @ 2009-11-23 21:34 UTC (permalink / raw) To: rms; +Cc: dak, Stefan Monnier, emacs-devel 2009/11/23 Richard Stallman <rms@gnu.org>: > What are the situations where a user is likely to see these stray > bytes. When visiting a binary file, of course; but in that situation, > nobody will be surprised or disappointed. So what are the other > cases, Sometimes when Emacs can't guess the coding system. $ od -c euro.txt 0000000 T h a t c o s t s 200 1 7 . \n 0000020 $ emacs euro.txt This is really a windows-1252 file and the strange character is supposed to be a Euro sign. For me, with no particular setup to make Emacs expect windows-1252 files that shows in emacs as "That costs \20017." with raw-text-unix. > and what might the user really want instead? Does it mean the > user probably wants to do M-x decode-coding-region? If so, can we find a way > to give the user that hint? In that case revert-buffer-with-coding-system. Ideally I'd like Emacs to ask directly when opening the file in such a case, if it can't determine anything better than raw-bytes. At least if the mode (like text-mode here) indicates that it shouldn't be a binary file. ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: Displaying bytes (was: Inadequate documentation of silly characters on screen.) 2009-11-23 21:34 ` Per Starbäck @ 2009-11-24 22:47 ` Richard Stallman 2009-11-25 1:33 ` Kenichi Handa 0 siblings, 1 reply; 96+ messages in thread From: Richard Stallman @ 2009-11-24 22:47 UTC (permalink / raw) To: Per Starbäck; +Cc: dak, monnier, emacs-devel $ od -c euro.txt 0000000 T h a t c o s t s 200 1 7 . \n 0000020 $ emacs euro.txt This is really a windows-1252 file and the strange character is supposed to be a Euro sign. For me, with no particular setup to make Emacs expect windows-1252 files that shows in emacs as "That costs \20017." with raw-text-unix. Why doesn't Emacs guess right, in this case? Could we make it guess right by changing the coding system priorities? If so, should we change the default priorities? It may be that a different set of priorities would cause similar problems in some other cases and the current defaults are the best. But if we have not looked at the question in several years, it would be worth studying it now. In that case revert-buffer-with-coding-system. Ideally I'd like Emacs to ask directly when opening the file in such a case, if it can't determine anything better than raw-bytes. Maybe so. ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: Displaying bytes (was: Inadequate documentation of silly characters on screen.) 2009-11-24 22:47 ` Richard Stallman @ 2009-11-25 1:33 ` Kenichi Handa 2009-11-25 2:29 ` Displaying bytes (was: Inadequate documentation of silly Stefan Monnier ` (3 more replies) 0 siblings, 4 replies; 96+ messages in thread From: Kenichi Handa @ 2009-11-25 1:33 UTC (permalink / raw) To: rms; +Cc: per.starback, dak, monnier, emacs-devel In article <E1ND4AD-0003Yg-Cc@fencepost.gnu.org>, Richard Stallman <rms@gnu.org> writes: > $ od -c euro.txt > 0000000 T h a t c o s t s 200 1 7 . \n > 0000020 > $ emacs euro.txt > This is really a windows-1252 file and the strange character is > supposed to be a Euro sign. > For me, with no particular setup to make Emacs expect windows-1252 > files that shows in emacs as > "That costs \20017." with raw-text-unix. > Why doesn't Emacs guess right, in this case? Because some other coding system of the same coding-category of windows-1252 (coding-category-charset) has the higher priority and that coding system doesn't contain code \200. > Could we make it guess right by changing the coding system > priorities? Yes. > If so, should we change the default priorities? I'm not sure. As it seems that windows-1252 is a superset of iso-8859-1, it may be ok to give windows-1252 the higher priority. How do iso-8859-1 users think? The better thing is to allow registering multiple coding systems in one coding-category, but I'm not sure I have a time to work on it. > It may be that a different set of priorities would cause similar > problems in some other cases and the current defaults are the best. > But if we have not looked at the question in several years, it would > be worth studying it now. > In that case revert-buffer-with-coding-system. Ideally I'd like Emacs > to ask directly when opening the file > in such a case, if it can't determine anything better than raw-bytes. > Maybe so. For that, it seems that adding that facility in after-insert-file-set-coding is good. Here's a sample patch. The actual change should give more information to a user. --- mule.el.~1.294.~ 2009-11-17 11:42:45.000000000 +0900 +++ mule.el 2009-11-25 10:17:49.000000000 +0900 @@ -1893,7 +1893,18 @@ coding-system-for-read (not (eq coding-system-for-read 'auto-save-coding))) (setq buffer-file-coding-system-explicit - (cons coding-system-for-read nil))) + (cons coding-system-for-read nil)) + (when (and last-coding-system-used + (eq (coding-system-base last-coding-system-used) 'raw-text)) + ;; Give a chance of decoding by some coding system. + (let ((coding-system (read-coding-system "Actual coding system: "))) + (if coding-system + (save-restriction + (narrow-to-region (point) (+ (point) inserted)) + (let ((modified (buffer-modified-p))) + (decode-coding-region (point-min) (point-max) coding-system) + (setq inserted (- (point-max) (point-min))) + (set-buffer-modified-p modified))))))) (if last-coding-system-used (let ((coding-system (find-new-buffer-file-coding-system last-coding-system-used))) --- Kenichi Handa handa@m17n.org ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: Displaying bytes (was: Inadequate documentation of silly 2009-11-25 1:33 ` Kenichi Handa @ 2009-11-25 2:29 ` Stefan Monnier 2009-11-25 2:50 ` Lennart Borgman 2009-11-25 6:25 ` Stephen J. Turnbull 2009-11-25 5:40 ` Displaying bytes (was: Inadequate documentation of silly characters on screen.) Ulrich Mueller ` (2 subsequent siblings) 3 siblings, 2 replies; 96+ messages in thread From: Stefan Monnier @ 2009-11-25 2:29 UTC (permalink / raw) To: Kenichi Handa; +Cc: per.starback, dak, rms, emacs-devel >> If so, should we change the default priorities? > I'm not sure. As it seems that windows-1252 is a superset of > iso-8859-1, it may be ok to give windows-1252 the higher priority. > How do iso-8859-1 users think? The problem with windows-1252 is that all files are valid in that coding-system. So it's OK if there's a really high chance of encountering such files, but otherwise it leads to many misdetections. > For that, it seems that adding that facility in > after-insert-file-set-coding is good. Here's a sample patch. The > actual change should give more information to a user. Maybe we could try that. But I really dislike adding a user-prompt in the middle of some operation that might be performed as part of something "unrelated". And indeed the actual change may need to give a lot more information, mostly displaying the buffer without which the user cannot make a good guess. Stefan ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: Displaying bytes (was: Inadequate documentation of silly 2009-11-25 2:29 ` Displaying bytes (was: Inadequate documentation of silly Stefan Monnier @ 2009-11-25 2:50 ` Lennart Borgman 2009-11-25 6:25 ` Stephen J. Turnbull 1 sibling, 0 replies; 96+ messages in thread From: Lennart Borgman @ 2009-11-25 2:50 UTC (permalink / raw) To: Stefan Monnier; +Cc: per.starback, dak, emacs-devel, rms, Kenichi Handa On Wed, Nov 25, 2009 at 3:29 AM, Stefan Monnier <monnier@iro.umontreal.ca> wrote: >>> If so, should we change the default priorities? >> I'm not sure. As it seems that windows-1252 is a superset of >> iso-8859-1, it may be ok to give windows-1252 the higher priority. >> How do iso-8859-1 users think? > > The problem with windows-1252 is that all files are valid in that > coding-system. So it's OK if there's a really high chance of > encountering such files, but otherwise it leads to many misdetections. > >> For that, it seems that adding that facility in >> after-insert-file-set-coding is good. Here's a sample patch. The >> actual change should give more information to a user. > > Maybe we could try that. But I really dislike adding a user-prompt in > the middle of some operation that might be performed as part of > something "unrelated". And indeed the actual change may need to give > a lot more information, mostly displaying the buffer without which the > user cannot make a good guess. Maybe it is better to read in the file in the buffer with a best guess and add a hook that is run the first time the buffer is shown in a window with some notification to the user of the problem? Then of course also provide enough hints to make it as easy to change coding system in that situation to the relevant alternatives. ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: Displaying bytes (was: Inadequate documentation of silly 2009-11-25 2:29 ` Displaying bytes (was: Inadequate documentation of silly Stefan Monnier 2009-11-25 2:50 ` Lennart Borgman @ 2009-11-25 6:25 ` Stephen J. Turnbull 1 sibling, 0 replies; 96+ messages in thread From: Stephen J. Turnbull @ 2009-11-25 6:25 UTC (permalink / raw) To: Stefan Monnier; +Cc: per.starback, dak, emacs-devel, rms, Kenichi Handa Stefan Monnier writes: > The problem with windows-1252 is that all files are valid in that > coding-system. Well, *pedantically* that's true of any ISO 8859 coding system too, since ISO 8859 doesn't specify what might appear in C1 at all. In practice for 1252 1. The only C0 controls you'll commonly see are \t, \r, and \n. 2. The set of C1 controls that are defined is limited IIRC (but Microsoft does go around changing it without warning, so I could be wrong by now ;-). 3. It's line-oriented text (even if long-lines): you'll very probably see \r and \n only as \r\n, you might see only \n and no \r, and you'll not see "random" use of \r or \n. ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: Displaying bytes (was: Inadequate documentation of silly characters on screen.) 2009-11-25 1:33 ` Kenichi Handa 2009-11-25 2:29 ` Displaying bytes (was: Inadequate documentation of silly Stefan Monnier @ 2009-11-25 5:40 ` Ulrich Mueller 2009-11-26 22:59 ` Displaying bytes Reiner Steib 2009-11-25 5:59 ` Displaying bytes (was: Inadequate documentation of silly characters on screen.) Stephen J. Turnbull 2009-11-29 16:01 ` Richard Stallman 3 siblings, 1 reply; 96+ messages in thread From: Ulrich Mueller @ 2009-11-25 5:40 UTC (permalink / raw) To: Kenichi Handa; +Cc: per.starback, dak, emacs-devel, rms, monnier >>>>> On Wed, 25 Nov 2009, Kenichi Handa wrote: >> If so, should we change the default priorities? > I'm not sure. As it seems that windows-1252 is a superset of > iso-8859-1, it may be ok to give windows-1252 the higher priority. Please don't. I wonder why one would even *think* of changing Emacs's default to a Microsoft proprietary "code page". :-( > How do iso-8859-1 users think? Seems to me that use of iso-8859-* is much more widespread on *nix systems. I think the current default priorities are perfectly fine. Ulrich ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: Displaying bytes 2009-11-25 5:40 ` Displaying bytes (was: Inadequate documentation of silly characters on screen.) Ulrich Mueller @ 2009-11-26 22:59 ` Reiner Steib 2009-11-27 0:16 ` Ulrich Mueller ` (2 more replies) 0 siblings, 3 replies; 96+ messages in thread From: Reiner Steib @ 2009-11-26 22:59 UTC (permalink / raw) To: Ulrich Mueller; +Cc: emacs-devel, Kenichi Handa On Wed, Nov 25 2009, Ulrich Mueller wrote: >>>>>> On Wed, 25 Nov 2009, Kenichi Handa wrote: > >>> If so, should we change the default priorities? > >> I'm not sure. As it seems that windows-1252 is a superset of >> iso-8859-1, It is, yes. >> it may be ok to give windows-1252 the higher priority. > > Please don't. > > I wonder why one would even *think* of changing Emacs's default to a > Microsoft proprietary "code page". :-( Just because it has "windows" in its name? IIRC it is registered at IANA. >> How do iso-8859-1 users think? > > Seems to me that use of iso-8859-* is much more widespread on *nix > systems. As far as I understand, an iso-8859-1 user won't notice any difference. Only if the file is _not_ iso-8859-1 and "fits" in windows-1252 (e.g. it uses one of the few chars that make the difference). We have done something similar (see `mm-charset-override-alist') in Gnus for displaying mis-labelled articles. > I think the current default priorities are perfectly fine. Bye, Reiner. -- ,,, (o o) ---ooO-(_)-Ooo--- | PGP key available | http://rsteib.home.pages.de/ ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: Displaying bytes 2009-11-26 22:59 ` Displaying bytes Reiner Steib @ 2009-11-27 0:16 ` Ulrich Mueller 2009-11-27 1:41 ` Stefan Monnier 2009-11-27 4:14 ` Stephen J. Turnbull 2 siblings, 0 replies; 96+ messages in thread From: Ulrich Mueller @ 2009-11-27 0:16 UTC (permalink / raw) To: Reiner Steib; +Cc: emacs-devel, Kenichi Handa >>>>> On Thu, 26 Nov 2009, Reiner Steib wrote: >>> I'm not sure. As it seems that windows-1252 is a superset of >>> iso-8859-1, > It is, yes. They are identical, except for the range from 0x80 to 0x9f, where ISO-8859-1 assigns control characters [1]. Look into the log file of an xterm (TERM=xterm-8bit) and you'll see them. >> I wonder why one would even *think* of changing Emacs's default to >> a Microsoft proprietary "code page". :-( > Just because it has "windows" in its name? IIRC it is registered at > IANA. Yes, in the "vendor" range [2], together with all variants of EBCDIC that ever existed. ;-) Whereas ISO-8859-1 is an official ISO standard. Ulrich [1] ftp://ftp.unicode.org/Public/MAPPINGS/ISO8859/8859-1.TXT [2] http://www.iana.org/assignments/character-sets ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: Displaying bytes 2009-11-26 22:59 ` Displaying bytes Reiner Steib 2009-11-27 0:16 ` Ulrich Mueller @ 2009-11-27 1:41 ` Stefan Monnier 2009-11-27 4:14 ` Stephen J. Turnbull 2 siblings, 0 replies; 96+ messages in thread From: Stefan Monnier @ 2009-11-27 1:41 UTC (permalink / raw) To: Reiner Steib; +Cc: Ulrich Mueller, Kenichi Handa, emacs-devel >> Seems to me that use of iso-8859-* is much more widespread on *nix >> systems. > As far as I understand, an iso-8859-1 user won't notice any > difference. They'll notice a difference when opening a file that's neither latin-1 nor windows-1252 but which happens to fall within the range of windows-1252 (which is the case for most non-latin1 files). > We have done something similar (see `mm-charset-override-alist') in > Gnus for displaying mis-labelled articles. It's very different: it's perfectly OK to treat a latin-1 message or file as if it were windows-1252. It'll almost always DTRT. The problem is when we have to guess the coding-system, in which case checking windows-1252 instead of latin-1 will give you more false positives. Stefan ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: Displaying bytes 2009-11-26 22:59 ` Displaying bytes Reiner Steib 2009-11-27 0:16 ` Ulrich Mueller 2009-11-27 1:41 ` Stefan Monnier @ 2009-11-27 4:14 ` Stephen J. Turnbull 2 siblings, 0 replies; 96+ messages in thread From: Stephen J. Turnbull @ 2009-11-27 4:14 UTC (permalink / raw) To: Reiner Steib; +Cc: Ulrich Mueller, Kenichi Handa, emacs-devel Reiner Steib writes: > Just because it has "windows" in its name? IIRC it is registered at > IANA. Not because of the name. Because the registration at IANA does not define it, the last time I looked. It merely is a placeholder for an internal Microsoft page that Microsoft updates at its convenience (and has done, for example when adding the EURO SIGN). ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: Displaying bytes (was: Inadequate documentation of silly characters on screen.) 2009-11-25 1:33 ` Kenichi Handa 2009-11-25 2:29 ` Displaying bytes (was: Inadequate documentation of silly Stefan Monnier 2009-11-25 5:40 ` Displaying bytes (was: Inadequate documentation of silly characters on screen.) Ulrich Mueller @ 2009-11-25 5:59 ` Stephen J. Turnbull 2009-11-25 8:16 ` Kenichi Handa 2009-11-29 16:01 ` Richard Stallman 3 siblings, 1 reply; 96+ messages in thread From: Stephen J. Turnbull @ 2009-11-25 5:59 UTC (permalink / raw) To: Kenichi Handa; +Cc: per.starback, dak, emacs-devel, rms, monnier Kenichi Handa writes: > In article <E1ND4AD-0003Yg-Cc@fencepost.gnu.org>, Richard Stallman <rms@gnu.org> writes: > > If so, should we change the default priorities? > > I'm not sure. As it seems that windows-1252 is a superset of > iso-8859-1, it may be ok to give windows-1252 the higher priority. > How do iso-8859-1 users think? Why not make a Windows-12xx coding-category? If you don't want to advertise what it is, you could call it "ascii8" or "pseudo-ascii" or something like that. (Wouldn't some of the obsolete Vietnamese standards fit this too? Ie, 0-0177 are the same as ISO-646, and 0200-0377 are used for the alternate script?) If you don't make a separate coding category for that, I don't like the change, myself. Windows-12xx character sets are proprietary in the sense that last I looked, the IANA registry for Windows-12xx coded character sets pointed to internal Microsoft documents, and made no promises about changes to those documents. As far as I know, Microsoft added the EURO SIGN to Windows-1252 simply by editing that internal page. There was no indication of the history of such changes on the IANA page. ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: Displaying bytes (was: Inadequate documentation of silly characters on screen.) 2009-11-25 5:59 ` Displaying bytes (was: Inadequate documentation of silly characters on screen.) Stephen J. Turnbull @ 2009-11-25 8:16 ` Kenichi Handa 0 siblings, 0 replies; 96+ messages in thread From: Kenichi Handa @ 2009-11-25 8:16 UTC (permalink / raw) To: Stephen J. Turnbull; +Cc: per.starback, dak, emacs-devel, rms, monnier In article <87fx835elh.fsf@uwakimon.sk.tsukuba.ac.jp>, "Stephen J. Turnbull" <stephen@xemacs.org> writes: > I'm not sure. As it seems that windows-1252 is a superset of > iso-8859-1, it may be ok to give windows-1252 the higher priority. > How do iso-8859-1 users think? > Why not make a Windows-12xx coding-category? If you don't want to > advertise what it is, you could call it "ascii8" or "pseudo-ascii" or > something like that. Ah! A coding-category of a coding-system is automatically determined by :coding-type arg (and by some other arg depending on :coding-type) of define-coding-system. And iso-8859-x and windows-12xx are exactly the same in this aspect; i.e. both :coding-type is `charset' which means the coding system is for decoding/encoding charsets in :charset-list. Perhaps it is good to add one more coding-category `charset8' to which such coding-systems that handle a single byte charset containing many 0x80..0x9F area code are classified. > (Wouldn't some of the obsolete Vietnamese > standards fit this too? Ie, 0-0177 are the same as ISO-646, and > 0200-0377 are used for the alternate script?) Do you mean such coding-systems as vietnamese-tcvn and vietnamese-viscii? Although their 0x00-0x1F are not the same as ASCII, yes, they can be classified into `charset8' category. --- Kenichi Handa handa@m17n.org ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: Displaying bytes (was: Inadequate documentation of silly characters on screen.) 2009-11-25 1:33 ` Kenichi Handa ` (2 preceding siblings ...) 2009-11-25 5:59 ` Displaying bytes (was: Inadequate documentation of silly characters on screen.) Stephen J. Turnbull @ 2009-11-29 16:01 ` Richard Stallman 2009-11-29 16:31 ` Displaying bytes (was: Inadequate documentation of silly Stefan Monnier 2009-11-29 22:19 ` Displaying bytes (was: Inadequate documentation of silly characters on screen.) Kim F. Storm 3 siblings, 2 replies; 96+ messages in thread From: Richard Stallman @ 2009-11-29 16:01 UTC (permalink / raw) To: Kenichi Handa; +Cc: per.starback, dak, monnier, emacs-devel We don't want to raise the priority of windows-1252 because it would cause many other encodings not to be recognized. If it turns out that windows-1252 files are the main cause of 8-bit-control characters in the buffer, here's another idea. If visiting a file gives you some 8-bit-control characters, ask the user "Is this file encoded in Windows encoding (windows-1252)?" and do so if she says yes. Here's another idea. We could employ some heuristics to see if the distribution of those characters seems typical for the way those characters are used. For instance, some of the punctuation characters (the ones that represent quotation marks) should always have whitespace or punctuation on at least one side. Also, there should be no ASCII control characters other than whitespace. Maybe more specific heuristics can be developed. These could be used as conditions for recognizing the file as windows-1252. If these heuristics are strong enough, they could reject nearly all false matches, provided the file is long enough. (A minimum length could be part of the conditions.) Then we could increase the priority of windows-1252 without the bad side effect of using it when it is not intended. This is ad-hoc, and not elegant. But the problem is important enough in practice that an ad-hoc solution is justified if it works well. ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: Displaying bytes (was: Inadequate documentation of silly 2009-11-29 16:01 ` Richard Stallman @ 2009-11-29 16:31 ` Stefan Monnier 2009-11-29 22:01 ` Juri Linkov 2009-11-29 22:19 ` Displaying bytes (was: Inadequate documentation of silly characters on screen.) Kim F. Storm 1 sibling, 1 reply; 96+ messages in thread From: Stefan Monnier @ 2009-11-29 16:31 UTC (permalink / raw) To: rms; +Cc: per.starback, dak, emacs-devel, Kenichi Handa > If it turns out that windows-1252 files are the main cause of > 8-bit-control characters in the buffer, here's another idea. It may be the case for some users, but it probably isn't the case in general. It's clearly not the case for me (I only/mostly see such characters in Gnus when I receive email that is improperly labelled, where I'm happy to see tham so that I complain to their originator). > Here's another idea. We could employ some heuristics to see if the > distribution of those characters seems typical for the way those > characters are used. For instance, some of the punctuation characters Using such heursitics might be a good idea in general to automatically detect which encoding is used, or which language is used. As time passes, it becomes less and less important for coding-systems in my experience (utf-8 and utf-16 seem to slowly take over and we already auto-detect them well). Stefan ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: Displaying bytes (was: Inadequate documentation of silly 2009-11-29 16:31 ` Displaying bytes (was: Inadequate documentation of silly Stefan Monnier @ 2009-11-29 22:01 ` Juri Linkov 2009-11-30 6:05 ` tomas 0 siblings, 1 reply; 96+ messages in thread From: Juri Linkov @ 2009-11-29 22:01 UTC (permalink / raw) To: Stefan Monnier; +Cc: per.starback, dak, rms, Kenichi Handa, emacs-devel >> Here's another idea. We could employ some heuristics to see if the >> distribution of those characters seems typical for the way those >> characters are used. For instance, some of the punctuation characters > > Using such heursitics might be a good idea in general to automatically > detect which encoding is used, or which language is used. Unicad (http://www.emacswiki.org/emacs/Unicad) uses statistic models to auto-detect windows-1252 and many many other coding systems (auto-detecting windows-1252 is not advertised on the main page, but actually can be observed in source code). The theory is described at http://www.mozilla.org/projects/intl/UniversalCharsetDetection.html I hope sometime this will be added to Emacs. -- Juri Linkov http://www.jurta.org/emacs/ ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: Displaying bytes (was: Inadequate documentation of silly 2009-11-29 22:01 ` Juri Linkov @ 2009-11-30 6:05 ` tomas 2009-11-30 12:09 ` Andreas Schwab 0 siblings, 1 reply; 96+ messages in thread From: tomas @ 2009-11-30 6:05 UTC (permalink / raw) To: Juri Linkov Cc: dak, rms, Kenichi Handa, per.starback, emacs-devel, Stefan Monnier -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Mon, Nov 30, 2009 at 12:01:29AM +0200, Juri Linkov wrote: [...] > Unicad (http://www.emacswiki.org/emacs/Unicad) uses statistic models > to auto-detect windows-1252 and many many other coding systems > (auto-detecting windows-1252 is not advertised on the main page, > but actually can be observed in source code). The theory is described > at http://www.mozilla.org/projects/intl/UniversalCharsetDetection.html > I hope sometime this will be added to Emacs. It looks theoretically quite neat. I hope this too -- the current heuristics are often at a loss. Ironically, the cited page at mozilla doesn't display correctly in my browser (of all things mozilla!). Setting to auto-detect guesses UTF-8 whereas it's latin-1 -- as correctly advertised in the headers :-) (yes, it's off-topic and it's most-probably some miscofiguration on my side, but I thought some might savour the irony). But I also feel that we need more systematic heuristics. I'll give Unicad a try. Regards - -- tomás -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.6 (GNU/Linux) iD8DBQFLE2CwBcgs9XrR2kYRAsCxAJ0cyKl6hp5jN4+N7ogimn354z9+lgCdHAqW REqc68ZeDEqG7eXi7d/HFLU= =efXE -----END PGP SIGNATURE----- ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: Displaying bytes (was: Inadequate documentation of silly 2009-11-30 6:05 ` tomas @ 2009-11-30 12:09 ` Andreas Schwab 2009-11-30 12:39 ` tomas 0 siblings, 1 reply; 96+ messages in thread From: Andreas Schwab @ 2009-11-30 12:09 UTC (permalink / raw) To: tomas Cc: dak, rms, Kenichi Handa, per.starback, emacs-devel, Juri Linkov, Stefan Monnier tomas@tuxteam.de writes: > Ironically, the cited page at mozilla doesn't display correctly in my > browser (of all things mozilla!). Setting to auto-detect guesses UTF-8 > whereas it's latin-1 -- as correctly advertised in the headers :-) The HTML header claims UTF-8, as does the HTTP header. Andreas. -- Andreas Schwab, schwab@linux-m68k.org GPG Key fingerprint = 58CA 54C7 6D53 942B 1756 01D3 44D5 214B 8276 4ED5 "And now for something completely different." ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: Displaying bytes (was: Inadequate documentation of silly 2009-11-30 12:09 ` Andreas Schwab @ 2009-11-30 12:39 ` tomas 0 siblings, 0 replies; 96+ messages in thread From: tomas @ 2009-11-30 12:39 UTC (permalink / raw) To: emacs-devel -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Mon, Nov 30, 2009 at 01:09:55PM +0100, Andreas Schwab wrote: > tomas@tuxteam.de writes: > > > Ironically, the cited page at mozilla doesn't display correctly in my > > browser (of all things mozilla!). Setting to auto-detect guesses UTF-8 > > whereas it's latin-1 -- as correctly advertised in the headers :-) > > The HTML header claims UTF-8, as does the HTTP header. I stand corrected. I did put too much belief on what Mozilla told me in the "page info" blurb. Note to self: don't believe what web browsers tell you. Grumble. This makes the irony even better ;-) Thanks - -- tomás -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.6 (GNU/Linux) iD8DBQFLE70eBcgs9XrR2kYRAlUPAJ4n5x+aaGoYGmbANgY/SXlOFF1ETACdFa2j TZxfwsMyxnzqI7MI/9+HTPM= =JXqN -----END PGP SIGNATURE----- ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: Displaying bytes (was: Inadequate documentation of silly characters on screen.) 2009-11-29 16:01 ` Richard Stallman 2009-11-29 16:31 ` Displaying bytes (was: Inadequate documentation of silly Stefan Monnier @ 2009-11-29 22:19 ` Kim F. Storm 2009-11-30 1:42 ` Stephen J. Turnbull 1 sibling, 1 reply; 96+ messages in thread From: Kim F. Storm @ 2009-11-29 22:19 UTC (permalink / raw) To: rms; +Cc: per.starback, dak, emacs-devel, monnier, Kenichi Handa Richard Stallman <rms@gnu.org> writes: > We don't want to raise the priority of windows-1252 because it would > cause many other encodings not to be recognized. > > If it turns out that windows-1252 files are the main cause of > 8-bit-control characters in the buffer, here's another idea. Sorry I haven't followed the entire thread, but here's an idea: A Windows-1252 file most likely originated on Windoze, so what about only raising the priority when the file has CRNL line endings? -- Kim F. Storm <storm@cua.dk> http://www.cua.dk ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: Displaying bytes (was: Inadequate documentation of silly characters on screen.) 2009-11-29 22:19 ` Displaying bytes (was: Inadequate documentation of silly characters on screen.) Kim F. Storm @ 2009-11-30 1:42 ` Stephen J. Turnbull 0 siblings, 0 replies; 96+ messages in thread From: Stephen J. Turnbull @ 2009-11-30 1:42 UTC (permalink / raw) To: Kim F. Storm; +Cc: dak, rms, Kenichi Handa, per.starback, emacs-devel, monnier Kim F. Storm writes: > A Windows-1252 file most likely originated on Windoze, so what about > only raising the priority when the file has CRNL line endings? That turns out not to be true in my experience. There are a lot of European people of my acquaintance who started using 1252 when it had the EURO SIGN (Microsoft put it in well before Euros were in circulation IIRC) and ISO-8859-15 had not yet been published. ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: Displaying bytes 2009-11-23 20:38 ` Richard Stallman 2009-11-23 21:34 ` Per Starbäck @ 2009-11-24 1:28 ` Stefan Monnier 2009-11-24 22:47 ` Richard Stallman 2009-11-24 22:47 ` Richard Stallman 1 sibling, 2 replies; 96+ messages in thread From: Stefan Monnier @ 2009-11-24 1:28 UTC (permalink / raw) To: rms; +Cc: dak, emacs-devel [-- Attachment #1: Type: text/plain, Size: 363 bytes --] > For instance, C-u C-x = on \224 says > character: (4194196, #o17777624, #x3fff94) > preferred charset: tis620-2533 (TIS620.2533) > code point: 0x94 > syntax: w which means: word > buffer code: #x94 > file code: #x94 (encoded by coding system no-conversion) > display: not encodable for terminal Here C-u C-x = tells me: [-- Warning: decoded text below may be mangled, UTF-8 assumed --] [-- Attachment #2: Type: text/plain, Size: 332 bytes --] character: ¡ (4194209, #o17777641, #x3fffa1) preferred charset: eight-bit (Raw bytes 128-255) code point: 0xA1 syntax: w which means: word buffer code: #xA1 file code: not encodable by coding system utf-8-unix display: no font available I don't know you see this "tis620" stuff. [-- Attachment #3: Type: text/plain, Size: 586 bytes --] > Perhaps it should say, > character: Stray byte (4194196, #o17777624, #x3fff94) We could do that indeed. > What are the situations where a user is likely to see these stray > bytes. There pretty much shouldn't be any in multibyte buffers. > When visiting a binary file, of course; but in that situation, > nobody will be surprised or disappointed. And presumably for binary files, the buffer will be unibyte. > (It is also not clear to me what "ASCII compatible" means in this > context.) It means that the lower 128 chars coincide with those of ASCII. Stefan ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: Displaying bytes 2009-11-24 1:28 ` Displaying bytes Stefan Monnier @ 2009-11-24 22:47 ` Richard Stallman 2009-11-25 2:18 ` Stefan Monnier 2009-11-24 22:47 ` Richard Stallman 1 sibling, 1 reply; 96+ messages in thread From: Richard Stallman @ 2009-11-24 22:47 UTC (permalink / raw) To: Stefan Monnier; +Cc: dak, emacs-devel I don't know you see this "tis620" stuff. How strange this discrepancy. I have a few changes that are not installed, but not in anything relevant here. I last updated source code on Nov 18. Here's what apparently defines that character set, in mule-conf.el: (define-charset 'tis620-2533 "TIS620.2533" :short-name "TIS620.2533" :ascii-compatible-p t :code-space [0 255] :superset '(ascii eight-bit-control (thai-tis620 . 128))) I don't entirely understand define-charset, but it seems plausible that this gives the observed results. Is this absent in your source? Anyway, please don't overlook the other suggestions in my message for how to make things clearer. > What are the situations where a user is likely to see these stray > bytes. There pretty much shouldn't be any in multibyte buffers. Would it be good to ask people to send bug reports when these stray byte characters appear in multibyte buffers? ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: Displaying bytes 2009-11-24 22:47 ` Richard Stallman @ 2009-11-25 2:18 ` Stefan Monnier 2009-11-26 6:24 ` Richard Stallman 0 siblings, 1 reply; 96+ messages in thread From: Stefan Monnier @ 2009-11-25 2:18 UTC (permalink / raw) To: rms; +Cc: dak, emacs-devel > I don't know you see this "tis620" stuff. > How strange this discrepancy. I have a few changes that are not > installed, but not in anything relevant here. I last updated > source code on Nov 18. Oh wait, I now see: you get `tis620' for chars between 128 ans 160 (i.e. eight-bit-control), and `eight-bit' for chars between 160 and 256. > Here's what apparently defines that character set, in mule-conf.el: > (define-charset 'tis620-2533 > "TIS620.2533" > :short-name "TIS620.2533" > :ascii-compatible-p t > :code-space [0 255] > :superset '(ascii eight-bit-control (thai-tis620 . 128))) Looks like the eight-bit-control here is part of the problem. > Anyway, please don't overlook the other suggestions in my message > for how to make things clearer. Of course. >> What are the situations where a user is likely to see these stray >> bytes. > There pretty much shouldn't be any in multibyte buffers. > Would it be good to ask people to send bug reports when these > stray byte characters appear in multibyte buffers? No, these chars can appear in cases where Emacs does the right thing. I.e. sometimes they reflect bugs, but often they just reflect "pilot errors" or corrupted data completely outside the control of Emacs. Stefan ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: Displaying bytes 2009-11-25 2:18 ` Stefan Monnier @ 2009-11-26 6:24 ` Richard Stallman 2009-11-26 8:59 ` David Kastrup 2009-11-26 14:57 ` Stefan Monnier 0 siblings, 2 replies; 96+ messages in thread From: Richard Stallman @ 2009-11-26 6:24 UTC (permalink / raw) To: Stefan Monnier; +Cc: dak, emacs-devel > There pretty much shouldn't be any in multibyte buffers. > Would it be good to ask people to send bug reports when these > stray byte characters appear in multibyte buffers? No, these chars can appear in cases where Emacs does the right thing. I.e. sometimes they reflect bugs, but often they just reflect "pilot errors" or corrupted data completely outside the control of Emacs. If it is nearly always due to a bug, a user error, or bad data, perhaps it would be good to display a diagnostic after file commands that put them in the buffer. Perhaps pop up a buffer explaining what these mean and what to do about them. ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: Displaying bytes 2009-11-26 6:24 ` Richard Stallman @ 2009-11-26 8:59 ` David Kastrup 2009-11-26 14:57 ` Stefan Monnier 1 sibling, 0 replies; 96+ messages in thread From: David Kastrup @ 2009-11-26 8:59 UTC (permalink / raw) To: emacs-devel Richard Stallman <rms@gnu.org> writes: > > There pretty much shouldn't be any in multibyte buffers. > > > Would it be good to ask people to send bug reports when these > > stray byte characters appear in multibyte buffers? > > No, these chars can appear in cases where Emacs does the right thing. > I.e. sometimes they reflect bugs, but often they just reflect "pilot > errors" or corrupted data completely outside the control of Emacs. > > If it is nearly always due to a bug, a user error, or bad data, > perhaps it would be good to display a diagnostic after file commands > that put them in the buffer. Perhaps pop up a buffer explaining what > these mean and what to do about them. The encoding indicator in the mode line could get warning-face, and the respective pop-up help mention "buffer contains undecodable bytes." -- David Kastrup ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: Displaying bytes 2009-11-26 6:24 ` Richard Stallman 2009-11-26 8:59 ` David Kastrup @ 2009-11-26 14:57 ` Stefan Monnier 2009-11-26 16:28 ` Lennart Borgman 2009-11-27 6:36 ` Richard Stallman 1 sibling, 2 replies; 96+ messages in thread From: Stefan Monnier @ 2009-11-26 14:57 UTC (permalink / raw) To: rms; +Cc: dak, emacs-devel > If it is nearly always due to a bug, a user error, or bad data, > perhaps it would be good to display a diagnostic after file commands > that put them in the buffer. Perhaps pop up a buffer explaining what > these mean and what to do about them. If someone wants to take a stab at it, that's fine by me, but it looks way too difficult for me. The origin of the problem can be so diverse that it'll be difficult to come up with instrcutions that will be useful and will not confuse a significant part of the user population. Stefan ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: Displaying bytes 2009-11-26 14:57 ` Stefan Monnier @ 2009-11-26 16:28 ` Lennart Borgman 2009-11-27 6:36 ` Richard Stallman 1 sibling, 0 replies; 96+ messages in thread From: Lennart Borgman @ 2009-11-26 16:28 UTC (permalink / raw) To: Stefan Monnier; +Cc: dak, rms, emacs-devel On Thu, Nov 26, 2009 at 3:57 PM, Stefan Monnier <monnier@iro.umontreal.ca> wrote: >> If it is nearly always due to a bug, a user error, or bad data, >> perhaps it would be good to display a diagnostic after file commands >> that put them in the buffer. Perhaps pop up a buffer explaining what >> these mean and what to do about them. > > If someone wants to take a stab at it, that's fine by me, but it looks > way too difficult for me. The origin of the problem can be so diverse > that it'll be difficult to come up with instrcutions that will be useful > and will not confuse a significant part of the user population. Is that not good enough instructions to put up? ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: Displaying bytes 2009-11-26 14:57 ` Stefan Monnier 2009-11-26 16:28 ` Lennart Borgman @ 2009-11-27 6:36 ` Richard Stallman 1 sibling, 0 replies; 96+ messages in thread From: Richard Stallman @ 2009-11-27 6:36 UTC (permalink / raw) To: Stefan Monnier; +Cc: dak, emacs-devel If someone wants to take a stab at it, that's fine by me, but it looks way too difficult for me. The origin of the problem can be so diverse that it'll be difficult to come up with instrcutions that will be useful and will not confuse a significant part of the user population. How about: It's possible Emacs guessed the wrong coding system to decode the file. [advice on how to check that, and how to specify a different coding system] If these strange characters are due to bad data in a file you visited, just try not to let them worry you. If you think they appeared due to a bug in Emacs, please send a bug report using M-x report-emacs-bug. If they appear for some other reason not mentioned above, please consider its absence from this message to be a bug in Emacs, and please send a bug report using M-x report-emacs-bug. ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: Displaying bytes 2009-11-24 1:28 ` Displaying bytes Stefan Monnier 2009-11-24 22:47 ` Richard Stallman @ 2009-11-24 22:47 ` Richard Stallman 1 sibling, 0 replies; 96+ messages in thread From: Richard Stallman @ 2009-11-24 22:47 UTC (permalink / raw) To: Stefan Monnier; +Cc: dak, emacs-devel It means that the lower 128 chars coincide with those of ASCII. We could make that more self-explanatory in the buffer. ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: Fwd: Re: Inadequate documentation of silly characters on screen. 2009-11-19 21:25 ` Alan Mackenzie 2009-11-19 22:31 ` David Kastrup @ 2009-11-20 8:48 ` Eli Zaretskii 1 sibling, 0 replies; 96+ messages in thread From: Eli Zaretskii @ 2009-11-20 8:48 UTC (permalink / raw) To: Alan Mackenzie; +Cc: dak, emacs-devel > Date: Thu, 19 Nov 2009 21:25:50 +0000 > From: Alan Mackenzie <acm@muc.de> > Cc: David Kastrup <dak@gnu.org>, emacs-devel@gnu.org > > Why couldn't Emacs have simply displayed the character as "ñ"? Because Emacs does not interpret raw bytes as human-readable characters, by design. You could set unibyte-display-via-language-environment to get it displayed as "ñ", but that's only a display setting, it doesn't change the basic fact that Emacs is _not_ treating 241 in a unibyte string as a character. ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: Fwd: Re: Inadequate documentation of silly characters on screen. 2009-11-19 18:08 ` Alan Mackenzie 2009-11-19 19:25 ` Davis Herring @ 2009-11-19 19:52 ` Eli Zaretskii 2009-11-19 20:53 ` Alan Mackenzie 2009-11-19 20:05 ` Stefan Monnier 2 siblings, 1 reply; 96+ messages in thread From: Eli Zaretskii @ 2009-11-19 19:52 UTC (permalink / raw) To: Alan Mackenzie; +Cc: dak, emacs-devel > Date: Thu, 19 Nov 2009 18:08:48 +0000 > From: Alan Mackenzie <acm@muc.de> > Cc: emacs-devel@gnu.org > > No, you (all of you) are missing the point. That point is that if an > Emacs Lisp hacker writes "?ñ", it should work, regardless of > what "codepoint" it has, what "bytes" represent it, whether those > "bytes" are coded with a different codepoint, or what have you. No can do, as long as we support both unibyte and multibyte buffers and strings. > OK. Surely displaying it as "\361" is a bug? It's no more a bug than this: M-: ?a RET => 97 If `a' can be represented as 97, then why cannot \361 be represented as 4194289? > So, how did the character "ñ" get turned into the illegal byte #xf1? It did so because you used aset to put it into a unibyte string. > Are you saying that Emacs is converting "?ñ" and "?ä" into the wrong > integers? Emacs can convert it into 2 distinct integer representations. It decides which one by the context. And you just happened to give it the wrong context. > What is the correct Emacs internal representation for "ñ" and "ä"? That depends on whether they will be put into a multibyte string/buffer or a unibyte one. > > Because Emacs has no separate "character" data type. > > For which I am thankful. Then please understand that there's no bug here. ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: Fwd: Re: Inadequate documentation of silly characters on screen. 2009-11-19 19:52 ` Eli Zaretskii @ 2009-11-19 20:53 ` Alan Mackenzie 2009-11-19 22:16 ` David Kastrup 0 siblings, 1 reply; 96+ messages in thread From: Alan Mackenzie @ 2009-11-19 20:53 UTC (permalink / raw) To: Eli Zaretskii; +Cc: dak, emacs-devel Hi, Eli! On Thu, Nov 19, 2009 at 09:52:20PM +0200, Eli Zaretskii wrote: > > Date: Thu, 19 Nov 2009 18:08:48 +0000 > > From: Alan Mackenzie <acm@muc.de> > > Cc: emacs-devel@gnu.org > > No, you (all of you) are missing the point. That point is that if an > > Emacs Lisp hacker writes "?ñ", it should work, regardless of what > > "codepoint" it has, what "bytes" represent it, whether those "bytes" > > are coded with a different codepoint, or what have you. > No can do, as long as we support both unibyte and multibyte buffers > and strings. This seems to be the big thing. That ?ñ has no unique meaning. The current situation violates the description on the elisp page "Basic Char Syntax", which describes the situation as I understood it up until half an hour ago. > > OK. Surely displaying it as "\361" is a bug? > If `a' can be represented as 97, then why cannot \361 be represented > as 4194289? ROFLMAO. If this weren't true, you couldn't invent it. ;-) > > So, how did the character "ñ" get turned into the illegal byte #xf1? > It did so because you used aset to put it into a unibyte string. So, what should I have done to achieve the desired effect? How should I modify "(aset nl 0 ?ü)" so that it does the Right Thing? > > Are you saying that Emacs is converting "?ñ" and "?ä" into the wrong > > integers? > Emacs can convert it into 2 distinct integer representations. It > decides which one by the context. And you just happened to give it > the wrong context. OK, I understand that now, thanks. > > > Because Emacs has no separate "character" data type. > > For which I am thankful. > Then please understand that there's no bug here. Oh, I disagree with that. But, whatever.... -- Alan Mackenzie (Nuremberg, Germany). ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: Fwd: Re: Inadequate documentation of silly characters on screen. 2009-11-19 20:53 ` Alan Mackenzie @ 2009-11-19 22:16 ` David Kastrup 2009-11-20 8:55 ` Eli Zaretskii 0 siblings, 1 reply; 96+ messages in thread From: David Kastrup @ 2009-11-19 22:16 UTC (permalink / raw) To: emacs-devel Alan Mackenzie <acm@muc.de> writes: > Hi, Eli! > > On Thu, Nov 19, 2009 at 09:52:20PM +0200, Eli Zaretskii wrote: >> > Date: Thu, 19 Nov 2009 18:08:48 +0000 >> > From: Alan Mackenzie <acm@muc.de> >> > Cc: emacs-devel@gnu.org > >> > No, you (all of you) are missing the point. That point is that if an >> > Emacs Lisp hacker writes "?ñ", it should work, regardless of what >> > "codepoint" it has, what "bytes" represent it, whether those "bytes" >> > are coded with a different codepoint, or what have you. > >> No can do, as long as we support both unibyte and multibyte buffers >> and strings. > > This seems to be the big thing. That ?ñ has no unique meaning. Wrong. It means the character code of the character ñ in Emacs' internal encoding. > The current situation violates the description on the elisp page > "Basic Char Syntax", which describes the situation as I understood it > up until half an hour ago. Hm? 2.3.3.1 Basic Char Syntax ......................... Since characters are really integers, the printed representation of a character is a decimal number. This is also a possible read syntax for a character, but writing characters that way in Lisp programs is not clear programming. You should _always_ use the special read syntax formats that Emacs Lisp provides for characters. These syntax formats start with a question mark. This makes very very very clear that we are talking about an integer here. Not that the higher node does not also mention this: 2.3.3 Character Type -------------------- A "character" in Emacs Lisp is nothing more than an integer. In other words, characters are represented by their character codes. For example, the character `A' is represented as the integer 65. >> > OK. Surely displaying it as "\361" is a bug? > >> If `a' can be represented as 97, then why cannot \361 be represented >> as 4194289? > > ROFLMAO. If this weren't true, you couldn't invent it. ;-) Since raw bytes above 127 are not legal utf-8 sequences and we want some character representation for them, and since character codes 128 to 255 are already valid Unicode codepoints, the obvious solution is to use numbers that aren't valid Unicode codepoints. One could have chosen -128 to -255 for example. Except that we don't have a natural algorithm for encoding those in a superset of utf-8. >> > So, how did the character "ñ" get turned into the illegal byte >> > #xf1? > >> It did so because you used aset to put it into a unibyte string. > > So, what should I have done to achieve the desired effect? How should > I modify "(aset nl 0 ?ü)" so that it does the Right Thing? Using aset on strings is crude. If it were up to me, I would not allow this operation at all. >> > Are you saying that Emacs is converting "?ñ" and "?ä" into the >> > wrong integers? > >> Emacs can convert it into 2 distinct integer representations. It >> decides which one by the context. And you just happened to give it >> the wrong context. > > OK, I understand that now, thanks. Too bad that it's wrong. ?ñ is the integer that is Emacs' internal character code for ñ. A single integer representation, only different on Emacsen with different internal character codes. If you want to produce an actual string from it, use char-to-string. -- David Kastrup ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: Fwd: Re: Inadequate documentation of silly characters on screen. 2009-11-19 22:16 ` David Kastrup @ 2009-11-20 8:55 ` Eli Zaretskii 0 siblings, 0 replies; 96+ messages in thread From: Eli Zaretskii @ 2009-11-20 8:55 UTC (permalink / raw) To: David Kastrup; +Cc: emacs-devel > From: David Kastrup <dak@gnu.org> > Date: Thu, 19 Nov 2009 23:16:24 +0100 > > >> > Are you saying that Emacs is converting "?ñ" and "?ä" into the > >> > wrong integers? > > > >> Emacs can convert it into 2 distinct integer representations. It > >> decides which one by the context. And you just happened to give it > >> the wrong context. > > > > OK, I understand that now, thanks. > > Too bad that it's wrong. ?ñ is the integer that is Emacs' internal > character code for ñ. What I wrote was not about ?ñ itself (which is indeed just an integer 241 in Emacs 23), but about the two possibilities to convert it to the internal representation when it is inserted into a string (or a buffer, for that matter). One possibility is to convert it to a UTF-8 encoding of the Latin-1 character ñ, the other is to convert to the (extended) UTF-8 encoding of a character whose codepoint is 4194289. ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: Fwd: Re: Inadequate documentation of silly characters on screen. 2009-11-19 18:08 ` Alan Mackenzie 2009-11-19 19:25 ` Davis Herring 2009-11-19 19:52 ` Eli Zaretskii @ 2009-11-19 20:05 ` Stefan Monnier 2009-11-19 21:27 ` Alan Mackenzie 2 siblings, 1 reply; 96+ messages in thread From: Stefan Monnier @ 2009-11-19 20:05 UTC (permalink / raw) To: Alan Mackenzie; +Cc: David Kastrup, emacs-devel > OK. Surely displaying it as "\361" is a bug? Should it not display as > "\17777761". If it did, it would have saved half of my ranting. Hmm.. I lost you here. How would it have helped you? Stefan ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: Fwd: Re: Inadequate documentation of silly characters on screen. 2009-11-19 20:05 ` Stefan Monnier @ 2009-11-19 21:27 ` Alan Mackenzie 0 siblings, 0 replies; 96+ messages in thread From: Alan Mackenzie @ 2009-11-19 21:27 UTC (permalink / raw) To: Stefan Monnier; +Cc: David Kastrup, emacs-devel Hi, Stefan! On Thu, Nov 19, 2009 at 03:05:59PM -0500, Stefan Monnier wrote: > > OK. Surely displaying it as "\361" is a bug? Should it not display as > > "\17777761". If it did, it would have saved half of my ranting. > Hmm.. I lost you here. How would it have helped you? I wouldn't have wasted an hour trying to sort out what was apparently wrong with the coding systems. > Stefan -- Alan Mackenzie (Nuremberg, Germany). ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: Fwd: Re: Inadequate documentation of silly characters on screen. 2009-11-19 15:58 ` Alan Mackenzie ` (2 preceding siblings ...) 2009-11-19 16:55 ` David Kastrup @ 2009-11-19 19:43 ` Eli Zaretskii 2009-11-19 21:57 ` Alan Mackenzie 2009-11-19 20:02 ` Stefan Monnier 4 siblings, 1 reply; 96+ messages in thread From: Eli Zaretskii @ 2009-11-19 19:43 UTC (permalink / raw) To: Alan Mackenzie; +Cc: jasonr, schwab, monnier, emacs-devel > Date: Thu, 19 Nov 2009 15:58:48 +0000 > From: Alan Mackenzie <acm@muc.de> > Cc: emacs-devel@gnu.org, Andreas Schwab <schwab@linux-m68k.org>, > Jason Rumney <jasonr@gnu.org> > > > No: the string does not contain any characters, only bytes, because it's > > a unibyte string. > > I'm thinking from the lisp viewpoint. The string is a data structure > which contains characters. I really don't want to have to think about > the difference between "chars" and "bytes" when I'm hacking lisp. If I > do, then the abstraction "string" is broken. No, it isn't. Emacs supports unibyte strings and multibyte strings. The latter hold characters, but the former hold raw bytes. See "(elisp) Text Representations". > > The byte 241 can be inserted in multibyte strings and buffers because > > it is also a char of code 4194289 (which gets displayed as \361). > > Hang on a mo'! How can the byte 241 "be" a char of code 4194289? This > is some strange usage of the word "be" that I wasn't previously aware > of. ;-) That's how Emacs 23 represents raw bytes in multibyte buffers and strings. > At this point, would you please just agree with me that when I do > > (setq nl "\n") > (aset nl 0 ?ñ) > (insert nl) > > , what should appear on the screen should be "ñ", NOT "\361"? No, I don't agree. If you want to get a human-readable text string, don't use aset; use string operations instead. ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: Fwd: Re: Inadequate documentation of silly characters on screen. 2009-11-19 19:43 ` Eli Zaretskii @ 2009-11-19 21:57 ` Alan Mackenzie 2009-11-19 23:10 ` Stefan Monnier 0 siblings, 1 reply; 96+ messages in thread From: Alan Mackenzie @ 2009-11-19 21:57 UTC (permalink / raw) To: Eli Zaretskii; +Cc: emacs-devel, schwab, monnier, jasonr Hi, Eli! On Thu, Nov 19, 2009 at 09:43:29PM +0200, Eli Zaretskii wrote: > > Date: Thu, 19 Nov 2009 15:58:48 +0000 > > From: Alan Mackenzie <acm@muc.de> > > Cc: emacs-devel@gnu.org, Andreas Schwab <schwab@linux-m68k.org>, > > Jason Rumney <jasonr@gnu.org> > > > No: the string does not contain any characters, only bytes, because > > > it's a unibyte string. > > I'm thinking from the lisp viewpoint. The string is a data structure > > which contains characters. I really don't want to have to think > > about the difference between "chars" and "bytes" when I'm hacking > > lisp. If I do, then the abstraction "string" is broken. > No, it isn't. Emacs supports unibyte strings and multibyte strings. > The latter hold characters, but the former hold raw bytes. See > "(elisp) Text Representations". The abstraction is broken. It is broken because it isn't abstract - its users have to think about the way characters are represented. In an effective abstraction, a user could just write "ñ" or ?ñ and rely on the underlying mechanisms to work. Instead of the abstraction "string", we have two grossly inferior abstractions, "unibyte string" and "multibyte string". Please suggest to me the correct elisp to "replace the zeroth character of an existing string with Spanish n-twiddle". If this is impossible to write, or it's grossly larger than the buggy "(aset nl 0 ?ñ)", that's a demonstration of the breakage. > > > The byte 241 can be inserted in multibyte strings and buffers > > > because it is also a char of code 4194289 (which gets displayed as > > > \361). > > Hang on a mo'! How can the byte 241 "be" a char of code 4194289? This > > is some strange usage of the word "be" that I wasn't previously aware > > of. ;-) > That's how Emacs 23 represents raw bytes in multibyte buffers and > strings. Why is it necessary to distinguish between 'A' and 65? Surely they're both just 0x41? I'm missing something here. > > At this point, would you please just agree with me that when I do > > (setq nl "\n") > > (aset nl 0 ?ñ) > > (insert nl) > > , what should appear on the screen should be "ñ", NOT "\361"? > No, I don't agree. If you want to get a human-readable text string, > don't use aset; use string operations instead. There aren't any. `store-substring' will fail if the bits-and-bytes representation of the new bit differ in size from the old bit, thus surely isn't any better than `aset'. At least `aset' tries to convert to multibyte. I don't imagine anybody here would hold that the current state of strings is ideal. I'm still trying to piece together what the essence of the problem is. -- Alan Mackenzie (Nuremberg, Germany). ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: Fwd: Re: Inadequate documentation of silly characters on screen. 2009-11-19 21:57 ` Alan Mackenzie @ 2009-11-19 23:10 ` Stefan Monnier 0 siblings, 0 replies; 96+ messages in thread From: Stefan Monnier @ 2009-11-19 23:10 UTC (permalink / raw) To: Alan Mackenzie; +Cc: Eli Zaretskii, emacs-devel, schwab, jasonr > The abstraction is broken. It is broken because it isn't abstract - its > users have to think about the way characters are represented. In an > effective abstraction, a user could just write "ñ" or ?ñ and rely on the > underlying mechanisms to work. > Instead of the abstraction "string", we have two grossly inferior > abstractions, "unibyte string" and "multibyte string". No: the abstraction "multibyte string" is what you call "a string", it's absolutely identical. The only problem is that there's one tiny but significant unsupported spot: when you write a string constant you may think it's a multibyte string, but Emacs may disagree. The abstraction "unibyte string" is what you might call "a byte array". It doesn't have much to do with your idea of a string. > Please suggest to me the correct elisp to "replace the zeroth character > of an existing string with Spanish n-twiddle". For a unibyte string, it's impossible since "Spanish n-twiddle" is not a byte. For multibyte strings, `aset' will work dandy (tho inefficiently of course because we're talking about a string, not an array). > If this is impossible to write, or it's grossly larger than the buggy > "(aset nl 0 ?ñ)", that's a demonstration of the breakage. Except the breakage is elsewhere: you expect `nl' to be a multibyte string (i.e. "a string" in your mind), whereas Emacs tricked you earlier and `nl' is really a byte array. > Why is it necessary to distinguish between 'A' and 65? It's not usually. Because in almost all coding systems, the character A is represented by the byte 65. >> No, I don't agree. If you want to get a human-readable text string, >> don't use aset; use string operations instead. > There aren't any. Of course there are: substring+concat. > I don't imagine anybody here would hold that the current state of strings > is ideal. I'm still trying to piece together what the essence of the > problem is. The essense is that "\n" is not what you think of as a string: it's a byte array instead. And Emacs managed to do enough magic to trick you into thinking until now that it's just like a string. Stefan ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: Fwd: Re: Inadequate documentation of silly characters on screen. 2009-11-19 15:58 ` Alan Mackenzie ` (3 preceding siblings ...) 2009-11-19 19:43 ` Eli Zaretskii @ 2009-11-19 20:02 ` Stefan Monnier 4 siblings, 0 replies; 96+ messages in thread From: Stefan Monnier @ 2009-11-19 20:02 UTC (permalink / raw) To: Alan Mackenzie; +Cc: emacs-devel, Andreas Schwab, Jason Rumney >> No: the string does not contain any characters, only bytes, because it's >> a unibyte string. > I'm thinking from the lisp viewpoint. So am I. Lisp also manipulates bytes sometimes. What happens is that you're working mostly on a major mode, so you mostly never deal with processes and files, so basically your whole world is (or should be) multibyte and you never want to bump into a byte. > I really don't want to have to think about the difference between > "chars" and "bytes" when I'm hacking lisp. When you write code that gets an email message via a connection to an IMAP server, you have no choice but to care about the distinction between the sequence of bytes you receive and the sequence of chars&images you want to turn it into. That's true for any language, Elisp included. > If I do, then the abstraction "string" is broken. Not sure in which way. >> So it contains the byte 241, not the character ñ. > That is then a bug. I wrote "(aset nl 0 ?ñ)", not "(aset nl 0 241)". ?ñ = 241 = #xf1 = #o361 There is absolutely no difference between the two expressions once they've been read: the reader turns ?ñ into the integer 241. >> The byte 241 can be inserted in multibyte strings and buffers because >> it is also a char of code 4194289 (which gets displayed as \361). > Hang on a mo'! How can the byte 241 "be" a char of code 4194289? This > is some strange usage of the word "be" that I wasn't previously aware > of. ;-) Agreed. > At this point, would you please just agree with me that when I do > (setq nl "\n") > (aset nl 0 ?ñ) > (insert nl) > , what should appear on the screen should be "ñ", NOT "\361"? Thanks! I have already agreed. Stefan ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: Fwd: Re: Inadequate documentation of silly characters on screen. 2009-11-19 8:20 ` Alan Mackenzie 2009-11-19 8:50 ` Miles Bader 2009-11-19 10:16 ` Fwd: " Andreas Schwab @ 2009-11-19 14:08 ` Stefan Monnier 2009-11-19 14:50 ` Jason Rumney 2009-11-19 17:08 ` Fwd: " Alan Mackenzie 2 siblings, 2 replies; 96+ messages in thread From: Stefan Monnier @ 2009-11-19 14:08 UTC (permalink / raw) To: Alan Mackenzie; +Cc: emacs-devel > The above sequence "works" in Emacs 22.3, in the sense that "ñ" gets > displayed There are many differences that cause it to work completely differently: > - when I do M-: (aset nl 0 ?ñ), I get > "2289 (#o4361, #x8f1)" (Emacs 22.3) > "241 (#o361, #xf1)" (Emacs 23.1) ?ñ = 2289 in Emacs-22 ?ñ = 241 in Emacs-23 So in Emacs-22, there is no possible confusion for this char with a byte. So when you do the `aset', Emacs-22 converts the unibyte string nl to multibyte, whereas Emacs-23 doesn't. From then on, in Emacs-22 your example is all multibyte, so there's no surprise. Now if in Emacs-22 you do instead (aset nl 0 241), where 241 in Emacs-22 is not a valid char and can hence only be a byte, then aset leaves the string as unibyte and we end up with the same nl as in Emacs-23. But if you then (insert nl), Emacs-22 will probably end up inserting a ñ in your buffer, because Emacs-22 performs a decoding step using your language environment when inserting a unibyte string into a unibyte buffer (this used to be helpful for code that didn't know enough about Mule to setup coding systems properly, which is why it was done, but nowadays it was just hiding bugs and encouraging sloppiness in coding so we removed it). > fix it before the pretest? How about interpreting "\n" and friends as > multibyte or unibyte according to the prevailing flavour? I'm not sure what that means. But maybe "\n" should be multibyte, yes. >> If you give us more context (i.e. more of the real code where the >> problem show up), maybe we can tell you how to avoid it. > OK. I have my own routine to display regexps. As a first step, I > translate \n -> ñ, (and \t, \r, \f similarly). This is how: > (defun translate-rnt (regexp) > "REGEXP is a string. Translate any \t \n \r and \f characters > to wierd non-ASCII printable characters: \t to Î (206, \xCE), \n > to ñ (241, \xF1), \r to ® (174, \xAE) and \f to £ (163, \xA3). > The original string is modified." > (let (ch pos) > (while (setq pos (string-match "[\t\n\r\f]" regexp)) > (setq ch (aref regexp pos)) > (aset regexp pos ; <=================== > (cond ((eq ch ?\t) ?Î) > ((eq ch ?\n) ?ñ) > ((eq ch ?\r) ?®) > (t ?£)))) > regexp)) Each one of those `aset' (when performed according to your wishes) would change the byte-size of the string, so it would internally require copying the whole string each time: aset on (multibyte) strings is very inefficient (compared to what most people expect, not necessarily compared to other operations). I'd recommend you use higher-level operations since they'll work just as well and are less susceptible to such problems: (replace-regexp-in-string "[\t\n\r\f]" (lambda (s) (or (cdr (assoc s '(("\t" . "Î") ("\n" . "ñ") ("\r" . "®")))) "£")) regexp) > Why do we have both unibyte and multibyte? Is there any reason > not to remove unibyte altogether (though obviously not for 23.2). Because bytes and chars are different, so we have strings of bytes and strings of chars. The problem with it is not their combined existence, but the fact that they are not different enough. Many people don't understand the difference between chars and bytes, but even more people can't figure out which Elisp operation returns a unibyte string and which a multibyte strings, and that for a "good" reason: it's very difficult to predict. Emacs-23 tries to help in this in the following ways: - `string' always builds a multibyte string now, so if you want a unibyte string, you need to use the new `unibyte-string' function. - we don't automatically perform encoding/decoding conversions between the two forms, so we hide the difference a bit less. We should probably moved towards making all string immediates multibyte and add a new syntax to unibyte immediates. > What was the change between 22.3 and 23.1 that broke my code? Mostly: the change to unibyte internal representation which made 241 (and other byte values) ambiguous since it can also be interpreted now as a character value. > Would it, perhaps, be a good idea to reconsider that change? I think you'll understand that reverting to the emacs-mule (iso-2022-based) internal representation is not really on the table ;-) Stefan ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: Fwd: Re: Inadequate documentation of silly characters on screen. 2009-11-19 14:08 ` Stefan Monnier @ 2009-11-19 14:50 ` Jason Rumney 2009-11-19 15:27 ` Stefan Monnier 2009-11-19 17:08 ` Fwd: " Alan Mackenzie 1 sibling, 1 reply; 96+ messages in thread From: Jason Rumney @ 2009-11-19 14:50 UTC (permalink / raw) To: Stefan Monnier; +Cc: Alan Mackenzie, emacs-devel Stefan Monnier <monnier@iro.umontreal.ca> writes: > We should probably moved towards making all string immediates multibyte > and add a new syntax to unibyte immediates. Also, make it an error to try to put a multibyte character in a unibyte string rather than automatically converting the string to multibyte or silently truncating to 8 bit or whatever Emacs does now. ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: Fwd: Re: Inadequate documentation of silly characters on screen. 2009-11-19 14:50 ` Jason Rumney @ 2009-11-19 15:27 ` Stefan Monnier 2009-11-19 23:12 ` Miles Bader 0 siblings, 1 reply; 96+ messages in thread From: Stefan Monnier @ 2009-11-19 15:27 UTC (permalink / raw) To: Jason Rumney; +Cc: Alan Mackenzie, emacs-devel >> We should probably moved towards making all string immediates multibyte >> and add a new syntax to unibyte immediates. > Also, make it an error to try to put a multibyte character in a unibyte > string rather than automatically converting the string to multibyte or Yes. Currently, we need this conversion specifically because many strings start as unibyte even though they really should start right away as multibyte. This said, `aset' in multibyte strings is still evil and unnecessary. > silently truncating to 8 bit or whatever Emacs does now. I don't think Emacs-23 does such silent truncations any more, tho there might be some such checks that we still haven't installed. Stefan ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: Inadequate documentation of silly characters on screen. 2009-11-19 15:27 ` Stefan Monnier @ 2009-11-19 23:12 ` Miles Bader 2009-11-20 2:16 ` Stefan Monnier 2009-11-20 3:37 ` Stephen J. Turnbull 0 siblings, 2 replies; 96+ messages in thread From: Miles Bader @ 2009-11-19 23:12 UTC (permalink / raw) To: Stefan Monnier; +Cc: Alan Mackenzie, emacs-devel, Jason Rumney Stefan Monnier <monnier@iro.umontreal.ca> writes: > many strings start as unibyte even though they really should start > right away as multibyte. That seems the fundamental problem here. It seems better to make unibyte strings something that can only be created with some explicit operation. -Miles -- "Suppose we've chosen the wrong god. Every time we go to church we're just making him madder and madder." -- Homer Simpson ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: Inadequate documentation of silly characters on screen. 2009-11-19 23:12 ` Miles Bader @ 2009-11-20 2:16 ` Stefan Monnier 2009-11-20 3:37 ` Stephen J. Turnbull 1 sibling, 0 replies; 96+ messages in thread From: Stefan Monnier @ 2009-11-20 2:16 UTC (permalink / raw) To: Miles Bader; +Cc: Alan Mackenzie, emacs-devel, Jason Rumney >> many strings start as unibyte even though they really should start >> right away as multibyte. > That seems the fundamental problem here. > It seems better to make unibyte strings something that can only be > created with some explicit operation. Agreed. As I said earlier in this thread: We should probably move towards making all string immediates multibyte and add a new syntax for unibyte immediates. -- Stefan ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: Inadequate documentation of silly characters on screen. 2009-11-19 23:12 ` Miles Bader 2009-11-20 2:16 ` Stefan Monnier @ 2009-11-20 3:37 ` Stephen J. Turnbull 2009-11-20 4:30 ` Stefan Monnier 1 sibling, 1 reply; 96+ messages in thread From: Stephen J. Turnbull @ 2009-11-20 3:37 UTC (permalink / raw) To: Miles Bader; +Cc: Alan Mackenzie, Jason Rumney, Stefan Monnier, emacs-devel Miles Bader writes: > Stefan Monnier <monnier@iro.umontreal.ca> writes: > > many strings start as unibyte even though they really should start > > right away as multibyte. > > That seems the fundamental problem here. > > It seems better to make unibyte strings something that can only be > created with some explicit operation. I don't see why you *need* them at all. Both pre-Emacs-integration Mule and XEmacs do fine with a multibyte representation for binary. Nobody has complained about performance of stream operations since Kyle Jones and Hrvoje Niksic bitched and we did some measurements in 1998 or so. It turns out that (as you'd expect) multibyte stream operations (except Boyer-Moore, which takes no performance hit :-) are about 50% slower because the representation is about 50% bigger. But this is rarely noticable to users. The noticable performance problems turned out to be a problem with Unix interfaces, not multibyte. The performance problem is in array operations, since (without caching) finding a particular character position is O(position). If you want to turn Emacs into an engine for general network programming and the like, yes, it would be good to have a separate unibyte type. This is what Python does, but Emacs would not have to go through the agony of switching from a unibyte representation for human-readable text to a multibyte representation the way Python does for Python 3. In that case, Emacs should not create them without an explicit operation, and there should be a separate notation such as #b"this is a unibyte string" (although #b may already be taken?) for literals. ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: Inadequate documentation of silly characters on screen. 2009-11-20 3:37 ` Stephen J. Turnbull @ 2009-11-20 4:30 ` Stefan Monnier 2009-11-20 7:18 ` Stephen J. Turnbull 0 siblings, 1 reply; 96+ messages in thread From: Stefan Monnier @ 2009-11-20 4:30 UTC (permalink / raw) To: Stephen J. Turnbull Cc: Alan Mackenzie, emacs-devel, Jason Rumney, Miles Bader > I don't see why you *need* them at all. We don't need the unibyte representation. But we do need to distinguish bytes and chars, encoded string from non-encoded strings, etc... What representation is used for them is secondary, but using different representations for the two cases doesn't seem to be a source of problems. The source of problems is that inherited history where we mixed the unibyte and multibyte objects and treid to pretend they were just one and the same thing and that conversion between them can be done automatically. Stefan ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: Inadequate documentation of silly characters on screen. 2009-11-20 4:30 ` Stefan Monnier @ 2009-11-20 7:18 ` Stephen J. Turnbull 2009-11-20 14:16 ` Stefan Monnier 0 siblings, 1 reply; 96+ messages in thread From: Stephen J. Turnbull @ 2009-11-20 7:18 UTC (permalink / raw) To: Stefan Monnier; +Cc: Miles Bader, Alan Mackenzie, Jason Rumney, emacs-devel Stefan Monnier writes: > What representation is used for them is secondary, but using different > representations for the two cases doesn't seem to be a source > of problems. The source of problems is that inherited history where we > mixed the unibyte and multibyte objects and treid to pretend they were > just one and the same thing and that conversion between them can be > done automatically. Er, they *were* one and the same thing because of string-as-unibyte and friends. ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: Inadequate documentation of silly characters on screen. 2009-11-20 7:18 ` Stephen J. Turnbull @ 2009-11-20 14:16 ` Stefan Monnier 2009-11-21 4:13 ` Stephen J. Turnbull 0 siblings, 1 reply; 96+ messages in thread From: Stefan Monnier @ 2009-11-20 14:16 UTC (permalink / raw) To: Stephen J. Turnbull Cc: Miles Bader, Alan Mackenzie, Jason Rumney, emacs-devel >> What representation is used for them is secondary, but using different >> representations for the two cases doesn't seem to be a source >> of problems. The source of problems is that inherited history where we >> mixed the unibyte and multibyte objects and treid to pretend they were >> just one and the same thing and that conversion between them can be >> done automatically. > Er, they *were* one and the same thing because of string-as-unibyte > and friends. string-as-unibyte returns a new string, so no: they were not the same. Stefan ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: Inadequate documentation of silly characters on screen. 2009-11-20 14:16 ` Stefan Monnier @ 2009-11-21 4:13 ` Stephen J. Turnbull 2009-11-21 5:24 ` Stefan Monnier 0 siblings, 1 reply; 96+ messages in thread From: Stephen J. Turnbull @ 2009-11-21 4:13 UTC (permalink / raw) To: Stefan Monnier; +Cc: Alan Mackenzie, Jason Rumney, emacs-devel, Miles Bader Stefan Monnier writes: > string-as-unibyte returns a new string, so no: they were not the same. Sorry, `toggle-enable-multibyte-characters' was what I had in mind. So, yes, they *were* *indeed* the same. YHBT (it wasn't intentional). I dunno, de gustibus non est disputandum and all that, but this idea of having an in-band representation for raw bytes in a multibyte string sounds to me like more trouble than it's worth. I think it would be much better to serve (eg) AUCTeX's needs with a special coding system that grabs some unlikely-to-be-used private code space and puts the bytes there. That puts the responsibility for dealing with such perversity[1] on the people who have some idea what they're dealing with, not unsuspecting CC Mode maintainers who won't be using that coding system. And it should be either an error to (aset string pos 241) (sorry Alan!) or 241 should be implicitly interpreted as Latin-1 (ie, ?ñ). I favor the former, because what Alan is doing screws Spanish-speaking users AFAICS. OTOH, the latter extends naturally if you have plans to add support for fixed-width Unicode buffers (UTF-16 and UTF-32). Vive la différence techniquement! Footnotes: [1] In the sense of "the world is perverse", I'm not blaming AUCTeX or TeX for this! ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: Inadequate documentation of silly characters on screen. 2009-11-21 4:13 ` Stephen J. Turnbull @ 2009-11-21 5:24 ` Stefan Monnier 2009-11-21 6:42 ` Stephen J. Turnbull 0 siblings, 1 reply; 96+ messages in thread From: Stefan Monnier @ 2009-11-21 5:24 UTC (permalink / raw) To: Stephen J. Turnbull Cc: Alan Mackenzie, Jason Rumney, emacs-devel, Miles Bader > Sorry, `toggle-enable-multibyte-characters' was what I had in mind. > So, yes, they *were* *indeed* the same. YHBT (it wasn't intentional). Oh, yes, *that* one. I haven't yet managed to run a useful Emacs instance with an "assert (BEG == Z);" at the entrance to this nasty function, but I keep hoping I'll get there. > I dunno, de gustibus non est disputandum and all that, but this idea > of having an in-band representation for raw bytes in a multibyte > string sounds to me like more trouble than it's worth. I think it > would be much better to serve (eg) AUCTeX's needs with a special > coding system that grabs some unlikely-to-be-used private code space > and puts the bytes there. That puts the responsibility for dealing > with such perversity[1] on the people who have some idea what they're > dealing with, not unsuspecting CC Mode maintainers who won't be using > that coding system. I don't know what you mean. The eight-bit "chars" were introduced to make sure that decoding+reencoding will always return the exact same byte-sequence, no matter what coding-system was used (i.e. even if the byte-sequence is invaldi for that coding-system). Dunno how XEmacs handles it. > And it should be either an error to (aset string pos 241) (sorry > Alan!) or 241 should be implicitly interpreted as Latin-1 (ie, ?ñ). I > favor the former, because what Alan is doing screws Spanish-speaking > users AFAICS. OTOH, the latter extends naturally if you have plans to > add support for fixed-width Unicode buffers (UTF-16 and UTF-32). I understand this even less. I think XEmacs's fundamental tradeoffs are subtly different but lead to very far-reaching consequences, and for that reason it's difficult for us to take a step back and understand the other point of view. Stefan ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: Inadequate documentation of silly characters on screen. 2009-11-21 5:24 ` Stefan Monnier @ 2009-11-21 6:42 ` Stephen J. Turnbull 2009-11-21 6:49 ` Stefan Monnier 2009-11-21 12:33 ` David Kastrup 0 siblings, 2 replies; 96+ messages in thread From: Stephen J. Turnbull @ 2009-11-21 6:42 UTC (permalink / raw) To: Stefan Monnier; +Cc: Miles Bader, Alan Mackenzie, emacs-devel, Jason Rumney Stefan Monnier writes: > I don't know what you mean. The eight-bit "chars" were introduced to > make sure that decoding+reencoding will always return the exact same > byte-sequence, no matter what coding-system was used (i.e. even if the > byte-sequence is invaldi for that coding-system). Dunno how XEmacs > handles it. Honestly, it currently doesn't, or doesn't very well, despite some work by Aidan. However, I think a well-behaved platform should by default error (something derived from invalid-state, in XEmacs's error hierarchy) in such a case; normally this means corruption in the file. There are special cases like utf8latex whose error messages give you a certain number of octets without respecting character boundaries; I agree there is need to handle this case. What Python 3 (PEP 383) does is provide a family of coding system variants which use invalid Unicode surrogates to encode "raw bytes" for situations where the user asks you to proceed despite invalid octet sequences for the coding system; since Emacs's internal code is UTF-8, any Unicode surrogate is invalid and could be used for this purpose. This would make non-Emacs apps barf errors on such Emacs autosaves, but they'll probably barf on the source file, too. > > And it should be either an error to (aset string pos 241) (sorry > > Alan!) or 241 should be implicitly interpreted as Latin-1 (ie, ?ñ). I > > favor the former, because what Alan is doing screws Spanish-speaking > > users AFAICS. OTOH, the latter extends naturally if you have plans to > > add support for fixed-width Unicode buffers (UTF-16 and UTF-32). > > I understand this even less. There's a typo in the expr above, should be "multibyte-string". The proposed treatment of 241 is due to the fact that it is currently illegal in multibyte strings AIUI. Re the bit about Spanish-speakers: AIUI, Alan is translating multiline strings to oneline strings by using an unusual graphic character. But it's only unusual in non-Spanish cases; Spanish-speakers may very well want to include comments like "¡I wanna write this comment in Español!" which would presumably get unfolded to "¡I wanna write this comment in Espa\nol!" Not very nice. Re widechar buffers: the codes for Latin-1 characters in UTF-16 and UTF-32 are just zero-padded extensions of the unibyte codes. I'm pretty sure it's this kind of thing that Ben had in mind when he originally designed the XEmacs version of the Mule internal encoding to make (= (char-int ?ñ) 241) true in all versions of XEmacs. > I think XEmacs's fundamental tradeoffs are subtly different but > lead to very far-reaching consequences, Indeed, but I'm not talking about XEmacs, except for comparison of techniques. ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: Inadequate documentation of silly characters on screen. 2009-11-21 6:42 ` Stephen J. Turnbull @ 2009-11-21 6:49 ` Stefan Monnier 2009-11-21 7:27 ` Stephen J. Turnbull 2009-11-21 12:33 ` David Kastrup 1 sibling, 1 reply; 96+ messages in thread From: Stefan Monnier @ 2009-11-21 6:49 UTC (permalink / raw) To: Stephen J. Turnbull Cc: Miles Bader, Alan Mackenzie, emacs-devel, Jason Rumney > There's a typo in the expr above, should be "multibyte-string". The > proposed treatment of 241 is due to the fact that it is currently > illegal in multibyte strings AIUI. 241 is perfectly valid in multibyte strings (as well as in unibyte-strings). Stefan ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: Inadequate documentation of silly characters on screen. 2009-11-21 6:49 ` Stefan Monnier @ 2009-11-21 7:27 ` Stephen J. Turnbull 2009-11-23 1:58 ` Stefan Monnier 0 siblings, 1 reply; 96+ messages in thread From: Stephen J. Turnbull @ 2009-11-21 7:27 UTC (permalink / raw) To: Stefan Monnier; +Cc: Alan Mackenzie, emacs-devel Stefan Monnier writes: > 241 is perfectly valid in multibyte strings (as well as in > unibyte-strings). OK, so "invalid" was up to Emacs 22, then? So the problem is that because characters are integers and vice versa, there's no way for the user to let Emacs duck-type multibyte vs unibyte strings for him. If he cares, he needs to check. If he doesn't care, eventually Emacs will punish him for his lapse. I suppose subst-char-in-string is similarly useless for Alan's purpose, then? What he really needs to use is something like (replace-in-string str "\n" "ñ") right? ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: Inadequate documentation of silly characters on screen. 2009-11-21 7:27 ` Stephen J. Turnbull @ 2009-11-23 1:58 ` Stefan Monnier 0 siblings, 0 replies; 96+ messages in thread From: Stefan Monnier @ 2009-11-23 1:58 UTC (permalink / raw) To: Stephen J. Turnbull; +Cc: Alan Mackenzie, emacs-devel > So the problem is that because characters are integers and vice versa, > there's no way for the user to let Emacs duck-type multibyte vs > unibyte strings for him. If he cares, he needs to check. If he > doesn't care, eventually Emacs will punish him for his lapse. > I suppose subst-char-in-string is similarly useless for Alan's > purpose, then? What he really needs to use is something like > (replace-in-string str "\n" "ñ") > right? Pretty much yes. When chars come within strings, the multibyteness of the string indicates what the string elements are (chars or bytes), so as long as you only manipulate strings, Emacs is able to DTRT. As soon as you manipulate actual chars, the ambiguity between chars and bytes for values [128..255] can bite you unless you're careful about how you use them (e.g. about the multibyteness of the strings with which you combine them). That's where `aset' bites. I hate `aset' on strings because it has side-effects (obviously) and because strings aren't vectors so you can't guarantee the expected efficiency, but neither are the source of the problem here. So indeed subst-char-in-string suffers similarly. Stefan ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: Inadequate documentation of silly characters on screen. 2009-11-21 6:42 ` Stephen J. Turnbull 2009-11-21 6:49 ` Stefan Monnier @ 2009-11-21 12:33 ` David Kastrup 2009-11-21 13:55 ` Stephen J. Turnbull 1 sibling, 1 reply; 96+ messages in thread From: David Kastrup @ 2009-11-21 12:33 UTC (permalink / raw) To: emacs-devel "Stephen J. Turnbull" <stephen@xemacs.org> writes: > Stefan Monnier writes: > > > I don't know what you mean. The eight-bit "chars" were introduced > > to make sure that decoding+reencoding will always return the exact > > same byte-sequence, no matter what coding-system was used > > (i.e. even if the byte-sequence is invaldi for that coding-system). > > Dunno how XEmacs handles it. > > Honestly, it currently doesn't, or doesn't very well, despite some > work by Aidan. But we don't need to make this a problem for _Emacs_. > However, I think a well-behaved platform should by default error > (something derived from invalid-state, in XEmacs's error hierarchy) in > such a case; normally this means corruption in the file. We take care that it does not mean corruption. And more often it means that you might have been loading with the wrong encoding (people do that all the time). If you edit some innocent ASCII part and save again, you won't appreciate changes all across the file elsewhere in parts you did not touch or see on-screen. Sometimes there is no "right encoding". If I load an executable or an image file with tag strings and change one string in overwrite mode, I want to be able to save again. Compiled Elisp files contain binary strings as well. There may be source files with binary blobs in them, there may be files with parts in different encodings and so on. > There are special cases like utf8latex whose error messages give you a > certain number of octets without respecting character boundaries; I > agree there is need to handle this case. Forget about the TeX problem: that is a red herring. It is just one case where irrevertable corruption is not the right answer. In fact, I know of no case where irrevertable corruption is the right answer. "Don't touch what you don't understand" is a good rationale. For XEmacs, following this rationale would currently require erroring out. And I actually recommend that you do so: you will learn the hard way that users like the Emacs solution of "don't touch what you don't understand", namely having artificial code points for losslessly representing the parts Emacs does not understand in a particular encoding, better. > What Python 3 (PEP 383) does is provide a family of coding system > variants which use invalid Unicode surrogates to encode "raw bytes" > for situations where the user asks you to proceed despite invalid > octet sequences for the coding system; since Emacs's internal code is > UTF-8, any Unicode surrogate is invalid and could be used for this > purpose. This would make non-Emacs apps barf errors on such Emacs > autosaves, but they'll probably barf on the source file, too. We currently _have_ such a scheme in place. We just use different Unicode-invalid code points. > There's a typo in the expr above, should be "multibyte-string". The > proposed treatment of 241 is due to the fact that it is currently > illegal in multibyte strings AIUI. It is a perfectly valid character ñ in multibyte strings, but not represented by its single-byte/latin-1 equivalent. > Re widechar buffers: the codes for Latin-1 characters in UTF-16 and > UTF-32 are just zero-padded extensions of the unibyte codes. I think you may be muddling characters and their byte sequence representations. At least I can't read much sense into this statement otherwise. -- David Kastrup ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: Inadequate documentation of silly characters on screen. 2009-11-21 12:33 ` David Kastrup @ 2009-11-21 13:55 ` Stephen J. Turnbull 2009-11-21 14:36 ` David Kastrup 0 siblings, 1 reply; 96+ messages in thread From: Stephen J. Turnbull @ 2009-11-21 13:55 UTC (permalink / raw) To: David Kastrup; +Cc: emacs-devel David Kastrup writes: > > However, I think a well-behaved platform should by default error > > (something derived from invalid-state, in XEmacs's error hierarchy) in > > such a case; normally this means corruption in the file. > > We take care that it does not mean corruption. I meant pre-existing corruption, like your pre-existing disposition to bash XEmacs. Please take it elsewhere; it doesn't belong on Emacs channels. (Of course I'd prefer not to see it on XEmacs channels either, but at least it wouldn't be entirely off-topic there.) > And more often it means that you might have been loading with the > wrong encoding (people do that all the time). If you edit some > innocent ASCII part You can't do that if the file is not in a buffer because the encoding error aborted the conversion. Aborting the conversion is what the Unicode Consortium requires, too, IIRC: errors in UTF-8 (or any other UTF for that matter) are considered *fatal* by the standard. Exactly what that means is up to the application to decide. One plausible approach would be to do what you do now, but make the buffer read-only. > Sometimes there is no "right encoding". So what? The point is that there certainly are *wrong* encodings, namely ones that will result in corruption if you try to save the file in that encoding. There are usually many "usable" encodings (binary is always available, for example). Some will be preferred by users, and that will be reflected in coding system precedence. But when faced with ambiguity, it is best to refuse to guess. > We currently _have_ [a scheme for encoding invalid sequences of > code units] in place. We just use different Unicode-invalid code > points [from Python]. Conceded. I realized that later; the important difference is that Python only uses that scheme when explicitly requested. ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: Inadequate documentation of silly characters on screen. 2009-11-21 13:55 ` Stephen J. Turnbull @ 2009-11-21 14:36 ` David Kastrup 2009-11-21 17:53 ` Stephen J. Turnbull 0 siblings, 1 reply; 96+ messages in thread From: David Kastrup @ 2009-11-21 14:36 UTC (permalink / raw) To: emacs-devel "Stephen J. Turnbull" <stephen@xemacs.org> writes: > David Kastrup writes: > > > > However, I think a well-behaved platform should by default error > > > (something derived from invalid-state, in XEmacs's error > > > hierarchy) in such a case; normally this means corruption in the > > > file. > > > > We take care that it does not mean corruption. > > I meant pre-existing corruption [...] That interpretation is not the business of the editor. It may decide to give a warning, but refusing to work at all does not increase its usefulness. > > And more often it means that you might have been loading with the > > wrong encoding (people do that all the time). If you edit some > > innocent ASCII part > > You can't do that if the file is not in a buffer because the encoding > error aborted the conversion. Not being able to do what I want is not a particularly enticing feature. > Aborting the conversion is what the Unicode Consortium requires, too, > IIRC: An editor is not the same as a validator. It's not its business to decide what files I should be allowed to work with. > errors in UTF-8 (or any other UTF for that matter) are considered > *fatal* by the standard. Exactly what that means is up to the > application to decide. One plausible approach would be to do what you > do now, but make the buffer read-only. Making the buffer read-only is a reasonable thing to do if it can't possibly be written back unchanged. For example, if I load a file in latin-1 and insert a few non-latin-1 characters. In this case Emacs should not just silently write the file in utf-8 because that changes the encoding of some preexisting characters. The situation is different if I load a pure ASCII file: in that case, the utf-8 decision is feasible when compatible with the environment. > > Sometimes there is no "right encoding". > > So what? The point is that there certainly are *wrong* encodings, > namely ones that will result in corruption if you try to save the file > in that encoding. But we have a fair amount of encodings (those without escape characters IIRC) which don't imply corruption when saving. And that is a good feature for an editor. For example, when working with version control systems, you want minimal diffs. Encoding systems with escape characters are not good for that. I would strongly advise against Emacs picking any escape-character based encoding (or otherwise non-byte-stream-preserving) automatically. Less breakage is always a good thing. > But when faced with ambiguity, it is best to refuse to guess. You don't need to guess if you just preserve the byte sequence. That makes it somebody else's problem. The GNU utilities have always made it a point to work with arbitrary input without insisting on it being "sensible". Historically, most Unix utilities just crashed when you fed them arbitrary garbage. They have taken a lesson from GNU nowadays. And I consider it a good lesson. > > We currently _have_ [a scheme for encoding invalid sequences of > > code units] in place. We just use different Unicode-invalid code > > points [from Python]. > > Conceded. I realized that later; the important difference is that > Python only uses that scheme when explicitly requested. All in all, it is nobody else's business what encoding Emacs uses for internal purposes. Making Emacs preserve byte streams means that the user has to worry less, not more, about what Emacs might be able to work with. The Emacs 23 internal encoding does a better job not getting into the hair of users with encoding issues than Emacs 22 did, because of a better correspondence with external encodings. But ideally, the user should not have to worry about the difference. -- David Kastrup ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: Inadequate documentation of silly characters on screen. 2009-11-21 14:36 ` David Kastrup @ 2009-11-21 17:53 ` Stephen J. Turnbull 2009-11-21 23:30 ` David Kastrup 0 siblings, 1 reply; 96+ messages in thread From: Stephen J. Turnbull @ 2009-11-21 17:53 UTC (permalink / raw) To: David Kastrup; +Cc: emacs-devel David Kastrup writes: > "Stephen J. Turnbull" <stephen@xemacs.org> writes: > > I meant pre-existing corruption [...] > > That interpretation is not the business of the editor. Precisely my point. The editor has *no* way to interpret at the point of encountering the invalid sequence, and therefore it should *stop* and ask the user what to do. That doesn't mean it should throw away the data, but it sure does mean that it should not continue as though there is valid data in the buffer. Emacs is welcome to do that, but I am sure you will get bug reports about it. ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: Inadequate documentation of silly characters on screen. 2009-11-21 17:53 ` Stephen J. Turnbull @ 2009-11-21 23:30 ` David Kastrup 2009-11-22 1:27 ` Sebastian Rose 0 siblings, 1 reply; 96+ messages in thread From: David Kastrup @ 2009-11-21 23:30 UTC (permalink / raw) To: emacs-devel "Stephen J. Turnbull" <stephen@xemacs.org> writes: > David Kastrup writes: > > "Stephen J. Turnbull" <stephen@xemacs.org> writes: > > > > I meant pre-existing corruption [...] > > > > That interpretation is not the business of the editor. > > Precisely my point. The editor has *no* way to interpret at the point > of encountering the invalid sequence, and therefore it should *stop* > and ask the user what to do. That doesn't mean it should throw away > the data, but it sure does mean that it should not continue as though > there is valid data in the buffer. > > Emacs is welcome to do that, but I am sure you will get bug reports > about it. Why would we get a bug report about Emacs saving a file changed only in the locations that the user actually edited? People might complain when Emacs does not recognize some encoding properly, but they certainly will not demand that Emacs should stop working altogether. -- David Kastrup ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: Inadequate documentation of silly characters on screen. 2009-11-21 23:30 ` David Kastrup @ 2009-11-22 1:27 ` Sebastian Rose 2009-11-22 8:06 ` David Kastrup 0 siblings, 1 reply; 96+ messages in thread From: Sebastian Rose @ 2009-11-22 1:27 UTC (permalink / raw) To: David Kastrup; +Cc: emacs-devel David Kastrup <dak@gnu.org> writes: > "Stephen J. Turnbull" <stephen@xemacs.org> writes: > >> David Kastrup writes: >> > "Stephen J. Turnbull" <stephen@xemacs.org> writes: >> >> > > I meant pre-existing corruption [...] >> > >> > That interpretation is not the business of the editor. >> >> Precisely my point. The editor has *no* way to interpret at the point >> of encountering the invalid sequence, and therefore it should *stop* >> and ask the user what to do. That doesn't mean it should throw away >> the data, but it sure does mean that it should not continue as though >> there is valid data in the buffer. >> >> Emacs is welcome to do that, but I am sure you will get bug reports >> about it. > > Why would we get a bug report about Emacs saving a file changed only in > the locations that the user actually edited? > > People might complain when Emacs does not recognize some encoding > properly, but they certainly will not demand that Emacs should stop > working altogether. People do indeed complain on the emacs-orgmode mailing list and I can reproduce their problems. You may read the details here: http://www.mail-archive.com/emacs-orgmode@gnu.org/msg19778.html `M-x recode-file-name' doesn't work either. I guess this is related? Best wishes Sebastian ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: Inadequate documentation of silly characters on screen. 2009-11-22 1:27 ` Sebastian Rose @ 2009-11-22 8:06 ` David Kastrup 2009-11-22 23:52 ` Sebastian Rose 0 siblings, 1 reply; 96+ messages in thread From: David Kastrup @ 2009-11-22 8:06 UTC (permalink / raw) To: emacs-devel Sebastian Rose <sebastian_rose@gmx.de> writes: > David Kastrup <dak@gnu.org> writes: >> "Stephen J. Turnbull" <stephen@xemacs.org> writes: >> >>> David Kastrup writes: >>> > "Stephen J. Turnbull" <stephen@xemacs.org> writes: >>> >>> > > I meant pre-existing corruption [...] >>> > >>> > That interpretation is not the business of the editor. >>> >>> Precisely my point. The editor has *no* way to interpret at the point >>> of encountering the invalid sequence, and therefore it should *stop* >>> and ask the user what to do. That doesn't mean it should throw away >>> the data, but it sure does mean that it should not continue as though >>> there is valid data in the buffer. >>> >>> Emacs is welcome to do that, but I am sure you will get bug reports >>> about it. >> >> Why would we get a bug report about Emacs saving a file changed only in >> the locations that the user actually edited? >> >> People might complain when Emacs does not recognize some encoding >> properly, but they certainly will not demand that Emacs should stop >> working altogether. > > > People do indeed complain on the emacs-orgmode mailing list and I can > reproduce their problems. What meaning of "indeed" are you using here? This is a complaint about Emacs _not_ faithfully replicating a byte pattern that it expects to be in a particular encoding. > http://www.mail-archive.com/emacs-orgmode@gnu.org/msg19778.html > > I guess this is related? It is related, but it bolsters rather than defeats my argument. People don't _like_ Emacs to cop out altogether. -- David Kastrup ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: Inadequate documentation of silly characters on screen. 2009-11-22 8:06 ` David Kastrup @ 2009-11-22 23:52 ` Sebastian Rose 0 siblings, 0 replies; 96+ messages in thread From: Sebastian Rose @ 2009-11-22 23:52 UTC (permalink / raw) To: David Kastrup; +Cc: emacs-devel David Kastrup <dak@gnu.org> writes: > Sebastian Rose <sebastian_rose@gmx.de> writes: > >> David Kastrup <dak@gnu.org> writes: >>> "Stephen J. Turnbull" <stephen@xemacs.org> writes: >>> >>>> David Kastrup writes: >>>> > "Stephen J. Turnbull" <stephen@xemacs.org> writes: >>>> >>>> > > I meant pre-existing corruption [...] >>>> > >>>> > That interpretation is not the business of the editor. >>>> >>>> Precisely my point. The editor has *no* way to interpret at the point >>>> of encountering the invalid sequence, and therefore it should *stop* >>>> and ask the user what to do. That doesn't mean it should throw away >>>> the data, but it sure does mean that it should not continue as though >>>> there is valid data in the buffer. >>>> >>>> Emacs is welcome to do that, but I am sure you will get bug reports >>>> about it. >>> >>> Why would we get a bug report about Emacs saving a file changed only in >>> the locations that the user actually edited? >>> >>> People might complain when Emacs does not recognize some encoding >>> properly, but they certainly will not demand that Emacs should stop >>> working altogether. >> >> >> People do indeed complain on the emacs-orgmode mailing list and I can >> reproduce their problems. > > What meaning of "indeed" are you using here? This is a complaint about > Emacs _not_ faithfully replicating a byte pattern that it expects to be > in a particular encoding. > >> http://www.mail-archive.com/emacs-orgmode@gnu.org/msg19778.html >> >> I guess this is related? > > It is related, but it bolsters rather than defeats my argument. > > People don't _like_ Emacs to cop out altogether. Sorry David. This was not meant as an argument. It was more a question because I was a bit unsure if this was related (I did not follow thread that closely). And in that case, the OP reported, that Emacs indeed refused to work, in that it didn't want to save the file (which I cannot fully reproduce). I didn't mean to highjack this thread though. Thanks for your answer anyway Sebastiab ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: Fwd: Re: Inadequate documentation of silly characters on screen. 2009-11-19 14:08 ` Stefan Monnier 2009-11-19 14:50 ` Jason Rumney @ 2009-11-19 17:08 ` Alan Mackenzie 1 sibling, 0 replies; 96+ messages in thread From: Alan Mackenzie @ 2009-11-19 17:08 UTC (permalink / raw) To: Stefan Monnier; +Cc: emacs-devel Hi, Stefan, On Thu, Nov 19, 2009 at 09:08:29AM -0500, Stefan Monnier wrote: > >> If you give us more context (i.e. more of the real code where the > >> problem show up), maybe we can tell you how to avoid it. > > OK. I have my own routine to display regexps. As a first step, I > > translate \n -> ñ, (and \t, \r, \f similarly). This is how: > > (defun translate-rnt (regexp) > > "REGEXP is a string. Translate any \t \n \r and \f characters > > to wierd non-ASCII printable characters: \t to Î (206, \xCE), \n > > to ñ (241, \xF1), \r to ® (174, \xAE) and \f to £ (163, \xA3). > > The original string is modified." > > (let (ch pos) > > (while (setq pos (string-match "[\t\n\r\f]" regexp)) > > (setq ch (aref regexp pos)) > > (aset regexp pos ; <=================== > > (cond ((eq ch ?\t) ?Î) > > ((eq ch ?\n) ?ñ) > > ((eq ch ?\r) ?®) > > (t ?£)))) > > regexp)) > Each one of those `aset' (when performed according to your wishes) would > change the byte-size of the string, so it would internally require > copying the whole string each time: aset on (multibyte) strings is very > inefficient (compared to what most people expect, not necessarily > compared to other operations). I'd recommend you use higher-level > operations since they'll work just as well and are less susceptible to > such problems: > (replace-regexp-in-string "[\t\n\r\f]" > (lambda (s) > (or (cdr (assoc s '(("\t" . "Î") > ("\n" . "ñ") > ("\r" . "®")))) > "£")) > regexp) That works 100%. Even in Emacs 23 ;-). Thanks! > Stefan -- Alan Mackenzie (Nuremberg, Germany). ^ permalink raw reply [flat|nested] 96+ messages in thread
end of thread, other threads:[~2009-11-30 12:39 UTC | newest] Thread overview: 96+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2009-11-18 19:12 [acm@muc.de: Re: Inadequate documentation of silly characters on screen.] Alan Mackenzie 2009-11-19 1:27 ` Fwd: Re: Inadequate documentation of silly characters on screen Stefan Monnier 2009-11-19 8:20 ` Alan Mackenzie 2009-11-19 8:50 ` Miles Bader 2009-11-19 10:16 ` Fwd: " Andreas Schwab 2009-11-19 12:21 ` Alan Mackenzie 2009-11-19 13:21 ` Jason Rumney 2009-11-19 13:35 ` Stefan Monnier 2009-11-19 14:18 ` Alan Mackenzie 2009-11-19 14:58 ` Jason Rumney 2009-11-19 15:42 ` Alan Mackenzie 2009-11-19 19:39 ` Eli Zaretskii 2009-11-19 15:30 ` Stefan Monnier 2009-11-19 15:58 ` Alan Mackenzie 2009-11-19 16:06 ` Andreas Schwab 2009-11-19 16:47 ` Aidan Kehoe 2009-11-19 17:29 ` Alan Mackenzie 2009-11-19 18:21 ` Aidan Kehoe 2009-11-20 2:43 ` Stephen J. Turnbull 2009-11-19 19:45 ` Eli Zaretskii 2009-11-19 20:07 ` Eli Zaretskii 2009-11-19 19:55 ` Stefan Monnier 2009-11-20 3:13 ` Stephen J. Turnbull 2009-11-19 16:55 ` David Kastrup 2009-11-19 18:08 ` Alan Mackenzie 2009-11-19 19:25 ` Davis Herring 2009-11-19 21:25 ` Alan Mackenzie 2009-11-19 22:31 ` David Kastrup 2009-11-21 22:52 ` Richard Stallman 2009-11-23 2:08 ` Displaying bytes (was: Inadequate documentation of silly characters on screen.) Stefan Monnier 2009-11-23 20:38 ` Richard Stallman 2009-11-23 21:34 ` Per Starbäck 2009-11-24 22:47 ` Richard Stallman 2009-11-25 1:33 ` Kenichi Handa 2009-11-25 2:29 ` Displaying bytes (was: Inadequate documentation of silly Stefan Monnier 2009-11-25 2:50 ` Lennart Borgman 2009-11-25 6:25 ` Stephen J. Turnbull 2009-11-25 5:40 ` Displaying bytes (was: Inadequate documentation of silly characters on screen.) Ulrich Mueller 2009-11-26 22:59 ` Displaying bytes Reiner Steib 2009-11-27 0:16 ` Ulrich Mueller 2009-11-27 1:41 ` Stefan Monnier 2009-11-27 4:14 ` Stephen J. Turnbull 2009-11-25 5:59 ` Displaying bytes (was: Inadequate documentation of silly characters on screen.) Stephen J. Turnbull 2009-11-25 8:16 ` Kenichi Handa 2009-11-29 16:01 ` Richard Stallman 2009-11-29 16:31 ` Displaying bytes (was: Inadequate documentation of silly Stefan Monnier 2009-11-29 22:01 ` Juri Linkov 2009-11-30 6:05 ` tomas 2009-11-30 12:09 ` Andreas Schwab 2009-11-30 12:39 ` tomas 2009-11-29 22:19 ` Displaying bytes (was: Inadequate documentation of silly characters on screen.) Kim F. Storm 2009-11-30 1:42 ` Stephen J. Turnbull 2009-11-24 1:28 ` Displaying bytes Stefan Monnier 2009-11-24 22:47 ` Richard Stallman 2009-11-25 2:18 ` Stefan Monnier 2009-11-26 6:24 ` Richard Stallman 2009-11-26 8:59 ` David Kastrup 2009-11-26 14:57 ` Stefan Monnier 2009-11-26 16:28 ` Lennart Borgman 2009-11-27 6:36 ` Richard Stallman 2009-11-24 22:47 ` Richard Stallman 2009-11-20 8:48 ` Fwd: Re: Inadequate documentation of silly characters on screen Eli Zaretskii 2009-11-19 19:52 ` Eli Zaretskii 2009-11-19 20:53 ` Alan Mackenzie 2009-11-19 22:16 ` David Kastrup 2009-11-20 8:55 ` Eli Zaretskii 2009-11-19 20:05 ` Stefan Monnier 2009-11-19 21:27 ` Alan Mackenzie 2009-11-19 19:43 ` Eli Zaretskii 2009-11-19 21:57 ` Alan Mackenzie 2009-11-19 23:10 ` Stefan Monnier 2009-11-19 20:02 ` Stefan Monnier 2009-11-19 14:08 ` Stefan Monnier 2009-11-19 14:50 ` Jason Rumney 2009-11-19 15:27 ` Stefan Monnier 2009-11-19 23:12 ` Miles Bader 2009-11-20 2:16 ` Stefan Monnier 2009-11-20 3:37 ` Stephen J. Turnbull 2009-11-20 4:30 ` Stefan Monnier 2009-11-20 7:18 ` Stephen J. Turnbull 2009-11-20 14:16 ` Stefan Monnier 2009-11-21 4:13 ` Stephen J. Turnbull 2009-11-21 5:24 ` Stefan Monnier 2009-11-21 6:42 ` Stephen J. Turnbull 2009-11-21 6:49 ` Stefan Monnier 2009-11-21 7:27 ` Stephen J. Turnbull 2009-11-23 1:58 ` Stefan Monnier 2009-11-21 12:33 ` David Kastrup 2009-11-21 13:55 ` Stephen J. Turnbull 2009-11-21 14:36 ` David Kastrup 2009-11-21 17:53 ` Stephen J. Turnbull 2009-11-21 23:30 ` David Kastrup 2009-11-22 1:27 ` Sebastian Rose 2009-11-22 8:06 ` David Kastrup 2009-11-22 23:52 ` Sebastian Rose 2009-11-19 17:08 ` Fwd: " Alan Mackenzie
Code repositories for project(s) associated with this public inbox https://git.savannah.gnu.org/cgit/emacs.git This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).