* coding tags and utf-16 @ 2005-12-21 8:00 Werner LEMBERG 2005-12-23 23:43 ` Werner LEMBERG 2006-01-04 6:42 ` Kenichi Handa 0 siblings, 2 replies; 25+ messages in thread From: Werner LEMBERG @ 2005-12-21 8:00 UTC (permalink / raw) There is a serious problem with coding tags and utf-16 encodings of any flavour: Emacs simply can't recognize the tag. This is a non-trivial problem. Right now I'm working on a groff preprocessor which tries to handle this. I'm doing the following to find the tag in an encoding-independent way: . Check whether the file starts with the BOM (Byte Order Mark) -- this is one of the following byte sequences: UTF-8: 0xEFBBBF UTF-16: 0xFEFF or 0xFFFE Skip it. . Ignore zero bytes while looking for the -*- coding: ... -*- stuff. This heuristic algorithm might not give correct results in all cases but it should be sufficiently reliable for normal use. Werner ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: coding tags and utf-16 2005-12-21 8:00 coding tags and utf-16 Werner LEMBERG @ 2005-12-23 23:43 ` Werner LEMBERG 2005-12-24 16:32 ` Richard M. Stallman 2006-01-04 6:42 ` Kenichi Handa 1 sibling, 1 reply; 25+ messages in thread From: Werner LEMBERG @ 2005-12-23 23:43 UTC (permalink / raw) > There is a serious problem with coding tags and utf-16 encodings of > any flavour: Emacs simply can't recognize the tag. [...] Surprisingly, I saw no response on the list which either means that my mail hasn't come through, nobody is interested in this problem, or that it is a non-issue. In case it won't get fixed I suggest to add it to the TODO list, together with a not in the emacs manual that coding tags don't work with utf-16 encoding flavours. Werner > This is a non-trivial problem. Right now I'm working on a groff > preprocessor which tries to handle this. I'm doing the following to > find the tag in an encoding-independent way: > > . Check whether the file starts with the BOM (Byte Order Mark) -- > this is one of the following byte sequences: > > UTF-8: 0xEFBBBF > UTF-16: 0xFEFF or 0xFFFE > > Skip it. > > . Ignore zero bytes while looking for the -*- coding: ... -*- > stuff. > > This heuristic algorithm might not give correct results in all cases > but it should be sufficiently reliable for normal use. ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: coding tags and utf-16 2005-12-23 23:43 ` Werner LEMBERG @ 2005-12-24 16:32 ` Richard M. Stallman 0 siblings, 0 replies; 25+ messages in thread From: Richard M. Stallman @ 2005-12-24 16:32 UTC (permalink / raw) Cc: emacs-devel Surprisingly, I saw no response on the list which either means that my mail hasn't come through, nobody is interested in this problem, or that it is a non-issue. Your mail did come through. We should not conclude that it is a non-issue merely because nobody has responded. I asked Handa to look at it, but he hasn't replied yet. ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: coding tags and utf-16 2005-12-21 8:00 coding tags and utf-16 Werner LEMBERG 2005-12-23 23:43 ` Werner LEMBERG @ 2006-01-04 6:42 ` Kenichi Handa 2006-01-04 14:58 ` Werner LEMBERG ` (2 more replies) 1 sibling, 3 replies; 25+ messages in thread From: Kenichi Handa @ 2006-01-04 6:42 UTC (permalink / raw) Cc: emacs-devel In article <20051221.090033.182620434.wl@gnu.org>, Werner LEMBERG <wl@gnu.org> writes: > There is a serious problem with coding tags and utf-16 encodings of > any flavour: Emacs simply can't recognize the tag. This is a > non-trivial problem. Sorry for the late reply, but I think coding tag is useless for a file encoded in some of utf-16 variants. If a file has BOM at the head, BOM should tell the exact encoding whatever is specified in coding tag. If a file is encoded without BOM, we must use the less reliable heuristics to guess utf-16be or utf-16le. If you find a coding-tag spec by ignoring all zero bytes at even byte indexes, it means that the file is, in high possibility, utf-16be whatever the tag value is. If you find a coding-tag spec by ignoring all zero bytes at odd byte indexes, it means that the file is utf-16le whatever the tag value is. So, in any cases, a tag value itself is useless. Then how to detect utf-16 more reliably? In the current Emacs (i.e. Ver.22), I think we can use auto-coding-regexp-alist or auto-coding-alist. In the former case, we can register BOM patterns and also something like "\\`\\(\0[\0-\177]\\)+" for utf-16be. In the latter case, you can use more complicated heuristics in a registered function. But, those are anyway just heuristics; not 100% reliable. So I think we need a user option to turn it on and off, or perhaps a user option to select which kind of heuristics. --- Kenichi Handa handa@m17n.org ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: coding tags and utf-16 2006-01-04 6:42 ` Kenichi Handa @ 2006-01-04 14:58 ` Werner LEMBERG 2006-01-05 3:46 ` Richard M. Stallman 2006-01-05 15:56 ` Stefan Monnier 2 siblings, 0 replies; 25+ messages in thread From: Werner LEMBERG @ 2006-01-04 14:58 UTC (permalink / raw) Cc: groff, bruno, emacs-devel > > There is a serious problem with coding tags and utf-16 encodings > > of any flavour: Emacs simply can't recognize the tag. This is a > > non-trivial problem. > > Sorry for the late reply, but I think coding tag is useless for a > file encoded in some of utf-16 variants. > > If a file has BOM at the head, BOM should tell the exact encoding > whatever is specified in coding tag. > > If a file is encoded without BOM, we must use the less reliable > heuristics to guess utf-16be or utf-16le. If you find a coding-tag > spec by ignoring all zero bytes at even byte indexes, it means that > the file is, in high possibility, utf-16be whatever the tag value > is. If you find a coding-tag spec by ignoring all zero bytes at odd > byte indexes, it means that the file is utf-16le whatever the tag > value is. > > So, in any cases, a tag value itself is useless. [...] I'll do the following for groff's preprocessor, preconv: . If the data starts with a BOM, use it, and ignore the coding tag. . Otherwise, if there are zero bytes in the first two lines, ignore those zero values, emit a warning, and use the coding tag, if any. . Otherwise, use the default encoding -- this normally will lead to a wrong result and make groff explode, but I consider this better than to apply heuristics, especially if you have to recognize both UTF16 and UTF32 variants. This is probably a suboptimal solution but quite easy to implement, and the user can always explicitly select an encoding on the command line. Perhaps someone finds (and implements) a better way which I can then adapt to preconv. Werner ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: coding tags and utf-16 2006-01-04 6:42 ` Kenichi Handa 2006-01-04 14:58 ` Werner LEMBERG @ 2006-01-05 3:46 ` Richard M. Stallman 2006-01-05 4:33 ` Kenichi Handa 2006-01-05 15:56 ` Stefan Monnier 2 siblings, 1 reply; 25+ messages in thread From: Richard M. Stallman @ 2006-01-05 3:46 UTC (permalink / raw) Cc: emacs-devel If a file is encoded without BOM, we must use the less reliable heuristics to guess utf-16be or utf-16le. If you find a coding-tag spec by ignoring all zero bytes at even byte indexes, it means that the file is, in high possibility, utf-16be whatever the tag value is. If you find a coding-tag spec by ignoring all zero bytes at odd byte indexes, it means that the file is utf-16le whatever the tag value is. Does Emacs already implement these heuristics? But, those are anyway just heuristics; not 100% reliable. So I think we need a user option to turn it on and off, or perhaps a user option to select which kind of heuristics. Should we install this option now? ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: coding tags and utf-16 2006-01-05 3:46 ` Richard M. Stallman @ 2006-01-05 4:33 ` Kenichi Handa 2006-01-05 12:24 ` David Kastrup 2006-01-05 23:11 ` Richard M. Stallman 0 siblings, 2 replies; 25+ messages in thread From: Kenichi Handa @ 2006-01-05 4:33 UTC (permalink / raw) Cc: emacs-devel In article <E1EuM4r-00051L-Sf@fencepost.gnu.org>, "Richard M. Stallman" <rms@gnu.org> writes: > If a file is encoded without BOM, we must use the less > reliable heuristics to guess utf-16be or utf-16le. If you > find a coding-tag spec by ignoring all zero bytes at even > byte indexes, it means that the file is, in high > possibility, utf-16be whatever the tag value is. If you > find a coding-tag spec by ignoring all zero bytes at odd > byte indexes, it means that the file is utf-16le whatever > the tag value is. > Does Emacs already implement these heuristics? No. > But, those are anyway just heuristics; not 100% reliable. > So I think we need a user option to turn it on and off, or > perhaps a user option to select which kind of heuristics. > Should we install this option now? I can't tell whether or not it's important enough to install now because I never encountered a utf-16 file. --- Kenichi Handa handa@m17n.org ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: coding tags and utf-16 2006-01-05 4:33 ` Kenichi Handa @ 2006-01-05 12:24 ` David Kastrup 2006-01-06 0:27 ` Andreas Schwab 2006-01-05 23:11 ` Richard M. Stallman 1 sibling, 1 reply; 25+ messages in thread From: David Kastrup @ 2006-01-05 12:24 UTC (permalink / raw) Cc: rms, emacs-devel Kenichi Handa <handa@m17n.org> writes: > "Richard M. Stallman" <rms@gnu.org> writes: > >> Should we install this option now? > > I can't tell whether or not it's important enough to install > now because I never encountered a utf-16 file. I think the most common occurence would be system files on MS Windows. The byte markers are very unique: I think we should heed them unless there are very important technical considerations speaking against it (one reason would be if the utf-16 encodings were not content-preserving for saving binary files. No idea whether this is the case). -- David Kastrup, Kriemhildstr. 15, 44793 Bochum ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: coding tags and utf-16 2006-01-05 12:24 ` David Kastrup @ 2006-01-06 0:27 ` Andreas Schwab 0 siblings, 0 replies; 25+ messages in thread From: Andreas Schwab @ 2006-01-06 0:27 UTC (permalink / raw) Cc: emacs-devel, rms, Kenichi Handa David Kastrup <dak@gnu.org> writes: > Kenichi Handa <handa@m17n.org> writes: > >> "Richard M. Stallman" <rms@gnu.org> writes: >> >>> Should we install this option now? >> >> I can't tell whether or not it's important enough to install >> now because I never encountered a utf-16 file. > > I think the most common occurence would be system files on MS Windows. MacOS is also using utf-16 for locale files in application bundles. Andreas. -- Andreas Schwab, SuSE Labs, schwab@suse.de SuSE Linux Products GmbH, Maxfeldstraße 5, 90409 Nürnberg, Germany PGP key fingerprint = 58CA 54C7 6D53 942B 1756 01D3 44D5 214B 8276 4ED5 "And now for something completely different." ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: coding tags and utf-16 2006-01-05 4:33 ` Kenichi Handa 2006-01-05 12:24 ` David Kastrup @ 2006-01-05 23:11 ` Richard M. Stallman 2006-01-06 1:22 ` Werner LEMBERG 2006-01-06 11:26 ` Kenichi Handa 1 sibling, 2 replies; 25+ messages in thread From: Richard M. Stallman @ 2006-01-05 23:11 UTC (permalink / raw) Cc: emacs-devel > But, those are anyway just heuristics; not 100% reliable. > So I think we need a user option to turn it on and off, or > perhaps a user option to select which kind of heuristics. > Should we install this option now? I can't tell whether or not it's important enough to install now because I never encountered a utf-16 file. Werner sent a message explaining how another program handles them. Is it feasible to implement that in Emacs? Would it be so much of a complication that we should not install it now? ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: coding tags and utf-16 2006-01-05 23:11 ` Richard M. Stallman @ 2006-01-06 1:22 ` Werner LEMBERG 2006-01-06 11:26 ` Kenichi Handa 1 sibling, 0 replies; 25+ messages in thread From: Werner LEMBERG @ 2006-01-06 1:22 UTC (permalink / raw) Cc: emacs-devel, handa > I can't tell whether or not it's important enough to install now > because I never encountered a utf-16 file. > > Werner sent a message explaining how another program handles them. This is work in progress, so don't expect thoroughly tested results... Werner ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: coding tags and utf-16 2006-01-05 23:11 ` Richard M. Stallman 2006-01-06 1:22 ` Werner LEMBERG @ 2006-01-06 11:26 ` Kenichi Handa 2006-01-07 4:23 ` Richard M. Stallman 1 sibling, 1 reply; 25+ messages in thread From: Kenichi Handa @ 2006-01-06 11:26 UTC (permalink / raw) Cc: emacs-devel In article <E1EueGK-000837-ED@fencepost.gnu.org>, "Richard M. Stallman" <rms@gnu.org> writes: >> But, those are anyway just heuristics; not 100% reliable. >> So I think we need a user option to turn it on and off, or >> perhaps a user option to select which kind of heuristics. >> Should we install this option now? > I can't tell whether or not it's important enough to install > now because I never encountered a utf-16 file. > Werner sent a message explaining how another program handles > them. Is it feasible to implement that in Emacs? > Would it be so much of a complication that we should > not install it now? As Werner wrote, his method is still in progress. And, it seems that "emitting warning" is an important point in his method. But I think it's not a trivial change to enable Emacs to emit warning while (or after) detecting a code. --- Kenichi Handa handa@m17n.org ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: coding tags and utf-16 2006-01-06 11:26 ` Kenichi Handa @ 2006-01-07 4:23 ` Richard M. Stallman 2006-01-07 6:05 ` Kenichi Handa 0 siblings, 1 reply; 25+ messages in thread From: Richard M. Stallman @ 2006-01-07 4:23 UTC (permalink / raw) Cc: emacs-devel As Werner wrote, his method is still in progress. And, it seems that "emitting warning" is an important point in his method. But I think it's not a trivial change to enable Emacs to emit warning while (or after) detecting a code. Why is that hard? ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: coding tags and utf-16 2006-01-07 4:23 ` Richard M. Stallman @ 2006-01-07 6:05 ` Kenichi Handa 0 siblings, 0 replies; 25+ messages in thread From: Kenichi Handa @ 2006-01-07 6:05 UTC (permalink / raw) Cc: emacs-devel In article <E1Ev5c7-0002nt-MJ@fencepost.gnu.org>, "Richard M. Stallman" <rms@gnu.org> writes: > As Werner wrote, his method is still in progress. And, it > seems that "emitting warning" is an important point in his > method. But I think it's not a trivial change to enable > Emacs to emit warning while (or after) detecting a code. > Why is that hard? I didn't say it's hard. I don't know how hard it is at the moment. But, my gut feeling is that the required change is not simple and not suitable for the Emacs of the current stage. First of all, we must start from deciding a precise recipe of how and when to emit what kind of warning in which case. --- Kenichi Handa handa@m17n.org ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: coding tags and utf-16 2006-01-04 6:42 ` Kenichi Handa 2006-01-04 14:58 ` Werner LEMBERG 2006-01-05 3:46 ` Richard M. Stallman @ 2006-01-05 15:56 ` Stefan Monnier 2006-01-06 6:31 ` Kenichi Handa 2 siblings, 1 reply; 25+ messages in thread From: Stefan Monnier @ 2006-01-05 15:56 UTC (permalink / raw) Cc: emacs-devel > So, in any cases, a tag value itself is useless. Then how > to detect utf-16 more reliably? In the current Emacs > (i.e. Ver.22), I think we can use auto-coding-regexp-alist > or auto-coding-alist. In the former case, we can register > BOM patterns and also something like "\\`\\(\0[\0-\177]\\)+" > for utf-16be. In the latter case, you can use more > complicated heuristics in a registered function. Can't it be somehow added to detect_coding_utf_16? > But, those are anyway just heuristics; not 100% reliable. > So I think we need a user option to turn it on and off, or > perhaps a user option to select which kind of heuristics. Shouldn't this be done via the coding-system-priority? Stefan ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: coding tags and utf-16 2006-01-05 15:56 ` Stefan Monnier @ 2006-01-06 6:31 ` Kenichi Handa 2006-01-06 10:28 ` David Kastrup 0 siblings, 1 reply; 25+ messages in thread From: Kenichi Handa @ 2006-01-06 6:31 UTC (permalink / raw) Cc: emacs-devel In article <m1psn61xim.fsf-monnier+emacs@gnu.org>, Stefan Monnier <monnier@iro.umontreal.ca> writes: >> So, in any cases, a tag value itself is useless. Then how >> to detect utf-16 more reliably? In the current Emacs >> (i.e. Ver.22), I think we can use auto-coding-regexp-alist >> or auto-coding-alist. In the former case, we can register >> BOM patterns and also something like "\\`\\(\0[\0-\177]\\)+" >> for utf-16be. In the latter case, you can use more >> complicated heuristics in a registered function. > Can't it be somehow added to detect_coding_utf_16? Yes, but usually it has no effect if, for instance, iso-8859-1 is more preferred. If only ASCII and Latin-1 characters are encoded in utf-16, all bytes (including BOM) are valid for iso-8859-1. --- Kenichi Handa handa@m17n.org ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: coding tags and utf-16 2006-01-06 6:31 ` Kenichi Handa @ 2006-01-06 10:28 ` David Kastrup 2006-02-09 0:32 ` Kevin Rodgers 0 siblings, 1 reply; 25+ messages in thread From: David Kastrup @ 2006-01-06 10:28 UTC (permalink / raw) Cc: Stefan Monnier, emacs-devel Kenichi Handa <handa@m17n.org> writes: > In article <m1psn61xim.fsf-monnier+emacs@gnu.org>, Stefan Monnier <monnier@iro.umontreal.ca> writes: > >>> So, in any cases, a tag value itself is useless. Then how >>> to detect utf-16 more reliably? In the current Emacs >>> (i.e. Ver.22), I think we can use auto-coding-regexp-alist >>> or auto-coding-alist. In the former case, we can register >>> BOM patterns and also something like "\\`\\(\0[\0-\177]\\)+" >>> for utf-16be. In the latter case, you can use more >>> complicated heuristics in a registered function. > >> Can't it be somehow added to detect_coding_utf_16? > > Yes, but usually it has no effect if, for instance, > iso-8859-1 is more preferred. If only ASCII and Latin-1 > characters are encoded in utf-16, all bytes (including BOM) > are valid for iso-8859-1. I thought we had discussed this already. The BOM-encodings should have priority since the likelihood of a misdetection is negligible (the character pair does not make sense at the start of a text in latin-1 in any language): the only thing that can reasonably be expected to happen is that a binary file is detected as utf-16. Not much of an issue, I'd say. Of course, for the BOM-less utf-16 encodings, priority should depend on the language environment. -- David Kastrup, Kriemhildstr. 15, 44793 Bochum ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: coding tags and utf-16 2006-01-06 10:28 ` David Kastrup @ 2006-02-09 0:32 ` Kevin Rodgers 2006-02-28 1:08 ` Kenichi Handa 0 siblings, 1 reply; 25+ messages in thread From: Kevin Rodgers @ 2006-02-09 0:32 UTC (permalink / raw) David Kastrup wrote: > Kenichi Handa <handa@m17n.org> writes: > >>In article <m1psn61xim.fsf-monnier+emacs@gnu.org>, Stefan Monnier <monnier@iro.umontreal.ca> writes: >> >>>> So, in any cases, a tag value itself is useless. Then how >>>> to detect utf-16 more reliably? In the current Emacs >>>> (i.e. Ver.22), I think we can use auto-coding-regexp-alist >>>> or auto-coding-alist. In the former case, we can register >>>> BOM patterns and also something like "\\`\\(\0[\0-\177]\\)+" >>>> for utf-16be. In the latter case, you can use more >>>> complicated heuristics in a registered function. >> >>>Can't it be somehow added to detect_coding_utf_16? >> >>Yes, but usually it has no effect if, for instance, >>iso-8859-1 is more preferred. If only ASCII and Latin-1 >>characters are encoded in utf-16, all bytes (including BOM) >>are valid for iso-8859-1. > > I thought we had discussed this already. The BOM-encodings should > have priority since the likelihood of a misdetection is negligible > (the character pair does not make sense at the start of a text in > latin-1 in any language): the only thing that can reasonably be > expected to happen is that a binary file is detected as utf-16. Not > much of an issue, I'd say. Exactly. So why haven't these entries been added to auto-coding-regexp-alist? ("\\`\xEF\xBB\xBF" . utf-8) ("\\`\xFE\xFF" . utf-16-be) ("\\`\xFF\xFE" . utf-16-le) ("\\`\x00\x00\xFE\xFF" . utf-32-be) ("\\`\xFF\xFE\x00\x00" . utf-32-le) > Of course, for the BOM-less utf-16 encodings, priority should depend > on the language environment. Definitely. -- Kevin Rodgers ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: coding tags and utf-16 2006-02-09 0:32 ` Kevin Rodgers @ 2006-02-28 1:08 ` Kenichi Handa 2006-03-04 20:34 ` Benjamin Riefenstahl 2006-03-16 2:23 ` Kenichi Handa 0 siblings, 2 replies; 25+ messages in thread From: Kenichi Handa @ 2006-02-28 1:08 UTC (permalink / raw) Cc: emacs-devel Sorry for the late responce. In article <dse2i6$d2b$1@sea.gmane.org>, Kevin Rodgers <ihs_4664@yahoo.com> writes: >> I thought we had discussed this already. The BOM-encodings should >> have priority since the likelihood of a misdetection is negligible >> (the character pair does not make sense at the start of a text in >> latin-1 in any language): the only thing that can reasonably be >> expected to happen is that a binary file is detected as utf-16. Not >> much of an issue, I'd say. I've just digged out old mails we exchanged on this topic (about a year ago). To my understanding, there was no clear conclusion. Here are the extracts: ------------------------------------------------------------ I wrote: > I think BOM is not that safe because there are many charsets > who have normal letters at 0xFE and 0xFF. Jason wrote: > But what are those characters, and are they likely to appear as a pair > at the beginning of the file, and nowhere else? I wrote: > Sorry, I don't know. Dave wrote: >> Exactly what Windows does for what? Recognizing a utf-16 registry >> file when opened in the registry editor? > Auto-detecting utf-16 generally. Although I don't think it would give > false positives on iso-8859 text, I don't know if it could with other > charsets. > > I could believe that Windows doesn't just go by byte-order-mark in > some locales where there might be a problem. If so, it could be > useful to do the same thing. ------------------------------------------------------------ For instance, I've just googled the two character sequence of 0xFE 0xFF of koi8 and found several occurrences. > Exactly. So why haven't these entries been added to > auto-coding-regexp-alist? > ("\\`\xEF\xBB\xBF" . utf-8) As far as I know, UTF-8 should not start with this sequence unless the text really starts with ZWNBSP (very unlikely). > ("\\`\xFE\xFF" . utf-16-be) > ("\\`\xFF\xFE" . utf-16-le) Although it's not clear how safe they are, if no one objects, I'll add them in auto-coding-regexp-alist. > ("\\`\x00\x00\xFE\xFF" . utf-32-be) > ("\\`\xFF\xFE\x00\x00" . utf-32-le) Emacs doesn't support those encoding for the momemnt. --- Kenichi Handa handa@m17n.org ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: coding tags and utf-16 2006-02-28 1:08 ` Kenichi Handa @ 2006-03-04 20:34 ` Benjamin Riefenstahl 2006-03-06 13:04 ` Kenichi Handa 2006-03-08 5:42 ` Tomas Zerolo 2006-03-16 2:23 ` Kenichi Handa 1 sibling, 2 replies; 25+ messages in thread From: Benjamin Riefenstahl @ 2006-03-04 20:34 UTC (permalink / raw) Cc: Kevin Rodgers Hi, Kenichi Handa writes: >> ("\\`\xEF\xBB\xBF" . utf-8) > > As far as I know, UTF-8 should not start with this sequence unless > the text really starts with ZWNBSP (very unlikely). UTF-8 can start with a BOM. See <http://www.unicode.org/faq/utf_bom.html#29>. >> ("\\`\xFE\xFF" . utf-16-be) >> ("\\`\xFF\xFE" . utf-16-le) > > Although it's not clear how safe they are, if no one objects, > I'll add them in auto-coding-regexp-alist. Shouldn't those be utf-16-[bl]e-with-signature? Or has the naming convention changed? benny ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: coding tags and utf-16 2006-03-04 20:34 ` Benjamin Riefenstahl @ 2006-03-06 13:04 ` Kenichi Handa 2006-03-06 19:35 ` Benjamin Riefenstahl 2006-03-08 5:42 ` Tomas Zerolo 1 sibling, 1 reply; 25+ messages in thread From: Kenichi Handa @ 2006-03-06 13:04 UTC (permalink / raw) Cc: ihs_4664, emacs-devel In article <m3hd6et0de.fsf@seneca.benny.turtle-trading.net>, Benjamin Riefenstahl <b.riefenstahl@turtle-trading.net> writes: > Kenichi Handa writes: >>> ("\\`\xEF\xBB\xBF" . utf-8) >> >> As far as I know, UTF-8 should not start with this sequence unless >> the text really starts with ZWNBSP (very unlikely). > UTF-8 can start with a BOM. See > <http://www.unicode.org/faq/utf_bom.html#29>. That's why I wrote "unless ..." part. For decoding UTF-8, we should not delete that BOM but treat it as the content of the text. For UTF-16, Unicode explicitly says that "The BOM is not considered part of the content of the text", but for UTF-8, it doesn't say such a thing. Anyway, as Unicode doesn't recommend but doesn't inhibit BOM in UTF-8 either, if people agree, I'll add it too. >>> ("\\`\xFE\xFF" . utf-16-be) >>> ("\\`\xFF\xFE" . utf-16-le) >> >> Although it's not clear how safe they are, if no one objects, >> I'll add them in auto-coding-regexp-alist. > Shouldn't those be utf-16-[bl]e-with-signature? Or has the naming > convention changed? Actually utf-16-be is an alias of utf-16be-with-signature (more precisely, an alias of mule-utf-16be-with-signature) and is different from utf-16be (and we don't have utf-16-be-with-signature). I have a responsibility for this confusing naming. I long ago mistakenly accepted and committed those names (utf-16-[bl]e), and now keeping them for backward compatibility. Anyway I agree that using utf-16[bl]e-with-signature here is better. --- Kenichi Handa handa@m17n.org ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: coding tags and utf-16 2006-03-06 13:04 ` Kenichi Handa @ 2006-03-06 19:35 ` Benjamin Riefenstahl 2006-03-07 1:02 ` Kenichi Handa 0 siblings, 1 reply; 25+ messages in thread From: Benjamin Riefenstahl @ 2006-03-06 19:35 UTC (permalink / raw) Cc: ihs_4664, emacs-devel Hi, Kenichi Handa writes: > For decoding UTF-8, we should not delete that BOM but treat it as > the content of the text. For UTF-16, Unicode explicitly says that > "The BOM is not considered part of the content of the text", but for > UTF-8, it doesn't say such a thing. NOTEPAD.EXE (the basic MS Windows editor) adds a BOM when writing UTF-8 files. When I saw that and tried to discuss it on their newsgroups, I learned that it seems to be Microsoft's POV that this is a good thing. Which means files like that exist. Treating the BOM as content means that U+FEFF creeps into the regular content of documents through cut-and-paste and through components of template systems. I have already seen that happening in real life and of course it leads to stupid bugs. I think Emacs should do better. > utf-16-be [==] utf-16be-with-signature [!=] utf-16be ;-) benny ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: coding tags and utf-16 2006-03-06 19:35 ` Benjamin Riefenstahl @ 2006-03-07 1:02 ` Kenichi Handa 0 siblings, 0 replies; 25+ messages in thread From: Kenichi Handa @ 2006-03-07 1:02 UTC (permalink / raw) Cc: ihs_4664, emacs-devel In article <m3wtf7xt6z.fsf@seneca.benny.turtle-trading.net>, Benjamin Riefenstahl <b.riefenstahl@turtle-trading.net> writes: > Kenichi Handa writes: >> For decoding UTF-8, we should not delete that BOM but treat it as >> the content of the text. For UTF-16, Unicode explicitly says that >> "The BOM is not considered part of the content of the text", but for >> UTF-8, it doesn't say such a thing. > NOTEPAD.EXE (the basic MS Windows editor) adds a BOM when writing > UTF-8 files. When I saw that and tried to discuss it on their > newsgroups, I learned that it seems to be Microsoft's POV that this is > a good thing. > Which means files like that exist. Treating the BOM as content means > that U+FEFF creeps into the regular content of documents through > cut-and-paste and through components of template systems. I have > already seen that happening in real life and of course it leads to > stupid bugs. I think Emacs should do better. But, it's simply a bug to delete the leading U+FEFF from the content while decoding utf-8. Perhaps we should add some customizable flag to control that behavior after the release. >> utf-16-be [==] utf-16be-with-signature [!=] utf-16be > ;-) ^.^;;; --- Kenichi Handa handa@m17n.org ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: coding tags and utf-16 2006-03-04 20:34 ` Benjamin Riefenstahl 2006-03-06 13:04 ` Kenichi Handa @ 2006-03-08 5:42 ` Tomas Zerolo 1 sibling, 0 replies; 25+ messages in thread From: Tomas Zerolo @ 2006-03-08 5:42 UTC (permalink / raw) Cc: Kevin Rodgers, emacs-devel [-- Attachment #1.1: Type: text/plain, Size: 529 bytes --] On Sat, Mar 04, 2006 at 09:34:37PM +0100, Benjamin Riefenstahl wrote: > Hi, > > Kenichi Handa writes: > >> ("\\`\xEF\xBB\xBF" . utf-8) > > > > As far as I know, UTF-8 should not start with this sequence unless > > the text really starts with ZWNBSP (very unlikely). > > UTF-8 can start with a BOM. See > <http://www.unicode.org/faq/utf_bom.html#29>. This is so sick I nearly can't believe that. Some entities shouldn't be accepted as members of any consortia. Sorry. I had to say that. Regards -- tomás [-- Attachment #1.2: Digital signature --] [-- Type: application/pgp-signature, Size: 189 bytes --] [-- Attachment #2: Type: text/plain, Size: 142 bytes --] _______________________________________________ Emacs-devel mailing list Emacs-devel@gnu.org http://lists.gnu.org/mailman/listinfo/emacs-devel ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: coding tags and utf-16 2006-02-28 1:08 ` Kenichi Handa 2006-03-04 20:34 ` Benjamin Riefenstahl @ 2006-03-16 2:23 ` Kenichi Handa 1 sibling, 0 replies; 25+ messages in thread From: Kenichi Handa @ 2006-03-16 2:23 UTC (permalink / raw) Cc: ihs_4664, emacs-devel In article <E1FDtLw-0005XV-00@etlken>, Kenichi Handa <handa@m17n.org> writes: > Although it's not clear how safe they are, if no one objects, > I'll add them in auto-coding-regexp-alist. >> ("\\`\x00\x00\xFE\xFF" . utf-32-be) >> ("\\`\xFF\xFE\x00\x00" . utf-32-le) As there's no objection, I've just added them to auto-coding-regexp-alist. >> ("\\`\xEF\xBB\xBF" . utf-8) That one too. --- Kenichi Handa handa@m17n.org ^ permalink raw reply [flat|nested] 25+ messages in thread
end of thread, other threads:[~2006-03-16 2:23 UTC | newest] Thread overview: 25+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2005-12-21 8:00 coding tags and utf-16 Werner LEMBERG 2005-12-23 23:43 ` Werner LEMBERG 2005-12-24 16:32 ` Richard M. Stallman 2006-01-04 6:42 ` Kenichi Handa 2006-01-04 14:58 ` Werner LEMBERG 2006-01-05 3:46 ` Richard M. Stallman 2006-01-05 4:33 ` Kenichi Handa 2006-01-05 12:24 ` David Kastrup 2006-01-06 0:27 ` Andreas Schwab 2006-01-05 23:11 ` Richard M. Stallman 2006-01-06 1:22 ` Werner LEMBERG 2006-01-06 11:26 ` Kenichi Handa 2006-01-07 4:23 ` Richard M. Stallman 2006-01-07 6:05 ` Kenichi Handa 2006-01-05 15:56 ` Stefan Monnier 2006-01-06 6:31 ` Kenichi Handa 2006-01-06 10:28 ` David Kastrup 2006-02-09 0:32 ` Kevin Rodgers 2006-02-28 1:08 ` Kenichi Handa 2006-03-04 20:34 ` Benjamin Riefenstahl 2006-03-06 13:04 ` Kenichi Handa 2006-03-06 19:35 ` Benjamin Riefenstahl 2006-03-07 1:02 ` Kenichi Handa 2006-03-08 5:42 ` Tomas Zerolo 2006-03-16 2:23 ` Kenichi Handa
Code repositories for project(s) associated with this public inbox https://git.savannah.gnu.org/cgit/emacs.git This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).