unofficial mirror of emacs-devel@gnu.org 
 help / color / mirror / code / Atom feed
* coding tags and utf-16
@ 2005-12-21  8:00 Werner LEMBERG
  2005-12-23 23:43 ` Werner LEMBERG
  2006-01-04  6:42 ` Kenichi Handa
  0 siblings, 2 replies; 25+ messages in thread
From: Werner LEMBERG @ 2005-12-21  8:00 UTC (permalink / raw)



There is a serious problem with coding tags and utf-16 encodings of
any flavour: Emacs simply can't recognize the tag.  This is a
non-trivial problem.  Right now I'm working on a groff preprocessor
which tries to handle this.  I'm doing the following to find the tag
in an encoding-independent way:

  . Check whether the file starts with the BOM (Byte Order Mark) --
    this is one of the following byte sequences:

      UTF-8:  0xEFBBBF
      UTF-16: 0xFEFF or 0xFFFE

    Skip it.

  . Ignore zero bytes while looking for the -*- coding: ... -*-
    stuff.

This heuristic algorithm might not give correct results in all cases
but it should be sufficiently reliable for normal use.


    Werner

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: coding tags and utf-16
  2005-12-21  8:00 coding tags and utf-16 Werner LEMBERG
@ 2005-12-23 23:43 ` Werner LEMBERG
  2005-12-24 16:32   ` Richard M. Stallman
  2006-01-04  6:42 ` Kenichi Handa
  1 sibling, 1 reply; 25+ messages in thread
From: Werner LEMBERG @ 2005-12-23 23:43 UTC (permalink / raw)


> There is a serious problem with coding tags and utf-16 encodings of
> any flavour: Emacs simply can't recognize the tag.  [...]


Surprisingly, I saw no response on the list which either means that my
mail hasn't come through, nobody is interested in this problem, or
that it is a non-issue.

In case it won't get fixed I suggest to add it to the TODO list,
together with a not in the emacs manual that coding tags don't work
with utf-16 encoding flavours.


    Werner


> This is a non-trivial problem.  Right now I'm working on a groff
> preprocessor which tries to handle this.  I'm doing the following to
> find the tag in an encoding-independent way:
> 
>   . Check whether the file starts with the BOM (Byte Order Mark) --
>     this is one of the following byte sequences:
> 
>       UTF-8:  0xEFBBBF
>       UTF-16: 0xFEFF or 0xFFFE
> 
>     Skip it.
> 
>   . Ignore zero bytes while looking for the -*- coding: ... -*-
>     stuff.
> 
> This heuristic algorithm might not give correct results in all cases
> but it should be sufficiently reliable for normal use.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: coding tags and utf-16
  2005-12-23 23:43 ` Werner LEMBERG
@ 2005-12-24 16:32   ` Richard M. Stallman
  0 siblings, 0 replies; 25+ messages in thread
From: Richard M. Stallman @ 2005-12-24 16:32 UTC (permalink / raw)
  Cc: emacs-devel

    Surprisingly, I saw no response on the list which either means that my
    mail hasn't come through, nobody is interested in this problem, or
    that it is a non-issue.

Your mail did come through.  We should not conclude that it is a
non-issue merely because nobody has responded.  I asked Handa to look
at it, but he hasn't replied yet.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: coding tags and utf-16
  2005-12-21  8:00 coding tags and utf-16 Werner LEMBERG
  2005-12-23 23:43 ` Werner LEMBERG
@ 2006-01-04  6:42 ` Kenichi Handa
  2006-01-04 14:58   ` Werner LEMBERG
                     ` (2 more replies)
  1 sibling, 3 replies; 25+ messages in thread
From: Kenichi Handa @ 2006-01-04  6:42 UTC (permalink / raw)
  Cc: emacs-devel

In article <20051221.090033.182620434.wl@gnu.org>, Werner LEMBERG <wl@gnu.org> writes:

> There is a serious problem with coding tags and utf-16 encodings of
> any flavour: Emacs simply can't recognize the tag.  This is a
> non-trivial problem.

Sorry for the late reply, but I think coding tag is useless
for a file encoded in some of utf-16 variants.

If a file has BOM at the head, BOM should tell the exact
encoding whatever is specified in coding tag.

If a file is encoded without BOM, we must use the less
reliable heuristics to guess utf-16be or utf-16le.  If you
find a coding-tag spec by ignoring all zero bytes at even
byte indexes, it means that the file is, in high
possibility, utf-16be whatever the tag value is.  If you
find a coding-tag spec by ignoring all zero bytes at odd
byte indexes, it means that the file is utf-16le whatever
the tag value is.

So, in any cases, a tag value itself is useless.  Then how
to detect utf-16 more reliably?  In the current Emacs
(i.e. Ver.22), I think we can use auto-coding-regexp-alist
or auto-coding-alist.  In the former case, we can register
BOM patterns and also something like "\\`\\(\0[\0-\177]\\)+"
for utf-16be.  In the latter case, you can use more
complicated heuristics in a registered function.

But, those are anyway just heuristics; not 100% reliable.
So I think we need a user option to turn it on and off, or
perhaps a user option to select which kind of heuristics.

---
Kenichi Handa
handa@m17n.org

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: coding tags and utf-16
  2006-01-04  6:42 ` Kenichi Handa
@ 2006-01-04 14:58   ` Werner LEMBERG
  2006-01-05  3:46   ` Richard M. Stallman
  2006-01-05 15:56   ` Stefan Monnier
  2 siblings, 0 replies; 25+ messages in thread
From: Werner LEMBERG @ 2006-01-04 14:58 UTC (permalink / raw)
  Cc: groff, bruno, emacs-devel

> > There is a serious problem with coding tags and utf-16 encodings
> > of any flavour: Emacs simply can't recognize the tag.  This is a
> > non-trivial problem.
> 
> Sorry for the late reply, but I think coding tag is useless for a
> file encoded in some of utf-16 variants.
> 
> If a file has BOM at the head, BOM should tell the exact encoding
> whatever is specified in coding tag.
> 
> If a file is encoded without BOM, we must use the less reliable
> heuristics to guess utf-16be or utf-16le.  If you find a coding-tag
> spec by ignoring all zero bytes at even byte indexes, it means that
> the file is, in high possibility, utf-16be whatever the tag value
> is.  If you find a coding-tag spec by ignoring all zero bytes at odd
> byte indexes, it means that the file is utf-16le whatever the tag
> value is.
> 
> So, in any cases, a tag value itself is useless.  [...]

I'll do the following for groff's preprocessor, preconv:

  . If the data starts with a BOM, use it, and ignore the coding tag.

  . Otherwise, if there are zero bytes in the first two lines, ignore
    those zero values, emit a warning, and use the coding tag, if any.

  . Otherwise, use the default encoding -- this normally will lead to
    a wrong result and make groff explode, but I consider this better
    than to apply heuristics, especially if you have to recognize both
    UTF16 and UTF32 variants.  This is probably a suboptimal solution
    but quite easy to implement, and the user can always explicitly
    select an encoding on the command line.  Perhaps someone finds
    (and implements) a better way which I can then adapt to preconv.


      Werner

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: coding tags and utf-16
  2006-01-04  6:42 ` Kenichi Handa
  2006-01-04 14:58   ` Werner LEMBERG
@ 2006-01-05  3:46   ` Richard M. Stallman
  2006-01-05  4:33     ` Kenichi Handa
  2006-01-05 15:56   ` Stefan Monnier
  2 siblings, 1 reply; 25+ messages in thread
From: Richard M. Stallman @ 2006-01-05  3:46 UTC (permalink / raw)
  Cc: emacs-devel

    If a file is encoded without BOM, we must use the less
    reliable heuristics to guess utf-16be or utf-16le.  If you
    find a coding-tag spec by ignoring all zero bytes at even
    byte indexes, it means that the file is, in high
    possibility, utf-16be whatever the tag value is.  If you
    find a coding-tag spec by ignoring all zero bytes at odd
    byte indexes, it means that the file is utf-16le whatever
    the tag value is.

Does Emacs already implement these heuristics?

    But, those are anyway just heuristics; not 100% reliable.
    So I think we need a user option to turn it on and off, or
    perhaps a user option to select which kind of heuristics.

Should we install this option now?

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: coding tags and utf-16
  2006-01-05  3:46   ` Richard M. Stallman
@ 2006-01-05  4:33     ` Kenichi Handa
  2006-01-05 12:24       ` David Kastrup
  2006-01-05 23:11       ` Richard M. Stallman
  0 siblings, 2 replies; 25+ messages in thread
From: Kenichi Handa @ 2006-01-05  4:33 UTC (permalink / raw)
  Cc: emacs-devel

In article <E1EuM4r-00051L-Sf@fencepost.gnu.org>, "Richard M. Stallman" <rms@gnu.org> writes:

>     If a file is encoded without BOM, we must use the less
>     reliable heuristics to guess utf-16be or utf-16le.  If you
>     find a coding-tag spec by ignoring all zero bytes at even
>     byte indexes, it means that the file is, in high
>     possibility, utf-16be whatever the tag value is.  If you
>     find a coding-tag spec by ignoring all zero bytes at odd
>     byte indexes, it means that the file is utf-16le whatever
>     the tag value is.

> Does Emacs already implement these heuristics?

No.

>     But, those are anyway just heuristics; not 100% reliable.
>     So I think we need a user option to turn it on and off, or
>     perhaps a user option to select which kind of heuristics.

> Should we install this option now?

I can't tell whether or not it's important enough to install
now because I never encountered a utf-16 file.

---
Kenichi Handa
handa@m17n.org

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: coding tags and utf-16
  2006-01-05  4:33     ` Kenichi Handa
@ 2006-01-05 12:24       ` David Kastrup
  2006-01-06  0:27         ` Andreas Schwab
  2006-01-05 23:11       ` Richard M. Stallman
  1 sibling, 1 reply; 25+ messages in thread
From: David Kastrup @ 2006-01-05 12:24 UTC (permalink / raw)
  Cc: rms, emacs-devel

Kenichi Handa <handa@m17n.org> writes:

> "Richard M. Stallman" <rms@gnu.org> writes:
>
>> Should we install this option now?
>
> I can't tell whether or not it's important enough to install
> now because I never encountered a utf-16 file.

I think the most common occurence would be system files on MS Windows.
The byte markers are very unique: I think we should heed them unless
there are very important technical considerations speaking against it
(one reason would be if the utf-16 encodings were not
content-preserving for saving binary files.  No idea whether this is
the case).

-- 
David Kastrup, Kriemhildstr. 15, 44793 Bochum

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: coding tags and utf-16
  2006-01-04  6:42 ` Kenichi Handa
  2006-01-04 14:58   ` Werner LEMBERG
  2006-01-05  3:46   ` Richard M. Stallman
@ 2006-01-05 15:56   ` Stefan Monnier
  2006-01-06  6:31     ` Kenichi Handa
  2 siblings, 1 reply; 25+ messages in thread
From: Stefan Monnier @ 2006-01-05 15:56 UTC (permalink / raw)
  Cc: emacs-devel

> So, in any cases, a tag value itself is useless.  Then how
> to detect utf-16 more reliably?  In the current Emacs
> (i.e. Ver.22), I think we can use auto-coding-regexp-alist
> or auto-coding-alist.  In the former case, we can register
> BOM patterns and also something like "\\`\\(\0[\0-\177]\\)+"
> for utf-16be.  In the latter case, you can use more
> complicated heuristics in a registered function.

Can't it be somehow added to detect_coding_utf_16?

> But, those are anyway just heuristics; not 100% reliable.
> So I think we need a user option to turn it on and off, or
> perhaps a user option to select which kind of heuristics.

Shouldn't this be done via the coding-system-priority?


        Stefan

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: coding tags and utf-16
  2006-01-05  4:33     ` Kenichi Handa
  2006-01-05 12:24       ` David Kastrup
@ 2006-01-05 23:11       ` Richard M. Stallman
  2006-01-06  1:22         ` Werner LEMBERG
  2006-01-06 11:26         ` Kenichi Handa
  1 sibling, 2 replies; 25+ messages in thread
From: Richard M. Stallman @ 2006-01-05 23:11 UTC (permalink / raw)
  Cc: emacs-devel

    >     But, those are anyway just heuristics; not 100% reliable.
    >     So I think we need a user option to turn it on and off, or
    >     perhaps a user option to select which kind of heuristics.

    > Should we install this option now?

    I can't tell whether or not it's important enough to install
    now because I never encountered a utf-16 file.

Werner sent a message explaining how another program handles
them.  Is it feasible to implement that in Emacs?
Would it be so much of a complication that we should
not install it now?

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: coding tags and utf-16
  2006-01-05 12:24       ` David Kastrup
@ 2006-01-06  0:27         ` Andreas Schwab
  0 siblings, 0 replies; 25+ messages in thread
From: Andreas Schwab @ 2006-01-06  0:27 UTC (permalink / raw)
  Cc: emacs-devel, rms, Kenichi Handa

David Kastrup <dak@gnu.org> writes:

> Kenichi Handa <handa@m17n.org> writes:
>
>> "Richard M. Stallman" <rms@gnu.org> writes:
>>
>>> Should we install this option now?
>>
>> I can't tell whether or not it's important enough to install
>> now because I never encountered a utf-16 file.
>
> I think the most common occurence would be system files on MS Windows.

MacOS is also using utf-16 for locale files in application bundles.

Andreas.

-- 
Andreas Schwab, SuSE Labs, schwab@suse.de
SuSE Linux Products GmbH, Maxfeldstraße 5, 90409 Nürnberg, Germany
PGP key fingerprint = 58CA 54C7 6D53 942B 1756  01D3 44D5 214B 8276 4ED5
"And now for something completely different."

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: coding tags and utf-16
  2006-01-05 23:11       ` Richard M. Stallman
@ 2006-01-06  1:22         ` Werner LEMBERG
  2006-01-06 11:26         ` Kenichi Handa
  1 sibling, 0 replies; 25+ messages in thread
From: Werner LEMBERG @ 2006-01-06  1:22 UTC (permalink / raw)
  Cc: emacs-devel, handa


>     I can't tell whether or not it's important enough to install now
>     because I never encountered a utf-16 file.
> 
> Werner sent a message explaining how another program handles them.

This is work in progress, so don't expect thoroughly tested results...


    Werner

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: coding tags and utf-16
  2006-01-05 15:56   ` Stefan Monnier
@ 2006-01-06  6:31     ` Kenichi Handa
  2006-01-06 10:28       ` David Kastrup
  0 siblings, 1 reply; 25+ messages in thread
From: Kenichi Handa @ 2006-01-06  6:31 UTC (permalink / raw)
  Cc: emacs-devel

In article <m1psn61xim.fsf-monnier+emacs@gnu.org>, Stefan Monnier <monnier@iro.umontreal.ca> writes:

>>  So, in any cases, a tag value itself is useless.  Then how
>>  to detect utf-16 more reliably?  In the current Emacs
>>  (i.e. Ver.22), I think we can use auto-coding-regexp-alist
>>  or auto-coding-alist.  In the former case, we can register
>>  BOM patterns and also something like "\\`\\(\0[\0-\177]\\)+"
>>  for utf-16be.  In the latter case, you can use more
>>  complicated heuristics in a registered function.

> Can't it be somehow added to detect_coding_utf_16?

Yes, but usually it has no effect if, for instance,
iso-8859-1 is more preferred.  If only ASCII and Latin-1
characters are encoded in utf-16, all bytes (including BOM)
are valid for iso-8859-1.

---
Kenichi Handa
handa@m17n.org

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: coding tags and utf-16
  2006-01-06  6:31     ` Kenichi Handa
@ 2006-01-06 10:28       ` David Kastrup
  2006-02-09  0:32         ` Kevin Rodgers
  0 siblings, 1 reply; 25+ messages in thread
From: David Kastrup @ 2006-01-06 10:28 UTC (permalink / raw)
  Cc: Stefan Monnier, emacs-devel

Kenichi Handa <handa@m17n.org> writes:

> In article <m1psn61xim.fsf-monnier+emacs@gnu.org>, Stefan Monnier <monnier@iro.umontreal.ca> writes:
>
>>>  So, in any cases, a tag value itself is useless.  Then how
>>>  to detect utf-16 more reliably?  In the current Emacs
>>>  (i.e. Ver.22), I think we can use auto-coding-regexp-alist
>>>  or auto-coding-alist.  In the former case, we can register
>>>  BOM patterns and also something like "\\`\\(\0[\0-\177]\\)+"
>>>  for utf-16be.  In the latter case, you can use more
>>>  complicated heuristics in a registered function.
>
>> Can't it be somehow added to detect_coding_utf_16?
>
> Yes, but usually it has no effect if, for instance,
> iso-8859-1 is more preferred.  If only ASCII and Latin-1
> characters are encoded in utf-16, all bytes (including BOM)
> are valid for iso-8859-1.

I thought we had discussed this already.  The BOM-encodings should
have priority since the likelihood of a misdetection is negligible
(the character pair does not make sense at the start of a text in
latin-1 in any language): the only thing that can reasonably be
expected to happen is that a binary file is detected as utf-16.  Not
much of an issue, I'd say.

Of course, for the BOM-less utf-16 encodings, priority should depend
on the language environment.

-- 
David Kastrup, Kriemhildstr. 15, 44793 Bochum

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: coding tags and utf-16
  2006-01-05 23:11       ` Richard M. Stallman
  2006-01-06  1:22         ` Werner LEMBERG
@ 2006-01-06 11:26         ` Kenichi Handa
  2006-01-07  4:23           ` Richard M. Stallman
  1 sibling, 1 reply; 25+ messages in thread
From: Kenichi Handa @ 2006-01-06 11:26 UTC (permalink / raw)
  Cc: emacs-devel

In article <E1EueGK-000837-ED@fencepost.gnu.org>, "Richard M. Stallman" <rms@gnu.org> writes:

>>      But, those are anyway just heuristics; not 100% reliable.
>>      So I think we need a user option to turn it on and off, or
>>      perhaps a user option to select which kind of heuristics.

>>  Should we install this option now?

>     I can't tell whether or not it's important enough to install
>     now because I never encountered a utf-16 file.

> Werner sent a message explaining how another program handles
> them.  Is it feasible to implement that in Emacs?

> Would it be so much of a complication that we should
> not install it now?

As Werner wrote, his method is still in progress.  And, it
seems that "emitting warning" is an important point in his
method.  But I think it's not a trivial change to enable
Emacs to emit warning while (or after) detecting a code.

---
Kenichi Handa
handa@m17n.org

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: coding tags and utf-16
  2006-01-06 11:26         ` Kenichi Handa
@ 2006-01-07  4:23           ` Richard M. Stallman
  2006-01-07  6:05             ` Kenichi Handa
  0 siblings, 1 reply; 25+ messages in thread
From: Richard M. Stallman @ 2006-01-07  4:23 UTC (permalink / raw)
  Cc: emacs-devel

    As Werner wrote, his method is still in progress.  And, it
    seems that "emitting warning" is an important point in his
    method.  But I think it's not a trivial change to enable
    Emacs to emit warning while (or after) detecting a code.

Why is that hard?

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: coding tags and utf-16
  2006-01-07  4:23           ` Richard M. Stallman
@ 2006-01-07  6:05             ` Kenichi Handa
  0 siblings, 0 replies; 25+ messages in thread
From: Kenichi Handa @ 2006-01-07  6:05 UTC (permalink / raw)
  Cc: emacs-devel

In article <E1Ev5c7-0002nt-MJ@fencepost.gnu.org>, "Richard M. Stallman" <rms@gnu.org> writes:

>     As Werner wrote, his method is still in progress.  And, it
>     seems that "emitting warning" is an important point in his
>     method.  But I think it's not a trivial change to enable
>     Emacs to emit warning while (or after) detecting a code.

> Why is that hard?

I didn't say it's hard.  I don't know how hard it is at the
moment.  But, my gut feeling is that the required change is
not simple and not suitable for the Emacs of the current
stage.  First of all, we must start from deciding a precise
recipe of how and when to emit what kind of warning in which
case.

---
Kenichi Handa
handa@m17n.org

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: coding tags and utf-16
  2006-01-06 10:28       ` David Kastrup
@ 2006-02-09  0:32         ` Kevin Rodgers
  2006-02-28  1:08           ` Kenichi Handa
  0 siblings, 1 reply; 25+ messages in thread
From: Kevin Rodgers @ 2006-02-09  0:32 UTC (permalink / raw)


David Kastrup wrote:
> Kenichi Handa <handa@m17n.org> writes: 
> 
>>In article <m1psn61xim.fsf-monnier+emacs@gnu.org>, Stefan Monnier <monnier@iro.umontreal.ca> writes:
>>
>>>> So, in any cases, a tag value itself is useless.  Then how
>>>> to detect utf-16 more reliably?  In the current Emacs
>>>> (i.e. Ver.22), I think we can use auto-coding-regexp-alist
>>>> or auto-coding-alist.  In the former case, we can register
>>>> BOM patterns and also something like "\\`\\(\0[\0-\177]\\)+"
>>>> for utf-16be.  In the latter case, you can use more
>>>> complicated heuristics in a registered function.
>>
>>>Can't it be somehow added to detect_coding_utf_16?
>>
>>Yes, but usually it has no effect if, for instance,
>>iso-8859-1 is more preferred.  If only ASCII and Latin-1
>>characters are encoded in utf-16, all bytes (including BOM)
>>are valid for iso-8859-1.
> 
> I thought we had discussed this already.  The BOM-encodings should
> have priority since the likelihood of a misdetection is negligible
> (the character pair does not make sense at the start of a text in
> latin-1 in any language): the only thing that can reasonably be
> expected to happen is that a binary file is detected as utf-16.  Not
> much of an issue, I'd say.

Exactly.  So why haven't these entries been added to 
auto-coding-regexp-alist?

("\\`\xEF\xBB\xBF" . utf-8)
("\\`\xFE\xFF" . utf-16-be)
("\\`\xFF\xFE" . utf-16-le)
("\\`\x00\x00\xFE\xFF" . utf-32-be)
("\\`\xFF\xFE\x00\x00" . utf-32-le)

> Of course, for the BOM-less utf-16 encodings, priority should depend
> on the language environment.

Definitely.
-- 
Kevin Rodgers

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: coding tags and utf-16
  2006-02-09  0:32         ` Kevin Rodgers
@ 2006-02-28  1:08           ` Kenichi Handa
  2006-03-04 20:34             ` Benjamin Riefenstahl
  2006-03-16  2:23             ` Kenichi Handa
  0 siblings, 2 replies; 25+ messages in thread
From: Kenichi Handa @ 2006-02-28  1:08 UTC (permalink / raw)
  Cc: emacs-devel

Sorry for the late responce.

In article <dse2i6$d2b$1@sea.gmane.org>, Kevin Rodgers <ihs_4664@yahoo.com> writes:

>> I thought we had discussed this already.  The BOM-encodings should
>> have priority since the likelihood of a misdetection is negligible
>> (the character pair does not make sense at the start of a text in
>> latin-1 in any language): the only thing that can reasonably be
>> expected to happen is that a binary file is detected as utf-16.  Not
>> much of an issue, I'd say.

I've just digged out old mails we exchanged on this topic
(about a year ago).  To my understanding, there was no
clear conclusion.  Here are the extracts:
------------------------------------------------------------
I wrote:
> I think BOM is not that safe because there are many charsets
> who have normal letters at 0xFE and 0xFF.

Jason wrote:
> But what are those characters, and are they likely to appear as a pair 
> at the beginning of the file, and nowhere else?

I wrote:
> Sorry, I don't know.

Dave wrote:
>> Exactly what Windows does for what?  Recognizing a utf-16 registry
>> file when opened in the registry editor?

> Auto-detecting utf-16 generally.  Although I don't think it would give
> false positives on iso-8859 text, I don't know if it could with other
> charsets.
> 
> I could believe that Windows doesn't just go by byte-order-mark in
> some locales where there might be a problem.  If so, it could be
> useful to do the same thing.
------------------------------------------------------------

For instance, I've just googled the two character sequence
of 0xFE 0xFF of koi8 and found several occurrences.

> Exactly.  So why haven't these entries been added to 
> auto-coding-regexp-alist?

> ("\\`\xEF\xBB\xBF" . utf-8)

As far as I know, UTF-8 should not start with this sequence
unless the text really starts with ZWNBSP (very unlikely).

> ("\\`\xFE\xFF" . utf-16-be)
> ("\\`\xFF\xFE" . utf-16-le)

Although it's not clear how safe they are, if no one objects,
I'll add them in auto-coding-regexp-alist.

> ("\\`\x00\x00\xFE\xFF" . utf-32-be)
> ("\\`\xFF\xFE\x00\x00" . utf-32-le)

Emacs doesn't support those encoding for the momemnt.

---
Kenichi Handa
handa@m17n.org

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: coding tags and utf-16
  2006-02-28  1:08           ` Kenichi Handa
@ 2006-03-04 20:34             ` Benjamin Riefenstahl
  2006-03-06 13:04               ` Kenichi Handa
  2006-03-08  5:42               ` Tomas Zerolo
  2006-03-16  2:23             ` Kenichi Handa
  1 sibling, 2 replies; 25+ messages in thread
From: Benjamin Riefenstahl @ 2006-03-04 20:34 UTC (permalink / raw)
  Cc: Kevin Rodgers

Hi,

Kenichi Handa writes:
>> ("\\`\xEF\xBB\xBF" . utf-8)
>
> As far as I know, UTF-8 should not start with this sequence unless
> the text really starts with ZWNBSP (very unlikely).

UTF-8 can start with a BOM.  See
<http://www.unicode.org/faq/utf_bom.html#29>.

>> ("\\`\xFE\xFF" . utf-16-be)
>> ("\\`\xFF\xFE" . utf-16-le)
>
> Although it's not clear how safe they are, if no one objects,
> I'll add them in auto-coding-regexp-alist.

Shouldn't those be utf-16-[bl]e-with-signature?  Or has the naming
convention changed?

benny

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: coding tags and utf-16
  2006-03-04 20:34             ` Benjamin Riefenstahl
@ 2006-03-06 13:04               ` Kenichi Handa
  2006-03-06 19:35                 ` Benjamin Riefenstahl
  2006-03-08  5:42               ` Tomas Zerolo
  1 sibling, 1 reply; 25+ messages in thread
From: Kenichi Handa @ 2006-03-06 13:04 UTC (permalink / raw)
  Cc: ihs_4664, emacs-devel

In article <m3hd6et0de.fsf@seneca.benny.turtle-trading.net>, Benjamin Riefenstahl <b.riefenstahl@turtle-trading.net> writes:

> Kenichi Handa writes:
>>> ("\\`\xEF\xBB\xBF" . utf-8)
>> 
>> As far as I know, UTF-8 should not start with this sequence unless
>> the text really starts with ZWNBSP (very unlikely).

> UTF-8 can start with a BOM.  See
> <http://www.unicode.org/faq/utf_bom.html#29>.

That's why I wrote "unless ..." part.  For decoding UTF-8,
we should not delete that BOM but treat it as the content of
the text.  For UTF-16, Unicode explicitly says that "The BOM
is not considered part of the content of the text", but for
UTF-8, it doesn't say such a thing.

Anyway, as Unicode doesn't recommend but doesn't inhibit BOM
in UTF-8 either, if people agree, I'll add it too.

>>> ("\\`\xFE\xFF" . utf-16-be)
>>> ("\\`\xFF\xFE" . utf-16-le)
>> 
>> Although it's not clear how safe they are, if no one objects,
>> I'll add them in auto-coding-regexp-alist.

> Shouldn't those be utf-16-[bl]e-with-signature?  Or has the naming
> convention changed?

Actually utf-16-be is an alias of utf-16be-with-signature
(more precisely, an alias of mule-utf-16be-with-signature)
and is different from utf-16be (and we don't have
utf-16-be-with-signature).  I have a responsibility for this
confusing naming.  I long ago mistakenly accepted and
committed those names (utf-16-[bl]e), and now keeping them
for backward compatibility.  Anyway I agree that using
utf-16[bl]e-with-signature here is better.

---
Kenichi Handa
handa@m17n.org

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: coding tags and utf-16
  2006-03-06 13:04               ` Kenichi Handa
@ 2006-03-06 19:35                 ` Benjamin Riefenstahl
  2006-03-07  1:02                   ` Kenichi Handa
  0 siblings, 1 reply; 25+ messages in thread
From: Benjamin Riefenstahl @ 2006-03-06 19:35 UTC (permalink / raw)
  Cc: ihs_4664, emacs-devel

Hi,


Kenichi Handa writes:
> For decoding UTF-8, we should not delete that BOM but treat it as
> the content of the text.  For UTF-16, Unicode explicitly says that
> "The BOM is not considered part of the content of the text", but for
> UTF-8, it doesn't say such a thing.

NOTEPAD.EXE (the basic MS Windows editor) adds a BOM when writing
UTF-8 files.  When I saw that and tried to discuss it on their
newsgroups, I learned that it seems to be Microsoft's POV that this is
a good thing.

Which means files like that exist.  Treating the BOM as content means
that U+FEFF creeps into the regular content of documents through
cut-and-paste and through components of template systems.  I have
already seen that happening in real life and of course it leads to
stupid bugs.  I think Emacs should do better.


> utf-16-be [==] utf-16be-with-signature [!=] utf-16be

;-)


benny

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: coding tags and utf-16
  2006-03-06 19:35                 ` Benjamin Riefenstahl
@ 2006-03-07  1:02                   ` Kenichi Handa
  0 siblings, 0 replies; 25+ messages in thread
From: Kenichi Handa @ 2006-03-07  1:02 UTC (permalink / raw)
  Cc: ihs_4664, emacs-devel

In article <m3wtf7xt6z.fsf@seneca.benny.turtle-trading.net>, Benjamin Riefenstahl <b.riefenstahl@turtle-trading.net> writes:

> Kenichi Handa writes:
>> For decoding UTF-8, we should not delete that BOM but treat it as
>> the content of the text.  For UTF-16, Unicode explicitly says that
>> "The BOM is not considered part of the content of the text", but for
>> UTF-8, it doesn't say such a thing.

> NOTEPAD.EXE (the basic MS Windows editor) adds a BOM when writing
> UTF-8 files.  When I saw that and tried to discuss it on their
> newsgroups, I learned that it seems to be Microsoft's POV that this is
> a good thing.

> Which means files like that exist.  Treating the BOM as content means
> that U+FEFF creeps into the regular content of documents through
> cut-and-paste and through components of template systems.  I have
> already seen that happening in real life and of course it leads to
> stupid bugs.  I think Emacs should do better.

But, it's simply a bug to delete the leading U+FEFF from the
content while decoding utf-8.  Perhaps we should add some
customizable flag to control that behavior after the
release.

>> utf-16-be [==] utf-16be-with-signature [!=] utf-16be

> ;-)

^.^;;;

---
Kenichi Handa
handa@m17n.org

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: coding tags and utf-16
  2006-03-04 20:34             ` Benjamin Riefenstahl
  2006-03-06 13:04               ` Kenichi Handa
@ 2006-03-08  5:42               ` Tomas Zerolo
  1 sibling, 0 replies; 25+ messages in thread
From: Tomas Zerolo @ 2006-03-08  5:42 UTC (permalink / raw)
  Cc: Kevin Rodgers, emacs-devel


[-- Attachment #1.1: Type: text/plain, Size: 529 bytes --]

On Sat, Mar 04, 2006 at 09:34:37PM +0100, Benjamin Riefenstahl wrote:
> Hi,
> 
> Kenichi Handa writes:
> >> ("\\`\xEF\xBB\xBF" . utf-8)
> >
> > As far as I know, UTF-8 should not start with this sequence unless
> > the text really starts with ZWNBSP (very unlikely).
> 
> UTF-8 can start with a BOM.  See
> <http://www.unicode.org/faq/utf_bom.html#29>.

This is so sick I nearly can't believe that. Some entities shouldn't be
accepted as members of any consortia.

Sorry. I had to say that.

Regards
-- tomás

[-- Attachment #1.2: Digital signature --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

[-- Attachment #2: Type: text/plain, Size: 142 bytes --]

_______________________________________________
Emacs-devel mailing list
Emacs-devel@gnu.org
http://lists.gnu.org/mailman/listinfo/emacs-devel

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: coding tags and utf-16
  2006-02-28  1:08           ` Kenichi Handa
  2006-03-04 20:34             ` Benjamin Riefenstahl
@ 2006-03-16  2:23             ` Kenichi Handa
  1 sibling, 0 replies; 25+ messages in thread
From: Kenichi Handa @ 2006-03-16  2:23 UTC (permalink / raw)
  Cc: ihs_4664, emacs-devel

In article <E1FDtLw-0005XV-00@etlken>, Kenichi Handa <handa@m17n.org> writes:

> Although it's not clear how safe they are, if no one objects,
> I'll add them in auto-coding-regexp-alist.

>> ("\\`\x00\x00\xFE\xFF" . utf-32-be)
>> ("\\`\xFF\xFE\x00\x00" . utf-32-le)

As there's no objection, I've just added them to
auto-coding-regexp-alist.

>> ("\\`\xEF\xBB\xBF" . utf-8)

That one too.

---
Kenichi Handa
handa@m17n.org

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2006-03-16  2:23 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2005-12-21  8:00 coding tags and utf-16 Werner LEMBERG
2005-12-23 23:43 ` Werner LEMBERG
2005-12-24 16:32   ` Richard M. Stallman
2006-01-04  6:42 ` Kenichi Handa
2006-01-04 14:58   ` Werner LEMBERG
2006-01-05  3:46   ` Richard M. Stallman
2006-01-05  4:33     ` Kenichi Handa
2006-01-05 12:24       ` David Kastrup
2006-01-06  0:27         ` Andreas Schwab
2006-01-05 23:11       ` Richard M. Stallman
2006-01-06  1:22         ` Werner LEMBERG
2006-01-06 11:26         ` Kenichi Handa
2006-01-07  4:23           ` Richard M. Stallman
2006-01-07  6:05             ` Kenichi Handa
2006-01-05 15:56   ` Stefan Monnier
2006-01-06  6:31     ` Kenichi Handa
2006-01-06 10:28       ` David Kastrup
2006-02-09  0:32         ` Kevin Rodgers
2006-02-28  1:08           ` Kenichi Handa
2006-03-04 20:34             ` Benjamin Riefenstahl
2006-03-06 13:04               ` Kenichi Handa
2006-03-06 19:35                 ` Benjamin Riefenstahl
2006-03-07  1:02                   ` Kenichi Handa
2006-03-08  5:42               ` Tomas Zerolo
2006-03-16  2:23             ` Kenichi Handa

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).