23.0.60; [nxml] BOM and utf-8

unofficial mirror of emacs-devel@gnu.org 
 help / color / mirror / code / Atom feed

* 23.0.60; [nxml] BOM and utf-8
@ 2008-05-17 12:31 Patrick Drechsler
  2008-05-17 14:13 ` Lennart Borgman (gmail)
                   ` (2 more replies)
  0 siblings, 3 replies; 41+ messages in thread
From: Patrick Drechsler @ 2008-05-17 12:31 UTC (permalink / raw)
  To: emacs-pretest-bug

[-- Attachment #1: Type: text/plain, Size: 3115 bytes --]


Please write in English if possible, because the Emacs maintainers
usually do not have translators to read other languages for them.

Your bug report will be posted to the emacs-pretest-bug@gnu.org mailing list.

Please describe exactly what actions triggered the bug
and the precise symptoms of the bug:

Hi,

is the attached xml file (simple.xml) really invalid (as indicated by
nxhtml) or is this a bug in nxhtml?

describe-char on the first symbol gives (I replaced the BOM part with
XXX):

,----
|         character: XXX (65279, #o177377, #xfeff)
| preferred charset: unicode (Unicode (ISO10646))
|        code point: 0xFEFF
|            syntax: w 	which means: word
|       buffer code: #xEF #xBB #xBF
|         file code: #xEF #xBB #xBF (encoded by coding system utf-8-unix)
|           display: no font available
| 
| Character code properties are not shown: customize what to show
| 
| There is an overlay here:
|  From 1 to 2
|   category             rng-error
|   help-echo            "Missing space after name"
|   priority             1
| 
| 
| There are text properties here:
|   auto-composed        t
|   fontified            t
`----

describe-current-coding-system also looks correct:

,----
| Coding system for saving this buffer:
|   U -- utf-8-unix (alias: mule-utf-8-unix)
`----

Regards,

Patrick

If Emacs crashed, and you have the Emacs process in the gdb debugger,
please include the output from the following gdb commands:
    `bt full' and `xbacktrace'.
If you would like to further debug the crash, please read the file
/home/patrick/prg/stow/emacs-devel/share/emacs/23.0.60/etc/DEBUG for instructions.


In GNU Emacs 23.0.60.4 (i686-pc-linux-gnu, GTK+ Version 2.12.9)
 of 2008-05-08 on golem
Windowing system distributor `The X.Org Foundation', version 11.0.10400090
configured using `configure  '--prefix=/home/patrick/prg/stow/emacs-devel''

Important settings:
  value of $LC_ALL: nil
  value of $LC_COLLATE: nil
  value of $LC_CTYPE: nil
  value of $LC_MESSAGES: nil
  value of $LC_MONETARY: nil
  value of $LC_NUMERIC: nil
  value of $LC_TIME: nil
  value of $LANG: de_DE.UTF-8
  value of $XMODIFIERS: nil
  locale-coding-system: utf-8-unix
  default-enable-multibyte-characters: t

Major mode: nXML

Minor modes in effect:
  delete-selection-mode: t
  show-paren-mode: t
  savehist-mode: t
  pc-selection-mode: t
  iswitchb-mode: t
  display-time-mode: t
  shell-dirtrack-mode: t
  tooltip-mode: t
  mouse-wheel-mode: t
  menu-bar-mode: t
  file-name-shadow-mode: t
  global-font-lock-mode: t
  font-lock-mode: t
  blink-cursor-mode: t
  global-auto-composition-mode: t
  auto-composition-mode: t
  auto-encryption-mode: t
  auto-compression-mode: t
  column-number-mode: t
  line-number-mode: t
  transient-mark-mode: t

Recent input:
M-x C-g <f10> <menu-bar> <help-menu> <send-emacs-b
ug-report>

Recent messages:
Loading time...done
Loading iswitchb...done
Loading pc-select...done
Loading savehist...done
Loading paren...done
Emacs startup time: 3 seconds.
For information about GNU Emacs and the GNU system, type C-h C-a.
Using vacuous schema
Quit
Missing space after name


[-- Attachment #2: simple.xml --]
[-- Type: application/xml, Size: 57 bytes --]

<?xml version="1.0" encoding="UTF-8"?>
<data>
</data>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: 23.0.60; [nxml] BOM and utf-8
  2008-05-17 12:31 23.0.60; [nxml] BOM and utf-8 Patrick Drechsler
@ 2008-05-17 14:13 ` Lennart Borgman (gmail)
  2008-05-17 16:57   ` Patrick Drechsler
  2008-05-17 20:38 ` Mark A. Hershberger
  2008-05-18  2:29 ` Stephen J. Turnbull
  2 siblings, 1 reply; 41+ messages in thread
From: Lennart Borgman (gmail) @ 2008-05-17 14:13 UTC (permalink / raw)
  To: Patrick Drechsler; +Cc: emacs-pretest-bug

Patrick Drechsler wrote:
> Please write in English if possible, because the Emacs maintainers
> usually do not have translators to read other languages for them.
> 
> Your bug report will be posted to the emacs-pretest-bug@gnu.org mailing list.
> 
> Please describe exactly what actions triggered the bug
> and the precise symptoms of the bug:
> 
> Hi,
> 
> is the attached xml file (simple.xml) really invalid (as indicated by
> nxhtml) or is this a bug in nxhtml?

Patrick, don't you open this opened in nxml-mode, not nxhtml-mode?

> describe-char on the first symbol gives (I replaced the BOM part with
> XXX):
> 
> ,----
> |         character: XXX (65279, #o177377, #xfeff)
> | preferred charset: unicode (Unicode (ISO10646))
> |        code point: 0xFEFF
> |            syntax: w 	which means: word
> |       buffer code: #xEF #xBB #xBF
> |         file code: #xEF #xBB #xBF (encoded by coding system utf-8-unix)
> |           display: no font available
> | 
> | Character code properties are not shown: customize what to show
> | 
> | There is an overlay here:
> |  From 1 to 2
> |   category             rng-error
> |   help-echo            "Missing space after name"
> |   priority             1
> | 
> | 
> | There are text properties here:
> |   auto-composed        t
> |   fontified            t
> `----
> 
> describe-current-coding-system also looks correct:
> 
> ,----
> | Coding system for saving this buffer:
> |   U -- utf-8-unix (alias: mule-utf-8-unix)
> `----
> 
> Regards,
> 
> Patrick
> 
> If Emacs crashed, and you have the Emacs process in the gdb debugger,
> please include the output from the following gdb commands:
>     `bt full' and `xbacktrace'.
> If you would like to further debug the crash, please read the file
> /home/patrick/prg/stow/emacs-devel/share/emacs/23.0.60/etc/DEBUG for instructions.
> 
> 
> In GNU Emacs 23.0.60.4 (i686-pc-linux-gnu, GTK+ Version 2.12.9)
>  of 2008-05-08 on golem
> Windowing system distributor `The X.Org Foundation', version 11.0.10400090
> configured using `configure  '--prefix=/home/patrick/prg/stow/emacs-devel''
> 
> Important settings:
>   value of $LC_ALL: nil
>   value of $LC_COLLATE: nil
>   value of $LC_CTYPE: nil
>   value of $LC_MESSAGES: nil
>   value of $LC_MONETARY: nil
>   value of $LC_NUMERIC: nil
>   value of $LC_TIME: nil
>   value of $LANG: de_DE.UTF-8
>   value of $XMODIFIERS: nil
>   locale-coding-system: utf-8-unix
>   default-enable-multibyte-characters: t
> 
> Major mode: nXML
> 
> Minor modes in effect:
>   delete-selection-mode: t
>   show-paren-mode: t
>   savehist-mode: t
>   pc-selection-mode: t
>   iswitchb-mode: t
>   display-time-mode: t
>   shell-dirtrack-mode: t
>   tooltip-mode: t
>   mouse-wheel-mode: t
>   menu-bar-mode: t
>   file-name-shadow-mode: t
>   global-font-lock-mode: t
>   font-lock-mode: t
>   blink-cursor-mode: t
>   global-auto-composition-mode: t
>   auto-composition-mode: t
>   auto-encryption-mode: t
>   auto-compression-mode: t
>   column-number-mode: t
>   line-number-mode: t
>   transient-mark-mode: t
> 
> Recent input:
> M-x C-g <f10> <menu-bar> <help-menu> <send-emacs-b
> ug-report>
> 
> Recent messages:
> Loading time...done
> Loading iswitchb...done
> Loading pc-select...done
> Loading savehist...done
> Loading paren...done
> Emacs startup time: 3 seconds.
> For information about GNU Emacs and the GNU system, type C-h C-a.
> Using vacuous schema
> Quit
> Missing space after name
> 




^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: 23.0.60; [nxml] BOM and utf-8
  2008-05-17 14:13 ` Lennart Borgman (gmail)
@ 2008-05-17 16:57   ` Patrick Drechsler
  0 siblings, 0 replies; 41+ messages in thread
From: Patrick Drechsler @ 2008-05-17 16:57 UTC (permalink / raw)
  To: emacs-devel; +Cc: emacs-pretest-bug

"Lennart Borgman (gmail)" <lennart.borgman@gmail.com> writes:

> Patrick Drechsler wrote:
>>
>> is the attached xml file (simple.xml) really invalid (as indicated by
>> nxhtml) or is this a bug in nxhtml?
>
> Patrick, don't you open this opened in nxml-mode, not nxhtml-mode?

It does not make a difference if I use nxhtml or nxml mode. To
reproduce:

1. emacs -Q simple.xml
2. M-x nxml-mode

-> nxml-mode displays "(nXML Invalid)" in the mode line

>> describe-char on the first symbol gives (I replaced the BOM part with
>> XXX):
>>
>> ,----
>> |         character: XXX (65279, #o177377, #xfeff)
>> | preferred charset: unicode (Unicode (ISO10646))
>> |        code point: 0xFEFF
>> |            syntax: w 	which means: word
>> |       buffer code: #xEF #xBB #xBF
>> |         file code: #xEF #xBB #xBF (encoded by coding system utf-8-unix)
>> |           display: no font available
>> | | Character code properties are not shown: customize what to show
>> | | There is an overlay here:
>> |  From 1 to 2
>> |   category             rng-error
>> |   help-echo            "Missing space after name"
>> |   priority             1
>> | | | There are text properties here:
>> |   auto-composed        t
>> |   fontified            t
>> `----
>>
>> describe-current-coding-system also looks correct:
>>
>> ,----
>> | Coding system for saving this buffer:
>> |   U -- utf-8-unix (alias: mule-utf-8-unix)
>> `----
>>
>> Regards,
>>
>> Patrick
>>
>> If Emacs crashed, and you have the Emacs process in the gdb debugger,
>> please include the output from the following gdb commands:
>>     `bt full' and `xbacktrace'.
>> If you would like to further debug the crash, please read the file
>> /home/patrick/prg/stow/emacs-devel/share/emacs/23.0.60/etc/DEBUG for instructions.
>>
>>
>> In GNU Emacs 23.0.60.4 (i686-pc-linux-gnu, GTK+ Version 2.12.9)
>>  of 2008-05-08 on golem
>> Windowing system distributor `The X.Org Foundation', version 11.0.10400090
>> configured using `configure  '--prefix=/home/patrick/prg/stow/emacs-devel''
>>
>> Important settings:
>>   value of $LC_ALL: nil
>>   value of $LC_COLLATE: nil
>>   value of $LC_CTYPE: nil
>>   value of $LC_MESSAGES: nil
>>   value of $LC_MONETARY: nil
>>   value of $LC_NUMERIC: nil
>>   value of $LC_TIME: nil
>>   value of $LANG: de_DE.UTF-8
>>   value of $XMODIFIERS: nil
>>   locale-coding-system: utf-8-unix
>>   default-enable-multibyte-characters: t
>>
>> Major mode: nXML
>>
>> Minor modes in effect:
>>   delete-selection-mode: t
>>   show-paren-mode: t
>>   savehist-mode: t
>>   pc-selection-mode: t
>>   iswitchb-mode: t
>>   display-time-mode: t
>>   shell-dirtrack-mode: t
>>   tooltip-mode: t
>>   mouse-wheel-mode: t
>>   menu-bar-mode: t
>>   file-name-shadow-mode: t
>>   global-font-lock-mode: t
>>   font-lock-mode: t
>>   blink-cursor-mode: t
>>   global-auto-composition-mode: t
>>   auto-composition-mode: t
>>   auto-encryption-mode: t
>>   auto-compression-mode: t
>>   column-number-mode: t
>>   line-number-mode: t
>>   transient-mark-mode: t
>>
>> Recent input:
>> M-x C-g <f10> <menu-bar> <help-menu> <send-emacs-b
>> ug-report>
>>
>> Recent messages:
>> Loading time...done
>> Loading iswitchb...done
>> Loading pc-select...done
>> Loading savehist...done
>> Loading paren...done
>> Emacs startup time: 3 seconds.
>> For information about GNU Emacs and the GNU system, type C-h C-a.
>> Using vacuous schema
>> Quit
>> Missing space after name
>>
>
>
>

-- 
I'm not asleep. I'm just looking at my eyelids!





^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: 23.0.60; [nxml] BOM and utf-8
  2008-05-17 12:31 23.0.60; [nxml] BOM and utf-8 Patrick Drechsler
  2008-05-17 14:13 ` Lennart Borgman (gmail)
@ 2008-05-17 20:38 ` Mark A. Hershberger
  2008-05-21 22:20   ` Patrick Drechsler
  2008-05-18  2:29 ` Stephen J. Turnbull
  2 siblings, 1 reply; 41+ messages in thread
From: Mark A. Hershberger @ 2008-05-17 20:38 UTC (permalink / raw)
  To: emacs-devel; +Cc: emacs-pretest-bug

[-- Attachment #1: Type: text/plain, Size: 925 bytes --]

Patrick Drechsler <patrick@pdrechsler.de> writes:

> is the attached xml file (simple.xml) really invalid (as indicated by
> nxhtml) or is this a bug in nxhtml?

The file simple.xml is really invalid.

http://www.w3.org/TR/2006/REC-xml-20060816/#sec-prolog-dtd

The XML spec gives the following syntax description for the prolog of an
XML file (I've only copied the relevant parts):

    [3]    S	    ::=  (#x20 | #x9 | #xD | #xA)+
    [22]   prolog   ::=  XMLDecl? Misc* (doctypedecl  Misc*)?
    [23]   XMLDecl  ::= '<?xml' VersionInfo EncodingDecl? SDDecl? S? '?>'

Note that there is no S before the literal “<?xml" and that "<?xml" is
optional.

So, yes, an file that contains whitespace before "<?xml" is invalid XML.

-- 
http://hexmode.com/
GPG Fingerprint: 7E15 362D A32C DFAB E4D2  B37A 735E F10A 2DFC BFF5

Ideas create idols; only wonder leads to knowing.
    -- St. Gregory of Nyssa

[-- Attachment #2: Type: application/pgp-signature, Size: 188 bytes --]

^ permalink raw reply	[flat|nested] 41+ messages in thread

* 23.0.60; [nxml] BOM and utf-8
  2008-05-17 12:31 23.0.60; [nxml] BOM and utf-8 Patrick Drechsler
  2008-05-17 14:13 ` Lennart Borgman (gmail)
  2008-05-17 20:38 ` Mark A. Hershberger
@ 2008-05-18  2:29 ` Stephen J. Turnbull
  2008-05-18  2:30   ` Miles Bader
  2 siblings, 1 reply; 41+ messages in thread
From: Stephen J. Turnbull @ 2008-05-18  2:29 UTC (permalink / raw)
  To: Patrick Drechsler; +Cc: emacs-pretest-bug

Patrick Drechsler writes:

 > is the attached xml file (simple.xml) really invalid (as indicated by
 > nxhtml) or is this a bug in nxhtml?

Neither.  Emacs is (arguably) reading it incorrectly.

 > describe-char on the first symbol gives (I replaced the BOM part with
 > XXX):

The signature is *not* part of the text according to the Unicode
standard, and if recognized as a signature should be removed by the
I/O system (here, Emacs) before passing it to the XML processor.

 > |         file code: #xEF #xBB #xBF (encoded by coding system utf-8-unix)

There should be an Emacs coding system that removes the BOM.  The XML
standard requires that the XML declaration, if present, be the first
thing in the file.  XML does not recognize the BOM as part of the
prolog, optional or otherwise.  The BOM signals the encoding of the
document, but in XML the atomic constituents are characters; there is
no encoding, and thus no place for a BOM.  (The standard recognizes
that encoding varies from context to context, and provides means for
specifying it, but that's a different issue.)

See Mark Hershberger's reply for more detail on the syntax of an XML
file.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: 23.0.60; [nxml] BOM and utf-8
  2008-05-18  2:29 ` Stephen J. Turnbull
@ 2008-05-18  2:30   ` Miles Bader
  2008-05-18  3:19     ` Eli Zaretskii
  2008-05-18  4:13     ` Stephen J. Turnbull
  0 siblings, 2 replies; 41+ messages in thread
From: Miles Bader @ 2008-05-18  2:30 UTC (permalink / raw)
  To: Stephen J. Turnbull; +Cc: emacs-pretest-bug, Patrick Drechsler

"Stephen J. Turnbull" <stephen@xemacs.org> writes:
>  > is the attached xml file (simple.xml) really invalid (as indicated by
>  > nxhtml) or is this a bug in nxhtml?
>
> Neither.  Emacs is (arguably) reading it incorrectly.

By "arguably" I presume you're referring to the "Microsoft does <random
stupid thing>, therefore everybody who doesn't do <random stupid thing>
is thing>incorrect" tactic.

I think think it would be a lot _more_ arguable that microsoft apps
which randomly add BOM to the beginning of files where it is invalid are
broken.  In general, other apps that read such files are not expecting
the BOM, and won't be able to deal with it.  So Emacs wouldn't be doing
the user any favors by hiding the BOM from him.

BOM is not part of UTF-8.  UTF-8 files that contain "BOM" are simply
UTF-8 files with a random weird character at the beginning.

-Miles

-- 
In New York, most people don't have cars, so if you want to kill a person, you
have to take the subway to their house.  And sometimes on the way, the train
is delayed and you get impatient, so you have to kill someone on the subway.
  [George Carlin]

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: 23.0.60; [nxml] BOM and utf-8
  2008-05-18  2:30   ` Miles Bader
@ 2008-05-18  3:19     ` Eli Zaretskii
  2008-05-18  4:19       ` Stephen J. Turnbull
  2008-05-18  8:56       ` Jason Rumney
  2008-05-18  4:13     ` Stephen J. Turnbull
  1 sibling, 2 replies; 41+ messages in thread
From: Eli Zaretskii @ 2008-05-18  3:19 UTC (permalink / raw)
  To: Miles Bader; +Cc: emacs-pretest-bug, stephen, patrick

> From: Miles Bader <miles@gnu.org>
> Date: Sun, 18 May 2008 11:30:05 +0900
> Cc: emacs-pretest-bug@gnu.org, Patrick Drechsler <patrick@pdrechsler.de>
> 
> "Stephen J. Turnbull" <stephen@xemacs.org> writes:
> >  > is the attached xml file (simple.xml) really invalid (as indicated by
> >  > nxhtml) or is this a bug in nxhtml?
> >
> > Neither.  Emacs is (arguably) reading it incorrectly.
> 
> By "arguably" I presume you're referring to the "Microsoft does <random
> stupid thing>, therefore everybody who doesn't do <random stupid thing>
> is thing>incorrect" tactic.

I'm not sure you are barking the right tree.  AFAIK, Microsoft doesn't
use UTF-8 at all, they use UTF-16 (where a BOM is generally
necessary).




^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: 23.0.60; [nxml] BOM and utf-8
  2008-05-18  2:30   ` Miles Bader
  2008-05-18  3:19     ` Eli Zaretskii
@ 2008-05-18  4:13     ` Stephen J. Turnbull
  2008-05-18  5:40       ` Miles Bader
  2008-05-18  9:14       ` David Kastrup
  1 sibling, 2 replies; 41+ messages in thread
From: Stephen J. Turnbull @ 2008-05-18  4:13 UTC (permalink / raw)
  To: Miles Bader; +Cc: emacs-pretest-bug, Patrick Drechsler

Miles Bader writes:

 > "Stephen J. Turnbull" <stephen@xemacs.org> writes:
 > >  > is the attached xml file (simple.xml) really invalid (as indicated by
 > >  > nxhtml) or is this a bug in nxhtml?
 > >
 > > Neither.  Emacs is (arguably) reading it incorrectly.
 > 
 > By "arguably" I presume you're referring to the "Microsoft does <random
 > stupid thing>, therefore everybody who doesn't do <random stupid thing>
 > is incorrect" tactic.

No, by "arguably" I'm referring to the fact that although the optional
UTF-8 signature has been part of ISO/IEC 10646-1 and Unicode for a
decade or so, not to mention Internet STD 63 (aka RFC 3269), I fully
expected somebody like you to pop up and argue about it.

It is a bad standard (see STD 63) and possibly Microsoft-induced, but
it *is* the standard and is showing no signs of going away; see
Section 16.8 of *The Unicode Standard*, v5.0.  In fact, the trend is
the other way around: the ancient RFCs 2044 and 2279 don't mention it
either way, but STD 63 found it necessary to *add* it.

 > In general, other apps that read such files are not expecting the
 > BOM, and won't be able to deal with it.  So Emacs wouldn't be doing
 > the user any favors by hiding the BOM from him.

So pop up a warning to the effect that the BOM was stripped per the
Unicode standard, and that if it needs to be preserved, set
UNICODE_ME_SOFTLY in the environment or bind `unicode-me-softly'
around the codec.

Alternatively, sabotage the Microsoft users by silently eating the BOM
on the way in, and writing the file in GNU substandard[1] format on the
way out.

Footnotes: 
[1]  A substandard is a standard with stupid optional features
subtracted. :-)

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: 23.0.60; [nxml] BOM and utf-8
  2008-05-18  3:19     ` Eli Zaretskii
@ 2008-05-18  4:19       ` Stephen J. Turnbull
  2008-05-18  8:56       ` Jason Rumney
  1 sibling, 0 replies; 41+ messages in thread
From: Stephen J. Turnbull @ 2008-05-18  4:19 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: emacs-pretest-bug, patrick, Miles Bader

Eli Zaretskii writes:

 > I'm not sure you are barking the right tree.  AFAIK, Microsoft doesn't
 > use UTF-8 at all, they use UTF-16 (where a BOM is generally
 > necessary).

They do use UTF-8 in some text formats, at least Notepad and Wordpad
used to produce BOM-prefixed UTF-8 for "text" and (IIRC) UTF-16 for
"Unicode text".




^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: 23.0.60; [nxml] BOM and utf-8
  2008-05-18  4:13     ` Stephen J. Turnbull
@ 2008-05-18  5:40       ` Miles Bader
  2008-05-18  9:14       ` David Kastrup
  1 sibling, 0 replies; 41+ messages in thread
From: Miles Bader @ 2008-05-18  5:40 UTC (permalink / raw)
  To: Stephen J. Turnbull; +Cc: emacs-pretest-bug, Patrick Drechsler

"Stephen J. Turnbull" <stephen@xemacs.org> writes:
> No, by "arguably" I'm referring to the fact that although the optional
> UTF-8 signature has been part of ISO/IEC 10646-1 and Unicode for a
> decade or so, not to mention Internet STD 63 (aka RFC 3269), I fully
> expected somebody like you to pop up and argue about it.

"Somebody like me"?

-Miles

-- 
I'd rather be consing.




^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: 23.0.60; [nxml] BOM and utf-8
  2008-05-18  3:19     ` Eli Zaretskii
  2008-05-18  4:19       ` Stephen J. Turnbull
@ 2008-05-18  8:56       ` Jason Rumney
  2008-05-18 11:00         ` Patrick Drechsler
  2008-05-18 15:19         ` joakim
  1 sibling, 2 replies; 41+ messages in thread
From: Jason Rumney @ 2008-05-18  8:56 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: emacs-pretest-bug, stephen, patrick, Miles Bader

Eli Zaretskii wrote:
> I'm not sure you are barking the right tree.  AFAIK, Microsoft doesn't
> use UTF-8 at all, they use UTF-16 (where a BOM is generally
> necessary).
>   

What Miles is talking about is certain Microsoft software (including 
their XML library), which when saving to UTF-8 writes a UTF-8 encoded 
0xFEFF at the start of the file. Its probably caused by first encoding 
in UTF-16 then transcoding to UTF-8.






^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: 23.0.60; [nxml] BOM and utf-8
  2008-05-18  4:13     ` Stephen J. Turnbull
  2008-05-18  5:40       ` Miles Bader
@ 2008-05-18  9:14       ` David Kastrup
  2008-05-19  3:05         ` Stephen J. Turnbull
  1 sibling, 1 reply; 41+ messages in thread
From: David Kastrup @ 2008-05-18  9:14 UTC (permalink / raw)
  To: Stephen J. Turnbull; +Cc: emacs-pretest-bug, Patrick Drechsler, Miles Bader

"Stephen J. Turnbull" <stephen@xemacs.org> writes:

> So pop up a warning to the effect that the BOM was stripped per the
> Unicode standard, and that if it needs to be preserved, set
> UNICODE_ME_SOFTLY in the environment or bind `unicode-me-softly'
> around the codec.

It would be sufficient to use an encoding variation which adds the bom
back on writing.

I am actually surprised that this is not done right now: I thought we
had a discussion about having the BOM-encodings early in the automatic
encoding detections.

> Alternatively, sabotage the Microsoft users by silently eating the BOM
> on the way in, and writing the file in GNU substandard[1] format on
> the way out.

Emacs developers are not nonchalant about having Emacs write a byte
sequence differing from what it read in (apart from where it can't help
it, like with non-canonically encoded valid texts in shift character
based encodings) in my impression, and it is one of the better features.

-- 
David Kastrup, Kriemhildstr. 15, 44793 Bochum

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: 23.0.60; [nxml] BOM and utf-8
  2008-05-18  8:56       ` Jason Rumney
@ 2008-05-18 11:00         ` Patrick Drechsler
  2008-05-19  3:11           ` Stephen J. Turnbull
  2008-05-18 15:19         ` joakim
  1 sibling, 1 reply; 41+ messages in thread
From: Patrick Drechsler @ 2008-05-18 11:00 UTC (permalink / raw)
  To: emacs-devel; +Cc: emacs-pretest-bug

Jason Rumney <jasonr@gnu.org> writes:

> Eli Zaretskii wrote:
>> I'm not sure you are barking the right tree.  AFAIK, Microsoft
>> doesn't use UTF-8 at all, they use UTF-16 (where a BOM is generally
>> necessary).
>
> What Miles is talking about is certain Microsoft software (including
> their XML library), which when saving to UTF-8 writes a UTF-8 encoded
> 0xFEFF at the start of the file. Its probably caused by first encoding
> in UTF-16 then transcoding to UTF-8.

This seems to be the problem in my case: A microsoft .NET application
adds the BOM to the xml file which is encoded as utf-8.

Thanks to all for the instructive feedback on this issue! 

I second the opinion that it would be nice to have the option to hide
(or remove) the BOM while editing the file in Emacs and reinserting it
(if it was removed) when done editing. Otherwise one is not able to
validate the rest of the xml file using nxml.

Cheers,

Patrick 

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: 23.0.60; [nxml] BOM and utf-8
  2008-05-18  8:56       ` Jason Rumney
  2008-05-18 11:00         ` Patrick Drechsler
@ 2008-05-18 15:19         ` joakim
  1 sibling, 0 replies; 41+ messages in thread
From: joakim @ 2008-05-18 15:19 UTC (permalink / raw)
  To: emacs-devel

Jason Rumney <jasonr@gnu.org> writes:


> What Miles is talking about is certain Microsoft software (including
> their XML library), which when saving to UTF-8 writes a UTF-8 encoded
> 0xFEFF at the start of the file. Its probably caused by first encoding
> in UTF-16 then transcoding to UTF-8.
>

Anyway, having Emacs handling BOM in a usefull manner, might help
convince a couple of non-Emacs users at my workplace of Emacs
superiority.


-- 
Joakim Verona




^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: 23.0.60; [nxml] BOM and utf-8
  2008-05-19  3:05         ` Stephen J. Turnbull
@ 2008-05-18 23:40           ` David Kastrup
  2008-05-19 20:34             ` Stephen J. Turnbull
  2008-05-19  6:32           ` Lennart Borgman (gmail)
  1 sibling, 1 reply; 41+ messages in thread
From: David Kastrup @ 2008-05-18 23:40 UTC (permalink / raw)
  To: Stephen J. Turnbull; +Cc: emacs-pretest-bug, Patrick Drechsler, Miles Bader

"Stephen J. Turnbull" <stephen@xemacs.org> writes:

> In any case, maintaining faithfulness of representation is simply not
> possible, as you point out

With some coding systems.  But the latin-* and utf-* can maintain the
binary stream since their coding is required to be canonical in the
standard.  Everything that is not canonical (including the byte
sequences for encoding out-of-line octets) is encoded as out-of-line
octets as far as I understand.

> (safe-character-sets or whatever you call your analog to latin-unity
> being another case).  It's also not at all obvious that that is a very
> useful requirement when dealing with a character-oriented standard
> like Unicode or XML, since you can expect many applications to
> canonicalize the text "behind your back".

That's not an issue.  But for example, you can use Emacs to load some
library in the coding its texts are encoded in, search and edit a string
in overwrite mode (as long as it does not get longer) and save again,
and the result will usually work.

Also you can load, edit and save a text file in colloborative
environments, and the diffs/patches will be just in the edited areas
(this will supposedly work better with Emacs-23 than Emacs-22).  Those
are quite important features.

> Users should get used to it, and we should document how to force Emacs
> to error rather than do anything behind your back for those who need
> binary faithfulness rather than text faithfulness.

Since binary faithfulness implies text faithfulness, there is no reason
not to the right thing instead of erroring out.

-- 
David Kastrup, Kriemhildstr. 15, 44793 Bochum

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: 23.0.60; [nxml] BOM and utf-8
  2008-05-18  9:14       ` David Kastrup
@ 2008-05-19  3:05         ` Stephen J. Turnbull
  2008-05-18 23:40           ` David Kastrup
  2008-05-19  6:32           ` Lennart Borgman (gmail)
  0 siblings, 2 replies; 41+ messages in thread
From: Stephen J. Turnbull @ 2008-05-19  3:05 UTC (permalink / raw)
  To: David Kastrup; +Cc: emacs-pretest-bug, Patrick Drechsler, Miles Bader

David Kastrup writes:

 > It would be sufficient to use an encoding variation which adds the bom
 > back on writing.
 > 
 > I am actually surprised that this is not done right now: I thought we
 > had a discussion about having the BOM-encodings early in the automatic
 > encoding detections.

IIRC this is an issue recently reported by Eli, discussed, and ISTR
already fixed by Handa-san.  Don't have time to dig it up though.
Something about -le vs -littleendian.

 > > Alternatively, sabotage the Microsoft users by silently eating the BOM
 > > on the way in, and writing the file in GNU substandard[1] format on
 > > the way out.
 > 
 > Emacs developers are not nonchalant about having Emacs write a byte
 > sequence differing from what it read in

OK, I should always use smileys on this list, my bad.  The main point
was to get in the "substandard" joke, YHBT HAND.  And I was
recommending that for Miles's benefit, not as an Emacs default.

In any case, maintaining faithfulness of representation is simply not
possible, as you point out (safe-character-sets or whatever you call
your analog to latin-unity being another case).  It's also not at all
obvious that that is a very useful requirement when dealing with a
character-oriented standard like Unicode or XML, since you can expect
many applications to canonicalize the text "behind your back".

Users should get used to it, and we should document how to force Emacs
to error rather than do anything behind your back for those who need
binary faithfulness rather than text faithfulness.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: 23.0.60; [nxml] BOM and utf-8
  2008-05-18 11:00         ` Patrick Drechsler
@ 2008-05-19  3:11           ` Stephen J. Turnbull
  2008-05-19 14:32             ` Patrick Drechsler
  0 siblings, 1 reply; 41+ messages in thread
From: Stephen J. Turnbull @ 2008-05-19  3:11 UTC (permalink / raw)
  To: Patrick Drechsler; +Cc: emacs-pretest-bug, emacs-devel

Patrick Drechsler writes:

 > I second the opinion that it would be nice to have the option to hide
 > (or remove) the BOM while editing the file in Emacs and reinserting it
 > (if it was removed) when done editing.

I believe that there is a utf-8-signature or similarly named coding
system which does this.  (It's called utf-8-bom in XEmacs but IIRC
Emacs uses the more accurate name, since UTF-8 is of course always
bigendian.)

You can use the auto-coding-alist (or something like that) to ensure
that files with certain names always get the strip-BOM-on-input,
prepend-BOM-on-output behavior.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: 23.0.60; [nxml] BOM and utf-8
  2008-05-19  3:05         ` Stephen J. Turnbull
  2008-05-18 23:40           ` David Kastrup
@ 2008-05-19  6:32           ` Lennart Borgman (gmail)
  1 sibling, 0 replies; 41+ messages in thread
From: Lennart Borgman (gmail) @ 2008-05-19  6:32 UTC (permalink / raw)
  To: Stephen J. Turnbull; +Cc: emacs-pretest-bug, Patrick Drechsler, Miles Bader

Stephen J. Turnbull wrote:
>  > I am actually surprised that this is not done right now: I thought we
>  > had a discussion about having the BOM-encodings early in the automatic
>  > encoding detections.
> 
> IIRC this is an issue recently reported by Eli, discussed, and ISTR
> already fixed by Handa-san.  Don't have time to dig it up though.
> Something about -le vs -littleendian.

Maybe that was done for xml-mode, but not for nxml-mode?




^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: 23.0.60; [nxml] BOM and utf-8
  2008-05-19  3:11           ` Stephen J. Turnbull
@ 2008-05-19 14:32             ` Patrick Drechsler
  2008-05-19 18:56               ` Eli Zaretskii
  0 siblings, 1 reply; 41+ messages in thread
From: Patrick Drechsler @ 2008-05-19 14:32 UTC (permalink / raw)
  To: emacs-devel; +Cc: emacs-pretest-bug

"Stephen J. Turnbull" <stephen@xemacs.org> writes:

> Patrick Drechsler writes:
>
>  > I second the opinion that it would be nice to have the option to
>  > hide (or remove) the BOM while editing the file in Emacs and
>  > reinserting it (if it was removed) when done editing.
>
> I believe that there is a utf-8-signature or similarly named coding
> system which does this.  (It's called utf-8-bom in XEmacs but IIRC
> Emacs uses the more accurate name, since UTF-8 is of course always
> bigendian.)

Thanks for the feedback Stephen.

I am not able to find this coding system by searching through the list
of possible coding systems (describe-coding-system -> TAB). The only
appearances of the string "sig" are all utf-16 related (the string "bom"
did not return any results either):

,----
| 4 matches for "sig" in buffer: *Completions*
|     386:utf-16be-with-signature 	utf-16be-with-signature-dos
|     387:utf-16be-with-signature-mac 	utf-16be-with-signature-unix
|     390:utf-16le-with-signature 	utf-16le-with-signature-dos
|     391:utf-16le-with-signature-mac 	utf-16le-with-signature-unix
`----

> You can use the auto-coding-alist (or something like that) to ensure
> that files with certain names always get the strip-BOM-on-input,
> prepend-BOM-on-output behavior.

Cheers,

Patrick 





^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: 23.0.60; [nxml] BOM and utf-8
  2008-05-19 14:32             ` Patrick Drechsler
@ 2008-05-19 18:56               ` Eli Zaretskii
  2008-05-20 15:16                 ` Patrick Drechsler
  0 siblings, 1 reply; 41+ messages in thread
From: Eli Zaretskii @ 2008-05-19 18:56 UTC (permalink / raw)
  To: Patrick Drechsler; +Cc: emacs-devel

> From: Patrick Drechsler <patrick@pdrechsler.de>
> Date: Mon, 19 May 2008 16:32:10 +0200
> Cc: emacs-devel@gnu.org
> 
> > I believe that there is a utf-8-signature or similarly named coding
> > system which does this.  (It's called utf-8-bom in XEmacs but IIRC
> > Emacs uses the more accurate name, since UTF-8 is of course always
> > bigendian.)
> 
> Thanks for the feedback Stephen.
> 
> I am not able to find this coding system by searching through the list
> of possible coding systems (describe-coding-system -> TAB).

That's because Emacs doesn't have it.




^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: 23.0.60; [nxml] BOM and utf-8
  2008-05-18 23:40           ` David Kastrup
@ 2008-05-19 20:34             ` Stephen J. Turnbull
  2008-05-19 20:57               ` David Kastrup
  0 siblings, 1 reply; 41+ messages in thread
From: Stephen J. Turnbull @ 2008-05-19 20:34 UTC (permalink / raw)
  To: David Kastrup; +Cc: emacs-pretest-bug, Patrick Drechsler, Miles Bader

David Kastrup writes:
 > "Stephen J. Turnbull" <stephen@xemacs.org> writes:

 > > In any case, maintaining faithfulness of representation is simply not
 > > possible, as you point out
 > 
 > With some coding systems.  But the latin-* and utf-* can maintain the
 > binary stream since their coding is required to be canonical in the
 > standard.

latin-* will do so because of their extremely limited range.  It's
unfortunate that programmer intuitions about text have been
Americanized (== drastically limited) by these encodings.

utf-* can maintain representation in the very limited sense you have
in mind, and I know that is very useful to you in dealing with non-
conforming applications like TeX.  However, you still run into the
problem that faithfulness of representation is not a goal of Unicode.

 > > It's also not at all obvious that that is a very
 > > useful requirement when dealing with a character-oriented standard
 > > like Unicode or XML, since you can expect many applications to
 > > canonicalize the text "behind your back".
 > 
 > That's not an issue.

What do you mean by "that's not an issue?"  How can you know when I
haven't named the application?

 > Also you can load, edit and save a text file in colloborative
 > environments, and the diffs/patches will be just in the edited areas
 > (this will supposedly work better with Emacs-23 than Emacs-22).  Those
 > are quite important features.

Sure, and Emacs must provide coding systems that preserve them, and
generally use those coding systems by default.  Did anybody say
otherwise?

 > > Users should get used to it, and we should document how to force Emacs
 > > to error rather than do anything behind your back for those who need
 > > binary faithfulness rather than text faithfulness.
 > 
 > Since binary faithfulness implies text faithfulness, there is no reason
 > not to the right thing instead of erroring out.

"There is no reason"?  How arrogant of you!  Rather, "David Kastrup
lacks the knowledge of the reasons."  Here are three examples:

Binary faithfulness may imply breaking text programs.  For example,
`forward-char' and `replace-string' will give surprising results in a
buffer using Unicode internally that contains Unicode in NFD
normalization (and these anomolies will be noticeable in all Western
European languages excluding English).  Binary faithfulness may imply
inefficiency.  For example, files need not be normalized, which would
imply keeping a copy of the whole file and doing a Unicode diff to
determine which parts of the file need to be saved from the buffer and
which parts from the saved copy.  Binary faithfulness may be
incompatible with other user demands, for example if a user introduces
Latin-2 characters into a Latin-9 text.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: 23.0.60; [nxml] BOM and utf-8
  2008-05-19 20:34             ` Stephen J. Turnbull
@ 2008-05-19 20:57               ` David Kastrup
  2008-05-19 23:36                 ` Stephen J. Turnbull
  0 siblings, 1 reply; 41+ messages in thread
From: David Kastrup @ 2008-05-19 20:57 UTC (permalink / raw)
  To: Stephen J. Turnbull; +Cc: emacs-pretest-bug, Patrick Drechsler, Miles Bader

"Stephen J. Turnbull" <stephen@xemacs.org> writes:

> David Kastrup writes:
>  > "Stephen J. Turnbull" <stephen@xemacs.org> writes:
>
>  > > In any case, maintaining faithfulness of representation is simply not
>  > > possible, as you point out
>  > 
>  > With some coding systems.  But the latin-* and utf-* can maintain
>  > the binary stream since their coding is required to be canonical in
>  > the standard.
>
> latin-* will do so because of their extremely limited range.  It's
> unfortunate that programmer intuitions about text have been
> Americanized (== drastically limited) by these encodings.
>
> utf-* can maintain representation in the very limited sense you have
> in mind, and I know that is very useful to you in dealing with non-
> conforming applications like TeX.  However, you still run into the
> problem that faithfulness of representation is not a goal of Unicode.

I am not interested in the "goal of Unicode" but in that of Emacs.
Unicode is about text files.  But Emacs communicates via byte streams
and those are not necessarily text, or necessarily all text.

>  > > It's also not at all obvious that that is a very useful
>  > > requirement when dealing with a character-oriented standard like
>  > > Unicode or XML, since you can expect many applications to
>  > > canonicalize the text "behind your back".
>  > 
>  > That's not an issue.
>
> What do you mean by "that's not an issue?"  How can you know when I
> haven't named the application?

Because we are not talking about what arbitrary applications may do, but
what Emacs should do.  There may be other applications that tend to
garble byte streams, and there might even be some Elisp applications
that garble byte streams.  But that does not mean that the Emacs core
should feel nonchalant about garbling byte streams.

>  > Also you can load, edit and save a text file in colloborative
>  > environments, and the diffs/patches will be just in the edited
>  > areas (this will supposedly work better with Emacs-23 than
>  > Emacs-22).  Those are quite important features.
>
> Sure, and Emacs must provide coding systems that preserve them, and
> generally use those coding systems by default.  Did anybody say
> otherwise?

So what was your point supposed to be?

>  > > Users should get used to it, and we should document how to force
>  > > Emacs to error rather than do anything behind your back for those
>  > > who need binary faithfulness rather than text faithfulness.
>  > 
>  > Since binary faithfulness implies text faithfulness, there is no
>  > reason not to the right thing instead of erroring out.
>
> "There is no reason"?  How arrogant of you!  Rather, "David Kastrup
> lacks the knowledge of the reasons."  Here are three examples:
>
> Binary faithfulness may imply breaking text programs.  For example,
> `forward-char' and `replace-string' will give surprising results in a
> buffer using Unicode internally that contains Unicode in NFD
> normalization (and these anomolies will be noticeable in all Western
> European languages excluding English).

So forward-char and replace-string should be made to work as expected on
non-normalized texts.  One could even normalize texts and use text
properties in order to restore the non-normalized form when
communicating externally.

> Binary faithfulness may imply inefficiency.  For example, files need
> not be normalized, which would imply keeping a copy of the whole file
> and doing a Unicode diff to determine which parts of the file need to
> be saved from the buffer and which parts from the saved copy.

That sounds more like "binary faithfulness may inspire stupidity".  Of
course one needs to look for reasonable implementations.  Inefficiency
has not kept us from moving the Emacs-20.1 MULE model (where buffer and
string offsets were byte-oriented) to the 20.7 model (no idea where the
transition happened exactly) with character-based buffer and string
offsets.  Sometimes one has to balance sanity and efficiency.  And there
are ways for getting a reasonable amount of efficiency back.

> Binary faithfulness may be incompatible with other user demands, for
> example if a user introduces Latin-2 characters into a Latin-9 text.

Why do you think we switched to utf-8 internally and got rid of latin
unification?

-- 
David Kastrup, Kriemhildstr. 15, 44793 Bochum




^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: 23.0.60; [nxml] BOM and utf-8
  2008-05-19 20:57               ` David Kastrup
@ 2008-05-19 23:36                 ` Stephen J. Turnbull
  2008-05-20  7:13                   ` David Kastrup
  0 siblings, 1 reply; 41+ messages in thread
From: Stephen J. Turnbull @ 2008-05-19 23:36 UTC (permalink / raw)
  To: David Kastrup; +Cc: emacs-pretest-bug, Patrick Drechsler, Miles Bader

David Kastrup writes:

 > I am not interested in the "goal of Unicode" but in that of Emacs.
 > Unicode is about text files.  But Emacs communicates via byte streams
 > and those are not necessarily text, or necessarily all text.

Some Emacs files *are* text, and getting them to behave correctly will
require understanding "the goals of Unicode".  Since Unicode is now
the underlying representation of multibyte buffers, you don't have a
choice about this.  Cf. Thomas Morgan's recent post on "disappearing
cursor".

 > > Sure, and Emacs must provide coding systems that preserve them, and
 > > generally use those coding systems by default.  Did anybody say
 > > otherwise?
 > 
 > So what was your point supposed to be?

That Miles could use a BOM-swallowing encoding on input and a non-BOM-
producing encoding on output to enforce his view of Microsoft
conventions on others.  I told Patrick what I thought *Emacs* should
do, but apparently it doesn't do that yet.

 > So forward-char and replace-string should be made to work as
 > expected on non-normalized texts.

Good luck.  I don't know how to do that, and doubt that it is
possible.  I do not think that "as expected" can be well defined,
because for purposes like computing storage requirements composing
characters should be considered characters, while for others like
computing the number of columns occupied by a line they should not.

 > > Binary faithfulness may be incompatible with other user demands, for
 > > example if a user introduces Latin-2 characters into a Latin-9 text.
 > 
 > Why do you think we switched to utf-8 internally and got rid of latin
 > unification?

David, don't you realize that is not a response to what I wrote?

I think it's time to stop this thread until you address the issues
instead of me.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: 23.0.60; [nxml] BOM and utf-8
  2008-05-19 23:36                 ` Stephen J. Turnbull
@ 2008-05-20  7:13                   ` David Kastrup
  2008-05-30  2:47                     ` Kenichi Handa
  0 siblings, 1 reply; 41+ messages in thread
From: David Kastrup @ 2008-05-20  7:13 UTC (permalink / raw)
  To: Stephen J. Turnbull; +Cc: emacs-pretest-bug, Patrick Drechsler, Miles Bader

"Stephen J. Turnbull" <stephen@xemacs.org> writes:

> David Kastrup writes:
>
>  > I am not interested in the "goal of Unicode" but in that of Emacs.
>  > Unicode is about text files.  But Emacs communicates via byte streams
>  > and those are not necessarily text, or necessarily all text.
>
> Some Emacs files *are* text, and getting them to behave correctly will
> require understanding "the goals of Unicode".  Since Unicode is now
> the underlying representation of multibyte buffers, you don't have a
> choice about this.  Cf. Thomas Morgan's recent post on "disappearing
> cursor".

Sigh.  Bugs are there to be fixed, not to be used as an excuse for more
bugs.  The interpretation of Unicode is a matter of the display engine,
not of the byte stream encoders/decoders.

>  > > Sure, and Emacs must provide coding systems that preserve them,
>  > > and generally use those coding systems by default.  Did anybody
>  > > say otherwise?
>  > 
>  > So what was your point supposed to be?
>
> That Miles could use a BOM-swallowing encoding on input and a non-BOM-
> producing encoding on output to enforce his view of Microsoft
> conventions on others.

I suppose you underestimate Miles here.

>  > So forward-char and replace-string should be made to work as
>  > expected on non-normalized texts.
>
> Good luck.  I don't know how to do that, and doubt that it is
> possible.

We have similar issues with case-folding replacements.  Anyway: one
problem is not an excuse to introduce unrelated bugs elsewhere.  Moving
character unification to a place where it does more damage does not
magically make the problem different.

> I do not think that "as expected" can be well defined, because for
> purposes like computing storage requirements composing characters
> should be considered characters, while for others like computing the
> number of columns occupied by a line they should not.

Again, you are being destructive.  Problems don't present an excuse for
being sloppy.  If one can see a problem that can't be fixed by
principle, then one should try confining it to those operations where it
is inherent instead of spreading its effects all around and making
everything unpredictable.

Yes, there are questions in the presence of composing characters of what
one wants to have forward-char and replace-string and overwrite-mode do.
One reasonable approach is to consider Unicode glyphs as an inseparable
entity with regard to user commands.  It is basically Emacs 20.2 all
over.  But composed Unicode glyphs have no single code points.  They are
vectors.  As long as a character representation as scalar integers
remains valid, Unicode code points is all that we can do.

>  > > Binary faithfulness may be incompatible with other user demands,
>  > > for example if a user introduces Latin-2 characters into a
>  > > Latin-9 text.
>  > 
>  > Why do you think we switched to utf-8 internally and got rid of
>  > latin unification?
>
> David, don't you realize that is not a response to what I wrote?
>
> I think it's time to stop this thread until you address the issues
> instead of me.

Whatever.

-- 
David Kastrup, Kriemhildstr. 15, 44793 Bochum

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: 23.0.60; [nxml] BOM and utf-8
  2008-05-19 18:56               ` Eli Zaretskii
@ 2008-05-20 15:16                 ` Patrick Drechsler
  0 siblings, 0 replies; 41+ messages in thread
From: Patrick Drechsler @ 2008-05-20 15:16 UTC (permalink / raw)
  To: emacs-devel

Eli Zaretskii <eliz@gnu.org> writes:

>> From: Patrick Drechsler <patrick@pdrechsler.de>
>> Date: Mon, 19 May 2008 16:32:10 +0200
>> Cc: emacs-devel@gnu.org
>> 
>> > I believe that there is a utf-8-signature or similarly named coding
>> > system which does this.  (It's called utf-8-bom in XEmacs but IIRC
>> > Emacs uses the more accurate name, since UTF-8 is of course always
>> > bigendian.)
>> 
>> Thanks for the feedback Stephen.
>> 
>> I am not able to find this coding system by searching through the list
>> of possible coding systems (describe-coding-system -> TAB).
>
> That's because Emacs doesn't have it.

OK, thanks for the confirmation.

Cheers,

Patrick 
-- 
God put me on earth to accomplish a certain number of things.
Right now I am so far behind I will never die.
                               -- Bill Waterson, Calvin and Hobbes






^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: 23.0.60; [nxml] BOM and utf-8
  2008-05-17 20:38 ` Mark A. Hershberger
@ 2008-05-21 22:20   ` Patrick Drechsler
  2008-05-21 22:37     ` Patrick Drechsler
  0 siblings, 1 reply; 41+ messages in thread
From: Patrick Drechsler @ 2008-05-21 22:20 UTC (permalink / raw)
  To: emacs-devel; +Cc: emacs-pretest-bug

mah@everybody.org (Mark A. Hershberger) writes:

> Patrick Drechsler <patrick@pdrechsler.de> writes:
>
>> is the attached xml file (simple.xml) really invalid (as indicated by
>> nxhtml) or is this a bug in nxhtml?

s/nxhtml/nxml/

> The file simple.xml is really invalid.
>
> http://www.w3.org/TR/2006/REC-xml-20060816/#sec-prolog-dtd
>
> The XML spec gives the following syntax description for the prolog of an
> XML file (I've only copied the relevant parts):
>
>     [3]    S	    ::=  (#x20 | #x9 | #xD | #xA)+
>     [22]   prolog   ::=  XMLDecl? Misc* (doctypedecl  Misc*)?
>     [23]   XMLDecl  ::= '<?xml' VersionInfo EncodingDecl? SDDecl? S? '?>'
>
> Note that there is no S before the literal “<?xml" and that "<?xml" is
> optional.
>
> So, yes, an file that contains whitespace before "<?xml" is invalid XML.

Finally having some spare time I read the specs from above and I have a
followup question concerning your last sentence:

The BOM in my example file is not whitespace, it is xEF xBB xBF (it is
only displayed as whitespace by Emacs). According to the W3C site this
is valid:

,----[ http://www.w3.org/TR/2006/REC-xml-20060816/#charencoding ]
| Entities encoded in UTF-16 MUST and entities encoded in UTF-8 MAY
| begin with the Byte Order Mark described by Annex H of [ISO/IEC
| 10646:2000], section 2.4 of [Unicode], and section 2.7 of [Unicode3]
| (the ZERO WIDTH NO-BREAK SPACE character, #xFEFF). This is an encoding
| signature, not part of either the markup or the character data of the
| XML document. XML processors MUST be able to use this character to
| differentiate between UTF-8 and UTF-16 encoded documents.
`----

Also http://www.w3.org/TR/2006/REC-xml-20060816/#sec-guessing-no-ext-info

and

,----[ http://www.w3.org/TR/2006/REC-xml-20060816/#sec-guessing-with-ext-info ]
| If an XML entity is in a file, the Byte-Order Mark and encoding
| declaration are used (if present) to determine the character encoding.
`----

sound like a BOM is a legal (although optional) part of a xml file coded
in utf-8.

But I am not an expert, so please correct my potentially incorrect
interpretation.

In case my interpretation is correct, this is a bug in emacs' nxml mode.

Cheers,

Patrick 
-- 
Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are, by
definition, not smart enough to debug it.  (Kernighan)

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: 23.0.60; [nxml] BOM and utf-8
  2008-05-21 22:20   ` Patrick Drechsler
@ 2008-05-21 22:37     ` Patrick Drechsler
  2008-05-22  1:33       ` Mark A. Hershberger
  2008-05-22  4:17       ` tomas
  0 siblings, 2 replies; 41+ messages in thread
From: Patrick Drechsler @ 2008-05-21 22:37 UTC (permalink / raw)
  To: emacs-devel; +Cc: emacs-pretest-bug

Patrick Drechsler <patrick@pdrechsler.de> writes:

> mah@everybody.org (Mark A. Hershberger) writes:
>
>> Patrick Drechsler <patrick@pdrechsler.de> writes:
>>
>>> is the attached xml file (simple.xml) really invalid (as indicated by
>>> nxhtml) or is this a bug in nxhtml?
>
> s/nxhtml/nxml/
>
>> The file simple.xml is really invalid.
>>
>> http://www.w3.org/TR/2006/REC-xml-20060816/#sec-prolog-dtd
>>
>> The XML spec gives the following syntax description for the prolog of an
>> XML file (I've only copied the relevant parts):
>>
>>     [3]    S	    ::=  (#x20 | #x9 | #xD | #xA)+
>>     [22]   prolog   ::=  XMLDecl? Misc* (doctypedecl  Misc*)?
>>     [23]   XMLDecl  ::= '<?xml' VersionInfo EncodingDecl? SDDecl? S? '?>'
>>
>> Note that there is no S before the literal “<?xml" and that "<?xml" is
>> optional.
>>
>> So, yes, an file that contains whitespace before "<?xml" is invalid XML.
>
> Finally having some spare time I read the specs from above and I have a
> followup question concerning your last sentence:
>
> The BOM in my example file is not whitespace, it is xEF xBB xBF (it is
> only displayed as whitespace by Emacs). According to the W3C site this
> is valid:
>
> ,----[ http://www.w3.org/TR/2006/REC-xml-20060816/#charencoding ]
> | Entities encoded in UTF-16 MUST and entities encoded in UTF-8 MAY
> | begin with the Byte Order Mark described by Annex H of [ISO/IEC
> | 10646:2000], section 2.4 of [Unicode], and section 2.7 of [Unicode3]
> | (the ZERO WIDTH NO-BREAK SPACE character, #xFEFF). This is an encoding
> | signature, not part of either the markup or the character data of the
> | XML document. XML processors MUST be able to use this character to
> | differentiate between UTF-8 and UTF-16 encoded documents.
> `----
>
> Also http://www.w3.org/TR/2006/REC-xml-20060816/#sec-guessing-no-ext-info

sorry, wrong link: http://www.w3.org/TR/2006/REC-xml-20060816/#sec-guessing

> and
>
> ,----[ http://www.w3.org/TR/2006/REC-xml-20060816/#sec-guessing-with-ext-info ]
> | If an XML entity is in a file, the Byte-Order Mark and encoding
> | declaration are used (if present) to determine the character encoding.
> `----
>
> sound like a BOM is a legal (although optional) part of a xml file coded
> in utf-8.
>
> But I am not an expert, so please correct my potentially incorrect
> interpretation.
>
> In case my interpretation is correct, this is a bug in emacs' nxml mode.
>
> Cheers,
>
> Patrick 

-- 
._q0p_.  
'=(_)='
 / V \
(_/^\_)





^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: 23.0.60; [nxml] BOM and utf-8
  2008-05-21 22:37     ` Patrick Drechsler
@ 2008-05-22  1:33       ` Mark A. Hershberger
  2008-05-22 14:43         ` Tom Tromey
  2008-05-22  4:17       ` tomas
  1 sibling, 1 reply; 41+ messages in thread
From: Mark A. Hershberger @ 2008-05-22  1:33 UTC (permalink / raw)
  To: emacs-devel; +Cc: emacs-pretest-bug

Patrick Drechsler <patrick@pdrechsler.de> writes:

> Patrick Drechsler <patrick@pdrechsler.de> writes:

>> Finally having some spare time I read the specs from above and I have a
>> followup question concerning your last sentence:
>>
>> The BOM in my example file is not whitespace, it is xEF xBB xBF (it is
>> only displayed as whitespace by Emacs). According to the W3C site this
>> is valid:

You're right.  I skimmed too fast and missed that important bit.

However, you'll note that the spec still does not leave room for a BOM.
I don't know enough about UTF-8 encoding and XML to know if this
matters.

Mark.

-- 
http://hexmode.com/
GPG Fingerprint: 7E15 362D A32C DFAB E4D2  B37A 735E F10A 2DFC BFF5

Ideas create idols; only wonder leads to knowing.
    -- St. Gregory of Nyssa





^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: 23.0.60; [nxml] BOM and utf-8
  2008-05-21 22:37     ` Patrick Drechsler
  2008-05-22  1:33       ` Mark A. Hershberger
@ 2008-05-22  4:17       ` tomas
  2008-05-22  4:33         ` Miles Bader
  2008-05-22 17:34         ` Stephen J. Turnbull
  1 sibling, 2 replies; 41+ messages in thread
From: tomas @ 2008-05-22  4:17 UTC (permalink / raw)
  To: Patrick Drechsler; +Cc: emacs-pretest-bug, emacs-devel

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Thu, May 22, 2008 at 12:37:11AM +0200, Patrick Drechsler wrote:
> Patrick Drechsler <patrick@pdrechsler.de> writes:

This would be rather a question to w3.org, but...

> > ,----[ http://www.w3.org/TR/2006/REC-xml-20060816/#charencoding ]
> > | Entities encoded in UTF-16 MUST and entities encoded in UTF-8 MAY
> > | begin with the Byte Order Mark [...]
> > |        [...]  XML processors MUST be able to use this character to
> > | differentiate between UTF-8 and UTF-16 encoded documents.
> > `----

...and how are the XML processors supposed to achieve that? Is there a
second variant of BOM, indicating UTF-8?

> > and
> >
> > ,----[ http://www.w3.org/TR/2006/REC-xml-20060816/#sec-guessing-with-ext-info ]
> > | If an XML entity is in a file, the Byte-Order Mark and encoding
> > | declaration are used (if present) to determine the character encoding.
> > `----

...or is rather the absence of a BOM the indicator for UTF-8?

Am I completely whacko, or are they?

Sorry. I am confused.

Ah, and BTW: interpreting the BOM as whitespace is not that far off --
as stated in <http://unicode.org/faq/utf_bom.html#38>:

 | Q: What should I do with U+FEFF in the middle of a file?
 | 
 | A: In the absence of a protocol supporting its use as a BOM and when not
 | at the beginning of a text stream, U+FEFF should normally not occur. For
 | backwards compatibility it should be treated as ZERO WIDTH NON-BREAKING
 | SPACE (ZWNBSP), and is then part of the content of the file or string.
 [...]

But that would be "in the middle of a file", not at the beginning, as
our case is.

I'd appreciate any insights.

Thanks
- -- tomás
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)

iD8DBQFINPPpBcgs9XrR2kYRAutgAJ9BXb32mnDV53T3RTOBu4LGmOfHIgCfUxNG
EJYtPO908ac75bw1vERvRyQ=
=IQaH
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: 23.0.60; [nxml] BOM and utf-8
  2008-05-22  4:17       ` tomas
@ 2008-05-22  4:33         ` Miles Bader
  2008-05-22  8:28           ` Jason Rumney
  2008-05-27  8:22           ` tomas
  2008-05-22 17:34         ` Stephen J. Turnbull
  1 sibling, 2 replies; 41+ messages in thread
From: Miles Bader @ 2008-05-22  4:33 UTC (permalink / raw)
  To: tomas; +Cc: emacs-pretest-bug, Patrick Drechsler, emacs-devel

tomas@tuxteam.de writes:
>> > |        [...]  XML processors MUST be able to use this character to
>> > | differentiate between UTF-8 and UTF-16 encoded documents.
>> > `----
>
> ...and how are the XML processors supposed to achieve that? Is there a
> second variant of BOM, indicating UTF-8?

The encoding of BOM (incidentally, isn't this name for it obsolete?) is
different in utf-8 and utf-16.

-Miles

-- 
Quotation, n. The act of repeating erroneously the words of another. The words
erroneously repeated.




^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: 23.0.60; [nxml] BOM and utf-8
  2008-05-22  4:33         ` Miles Bader
@ 2008-05-22  8:28           ` Jason Rumney
  2008-05-27  8:22           ` tomas
  1 sibling, 0 replies; 41+ messages in thread
From: Jason Rumney @ 2008-05-22  8:28 UTC (permalink / raw)
  To: Miles Bader; +Cc: emacs-pretest-bug, Patrick Drechsler, tomas, emacs-devel

Miles Bader wrote:
> The encoding of BOM (incidentally, isn't this name for it obsolete?)

The Unicode Consortium seems very confused over this, probably because 
of the format they've chosen for the character info tables they publish, 
which they don't want to change now because software relies on it (the 
only way of specifying an alternate name seems to be to deprecate the 
old name to the "1.0 name" field, and use the new name as the preferred 
name). It was renamed to ZWNBSP in Unicode 2.0 to reflect its dual 
purpose as BOM and zero width no break space.  Then in a later version 
of Unicode its use as ZWNBSP was deprecated, but the official name was 
not changed back (swapping the ZWNBSP and BOM names would not be 
strictly correct, as ZWNBSP was not its name in Unicode 1.0).

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: 23.0.60; [nxml] BOM and utf-8
  2008-05-22  1:33       ` Mark A. Hershberger
@ 2008-05-22 14:43         ` Tom Tromey
  2008-05-22 21:24           ` Miles Bader
  0 siblings, 1 reply; 41+ messages in thread
From: Tom Tromey @ 2008-05-22 14:43 UTC (permalink / raw)
  To: Mark A. Hershberger; +Cc: emacs-pretest-bug, emacs-devel

>>>>> "Mark" == Mark A Hershberger <mah@everybody.org> writes:

Mark> However, you'll note that the spec still does not leave room for a BOM.
Mark> I don't know enough about UTF-8 encoding and XML to know if this
Mark> matters.

Note that this is an issue not just for XML files.  I recently added
some code to libcpp to ignore UTF-8 BOMs -- because some Windows
editors add them by default and because, supposedly, Visual Studio
requires them.

So, I think it would be nice for interoperability to handle them
properly, the same way Emacs handles different kinds of line endings.

Tom

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: 23.0.60; [nxml] BOM and utf-8
  2008-05-22  4:17       ` tomas
  2008-05-22  4:33         ` Miles Bader
@ 2008-05-22 17:34         ` Stephen J. Turnbull
  2008-05-23  9:05           ` tomas
  1 sibling, 1 reply; 41+ messages in thread
From: Stephen J. Turnbull @ 2008-05-22 17:34 UTC (permalink / raw)
  To: tomas; +Cc: emacs-pretest-bug, Patrick Drechsler, emacs-devel

tomas@tuxteam.de writes:

 > > > ,----[ http://www.w3.org/TR/2006/REC-xml-20060816/#charencoding ]
 > > > | Entities encoded in UTF-16 MUST and entities encoded in UTF-8 MAY
 > > > | begin with the Byte Order Mark [...]
 > > > |        [...]  XML processors MUST be able to use this character to
 > > > | differentiate between UTF-8 and UTF-16 encoded documents.
 > > > `----
 > 
 > ...and how are the XML processors supposed to achieve that? Is there a
 > second variant of BOM, indicating UTF-8?

Well, note that the BOM is three octets in UTF-8 but only two in
UTF-16.  Dead giveaway, there.

 > > > ,----[ http://www.w3.org/TR/2006/REC-xml-20060816/#sec-guessing-with-ext-info ]
 > > > | If an XML entity is in a file, the Byte-Order Mark and encoding
 > > > | declaration are used (if present) to determine the character encoding.
 > > > `----
 > 
 > ...or is rather the absence of a BOM the indicator for UTF-8?

Absence of a BOM is *an* indicator for UTF-8, as is presence of the
BOM encoded in UTF-8 as the first 3 octets of a file.

 > Am I completely whacko, or are they?

Neither.  You live in a relatively sane world, they live in a world
which contains the sovereign nations of Japan and Microsoft.

 > But that would be "in the middle of a file", not at the beginning, as
 > our case is.
 > 
 > I'd appreciate any insights.

If there is a higher level protocol which tells you what to do about
the BOM, obey it.  This is the case for XML files.

Otherwise, if U+FEFF occurs at the beginning of a file which otherwise
seems to be valid Unicode (in two-octet and four-octet versions, that
means containing no instances of 0xFFFF and only one endianness of
0xFEFF, in UTF-8, doesn't violate the constraints of UTF-8, and
doesn't contain any sequences that decode to U+FFFF or U+FFFE), ignore
it and start processing with the next character.

The next question is, where is this "XML processing" done?  As David
Kastrup points out, Emacs must be able to pass the BOM through to the
buffer, and he may be correct that that is the best default behavior.
I don't see any way for Emacs to determine when pass through is
appropriate and when not in the coding system; the coding system is
normally invoked at a level where Emacs cannot know that it will be
processed by nxml.

Therefore I think that for the purposes of XML conformance, nxml, not
Emacs, must be considered the XML processor, and nxml's failure to
recognize the BOM and ignore it (for the purpose of checking validity)
is a bug.

However, I'd be careful.  Maybe somebody should ask James Clark why he
did things this way.  He may have had an excellent reason for
insisting that Emacs strip BOMs before passing the file on to nxml.

Or maybe he just never saw UTF-8 signatures, or maybe he never edits
binary files using text encodings, and didn't consider use cases other
than editing text files important enough to provide for the BOM in
nxml.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: 23.0.60; [nxml] BOM and utf-8
  2008-05-22 14:43         ` Tom Tromey
@ 2008-05-22 21:24           ` Miles Bader
  0 siblings, 0 replies; 41+ messages in thread
From: Miles Bader @ 2008-05-22 21:24 UTC (permalink / raw)
  To: Tom Tromey; +Cc: emacs-pretest-bug, Mark A. Hershberger, emacs-devel

Tom Tromey <tromey@redhat.com> writes:
> Note that this is an issue not just for XML files.  I recently added
> some code to libcpp to ignore UTF-8 BOMs -- because some Windows
> editors add them by default and because, supposedly, Visual Studio
> requires them.

This the one that's _really_ stupid...

-Miles

-- 
97% of everything is grunge




^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: 23.0.60; [nxml] BOM and utf-8
  2008-05-22 17:34         ` Stephen J. Turnbull
@ 2008-05-23  9:05           ` tomas
  2008-05-23 21:23             ` Stephen J. Turnbull
  0 siblings, 1 reply; 41+ messages in thread
From: tomas @ 2008-05-23  9:05 UTC (permalink / raw)
  To: Stephen J. Turnbull
  Cc: emacs-pretest-bug, Patrick Drechsler, tomas, emacs-devel

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Fri, May 23, 2008 at 02:34:46AM +0900, Stephen J. Turnbull wrote:
> tomas@tuxteam.de writes:
> 
>  > > > ,----[ http://www.w3.org/TR/2006/REC-xml-20060816/#charencoding ]
>  > > > | Entities encoded in UTF-16 MUST and entities encoded in UTF-8 MAY
>  > > > | begin with the Byte Order Mark [...]
>  > > > |        [...]  XML processors MUST be able to use this character to
>  > > > | differentiate between UTF-8 and UTF-16 encoded documents.
>  > > > `----
>  > 
>  > ...and how are the XML processors supposed to achieve that? Is there a
>  > second variant of BOM, indicating UTF-8?
> 
> Well, note that the BOM is three octets in UTF-8 but only two in
> UTF-16.  Dead giveaway, there.

Duh. Thanks. That's what I was missing.

[...]

>  > Am I completely whacko, or are they?
> 
> Neither.  You live in a relatively sane world, they live in a world
> which contains the sovereign nations of Japan and Microsoft.

Thanks for your kind words :-)

As for whether Emacs or nxml has the burden of skipping the BOM -- that
would correspond to whether nxml "within" Emacs is "seeing" a piece of
XML or a whole XML document, right?

Regards
- -- tomás
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)

iD8DBQFINojHBcgs9XrR2kYRAtieAJsEoakhvgRrjisQ9XhIjAap5mISBACaAjrk
7IuDZQjZvvdFoadb90lSygE=
=022X
-----END PGP SIGNATURE-----




^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: 23.0.60; [nxml] BOM and utf-8
  2008-05-23  9:05           ` tomas
@ 2008-05-23 21:23             ` Stephen J. Turnbull
  2008-05-27  8:20               ` tomas
  0 siblings, 1 reply; 41+ messages in thread
From: Stephen J. Turnbull @ 2008-05-23 21:23 UTC (permalink / raw)
  To: tomas; +Cc: emacs-pretest-bug, Patrick Drechsler, emacs-devel

tomas@tuxteam.de writes:

 > As for whether Emacs or nxml has the burden of skipping the BOM -- that
 > would correspond to whether nxml "within" Emacs is "seeing" a piece of
 > XML or a whole XML document, right?

No, I don't think so.  First, as I tried to explain, I don't think
that Emacs can reliably "know" that the BOM needs to be skipped at
decoding time.  Second, if the "piece" is what XML calls a "parsed
external entity" (analogous to an include file), it must be subjected
to BOM processing according to section 4.3.3 of the XML standard.  On
the other hand, if the fragment is generated internally to Emacs, then
there should be no BOM, because the BOM is not part of the text of an
XML document: "This is an encoding signature, not part of either the
markup or the character data of the XML document."  While on the other
hand the BOM will not be produced with character semantics (as ZWNBSP)
in modern (since Unicode 3.2) Unicode processes.

So I think there is almost never going to be harm in nxml stripping
the BOM, whereas Emacs has to be much more careful.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: 23.0.60; [nxml] BOM and utf-8
  2008-05-23 21:23             ` Stephen J. Turnbull
@ 2008-05-27  8:20               ` tomas
  0 siblings, 0 replies; 41+ messages in thread
From: tomas @ 2008-05-27  8:20 UTC (permalink / raw)
  To: Stephen J. Turnbull
  Cc: emacs-pretest-bug, Patrick Drechsler, tomas, emacs-devel

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Sat, May 24, 2008 at 06:23:18AM +0900, Stephen J. Turnbull wrote:
> tomas@tuxteam.de writes:
> 
>  > As for whether Emacs or nxml has the burden of skipping the BOM -- that
>  > would correspond to whether nxml "within" Emacs is "seeing" a piece of
>  > XML or a whole XML document, right?
> 
> No, I don't think so.  First, as I tried to explain, I don't think
> that Emacs can reliably "know" that the BOM needs to be skipped at
> decoding time [...]

Yes, that makes sense to me now.

Thanks
- -- tomás
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)

iD8DBQFIO8RLBcgs9XrR2kYRAncjAJ938duk33W3ZcicieUZRjL/bL8qMwCdFD5S
bE1vV1K+rhD7nSNq/DeqDd8=
=cXki
-----END PGP SIGNATURE-----




^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: 23.0.60; [nxml] BOM and utf-8
  2008-05-22  4:33         ` Miles Bader
  2008-05-22  8:28           ` Jason Rumney
@ 2008-05-27  8:22           ` tomas
  1 sibling, 0 replies; 41+ messages in thread
From: tomas @ 2008-05-27  8:22 UTC (permalink / raw)
  To: Miles Bader; +Cc: emacs-pretest-bug, Patrick Drechsler, tomas, emacs-devel

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Thu, May 22, 2008 at 01:33:21PM +0900, Miles Bader wrote:
> tomas@tuxteam.de writes:
> >> > |        [...]  XML processors MUST be able to use this character to
> >> > | differentiate between UTF-8 and UTF-16 encoded documents.
> >> > `----
> >
> > ...and how are the XML processors supposed to achieve that? Is there a
> > second variant of BOM, indicating UTF-8?
> 
> The encoding of BOM (incidentally, isn't this name for it obsolete?) is
> different in utf-8 and utf-16.

Stephen noticed as well: this was just the missing piece in my head :)

(BTW: we might redefine the acronym to stand for "Bad Old Microsoft"?).

Regards
- -- tomás
> 
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)

iD8DBQFIO8TdBcgs9XrR2kYRAvqtAJ9jf4XT5nAgAeeNGjgi9T0f27BkewCfX4HQ
velBGbYVITaFqHsgGRhFlnA=
=88Eg
-----END PGP SIGNATURE-----




^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: 23.0.60; [nxml] BOM and utf-8
  2008-05-20  7:13                   ` David Kastrup
@ 2008-05-30  2:47                     ` Kenichi Handa
  2008-05-30  3:44                       ` Miles Bader
  0 siblings, 1 reply; 41+ messages in thread
From: Kenichi Handa @ 2008-05-30  2:47 UTC (permalink / raw)
  To: David Kastrup; +Cc: emacs-pretest-bug, stephen, patrick, miles

I have not followed the discussion of this thread, but FYI,
I've just added two new coding systems: utf-8-auto and
utf-8-with-signature.  utf-8 still decodes the first 0xFEFF
as a normal character.

---
Kenichi Handa
handa@ni.aist.go.jp




^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: 23.0.60; [nxml] BOM and utf-8
  2008-05-30  2:47                     ` Kenichi Handa
@ 2008-05-30  3:44                       ` Miles Bader
  2008-05-30  3:59                         ` Kenichi Handa
  0 siblings, 1 reply; 41+ messages in thread
From: Miles Bader @ 2008-05-30  3:44 UTC (permalink / raw)
  To: Kenichi Handa; +Cc: emacs-pretest-bug, stephen, patrick

Kenichi Handa <handa@m17n.org> writes:
> I have not followed the discussion of this thread, but FYI,
> I've just added two new coding systems: utf-8-auto and
> utf-8-with-signature.  utf-8 still decodes the first 0xFEFF
> as a normal character.

That seems reasonable ... what's their ordering in the coding
priority list by default...?

-Miles

-- 
Generous, adj. Originally this word meant noble by birth and was rightly
applied to a great multitude of persons. It now means noble by nature and is
taking a bit of a rest.




^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: 23.0.60; [nxml] BOM and utf-8
  2008-05-30  3:44                       ` Miles Bader
@ 2008-05-30  3:59                         ` Kenichi Handa
  0 siblings, 0 replies; 41+ messages in thread
From: Kenichi Handa @ 2008-05-30  3:59 UTC (permalink / raw)
  To: Miles Bader; +Cc: emacs-pretest-bug, stephen, patrick

In article <buotzggh2vt.fsf@dhapc248.dev.necel.com>, Miles Bader <miles.bader@necel.com> writes:

> Kenichi Handa <handa@m17n.org> writes:
> > I have not followed the discussion of this thread, but FYI,
> > I've just added two new coding systems: utf-8-auto and
> > utf-8-with-signature.  utf-8 still decodes the first 0xFEFF
> > as a normal character.

> That seems reasonable ... what's their ordering in the coding
> priority list by default...?

They are fairly low unless explicitly preferred.

---
Kenichi Handa
handa@ni.aist.go.jp




^ permalink raw reply	[flat|nested] 41+ messages in thread

end of thread, other threads:[~2008-05-30  3:59 UTC | newest]

Thread overview: 41+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-05-17 12:31 23.0.60; [nxml] BOM and utf-8 Patrick Drechsler
2008-05-17 14:13 ` Lennart Borgman (gmail)
2008-05-17 16:57   ` Patrick Drechsler
2008-05-17 20:38 ` Mark A. Hershberger
2008-05-21 22:20   ` Patrick Drechsler
2008-05-21 22:37     ` Patrick Drechsler
2008-05-22  1:33       ` Mark A. Hershberger
2008-05-22 14:43         ` Tom Tromey
2008-05-22 21:24           ` Miles Bader
2008-05-22  4:17       ` tomas
2008-05-22  4:33         ` Miles Bader
2008-05-22  8:28           ` Jason Rumney
2008-05-27  8:22           ` tomas
2008-05-22 17:34         ` Stephen J. Turnbull
2008-05-23  9:05           ` tomas
2008-05-23 21:23             ` Stephen J. Turnbull
2008-05-27  8:20               ` tomas
2008-05-18  2:29 ` Stephen J. Turnbull
2008-05-18  2:30   ` Miles Bader
2008-05-18  3:19     ` Eli Zaretskii
2008-05-18  4:19       ` Stephen J. Turnbull
2008-05-18  8:56       ` Jason Rumney
2008-05-18 11:00         ` Patrick Drechsler
2008-05-19  3:11           ` Stephen J. Turnbull
2008-05-19 14:32             ` Patrick Drechsler
2008-05-19 18:56               ` Eli Zaretskii
2008-05-20 15:16                 ` Patrick Drechsler
2008-05-18 15:19         ` joakim
2008-05-18  4:13     ` Stephen J. Turnbull
2008-05-18  5:40       ` Miles Bader
2008-05-18  9:14       ` David Kastrup
2008-05-19  3:05         ` Stephen J. Turnbull
2008-05-18 23:40           ` David Kastrup
2008-05-19 20:34             ` Stephen J. Turnbull
2008-05-19 20:57               ` David Kastrup
2008-05-19 23:36                 ` Stephen J. Turnbull
2008-05-20  7:13                   ` David Kastrup
2008-05-30  2:47                     ` Kenichi Handa
2008-05-30  3:44                       ` Miles Bader
2008-05-30  3:59                         ` Kenichi Handa
2008-05-19  6:32           ` Lennart Borgman (gmail)

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).