unofficial mirror of emacs-devel@gnu.org 
 help / color / mirror / code / Atom feed
* 23.0.60; Defaut encoding for XML files should be undefined (instead of utf-8)
@ 2008-02-15  9:06 Sébastien Vauban
  2008-02-15 22:32 ` Edward O'Connor
  0 siblings, 1 reply; 38+ messages in thread
From: Sébastien Vauban @ 2008-02-15  9:06 UTC (permalink / raw)
  To: emacs-pretest-bug; +Cc: alexandre, monnier

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset=us-ascii, Size: 5496 bytes --]

Hi,

Please describe exactly what actions triggered the bug
and the precise symptoms of the bug:

When I'm opening an XML file that I've saved in iso-latin-1
encoding, it's not correctly opened if there is no header,
cookie or local variable telling about my encoding.

This is how it looks like, when opened, as it is read by default
as utf-8 (while being iso-latin-1):

--8<---------------cut here---------------start------------->8---
    <enumType name="test">
      <pair value="11">D\373 \340 l'entreprise</pair>
      <pair value="12">Stagiaire occup\351</pair>
      <pair value="36">Autres</pair>
    </enumType>
--8<---------------cut here---------------end--------------->8---

    PS- I've manually edited the accented characters, rewriting û
        (and the others) with \373 so that you can see what I
        see.

The bug is: the default for XML files shouldn't be `utf-8' but
`undefined' (i.e. the usual default which tries to auto-detect
the encoding).


If Emacs crashed, and you have the Emacs process in the gdb debugger,
please include the output from the following gdb commands:
    `bt full' and `xbacktrace'.
If you would like to further debug the crash, please read the file
/usr/share/emacs/23.0.60/etc/DEBUG for instructions.


In GNU Emacs 23.0.60.1 (i486-pc-linux-gnu, GTK+ Version 2.12.0)
 of 2007-12-30 on iridium
 (emacs-snapshot package, version 1:20071229-1~gutsy)
Windowing system distributor `The X.Org Foundation', version 11.0.10300000
configured using `configure  '--build' 'i486-linux-gnu' '--host' 'i486-linux-gnu' '--prefix=/usr' '--sharedstatedir=/var/lib' '--libexecdir=/usr/lib' '--localstatedir=/var' '--infodir=/usr/share/info' '--mandir=/usr/share/man' '--with-pop=yes' '--enable-locallisppath=/etc/emacs-snapshot:/etc/emacs:/usr/local/share/emacs/23.0.60/site-lisp:/usr/local/share/emacs/site-lisp:/usr/share/emacs/23.0.60/site-lisp:/usr/share/emacs/site-lisp:/usr/share/emacs/23.0.60/leim' '--with-x=yes' '--with-x-toolkit=gtk' '--enable-font-backend' '--with-xft' '--with-freetype' 'build_alias=i486-linux-gnu' 'host_alias=i486-linux-gnu' 'CFLAGS=-DDEBIAN -DSITELOAD_PURESIZE_EXTRA=5000 -g -O2''

Important settings:
  value of $LC_ALL: nil
  value of $LC_COLLATE: nil
  value of $LC_CTYPE: nil
  value of $LC_MESSAGES: nil
  value of $LC_MONETARY: nil
  value of $LC_NUMERIC: nil
  value of $LC_TIME: nil
  value of $LANG: en_US.UTF-8
  value of $XMODIFIERS: nil
  locale-coding-system: utf-8-unix
  default-enable-multibyte-characters: t

Major mode: Article

Minor modes in effect:
  show-paren-mode: t
  recentf-mode: t
  auto-image-file-mode: t
  shell-dirtrack-mode: t
  global-hi-lock-mode: t
  icomplete-mode: t
  tooltip-mode: t
  tool-bar-mode: t
  mouse-wheel-mode: t
  menu-bar-mode: t
  file-name-shadow-mode: t
  global-font-lock-mode: t
  font-lock-mode: t
  blink-cursor-mode: t
  global-auto-composition-mode: t
  auto-composition-mode: t
  auto-compression-mode: t
  temp-buffer-resize-mode: t
  column-number-mode: t
  line-number-mode: t
  transient-mark-mode: t
  abbrev-mode: t

Recent input:
<left> <left> <left> <left> <left> <left> <left> <left> 
<left> <left> <left> <left> <left> <left> <left> <left> 
<left> <left> <left> <left> <left> <left> <left> <left> 
<right> SPC R E G <backspace> T <down> <down> <down> 
<down> <down> <down> <down> <down> <down> <down> <down> 
<down> <down> <down> <down> <down> <down> <left> <left> 
<left> <left> <left> <left> <left> <left> <left> <right> 
` <right> <right> <right> <right> <right> <right> <right> 
<right> <right> ' <down> <down> <left> <left> <left> 
<left> <left> <left> <left> <left> <left> <left> <left> 
<left> <left> <right> i s o - <down> <down> <down> 
<down> <down> <down> <down> <down> <down> <down> <down> 
<down> <down> <down> <down> <down> <down> <up> <up> 
<up> <up> <up> <up> <up> <up> <up> <up> <up> <up> <up> 
<up> <up> <up> <up> <up> <down> <down> <down> <down> 
<down> <down> <down> <down> <down> <down> <down> <down> 
<down> <down> <down> <down> <down> <down> <down> <down> 
<down> <down> <down> <down> <down> <down> <down> <C-home> 
C-c C-c . h e <tab> <return> <f6> <down> <down> <return> 
<f6> <f6> <f3> m t o f i <backspace> <backspace> <backspace> 
<backspace> o t i f . * x m l <backspace> <backspace> 
<backspace> <backspace> <backspace> d e l a i t r a 
i t _ e t . x m l <down> <down> <down> <down> <down> 
<return> <up> <up> <up> <up> <up> <up> <up> <up> <up> 
<up> <up> <up> <up> <up> <up> <up> <up> <up> <up> <up> 
<up> <up> <up> <up> C-SPC <down> <down> <down> <down> 
<down> <down> <down> <down> <down> <down> <down> <down> 
<down> <down> <down> <down> <down> <down> <down> <down> 
<down> <down> M-w <up> <up> <up> <up> <up> <up> <up> 
<up> <up> <up> <up> <up> <up> <up> <up> <up> <up> <up> 
<up> <up> <f6> <down> C-SPC <down> <down> <down> <down> 
<down> <down> <down> <down> <down> <down> <down> <down> 
M-w M-x <up> <up> <return>

Recent messages:
Mark set
10.10.10.51 
Domain name is `missioncriticalit.com', setting `smtpmail-smtp-server' to `mail.missioncriticalit.com'
Sending...
Sending news via ^$\|\(^comp\.emacs$\)\|\(^comp\.emacs\.xemacs$\)\|\(^fr\.comp\.applications\.emacs$\)\|\(^gnu\.emacs\.help$\) using nnvirtual...
nnimap: Updating info for nnimap:INBOX.help...done
Sending...done
Mark set
Saved text from "    <enumType name="motifDelaiTrait_et">"
Mark set

-- 
Sébastien Vauban




^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: 23.0.60; Defaut encoding for XML files should be undefined (instead of utf-8)
  2008-02-15  9:06 23.0.60; Defaut encoding for XML files should be undefined (instead of utf-8) Sébastien Vauban
@ 2008-02-15 22:32 ` Edward O'Connor
  2008-02-15 22:54   ` Jason Rumney
  2008-02-16  4:03   ` Stefan Monnier
  0 siblings, 2 replies; 38+ messages in thread
From: Edward O'Connor @ 2008-02-15 22:32 UTC (permalink / raw)
  To: emacs-devel; +Cc: emacs-pretest-bug

Sébastien wrote:

> When I'm opening an XML file that I've saved in iso-latin-1
> encoding, it's not correctly opened if there is no header,
> cookie or local variable telling about my encoding.
>
> This is how it looks like, when opened, as it is read by default
> as utf-8 (while being iso-latin-1):
>     <enumType name="test">
>       <pair value="11">D\373 \340 l'entreprise</pair>
>       <pair value="12">Stagiaire occup\351</pair>
>       <pair value="36">Autres</pair>
>     </enumType>

My understanding is that the XML specification requires XML documents
lacking an explicit <?xml encoding='foo'?> instruction to be UTF-8 or
UTF-16, so Emacs does the right thing already.


Ted





^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: 23.0.60; Defaut encoding for XML files should be undefined (instead of utf-8)
  2008-02-15 22:32 ` Edward O'Connor
@ 2008-02-15 22:54   ` Jason Rumney
  2008-02-15 23:24     ` Miles Bader
  2008-02-16  4:03   ` Stefan Monnier
  1 sibling, 1 reply; 38+ messages in thread
From: Jason Rumney @ 2008-02-15 22:54 UTC (permalink / raw)
  To: Edward O'Connor; +Cc: emacs-pretest-bug, emacs-devel

Edward O'Connor wrote:
> My understanding is that the XML specification requires XML documents
> lacking an explicit <?xml encoding='foo'?> instruction to be UTF-8 or
> UTF-16, so Emacs does the right thing already.
>   

Emacs goes beyond doing the right thing at the moment. The right thing 
would be to guide users into using utf-8 by making that the default 
encoding for *new* XML files, and perhaps warning if an existing file 
was detected as non-utf-8 without a charset declaration in the header. 
Forcing users into using utf-8 by ignoring explicit requests to save the 
file as latin-1 and by opening latin-1 encoded files as utf-8 even when 
the decoding fails is not the right behaviour. Our users are not slaves 
to specifications.





^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: 23.0.60; Defaut encoding for XML files should be undefined (instead of utf-8)
  2008-02-15 22:54   ` Jason Rumney
@ 2008-02-15 23:24     ` Miles Bader
  2008-02-15 23:34       ` Jason Rumney
  2008-02-18  2:49       ` Jason Rumney
  0 siblings, 2 replies; 38+ messages in thread
From: Miles Bader @ 2008-02-15 23:24 UTC (permalink / raw)
  To: Jason Rumney; +Cc: emacs-pretest-bug, Edward O'Connor, emacs-devel

Jason Rumney <jasonr@gnu.org> writes:
> Emacs goes beyond doing the right thing at the moment. The right thing
> would be to guide users into using utf-8 by making that the default
> encoding for *new* XML files, and perhaps warning if an existing file
> was detected as non-utf-8 without a charset declaration in the
> header. Forcing users into using utf-8 by ignoring explicit requests to
> save the file as latin-1 and by opening latin-1 encoded files as utf-8
> even when the decoding fails is not the right behaviour. Our users are
> not slaves to specifications.

Perhaps the "best" thing would be to temporarily do a
(prefer-coding-system 'utf-8) when reading xml files without an encoding
header, instead of _forcing_ the coding system to be utf-8.

However it doesn't look like the current Emacs mechanism for
format-specific coding systems (`auto-coding-functions') explicitly
supports functionality.  Maybe sgml-xml-auto-coding-function could use
whatever lower-level function does coding-system-detection using only
the characters in the buffer (I can't seem to find it, but there must be
such a thing...).

-Miles

-- 
Scriptures, n. The sacred books of our holy religion, as distinguished from
the false and profane writings on which all other faiths are based.




^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: 23.0.60; Defaut encoding for XML files should be undefined (instead of utf-8)
  2008-02-15 23:24     ` Miles Bader
@ 2008-02-15 23:34       ` Jason Rumney
  2008-02-15 23:42         ` Miles Bader
  2008-02-18  2:49       ` Jason Rumney
  1 sibling, 1 reply; 38+ messages in thread
From: Jason Rumney @ 2008-02-15 23:34 UTC (permalink / raw)
  To: Miles Bader; +Cc: emacs-pretest-bug, Edward O'Connor, emacs-devel

Miles Bader wrote:
> However it doesn't look like the current Emacs mechanism for
> format-specific coding systems (`auto-coding-functions') explicitly
> supports functionality.  Maybe sgml-xml-auto-coding-function could use
> whatever lower-level function does coding-system-detection using only
> the characters in the buffer (I can't seem to find it, but there must be
> such a thing...).
>   

file-coding-system-alist would be the place where I'd expect to see it 
used, but there utf-8 is hard-coded for xml files.





^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: 23.0.60; Defaut encoding for XML files should be undefined (instead of utf-8)
  2008-02-15 23:34       ` Jason Rumney
@ 2008-02-15 23:42         ` Miles Bader
  2008-02-16  3:42           ` Miles Bader
  0 siblings, 1 reply; 38+ messages in thread
From: Miles Bader @ 2008-02-15 23:42 UTC (permalink / raw)
  To: Jason Rumney; +Cc: emacs-pretest-bug, Edward O'Connor, emacs-devel

Jason Rumney <jasonr@gnu.org> writes:
>> However it doesn't look like the current Emacs mechanism for
>> format-specific coding systems (`auto-coding-functions') explicitly
>> supports functionality.  Maybe sgml-xml-auto-coding-function could use
>> whatever lower-level function does coding-system-detection using only
>> the characters in the buffer (I can't seem to find it, but there must be
>> such a thing...).
>
> file-coding-system-alist would be the place where I'd expect to see it
> used, but there utf-8 is hard-coded for xml files.

In any case, it would seem nice to support "prefer this (or these)
coding system(s)" in addition to "always use this coding system".

Currently the doc for file-coding-system-alist says: 

   ...
   If VAL is a cons of coding systems, the car part is used for decoding,
   and the cdr part is used for encoding.
   ...

Maybe this could be extended to something like:

   ...
   If VAL is a cons of coding systems, the car part is used for decoding,
   and the cdr part is used for encoding.  The cdr may be a symbol, in
   which case it is always used, or a list of coding systems, which are
   added to the front of the list used by the normal coding-system
   detection mechanism, as if by `prefer-coding-system'.
   ...

?

-Miles

-- 
(\(\
(^.^)
(")")
*This is the cute bunny virus, please copy this into your sig so it can spread.




^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: 23.0.60; Defaut encoding for XML files should be undefined (instead of utf-8)
  2008-02-15 23:42         ` Miles Bader
@ 2008-02-16  3:42           ` Miles Bader
  0 siblings, 0 replies; 38+ messages in thread
From: Miles Bader @ 2008-02-16  3:42 UTC (permalink / raw)
  To: Jason Rumney; +Cc: emacs-pretest-bug, Edward O'Connor, emacs-devel

Miles Bader <miles@gnu.org> writes:
> Maybe this could be extended to something like:
>
>    ...
>    If VAL is a cons of coding systems, the car part is used for decoding,
>    and the cdr part is used for encoding.  The cdr may be a symbol, in
>    which case it is always used, or a list of coding systems, which are
>    added to the front of the list used by the normal coding-system
>    detection mechanism, as if by `prefer-coding-system'.
>    ...

Oh, whoops, I confused the car and the cdr; I meant:

   If VAL is a cons of coding systems, the car part is used for decoding,
   and the cdr part is used for encoding.  The car may be a symbol, in
   which case it is always used, or a list of coding systems, which are
   added to the front of the list used by the normal coding-system
   detection mechanism, as if by `prefer-coding-system'.

That would mean you could use a value like:

   ((utf-8) . utf-8)

to indicate "do normal coding detection on read, but utf-8 is
preferred [over the users normal coding list]."

-Miles

-- 
Ocean, n. A body of water covering seven-tenths of a world designed for Man -
who has no gills.




^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: 23.0.60; Defaut encoding for XML files should be undefined (instead of utf-8)
  2008-02-15 22:32 ` Edward O'Connor
  2008-02-15 22:54   ` Jason Rumney
@ 2008-02-16  4:03   ` Stefan Monnier
  2008-02-16  7:17     ` Stephen J. Turnbull
  1 sibling, 1 reply; 38+ messages in thread
From: Stefan Monnier @ 2008-02-16  4:03 UTC (permalink / raw)
  To: Edward O'Connor; +Cc: emacs-pretest-bug, emacs-devel

>> When I'm opening an XML file that I've saved in iso-latin-1
>> encoding, it's not correctly opened if there is no header,
>> cookie or local variable telling about my encoding.
>> 
>> This is how it looks like, when opened, as it is read by default
>> as utf-8 (while being iso-latin-1):
>> <enumType name="test">
>> <pair value="11">D\373 \340 l'entreprise</pair>
>> <pair value="12">Stagiaire occup\351</pair>
>> <pair value="36">Autres</pair>
>> </enumType>

> My understanding is that the XML specification requires XML documents
> lacking an explicit <?xml encoding='foo'?> instruction to be UTF-8 or
> UTF-16, so Emacs does the right thing already.

The OP's file is *not* an XML document.  The XML document is generated
from various parts, including this file.  This is very similar to the
LaTeX \usepackage[XXX]{inputenc} which is only placd in the root file,
but not in the files included from that one.


        Stefan




^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: 23.0.60; Defaut encoding for XML files should be undefined (instead of utf-8)
  2008-02-16  4:03   ` Stefan Monnier
@ 2008-02-16  7:17     ` Stephen J. Turnbull
  2008-02-16  9:58       ` Jason Rumney
  0 siblings, 1 reply; 38+ messages in thread
From: Stephen J. Turnbull @ 2008-02-16  7:17 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: emacs-pretest-bug, Edward O'Connor, emacs-devel

Stefan Monnier writes:

 > The OP's file is *not* an XML document.  The XML document is generated
 > from various parts, including this file.  This is very similar to the
 > LaTeX \usepackage[XXX]{inputenc} which is only placd in the root file,
 > but not in the files included from that one.

Man, don't go there.  XML documents *and fragments* should be presumed
Unicode until the user explictly says otherwise.  Make it easy for the
user to do that, make it as easy as you like (eg, with a variable
containing the regexp ".*\.xml$"), but don't try to guess until the
user asks you to, please.





^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: 23.0.60; Defaut encoding for XML files should be undefined (instead of utf-8)
  2008-02-16  7:17     ` Stephen J. Turnbull
@ 2008-02-16  9:58       ` Jason Rumney
  2008-02-16 11:23         ` Stephen J. Turnbull
  0 siblings, 1 reply; 38+ messages in thread
From: Jason Rumney @ 2008-02-16  9:58 UTC (permalink / raw)
  To: Stephen J. Turnbull
  Cc: emacs-pretest-bug, emacs-devel, Stefan Monnier,
	Edward O'Connor

Stephen J. Turnbull wrote:
> Stefan Monnier writes:
>
> Man, don't go there.  XML documents *and fragments* should be presumed
> Unicode until the user explictly says otherwise.  Make it easy for the
> user to do that, make it as easy as you like (eg, with a variable
> containing the regexp ".*\.xml$"), but don't try to guess until the
> user asks you to, please.
>   

I think you're misunderstanding. Currently we use utf-8 in absense of a 
coding tag, even if it causes a decoding error. And when the user 
explicitly sets the file-coding-system to latin-1, we ignore it and save 
as utf-8. That is the behaviour we want to change, not the behaviour of 
defaulting to UTF-8.






^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: 23.0.60; Defaut encoding for XML files should be undefined (instead of utf-8)
  2008-02-16  9:58       ` Jason Rumney
@ 2008-02-16 11:23         ` Stephen J. Turnbull
  2008-02-16 12:07           ` Lennart Borgman (gmail)
  2008-02-16 17:03           ` Jason Rumney
  0 siblings, 2 replies; 38+ messages in thread
From: Stephen J. Turnbull @ 2008-02-16 11:23 UTC (permalink / raw)
  To: Jason Rumney
  Cc: Edward O'Connor, emacs-pretest-bug, Stefan Monnier,
	emacs-devel

Jason Rumney writes:
 > Stephen J. Turnbull wrote:
 > > Stefan Monnier writes:

 > > > [Typically this user is dealing with a fragment of a larger
 > > > document, not the whole document.]

 > > Man, don't go there.  XML documents *and fragments* should be presumed
 > > Unicode until the user explictly says otherwise.

 > I think you're misunderstanding.

No, I'm not.  Look at the subject.

 > Currently we use utf-8 in absense of a coding tag, even if it
 > causes a decoding error.

Signaling an error in this situation is appropriate.  Guessing what is
meant (eg, by falling back to undecided) is not.  It's quite possible
that the user is unintentionally in a Latin-1 environment and would
thank you if you reminded them that they should save in UTF-8.

 > And when the user explicitly sets the file-coding-system to
 > latin-1, we ignore it and save as utf-8.

PSGML and nXML were both written by James Clark; if they try to
enforce Unicode, I'd suggest that maybe somebody who knows more about
XML than all of us put together made that decision.

Of course in the end the users have reasons whereof OASIS does not
know, so that there must be escapes for users with legacy documents,
or with legacy document standards.  But users should be strongly
encouraged to use XML mechanisms, *not* those of Mule, to cope.  Mule
is designed to cope with environments where there are few rules and
those are poorly understood by programmers and users alike.  XML is
about having good rules, well understood by programmers (especially of
the UI) so that users don't have to.

PSGML and AUCTeX, at least, provide methods by which a master document
can be associated with a document fragment which provides various
kinds of context for the fragment -- it's not rocket science.  I would
imagine that nXML does too.  So, for example, upon detecting a coding
conflict, Emacs could offer to (1) insert an appropriate processing
instruction, or (2) associate the current fragment with an existing
master document via file locals, or (3) associate the fragment with a
dummy master document that lives entirely in Customize.  Those
documents could provide other context too, such as importing DTDs and
entities.







^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: 23.0.60; Defaut encoding for XML files should be undefined (instead of utf-8)
  2008-02-16 11:23         ` Stephen J. Turnbull
@ 2008-02-16 12:07           ` Lennart Borgman (gmail)
  2008-02-17  3:52             ` Stephen J. Turnbull
  2008-02-16 17:03           ` Jason Rumney
  1 sibling, 1 reply; 38+ messages in thread
From: Lennart Borgman (gmail) @ 2008-02-16 12:07 UTC (permalink / raw)
  To: Stephen J. Turnbull
  Cc: emacs-pretest-bug, emacs-devel, Edward O'Connor,
	Stefan Monnier, Jason Rumney

Stephen J. Turnbull wrote:
> PSGML and AUCTeX, at least, provide methods by which a master document
> can be associated with a document fragment which provides various
> kinds of context for the fragment -- it's not rocket science.  I would
> imagine that nXML does too.  So, for example, upon detecting a coding
> conflict, Emacs could offer to (1) insert an appropriate processing
> instruction, or (2) associate the current fragment with an existing
> master document via file locals, or (3) associate the fragment with a
> dummy master document that lives entirely in Customize.  Those
> documents could provide other context too, such as importing DTDs and
> entities.

Could you please tell more about how to find a master document in PSGML?




^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: 23.0.60; Defaut encoding for XML files should be undefined (instead of utf-8)
  2008-02-16 11:23         ` Stephen J. Turnbull
  2008-02-16 12:07           ` Lennart Borgman (gmail)
@ 2008-02-16 17:03           ` Jason Rumney
  2008-02-16 17:31             ` David Kastrup
  2008-02-17  3:53             ` Stephen J. Turnbull
  1 sibling, 2 replies; 38+ messages in thread
From: Jason Rumney @ 2008-02-16 17:03 UTC (permalink / raw)
  To: Stephen J. Turnbull
  Cc: emacs-pretest-bug, emacs-devel, Edward O'Connor,
	Stefan Monnier

Stephen J. Turnbull wrote:

> Jason Rumney writes:
>
>  > I think you're misunderstanding.
>
> No, I'm not.  Look at the subject.
>   

If the subject is all you have read of this thread, then you are 
definitely misunderstanding.

>  > And when the user explicitly sets the file-coding-system to
>  > latin-1, we ignore it and save as utf-8.
>
> PSGML and nXML were both written by James Clark;

This is nothing to do with psgml (which is not part of Emacs), or nxml. 
AFAIK psgml does nothing about encoding, as sgml files are not 
necessarily in any particular encoding. Nxml had some code which as far 
as I understand is similar to what I was proposing as a solution here. I 
removed that code shortly after it was merged into Emacs since Emacs 
already had code which seemed to do the same thing for XML files, 
regardless of mode. It is only now that I found out that the code in 
Emacs is overly simplistic, forcing UTF-8 even against users' wishes.






^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: 23.0.60; Defaut encoding for XML files should be undefined (instead of utf-8)
  2008-02-16 17:03           ` Jason Rumney
@ 2008-02-16 17:31             ` David Kastrup
  2008-02-17  3:53             ` Stephen J. Turnbull
  1 sibling, 0 replies; 38+ messages in thread
From: David Kastrup @ 2008-02-16 17:31 UTC (permalink / raw)
  To: Jason Rumney
  Cc: emacs-pretest-bug, Stephen J. Turnbull, Edward O'Connor,
	Stefan Monnier, emacs-devel

Jason Rumney <jasonr@gnu.org> writes:

> This is nothing to do with psgml (which is not part of Emacs), or
> nxml. AFAIK psgml does nothing about encoding, as sgml files are not
> necessarily in any particular encoding. Nxml had some code which as
> far as I understand is similar to what I was proposing as a solution
> here. I removed that code shortly after it was merged into Emacs since
> Emacs already had code which seemed to do the same thing for XML
> files, regardless of mode. It is only now that I found out that the
> code in Emacs is overly simplistic, forcing UTF-8 even against users'
> wishes.

Unless the user explicitly requests some encoding, the default should be
(irrespective of language environment) utf-8.  If the user explicitly
requests some encoding, that's the decoding for saving.  In either case,
if the resulting encoding does not agree with a prospective XML coding
cookie (the absence of which indicates utf-8), Emacs should offer to add
or change the coding cookie.

Something like that.

-- 
David Kastrup, Kriemhildstr. 15, 44793 Bochum




^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: 23.0.60; Defaut encoding for XML files should be undefined (instead of utf-8)
  2008-02-16 12:07           ` Lennart Borgman (gmail)
@ 2008-02-17  3:52             ` Stephen J. Turnbull
  2008-02-17 14:31               ` Lennart Borgman (gmail)
  0 siblings, 1 reply; 38+ messages in thread
From: Stephen J. Turnbull @ 2008-02-17  3:52 UTC (permalink / raw)
  To: Lennart Borgman (gmail)
  Cc: emacs-pretest-bug, Jason Rumney, Edward O'Connor,
	Stefan Monnier, emacs-devel

Lennart Borgman (gmail) writes:

 > Could you please tell more about how to find a master document in PSGML?

Here are the local variables from our top page's source (we genpage,
which is getting kind of old and creaky but still does the job.  The
interesting variable is sgml-parent-document.

  <!-- Keep this comment at the end of the file
  Local variables:
  mode: xml
  sgml-omittag:nil
  sgml-shorttag:nil
  sgml-namecase-general:nil
  sgml-general-insert-case:lower
  sgml-minimize-attributes:nil
  sgml-always-quote-attributes:t
  sgml-indent-step:2
  sgml-indent-data:t
  sgml-parent-document:("template.html" "html" "body" "table" "tr" "td")
  sgml-exposed-tags:nil
  sgml-local-catalogs:nil
  sgml-local-ecat-files:nil
  End:
  -->




^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: 23.0.60; Defaut encoding for XML files should be undefined (instead of utf-8)
  2008-02-16 17:03           ` Jason Rumney
  2008-02-16 17:31             ` David Kastrup
@ 2008-02-17  3:53             ` Stephen J. Turnbull
  2008-02-18  3:22               ` Miles Bader
  1 sibling, 1 reply; 38+ messages in thread
From: Stephen J. Turnbull @ 2008-02-17  3:53 UTC (permalink / raw)
  To: Jason Rumney
  Cc: emacs-pretest-bug, Edward O'Connor, Stefan Monnier,
	emacs-devel

Jason Rumney writes:
 > Stephen J. Turnbull wrote:
 > 
 > > Jason Rumney writes:
 > >
 > >  > I think you're misunderstanding.
 > >
 > > No, I'm not.  Look at the subject.

 > If the subject is all you have read of this thread,

Are you calling me out?  Good bye.  Good luck.





^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: 23.0.60; Defaut encoding for XML files should be undefined (instead of utf-8)
  2008-02-17  3:52             ` Stephen J. Turnbull
@ 2008-02-17 14:31               ` Lennart Borgman (gmail)
  2008-02-17 22:24                 ` Stephen J. Turnbull
  0 siblings, 1 reply; 38+ messages in thread
From: Lennart Borgman (gmail) @ 2008-02-17 14:31 UTC (permalink / raw)
  To: Stephen J. Turnbull
  Cc: emacs-pretest-bug, Jason Rumney, Edward O'Connor,
	Stefan Monnier, emacs-devel

Stephen J. Turnbull wrote:
> Lennart Borgman (gmail) writes:
> 
>  > Could you please tell more about how to find a master document in PSGML?
> 
> Here are the local variables from our top page's source (we genpage,
> which is getting kind of old and creaky but still does the job.  The
> interesting variable is sgml-parent-document.
> 
>   <!-- Keep this comment at the end of the file
>   Local variables:
>   mode: xml
>   sgml-omittag:nil
>   sgml-shorttag:nil
>   sgml-namecase-general:nil
>   sgml-general-insert-case:lower
>   sgml-minimize-attributes:nil
>   sgml-always-quote-attributes:t
>   sgml-indent-step:2
>   sgml-indent-data:t
>   sgml-parent-document:("template.html" "html" "body" "table" "tr" "td")
>   sgml-exposed-tags:nil
>   sgml-local-catalogs:nil
>   sgml-local-ecat-files:nil
>   End:
>   -->

Thanks, but were are these variables defined?




^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: 23.0.60; Defaut encoding for XML files should be undefined (instead of utf-8)
  2008-02-17 14:31               ` Lennart Borgman (gmail)
@ 2008-02-17 22:24                 ` Stephen J. Turnbull
  2008-02-17 22:27                   ` Miles Bader
  2008-02-18 16:35                   ` 23.0.60; Defaut encoding for XML files should be undefined (instead of utf-8) Lennart Borgman (gmail)
  0 siblings, 2 replies; 38+ messages in thread
From: Stephen J. Turnbull @ 2008-02-17 22:24 UTC (permalink / raw)
  To: Lennart Borgman (gmail)
  Cc: emacs-pretest-bug, emacs-devel, Edward O'Connor,
	Stefan Monnier

Lennart Borgman (gmail) writes:
 > Stephen J. Turnbull wrote:
 > > Lennart Borgman (gmail) writes:
 > > 
 > >  > Could you please tell more about how to find a master document in PSGML?
 > > 
 > > Here are the local variables from our top page's source [...].  The
 > > interesting variable is sgml-parent-document.

 > Thanks, but were are these variables defined?

Here's the docstring for sgml-parent-document.  I suppose the rest are
similarly from psgml.el or related libraries.

`sgml-parent-document' is a variable declared in Lisp.
  -- loaded from "psgml"

Value: nil

Setting it would make its value buffer-local.

Documentation:
*How to handle the current file as part of a bigger document.

The variable describes how the current file's content fit into the element
hierarchy.  The value should have the form

  (PARENT-FILE CONTEXT-ELEMENT* TOP-ELEMENT (HAS-SEEN-ELEMENT*)?)

PARENT-FILE	is a string, the name of the file containing the
		document entity.
CONTEXT-ELEMENT is a string, that is the name of an element type.
		It can occur 0 or more times and is used to set up
		exceptions and short reference map.  Good candidates
		for these elements are the elements open when the
		entity pointing to the current file is used.
TOP-ELEMENT	is a string that is the name of the element type
		of the top level element in the current file.  The file
		should contain one instance of this element, unless
		the last (Lisp) element of `sgml-parent-document' is a
		list.  If it is a list, the top level of the file
		should follow the content model of top-element.
HAS-SEEN-ELEMENT is a string that is the name of an element type.  This
	        element is satisfied in the content model of top-element.

Setting this variable automatically makes it local to the current buffer.




^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: 23.0.60; Defaut encoding for XML files should be undefined (instead of utf-8)
  2008-02-17 22:24                 ` Stephen J. Turnbull
@ 2008-02-17 22:27                   ` Miles Bader
  2008-02-18  0:07                     ` Stephen J. Turnbull
  2008-02-18 16:35                   ` 23.0.60; Defaut encoding for XML files should be undefined (instead of utf-8) Lennart Borgman (gmail)
  1 sibling, 1 reply; 38+ messages in thread
From: Miles Bader @ 2008-02-17 22:27 UTC (permalink / raw)
  To: Stephen J. Turnbull
  Cc: emacs-pretest-bug, Lennart Borgman (gmail), Edward O'Connor,
	Stefan Monnier, emacs-devel

"Stephen J. Turnbull" <turnbull@sk.tsukuba.ac.jp> writes:
>  > Thanks, but were are these variables defined?
>
> Here's the docstring for sgml-parent-document.  I suppose the rest are
> similarly from psgml.el or related libraries.

You know, it's nice if people use self-describing documents, but Emacs
is an _editor_, not an xml validation system.  It should handle things
gracefully, as best it can, even if the user is being a bit lazy.

-Miles

-- 
Brain, n. An apparatus with which we think we think.




^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: 23.0.60; Defaut encoding for XML files should be undefined (instead of utf-8)
  2008-02-17 22:27                   ` Miles Bader
@ 2008-02-18  0:07                     ` Stephen J. Turnbull
  2008-02-18  3:16                       ` Miles Bader
  0 siblings, 1 reply; 38+ messages in thread
From: Stephen J. Turnbull @ 2008-02-18  0:07 UTC (permalink / raw)
  To: Miles Bader
  Cc: emacs-pretest-bug, Lennart Borgman (gmail), Edward O'Connor,
	Stefan Monnier, emacs-devel


 > You know, it's nice if people use self-describing documents, but Emacs
 > is an _editor_, not an xml validation system.

Emacs is an editor, so what?  That's hardly an exclusive description
of Emacs!  If PSGML is loaded it is also an SGML validation system;
I'm not sure whether the state of the art includes XML validation, but
there's no reason it shouldn't, and every reason it should.

 > It should handle things gracefully, as best it can, even if the
 > user is being a bit lazy.

What's ungraceful about C-x C-s responding with

    Warning:  This XML document does not seem to conform to XML
    charset declaration rules.  Would you like to
    (1) add an XML processor instruction (coding cookie)
    (2) link to a parent document which specifies the encoding
    (3) create a dummy parent document (available only to your sessions)
    (4) save as is in the [buffer-file-coding-system] encoding?

Doesn't it already do similar things if b-f-c-s is iso-8859-1 and you
try to save Japanese?

Nor is it an issue of lazy users.  Users *should* be lazy.  *We*
should do as much work as possible for them, and make standards
conformance the path of least resistance.  But what I think is
happening here is that some *developers* simply dislike standards
(standards "enslave" you, dontcha know?), and others are lazy.





^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: 23.0.60; Defaut encoding for XML files should be undefined (instead of utf-8)
  2008-02-15 23:24     ` Miles Bader
  2008-02-15 23:34       ` Jason Rumney
@ 2008-02-18  2:49       ` Jason Rumney
  2008-02-18  3:01         ` Jason Rumney
  1 sibling, 1 reply; 38+ messages in thread
From: Jason Rumney @ 2008-02-18  2:49 UTC (permalink / raw)
  To: Miles Bader; +Cc: emacs-pretest-bug, Edward O'Connor, emacs-devel

Miles Bader wrote:
> Perhaps the "best" thing would be to temporarily do a
> (prefer-coding-system 'utf-8) when reading xml files without an encoding
> header, instead of _forcing_ the coding system to be utf-8.
>   
I made this change in sgml-xml-auto-coding-function, and also introduced 
a new function to do the same for .xml files without an xml declaration, 
which were hardcoded to utf-8 by file-coding-system-alist.

I also disabled the write-hook in nxml-mode that was forcing utf-8 on 
write, since its most important functionality of honoring changes to the 
encoding attribute in the xml declaration is already duplicated 
elsewhere in Emacs (a write equivalent of sgml-xml-auto-coding-function?).








^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: 23.0.60; Defaut encoding for XML files should be undefined (instead of utf-8)
  2008-02-18  2:49       ` Jason Rumney
@ 2008-02-18  3:01         ` Jason Rumney
  0 siblings, 0 replies; 38+ messages in thread
From: Jason Rumney @ 2008-02-18  3:01 UTC (permalink / raw)
  To: Miles Bader; +Cc: emacs-pretest-bug, Edward O'Connor, emacs-devel

Jason Rumney wrote:
> I made this change in sgml-xml-auto-coding-function, and also 
> introduced a new function to do the same for .xml files without an xml 
> declaration, which were hardcoded to utf-8 by file-coding-system-alist.
I forgot to say, I also added a warning when the encoding is in 
violation of the spec.





^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: 23.0.60; Defaut encoding for XML files should be undefined (instead of utf-8)
  2008-02-18  0:07                     ` Stephen J. Turnbull
@ 2008-02-18  3:16                       ` Miles Bader
  2008-02-18  6:26                         ` Stephen J. Turnbull
  0 siblings, 1 reply; 38+ messages in thread
From: Miles Bader @ 2008-02-18  3:16 UTC (permalink / raw)
  To: Stephen J. Turnbull
  Cc: emacs-pretest-bug, emacs-devel, Lennart Borgman (gmail),
	Edward O'Connor, Stefan Monnier

"Stephen J. Turnbull" <stephen@xemacs.org> writes:
> What's ungraceful about C-x C-s responding with
>     Warning:  This XML document does not seem to conform to XML
>     charset declaration rules.  Would you like to
>     (1) add an XML processor instruction (coding cookie)
>     (2) link to a parent document which specifies the encoding
>     (3) create a dummy parent document (available only to your sessions)
>     (4) save as is in the [buffer-file-coding-system] encoding?

The discussion is about what to do when visiting a file, not saving it.

[Whether the above sort of "extreme nannying" is a good idea or not when
saving, probably depends on who you ask; many would probably appreciate
the help, but knowing Emacs, users, it seems a good bet some would chaff
at the presumption.]

For visiting, I rather like what I suggested earlier, changing Emacs'
format-specific-coding mechanisms to support format-specific coding
_preferences_ as well as "absolute codings".

-Miles

-- 
Guilt, n. The condition of one who is known to have committed an indiscretion,
as distinguished from the state of him who has covered his tracks.




^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: 23.0.60; Defaut encoding for XML files should be undefined (instead of utf-8)
  2008-02-17  3:53             ` Stephen J. Turnbull
@ 2008-02-18  3:22               ` Miles Bader
  2008-02-18  6:01                 ` Stephen J. Turnbull
  0 siblings, 1 reply; 38+ messages in thread
From: Miles Bader @ 2008-02-18  3:22 UTC (permalink / raw)
  To: Stephen J. Turnbull
  Cc: emacs-pretest-bug, emacs-devel, Edward O'Connor,
	Stefan Monnier, Jason Rumney

"Stephen J. Turnbull" <stephen@xemacs.org> writes:
> Are you calling me out?  Good bye.  Good luck.

What, you cannot be wrong?!

-miles

-- 
Bacchus, n. A convenient deity invented by the ancients as an excuse for
getting drunk.




^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: 23.0.60; Defaut encoding for XML files should be undefined (instead of utf-8)
  2008-02-18  3:22               ` Miles Bader
@ 2008-02-18  6:01                 ` Stephen J. Turnbull
  0 siblings, 0 replies; 38+ messages in thread
From: Stephen J. Turnbull @ 2008-02-18  6:01 UTC (permalink / raw)
  To: Miles Bader
  Cc: emacs-pretest-bug, Jason Rumney, Edward O'Connor,
	Stefan Monnier, emacs-devel

Miles Bader writes:
 > "Stephen J. Turnbull" <stephen@xemacs.org> writes:
 > > Are you calling me out?  Good bye.  Good luck.
 > 
 > What, you cannot be wrong?!

Not in this.  Specifically:

Mr. Rumney accused me of not understanding what was being discussed
for no reason I can see, and then of not reading the thread merely
because I cited the subject of the thread.  This after likening
standard conformance to slavery in a reply to Mr. O'Conner.  I took
exception to his rude behavior, I do not intend to converse with
Mr. Rumney on this subject further, and I wish you all luck if that is
acceptable behavior here.

All matters of opinion or personal choice, and those are mine.




^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: 23.0.60; Defaut encoding for XML files should be undefined (instead of utf-8)
  2008-02-18  3:16                       ` Miles Bader
@ 2008-02-18  6:26                         ` Stephen J. Turnbull
  2008-02-18  6:40                           ` Miles Bader
  2008-02-18 14:59                           ` Projects and multi-file documents (was: 23.0.60; Defaut encoding for XML files should be undefined (instead of utf-8)) Stefan Monnier
  0 siblings, 2 replies; 38+ messages in thread
From: Stephen J. Turnbull @ 2008-02-18  6:26 UTC (permalink / raw)
  To: Miles Bader
  Cc: emacs-pretest-bug, Lennart Borgman (gmail), Edward O'Connor,
	Stefan Monnier, emacs-devel

Miles Bader writes:
 > "Stephen J. Turnbull" <stephen@xemacs.org> writes:
 > > What's ungraceful about C-x C-s responding with
 > >     Warning:  This XML document does not seem to conform to XML
 > >     charset declaration rules.  Would you like to
 > >     (1) add an XML processor instruction (coding cookie)
 > >     (2) link to a parent document which specifies the encoding
 > >     (3) create a dummy parent document (available only to your sessions)
 > >     (4) save as is in the [buffer-file-coding-system] encoding?
 > 
 > The discussion is about what to do when visiting a file, not saving
 > it.

Do the same thing at visit time by default.  It's not like the
implementation would differ, it's just it would be a post-visit hook
instead of a pre-save hook.

 > [Whether the above sort of "extreme nannying"

"Extreme nannying"?  It's just an extension of the idea of providing a
document skeleton for a single file to multi-file documents.

 > knowing Emacs, users, it seems a good bet some would chaff
 > at the presumption.]

Let them chafe, and provide some lotion in the form of configurability
for those who know what they're doing and why.  But here, we're
talking about defaults.

 > For visiting, I rather like what I suggested earlier, changing Emacs'
 > format-specific-coding mechanisms to support format-specific coding
 > _preferences_ as well as "absolute codings".

IMO, users who know enough about coding systems and the various
formats to use such a facility would be just as happy to use
`file-coding-system-alist'.  Learning about it will be a burden for
the "naive" users who want things to "just work".  And maintaining a
database of defaults will be a burden on maintainers disproportionate
to the benefit.

Also, I appealed to PSGML and AUCTeX for a reason.  There are many
cases now where multi-file documents are the norm.  I think that not
only DTP files, but also multi-file source programs could benefit from
these techniques, over and above what we get from tag tables already.

As users get used to thinking of it as a general facility rather than
a property of just one mode, they'll generate neat ideas about what to
do with it.




^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: 23.0.60; Defaut encoding for XML files should be undefined (instead of utf-8)
  2008-02-18  6:26                         ` Stephen J. Turnbull
@ 2008-02-18  6:40                           ` Miles Bader
  2008-02-19  7:17                             ` Stephen J. Turnbull
  2008-02-18 14:59                           ` Projects and multi-file documents (was: 23.0.60; Defaut encoding for XML files should be undefined (instead of utf-8)) Stefan Monnier
  1 sibling, 1 reply; 38+ messages in thread
From: Miles Bader @ 2008-02-18  6:40 UTC (permalink / raw)
  To: Stephen J. Turnbull
  Cc: emacs-pretest-bug, emacs-devel, Lennart Borgman (gmail),
	Edward O'Connor, Stefan Monnier

"Stephen J. Turnbull" <stephen@xemacs.org> writes:
>  > The discussion is about what to do when visiting a file, not saving
>  > it.
>
> Do the same thing at visit time by default.  It's not like the
> implementation would differ, it's just it would be a post-visit hook
> instead of a pre-save hook.

That isn't going to fly.  What are willing to put up with when saving a
file is very different from what they're willing to put up when visiting
one.  [It might be useful as an configurable _option_, of course, for
those people who need it.]

>  > For visiting, I rather like what I suggested earlier, changing Emacs'
>  > format-specific-coding mechanisms to support format-specific coding
>  > _preferences_ as well as "absolute codings".
>
> IMO, users who know enough about coding systems and the various
> formats to use such a facility would be just as happy to use
> `file-coding-system-alist'.  Learning about it will be a burden for
> the "naive" users who want things to "just work".  And maintaining a
> database of defaults will be a burden on maintainers disproportionate
> to the benefit.

I think maybe you misunderstood my proposal.

Emacs currently, for xml and related files, has various mechanisms that
_force_ the encoding to be utf-8 when reading (visiting) the file.  [If
the file has no coding: tag.]

I'm suggesting these mechanisms have the ability to be "softer", so that
instead of forcing the encoding to utf-8, utf-8 is merely pushed to the
front of the coding priority list for coding recognition, as if by the
`prefer-coding-system' function (while reading that file), and Emacs'
normal automagic recognition allowed to do its stuff.  This "preference"
is _not_ set by the user, but rather by the same emacs magic that
currently sets an "absolute" utf-8 coding [on non-tagged files].

That would mean that "proper" xml files would continue to work just as
now, using utf-8 (with no danger of being screwed up due to the user's
language environment), but that "improper" xml files would stand a much
better chance of showing up in the user's buffer as something readable
rather than gibberish.  [This doesn't guarantee that other apps will do
the right thing of course -- though in the case being discussed, the
would have -- but while in Emacs, things tend to make a lot more sense
if the file was read as something reasonable.]

-Miles

-- 
Consult, v.i. To seek another's disapproval of a course already decided on.




^ permalink raw reply	[flat|nested] 38+ messages in thread

* Projects and multi-file documents (was: 23.0.60; Defaut encoding for XML files should be undefined (instead of utf-8))
  2008-02-18  6:26                         ` Stephen J. Turnbull
  2008-02-18  6:40                           ` Miles Bader
@ 2008-02-18 14:59                           ` Stefan Monnier
  2008-02-18 18:51                             ` Projects and multi-file documents Ralf Angeli
  1 sibling, 1 reply; 38+ messages in thread
From: Stefan Monnier @ 2008-02-18 14:59 UTC (permalink / raw)
  To: Stephen J. Turnbull
  Cc: emacs-pretest-bug, emacs-devel, Lennart Borgman (gmail),
	Edward O'Connor, Miles Bader

> Also, I appealed to PSGML and AUCTeX for a reason.  There are many
> cases now where multi-file documents are the norm.  I think that not
> only DTP files, but also multi-file source programs could benefit from
> these techniques, over and above what we get from tag tables already.

Yes, Emacs needs to be improved w.r.t its support for multi-file
documents, and projects in general.


        Stefan




^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: 23.0.60; Defaut encoding for XML files should be undefined (instead of utf-8)
  2008-02-17 22:24                 ` Stephen J. Turnbull
  2008-02-17 22:27                   ` Miles Bader
@ 2008-02-18 16:35                   ` Lennart Borgman (gmail)
  1 sibling, 0 replies; 38+ messages in thread
From: Lennart Borgman (gmail) @ 2008-02-18 16:35 UTC (permalink / raw)
  To: Stephen J. Turnbull; +Cc: emacs-devel

Stephen J. Turnbull wrote:
> Here's the docstring for sgml-parent-document.  I suppose the rest are
> similarly from psgml.el or related libraries.
> 
> `sgml-parent-document' is a variable declared in Lisp.
>   -- loaded from "psgml"

I see, thanks. I thought it was something in sgml-mode.el.




^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Projects and multi-file documents
  2008-02-18 14:59                           ` Projects and multi-file documents (was: 23.0.60; Defaut encoding for XML files should be undefined (instead of utf-8)) Stefan Monnier
@ 2008-02-18 18:51                             ` Ralf Angeli
  0 siblings, 0 replies; 38+ messages in thread
From: Ralf Angeli @ 2008-02-18 18:51 UTC (permalink / raw)
  To: Stefan Monnier
  Cc: emacs-pretest-bug, Lennart Borgman (gmail), Edward O'Connor,
	emacs-devel, Stephen J. Turnbull, Miles Bader

* Stefan Monnier (2008-02-18) writes:

>> Also, I appealed to PSGML and AUCTeX for a reason.  There are many
>> cases now where multi-file documents are the norm.  I think that not
>> only DTP files, but also multi-file source programs could benefit from
>> these techniques, over and above what we get from tag tables already.
>
> Yes, Emacs needs to be improved w.r.t its support for multi-file
> documents, and projects in general.

AUCTeX is likely not a role model for multi-file handling.  The use of
file variables for the master file makes it only easy to search upwards
but not downwards.  RefTeX might be more interesting to look at.  Here
the result of parsing a multi-file LaTeX document is stored in a
separate file.

-- 
Ralf




^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: 23.0.60; Defaut encoding for XML files should be undefined (instead of utf-8)
  2008-02-18  6:40                           ` Miles Bader
@ 2008-02-19  7:17                             ` Stephen J. Turnbull
  2008-02-19  7:19                               ` Miles Bader
  2008-02-19 15:50                               ` Stefan Monnier
  0 siblings, 2 replies; 38+ messages in thread
From: Stephen J. Turnbull @ 2008-02-19  7:17 UTC (permalink / raw)
  To: Miles Bader
  Cc: emacs-pretest-bug, Lennart Borgman (gmail), Edward O'Connor,
	Stefan Monnier, emacs-devel

Miles Bader writes:
 > "Stephen J. Turnbull" <stephen@xemacs.org> writes:

 > > Do the same thing at visit time by default.  It's not like the
 > > implementation would differ, it's just it would be a post-visit hook
 > > instead of a pre-save hook.
 > 
 > That isn't going to fly.  What are willing to put up with when
 > saving a file is very different from what they're willing to put up
 > when visiting one.

Oh, I think it will indeed fly.  First of all, there will be an Emacsy
"the-user-is-always-right flag"; we're discussing the default here,
which IMO should lean heavily to standard conformance and protecting
the user from automatic decisions they may not understand.

 > I think maybe you misunderstood my proposal.

I missed the detail that you planned to hamstring
`prefer-coding-system', yes.  That's really minor though, in view of
the fundamental disagreement.

My position is that XML has a perfectly acceptable in-band way to
announce encodings.  Contrary to what I understood Stefan to be
saying, it is per-file and required by the standard.[1]  This gives
strong reason to believe that most users will be happy to add text
declarations, especially in free software where they'll be using
high-quality XML implementations.

Furthermore, my position is that in the event that the user chooses
not to use an XML text declaration to declare the encoding, use of
Mule detection mechanisms (including coding: cookies) is just asking
for trouble, because they impose risks both of giving Unicode to users
who want a legacy encoding and of giving a legacy encoding to users
who want Unicode.  The fact that your proposal produces buffers that
*look* like text even though they *are* gibberish according to the
standard (or according to some nonconforming application!) is in no
way a point in its favor!

Of course, Mule's well-tuned detection facilities should be used to
*advise* the user about what encoding is in the buffer, and therefore
what to put in the text declaration or some out of band means of
declaring the encoding.  But in the absence of explicit declaration
(including setting the Emacs-I-dont-need-none-o-yer-XML-lip flag to t)
by the user, the user should be asked to confirm the encoding.

Footnotes: 
[1]  http://www.w3.org/TR/REC-xml/#sec-TextDecl for the definition;
loc. cit. #charencoding for the "MUST".





^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: 23.0.60; Defaut encoding for XML files should be undefined (instead of utf-8)
  2008-02-19  7:17                             ` Stephen J. Turnbull
@ 2008-02-19  7:19                               ` Miles Bader
  2008-02-19 21:03                                 ` Stephen J. Turnbull
  2008-02-19 15:50                               ` Stefan Monnier
  1 sibling, 1 reply; 38+ messages in thread
From: Miles Bader @ 2008-02-19  7:19 UTC (permalink / raw)
  To: Stephen J. Turnbull
  Cc: emacs-pretest-bug, emacs-devel, Lennart Borgman (gmail),
	Edward O'Connor, Stefan Monnier

"Stephen J. Turnbull" <stephen@xemacs.org> writes:
>  > I think maybe you misunderstood my proposal.
>
> I missed the detail that you planned to hamstring
> `prefer-coding-system', yes.  That's really minor though, in view of
> the fundamental disagreement.

"Hamstring `prefer-coding-system'"?

I proposed _adding_ some functionality that _uses_ prefer-coding-system
(or more likely, uses whatever underlying mechanism prefer-coding-system
uses).  No functionality would be removed.  How on earth is that
"hamstringing"?

Yeesh.

-Miles

-- 
Any man who is a triangle, has thee right, when in Cartesian Space, to
have angles, which when summed, come to know more, nor no less, than
nine score degrees, should he so wish.  [TEMPLE OV THEE LEMUR]




^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: 23.0.60; Defaut encoding for XML files should be undefined (instead of utf-8)
  2008-02-19  7:17                             ` Stephen J. Turnbull
  2008-02-19  7:19                               ` Miles Bader
@ 2008-02-19 15:50                               ` Stefan Monnier
  2008-02-19 22:02                                 ` Stephen J. Turnbull
  1 sibling, 1 reply; 38+ messages in thread
From: Stefan Monnier @ 2008-02-19 15:50 UTC (permalink / raw)
  To: Stephen J. Turnbull
  Cc: emacs-pretest-bug, emacs-devel, Lennart Borgman (gmail),
	Edward O'Connor, Miles Bader

> My position is that XML has a perfectly acceptable in-band way to
> announce encodings.  Contrary to what I understood Stefan to be
> saying, it is per-file and required by the standard.[1]  This gives
> strong reason to believe that most users will be happy to add text
> declarations, especially in free software where they'll be using
> high-quality XML implementations.

My understanding of the OP's situation is that his files are not XML
files, but plaintext files that happen to contain XML fragments.
I'd expect that those fragments will be put together via
text-concatenation to generate a complete XML document.

I don't know much about XML: does XML allow the in-band encoding
declaration to appear sprinkled multiple times inside the document?


        Stefan




^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: 23.0.60; Defaut encoding for XML files should be undefined (instead of utf-8)
  2008-02-19  7:19                               ` Miles Bader
@ 2008-02-19 21:03                                 ` Stephen J. Turnbull
  2008-02-19 22:47                                   ` Jason Rumney
  2008-02-19 22:58                                   ` Miles Bader
  0 siblings, 2 replies; 38+ messages in thread
From: Stephen J. Turnbull @ 2008-02-19 21:03 UTC (permalink / raw)
  To: Miles Bader
  Cc: emacs-pretest-bug, Lennart Borgman (gmail), Edward O'Connor,
	Stefan Monnier, emacs-devel

Miles Bader writes:

 > I proposed _adding_ some functionality that _uses_ prefer-coding-system
 > (or more likely, uses whatever underlying mechanism prefer-coding-system
 > uses).  No functionality would be removed.

Not removed, disabled (in some cases).  Specifically, if the *user* or
some application programmer uses `prefer-coding-system' with a
non-Unicode (non-UTF-8?) argument, he won't get the result he expects
for some XML files.  (This is true of my proposal as well, but I'm
proposing that XML encoding be explicitly decoupled from Mule
guesswork, so it doesn't bother me.)

In case you're forgotten, this is precisely the kind of behavior that
distresses the OP.





^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: 23.0.60; Defaut encoding for XML files should be undefined (instead of utf-8)
  2008-02-19 15:50                               ` Stefan Monnier
@ 2008-02-19 22:02                                 ` Stephen J. Turnbull
  0 siblings, 0 replies; 38+ messages in thread
From: Stephen J. Turnbull @ 2008-02-19 22:02 UTC (permalink / raw)
  To: Stefan Monnier
  Cc: emacs-pretest-bug, Miles Bader, Lennart Borgman (gmail),
	Edward O'Connor, emacs-devel

Stefan Monnier writes:

 > My understanding of the OP's situation is that his files are not XML
 > files, but plaintext files that happen to contain XML fragments.

Interpreting the XML 1.0 standard, if those XML fragments are intended
to be parsed by the XML processor as part of the document, they are
(conceptually) "external entities".  How that affects XML processing
will depend on exactly what you mean by "text-concatenation".

ISTM there are two possibilities.  First, use the XML facilities (ie,
an entity reference).  That looks like this (there's also a "PUBLIC"
entity version):

<!ENTITY open-hatch
         SYSTEM "http://www.textuality.com/boilerplate/OpenHatch.xml">

Blah blah blah
&open-hatch;
foo bar baz.

Entity reference has the advantage of using XML catalogs and the like
to find the entity (similar to the way C's #include allows cpp to use
an include path).  The XML specification requires entities to declare
their own encoding using a text declaration, unless it is UTF-8 or can
be detected using the Byte Order Mark.  IMO this is the obvious way to
do things if your XML processor supports external entity reference.

Second, use some kind of preprocessor for concatenation, such as cat
or cpp.  In this case, a text declaration can't be used because it
must appear as the first thing in the entity, but the XML process will
see only a single entity, the whole document.  In that case the XML
specification says nothing about the fragments.

However, because the XML specification mandates a fatal error[1] when
a processor detects any encoding inconsistency or ambiguity, to users
the risks of guessing about fragment encodings are potentially high
(at least in annoyance).  So I advocate using a multientity framework
(for this purpose among others) where some sort of master document is
available to check consistency, rather than Mule guesswork on a
file-by-file basis.

 > I don't know much about XML:

The XML specification is rather short (especially compared to the
SGML specification), yet self-contained.


Footnotes: 
[1]  Not necessarily termination of the process, but normal processing
must terminate, and the XML processor permanently enters an error mode.
Very annoying at best.





^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: 23.0.60; Defaut encoding for XML files should be undefined (instead of utf-8)
  2008-02-19 21:03                                 ` Stephen J. Turnbull
@ 2008-02-19 22:47                                   ` Jason Rumney
  2008-02-19 22:58                                   ` Miles Bader
  1 sibling, 0 replies; 38+ messages in thread
From: Jason Rumney @ 2008-02-19 22:47 UTC (permalink / raw)
  To: Stephen J. Turnbull
  Cc: emacs-pretest-bug, Lennart Borgman (gmail), Edward O'Connor,
	emacs-devel, Stefan Monnier, Miles Bader

Stephen J. Turnbull wrote:
> Not removed, disabled (in some cases).  Specifically, if the *user* or
> some application programmer uses `prefer-coding-system' with a
> non-Unicode (non-UTF-8?) argument, he won't get the result he expects
> for some XML files.  (This is true of my proposal as well, but I'm
> proposing that XML encoding be explicitly decoupled from Mule
> guesswork, so it doesn't bother me.)
>
> In case you're forgotten, this is precisely the kind of behavior that
> distresses the OP.
>   


That's the behaviour that the OP thinks distresses him. But other 
encodings are highly unlikely to be mistaken for UTF-8, so in practice, 
pushing UTF-8 to the front of the prefer-coding-system queue is unlikely 
to distress him.

What is really distressing the OP is that UTF-8 was previously forced, 
which caused his file to load with binary non-characters in place of his 
latin-1 characters, and if he doesn't notice it and edits the file, the 
only coding system he can save as is "raw-text" (and I'm not sure 
whether the result will be recoverable once he does that).

Compounding that, is nxml-mode was ignoring the request to save as 
raw-text and forcing utf-8 again, which fails, so the changes cannot be 
saved.






^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: 23.0.60; Defaut encoding for XML files should be undefined (instead of utf-8)
  2008-02-19 21:03                                 ` Stephen J. Turnbull
  2008-02-19 22:47                                   ` Jason Rumney
@ 2008-02-19 22:58                                   ` Miles Bader
  2008-02-20  0:43                                     ` Stephen J. Turnbull
  1 sibling, 1 reply; 38+ messages in thread
From: Miles Bader @ 2008-02-19 22:58 UTC (permalink / raw)
  To: Stephen J. Turnbull
  Cc: emacs-pretest-bug, emacs-devel, Lennart Borgman (gmail),
	Edward O'Connor, Stefan Monnier

"Stephen J. Turnbull" <stephen@xemacs.org> writes:
> > I proposed _adding_ some functionality that _uses_
> > prefer-coding-system (or more likely, uses whatever underlying
> > mechanism prefer-coding-system uses).  No functionality would be
> > removed.
>
> Not removed, disabled (in some cases).  Specifically, if the *user* or
> some application programmer uses `prefer-coding-system' with a
> non-Unicode (non-UTF-8?) argument, he won't get the result he expects
> for some XML files.

That is already the case.  My suggestion would not make it worse.  If
anything, improve the situation by allowing prefer-coding-system to work
more often.  In any case, no hamstringing.

> In case you're forgotten, this is precisely the kind of behavior that
> distresses the OP.

And my suggestion would likely _improve_ the situtation from the point
of the OP.  Maybe not entirely to his satisfaction -- there are probably
cases where latin1 can be mistaken for utf8, and do fix that would
likely require more explicit action on his part -- but more
automatically.

You're probably right that to make things really work the way the OP
wants, some mechanism to help him set up the desired file associations
would be more reliable.

However I do not think it's a good idea to make simple _visiting_ of a
file modify that file.  That would be _really_ annoying.  If it's deemed
desirable to use some sort of user query to set up encoding info in a
funny case like this, the information should be kept in memory (of
course it could be made permanent if the user chose to save that file
for other reasons).

-Miles

-- 
The automobile has not merely taken over the street, it has dissolved the
living tissue of the city.  Its appetite for space is absolutely insatiable;
moving and parked, it devours urban land, leaving the buildings as mere islands
of habitable space in a sea of dangerous and ugly traffic.
[James Marston Fitch, New York Times, 1 May 1960]




^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: 23.0.60; Defaut encoding for XML files should be undefined (instead of utf-8)
  2008-02-19 22:58                                   ` Miles Bader
@ 2008-02-20  0:43                                     ` Stephen J. Turnbull
  0 siblings, 0 replies; 38+ messages in thread
From: Stephen J. Turnbull @ 2008-02-20  0:43 UTC (permalink / raw)
  To: Miles Bader
  Cc: emacs-pretest-bug, Lennart Borgman (gmail), Edward O'Connor,
	Stefan Monnier, emacs-devel

Miles Bader writes:

 > of the OP.  Maybe not entirely to his satisfaction -- there are probably
 > cases where latin1 can be mistaken for utf8,

Like all cases where the file only contains ASCII, and the user wishes
to add Latin 1 content.

 > However I do not think it's a good idea to make simple _visiting_ of a
 > file modify that file.

Why do you think that I suggested anything of the kind?




^ permalink raw reply	[flat|nested] 38+ messages in thread

end of thread, other threads:[~2008-02-20  0:43 UTC | newest]

Thread overview: 38+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-02-15  9:06 23.0.60; Defaut encoding for XML files should be undefined (instead of utf-8) Sébastien Vauban
2008-02-15 22:32 ` Edward O'Connor
2008-02-15 22:54   ` Jason Rumney
2008-02-15 23:24     ` Miles Bader
2008-02-15 23:34       ` Jason Rumney
2008-02-15 23:42         ` Miles Bader
2008-02-16  3:42           ` Miles Bader
2008-02-18  2:49       ` Jason Rumney
2008-02-18  3:01         ` Jason Rumney
2008-02-16  4:03   ` Stefan Monnier
2008-02-16  7:17     ` Stephen J. Turnbull
2008-02-16  9:58       ` Jason Rumney
2008-02-16 11:23         ` Stephen J. Turnbull
2008-02-16 12:07           ` Lennart Borgman (gmail)
2008-02-17  3:52             ` Stephen J. Turnbull
2008-02-17 14:31               ` Lennart Borgman (gmail)
2008-02-17 22:24                 ` Stephen J. Turnbull
2008-02-17 22:27                   ` Miles Bader
2008-02-18  0:07                     ` Stephen J. Turnbull
2008-02-18  3:16                       ` Miles Bader
2008-02-18  6:26                         ` Stephen J. Turnbull
2008-02-18  6:40                           ` Miles Bader
2008-02-19  7:17                             ` Stephen J. Turnbull
2008-02-19  7:19                               ` Miles Bader
2008-02-19 21:03                                 ` Stephen J. Turnbull
2008-02-19 22:47                                   ` Jason Rumney
2008-02-19 22:58                                   ` Miles Bader
2008-02-20  0:43                                     ` Stephen J. Turnbull
2008-02-19 15:50                               ` Stefan Monnier
2008-02-19 22:02                                 ` Stephen J. Turnbull
2008-02-18 14:59                           ` Projects and multi-file documents (was: 23.0.60; Defaut encoding for XML files should be undefined (instead of utf-8)) Stefan Monnier
2008-02-18 18:51                             ` Projects and multi-file documents Ralf Angeli
2008-02-18 16:35                   ` 23.0.60; Defaut encoding for XML files should be undefined (instead of utf-8) Lennart Borgman (gmail)
2008-02-16 17:03           ` Jason Rumney
2008-02-16 17:31             ` David Kastrup
2008-02-17  3:53             ` Stephen J. Turnbull
2008-02-18  3:22               ` Miles Bader
2008-02-18  6:01                 ` Stephen J. Turnbull

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).