utf-16le vs utf-16-le

unofficial mirror of emacs-devel@gnu.org 
 help / color / mirror / code / Atom feed

* utf-16le vs utf-16-le
@ 2008-04-13 14:54 Eli Zaretskii
  2008-04-13 19:32 ` Stefan Monnier
                   ` (3 more replies)
  0 siblings, 4 replies; 56+ messages in thread
From: Eli Zaretskii @ 2008-04-13 14:54 UTC (permalink / raw)
  To: emacs-devel

These two encodings have confusingly similar names, but significantly
different semantics: one expects a BOM, the other does not.  (I'll bet
a sixpack of beer that most of you will not know which one is which.)
A similar problem exists with the -be variant of UTF-16.

The fact that we have utf-16le-with-signature, but don't have the
corresponding -without-signature, also doesn't help.

I tripped over these when I tried to read debugging logs saved by
MS-Windows, which are in UTF-16 without a BOM: I used utf-16-le, which
swallowed the first character.  When I realized it was due to a BOM,
it took me reading of the doc strings of each encoding to find out
what I did wrong.

Can we please come up with some more self-explanatory names, and lose
the confusing le vs -le thing?  Please?

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: utf-16le vs utf-16-le
  2008-04-13 14:54 utf-16le vs utf-16-le Eli Zaretskii
@ 2008-04-13 19:32 ` Stefan Monnier
  2008-04-14  5:17   ` Kenichi Handa
  2008-04-13 22:23 ` Stephen J. Turnbull
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 56+ messages in thread
From: Stefan Monnier @ 2008-04-13 19:32 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: emacs-devel

> Can we please come up with some more self-explanatory names, and lose
> the confusing le vs -le thing?  Please?

That would be nice, indeed.  Also the encoding that use a BOM should not
just ignore the first char, but should only do so if the first char is
indeed a BOM.


        Stefan




^ permalink raw reply	[flat|nested] 56+ messages in thread

* utf-16le vs utf-16-le
  2008-04-13 14:54 utf-16le vs utf-16-le Eli Zaretskii
  2008-04-13 19:32 ` Stefan Monnier
@ 2008-04-13 22:23 ` Stephen J. Turnbull
  2008-04-14  3:19   ` Eli Zaretskii
  2008-04-14  5:17 ` Kenichi Handa
  2008-04-14  7:02 ` tomas
  3 siblings, 1 reply; 56+ messages in thread
From: Stephen J. Turnbull @ 2008-04-13 22:23 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: emacs-devel

Eli Zaretskii writes:

 > These two encodings have confusingly similar names, but significantly
 > different semantics: one expects a BOM, the other does not.

 > I tripped over these when I tried to read debugging logs saved by
 > MS-Windows, which are in UTF-16 without a BOM: I used utf-16-le, which
 > swallowed the first character.  When I realized it was due to a BOM,
 > it took me reading of the doc strings of each encoding to find out
 > what I did wrong.

Are you saying it was eating non-BOM characters?  But that's clearly a
bug in the codec.  If it's going to expect a BOM, it should error if
it doesn't get one, not eat the character.

This business of having presence or absence of signatures determined
by coding systems has always felt wrong to me.  Signatures are
generally related to higher-level protocols (eg, XML mandates them for
UTF-16, while the MS logging facility de facto prohibits them).  So
whether a signature is used or not should be a buffer-local variable,
not a property of the coding system.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: utf-16le vs utf-16-le
  2008-04-13 22:23 ` Stephen J. Turnbull
@ 2008-04-14  3:19   ` Eli Zaretskii
  2008-04-14  7:32     ` Stephen J. Turnbull
  0 siblings, 1 reply; 56+ messages in thread
From: Eli Zaretskii @ 2008-04-14  3:19 UTC (permalink / raw)
  To: Stephen J. Turnbull; +Cc: emacs-devel

> From: "Stephen J. Turnbull" <stephen@xemacs.org>
> Cc: emacs-devel@gnu.org
> Date: Mon, 14 Apr 2008 07:23:45 +0900
> 
> Are you saying it was eating non-BOM characters?

Yes, definitely.

> But that's clearly a bug in the codec.  If it's going to expect a
> BOM, it should error if it doesn't get one, not eat the character.

Maybe it is (I didn't yet have time to look at the code), but there
could be a good reason for that.  If it's so easy to recognize the
BOM, why do we need versions with and without it?

Anyway, it was the naming issue was what I was complaining about, not
the swallowing of a non-BOM character.  A user shouldn't be required
to read a doc string of a coding system each time she uses it; the
name of the coding system should be all the clue she needs to decide
which one is appropriate.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: utf-16le vs utf-16-le
  2008-04-13 14:54 utf-16le vs utf-16-le Eli Zaretskii
  2008-04-13 19:32 ` Stefan Monnier
  2008-04-13 22:23 ` Stephen J. Turnbull
@ 2008-04-14  5:17 ` Kenichi Handa
  2008-04-14 13:57   ` Stefan Monnier
  2008-04-14  7:02 ` tomas
  3 siblings, 1 reply; 56+ messages in thread
From: Kenichi Handa @ 2008-04-14  5:17 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: emacs-devel

In article <E1Jl3bC-0005N2-PJ@fencepost.gnu.org>, Eli Zaretskii <eliz@gnu.org> writes:

> These two encodings have confusingly similar names, but significantly
> different semantics: one expects a BOM, the other does not.  (I'll bet
> a sixpack of beer that most of you will not know which one is which.)
> A similar problem exists with the -be variant of UTF-16.

The correct names for "without BOM" versions are utf-16le
and utf-16be (RFC2781).

The two coding systems utf-16-le and utf-16-be were
introduced as "with BOM" version by Dave.  I noticed that
those names are very confusing when I was going to introduce
"without BOM" versions as utf-16be and utf-16le.  But as it
was after the release of some official version of Emacs
(perhaps 21.3), to keep backward compatiblity, I couldn't
delete utf-16-be/le.  So, I renamed them as
utf-16be-with-signature and utf-16le-with-signature and make
utf-16-be and utf-16-le just their aliases hoping that new
people use only these names:
  utf-16 utf-16le utf-16be utf-16le-with-signature utf-16be-with-signature

Stefan, if you think it's ok to break backward compatiblity
here, I'll delete alises utf-16-be and utf-16-le.

---
Kenichi Handa
handa@ni.aist.go.jp

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: utf-16le vs utf-16-le
  2008-04-13 19:32 ` Stefan Monnier
@ 2008-04-14  5:17   ` Kenichi Handa
  2008-04-14  6:10     ` David Kastrup
                       ` (2 more replies)
  0 siblings, 3 replies; 56+ messages in thread
From: Kenichi Handa @ 2008-04-14  5:17 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: eliz, emacs-devel

In article <jwvskxppn5i.fsf-monnier+emacs@gnu.org>, Stefan Monnier <monnier@iro.umontreal.ca> writes:

> Also the encoding that use a BOM should not
> just ignore the first char, but should only do so if the first char is
> indeed a BOM.

I'll fix that soon.

---
Kenichi Handa
handa@ni.aist.go.jp




^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: utf-16le vs utf-16-le
  2008-04-14  5:17   ` Kenichi Handa
@ 2008-04-14  6:10     ` David Kastrup
  2008-04-14 18:54       ` Eli Zaretskii
  2008-04-14 17:38     ` Eli Zaretskii
  2008-04-14 18:57     ` Eli Zaretskii
  2 siblings, 1 reply; 56+ messages in thread
From: David Kastrup @ 2008-04-14  6:10 UTC (permalink / raw)
  To: Kenichi Handa; +Cc: eliz, Stefan Monnier, emacs-devel

Kenichi Handa <handa@m17n.org> writes:

> In article <jwvskxppn5i.fsf-monnier+emacs@gnu.org>, Stefan Monnier <monnier@iro.umontreal.ca> writes:
>
>> Also the encoding that use a BOM should not
>> just ignore the first char, but should only do so if the first char is
>> indeed a BOM.
>
> I'll fix that soon.

How does recode-region work with encodings having a BOM?  Probably the
problem is not dissimilar to working with shift encodings.  Still I have
a hard time to picture either.

-- 
David Kastrup, Kriemhildstr. 15, 44793 Bochum




^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: utf-16le vs utf-16-le
  2008-04-13 14:54 utf-16le vs utf-16-le Eli Zaretskii
                   ` (2 preceding siblings ...)
  2008-04-14  5:17 ` Kenichi Handa
@ 2008-04-14  7:02 ` tomas
  2008-04-14 17:45   ` Eli Zaretskii
  3 siblings, 1 reply; 56+ messages in thread
From: tomas @ 2008-04-14  7:02 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: emacs-devel

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Sun, Apr 13, 2008 at 10:54:30AM -0400, Eli Zaretskii wrote:
> These two encodings have confusingly similar names, but significantly
> different semantics: one expects a BOM, the other does not.  (I'll bet
> a sixpack of beer that most of you will not know which one is which.)

I'd owe you one by now :-)

[...]

> I tripped over these when I tried to read debugging logs saved by
> MS-Windows, which are in UTF-16 without a BOM: [...]

This is courtesy of the same folks who like to put BOMs in UTF-8. I'm
speechless (again).

Regards
- -- tomás
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1 (GNU/Linux)

iD8DBQFIAwF2Bcgs9XrR2kYRAjBLAJ9gDshIobO1WEgDK65WdxI+C1stZACfa1tO
6RKbsr8NYQhg59GS3Yhl96c=
=Q1Og
-----END PGP SIGNATURE-----




^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: utf-16le vs utf-16-le
  2008-04-14  3:19   ` Eli Zaretskii
@ 2008-04-14  7:32     ` Stephen J. Turnbull
  2008-04-14  8:20       ` David Kastrup
  0 siblings, 1 reply; 56+ messages in thread
From: Stephen J. Turnbull @ 2008-04-14  7:32 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: emacs-devel

Eli Zaretskii writes:

 > Maybe it is (I didn't yet have time to look at the code), but there
 > could be a good reason for that.  If it's so easy to recognize the
 > BOM, why do we need versions with and without it?

I don't know, in fact I think I think it's a bad idea.  That's what
the part of my message that you snipped was saying.  But I'll have to
defer to Handa-san on that.





^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: utf-16le vs utf-16-le
  2008-04-14  7:32     ` Stephen J. Turnbull
@ 2008-04-14  8:20       ` David Kastrup
  2008-04-14 18:25         ` Stephen J. Turnbull
  0 siblings, 1 reply; 56+ messages in thread
From: David Kastrup @ 2008-04-14  8:20 UTC (permalink / raw)
  To: Stephen J. Turnbull; +Cc: Eli Zaretskii, emacs-devel

"Stephen J. Turnbull" <stephen@xemacs.org> writes:

> Eli Zaretskii writes:
>
>  > Maybe it is (I didn't yet have time to look at the code), but there
>  > could be a good reason for that.  If it's so easy to recognize the
>  > BOM, why do we need versions with and without it?
>
> I don't know, in fact I think I think it's a bad idea.  That's what
> the part of my message that you snipped was saying.  But I'll have to
> defer to Handa-san on that.

I think it obvious: if a BOM mark gets detected on read, one wants to
have it removed from the buffer and reinserted on saving the buffer.

I am just not sure what the semantics for recoding/encoding/decoding
regions are.  They should not mess with BOM in any case, I would
suppose.  But then reading a file is not equivalent to reading it
literally in unibyte mode and then decoding the buffer-region.

Maybe there never was such an equivalence (can't be for shift codes, can
it?).

-- 
David Kastrup, Kriemhildstr. 15, 44793 Bochum

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: utf-16le vs utf-16-le
  2008-04-14  5:17 ` Kenichi Handa
@ 2008-04-14 13:57   ` Stefan Monnier
  0 siblings, 0 replies; 56+ messages in thread
From: Stefan Monnier @ 2008-04-14 13:57 UTC (permalink / raw)
  To: Kenichi Handa; +Cc: Eli Zaretskii, emacs-devel

>> These two encodings have confusingly similar names, but significantly
>> different semantics: one expects a BOM, the other does not.  (I'll bet
>> a sixpack of beer that most of you will not know which one is which.)
>> A similar problem exists with the -be variant of UTF-16.

> The correct names for "without BOM" versions are utf-16le
> and utf-16be (RFC2781).

> The two coding systems utf-16-le and utf-16-be were
> introduced as "with BOM" version by Dave.  I noticed that
> those names are very confusing when I was going to introduce
> "without BOM" versions as utf-16be and utf-16le.  But as it
> was after the release of some official version of Emacs
> (perhaps 21.3), to keep backward compatiblity, I couldn't
> delete utf-16-be/le.  So, I renamed them as
> utf-16be-with-signature and utf-16le-with-signature and make
> utf-16-be and utf-16-le just their aliases hoping that new
> people use only these names:
>   utf-16 utf-16le utf-16be utf-16le-with-signature utf-16be-with-signature

That makes sense.

> Stefan, if you think it's ok to break backward compatiblity
> here, I'll delete alises utf-16-be and utf-16-le.

Can you please check Eamcs's own code as well as try and see if other
packages might rely on them?  I expect that most external packages would
be OK since they'd either not care about it or else they'd probably
already have to handle the case where utf-16-be is absent (for
compatibility with Emacs-21.1 and/or XEmacs).


        Stefan




^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: utf-16le vs utf-16-le
  2008-04-14  5:17   ` Kenichi Handa
  2008-04-14  6:10     ` David Kastrup
@ 2008-04-14 17:38     ` Eli Zaretskii
  2008-04-14 18:57     ` Eli Zaretskii
  2 siblings, 0 replies; 56+ messages in thread
From: Eli Zaretskii @ 2008-04-14 17:38 UTC (permalink / raw)
  To: Kenichi Handa; +Cc: monnier, emacs-devel

> From: Kenichi Handa <handa@m17n.org>
> CC: eliz@gnu.org, emacs-devel@gnu.org
> Date: Mon, 14 Apr 2008 14:17:59 +0900
> 
> In article <jwvskxppn5i.fsf-monnier+emacs@gnu.org>, Stefan Monnier <monnier@iro.umontreal.ca> writes:
> 
> > Also the encoding that use a BOM should not
> > just ignore the first char, but should only do so if the first char is
> > indeed a BOM.
> 
> I'll fix that soon.

Thank you.




^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: utf-16le vs utf-16-le
  2008-04-14  7:02 ` tomas
@ 2008-04-14 17:45   ` Eli Zaretskii
  2008-04-15  7:38     ` tomas
  0 siblings, 1 reply; 56+ messages in thread
From: Eli Zaretskii @ 2008-04-14 17:45 UTC (permalink / raw)
  To: tomas; +Cc: emacs-devel

> Date: Mon, 14 Apr 2008 07:02:14 +0000
> Cc: emacs-devel@gnu.org
> From: tomas@tuxteam.de
> 
> > I tripped over these when I tried to read debugging logs saved by
> > MS-Windows, which are in UTF-16 without a BOM: [...]
> 
> This is courtesy of the same folks who like to put BOMs in UTF-8. I'm
> speechless (again).

Actually, I don't necessarily see anything wrong with the lack of BOM
in this case: these are Windows-internal log files, meant to be read
by utilities who know the encoding, not by general-purpose text
editors.  UTF-16 is the native encoding used by Windows low-level APIs
and the kernel for non-ASCII text, so seeing that in a temporary file
shouldn't be a surprise.  And of course, Windows doesn't need a BOM
because it uses only one endianness.

A BOM in UTF-8 is another matter, of course...

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: utf-16le vs utf-16-le
  2008-04-14  8:20       ` David Kastrup
@ 2008-04-14 18:25         ` Stephen J. Turnbull
  2008-04-14 18:46           ` Eli Zaretskii
  2008-04-14 20:20           ` Stefan Monnier
  0 siblings, 2 replies; 56+ messages in thread
From: Stephen J. Turnbull @ 2008-04-14 18:25 UTC (permalink / raw)
  To: David Kastrup; +Cc: Eli Zaretskii, emacs-devel

David Kastrup writes:
 > "Stephen J. Turnbull" <stephen@xemacs.org> writes:

 > > I don't know, in fact I think I think [having BOM-specific coding
 > > systems is] a bad idea.  That's what the part of my message that
 > > you snipped was saying.  But I'll have to defer to Handa-san on
 > > that.
 > 
 > I think it obvious: if a BOM mark gets detected on read, one wants
 > to have it removed from the buffer and reinserted on saving the
 > buffer.

I agree, as you state it, it's obvious.  My question is "why does that
need to be part of the coding system?"  At present the UTF-16 and
UTF-32 Unicode coding systems (in the abstract) have *twenty-seven*
variants each (BOM-required, BOM-prohibited, BOM-autodetected X be,
le, system-dependent X CR, LF, CRLF), and UTF-8 needs *nine*.  This is
nuts, from a user-education standpoint.

What I proposed was a more generic concept where use of signatures and
the EOL convention would (at least to the user) appear as buffer-local
variables.

 > I am just not sure what the semantics for recoding/encoding/decoding
 > regions are.  They should not mess with BOM in any case, I would
 > suppose.  But then reading a file is not equivalent to reading it
 > literally in unibyte mode and then decoding the buffer-region.

That's correct.  The thing is, processing the BOM is a question of
*initialization* of a stream.

 > Maybe there never was such an equivalence (can't be for shift codes, can
 > it?).

In my view, there cannot be an equivalence.  An Emacs buffer in
unibyte mode is a *different* stream from the file it was read from,
and the decision about BOM processing will have to be made differently
from the way the decision is made at the time of reading from the
file.  You could add yet another option for BOM mode, namely "if this
stream is an Emacs buffer that is visting a file in unibyte mode, then
do BOM processing on conversion as if you were reading in the file in
multibyte mode."  I don't much like this....

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: utf-16le vs utf-16-le
  2008-04-14 18:25         ` Stephen J. Turnbull
@ 2008-04-14 18:46           ` Eli Zaretskii
  2008-04-14 21:01             ` Stephen J. Turnbull
  2008-04-14 20:20           ` Stefan Monnier
  1 sibling, 1 reply; 56+ messages in thread
From: Eli Zaretskii @ 2008-04-14 18:46 UTC (permalink / raw)
  To: Stephen J. Turnbull; +Cc: emacs-devel

> From: "Stephen J. Turnbull" <stephen@xemacs.org>
> Date: Tue, 15 Apr 2008 03:25:51 +0900
> Cc: Eli Zaretskii <eliz@gnu.org>, emacs-devel@gnu.org
> 
> I agree, as you state it, it's obvious.  My question is "why does that
> need to be part of the coding system?"

Well, consistency with other ``add-ons'', such as EOL format, is one
reason.

> At present the UTF-16 and
> UTF-32 Unicode coding systems (in the abstract) have *twenty-seven*
> variants each (BOM-required, BOM-prohibited, BOM-autodetected X be,
> le, system-dependent X CR, LF, CRLF), and UTF-8 needs *nine*.

Which 9 are needed by UTF-8?  I only see 4: the auto-detecting one,
then one each for -unix. -dos, and -mac.  What am I missing?

> What I proposed was a more generic concept where use of signatures and
> the EOL convention would (at least to the user) appear as buffer-local
> variables.

Don't forget that en/decoding is used on strings as well, not only on
buffers.  Buffer-local variables won't cut it, I think.




^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: utf-16le vs utf-16-le
  2008-04-14  6:10     ` David Kastrup
@ 2008-04-14 18:54       ` Eli Zaretskii
  2008-04-14 19:04         ` David Kastrup
  0 siblings, 1 reply; 56+ messages in thread
From: Eli Zaretskii @ 2008-04-14 18:54 UTC (permalink / raw)
  To: David Kastrup; +Cc: emacs-devel, monnier, handa

> From: David Kastrup <dak@gnu.org>
> Cc: Stefan Monnier <monnier@iro.umontreal.ca>,  eliz@gnu.org,  emacs-devel@gnu.org
> Date: Mon, 14 Apr 2008 08:10:07 +0200
> 
> How does recode-region work with encodings having a BOM?  Probably the
> problem is not dissimilar to working with shift encodings.  Still I have
> a hard time to picture either.

I'm not following: what exactly puzzles you in this, and why
recode-region is an issue?




^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: utf-16le vs utf-16-le
  2008-04-14  5:17   ` Kenichi Handa
  2008-04-14  6:10     ` David Kastrup
  2008-04-14 17:38     ` Eli Zaretskii
@ 2008-04-14 18:57     ` Eli Zaretskii
  2 siblings, 0 replies; 56+ messages in thread
From: Eli Zaretskii @ 2008-04-14 18:57 UTC (permalink / raw)
  To: Kenichi Handa; +Cc: monnier, emacs-devel

> From: Kenichi Handa <handa@m17n.org>
> Date: Mon, 14 Apr 2008 14:17:59 +0900
> Cc: eliz@gnu.org, emacs-devel@gnu.org
> 
> In article <jwvskxppn5i.fsf-monnier+emacs@gnu.org>, Stefan Monnier <monnier@iro.umontreal.ca> writes:
> 
> > Also the encoding that use a BOM should not
> > just ignore the first char, but should only do so if the first char is
> > indeed a BOM.
> 
> I'll fix that soon.

In case it wasn't clear: this problem exists on the release branch
(and in Emacs 22.2).  I didn't try the trunk (didn't have it on the
machine where I found this problem), and from code inspection in
coding.c, it looks like we already do TRT with a BOM that isn't a BOM.
But in Emacs 22.x, the UTF-16 decoder is implemented in CCL, not in C.




^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: utf-16le vs utf-16-le
  2008-04-14 18:54       ` Eli Zaretskii
@ 2008-04-14 19:04         ` David Kastrup
  0 siblings, 0 replies; 56+ messages in thread
From: David Kastrup @ 2008-04-14 19:04 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: emacs-devel, monnier, handa

Eli Zaretskii <eliz@gnu.org> writes:

>> From: David Kastrup <dak@gnu.org>
>> Cc: Stefan Monnier <monnier@iro.umontreal.ca>,  eliz@gnu.org,  emacs-devel@gnu.org
>> Date: Mon, 14 Apr 2008 08:10:07 +0200
>> 
>> How does recode-region work with encodings having a BOM?  Probably the
>> problem is not dissimilar to working with shift encodings.  Still I have
>> a hard time to picture either.
>
> I'm not following: what exactly puzzles you in this, and why
> recode-region is an issue?

Because it is not clear whether it should add or remove BOM marks (like
visiting a file would).  I've gone into more detail in a different mail
to the list I think, so it might make sense replying there.

-- 
David Kastrup, Kriemhildstr. 15, 44793 Bochum




^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: utf-16le vs utf-16-le
  2008-04-14 18:25         ` Stephen J. Turnbull
  2008-04-14 18:46           ` Eli Zaretskii
@ 2008-04-14 20:20           ` Stefan Monnier
  2008-04-14 20:58             ` David Kastrup
  2008-04-14 21:35             ` Stephen J. Turnbull
  1 sibling, 2 replies; 56+ messages in thread
From: Stefan Monnier @ 2008-04-14 20:20 UTC (permalink / raw)
  To: Stephen J. Turnbull; +Cc: Eli Zaretskii, emacs-devel

>> > I don't know, in fact I think I think [having BOM-specific coding
>> > systems is] a bad idea.  That's what the part of my message that
>> > you snipped was saying.  But I'll have to defer to Handa-san on
>> > that.
>> 
>> I think it obvious: if a BOM mark gets detected on read, one wants
>> to have it removed from the buffer and reinserted on saving the
>> buffer.

> I agree, as you state it, it's obvious.  My question is "why does that
> need to be part of the coding system?"  At present the UTF-16 and
> UTF-32 Unicode coding systems (in the abstract) have *twenty-seven*
> variants each (BOM-required, BOM-prohibited, BOM-autodetected X be,
> le, system-dependent X CR, LF, CRLF), and UTF-8 needs *nine*.  This is
> nuts, from a user-education standpoint.

For what it's worth, I do think it would make sense to try and move the
BOM-processing outside of the coding-system proper.  For me a good test
for coding-system-worthiness is "what if I use it for a process rather
than a file".  Based on this test, I'm not sure if BOMs really fit in
(other than for auto-detection and automatically stripping them, maybe).

> What I proposed was a more generic concept where use of signatures and
> the EOL convention would (at least to the user) appear as buffer-local
> variables.

Here, I disagree: EOL processing definitely need to take place when
talking to subprocesses, so EOL-handling doesn't belong in buffer-local
vars but in the coding-system.


        Stefan




^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: utf-16le vs utf-16-le
  2008-04-14 20:20           ` Stefan Monnier
@ 2008-04-14 20:58             ` David Kastrup
  2008-04-14 22:19               ` Stefan Monnier
  2008-04-14 21:35             ` Stephen J. Turnbull
  1 sibling, 1 reply; 56+ messages in thread
From: David Kastrup @ 2008-04-14 20:58 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: Stephen J. Turnbull, Eli Zaretskii, emacs-devel

Stefan Monnier <monnier@iro.umontreal.ca> writes:

>>> > I don't know, in fact I think I think [having BOM-specific coding
>>> > systems is] a bad idea.  That's what the part of my message that
>>> > you snipped was saying.  But I'll have to defer to Handa-san on
>>> > that.
>>> 
>>> I think it obvious: if a BOM mark gets detected on read, one wants
>>> to have it removed from the buffer and reinserted on saving the
>>> buffer.
>
>> I agree, as you state it, it's obvious.  My question is "why does that
>> need to be part of the coding system?"  At present the UTF-16 and
>> UTF-32 Unicode coding systems (in the abstract) have *twenty-seven*
>> variants each (BOM-required, BOM-prohibited, BOM-autodetected X be,
>> le, system-dependent X CR, LF, CRLF), and UTF-8 needs *nine*.  This is
>> nuts, from a user-education standpoint.
>
> For what it's worth, I do think it would make sense to try and move
> the BOM-processing outside of the coding-system proper.  For me a good
> test for coding-system-worthiness is "what if I use it for a process
> rather than a file".  Based on this test, I'm not sure if BOMs really
> fit in (other than for auto-detection and automatically stripping
> them, maybe).

Hm?  I don't see why starting communication with a BOM or not would
_not_ fit in.

>> What I proposed was a more generic concept where use of signatures
>> and the EOL convention would (at least to the user) appear as
>> buffer-local variables.
>
> Here, I disagree: EOL processing definitely need to take place when
> talking to subprocesses, so EOL-handling doesn't belong in
> buffer-local vars but in the coding-system.

I don't quite see the difference to BOM processing, even though the BOM
processing has to happen only once at the start.

-- 
David Kastrup, Kriemhildstr. 15, 44793 Bochum




^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: utf-16le vs utf-16-le
  2008-04-14 18:46           ` Eli Zaretskii
@ 2008-04-14 21:01             ` Stephen J. Turnbull
  2008-04-14 21:15               ` Andreas Schwab
  2008-04-15  3:25               ` Eli Zaretskii
  0 siblings, 2 replies; 56+ messages in thread
From: Stephen J. Turnbull @ 2008-04-14 21:01 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: emacs-devel

Eli Zaretskii writes:

 > Which 9 are needed by UTF-8?  I only see 4: the auto-detecting one,
 > then one each for -unix. -dos, and -mac.  What am I missing?

BOM-{prohibited,auto,required}.  Just because you don't see a need for
them doesn't mean that there are cases where somebody might want to
force or prohibit BOMs in UTF-8 for compatibility with other apps.

 > Don't forget that en/decoding is used on strings as well, not only on
 > buffers.  Buffer-local variables won't cut it, I think.

Strings don't have encoding signatures or newline variants; those
octet sequences if present in a string are merely binary octet
sequences.  They only have special semantics in external
representations.  Where's the problem?

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: utf-16le vs utf-16-le
  2008-04-14 21:01             ` Stephen J. Turnbull
@ 2008-04-14 21:15               ` Andreas Schwab
  2008-04-15  0:22                 ` Stephen J. Turnbull
  2008-04-15  3:25               ` Eli Zaretskii
  1 sibling, 1 reply; 56+ messages in thread
From: Andreas Schwab @ 2008-04-14 21:15 UTC (permalink / raw)
  To: Stephen J. Turnbull; +Cc: Eli Zaretskii, emacs-devel

"Stephen J. Turnbull" <stephen@xemacs.org> writes:

> Eli Zaretskii writes:
>
>  > Don't forget that en/decoding is used on strings as well, not only on
>  > buffers.  Buffer-local variables won't cut it, I think.
>
> Strings don't have encoding signatures or newline variants; those
> octet sequences if present in a string are merely binary octet
> sequences.  They only have special semantics in external
> representations.  Where's the problem?

The whole point of en/decoding is to convert between internal and
external representation, no matter whether operating on a buffer or a
string.

Andreas.

-- 
Andreas Schwab, SuSE Labs, schwab@suse.de
SuSE Linux Products GmbH, Maxfeldstraße 5, 90409 Nürnberg, Germany
PGP key fingerprint = 58CA 54C7 6D53 942B 1756  01D3 44D5 214B 8276 4ED5
"And now for something completely different."




^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: utf-16le vs utf-16-le
  2008-04-14 20:20           ` Stefan Monnier
  2008-04-14 20:58             ` David Kastrup
@ 2008-04-14 21:35             ` Stephen J. Turnbull
  1 sibling, 0 replies; 56+ messages in thread
From: Stephen J. Turnbull @ 2008-04-14 21:35 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: Eli Zaretskii, emacs-devel

Stefan Monnier writes:

 > > What I proposed was a more generic concept where use of signatures and
 > > the EOL convention would (at least to the user) appear as buffer-local
 > > variables.

Note the *at least to the user*.

 > [For EOLs], I disagree: EOL processing definitely need to take
 > place when talking to subprocesses,

Yes, it has to *take place* when talking to subprocesses, but I don't
see why it should be controlled by proliferation of (user-visible)
coding systems.




^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: utf-16le vs utf-16-le
  2008-04-14 20:58             ` David Kastrup
@ 2008-04-14 22:19               ` Stefan Monnier
  2008-04-14 22:26                 ` David Kastrup
  0 siblings, 1 reply; 56+ messages in thread
From: Stefan Monnier @ 2008-04-14 22:19 UTC (permalink / raw)
  To: David Kastrup; +Cc: Stephen J. Turnbull, Eli Zaretskii, emacs-devel

>> For what it's worth, I do think it would make sense to try and move
>> the BOM-processing outside of the coding-system proper.  For me a good
>> test for coding-system-worthiness is "what if I use it for a process
>> rather than a file".  Based on this test, I'm not sure if BOMs really
>> fit in (other than for auto-detection and automatically stripping
>> them, maybe).

> Hm?  I don't see why starting communication with a BOM or not would
> _not_ fit in.

I don't think the notion of "start" is quite the same for process data
as for files.

>>> What I proposed was a more generic concept where use of signatures
>>> and the EOL convention would (at least to the user) appear as
>>> buffer-local variables.
>> 
>> Here, I disagree: EOL processing definitely need to take place when
>> talking to subprocesses, so EOL-handling doesn't belong in
>> buffer-local vars but in the coding-system.

> I don't quite see the difference to BOM processing, even though the BOM
> processing has to happen only once at the start.

You mean, it's almost exactly the same, except it's completely
different?  Then I agree,


        Stefan




^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: utf-16le vs utf-16-le
  2008-04-14 22:19               ` Stefan Monnier
@ 2008-04-14 22:26                 ` David Kastrup
  2008-04-14 22:33                   ` Stefan Monnier
  0 siblings, 1 reply; 56+ messages in thread
From: David Kastrup @ 2008-04-14 22:26 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: Stephen J. Turnbull, Eli Zaretskii, emacs-devel

Stefan Monnier <monnier@IRO.UMontreal.CA> writes:

>>> For what it's worth, I do think it would make sense to try and move
>>> the BOM-processing outside of the coding-system proper.  For me a good
>>> test for coding-system-worthiness is "what if I use it for a process
>>> rather than a file".  Based on this test, I'm not sure if BOMs really
>>> fit in (other than for auto-detection and automatically stripping
>>> them, maybe).
>
>> Hm?  I don't see why starting communication with a BOM or not would
>> _not_ fit in.
>
> I don't think the notion of "start" is quite the same for process data
> as for files.
>
>>>> What I proposed was a more generic concept where use of signatures
>>>> and the EOL convention would (at least to the user) appear as
>>>> buffer-local variables.
>>> 
>>> Here, I disagree: EOL processing definitely need to take place when
>>> talking to subprocesses, so EOL-handling doesn't belong in
>>> buffer-local vars but in the coding-system.
>
>> I don't quite see the difference to BOM processing, even though the BOM
>> processing has to happen only once at the start.
>
> You mean, it's almost exactly the same, except it's completely
> different?  Then I agree,

"Start/end of line" and "Start of buffer/communication" is not
"completely different".  Likewise, "\\`" and "^" are not "completely
different" regular expressions.  They are different, yes, but less
different than a lot of other things.

-- 
David Kastrup, Kriemhildstr. 15, 44793 Bochum




^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: utf-16le vs utf-16-le
  2008-04-14 22:26                 ` David Kastrup
@ 2008-04-14 22:33                   ` Stefan Monnier
  2008-04-15  5:44                     ` David Kastrup
  0 siblings, 1 reply; 56+ messages in thread
From: Stefan Monnier @ 2008-04-14 22:33 UTC (permalink / raw)
  To: David Kastrup; +Cc: Stephen J. Turnbull, Eli Zaretskii, emacs-devel

>> You mean, it's almost exactly the same, except it's completely
>> different?  Then I agree,

> "Start/end of line" and "Start of buffer/communication" is not
> "completely different".  Likewise, "\\`" and "^" are not "completely
> different" regular expressions.

But by EOL we don't mean "^" or "$", but "\n": this *is* completely
different from "\\`".


        Stefan




^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: utf-16le vs utf-16-le
  2008-04-14 21:15               ` Andreas Schwab
@ 2008-04-15  0:22                 ` Stephen J. Turnbull
  0 siblings, 0 replies; 56+ messages in thread
From: Stephen J. Turnbull @ 2008-04-15  0:22 UTC (permalink / raw)
  To: Andreas Schwab; +Cc: Eli Zaretskii, emacs-devel

Andreas Schwab writes:

 > The whole point of en/decoding is to convert between internal and
 > external representation, no matter whether operating on a buffer or a
 > string.

Actually, that's an implementation detail.  I'm suggesting that the
world might be a better place if we decomposed the common
transformations we make in a different way.





^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: utf-16le vs utf-16-le
  2008-04-14 21:01             ` Stephen J. Turnbull
  2008-04-14 21:15               ` Andreas Schwab
@ 2008-04-15  3:25               ` Eli Zaretskii
  2008-04-15 16:51                 ` Stephen J. Turnbull
  1 sibling, 1 reply; 56+ messages in thread
From: Eli Zaretskii @ 2008-04-15  3:25 UTC (permalink / raw)
  To: Stephen J. Turnbull; +Cc: emacs-devel

> From: "Stephen J. Turnbull" <stephen@xemacs.org>
> Date: Tue, 15 Apr 2008 06:01:14 +0900
> Cc: emacs-devel@gnu.org
> 
> Eli Zaretskii writes:
> 
>  > Which 9 are needed by UTF-8?  I only see 4: the auto-detecting one,
>  > then one each for -unix. -dos, and -mac.  What am I missing?
> 
> BOM-{prohibited,auto,required}.

But we don't have these in Emacs, do we?

>  > Don't forget that en/decoding is used on strings as well, not only on
>  > buffers.  Buffer-local variables won't cut it, I think.
> 
> Strings don't have encoding signatures or newline variants

??? Of course, they do.  Emacs has no way of knowing what will be done
with the encoded string; in particular, you might well insert it into
a buffer or append it to a file.  Are you saying that Emacs should, at
the time of actual use of the string re-en/decode it according to
usage?

> those octet sequences if present in a string are merely binary octet
> sequences.  They only have special semantics in external
> representations.  Where's the problem?

A string can be sent to a process, for example, so we must have some
way of generating an external representation for it.




^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: utf-16le vs utf-16-le
  2008-04-14 22:33                   ` Stefan Monnier
@ 2008-04-15  5:44                     ` David Kastrup
  2008-04-15 15:35                       ` Stefan Monnier
  0 siblings, 1 reply; 56+ messages in thread
From: David Kastrup @ 2008-04-15  5:44 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: Stephen J. Turnbull, Eli Zaretskii, emacs-devel

Stefan Monnier <monnier@iro.umontreal.ca> writes:

>>> You mean, it's almost exactly the same, except it's completely
>>> different?  Then I agree,
>
>> "Start/end of line" and "Start of buffer/communication" is not
>> "completely different".  Likewise, "\\`" and "^" are not "completely
>> different" regular expressions.
>
> But by EOL we don't mean "^" or "$", but "\n": this *is* completely
> different from "\\`".

I fail to see anything close to a coherent argument here, and it is
probably not relevant to the issue at hand, anyway.  So we might as well
stop.

-- 
David Kastrup, Kriemhildstr. 15, 44793 Bochum




^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: utf-16le vs utf-16-le
  2008-04-14 17:45   ` Eli Zaretskii
@ 2008-04-15  7:38     ` tomas
  2008-04-15 22:30       ` Juri Linkov
  0 siblings, 1 reply; 56+ messages in thread
From: tomas @ 2008-04-15  7:38 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: emacs-devel

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Mon, Apr 14, 2008 at 08:45:32PM +0300, Eli Zaretskii wrote:
> > Date: Mon, 14 Apr 2008 07:02:14 +0000
> > Cc: emacs-devel@gnu.org
> > From: tomas@tuxteam.de
> > 
> > > I tripped over these when I tried to read debugging logs saved by
> > > MS-Windows, which are in UTF-16 without a BOM: [...]
> > 
> > This is courtesy of the same folks who like to put BOMs in UTF-8. I'm
> > speechless (again).
> 
> Actually, I don't necessarily see anything wrong with the lack of BOM
> in this case: these are Windows-internal log files, meant to be read
> by utilities who know the encoding, not by general-purpose text
> editors.  UTF-16 is the native encoding used by Windows low-level APIs
> and the kernel for non-ASCII text, so seeing that in a temporary file
> shouldn't be a surprise.  And of course, Windows doesn't need a BOM
> because it uses only one endianness.

Absolutely. I do agree on all this -- it was the stark contrast of not
using BOM sometimes in utf-16 to using BOM in UTF-8 what caused ummm...
some emotions ;-)

> A BOM in UTF-8 is another matter, of course...

Both things taken together make the work of art.

Regards
- -- tomás
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1 (GNU/Linux)

iD8DBQFIBFtwBcgs9XrR2kYRAtpuAJ9beaO8hnA+9E9ZwYOGHivuUzsaDgCeKc2t
Jia4zA34M29IAI0AJrd7NyA=
=msxe
-----END PGP SIGNATURE-----




^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: utf-16le vs utf-16-le
  2008-04-15  5:44                     ` David Kastrup
@ 2008-04-15 15:35                       ` Stefan Monnier
  0 siblings, 0 replies; 56+ messages in thread
From: Stefan Monnier @ 2008-04-15 15:35 UTC (permalink / raw)
  To: David Kastrup; +Cc: Stephen J. Turnbull, Eli Zaretskii, emacs-devel

>>>> You mean, it's almost exactly the same, except it's completely
>>>> different?  Then I agree,
>> 
>>> "Start/end of line" and "Start of buffer/communication" is not
>>> "completely different".  Likewise, "\\`" and "^" are not "completely
>>> different" regular expressions.
>> 
>> But by EOL we don't mean "^" or "$", but "\n": this *is* completely
>> different from "\\`".

> I fail to see anything close to a coherent argument here, and it is
> probably not relevant to the issue at hand, anyway.  So we might as well
> stop.

Look at the src/regex.c code (or any other regex manipulation code);
compare the code needed for "^" and "$" to the code needed for "\n".
"\n" is trivial, just like any other char (which is the key here:
EOL-conversion is just a way to convert the \n to a byte sequence and
vice-versa, which is why it integrates well with EOL-conversion),
whereas ^ and $ require special handling because they have to look
before or after the matched text.

Try (replace-regexp-in-string "\\>" "toto" "a b c") to get a feeling for
the kinds of problems you can get with regexp elements that look outside
of the matched text.

        Stefan

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: utf-16le vs utf-16-le
  2008-04-15  3:25               ` Eli Zaretskii
@ 2008-04-15 16:51                 ` Stephen J. Turnbull
  2008-04-15 20:09                   ` Eli Zaretskii
  0 siblings, 1 reply; 56+ messages in thread
From: Stephen J. Turnbull @ 2008-04-15 16:51 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: emacs-devel

Eli Zaretskii writes:

 > > BOM-{prohibited,auto,required}.
 > 
 > But we don't have these in Emacs, do we?

Huh?  We don't have the full suite, but we do have -signature variants.

 > >  > Don't forget that en/decoding is used on strings as well, not only on
 > >  > buffers.  Buffer-local variables won't cut it, I think.
 > > 
 > > Strings don't have encoding signatures or newline variants
 > 
 > ??? Of course, they do.

Indeed?  Suppose I have a string as the value of the symbol `s'
containing the octets "\r\n".  Please explain to me how to compute
whether that is the value 0x0D0A from a network stream prepared using
htons(3), or a line ending suitable for appending to a Windows file.

As I wrote before:

 > > those octet sequences if present in a string are merely binary octet
 > > sequences.  They only have special semantics in external
 > > representations.  Where's the problem?
 > 
 > A string can be sent to a process, for example, so we must have some
 > way of generating an external representation for it.

Well, of course we must.  But the right generalization of "buffer file
coding system" is not to apply en/decoding to strings, but rather to
give processes and sockets, etc, coding system properties equivalent
to my proposed buffer-local variables.

All I'm trying to say here is that "prepend a signature" and
"translate ?\n to appropriate EOL representation" and their inverses
make sense independently of the text encoding[1], and that the user
interface and API could be greatly clarified if it reflected that
fact.  I suspect bugs like the one you encountered would be a lot less
frequent if the internal architecture reflected it too, but that might
be inefficient.

Footnotes: 
[1]  Obviously "prepend a signature" needs to be parametrized by the
encoding in general, but in the case of Unicode UTFs it's actually
independent of the UTF.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: utf-16le vs utf-16-le
  2008-04-15 16:51                 ` Stephen J. Turnbull
@ 2008-04-15 20:09                   ` Eli Zaretskii
  2008-04-15 20:31                     ` Eli Zaretskii
                                       ` (2 more replies)
  0 siblings, 3 replies; 56+ messages in thread
From: Eli Zaretskii @ 2008-04-15 20:09 UTC (permalink / raw)
  To: Stephen J. Turnbull; +Cc: emacs-devel

> From: "Stephen J. Turnbull" <turnbull@sk.tsukuba.ac.jp>
> Cc: emacs-devel@gnu.org
> Date: Wed, 16 Apr 2008 01:51:50 +0900
> 
> Eli Zaretskii writes:
> 
>  > > BOM-{prohibited,auto,required}.
>  > 
>  > But we don't have these in Emacs, do we?
> 
> Huh?  We don't have the full suite, but we do have -signature variants.

Bot for UTF-8, we don't, at least not in GNU Emacs 23.  When I type
"C-x RET c utf-8 TAB TAB TAB", I see this in the completions buffer:

    Click <mouse-2> on a completion to select it.
    In this buffer, type RET to select the completion near point.

    Possible completions are:
    utf-8            utf-8-dos         utf-8-emacs  utf-8-emacs-dos
    utf-8-emacs-mac  utf-8-emacs-unix  utf-8-mac    utf-8-unix

No -signature variants.

>  > >  > Don't forget that en/decoding is used on strings as well, not only on
>  > >  > buffers.  Buffer-local variables won't cut it, I think.
>  > > 
>  > > Strings don't have encoding signatures or newline variants
>  > 
>  > ??? Of course, they do.
> 
> Indeed?  Suppose I have a string as the value of the symbol `s'
> containing the octets "\r\n".  Please explain to me how to compute
> whether that is the value 0x0D0A from a network stream prepared using
> htons(3), or a line ending suitable for appending to a Windows file.

The Lisp code that created the string knows what it is and how to deal
with it.  But you already know that, so I probably simply fail to
follow your reasoning.

>  > A string can be sent to a process, for example, so we must have some
>  > way of generating an external representation for it.
> 
> Well, of course we must.  But the right generalization of "buffer file
> coding system" is not to apply en/decoding to strings, but rather to
> give processes and sockets, etc, coding system properties equivalent
> to my proposed buffer-local variables.

I'm afraid that this will be very hard to implement in Emacs, since
the internals are very much exposed and we are used to copy strings to
and fro freely.  I think we also don't have sockets and other similar
interfaces as Lisp object to which we could give properties.




^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: utf-16le vs utf-16-le
  2008-04-15 20:09                   ` Eli Zaretskii
@ 2008-04-15 20:31                     ` Eli Zaretskii
  2008-04-15 20:35                       ` David Kastrup
  2008-04-16 20:15                     ` Stephen J. Turnbull
  2008-04-17  1:14                     ` Stefan Monnier
  2 siblings, 1 reply; 56+ messages in thread
From: Eli Zaretskii @ 2008-04-15 20:31 UTC (permalink / raw)
  To: emacs-devel

> Date: Tue, 15 Apr 2008 23:09:32 +0300
> From: Eli Zaretskii <eliz@gnu.org>
> Cc: emacs-devel@gnu.org
> 
>     Possible completions are:
>     utf-8            utf-8-dos         utf-8-emacs  utf-8-emacs-dos
>     utf-8-emacs-mac  utf-8-emacs-unix  utf-8-mac    utf-8-unix

Btw, utf-8-emacs is also an unfortunate name, IMO, because it again
does not explain in any way the difference between utf-8 and
utf-8-emacs.  Since I'm already in Emacs, adding -emacs to an encoding
name adds no information that would help me resolve the ambiguity.

How about emacs-internal instead?  Or anything else with the word
``internal'' as part of it?  (The fact that it is based on UTF-8 is
IMO irrelevant: it's still an internal Emacs representation that is
understood only by Emacs.)

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: utf-16le vs utf-16-le
  2008-04-15 20:31                     ` Eli Zaretskii
@ 2008-04-15 20:35                       ` David Kastrup
  0 siblings, 0 replies; 56+ messages in thread
From: David Kastrup @ 2008-04-15 20:35 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: emacs-devel

Eli Zaretskii <eliz@gnu.org> writes:

>> Date: Tue, 15 Apr 2008 23:09:32 +0300
>> From: Eli Zaretskii <eliz@gnu.org>
>> Cc: emacs-devel@gnu.org
>> 
>>     Possible completions are:
>>     utf-8            utf-8-dos         utf-8-emacs  utf-8-emacs-dos
>>     utf-8-emacs-mac  utf-8-emacs-unix  utf-8-mac    utf-8-unix
>
> Btw, utf-8-emacs is also an unfortunate name, IMO, because it again
> does not explain in any way the difference between utf-8 and
> utf-8-emacs.  Since I'm already in Emacs, adding -emacs to an encoding
> name adds no information that would help me resolve the ambiguity.
>
> How about emacs-internal instead?  Or anything else with the word
> ``internal'' as part of it?  (The fact that it is based on UTF-8 is
> IMO irrelevant: it's still an internal Emacs representation that is
> understood only by Emacs.)

unicode2-internal?

-- 
David Kastrup, Kriemhildstr. 15, 44793 Bochum




^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: utf-16le vs utf-16-le
  2008-04-15  7:38     ` tomas
@ 2008-04-15 22:30       ` Juri Linkov
  2008-04-16  3:20         ` Eli Zaretskii
  0 siblings, 1 reply; 56+ messages in thread
From: Juri Linkov @ 2008-04-15 22:30 UTC (permalink / raw)
  To: tomas; +Cc: Eli Zaretskii, emacs-devel

> Absolutely. I do agree on all this -- it was the stark contrast of not
> using BOM sometimes in utf-16 to using BOM in UTF-8 what caused ummm...
> some emotions ;-)
>
>> A BOM in UTF-8 is another matter, of course...
>
> Both things taken together make the work of art.

I have nothing to say about why Windows adds BOM to UTF-8 files,
but Emacs once saved me much time of debugging the problem when
mobile terminals failed to display files stored by the users of
Windows Notepad that adds BOM to UTF-8 files.  In earlier versions
of Emacs (I don't remember in which exactly), Emacs displayed the BOM
character at the beginning of the buffer, so it was easy to see where
the problem was.

Unfortunately, now in Emacs 23 I see no BOM marks displayed at the
beginning of the buffer.  I think Emacs should have a visual indication
for such hidden characters.

-- 
Juri Linkov
http://www.jurta.org/emacs/

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: utf-16le vs utf-16-le
  2008-04-15 22:30       ` Juri Linkov
@ 2008-04-16  3:20         ` Eli Zaretskii
  2008-04-16  8:12           ` Jason Rumney
  0 siblings, 1 reply; 56+ messages in thread
From: Eli Zaretskii @ 2008-04-16  3:20 UTC (permalink / raw)
  To: Juri Linkov; +Cc: tomas, emacs-devel

> From: Juri Linkov <juri@jurta.org>
> Date: Wed, 16 Apr 2008 01:30:41 +0300
> Cc: Eli Zaretskii <eliz@gnu.org>, emacs-devel@gnu.org
> 
> Unfortunately, now in Emacs 23 I see no BOM marks displayed at the
> beginning of the buffer.  I think Emacs should have a visual indication
> for such hidden characters.

Emacs behaves correctly IMO, since its behavior is tuned for reading
text, and BOM is not part of the text.  If you want to debug the
programs that generated that text, you can always use no-conversion or
find-file-literally.




^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: utf-16le vs utf-16-le
  2008-04-16  3:20         ` Eli Zaretskii
@ 2008-04-16  8:12           ` Jason Rumney
  2008-04-16 13:35             ` Stefan Monnier
  0 siblings, 1 reply; 56+ messages in thread
From: Jason Rumney @ 2008-04-16  8:12 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: Juri Linkov, tomas, emacs-devel

Eli Zaretskii wrote:

> Emacs behaves correctly IMO, since its behavior is tuned for reading
> text, and BOM is not part of the text.  If you want to debug the
> programs that generated that text, you can always use no-conversion or
> find-file-literally.
>   

But you don't know what you are debugging until Emacs (or something 
else) points out the unexpected BOM.  Indicating the presence of a BOM 
isn't really any different to indicating the encoding, though a better 
(more noticeable) UI might be some indicator in the left fringe on the 
first line of the file, rather than just a change to the character in 
the modeline.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: utf-16le vs utf-16-le
  2008-04-16  8:12           ` Jason Rumney
@ 2008-04-16 13:35             ` Stefan Monnier
  2008-04-16 14:45               ` Jason Rumney
                                 ` (2 more replies)
  0 siblings, 3 replies; 56+ messages in thread
From: Stefan Monnier @ 2008-04-16 13:35 UTC (permalink / raw)
  To: Jason Rumney; +Cc: Juri Linkov, Eli Zaretskii, tomas, emacs-devel

>> Emacs behaves correctly IMO, since its behavior is tuned for reading
>> text, and BOM is not part of the text.  If you want to debug the
>> programs that generated that text, you can always use no-conversion or
>> find-file-literally.

> But you don't know what you are debugging until Emacs (or something else)
> points out the unexpected BOM.  Indicating the presence of a BOM isn't
> really any different to indicating the encoding, though a better (more
> noticeable) UI might be some indicator in the left fringe on the first line
> of the file, rather than just a change to the character in the modeline.

We could use an approach similar to non-breaking space, where the BOM is
made visible just like any other char, with a special face.  Ideally it
would also be somehow protected from accidental removal,


        Stefan




^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: utf-16le vs utf-16-le
  2008-04-16 13:35             ` Stefan Monnier
@ 2008-04-16 14:45               ` Jason Rumney
  2008-04-16 17:05                 ` Stefan Monnier
  2008-04-16 20:09               ` Stephen J. Turnbull
  2008-04-16 23:17               ` Juri Linkov
  2 siblings, 1 reply; 56+ messages in thread
From: Jason Rumney @ 2008-04-16 14:45 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: Juri Linkov, Eli Zaretskii, tomas, emacs-devel

Stefan Monnier wrote:
> Ideally it would also be somehow protected from accidental removal
>   

Which is why I suggested the fringe.




^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: utf-16le vs utf-16-le
  2008-04-16 14:45               ` Jason Rumney
@ 2008-04-16 17:05                 ` Stefan Monnier
  0 siblings, 0 replies; 56+ messages in thread
From: Stefan Monnier @ 2008-04-16 17:05 UTC (permalink / raw)
  To: Jason Rumney; +Cc: Juri Linkov, Eli Zaretskii, tomas, emacs-devel

>> Ideally it would also be somehow protected from accidental removal
> Which is why I suggested the fringe.

But that doesn't work on ttys,


        Stefan




^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: utf-16le vs utf-16-le
  2008-04-16 13:35             ` Stefan Monnier
  2008-04-16 14:45               ` Jason Rumney
@ 2008-04-16 20:09               ` Stephen J. Turnbull
  2008-04-16 23:17               ` Juri Linkov
  2 siblings, 0 replies; 56+ messages in thread
From: Stephen J. Turnbull @ 2008-04-16 20:09 UTC (permalink / raw)
  To: Stefan Monnier
  Cc: Juri Linkov, Eli Zaretskii, tomas, emacs-devel, Jason Rumney

Stefan Monnier writes:

 > We could use an approach similar to non-breaking space, where the BOM is
 > made visible just like any other char, with a special face.

The BOM is not part of the text in a file, it is part of a
higher-level protocol.




^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: utf-16le vs utf-16-le
  2008-04-15 20:09                   ` Eli Zaretskii
  2008-04-15 20:31                     ` Eli Zaretskii
@ 2008-04-16 20:15                     ` Stephen J. Turnbull
  2008-04-16 20:32                       ` David Kastrup
  2008-04-16 22:09                       ` Eli Zaretskii
  2008-04-17  1:14                     ` Stefan Monnier
  2 siblings, 2 replies; 56+ messages in thread
From: Stephen J. Turnbull @ 2008-04-16 20:15 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: emacs-devel

Eli Zaretskii writes:

 > > Huh?  We don't have the full suite, but we do have -signature variants.
 > 
 > Bot for UTF-8, we don't, at least not in GNU Emacs 23.

We're not talking about GNU Emacs 23, we're talking about what should
be.  What I'm trying to say is that all of these variants are
occasionally useful, and they can be decomposed as text coding +
signature + EOL convention, rather than having a zillion variants with
weird names for the user to keep track of.

 > > Indeed?  Suppose I have a string as the value of the symbol `s'
 > > containing the octets "\r\n".  Please explain to me how to compute
 > > whether that is the value 0x0D0A from a network stream prepared using
 > > htons(3), or a line ending suitable for appending to a Windows file.
 > 
 > The Lisp code that created the string knows what it is and how to deal
 > with it.  But you already know that, so I probably simply fail to
 > follow your reasoning.

Let me quote you:

 > I'm afraid that this will be very hard to implement in Emacs, since
 > the internals are very much exposed and we are used to copy strings to
 > and fro freely.

 > I think we also don't have sockets and other similar interfaces as
 > Lisp object to which we could give properties.

That's a shame, isn't it?




^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: utf-16le vs utf-16-le
  2008-04-16 20:15                     ` Stephen J. Turnbull
@ 2008-04-16 20:32                       ` David Kastrup
  2008-04-17  3:23                         ` Stephen J. Turnbull
  2008-04-16 22:09                       ` Eli Zaretskii
  1 sibling, 1 reply; 56+ messages in thread
From: David Kastrup @ 2008-04-16 20:32 UTC (permalink / raw)
  To: Stephen J. Turnbull; +Cc: Eli Zaretskii, emacs-devel

"Stephen J. Turnbull" <stephen@xemacs.org> writes:

> Eli Zaretskii writes:
>
>  > > Huh?  We don't have the full suite, but we do have -signature variants.
>  > 
>  > Bot for UTF-8, we don't, at least not in GNU Emacs 23.
>
> We're not talking about GNU Emacs 23, we're talking about what should
> be.  What I'm trying to say is that all of these variants are
> occasionally useful, and they can be decomposed as text coding +
> signature + EOL convention, rather than having a zillion variants with
> weird names for the user to keep track of.

Well, the solution is then systematic names...

-- 
David Kastrup, Kriemhildstr. 15, 44793 Bochum




^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: utf-16le vs utf-16-le
  2008-04-16 20:15                     ` Stephen J. Turnbull
  2008-04-16 20:32                       ` David Kastrup
@ 2008-04-16 22:09                       ` Eli Zaretskii
  1 sibling, 0 replies; 56+ messages in thread
From: Eli Zaretskii @ 2008-04-16 22:09 UTC (permalink / raw)
  To: Stephen J. Turnbull; +Cc: emacs-devel

> From: "Stephen J. Turnbull" <stephen@xemacs.org>
> Cc: emacs-devel@gnu.org
> Date: Thu, 17 Apr 2008 05:15:42 +0900
> 
>  > I'm afraid that this will be very hard to implement in Emacs, since
>  > the internals are very much exposed and we are used to copy strings to
>  > and fro freely.
> 
>  > I think we also don't have sockets and other similar interfaces as
>  > Lisp object to which we could give properties.
> 
> That's a shame, isn't it?

Maybe, I really can't say.




^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: utf-16le vs utf-16-le
  2008-04-16 13:35             ` Stefan Monnier
  2008-04-16 14:45               ` Jason Rumney
  2008-04-16 20:09               ` Stephen J. Turnbull
@ 2008-04-16 23:17               ` Juri Linkov
  2008-04-16 23:42                 ` Jason Rumney
  2 siblings, 1 reply; 56+ messages in thread
From: Juri Linkov @ 2008-04-16 23:17 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: Eli Zaretskii, tomas, emacs-devel, Jason Rumney

>>> Emacs behaves correctly IMO, since its behavior is tuned for reading
>>> text, and BOM is not part of the text.  If you want to debug the
>>> programs that generated that text, you can always use no-conversion or
>>> find-file-literally.
>
>> But you don't know what you are debugging until Emacs (or something else)
>> points out the unexpected BOM.  Indicating the presence of a BOM isn't
>> really any different to indicating the encoding, though a better (more
>> noticeable) UI might be some indicator in the left fringe on the first line
>> of the file, rather than just a change to the character in the modeline.
>
> We could use an approach similar to non-breaking space, where the BOM is
> made visible just like any other char, with a special face.  Ideally it
> would also be somehow protected from accidental removal,

There is currently one way to display the BOM in Emacs: visiting
a file that contains the BOM with a BOM-less coding (e.g. visiting
a utf-16le-with-signature file forcing the utf-16le coding) displays
at the beginning of the buffer a big ugly character that looks like
some screen garbage.  There is some interesting information about it:

  name: ZERO WIDTH NO-BREAK SPACE
  old-name: BYTE ORDER MARK

This looks like it was once renamed, and a new name hints not to
display it due to its supposed zero width.

Maybe then a better indication would be in the modeline by displaying
the name of the coding with signature explicitly like "U(BOM)".

-- 
Juri Linkov
http://www.jurta.org/emacs/




^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: utf-16le vs utf-16-le
  2008-04-16 23:17               ` Juri Linkov
@ 2008-04-16 23:42                 ` Jason Rumney
  2008-04-17  1:03                   ` Kenichi Handa
  0 siblings, 1 reply; 56+ messages in thread
From: Jason Rumney @ 2008-04-16 23:42 UTC (permalink / raw)
  To: Juri Linkov; +Cc: Eli Zaretskii, tomas, Stefan Monnier, emacs-devel

Juri Linkov wrote:
>   name: ZERO WIDTH NO-BREAK SPACE
>   old-name: BYTE ORDER MARK
>   

I think these are reversed from what they should be. The zero width 
no-break space use of that character is deprecated.





^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: utf-16le vs utf-16-le
  2008-04-16 23:42                 ` Jason Rumney
@ 2008-04-17  1:03                   ` Kenichi Handa
  0 siblings, 0 replies; 56+ messages in thread
From: Kenichi Handa @ 2008-04-17  1:03 UTC (permalink / raw)
  To: Jason Rumney; +Cc: juri, eliz, tomas, monnier, emacs-devel

In article <48068ECC.1090900@gnu.org>, Jason Rumney <jasonr@gnu.org> writes:

> Juri Linkov wrote:
> >   name: ZERO WIDTH NO-BREAK SPACE
> >   old-name: BYTE ORDER MARK
> >   

> I think these are reversed from what they should be. The zero width 
> no-break space use of that character is deprecated.

But they are the names for U+FEFF defined in UnicodeData.txt
distributed by the Unicode Consortium.

---
Kenichi Handa
handa@ni.aist.go.jp




^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: utf-16le vs utf-16-le
  2008-04-15 20:09                   ` Eli Zaretskii
  2008-04-15 20:31                     ` Eli Zaretskii
  2008-04-16 20:15                     ` Stephen J. Turnbull
@ 2008-04-17  1:14                     ` Stefan Monnier
  2 siblings, 0 replies; 56+ messages in thread
From: Stefan Monnier @ 2008-04-17  1:14 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: Stephen J. Turnbull, emacs-devel

> and fro freely.  I think we also don't have sockets and other similar
> interfaces as Lisp object to which we could give properties.

I'm not sure I understand: sockets are represented as pseudo async
processes and they do have properties (see process-put/get).


        Stefan




^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: utf-16le vs utf-16-le
  2008-04-16 20:32                       ` David Kastrup
@ 2008-04-17  3:23                         ` Stephen J. Turnbull
  2008-04-17  3:26                           ` Eli Zaretskii
  0 siblings, 1 reply; 56+ messages in thread
From: Stephen J. Turnbull @ 2008-04-17  3:23 UTC (permalink / raw)
  To: David Kastrup; +Cc: Eli Zaretskii, emacs-devel

David Kastrup writes:
 > "Stephen J. Turnbull" <stephen@xemacs.org> writes:

 > > We're not talking about GNU Emacs 23, we're talking about what should
 > > be.  What I'm trying to say is that all of these variants are
 > > occasionally useful, and they can be decomposed as text coding +
 > > signature + EOL convention, rather than having a zillion variants with
 > > weird names for the user to keep track of.
 > 
 > Well, the solution is then systematic names...

Well, not entirely.  As Eli points out, many of the names won't be
bound to appropriate coding systems as things stand, because they
don't exist.

I also find it faintly unclean when the system has to go around
parsing symbol names to do things like change the EOL convention
preferred for a buffer.




^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: utf-16le vs utf-16-le
  2008-04-17  3:23                         ` Stephen J. Turnbull
@ 2008-04-17  3:26                           ` Eli Zaretskii
  2008-04-17  7:44                             ` Stephen J. Turnbull
  0 siblings, 1 reply; 56+ messages in thread
From: Eli Zaretskii @ 2008-04-17  3:26 UTC (permalink / raw)
  To: Stephen J. Turnbull; +Cc: emacs-devel

> From: "Stephen J. Turnbull" <stephen@xemacs.org>
> Cc: Eli Zaretskii <eliz@gnu.org>,
>     emacs-devel@gnu.org
> Date: Thu, 17 Apr 2008 12:23:37 +0900
> 
> I also find it faintly unclean when the system has to go around
> parsing symbol names to do things like change the EOL convention
> preferred for a buffer.

We don't parse the symbol name at all, AFAIR; instead, the properties
of each symbol are defined in advance by define-coding-system.




^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: utf-16le vs utf-16-le
  2008-04-17  3:26                           ` Eli Zaretskii
@ 2008-04-17  7:44                             ` Stephen J. Turnbull
  2008-04-17  8:19                               ` Jan Djärv
  0 siblings, 1 reply; 56+ messages in thread
From: Stephen J. Turnbull @ 2008-04-17  7:44 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: emacs-devel

Eli Zaretskii writes:

 > > From: "Stephen J. Turnbull" <stephen@xemacs.org>
 > > Cc: Eli Zaretskii <eliz@gnu.org>,
 > >     emacs-devel@gnu.org
 > > Date: Thu, 17 Apr 2008 12:23:37 +0900
 > > 
 > > I also find it faintly unclean when the system has to go around
 > > parsing symbol names to do things like change the EOL convention
 > > preferred for a buffer.
 > 
 > We don't parse the symbol name at all, AFAIR; instead, the properties
 > of each symbol are defined in advance by define-coding-system.

OK, so you've got properties which must be defined in correspondence
with the coding system names.  No parsing needed, but this would
bother me, defining NxMxP symbols when I could define N+M+P symbols.

I guess that doesn't bother you, so I'll just leave it at that.





^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: utf-16le vs utf-16-le
  2008-04-17  7:44                             ` Stephen J. Turnbull
@ 2008-04-17  8:19                               ` Jan Djärv
  2008-04-17 12:41                                 ` Eli Zaretskii
  2008-04-17 17:20                                 ` Stephen J. Turnbull
  0 siblings, 2 replies; 56+ messages in thread
From: Jan Djärv @ 2008-04-17  8:19 UTC (permalink / raw)
  To: Stephen J. Turnbull; +Cc: Eli Zaretskii, emacs-devel



Stephen J. Turnbull skrev:
> Eli Zaretskii writes:
> 
>  > > From: "Stephen J. Turnbull" <stephen@xemacs.org>
>  > > Cc: Eli Zaretskii <eliz@gnu.org>,
>  > >     emacs-devel@gnu.org
>  > > Date: Thu, 17 Apr 2008 12:23:37 +0900
>  > > 
>  > > I also find it faintly unclean when the system has to go around
>  > > parsing symbol names to do things like change the EOL convention
>  > > preferred for a buffer.
>  > 
>  > We don't parse the symbol name at all, AFAIR; instead, the properties
>  > of each symbol are defined in advance by define-coding-system.
> 
> OK, so you've got properties which must be defined in correspondence
> with the coding system names.  No parsing needed, but this would
> bother me, defining NxMxP symbols when I could define N+M+P symbols.
> 

In recode they have "surfaces".  So charset is separate from surfaces, for 
example EOL convention.  That would be nice to have in Emacs as well.

	Jan D.




^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: utf-16le vs utf-16-le
  2008-04-17  8:19                               ` Jan Djärv
@ 2008-04-17 12:41                                 ` Eli Zaretskii
  2008-04-17 17:20                                 ` Stephen J. Turnbull
  1 sibling, 0 replies; 56+ messages in thread
From: Eli Zaretskii @ 2008-04-17 12:41 UTC (permalink / raw)
  To: Jan Djärv; +Cc: turnbull, emacs-devel

> Date: Thu, 17 Apr 2008 10:19:56 +0200
> From: =?ISO-8859-1?Q?Jan_Dj=E4rv?= <jan.h.d@swipnet.se>
> Cc: Eli Zaretskii <eliz@gnu.org>, emacs-devel@gnu.org
> 
> In recode they have "surfaces".  So charset is separate from surfaces, for 
> example EOL convention.  That would be nice to have in Emacs as well.

We already do, at least for the EOL format.  Check out
coding-system-change-{text,eol}-conversion, coding-system-eol-type,
and coding-system-eol-type-mnemonic.  But I don't think BOM is treated
as a surface by `recode', unless I'm missing something.




^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: utf-16le vs utf-16-le
  2008-04-17  8:19                               ` Jan Djärv
  2008-04-17 12:41                                 ` Eli Zaretskii
@ 2008-04-17 17:20                                 ` Stephen J. Turnbull
  2008-04-17 18:03                                   ` Eli Zaretskii
  1 sibling, 1 reply; 56+ messages in thread
From: Stephen J. Turnbull @ 2008-04-17 17:20 UTC (permalink / raw)
  To: Jan Djärv; +Cc: Eli Zaretskii, emacs-devel

Jan Djärv writes:

 > In recode they have "surfaces".  So charset is separate from surfaces, for 
 > example EOL convention.  That would be nice to have in Emacs as well.

Do you know how recode's terminology maps to Unicode TR #17?  (If you
don't know what I'm talking about, please disregard; I just never
figured out what "surfaces" were and wouldn't mind a free clue.)




^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: utf-16le vs utf-16-le
  2008-04-17 17:20                                 ` Stephen J. Turnbull
@ 2008-04-17 18:03                                   ` Eli Zaretskii
  0 siblings, 0 replies; 56+ messages in thread
From: Eli Zaretskii @ 2008-04-17 18:03 UTC (permalink / raw)
  To: Stephen J. Turnbull; +Cc: jan.h.d, emacs-devel

> From: "Stephen J. Turnbull" <stephen@xemacs.org>
> Cc: Eli Zaretskii <eliz@gnu.org>,
>     emacs-devel@gnu.org
> Date: Fri, 18 Apr 2008 02:20:47 +0900
> 
> Jan Djärv writes:
> 
>  > In recode they have "surfaces".  So charset is separate from surfaces, for 
>  > example EOL convention.  That would be nice to have in Emacs as well.
> 
> Do you know how recode's terminology maps to Unicode TR #17?

AFAIU, what `recode' calls ``surface'' is a Transfer Encoding Syntax
(TES) in UTR #17 parlance.





^ permalink raw reply	[flat|nested] 56+ messages in thread

end of thread, other threads:[~2008-04-17 18:03 UTC | newest]

Thread overview: 56+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-04-13 14:54 utf-16le vs utf-16-le Eli Zaretskii
2008-04-13 19:32 ` Stefan Monnier
2008-04-14  5:17   ` Kenichi Handa
2008-04-14  6:10     ` David Kastrup
2008-04-14 18:54       ` Eli Zaretskii
2008-04-14 19:04         ` David Kastrup
2008-04-14 17:38     ` Eli Zaretskii
2008-04-14 18:57     ` Eli Zaretskii
2008-04-13 22:23 ` Stephen J. Turnbull
2008-04-14  3:19   ` Eli Zaretskii
2008-04-14  7:32     ` Stephen J. Turnbull
2008-04-14  8:20       ` David Kastrup
2008-04-14 18:25         ` Stephen J. Turnbull
2008-04-14 18:46           ` Eli Zaretskii
2008-04-14 21:01             ` Stephen J. Turnbull
2008-04-14 21:15               ` Andreas Schwab
2008-04-15  0:22                 ` Stephen J. Turnbull
2008-04-15  3:25               ` Eli Zaretskii
2008-04-15 16:51                 ` Stephen J. Turnbull
2008-04-15 20:09                   ` Eli Zaretskii
2008-04-15 20:31                     ` Eli Zaretskii
2008-04-15 20:35                       ` David Kastrup
2008-04-16 20:15                     ` Stephen J. Turnbull
2008-04-16 20:32                       ` David Kastrup
2008-04-17  3:23                         ` Stephen J. Turnbull
2008-04-17  3:26                           ` Eli Zaretskii
2008-04-17  7:44                             ` Stephen J. Turnbull
2008-04-17  8:19                               ` Jan Djärv
2008-04-17 12:41                                 ` Eli Zaretskii
2008-04-17 17:20                                 ` Stephen J. Turnbull
2008-04-17 18:03                                   ` Eli Zaretskii
2008-04-16 22:09                       ` Eli Zaretskii
2008-04-17  1:14                     ` Stefan Monnier
2008-04-14 20:20           ` Stefan Monnier
2008-04-14 20:58             ` David Kastrup
2008-04-14 22:19               ` Stefan Monnier
2008-04-14 22:26                 ` David Kastrup
2008-04-14 22:33                   ` Stefan Monnier
2008-04-15  5:44                     ` David Kastrup
2008-04-15 15:35                       ` Stefan Monnier
2008-04-14 21:35             ` Stephen J. Turnbull
2008-04-14  5:17 ` Kenichi Handa
2008-04-14 13:57   ` Stefan Monnier
2008-04-14  7:02 ` tomas
2008-04-14 17:45   ` Eli Zaretskii
2008-04-15  7:38     ` tomas
2008-04-15 22:30       ` Juri Linkov
2008-04-16  3:20         ` Eli Zaretskii
2008-04-16  8:12           ` Jason Rumney
2008-04-16 13:35             ` Stefan Monnier
2008-04-16 14:45               ` Jason Rumney
2008-04-16 17:05                 ` Stefan Monnier
2008-04-16 20:09               ` Stephen J. Turnbull
2008-04-16 23:17               ` Juri Linkov
2008-04-16 23:42                 ` Jason Rumney
2008-04-17  1:03                   ` Kenichi Handa

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).