utf-8 cut/paste

unofficial mirror of emacs-devel@gnu.org 
 help / color / mirror / code / Atom feed

* utf-8 cut/paste
@ 2004-05-23 18:59 Sam Steingold
  2004-05-24  9:52 ` Benjamin Riefenstahl
  0 siblings, 1 reply; 24+ messages in thread
From: Sam Steingold @ 2004-05-23 18:59 UTC (permalink / raw)


GNU Emacs 21.3.50.1 (i386-msvc-nt5.0.2195)
 of 2004-05-17 on WINSTEINGOLDLAP
--with-msvc (12.00)

I can save a file with utf-8 encoding, open it with notepad, then
select text, copy it and paste into firefox, but then the same utf-8
file is in an Emacs buffer, and I try to copy it with
`clipboard-kill-ring-save', I cannot paste into notepad or firefox: the
inserted text looks like "?????????".

-- 
Sam Steingold (http://www.podval.org/~sds) running w2k
<http://www.camera.org> <http://www.iris.org.il> <http://www.memri.org/>
<http://www.mideasttruth.com/> <http://www.honestreporting.com>
Whom computers would destroy, they must first drive mad.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: utf-8 cut/paste
  2004-05-23 18:59 utf-8 cut/paste Sam Steingold
@ 2004-05-24  9:52 ` Benjamin Riefenstahl
  2004-05-24 16:18   ` Sam Steingold
  0 siblings, 1 reply; 24+ messages in thread
From: Benjamin Riefenstahl @ 2004-05-24  9:52 UTC (permalink / raw)

Hi Sam,

Sam Steingold <sds@gnu.org> writes:
> the same utf-8 file is in an Emacs buffer, and I try to copy it with
> `clipboard-kill-ring-save', I cannot paste into notepad or firefox:
> the inserted text looks like "?????????".

Emacs uses the legacy 8-bit text clipboard format, it doesn't use
Unicode even on NT-based systems.  8-bit text won't work with
arbitrary Unicode characters, naturally.

If the text that you want to cut-and-paste only has characters that
are available in your local 8-bit encoding (windows-1252 typically),
there may be a solution using charset unification (I don't know how
that machinery works exactly).

In any case you probably need to make sure that your
selection-coding-system has the right value.

benny

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: utf-8 cut/paste
  2004-05-24  9:52 ` Benjamin Riefenstahl
@ 2004-05-24 16:18   ` Sam Steingold
  2004-05-24 19:19     ` Benjamin Riefenstahl
  0 siblings, 1 reply; 24+ messages in thread
From: Sam Steingold @ 2004-05-24 16:18 UTC (permalink / raw)


Benjamin,
thanks for your reply,

> * Benjamin Riefenstahl <Orawnzva.Evrsrafgnuy@rcbfg.qr> [2004-05-24 11:52:29 +0200]:
>
> Sam Steingold <sds@gnu.org> writes:
>> the same utf-8 file is in an Emacs buffer, and I try to copy it with
>> `clipboard-kill-ring-save', I cannot paste into notepad or firefox:
>> the inserted text looks like "?????????".
>
> Emacs uses the legacy 8-bit text clipboard format, it doesn't use
> Unicode even on NT-based systems.  8-bit text won't work with
> arbitrary Unicode characters, naturally.
>
> If the text that you want to cut-and-paste only has characters that
> are available in your local 8-bit encoding (windows-1252 typically),
> there may be a solution using charset unification (I don't know how
> that machinery works exactly).

what is "charset unification"?
I thought that if I use unicode (utf-8), all characters are already
in one set.

> In any case you probably need to make sure that your
> selection-coding-system has the right value.

what value is right?

selection-coding-system
 ==> cp1252



-- 
Sam Steingold (http://www.podval.org/~sds) running w2k
<http://www.camera.org> <http://www.iris.org.il> <http://www.memri.org/>
<http://www.mideasttruth.com/> <http://www.honestreporting.com>
Booze is the answer. I can't remember the question.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: utf-8 cut/paste
  2004-05-24 16:18   ` Sam Steingold
@ 2004-05-24 19:19     ` Benjamin Riefenstahl
  2004-05-24 21:00       ` Sam Steingold
  2004-05-25  6:02       ` Eli Zaretskii
  0 siblings, 2 replies; 24+ messages in thread
From: Benjamin Riefenstahl @ 2004-05-24 19:19 UTC (permalink / raw)
  Cc: emacs-devel

Hi Sam,

Sam Steingold <sds@gnu.org> writes:
> what is "charset unification"?

I meant the functions unify-8859-on-decoding-mode and
unify-8859-on-encoding-mode.  I'm not quite sure if these functions
also cover the Windows codepages like cp1252.

> I thought that if I use unicode (utf-8), all characters are already
> in one set.

In general theory, all the Unicode characters are in the Unicode
(utf-8) set and all the cp1252 characters are in the cp1252 set.  As
far as Emacs is concerned the two sets describe different characters
and they have no relationship.  Unless you use some kind of
translation, which is what the above functions do AFAIK.

>> In any case you probably need to make sure that your
>> selection-coding-system has the right value.
>
> what value is right?
>
> selection-coding-system
>  ==> cp1252

Assuming you are on an English system, that's the right one.

benny

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: utf-8 cut/paste
  2004-05-24 19:19     ` Benjamin Riefenstahl
@ 2004-05-24 21:00       ` Sam Steingold
  2004-05-24 23:10         ` Benjamin Riefenstahl
  2004-05-25  6:02       ` Eli Zaretskii
  1 sibling, 1 reply; 24+ messages in thread
From: Sam Steingold @ 2004-05-24 21:00 UTC (permalink / raw)
  Cc: emacs-devel

Hi Benny,

Suppose I have a utf-8 buffer (all my buffers are unicode
because of (prefer-coding-system 'utf-8) and
(modify-coding-system-alist 'file "" 'utf-8))
with cyrillic characters.
what do I do to get them into firefox?
I already do
  (when (fboundp 'utf-translate-cjk-mode) (utf-translate-cjk-mode 1))
  (when (fboundp 'unify-8859-on-decoding-mode) (unify-8859-on-decoding-mode 1))
  (when (fboundp 'unify-8859-on-encoding-mode) (unify-8859-on-encoding-mode 1))

-- 
Sam Steingold (http://www.podval.org/~sds) running w2k
<http://www.camera.org> <http://www.iris.org.il> <http://www.memri.org/>
<http://www.mideasttruth.com/> <http://www.honestreporting.com>
Even Windows doesn't suck, when you use Common Lisp

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: utf-8 cut/paste
  2004-05-24 21:00       ` Sam Steingold
@ 2004-05-24 23:10         ` Benjamin Riefenstahl
  2004-05-25 13:06           ` Sam Steingold
  0 siblings, 1 reply; 24+ messages in thread
From: Benjamin Riefenstahl @ 2004-05-24 23:10 UTC (permalink / raw)
  Cc: Sam Steingold

Hi Sam,

Sam Steingold <sds@gnu.org> writes:
> Suppose I have a utf-8 buffer (all my buffers are unicode because of
> (prefer-coding-system 'utf-8) and (modify-coding-system-alist 'file
> "" 'utf-8)) with cyrillic characters.  what do I do to get them into
> firefox?

You don't, at least not via the Windows clipboard.  Re-read what I
wrote earlier:

>> Emacs uses the legacy 8-bit text clipboard format, it doesn't use
>> Unicode even on NT-based systems.  8-bit text won't work with
>> arbitrary Unicode characters, naturally.

I assume you are on an English system.  So your 8-bit Windows encoding
is cp1252 (Latin-1 plus MS extensions).  That doesn't cover cyrillic,
so you can't cut-and-paste cyrillic.  That's regardless of what Emacs
supports or uses internally.  Emacs would have to use the Unicode
version of the Windows clipboard transfer to do this, but as of yet it
doesn't.

You can of course write the data into a file in UTF-8 and read that in
Firefox or Notepad.

benny

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: utf-8 cut/paste
  2004-05-24 19:19     ` Benjamin Riefenstahl
  2004-05-24 21:00       ` Sam Steingold
@ 2004-05-25  6:02       ` Eli Zaretskii
  2004-05-25 10:03         ` Benjamin Riefenstahl
  2004-05-26 15:48         ` Stefan Monnier
  1 sibling, 2 replies; 24+ messages in thread
From: Eli Zaretskii @ 2004-05-25  6:02 UTC (permalink / raw)
  Cc: sds, emacs-devel

> From: Benjamin Riefenstahl <Benjamin.Riefenstahl@epost.de>
> Date: Mon, 24 May 2004 21:19:03 +0200
> 
> > I thought that if I use unicode (utf-8), all characters are already
> > in one set.
> 
> In general theory, all the Unicode characters are in the Unicode
> (utf-8) set and all the cp1252 characters are in the cp1252 set.

Actually, cp1252 is not a charset, it's an encoding (a.k.a. ``coding
system'').  The underlying Mule charset is latin-iso8859-1.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: utf-8 cut/paste
  2004-05-25  6:02       ` Eli Zaretskii
@ 2004-05-25 10:03         ` Benjamin Riefenstahl
  2004-05-25 12:36           ` Eli Zaretskii
  2004-05-26 15:48         ` Stefan Monnier
  1 sibling, 1 reply; 24+ messages in thread
From: Benjamin Riefenstahl @ 2004-05-25 10:03 UTC (permalink / raw)
  Cc: sds, emacs-devel

Hi Eli,

"Eli Zaretskii" <eliz@gnu.org> writes:
> Actually, cp1252 is not a charset, it's an encoding (a.k.a. ``coding
> system'').  The underlying Mule charset is latin-iso8859-1.

And augmented by utf-8 characters where Latin-1 doesn't suffice, if I
see this right.  But that doesn't change the basic problem, does it?

benny

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: utf-8 cut/paste
  2004-05-25 10:03         ` Benjamin Riefenstahl
@ 2004-05-25 12:36           ` Eli Zaretskii
  2004-05-25 15:41             ` Sam Steingold
  0 siblings, 1 reply; 24+ messages in thread
From: Eli Zaretskii @ 2004-05-25 12:36 UTC (permalink / raw)
  Cc: sds, emacs-devel

> From: Benjamin Riefenstahl <Benjamin.Riefenstahl@epost.de>
> Date: Tue, 25 May 2004 12:03:44 +0200
> 
> "Eli Zaretskii" <eliz@gnu.org> writes:
> > Actually, cp1252 is not a charset, it's an encoding (a.k.a. ``coding
> > system'').  The underlying Mule charset is latin-iso8859-1.
> 
> And augmented by utf-8 characters where Latin-1 doesn't suffice, if I
> see this right.

Right (if you use code-pages.el rather than codepage.el, as most
Emacs platforms do; which is why I left it vague).

> But that doesn't change the basic problem, does it?

No, it doesn't.  My comment was a minor one, to help Sam avoid
possible confusion in the future.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: utf-8 cut/paste
  2004-05-24 23:10         ` Benjamin Riefenstahl
@ 2004-05-25 13:06           ` Sam Steingold
  0 siblings, 0 replies; 24+ messages in thread
From: Sam Steingold @ 2004-05-25 13:06 UTC (permalink / raw)
  Cc: emacs-devel

> * Benjamin Riefenstahl <Orawnzva.Evrsrafgnuy@rcbfg.qr> [2004-05-25 01:10:17 +0200]:
>
> Sam Steingold <sds@gnu.org> writes:
>> Suppose I have a utf-8 buffer (all my buffers are unicode because of
>> (prefer-coding-system 'utf-8) and (modify-coding-system-alist 'file
>> "" 'utf-8)) with cyrillic characters.  what do I do to get them into
>> firefox?
>
> You don't, at least not via the Windows clipboard.  Re-read what I
> wrote earlier:
>
>>> Emacs uses the legacy 8-bit text clipboard format, it doesn't use
>>> Unicode even on NT-based systems.  8-bit text won't work with
>>> arbitrary Unicode characters, naturally.
>
> I assume you are on an English system.  So your 8-bit Windows encoding
> is cp1252 (Latin-1 plus MS extensions).  That doesn't cover cyrillic,
> so you can't cut-and-paste cyrillic.  That's regardless of what Emacs
> supports or uses internally.  Emacs would have to use the Unicode
> version of the Windows clipboard transfer to do this, but as of yet it
> doesn't.
>
> You can of course write the data into a file in UTF-8 and read that in
> Firefox or Notepad.

Thanks a lot for clarifying the matter.

-- 
Sam Steingold (http://www.podval.org/~sds) running w2k
<http://www.camera.org> <http://www.iris.org.il> <http://www.memri.org/>
<http://www.mideasttruth.com/> <http://www.honestreporting.com>
What garlic is to food, insanity is to art.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: utf-8 cut/paste
  2004-05-25 12:36           ` Eli Zaretskii
@ 2004-05-25 15:41             ` Sam Steingold
  2004-05-26  4:22               ` Kenichi Handa
                                 ` (3 more replies)
  0 siblings, 4 replies; 24+ messages in thread
From: Sam Steingold @ 2004-05-25 15:41 UTC (permalink / raw)
  Cc: Benjamin Riefenstahl, emacs-devel

> * Eli Zaretskii <ryvm@tah.bet> [2004-05-25 14:36:07 +0200]:
>
> No, it doesn't.  My comment was a minor one, to help Sam avoid
> possible confusion in the future.

I am sorry, you lost me long ago (when MULE was merged into Emacs).

I understand what a CHARACTER is (a type in CL).
E.g., #\C is a "LATIN CAPITAL LETTER C", or
#\С is a "CYRILLIC CAPITAL LETTER ES" (even through they might look
similar in your font).
I understand that there are many (partial) function between (subsets of)
(INTEGER 0) and CHARACTER, called "encodings".
I don't know what a "charset" is, but I would guess that it is a subset
of CHARACTERs on which a particular encoding is defined.

I seem to recall that MULE considers characters as elements of these
charsets, not as elements of the class CHARACTER, i.e., each character
comes equipped with its integer encoding, and 2 characters which are
identical elements of CHARACTER, but appear in two different encodings
(e.g., #\Ц encoded in koi8 and in alt) are different characters in MULE.
This is so absurd that I can hardly believe that anyone could ever
conceive of this, let alone implement it.

This reminds me of a story (<http://v2.anekdot.ru/an/an0303/o030321.html#10>):

The Soviet space capsule of the 60-ies, Soyuz, was supposed to have
been made from titanium, but the titanium turned out to be too hard to
process, so it was made of a heavier aluminum alloy.  This violated
the mass properties and thus aerodynamic stability of the craft.  There
was no time to re-design everything (Moon race!), so the stability was
restored by adding a 150 kilogram lead dead-weight to the construction.
(only Soyuz-TMA in the early 2000 got rid of this thing!)

I hope it will take Emacs less than 30 years to get rid of the MULE
dead-weight.

-- 
Sam Steingold (http://www.podval.org/~sds) running w2k
<http://www.camera.org> <http://www.iris.org.il> <http://www.memri.org/>
<http://www.mideasttruth.com/> <http://www.honestreporting.com>
Save your burned out bulbs for me, I'm building my own dark room.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: utf-8 cut/paste
  2004-05-25 15:41             ` Sam Steingold
@ 2004-05-26  4:22               ` Kenichi Handa
  2004-05-28 17:45                 ` Sam Steingold
  2004-05-26  4:33               ` Miles Bader
                                 ` (2 subsequent siblings)
  3 siblings, 1 reply; 24+ messages in thread
From: Kenichi Handa @ 2004-05-26  4:22 UTC (permalink / raw)
  Cc: Benjamin.Riefenstahl, eliz, emacs-devel

In article <uoeocikfe.fsf@gnu.org>, Sam Steingold <sds@gnu.org> writes:
> I seem to recall that MULE considers characters as elements of these
> charsets, not as elements of the class CHARACTER, i.e., each character
> comes equipped with its integer encoding, and 2 characters which are
> identical elements of CHARACTER, but appear in two different encodings
> (e.g., #\Ц encoded in koi8 and in alt) are different characters in MULE.

Your understanding is quite correct.

> This is so absurd that I can hardly believe that anyone could ever
> conceive of this, let alone implement it.

When I desinged Mule (it was before Unicode was accepted
widely as now), there was no agreement in a character A1 in
a charset S1 and a character A2 in a charset S2 are actually
the same character.  There was no authority that says the
code point 0xE3 of koi8 and the code point 0x96 in alt
represents the same character.

> I hope it will take Emacs less than 30 years to get rid of the MULE
> dead-weight.

You don't have to wait that long.  The emacs-unicode (in
emacs-unicode-2 branch of CVS) is already there.

---
Ken'ichi HANDA
handa@m17n.org

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: utf-8 cut/paste
  2004-05-25 15:41             ` Sam Steingold
  2004-05-26  4:22               ` Kenichi Handa
@ 2004-05-26  4:33               ` Miles Bader
  2004-05-26 18:11                 ` Sam Steingold
  2004-05-26 11:32               ` Eli Zaretskii
  2004-05-26 12:30               ` Benjamin Riefenstahl
  3 siblings, 1 reply; 24+ messages in thread
From: Miles Bader @ 2004-05-26  4:33 UTC (permalink / raw)
  Cc: Benjamin Riefenstahl, emacs-devel

Sam Steingold <sds@gnu.org> writes:
> I hope it will take Emacs less than 30 years to get rid of the MULE
> dead-weight.

_Then_ what are you going to do when you're looking for an excuse to
flame pointlessly?

-Miles
-- 
The secret to creativity is knowing how to hide your sources.
  --Albert Einstein

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: utf-8 cut/paste
  2004-05-25 15:41             ` Sam Steingold
  2004-05-26  4:22               ` Kenichi Handa
  2004-05-26  4:33               ` Miles Bader
@ 2004-05-26 11:32               ` Eli Zaretskii
  2004-05-26 13:31                 ` Sam Steingold
  2004-05-26 12:30               ` Benjamin Riefenstahl
  3 siblings, 1 reply; 24+ messages in thread
From: Eli Zaretskii @ 2004-05-26 11:32 UTC (permalink / raw)
  Cc: emacs-devel

> From: Sam Steingold <sds@gnu.org>
> Date: Tue, 25 May 2004 11:41:09 -0400
> 
> > * Eli Zaretskii <ryvm@tah.bet> [2004-05-25 14:36:07 +0200]:
> >
> > No, it doesn't.  My comment was a minor one, to help Sam avoid
> > possible confusion in the future.
> 
> I am sorry, you lost me long ago (when MULE was merged into Emacs).

I'm not sure what that comment was supposed to tell (I didn't design
MULE, nor integrated it into Emacs).  So I will just pretend it was
never written.

I simply tried to help you understand things better, assuming that you
wanted to understand; if not, feel free to disregard what's below.

> I understand what a CHARACTER is (a type in CL).
> E.g., #\C is a "LATIN CAPITAL LETTER C", or
> #\\x7f is a "CYRILLIC CAPITAL LETTER ES" (even through they might look
> similar in your font).
> I understand that there are many (partial) function between (subsets of)
> (INTEGER 0) and CHARACTER, called "encodings".
> I don't know what a "charset" is, but I would guess that it is a subset
> of CHARACTERs on which a particular encoding is defined.

That is true, but it has no direct relevance to what I was trying to
explain.

What I was trying to explain was that, taking Cyrillic characters as
an example, any single Cyrillic character can be encoded in several
different encodings.  Examples of such encodings include KOI8-R,
ISO-8859-5, and cp1251 (a.k.a. windows-1251).

The set of Cyrillic characters is what MULE calls ``a charset''.  Any
encoding of characters from that charset is what MULE calls ``a coding
system''.

cp1251 is an encoding, not a charset.  It encodes the Cyrillic charset
(MULE calls that charset cyrillic-iso8859-5).  Similarly, cp1252
encodes the latin-iso8859-1 charset, and cp1255 encodes the
hebrew-iso8859-8 charset.

I sincerely hope that helps to make things more clear.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: utf-8 cut/paste
  2004-05-25 15:41             ` Sam Steingold
                                 ` (2 preceding siblings ...)
  2004-05-26 11:32               ` Eli Zaretskii
@ 2004-05-26 12:30               ` Benjamin Riefenstahl
  3 siblings, 0 replies; 24+ messages in thread
From: Benjamin Riefenstahl @ 2004-05-26 12:30 UTC (permalink / raw)
  Cc: emacs-devel

Hi Sam,

Note that your original problem with cyrillic is not actually related
to MULE.  MULE may make sound things a bit more complicated, but the
problem is that Emacs doesn't use the Unicode APIs of Windows.  Which
it can do fine (and probably will at some point), with or without
MULE.  At least on NT/W2K/XP, I don't know whether the Unicode
clipboard works on 9x/Me.

Sam Steingold <sds@gnu.org> writes:
> each character comes equipped with its integer encoding, and 2
> characters which are identical elements of CHARACTER, but appear in
> two different encodings (e.g., #\Ц encoded in koi8 and in alt) are
> different characters in MULE.  This is so absurd that I can hardly
> believe that anyone could ever conceive of this, let alone implement
> it.

You are presupposing that you know which "2 characters [...] are
identical elements of CHARACTER, but appear in two different
encodings."  While this knowledge seems obvious in theory, in practice
it involves quite a lot of work to formalize this unification for all
relevant charsets (i.e. for the charsets that are actually in use).

After the work has mostly been done in Unicode, this kind of
information is actually one of the major benefits of that standard.
So now, today we have a well-defined reference for things like:

> #\C is a "LATIN CAPITAL LETTER C", or #\С is a "CYRILLIC CAPITAL
> LETTER ES" (even through they might look similar in your font).

But when MULE was first implemented, Unicode was in its infancy, if I
see this right.  So at that time this knowledge wasn't available in
formal terms and in the necessary breadth.  IOW, MULE (building on
ISO-2022) was a solution at the time, while Unicode was a still in the
design phase with much work to go.

benny

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: utf-8 cut/paste
  2004-05-26 11:32               ` Eli Zaretskii
@ 2004-05-26 13:31                 ` Sam Steingold
  0 siblings, 0 replies; 24+ messages in thread
From: Sam Steingold @ 2004-05-26 13:31 UTC (permalink / raw)
  Cc: emacs-devel

> * Eli Zaretskii <ryvm@tah.bet> [2004-05-26 13:32:26 +0200]:
>
> I sincerely hope that helps to make things more clear.

Yes, it does.  I appreciate your explanations, thank you very much.

-- 
Sam Steingold (http://www.podval.org/~sds) running w2k
<http://www.camera.org> <http://www.iris.org.il> <http://www.memri.org/>
<http://www.mideasttruth.com/> <http://www.honestreporting.com>
main(a){printf(a,34,a="main(a){printf(a,34,a=%c%s%c,34);}",34);}

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: utf-8 cut/paste
  2004-05-25  6:02       ` Eli Zaretskii
  2004-05-25 10:03         ` Benjamin Riefenstahl
@ 2004-05-26 15:48         ` Stefan Monnier
  2004-05-26 18:11           ` Eli Zaretskii
  1 sibling, 1 reply; 24+ messages in thread
From: Stefan Monnier @ 2004-05-26 15:48 UTC (permalink / raw)
  Cc: Benjamin Riefenstahl, sds, emacs-devel

>> > I thought that if I use unicode (utf-8), all characters are already
>> > in one set.
>> In general theory, all the Unicode characters are in the Unicode
>> (utf-8) set and all the cp1252 characters are in the cp1252 set.
> Actually, cp1252 is not a charset, it's an encoding (a.k.a. ``coding
> system'').  The underlying Mule charset is latin-iso8859-1.

AFAICT he was talking "in general", not "in Emacs".
Emacs' notion of a charset is a mostly arbitrary internal detail.
cp1252 could have been implemented as another charset rather than being
mapped to a mix of 8859-1 and unicode chars.
IIUC, emacs-unicode does away with this notion of a charset.


        Stefan

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: utf-8 cut/paste
  2004-05-26 15:48         ` Stefan Monnier
@ 2004-05-26 18:11           ` Eli Zaretskii
  2004-05-26 20:02             ` Stefan Monnier
  0 siblings, 1 reply; 24+ messages in thread
From: Eli Zaretskii @ 2004-05-26 18:11 UTC (permalink / raw)
  Cc: Benjamin.Riefenstahl, sds, emacs-devel

> From: Stefan Monnier <monnier@iro.umontreal.ca>
> Date: 26 May 2004 11:48:32 -0400
> 
> cp1252 could have been implemented as another charset rather than being
> mapped to a mix of 8859-1 and unicode chars.

If we did that, it would make the unfortunate situation, whereby there
are multiple charsets that cover the same characters, even worse.
E.g., in addition to Unicode Cyrillic characters and
cyrillic-iso8859-5 Cyrillic characters, we would also have cp1252
Cyrillic characters.

Repeat after me: "multiple character sets covering the same characters
are BAD."

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: utf-8 cut/paste
  2004-05-26  4:33               ` Miles Bader
@ 2004-05-26 18:11                 ` Sam Steingold
  2004-05-26 19:23                   ` David Kastrup
  0 siblings, 1 reply; 24+ messages in thread
From: Sam Steingold @ 2004-05-26 18:11 UTC (permalink / raw)


> * Miles Bader <zvyrf@yfv.arp.pb.wc> [2004-05-26 13:33:23 +0900]:
>
> Sam Steingold <sds@gnu.org> writes:
>> I hope it will take Emacs less than 30 years to get rid of the MULE
>> dead-weight.
>
> _Then_ what are you going to do when you're looking for an excuse to
> flame pointlessly?

same as you do: pick on some specific individual. :-)



-- 
Sam Steingold (http://www.podval.org/~sds) running w2k
<http://www.camera.org> <http://www.iris.org.il> <http://www.memri.org/>
<http://www.mideasttruth.com/> <http://www.honestreporting.com>
A professor is someone who talks in someone else's sleep.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: utf-8 cut/paste
  2004-05-26 18:11                 ` Sam Steingold
@ 2004-05-26 19:23                   ` David Kastrup
  0 siblings, 0 replies; 24+ messages in thread
From: David Kastrup @ 2004-05-26 19:23 UTC (permalink / raw)
  Cc: emacs-devel

Sam Steingold <sds@gnu.org> writes:

> > * Miles Bader <zvyrf@yfv.arp.pb.wc> [2004-05-26 13:33:23 +0900]:
> >
> > Sam Steingold <sds@gnu.org> writes:
> >> I hope it will take Emacs less than 30 years to get rid of the MULE
> >> dead-weight.
> >
> > _Then_ what are you going to do when you're looking for an excuse to
> > flame pointlessly?
> 
> same as you do: pick on some specific individual. :-)

Depending on your Emacsian orientation, that ground has been
sufficiently covered by the seminal work of either RMS or JWZ.

-- 
David Kastrup, Kriemhildstr. 15, 44793 Bochum

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: utf-8 cut/paste
  2004-05-26 18:11           ` Eli Zaretskii
@ 2004-05-26 20:02             ` Stefan Monnier
  2004-05-27  8:10               ` Eli Zaretskii
  0 siblings, 1 reply; 24+ messages in thread
From: Stefan Monnier @ 2004-05-26 20:02 UTC (permalink / raw)
  Cc: Benjamin.Riefenstahl, sds, emacs-devel

>> cp1252 could have been implemented as another charset rather than being
>> mapped to a mix of 8859-1 and unicode chars.

> If we did that, it would make the unfortunate situation, whereby there
> are multiple charsets that cover the same characters, even worse.

Isn't it exactly what happened with 8859-15 ?
I never claimed it was desirable.

> Repeat after me: "multiple character sets covering the same characters
> are BAD."

You're just saying that unicode is good.
The only point I originally tried to make is that it's better to avoid
talking about Mule charsets if possible.


        Stefan

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: utf-8 cut/paste
  2004-05-26 20:02             ` Stefan Monnier
@ 2004-05-27  8:10               ` Eli Zaretskii
  0 siblings, 0 replies; 24+ messages in thread
From: Eli Zaretskii @ 2004-05-27  8:10 UTC (permalink / raw)
  Cc: Benjamin.Riefenstahl, sds, emacs-devel

> From: Stefan Monnier <monnier@iro.umontreal.ca>
> Date: 26 May 2004 16:02:59 -0400
> 
> > If we did that, it would make the unfortunate situation, whereby there
> > are multiple charsets that cover the same characters, even worse.
> 
> Isn't it exactly what happened with 8859-15 ?
> I never claimed it was desirable.

It was extremely unfortunate, but given that 8859-15 introduced a new
character that everybody and their dog wanted to be able to use, there
was really no other choice, as Unicode support was not ready yet to
become a viable alternative.

With the cpNNN issue, we _could_ afford losing some MS-specific
characters that were not supported by the then existing charsets.
(With a better Unicode supporte we have now, code-pages fixes even
that problem by mapping cpNNN directly into Unicode.)

> > Repeat after me: "multiple character sets covering the same characters
> > are BAD."
> 
> You're just saying that unicode is good.

Sure, but when cpNNN was introduced (Emacs 20.2, I think), Unicode
support in Emacs was almost non-existent.  The issue that was before
me was whether to introduce additional charsets covering characters
that already existed, and I felt strongly that we shouldn't do that
without a VERY good reason.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: utf-8 cut/paste
  2004-05-26  4:22               ` Kenichi Handa
@ 2004-05-28 17:45                 ` Sam Steingold
  2004-05-29 10:04                   ` Jason Rumney
  0 siblings, 1 reply; 24+ messages in thread
From: Sam Steingold @ 2004-05-28 17:45 UTC (permalink / raw)


> * Kenichi Handa <unaqn@z17a.bet> [2004-05-26 13:22:38 +0900]:
>
> When I desinged Mule (it was before Unicode was accepted widely as
> now), there was no agreement in a character A1 in a charset S1 and a
> character A2 in a charset S2 are actually the same character.  There
> was no authority that says the code point 0xE3 of koi8 and the code
> point 0x96 in alt represents the same character.

Do you need a special authority to say that 42 base 10, XLII / Roman,
and 52 base 8 are the same number?  Same with alt and koi8.

Or maybe this is different in Japan?  (I know nothing about these matters)

At any rate, this is now settled by Unicode.

Thanks to your and everyone else who took the pains to educated me.

PS. please remove me from the CC field.  I _do_ read <emacs-devel>,
    and I do _not_ need e-mail copies.  Thanks.

-- 
Sam Steingold (http://www.podval.org/~sds) running w2k
<http://www.camera.org> <http://www.iris.org.il> <http://www.memri.org/>
<http://www.mideasttruth.com/> <http://www.honestreporting.com>
To understand recursion, one has to understand recursion first.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: utf-8 cut/paste
  2004-05-28 17:45                 ` Sam Steingold
@ 2004-05-29 10:04                   ` Jason Rumney
  0 siblings, 0 replies; 24+ messages in thread
From: Jason Rumney @ 2004-05-29 10:04 UTC (permalink / raw)

Sam Steingold <sds@gnu.org> writes:

>> * Kenichi Handa <unaqn@z17a.bet> [2004-05-26 13:22:38 +0900]:
>>
>> When I desinged Mule (it was before Unicode was accepted widely as
>> now), there was no agreement in a character A1 in a charset S1 and a
>> character A2 in a charset S2 are actually the same character.  There
>> was no authority that says the code point 0xE3 of koi8 and the code
>> point 0x96 in alt represents the same character.
>
> Do you need a special authority to say that 42 base 10, XLII / Roman,
> and 52 base 8 are the same number?  Same with alt and koi8.

I think there is some confusion here. alt (assuming you mean
alternativenyj) and koi8 characters are the same character in Emacs,
in fact they are both internally represented as their respective
Unicode characters in current CVS (in previous versions, they were
mapped to iso-8859-5, but apparently that does not cover the full set
of Cyrillic characters and is seldom used in practice).

It is latin-x characters that are separate.

> PS. please remove me from the CC field.  I _do_ read <emacs-devel>,
>     and I do _not_ need e-mail copies.  Thanks.

If you add the following header to your mail, Gnus and maybe other
mail readers will do this automatically:

Mail-Copies-To: never

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2004-05-29 10:04 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-05-23 18:59 utf-8 cut/paste Sam Steingold
2004-05-24  9:52 ` Benjamin Riefenstahl
2004-05-24 16:18   ` Sam Steingold
2004-05-24 19:19     ` Benjamin Riefenstahl
2004-05-24 21:00       ` Sam Steingold
2004-05-24 23:10         ` Benjamin Riefenstahl
2004-05-25 13:06           ` Sam Steingold
2004-05-25  6:02       ` Eli Zaretskii
2004-05-25 10:03         ` Benjamin Riefenstahl
2004-05-25 12:36           ` Eli Zaretskii
2004-05-25 15:41             ` Sam Steingold
2004-05-26  4:22               ` Kenichi Handa
2004-05-28 17:45                 ` Sam Steingold
2004-05-29 10:04                   ` Jason Rumney
2004-05-26  4:33               ` Miles Bader
2004-05-26 18:11                 ` Sam Steingold
2004-05-26 19:23                   ` David Kastrup
2004-05-26 11:32               ` Eli Zaretskii
2004-05-26 13:31                 ` Sam Steingold
2004-05-26 12:30               ` Benjamin Riefenstahl
2004-05-26 15:48         ` Stefan Monnier
2004-05-26 18:11           ` Eli Zaretskii
2004-05-26 20:02             ` Stefan Monnier
2004-05-27  8:10               ` Eli Zaretskii

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).