Encoding of etc/HELLO

unofficial mirror of emacs-devel@gnu.org 
 help / color / mirror / code / Atom feed

* Encoding of etc/HELLO
@ 2018-04-20 13:25 Eli Zaretskii
  2018-04-20 15:34 ` Michael Albinus
  0 siblings, 1 reply; 33+ messages in thread
From: Eli Zaretskii @ 2018-04-20 13:25 UTC (permalink / raw)
  To: Michael Albinus; +Cc: emacs-devel

Michael, your recent changes to encode HELLO in UTF-8 are problematic
and AFAIU should be reverted, because they lose the CJK charset
information.  See

  http://lists.gnu.org/archive/html/emacs-devel/2009-08/msg01409.html
  http://lists.gnu.org/archive/html/bug-gnu-emacs/2013-03/msg00429.html
  http://lists.gnu.org/archive/html/bug-gnu-emacs/2013-03/msg00475.html

and the surrounding discussions for more about that.



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Encoding of etc/HELLO
  2018-04-20 13:25 Encoding of etc/HELLO Eli Zaretskii
@ 2018-04-20 15:34 ` Michael Albinus
  2018-04-20 16:00   ` Eli Zaretskii
  2018-04-20 16:56   ` Paul Eggert
  0 siblings, 2 replies; 33+ messages in thread
From: Michael Albinus @ 2018-04-20 15:34 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: emacs-devel

Eli Zaretskii <eliz@gnu.org> writes:

Hi Eli,

> Michael, your recent changes to encode HELLO in UTF-8 are problematic
> and AFAIU should be reverted, because they lose the CJK charset
> information.  See
>
>   http://lists.gnu.org/archive/html/emacs-devel/2009-08/msg01409.html
>   http://lists.gnu.org/archive/html/bug-gnu-emacs/2013-03/msg00429.html
>   http://lists.gnu.org/archive/html/bug-gnu-emacs/2013-03/msg00475.html
>
> and the surrounding discussions for more about that.

I see. No problem to revert the patch, it isn't important.

However, quoting the last reference above

--8<---------------cut here---------------start------------->8---
When a file is in some legacy encoding such as iso-2022-7bit, Emacs
attached charset properties to proper ranges of text, which works as a
hint for selecting a proper font especially for CJK characters.
--8<---------------cut here---------------end--------------->8---

I'm wondering why it is possible to attach charset properties for
iso-2022-7bit, but not for utf-8. Note, that I don't know too much about
this topic.

Best regards, Michael.



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Encoding of etc/HELLO
  2018-04-20 15:34 ` Michael Albinus
@ 2018-04-20 16:00   ` Eli Zaretskii
  2018-04-20 16:16     ` Stefan Monnier
  2018-04-20 17:39     ` Michael Albinus
  2018-04-20 16:56   ` Paul Eggert
  1 sibling, 2 replies; 33+ messages in thread
From: Eli Zaretskii @ 2018-04-20 16:00 UTC (permalink / raw)
  To: Michael Albinus; +Cc: emacs-devel

> From: Michael Albinus <michael.albinus@gmx.de>
> Cc: emacs-devel@gnu.org
> Date: Fri, 20 Apr 2018 17:34:45 +0200
> 
> --8<---------------cut here---------------start------------->8---
> When a file is in some legacy encoding such as iso-2022-7bit, Emacs
> attached charset properties to proper ranges of text, which works as a
> hint for selecting a proper font especially for CJK characters.
> --8<---------------cut here---------------end--------------->8---
> 
> I'm wondering why it is possible to attach charset properties for
> iso-2022-7bit, but not for utf-8. Note, that I don't know too much about
> this topic.

Because we don't have infrastructure for tagging sub-ranges of Unicode
with character sets (and in some sense, that would make little sense,
because Unicode is a unifying encoding).

ISO-2022 has built-in features to tag portions of text as belonging to
some specific charset.



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Encoding of etc/HELLO
  2018-04-20 16:00   ` Eli Zaretskii
@ 2018-04-20 16:16     ` Stefan Monnier
  2018-04-20 17:22       ` Eli Zaretskii
  2018-04-20 17:39     ` Michael Albinus
  1 sibling, 1 reply; 33+ messages in thread
From: Stefan Monnier @ 2018-04-20 16:16 UTC (permalink / raw)
  To: emacs-devel

> Because we don't have infrastructure for tagging sub-ranges of Unicode
> with character sets (and in some sense, that would make little sense,
> because Unicode is a unifying encoding).

Does Unicode offer a way to do that (i.e. is it a limitation on our
support of Unicode, or is it a limitation in the Unicode spec)?


        Stefan




^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Encoding of etc/HELLO
  2018-04-20 15:34 ` Michael Albinus
  2018-04-20 16:00   ` Eli Zaretskii
@ 2018-04-20 16:56   ` Paul Eggert
  2018-04-20 17:37     ` Michael Albinus
  1 sibling, 1 reply; 33+ messages in thread
From: Paul Eggert @ 2018-04-20 16:56 UTC (permalink / raw)
  To: Michael Albinus, Eli Zaretskii; +Cc: emacs-devel

On 04/20/2018 08:34 AM, Michael Albinus wrote:
> No problem to revert the patch, it isn't important.

If you revert it, please also revert commit 
0585bd643dae2592214e77998b875347e6e59bab, which I installed before 
seeing this thread.

It's true that this isn't important. Still, I like the the "hello" 
emoji; it's friendly.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Encoding of etc/HELLO
  2018-04-20 16:16     ` Stefan Monnier
@ 2018-04-20 17:22       ` Eli Zaretskii
  2018-04-20 20:42         ` Stefan Monnier
  0 siblings, 1 reply; 33+ messages in thread
From: Eli Zaretskii @ 2018-04-20 17:22 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: emacs-devel

> From: Stefan Monnier <monnier@iro.umontreal.ca>
> Date: Fri, 20 Apr 2018 12:16:05 -0400
> 
> > Because we don't have infrastructure for tagging sub-ranges of Unicode
> > with character sets (and in some sense, that would make little sense,
> > because Unicode is a unifying encoding).
> 
> Does Unicode offer a way to do that (i.e. is it a limitation on our
> support of Unicode, or is it a limitation in the Unicode spec)?

Unicode has language tag characters, but they are deprecated and their
use is discouraged.

In any case, I don't think Unicode features are relevant here, because
we already have char-script-table, which is all you can do with a
unified codepoint space.  The whole point of ISO-2022 is that the same
Unicode codepoints can come from different ISO-2022 charsets, and the
ISO-2022 encoding keeps that information in the bytestream.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Encoding of etc/HELLO
  2018-04-20 16:56   ` Paul Eggert
@ 2018-04-20 17:37     ` Michael Albinus
  2018-04-21 20:31       ` Juri Linkov
  0 siblings, 1 reply; 33+ messages in thread
From: Michael Albinus @ 2018-04-20 17:37 UTC (permalink / raw)
  To: Paul Eggert; +Cc: Eli Zaretskii, emacs-devel

Paul Eggert <eggert@cs.ucla.edu> writes:

>> No problem to revert the patch, it isn't important.
>
> If you revert it, please also revert commit
> 0585bd643dae2592214e77998b875347e6e59bab, which I installed before
> seeing this thread.

Done. I've reverted 0585bd643dae2592214e77998b875347e6e59bab and
c4cfb5d20487f9912f5896b3f1d291fe7ccc9804. I haven't reverted
e2ae724460e6d73d3ddcc6066427471799c4bd57, because Stefan did commit a
better patch on top of this.

> It's true that this isn't important. Still, I like the the "hello"
> emoji; it's friendly.

Yes, that was the idea. It's a pity that we cannot add valid utf-8
characters to etc/HELLO, when they are not iso-2022-7bit compatible.

Best regards, Michael.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Encoding of etc/HELLO
  2018-04-20 16:00   ` Eli Zaretskii
  2018-04-20 16:16     ` Stefan Monnier
@ 2018-04-20 17:39     ` Michael Albinus
  2018-04-21  7:10       ` Eli Zaretskii
  1 sibling, 1 reply; 33+ messages in thread
From: Michael Albinus @ 2018-04-20 17:39 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: emacs-devel

Eli Zaretskii <eliz@gnu.org> writes:

> Because we don't have infrastructure for tagging sub-ranges of Unicode
> with character sets (and in some sense, that would make little sense,
> because Unicode is a unifying encoding).
>
> ISO-2022 has built-in features to tag portions of text as belonging to
> some specific charset.

Thanks for the explanation. As said I have no knowledge about the topic,
but I'm still surprised that something like this isn't possible with utf-8.

Best regards, Michael.



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Encoding of etc/HELLO
  2018-04-20 17:22       ` Eli Zaretskii
@ 2018-04-20 20:42         ` Stefan Monnier
  2018-04-20 21:02           ` Clément Pit-Claudel
                             ` (2 more replies)
  0 siblings, 3 replies; 33+ messages in thread
From: Stefan Monnier @ 2018-04-20 20:42 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: emacs-devel

> Unicode has language tag characters, but they are deprecated and their
> use is discouraged.
>
> In any case, I don't think Unicode features are relevant here, because
> we already have char-script-table, which is all you can do with a
> unified codepoint space.

Yes, I understand this part of the situation.

> The whole point of ISO-2022 is that the same Unicode codepoints can
> come from different ISO-2022 charsets, and the ISO-2022 encoding keeps
> that information in the bytestream.

My question was meant to see if there's a way to encode a similar kind
of charset info into the bytestream.  From what you say above, there is
such a thing but its use is discouraged.

Clearly this problem is not specific to Emacs, so what do people do?
Hold on to iso-2022 for as long as they can (like we do in Emacs)?
Give up on these "details" of rendering for files using a mix of C, J, and K?
Rely on higher-level info (XML tags and friends) to carry the charset info?

        Stefan

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Encoding of etc/HELLO
  2018-04-20 20:42         ` Stefan Monnier
@ 2018-04-20 21:02           ` Clément Pit-Claudel
  2018-04-20 21:26           ` Paul Eggert
  2018-04-21  7:07           ` Eli Zaretskii
  2 siblings, 0 replies; 33+ messages in thread
From: Clément Pit-Claudel @ 2018-04-20 21:02 UTC (permalink / raw)
  To: emacs-devel

On 2018-04-20 16:42, Stefan Monnier wrote:
> Rely on higher-level info (XML tags and friends) to carry the charset info?

I think that's what people typically do, yes.  The table at https://en.wikipedia.org/wiki/Variant_Chinese_character#Usage_in_computing is a good example of using the lang and xml:lang attributes.



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Encoding of etc/HELLO
  2018-04-20 20:42         ` Stefan Monnier
  2018-04-20 21:02           ` Clément Pit-Claudel
@ 2018-04-20 21:26           ` Paul Eggert
  2018-04-21  7:07           ` Eli Zaretskii
  2 siblings, 0 replies; 33+ messages in thread
From: Paul Eggert @ 2018-04-20 21:26 UTC (permalink / raw)
  To: Stefan Monnier, Eli Zaretskii; +Cc: emacs-devel

On 04/20/2018 01:42 PM, Stefan Monnier wrote:
> Clearly this problem is not specific to Emacs, so what do people do?
> Hold on to iso-2022 for as long as they can (like we do in Emacs)?
> Give up on these "details" of rendering for files using a mix of C, J, and K?
> Rely on higher-level info (XML tags and friends) to carry the charset info?

For most uses, people typically just use UTF-8 and give up on the 
details, which tend to be in areas that many users don't care much about 
anyway. In practice if (say) a Japanese reader sees a Chinese quotation 
in a page of Japanese text, there's an excellent chance the reader won't 
much mind that the Chinese characters are rendered in Japanese-style, as 
this has long been common practice in Japanese printing anyway.

There are of course exceptions where it really matters which font you 
use, such as the Wikipedia page on Chinese character variants that 
Clément mentioned. But these are rare, and are typically handled by 
means other than plain text. It's like the Wikipedia page on kerning, 
which uses images rather than plain UTF-8 text to illustrate how to kern 
characters properly.

I mildly prefer multilingual text to be rendered in a consistent style 
for my language, as opposed to having it rendered separately for readers 
of each of its component languages, as this makes the text a bit easier 
for me to read (which is the point of text, isn't it?). But this of 
course is merely a style preference.

For what it's worth, the April 2018 w3techs.com numbers say that UTF-8 
is used by 91.3% of websites whose character encoding they know, and 
that this number is steadily growing (it was 88.9% a year ago). In 
contrast, ISO 2022 usage is declining steadily. Of course the web is not 
the entire universe; still, it's pretty clear which way the world is going.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Encoding of etc/HELLO
  2018-04-20 20:42         ` Stefan Monnier
  2018-04-20 21:02           ` Clément Pit-Claudel
  2018-04-20 21:26           ` Paul Eggert
@ 2018-04-21  7:07           ` Eli Zaretskii
  2018-04-21 14:58             ` Michael Welsh Duggan
  2 siblings, 1 reply; 33+ messages in thread
From: Eli Zaretskii @ 2018-04-21  7:07 UTC (permalink / raw)
  To: Stefan Monnier, Kenichi Handa; +Cc: emacs-devel

> From: Stefan Monnier <monnier@IRO.UMontreal.CA>
> Cc: emacs-devel@gnu.org
> Date: Fri, 20 Apr 2018 16:42:02 -0400
> 
> > The whole point of ISO-2022 is that the same Unicode codepoints can
> > come from different ISO-2022 charsets, and the ISO-2022 encoding keeps
> > that information in the bytestream.
> 
> My question was meant to see if there's a way to encode a similar kind
> of charset info into the bytestream.  From what you say above, there is
> such a thing but its use is discouraged.

If you mean a Unicode-compatible bytestream, then yes, that's the
feature I know of.  But if we want to use it in Emacs, we should
modify the UTF-x decoders to put the charset properties on the decoded
text, or invent a new property (since charset is currently 'unicode'),
and then augment the font selection code to consider that new
property.

> Clearly this problem is not specific to Emacs, so what do people do?
> Hold on to iso-2022 for as long as they can (like we do in Emacs)?
> Give up on these "details" of rendering for files using a mix of C, J, and K?
> Rely on higher-level info (XML tags and friends) to carry the charset info?

I don't know.  Several years ago, I think each vendor used a private
extension of ISO-2022 to support the emoji, not sure if that is still
the case, especially since the number of standardized emoji continues
to grow all the time.  We could perhaps follow one such extension in
our support of ISO-2022.  Or we could decide that the Han unification
has conquered the world, and therefore the CJK charset distinction for
font selection is no longer important enough for us, in which case we
could recode HELLO in UTF-8.

I've added Handa-san to this discussion in the hope that he could
comment on what would be the bets way forward.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Encoding of etc/HELLO
  2018-04-20 17:39     ` Michael Albinus
@ 2018-04-21  7:10       ` Eli Zaretskii
  2018-04-21 14:40         ` Clément Pit-Claudel
  2018-04-23  2:53         ` Stefan Monnier
  0 siblings, 2 replies; 33+ messages in thread
From: Eli Zaretskii @ 2018-04-21  7:10 UTC (permalink / raw)
  To: Michael Albinus; +Cc: emacs-devel

> From: Michael Albinus <michael.albinus@gmx.de>
> Cc: emacs-devel@gnu.org
> Date: Fri, 20 Apr 2018 19:39:21 +0200
> 
> I'm still surprised that something like this isn't possible with utf-8.

UTF-8 cannot encode language-specific differences of a given
character, that is something that is against the basic principle of
Unicode: that each character has one and only one encoding.

This is one reason why Emacs uses a superset of Unicode in its
internal representation, btw.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Encoding of etc/HELLO
  2018-04-21  7:10       ` Eli Zaretskii
@ 2018-04-21 14:40         ` Clément Pit-Claudel
  2018-04-21 15:43           ` Eli Zaretskii
  2018-04-21 15:52           ` Paul Eggert
  2018-04-23  2:53         ` Stefan Monnier
  1 sibling, 2 replies; 33+ messages in thread
From: Clément Pit-Claudel @ 2018-04-21 14:40 UTC (permalink / raw)
  To: emacs-devel

On 2018-04-21 03:10, Eli Zaretskii wrote:
> UTF-8 cannot encode language-specific differences of a given
> character, that is something that is against the basic principle of
> Unicode: that each character has one and only one encoding.

Aren't variation selectors used for a similar purpose, though? (Maybe I'm misunderstanding what they are for).



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Encoding of etc/HELLO
  2018-04-21  7:07           ` Eli Zaretskii
@ 2018-04-21 14:58             ` Michael Welsh Duggan
  2018-05-19 15:23               ` Eli Zaretskii
  0 siblings, 1 reply; 33+ messages in thread
From: Michael Welsh Duggan @ 2018-04-21 14:58 UTC (permalink / raw)
  To: emacs-devel

Eli Zaretskii <eliz@gnu.org> writes:

>> From: Stefan Monnier <monnier@IRO.UMontreal.CA>
>> Cc: emacs-devel@gnu.org
>> Date: Fri, 20 Apr 2018 16:42:02 -0400
>> 
>> > The whole point of ISO-2022 is that the same Unicode codepoints can
>> > come from different ISO-2022 charsets, and the ISO-2022 encoding keeps
>> > that information in the bytestream.
>> 
>> My question was meant to see if there's a way to encode a similar kind
>> of charset info into the bytestream.  From what you say above, there is
>> such a thing but its use is discouraged.
>
> If you mean a Unicode-compatible bytestream, then yes, that's the
> feature I know of.  But if we want to use it in Emacs, we should
> modify the UTF-x decoders to put the charset properties on the decoded
> text, or invent a new property (since charset is currently 'unicode'),
> and then augment the font selection code to consider that new
> property.
>
>> Clearly this problem is not specific to Emacs, so what do people do?
>> Hold on to iso-2022 for as long as they can (like we do in Emacs)?
>> Give up on these "details" of rendering for files using a mix of C, J, and K?
>> Rely on higher-level info (XML tags and friends) to carry the charset info?
>
> I don't know.  Several years ago, I think each vendor used a private
> extension of ISO-2022 to support the emoji, not sure if that is still
> the case, especially since the number of standardized emoji continues
> to grow all the time.  We could perhaps follow one such extension in
> our support of ISO-2022.  Or we could decide that the Han unification
> has conquered the world, and therefore the CJK charset distinction for
> font selection is no longer important enough for us, in which case we
> could recode HELLO in UTF-8.

I would suppose that the usual way to do this (encode glyph variants in
a Unicode-compatible bytestream) would be to use some form of document
markup.  In Emacs's case, enriched-mode would seem an ideal candidate
for this.  RFC-1896 specifically supports private extensions for
attributes using the "X-" syntax, and enriched.el is small and should be
simple to modify for this purpose.

-- 
Michael Welsh Duggan
(md5i@md5i.com)



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Encoding of etc/HELLO
  2018-04-21 14:40         ` Clément Pit-Claudel
@ 2018-04-21 15:43           ` Eli Zaretskii
  2018-04-21 15:52           ` Paul Eggert
  1 sibling, 0 replies; 33+ messages in thread
From: Eli Zaretskii @ 2018-04-21 15:43 UTC (permalink / raw)
  To: Clément Pit-Claudel; +Cc: emacs-devel

> From: Clément Pit-Claudel <cpitclaudel@gmail.com>
> Date: Sat, 21 Apr 2018 10:40:41 -0400
> 
> On 2018-04-21 03:10, Eli Zaretskii wrote:
> > UTF-8 cannot encode language-specific differences of a given
> > character, that is something that is against the basic principle of
> > Unicode: that each character has one and only one encoding.
> 
> Aren't variation selectors used for a similar purpose, though?

I'm not sure, but I don't think so.  The variation selectors specify
glyphs, not font selection.  But I admit I don't know enough about
this, so I might be mistaken.

(We use variation selectors only in macfont.m.)



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Encoding of etc/HELLO
  2018-04-21 14:40         ` Clément Pit-Claudel
  2018-04-21 15:43           ` Eli Zaretskii
@ 2018-04-21 15:52           ` Paul Eggert
  1 sibling, 0 replies; 33+ messages in thread
From: Paul Eggert @ 2018-04-21 15:52 UTC (permalink / raw)
  To: Clément Pit-Claudel, emacs-devel

Clément Pit-Claudel wrote:

> Aren't variation selectors used for a similar purpose, though?
They can be used for that, yes, though it's safe to say this would be 
bleeding-edge stuff. As I understand it, Adobe and others use them so that one 
can round-trip from Adobe formats into UTF-8 and back without losing information 
about ideograph variants. However, in practice variation selectors tend to be 
proprietary, so they're a bit of a minefield.

As far as etc/HELLO goes, a couple of years ago Ken Lunde proposed the PanCJKV 
ideographic variation database collection for east-Asia variaton; see:

https://github.com/adobe-type-tools/pancjkv-ivd-collection

It ran into some roadblocks, though, briefly described here:

http://www.unicodeconference.org/presentations/S8T2-Lunde.pdf



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Encoding of etc/HELLO
  2018-04-20 17:37     ` Michael Albinus
@ 2018-04-21 20:31       ` Juri Linkov
  2018-04-23 16:25         ` Eli Zaretskii
  0 siblings, 1 reply; 33+ messages in thread
From: Juri Linkov @ 2018-04-21 20:31 UTC (permalink / raw)
  To: Michael Albinus; +Cc: Eli Zaretskii, Paul Eggert, emacs-devel

>> It's true that this isn't important. Still, I like the the "hello"
>> emoji; it's friendly.
>
> Yes, that was the idea. It's a pity that we cannot add valid utf-8
> characters to etc/HELLO, when they are not iso-2022-7bit compatible.

I don't understand why it's impossible to create a charset like the
existing mule-unicode-e000-ffff but for character range over U+FFFF
to include such characters as U+1F44B.  Or is this an inherent limitation
of the iso-2022-7bit coding system?



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Encoding of etc/HELLO
  2018-04-21  7:10       ` Eli Zaretskii
  2018-04-21 14:40         ` Clément Pit-Claudel
@ 2018-04-23  2:53         ` Stefan Monnier
  2018-04-23 15:07           ` Eli Zaretskii
  1 sibling, 1 reply; 33+ messages in thread
From: Stefan Monnier @ 2018-04-23  2:53 UTC (permalink / raw)
  To: emacs-devel

> UTF-8 cannot encode language-specific differences of a given
> character, that is something that is against the basic principle of
> Unicode: that each character has one and only one encoding.

But along the way they discovered that it's sometimes difficult to
decide whether two "things" should be consider as one and the same
character or not.  They ended up with a set of "rules" to make those
decisions, but it's not nearly as simple as "each character has one and
only one encoding".

        Stefan

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Encoding of etc/HELLO
  2018-04-23  2:53         ` Stefan Monnier
@ 2018-04-23 15:07           ` Eli Zaretskii
  2018-04-23 15:23             ` Stefan Monnier
  0 siblings, 1 reply; 33+ messages in thread
From: Eli Zaretskii @ 2018-04-23 15:07 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: emacs-devel

> From: Stefan Monnier <monnier@iro.umontreal.ca>
> Date: Sun, 22 Apr 2018 22:53:58 -0400
> 
> > UTF-8 cannot encode language-specific differences of a given
> > character, that is something that is against the basic principle of
> > Unicode: that each character has one and only one encoding.
> 
> But along the way they discovered that it's sometimes difficult to
> decide whether two "things" should be consider as one and the same
> character or not.  They ended up with a set of "rules" to make those
> decisions, but it's not nearly as simple as "each character has one and
> only one encoding".

Not sure what you allude to here.  Are you talking about the variation
selectors?



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Encoding of etc/HELLO
  2018-04-23 15:07           ` Eli Zaretskii
@ 2018-04-23 15:23             ` Stefan Monnier
  2018-04-23 16:12               ` Eli Zaretskii
  0 siblings, 1 reply; 33+ messages in thread
From: Stefan Monnier @ 2018-04-23 15:23 UTC (permalink / raw)
  To: emacs-devel

>> But along the way they discovered that it's sometimes difficult to
>> decide whether two "things" should be consider as one and the same
>> character or not.  They ended up with a set of "rules" to make those
>> decisions, but it's not nearly as simple as "each character has one and
>> only one encoding".
> Not sure what you allude to here.

For example the fact that some CJK characters should be displayed
differently depending on whether they're part of a C text, or a J text,
or a K text, so are they really "one and the same character"?

Of course, there are other related choices: which versions of β should
be one and the same and which shouldn't (e.g. I currently see in Unicode
a greek and a latin version plus some variants of a math version (tho
none in "roman" shape))?

There are murky areas, with no "one right answer", although Unicode has
had to choose somehow, i.e. doing the best it can with a messy situation.

        Stefan

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Encoding of etc/HELLO
  2018-04-23 15:23             ` Stefan Monnier
@ 2018-04-23 16:12               ` Eli Zaretskii
  0 siblings, 0 replies; 33+ messages in thread
From: Eli Zaretskii @ 2018-04-23 16:12 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: emacs-devel

> From: Stefan Monnier <monnier@iro.umontreal.ca>
> Date: Mon, 23 Apr 2018 11:23:39 -0400
> 
> >> But along the way they discovered that it's sometimes difficult to
> >> decide whether two "things" should be consider as one and the same
> >> character or not.  They ended up with a set of "rules" to make those
> >> decisions, but it's not nearly as simple as "each character has one and
> >> only one encoding".
> > Not sure what you allude to here.
> 
> For example the fact that some CJK characters should be displayed
> differently depending on whether they're part of a C text, or a J text,
> or a K text, so are they really "one and the same character"?

This situation existed before Unicode.  Unicode tries to overcome it;
thus "Han unification".



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Encoding of etc/HELLO
  2018-04-21 20:31       ` Juri Linkov
@ 2018-04-23 16:25         ` Eli Zaretskii
  2018-04-23 20:05           ` Juri Linkov
  0 siblings, 1 reply; 33+ messages in thread
From: Eli Zaretskii @ 2018-04-23 16:25 UTC (permalink / raw)
  To: Juri Linkov, Kenichi Handa; +Cc: eggert, michael.albinus, emacs-devel

> From: Juri Linkov <juri@linkov.net>
> Cc: Paul Eggert <eggert@cs.ucla.edu>,  Eli Zaretskii <eliz@gnu.org>,  emacs-devel@gnu.org
> Date: Sat, 21 Apr 2018 23:31:22 +0300
> 
> I don't understand why it's impossible to create a charset like the
> existing mule-unicode-e000-ffff but for character range over U+FFFF
> to include such characters as U+1F44B.  Or is this an inherent limitation
> of the iso-2022-7bit coding system?

I'm not sure I understand your proposal.  Are you suggesting to create
a Mule charset covering just the Emoji block?  That could be possible
(assuming ISO-2022 still has vacant charset slots available, something
that I don't think I know how to determine reliably, and assuming we
decipher the black art of using define-charset).  But is this worth
doing it just for Emoji?

If you mean to add a larger range of characters, then I think a single
ISO-2022 compatible charset can support at most 8192 character, so we
will need a lot of charsets to cover codepoints between U+10000 and
U+2FFFF, and I'm not sure we have that many vacant slots.

Or did you mean to suggest something else?

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Encoding of etc/HELLO
  2018-04-23 16:25         ` Eli Zaretskii
@ 2018-04-23 20:05           ` Juri Linkov
  0 siblings, 0 replies; 33+ messages in thread
From: Juri Linkov @ 2018-04-23 20:05 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: Kenichi Handa, eggert, michael.albinus, emacs-devel

>> I don't understand why it's impossible to create a charset like the
>> existing mule-unicode-e000-ffff but for character range over U+FFFF
>> to include such characters as U+1F44B.  Or is this an inherent limitation
>> of the iso-2022-7bit coding system?
>
> I'm not sure I understand your proposal.  Are you suggesting to create
> a Mule charset covering just the Emoji block?  That could be possible
> (assuming ISO-2022 still has vacant charset slots available, something
> that I don't think I know how to determine reliably, and assuming we
> decipher the black art of using define-charset).  But is this worth
> doing it just for Emoji?
>
> If you mean to add a larger range of characters, then I think a single
> ISO-2022 compatible charset can support at most 8192 character, so we
> will need a lot of charsets to cover codepoints between U+10000 and
> U+2FFFF, and I'm not sure we have that many vacant slots.
>
> Or did you mean to suggest something else?

This is exactly what I meant.  While using ISO-2022 encoding in HELLO
to represent Unicode characters is just an inconvenience, the inability
to encode all Unicode characters in ISO-2022 is a serious limitation.



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Encoding of etc/HELLO
  2018-04-21 14:58             ` Michael Welsh Duggan
@ 2018-05-19 15:23               ` Eli Zaretskii
  2018-05-19 17:17                 ` Paul Eggert
  2018-05-19 17:52                 ` Michael Albinus
  0 siblings, 2 replies; 33+ messages in thread
From: Eli Zaretskii @ 2018-05-19 15:23 UTC (permalink / raw)
  To: Michael Welsh Duggan, Michael Albinus, Kenichi Handa; +Cc: emacs-devel

> From: Michael Welsh Duggan <mwd@md5i.com>
> Date: Sat, 21 Apr 2018 10:58:53 -0400
> 
> I would suppose that the usual way to do this (encode glyph variants in
> a Unicode-compatible bytestream) would be to use some form of document
> markup.  In Emacs's case, enriched-mode would seem an ideal candidate
> for this.  RFC-1896 specifically supports private extensions for
> attributes using the "X-" syntax, and enriched.el is small and should be
> simple to modify for this purpose.

Thanks, I used this idea to extend Enriched mode with support of
'charset' properties, and then recoded HELLO in UTF-8 and placed it
under Enriched mode.

Michael (Albinus), your Emoji addition is also in.  I hope I
identified the addition correctly; if not, feel free to fix, and
please make a point of marking the Emoji with the 'unicode' charset,
using the new function in facemenu.el.

Thanks to Handa-san for his advice regarding this feature.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Encoding of etc/HELLO
  2018-05-19 15:23               ` Eli Zaretskii
@ 2018-05-19 17:17                 ` Paul Eggert
  2018-05-19 18:03                   ` Eli Zaretskii
  2018-05-19 17:52                 ` Michael Albinus
  1 sibling, 1 reply; 33+ messages in thread
From: Paul Eggert @ 2018-05-19 17:17 UTC (permalink / raw)
  To: Eli Zaretskii, Michael Welsh Duggan, Michael Albinus,
	Kenichi Handa
  Cc: emacs-devel

Eli Zaretskii wrote:
> Thanks, I used this idea to extend Enriched mode with support of
> 'charset' properties, and then recoded HELLO in UTF-8 and placed it
> under Enriched mode.

Thanks for doing all that.

In looking at the new etc/HELLO, I see many uses of <x-charset><param> that seem 
to be unnecessary when Emacs is viewing the file. For example, the first few 
uses are:

<x-charset><param>latin-iso8859-1</param>¡Hola!, Grüß Gott, Hyvää päivää, Tere 
õhtust, Bon</x-charset><x-charset><param>latin-iso8859-3</param>ġu
           Cze</x-charset><x-charset><param>latin-iso8859-2</param>ść!, Dobrý 
den, </x-charset>

Can't the abovementioned formatting commands be removed without affecting what 
any Emacs user sees, because the corresponding character sets are not unified in 
Unicode? Would it be OK to simplify /etc/HELLO to remove unnecessary formatting 
commands, and to keep only the formatting commands that are plausibly needed in 
a Unicode text file? And if so, what heuristic should be used to remove the 
unnecessary formatting commands?

I assume that the formatting commands were done automatically, so perhaps I'm 
talking about potential changes to lisp/textmodes/enriched.el.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Encoding of etc/HELLO
  2018-05-19 15:23               ` Eli Zaretskii
  2018-05-19 17:17                 ` Paul Eggert
@ 2018-05-19 17:52                 ` Michael Albinus
  1 sibling, 0 replies; 33+ messages in thread
From: Michael Albinus @ 2018-05-19 17:52 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: Michael Welsh Duggan, Kenichi Handa, emacs-devel

Eli Zaretskii <eliz@gnu.org> writes:

Hi Eli,

> Michael (Albinus), your Emoji addition is also in.  I hope I
> identified the addition correctly

"WAVING HAND SIGN" is perfect.

Best regards, Michael.



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Encoding of etc/HELLO
  2018-05-19 17:17                 ` Paul Eggert
@ 2018-05-19 18:03                   ` Eli Zaretskii
  2018-05-19 18:23                     ` Paul Eggert
  0 siblings, 1 reply; 33+ messages in thread
From: Eli Zaretskii @ 2018-05-19 18:03 UTC (permalink / raw)
  To: Paul Eggert; +Cc: mwd, handa, michael.albinus, emacs-devel

> Cc: emacs-devel@gnu.org
> From: Paul Eggert <eggert@cs.ucla.edu>
> Date: Sat, 19 May 2018 10:17:33 -0700
> 
> In looking at the new etc/HELLO, I see many uses of <x-charset><param> that seem 
> to be unnecessary when Emacs is viewing the file. For example, the first few 
> uses are:
> 
> <x-charset><param>latin-iso8859-1</param>¡Hola!, Grüß Gott, Hyvää päivää, Tere 
> õhtust, Bon</x-charset><x-charset><param>latin-iso8859-3</param>ġu
>            Cze</x-charset><x-charset><param>latin-iso8859-2</param>ść!, Dobrý 
> den, </x-charset>

Which parts seem unnecessary in this snippet?  And why?

> Can't the abovementioned formatting commands be removed without affecting what 
> any Emacs user sees, because the corresponding character sets are not unified in 
> Unicode?

What do you mean by "unified" here?  In modern Emacs, we don't need to
unify the charsets, because they no longer determine the codepoints.
The 'charset' property just tells Emacs to which "culture", so-called,
or, if you want, to which language the greeting belongs, and the
purpose is only one: selection of an appropriate font to display that
greeting.  (In the future we might use that for other
language-dependent features.)

> Would it be OK to simplify /etc/HELLO to remove unnecessary formatting 
> commands, and to keep only the formatting commands that are plausibly needed in 
> a Unicode text file? And if so, what heuristic should be used to remove the 
> unnecessary formatting commands?
> 
> I assume that the formatting commands were done automatically, so perhaps I'm 
> talking about potential changes to lisp/textmodes/enriched.el.

Yes, the annotations were produced automatically by enriched.el, but
they simply follow what was already there in the original HELLO.  You
can see that by visiting HELLO on the emacs-26 branch, and then
invoking "M-x describe-text-properties" at various places in the file.
You will see that the annotations start and end where the 'charset'
properties started and ended in the ISO-2022 encoded file.

We could, of course, place the 'charset' properties only on the
greetings and the language names, leaving the rest of the text without
any 'charset' properties.  If that's what you mean, then I'm okay with
doing that; one could use the new facemenu command I added for that
purpose.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Encoding of etc/HELLO
  2018-05-19 18:03                   ` Eli Zaretskii
@ 2018-05-19 18:23                     ` Paul Eggert
  2018-05-19 18:39                       ` Eli Zaretskii
  0 siblings, 1 reply; 33+ messages in thread
From: Paul Eggert @ 2018-05-19 18:23 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: mwd, handa, michael.albinus, emacs-devel

Eli Zaretskii wrote:

> What do you mean by "unified" here?

What I meant was that, as far as I know, in Emacs this font selection currently 
does not depend on whether the charset is latin-iso8859-1 or latin-iso8859-3, 
because in UTF-8 text those two charsets are always displayed the same way that 
text sans charsets is displayed. And given the way the world has moved, it's 
hard to imagine any future version of Emacs caring whether the charset is 
latin-iso8859-1 or latin-iso8859-3 in UTF-8 text.

charset properties like japanese-jisx0208 do matter for display of course, and 
so should be kept.

> The 'charset' property just tells Emacs to which "culture", so-called,
> or, if you want, to which language the greeting belongs

RFC 1896 specifies the 'lang' command to specify languages. Shouldn't etc/HELLO 
do that instead of using 'charset'? That would seem to match the intent of 
text/enriched better.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Encoding of etc/HELLO
  2018-05-19 18:23                     ` Paul Eggert
@ 2018-05-19 18:39                       ` Eli Zaretskii
  2018-05-19 19:38                         ` Paul Eggert
  0 siblings, 1 reply; 33+ messages in thread
From: Eli Zaretskii @ 2018-05-19 18:39 UTC (permalink / raw)
  To: Paul Eggert; +Cc: mwd, handa, michael.albinus, emacs-devel

> Cc: mwd@md5i.com, michael.albinus@gmx.de, handa@gnu.org, emacs-devel@gnu.org
> From: Paul Eggert <eggert@cs.ucla.edu>
> Date: Sat, 19 May 2018 11:23:01 -0700
> 
> Eli Zaretskii wrote:
> 
> > What do you mean by "unified" here?
> 
> What I meant was that, as far as I know, in Emacs this font selection currently 
> does not depend on whether the charset is latin-iso8859-1 or latin-iso8859-3, 
> because in UTF-8 text those two charsets are always displayed the same way that 
> text sans charsets is displayed.

The codepoints are unified, of course, but that's not the whole story
as far as font selection goes.  See the documentation of
set-fontset-font, where it says that you can define a certain font to
be used for a specific charset: the charset information comes from the
text property.

> And given the way the world has moved, it's hard to imagine any
> future version of Emacs caring whether the charset is
> latin-iso8859-1 or latin-iso8859-3 in UTF-8 text.

Emacs doesn't care, but users might.  I agree that it is unlikely in
European cultures, but it isn't impossible.  And what do we lose by
leaving the information in the file?

> > The 'charset' property just tells Emacs to which "culture", so-called,
> > or, if you want, to which language the greeting belongs
> 
> RFC 1896 specifies the 'lang' command to specify languages. Shouldn't etc/HELLO 
> do that instead of using 'charset'? That would seem to match the intent of 
> text/enriched better.

We need to have the corresponding property in Emacs first, and we need
to have infrastructure for letting 'lang' affect what we want it to
affect, at least font selection.  Only after that we can implement
this in enriched.el.  I stuck with 'charset' because all the necessary
infrastructure is already in place.  Yes, 'charset' is ISO-2022
legacy, but it doesn't mean it's necessarily useless in modern Emacs.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Encoding of etc/HELLO
  2018-05-19 18:39                       ` Eli Zaretskii
@ 2018-05-19 19:38                         ` Paul Eggert
  2018-05-19 20:03                           ` Eli Zaretskii
  0 siblings, 1 reply; 33+ messages in thread
From: Paul Eggert @ 2018-05-19 19:38 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: mwd, handa, michael.albinus, emacs-devel

Eli Zaretskii wrote:
> I agree that it is unlikely in
> European cultures, but it isn't impossible.  And what do we lose by
> leaving the information in the file?

We lose simplicity and stability, because the etc/HELLO European charset 
information is often wrong for the purposes of display.

For example, currently etc/HELLO has a charset transition in the middle of the 
Maltese word “Bonġu”. If somebody actually specified a different font for 
iso-8859-3 because they wanted to display Maltese+Esperanto differently from 
English etc. (which as you note is unlikely, but suppose someone does it anyway 
to show off this Emacs feature), then their display would be glitched up in the 
middle of “Bonġu”. So as things stand this is a lurking bug in etc/HELLO. If we 
omitted needless charset transitions we wouldn't have to worry about correcting 
bugs like this one.

> We need to have the corresponding property in Emacs first, and we need
> to have infrastructure for letting 'lang' affect what we want it to
> affect, at least font selection.  Only after that we can implement
> this in enriched.el.

Thanks, I wasn't aware of this issue. I don't see a bug report for it; should I 
add an enhancement request?

In the meantime how about if we mark up etc/HELLO with lang commands instead of 
x-charset commands, except that we also keep x-charset commands that actually 
affect Emacs display in common use (e.g., CJK charsets)? When we get lang 
working, we can then remove the remaining x-charset commands.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Encoding of etc/HELLO
  2018-05-19 19:38                         ` Paul Eggert
@ 2018-05-19 20:03                           ` Eli Zaretskii
  2018-05-20  8:56                             ` Eli Zaretskii
  0 siblings, 1 reply; 33+ messages in thread
From: Eli Zaretskii @ 2018-05-19 20:03 UTC (permalink / raw)
  To: Paul Eggert; +Cc: mwd, handa, michael.albinus, emacs-devel

> Cc: mwd@md5i.com, michael.albinus@gmx.de, handa@gnu.org, emacs-devel@gnu.org
> From: Paul Eggert <eggert@cs.ucla.edu>
> Date: Sat, 19 May 2018 12:38:50 -0700
> 
> For example, currently etc/HELLO has a charset transition in the middle of the 
> Maltese word “Bonġu”. If somebody actually specified a different font for 
> iso-8859-3 because they wanted to display Maltese+Esperanto differently from 
> English etc. (which as you note is unlikely, but suppose someone does it anyway 
> to show off this Emacs feature), then their display would be glitched up in the 
> middle of “Bonġu”. So as things stand this is a lurking bug in etc/HELLO. If we 
> omitted needless charset transitions we wouldn't have to worry about correcting 
> bugs like this one.

That's the bad part of the ISO-2022 legacy: the charset transitions
don't always make sense.  It should be easy to fix this, though, by
placing the charset properties on complete greetings rather than on
more or less arbitrary substrings.

> > We need to have the corresponding property in Emacs first, and we need
> > to have infrastructure for letting 'lang' affect what we want it to
> > affect, at least font selection.  Only after that we can implement
> > this in enriched.el.
> 
> Thanks, I wasn't aware of this issue. I don't see a bug report for it; should I 
> add an enhancement request?

Fine by me, but Emacs lacks good support for language-specific
handling of text in general.  We had a few discussions in the past
about that.

> In the meantime how about if we mark up etc/HELLO with lang commands instead of 
> x-charset commands, except that we also keep x-charset commands that actually 
> affect Emacs display in common use (e.g., CJK charsets)? When we get lang 
> working, we can then remove the remaining x-charset commands.

I'd rather we first fixed the charset properties coverage in the file
so that they make more sense.



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Encoding of etc/HELLO
  2018-05-19 20:03                           ` Eli Zaretskii
@ 2018-05-20  8:56                             ` Eli Zaretskii
  0 siblings, 0 replies; 33+ messages in thread
From: Eli Zaretskii @ 2018-05-20  8:56 UTC (permalink / raw)
  To: eggert; +Cc: handa, emacs-devel

> Date: Sat, 19 May 2018 23:03:17 +0300
> From: Eli Zaretskii <eliz@gnu.org>
> Cc: mwd@md5i.com, handa@gnu.org, michael.albinus@gmx.de, emacs-devel@gnu.org
> 
> > In the meantime how about if we mark up etc/HELLO with lang commands instead of 
> > x-charset commands, except that we also keep x-charset commands that actually 
> > affect Emacs display in common use (e.g., CJK charsets)? When we get lang 
> > working, we can then remove the remaining x-charset commands.
> 
> I'd rather we first fixed the charset properties coverage in the file
> so that they make more sense.

Now done on the master branch.

As for 'lang', it sounds strange to me to use a text property that has
absolutely no meaning in the current Emacs.  I think we should first
develop at least some initial support for this property, and only then
start using it in files we distribute.



^ permalink raw reply	[flat|nested] 33+ messages in thread

end of thread, other threads:[~2018-05-20  8:56 UTC | newest]

Thread overview: 33+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-04-20 13:25 Encoding of etc/HELLO Eli Zaretskii
2018-04-20 15:34 ` Michael Albinus
2018-04-20 16:00   ` Eli Zaretskii
2018-04-20 16:16     ` Stefan Monnier
2018-04-20 17:22       ` Eli Zaretskii
2018-04-20 20:42         ` Stefan Monnier
2018-04-20 21:02           ` Clément Pit-Claudel
2018-04-20 21:26           ` Paul Eggert
2018-04-21  7:07           ` Eli Zaretskii
2018-04-21 14:58             ` Michael Welsh Duggan
2018-05-19 15:23               ` Eli Zaretskii
2018-05-19 17:17                 ` Paul Eggert
2018-05-19 18:03                   ` Eli Zaretskii
2018-05-19 18:23                     ` Paul Eggert
2018-05-19 18:39                       ` Eli Zaretskii
2018-05-19 19:38                         ` Paul Eggert
2018-05-19 20:03                           ` Eli Zaretskii
2018-05-20  8:56                             ` Eli Zaretskii
2018-05-19 17:52                 ` Michael Albinus
2018-04-20 17:39     ` Michael Albinus
2018-04-21  7:10       ` Eli Zaretskii
2018-04-21 14:40         ` Clément Pit-Claudel
2018-04-21 15:43           ` Eli Zaretskii
2018-04-21 15:52           ` Paul Eggert
2018-04-23  2:53         ` Stefan Monnier
2018-04-23 15:07           ` Eli Zaretskii
2018-04-23 15:23             ` Stefan Monnier
2018-04-23 16:12               ` Eli Zaretskii
2018-04-20 16:56   ` Paul Eggert
2018-04-20 17:37     ` Michael Albinus
2018-04-21 20:31       ` Juri Linkov
2018-04-23 16:25         ` Eli Zaretskii
2018-04-23 20:05           ` Juri Linkov

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).