* Encoding of etc/HELLO @ 2018-04-20 13:25 Eli Zaretskii 2018-04-20 15:34 ` Michael Albinus 0 siblings, 1 reply; 33+ messages in thread From: Eli Zaretskii @ 2018-04-20 13:25 UTC (permalink / raw) To: Michael Albinus; +Cc: emacs-devel Michael, your recent changes to encode HELLO in UTF-8 are problematic and AFAIU should be reverted, because they lose the CJK charset information. See http://lists.gnu.org/archive/html/emacs-devel/2009-08/msg01409.html http://lists.gnu.org/archive/html/bug-gnu-emacs/2013-03/msg00429.html http://lists.gnu.org/archive/html/bug-gnu-emacs/2013-03/msg00475.html and the surrounding discussions for more about that. ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: Encoding of etc/HELLO 2018-04-20 13:25 Encoding of etc/HELLO Eli Zaretskii @ 2018-04-20 15:34 ` Michael Albinus 2018-04-20 16:00 ` Eli Zaretskii 2018-04-20 16:56 ` Paul Eggert 0 siblings, 2 replies; 33+ messages in thread From: Michael Albinus @ 2018-04-20 15:34 UTC (permalink / raw) To: Eli Zaretskii; +Cc: emacs-devel Eli Zaretskii <eliz@gnu.org> writes: Hi Eli, > Michael, your recent changes to encode HELLO in UTF-8 are problematic > and AFAIU should be reverted, because they lose the CJK charset > information. See > > http://lists.gnu.org/archive/html/emacs-devel/2009-08/msg01409.html > http://lists.gnu.org/archive/html/bug-gnu-emacs/2013-03/msg00429.html > http://lists.gnu.org/archive/html/bug-gnu-emacs/2013-03/msg00475.html > > and the surrounding discussions for more about that. I see. No problem to revert the patch, it isn't important. However, quoting the last reference above --8<---------------cut here---------------start------------->8--- When a file is in some legacy encoding such as iso-2022-7bit, Emacs attached charset properties to proper ranges of text, which works as a hint for selecting a proper font especially for CJK characters. --8<---------------cut here---------------end--------------->8--- I'm wondering why it is possible to attach charset properties for iso-2022-7bit, but not for utf-8. Note, that I don't know too much about this topic. Best regards, Michael. ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: Encoding of etc/HELLO 2018-04-20 15:34 ` Michael Albinus @ 2018-04-20 16:00 ` Eli Zaretskii 2018-04-20 16:16 ` Stefan Monnier 2018-04-20 17:39 ` Michael Albinus 2018-04-20 16:56 ` Paul Eggert 1 sibling, 2 replies; 33+ messages in thread From: Eli Zaretskii @ 2018-04-20 16:00 UTC (permalink / raw) To: Michael Albinus; +Cc: emacs-devel > From: Michael Albinus <michael.albinus@gmx.de> > Cc: emacs-devel@gnu.org > Date: Fri, 20 Apr 2018 17:34:45 +0200 > > --8<---------------cut here---------------start------------->8--- > When a file is in some legacy encoding such as iso-2022-7bit, Emacs > attached charset properties to proper ranges of text, which works as a > hint for selecting a proper font especially for CJK characters. > --8<---------------cut here---------------end--------------->8--- > > I'm wondering why it is possible to attach charset properties for > iso-2022-7bit, but not for utf-8. Note, that I don't know too much about > this topic. Because we don't have infrastructure for tagging sub-ranges of Unicode with character sets (and in some sense, that would make little sense, because Unicode is a unifying encoding). ISO-2022 has built-in features to tag portions of text as belonging to some specific charset. ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: Encoding of etc/HELLO 2018-04-20 16:00 ` Eli Zaretskii @ 2018-04-20 16:16 ` Stefan Monnier 2018-04-20 17:22 ` Eli Zaretskii 2018-04-20 17:39 ` Michael Albinus 1 sibling, 1 reply; 33+ messages in thread From: Stefan Monnier @ 2018-04-20 16:16 UTC (permalink / raw) To: emacs-devel > Because we don't have infrastructure for tagging sub-ranges of Unicode > with character sets (and in some sense, that would make little sense, > because Unicode is a unifying encoding). Does Unicode offer a way to do that (i.e. is it a limitation on our support of Unicode, or is it a limitation in the Unicode spec)? Stefan ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: Encoding of etc/HELLO 2018-04-20 16:16 ` Stefan Monnier @ 2018-04-20 17:22 ` Eli Zaretskii 2018-04-20 20:42 ` Stefan Monnier 0 siblings, 1 reply; 33+ messages in thread From: Eli Zaretskii @ 2018-04-20 17:22 UTC (permalink / raw) To: Stefan Monnier; +Cc: emacs-devel > From: Stefan Monnier <monnier@iro.umontreal.ca> > Date: Fri, 20 Apr 2018 12:16:05 -0400 > > > Because we don't have infrastructure for tagging sub-ranges of Unicode > > with character sets (and in some sense, that would make little sense, > > because Unicode is a unifying encoding). > > Does Unicode offer a way to do that (i.e. is it a limitation on our > support of Unicode, or is it a limitation in the Unicode spec)? Unicode has language tag characters, but they are deprecated and their use is discouraged. In any case, I don't think Unicode features are relevant here, because we already have char-script-table, which is all you can do with a unified codepoint space. The whole point of ISO-2022 is that the same Unicode codepoints can come from different ISO-2022 charsets, and the ISO-2022 encoding keeps that information in the bytestream. ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: Encoding of etc/HELLO 2018-04-20 17:22 ` Eli Zaretskii @ 2018-04-20 20:42 ` Stefan Monnier 2018-04-20 21:02 ` Clément Pit-Claudel ` (2 more replies) 0 siblings, 3 replies; 33+ messages in thread From: Stefan Monnier @ 2018-04-20 20:42 UTC (permalink / raw) To: Eli Zaretskii; +Cc: emacs-devel > Unicode has language tag characters, but they are deprecated and their > use is discouraged. > > In any case, I don't think Unicode features are relevant here, because > we already have char-script-table, which is all you can do with a > unified codepoint space. Yes, I understand this part of the situation. > The whole point of ISO-2022 is that the same Unicode codepoints can > come from different ISO-2022 charsets, and the ISO-2022 encoding keeps > that information in the bytestream. My question was meant to see if there's a way to encode a similar kind of charset info into the bytestream. From what you say above, there is such a thing but its use is discouraged. Clearly this problem is not specific to Emacs, so what do people do? Hold on to iso-2022 for as long as they can (like we do in Emacs)? Give up on these "details" of rendering for files using a mix of C, J, and K? Rely on higher-level info (XML tags and friends) to carry the charset info? Stefan ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: Encoding of etc/HELLO 2018-04-20 20:42 ` Stefan Monnier @ 2018-04-20 21:02 ` Clément Pit-Claudel 2018-04-20 21:26 ` Paul Eggert 2018-04-21 7:07 ` Eli Zaretskii 2 siblings, 0 replies; 33+ messages in thread From: Clément Pit-Claudel @ 2018-04-20 21:02 UTC (permalink / raw) To: emacs-devel On 2018-04-20 16:42, Stefan Monnier wrote: > Rely on higher-level info (XML tags and friends) to carry the charset info? I think that's what people typically do, yes. The table at https://en.wikipedia.org/wiki/Variant_Chinese_character#Usage_in_computing is a good example of using the lang and xml:lang attributes. ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: Encoding of etc/HELLO 2018-04-20 20:42 ` Stefan Monnier 2018-04-20 21:02 ` Clément Pit-Claudel @ 2018-04-20 21:26 ` Paul Eggert 2018-04-21 7:07 ` Eli Zaretskii 2 siblings, 0 replies; 33+ messages in thread From: Paul Eggert @ 2018-04-20 21:26 UTC (permalink / raw) To: Stefan Monnier, Eli Zaretskii; +Cc: emacs-devel On 04/20/2018 01:42 PM, Stefan Monnier wrote: > Clearly this problem is not specific to Emacs, so what do people do? > Hold on to iso-2022 for as long as they can (like we do in Emacs)? > Give up on these "details" of rendering for files using a mix of C, J, and K? > Rely on higher-level info (XML tags and friends) to carry the charset info? For most uses, people typically just use UTF-8 and give up on the details, which tend to be in areas that many users don't care much about anyway. In practice if (say) a Japanese reader sees a Chinese quotation in a page of Japanese text, there's an excellent chance the reader won't much mind that the Chinese characters are rendered in Japanese-style, as this has long been common practice in Japanese printing anyway. There are of course exceptions where it really matters which font you use, such as the Wikipedia page on Chinese character variants that Clément mentioned. But these are rare, and are typically handled by means other than plain text. It's like the Wikipedia page on kerning, which uses images rather than plain UTF-8 text to illustrate how to kern characters properly. I mildly prefer multilingual text to be rendered in a consistent style for my language, as opposed to having it rendered separately for readers of each of its component languages, as this makes the text a bit easier for me to read (which is the point of text, isn't it?). But this of course is merely a style preference. For what it's worth, the April 2018 w3techs.com numbers say that UTF-8 is used by 91.3% of websites whose character encoding they know, and that this number is steadily growing (it was 88.9% a year ago). In contrast, ISO 2022 usage is declining steadily. Of course the web is not the entire universe; still, it's pretty clear which way the world is going. ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: Encoding of etc/HELLO 2018-04-20 20:42 ` Stefan Monnier 2018-04-20 21:02 ` Clément Pit-Claudel 2018-04-20 21:26 ` Paul Eggert @ 2018-04-21 7:07 ` Eli Zaretskii 2018-04-21 14:58 ` Michael Welsh Duggan 2 siblings, 1 reply; 33+ messages in thread From: Eli Zaretskii @ 2018-04-21 7:07 UTC (permalink / raw) To: Stefan Monnier, Kenichi Handa; +Cc: emacs-devel > From: Stefan Monnier <monnier@IRO.UMontreal.CA> > Cc: emacs-devel@gnu.org > Date: Fri, 20 Apr 2018 16:42:02 -0400 > > > The whole point of ISO-2022 is that the same Unicode codepoints can > > come from different ISO-2022 charsets, and the ISO-2022 encoding keeps > > that information in the bytestream. > > My question was meant to see if there's a way to encode a similar kind > of charset info into the bytestream. From what you say above, there is > such a thing but its use is discouraged. If you mean a Unicode-compatible bytestream, then yes, that's the feature I know of. But if we want to use it in Emacs, we should modify the UTF-x decoders to put the charset properties on the decoded text, or invent a new property (since charset is currently 'unicode'), and then augment the font selection code to consider that new property. > Clearly this problem is not specific to Emacs, so what do people do? > Hold on to iso-2022 for as long as they can (like we do in Emacs)? > Give up on these "details" of rendering for files using a mix of C, J, and K? > Rely on higher-level info (XML tags and friends) to carry the charset info? I don't know. Several years ago, I think each vendor used a private extension of ISO-2022 to support the emoji, not sure if that is still the case, especially since the number of standardized emoji continues to grow all the time. We could perhaps follow one such extension in our support of ISO-2022. Or we could decide that the Han unification has conquered the world, and therefore the CJK charset distinction for font selection is no longer important enough for us, in which case we could recode HELLO in UTF-8. I've added Handa-san to this discussion in the hope that he could comment on what would be the bets way forward. ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: Encoding of etc/HELLO 2018-04-21 7:07 ` Eli Zaretskii @ 2018-04-21 14:58 ` Michael Welsh Duggan 2018-05-19 15:23 ` Eli Zaretskii 0 siblings, 1 reply; 33+ messages in thread From: Michael Welsh Duggan @ 2018-04-21 14:58 UTC (permalink / raw) To: emacs-devel Eli Zaretskii <eliz@gnu.org> writes: >> From: Stefan Monnier <monnier@IRO.UMontreal.CA> >> Cc: emacs-devel@gnu.org >> Date: Fri, 20 Apr 2018 16:42:02 -0400 >> >> > The whole point of ISO-2022 is that the same Unicode codepoints can >> > come from different ISO-2022 charsets, and the ISO-2022 encoding keeps >> > that information in the bytestream. >> >> My question was meant to see if there's a way to encode a similar kind >> of charset info into the bytestream. From what you say above, there is >> such a thing but its use is discouraged. > > If you mean a Unicode-compatible bytestream, then yes, that's the > feature I know of. But if we want to use it in Emacs, we should > modify the UTF-x decoders to put the charset properties on the decoded > text, or invent a new property (since charset is currently 'unicode'), > and then augment the font selection code to consider that new > property. > >> Clearly this problem is not specific to Emacs, so what do people do? >> Hold on to iso-2022 for as long as they can (like we do in Emacs)? >> Give up on these "details" of rendering for files using a mix of C, J, and K? >> Rely on higher-level info (XML tags and friends) to carry the charset info? > > I don't know. Several years ago, I think each vendor used a private > extension of ISO-2022 to support the emoji, not sure if that is still > the case, especially since the number of standardized emoji continues > to grow all the time. We could perhaps follow one such extension in > our support of ISO-2022. Or we could decide that the Han unification > has conquered the world, and therefore the CJK charset distinction for > font selection is no longer important enough for us, in which case we > could recode HELLO in UTF-8. I would suppose that the usual way to do this (encode glyph variants in a Unicode-compatible bytestream) would be to use some form of document markup. In Emacs's case, enriched-mode would seem an ideal candidate for this. RFC-1896 specifically supports private extensions for attributes using the "X-" syntax, and enriched.el is small and should be simple to modify for this purpose. -- Michael Welsh Duggan (md5i@md5i.com) ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: Encoding of etc/HELLO 2018-04-21 14:58 ` Michael Welsh Duggan @ 2018-05-19 15:23 ` Eli Zaretskii 2018-05-19 17:17 ` Paul Eggert 2018-05-19 17:52 ` Michael Albinus 0 siblings, 2 replies; 33+ messages in thread From: Eli Zaretskii @ 2018-05-19 15:23 UTC (permalink / raw) To: Michael Welsh Duggan, Michael Albinus, Kenichi Handa; +Cc: emacs-devel > From: Michael Welsh Duggan <mwd@md5i.com> > Date: Sat, 21 Apr 2018 10:58:53 -0400 > > I would suppose that the usual way to do this (encode glyph variants in > a Unicode-compatible bytestream) would be to use some form of document > markup. In Emacs's case, enriched-mode would seem an ideal candidate > for this. RFC-1896 specifically supports private extensions for > attributes using the "X-" syntax, and enriched.el is small and should be > simple to modify for this purpose. Thanks, I used this idea to extend Enriched mode with support of 'charset' properties, and then recoded HELLO in UTF-8 and placed it under Enriched mode. Michael (Albinus), your Emoji addition is also in. I hope I identified the addition correctly; if not, feel free to fix, and please make a point of marking the Emoji with the 'unicode' charset, using the new function in facemenu.el. Thanks to Handa-san for his advice regarding this feature. ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: Encoding of etc/HELLO 2018-05-19 15:23 ` Eli Zaretskii @ 2018-05-19 17:17 ` Paul Eggert 2018-05-19 18:03 ` Eli Zaretskii 2018-05-19 17:52 ` Michael Albinus 1 sibling, 1 reply; 33+ messages in thread From: Paul Eggert @ 2018-05-19 17:17 UTC (permalink / raw) To: Eli Zaretskii, Michael Welsh Duggan, Michael Albinus, Kenichi Handa Cc: emacs-devel Eli Zaretskii wrote: > Thanks, I used this idea to extend Enriched mode with support of > 'charset' properties, and then recoded HELLO in UTF-8 and placed it > under Enriched mode. Thanks for doing all that. In looking at the new etc/HELLO, I see many uses of <x-charset><param> that seem to be unnecessary when Emacs is viewing the file. For example, the first few uses are: <x-charset><param>latin-iso8859-1</param>¡Hola!, Grüß Gott, Hyvää päivää, Tere õhtust, Bon</x-charset><x-charset><param>latin-iso8859-3</param>ġu Cze</x-charset><x-charset><param>latin-iso8859-2</param>ść!, Dobrý den, </x-charset> Can't the abovementioned formatting commands be removed without affecting what any Emacs user sees, because the corresponding character sets are not unified in Unicode? Would it be OK to simplify /etc/HELLO to remove unnecessary formatting commands, and to keep only the formatting commands that are plausibly needed in a Unicode text file? And if so, what heuristic should be used to remove the unnecessary formatting commands? I assume that the formatting commands were done automatically, so perhaps I'm talking about potential changes to lisp/textmodes/enriched.el. ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: Encoding of etc/HELLO 2018-05-19 17:17 ` Paul Eggert @ 2018-05-19 18:03 ` Eli Zaretskii 2018-05-19 18:23 ` Paul Eggert 0 siblings, 1 reply; 33+ messages in thread From: Eli Zaretskii @ 2018-05-19 18:03 UTC (permalink / raw) To: Paul Eggert; +Cc: mwd, handa, michael.albinus, emacs-devel > Cc: emacs-devel@gnu.org > From: Paul Eggert <eggert@cs.ucla.edu> > Date: Sat, 19 May 2018 10:17:33 -0700 > > In looking at the new etc/HELLO, I see many uses of <x-charset><param> that seem > to be unnecessary when Emacs is viewing the file. For example, the first few > uses are: > > <x-charset><param>latin-iso8859-1</param>¡Hola!, Grüß Gott, Hyvää päivää, Tere > õhtust, Bon</x-charset><x-charset><param>latin-iso8859-3</param>ġu > Cze</x-charset><x-charset><param>latin-iso8859-2</param>ść!, Dobrý > den, </x-charset> Which parts seem unnecessary in this snippet? And why? > Can't the abovementioned formatting commands be removed without affecting what > any Emacs user sees, because the corresponding character sets are not unified in > Unicode? What do you mean by "unified" here? In modern Emacs, we don't need to unify the charsets, because they no longer determine the codepoints. The 'charset' property just tells Emacs to which "culture", so-called, or, if you want, to which language the greeting belongs, and the purpose is only one: selection of an appropriate font to display that greeting. (In the future we might use that for other language-dependent features.) > Would it be OK to simplify /etc/HELLO to remove unnecessary formatting > commands, and to keep only the formatting commands that are plausibly needed in > a Unicode text file? And if so, what heuristic should be used to remove the > unnecessary formatting commands? > > I assume that the formatting commands were done automatically, so perhaps I'm > talking about potential changes to lisp/textmodes/enriched.el. Yes, the annotations were produced automatically by enriched.el, but they simply follow what was already there in the original HELLO. You can see that by visiting HELLO on the emacs-26 branch, and then invoking "M-x describe-text-properties" at various places in the file. You will see that the annotations start and end where the 'charset' properties started and ended in the ISO-2022 encoded file. We could, of course, place the 'charset' properties only on the greetings and the language names, leaving the rest of the text without any 'charset' properties. If that's what you mean, then I'm okay with doing that; one could use the new facemenu command I added for that purpose. ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: Encoding of etc/HELLO 2018-05-19 18:03 ` Eli Zaretskii @ 2018-05-19 18:23 ` Paul Eggert 2018-05-19 18:39 ` Eli Zaretskii 0 siblings, 1 reply; 33+ messages in thread From: Paul Eggert @ 2018-05-19 18:23 UTC (permalink / raw) To: Eli Zaretskii; +Cc: mwd, handa, michael.albinus, emacs-devel Eli Zaretskii wrote: > What do you mean by "unified" here? What I meant was that, as far as I know, in Emacs this font selection currently does not depend on whether the charset is latin-iso8859-1 or latin-iso8859-3, because in UTF-8 text those two charsets are always displayed the same way that text sans charsets is displayed. And given the way the world has moved, it's hard to imagine any future version of Emacs caring whether the charset is latin-iso8859-1 or latin-iso8859-3 in UTF-8 text. charset properties like japanese-jisx0208 do matter for display of course, and so should be kept. > The 'charset' property just tells Emacs to which "culture", so-called, > or, if you want, to which language the greeting belongs RFC 1896 specifies the 'lang' command to specify languages. Shouldn't etc/HELLO do that instead of using 'charset'? That would seem to match the intent of text/enriched better. ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: Encoding of etc/HELLO 2018-05-19 18:23 ` Paul Eggert @ 2018-05-19 18:39 ` Eli Zaretskii 2018-05-19 19:38 ` Paul Eggert 0 siblings, 1 reply; 33+ messages in thread From: Eli Zaretskii @ 2018-05-19 18:39 UTC (permalink / raw) To: Paul Eggert; +Cc: mwd, handa, michael.albinus, emacs-devel > Cc: mwd@md5i.com, michael.albinus@gmx.de, handa@gnu.org, emacs-devel@gnu.org > From: Paul Eggert <eggert@cs.ucla.edu> > Date: Sat, 19 May 2018 11:23:01 -0700 > > Eli Zaretskii wrote: > > > What do you mean by "unified" here? > > What I meant was that, as far as I know, in Emacs this font selection currently > does not depend on whether the charset is latin-iso8859-1 or latin-iso8859-3, > because in UTF-8 text those two charsets are always displayed the same way that > text sans charsets is displayed. The codepoints are unified, of course, but that's not the whole story as far as font selection goes. See the documentation of set-fontset-font, where it says that you can define a certain font to be used for a specific charset: the charset information comes from the text property. > And given the way the world has moved, it's hard to imagine any > future version of Emacs caring whether the charset is > latin-iso8859-1 or latin-iso8859-3 in UTF-8 text. Emacs doesn't care, but users might. I agree that it is unlikely in European cultures, but it isn't impossible. And what do we lose by leaving the information in the file? > > The 'charset' property just tells Emacs to which "culture", so-called, > > or, if you want, to which language the greeting belongs > > RFC 1896 specifies the 'lang' command to specify languages. Shouldn't etc/HELLO > do that instead of using 'charset'? That would seem to match the intent of > text/enriched better. We need to have the corresponding property in Emacs first, and we need to have infrastructure for letting 'lang' affect what we want it to affect, at least font selection. Only after that we can implement this in enriched.el. I stuck with 'charset' because all the necessary infrastructure is already in place. Yes, 'charset' is ISO-2022 legacy, but it doesn't mean it's necessarily useless in modern Emacs. ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: Encoding of etc/HELLO 2018-05-19 18:39 ` Eli Zaretskii @ 2018-05-19 19:38 ` Paul Eggert 2018-05-19 20:03 ` Eli Zaretskii 0 siblings, 1 reply; 33+ messages in thread From: Paul Eggert @ 2018-05-19 19:38 UTC (permalink / raw) To: Eli Zaretskii; +Cc: mwd, handa, michael.albinus, emacs-devel Eli Zaretskii wrote: > I agree that it is unlikely in > European cultures, but it isn't impossible. And what do we lose by > leaving the information in the file? We lose simplicity and stability, because the etc/HELLO European charset information is often wrong for the purposes of display. For example, currently etc/HELLO has a charset transition in the middle of the Maltese word “Bonġu”. If somebody actually specified a different font for iso-8859-3 because they wanted to display Maltese+Esperanto differently from English etc. (which as you note is unlikely, but suppose someone does it anyway to show off this Emacs feature), then their display would be glitched up in the middle of “Bonġu”. So as things stand this is a lurking bug in etc/HELLO. If we omitted needless charset transitions we wouldn't have to worry about correcting bugs like this one. > We need to have the corresponding property in Emacs first, and we need > to have infrastructure for letting 'lang' affect what we want it to > affect, at least font selection. Only after that we can implement > this in enriched.el. Thanks, I wasn't aware of this issue. I don't see a bug report for it; should I add an enhancement request? In the meantime how about if we mark up etc/HELLO with lang commands instead of x-charset commands, except that we also keep x-charset commands that actually affect Emacs display in common use (e.g., CJK charsets)? When we get lang working, we can then remove the remaining x-charset commands. ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: Encoding of etc/HELLO 2018-05-19 19:38 ` Paul Eggert @ 2018-05-19 20:03 ` Eli Zaretskii 2018-05-20 8:56 ` Eli Zaretskii 0 siblings, 1 reply; 33+ messages in thread From: Eli Zaretskii @ 2018-05-19 20:03 UTC (permalink / raw) To: Paul Eggert; +Cc: mwd, handa, michael.albinus, emacs-devel > Cc: mwd@md5i.com, michael.albinus@gmx.de, handa@gnu.org, emacs-devel@gnu.org > From: Paul Eggert <eggert@cs.ucla.edu> > Date: Sat, 19 May 2018 12:38:50 -0700 > > For example, currently etc/HELLO has a charset transition in the middle of the > Maltese word “Bonġu”. If somebody actually specified a different font for > iso-8859-3 because they wanted to display Maltese+Esperanto differently from > English etc. (which as you note is unlikely, but suppose someone does it anyway > to show off this Emacs feature), then their display would be glitched up in the > middle of “Bonġu”. So as things stand this is a lurking bug in etc/HELLO. If we > omitted needless charset transitions we wouldn't have to worry about correcting > bugs like this one. That's the bad part of the ISO-2022 legacy: the charset transitions don't always make sense. It should be easy to fix this, though, by placing the charset properties on complete greetings rather than on more or less arbitrary substrings. > > We need to have the corresponding property in Emacs first, and we need > > to have infrastructure for letting 'lang' affect what we want it to > > affect, at least font selection. Only after that we can implement > > this in enriched.el. > > Thanks, I wasn't aware of this issue. I don't see a bug report for it; should I > add an enhancement request? Fine by me, but Emacs lacks good support for language-specific handling of text in general. We had a few discussions in the past about that. > In the meantime how about if we mark up etc/HELLO with lang commands instead of > x-charset commands, except that we also keep x-charset commands that actually > affect Emacs display in common use (e.g., CJK charsets)? When we get lang > working, we can then remove the remaining x-charset commands. I'd rather we first fixed the charset properties coverage in the file so that they make more sense. ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: Encoding of etc/HELLO 2018-05-19 20:03 ` Eli Zaretskii @ 2018-05-20 8:56 ` Eli Zaretskii 0 siblings, 0 replies; 33+ messages in thread From: Eli Zaretskii @ 2018-05-20 8:56 UTC (permalink / raw) To: eggert; +Cc: handa, emacs-devel > Date: Sat, 19 May 2018 23:03:17 +0300 > From: Eli Zaretskii <eliz@gnu.org> > Cc: mwd@md5i.com, handa@gnu.org, michael.albinus@gmx.de, emacs-devel@gnu.org > > > In the meantime how about if we mark up etc/HELLO with lang commands instead of > > x-charset commands, except that we also keep x-charset commands that actually > > affect Emacs display in common use (e.g., CJK charsets)? When we get lang > > working, we can then remove the remaining x-charset commands. > > I'd rather we first fixed the charset properties coverage in the file > so that they make more sense. Now done on the master branch. As for 'lang', it sounds strange to me to use a text property that has absolutely no meaning in the current Emacs. I think we should first develop at least some initial support for this property, and only then start using it in files we distribute. ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: Encoding of etc/HELLO 2018-05-19 15:23 ` Eli Zaretskii 2018-05-19 17:17 ` Paul Eggert @ 2018-05-19 17:52 ` Michael Albinus 1 sibling, 0 replies; 33+ messages in thread From: Michael Albinus @ 2018-05-19 17:52 UTC (permalink / raw) To: Eli Zaretskii; +Cc: Michael Welsh Duggan, Kenichi Handa, emacs-devel Eli Zaretskii <eliz@gnu.org> writes: Hi Eli, > Michael (Albinus), your Emoji addition is also in. I hope I > identified the addition correctly "WAVING HAND SIGN" is perfect. Best regards, Michael. ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: Encoding of etc/HELLO 2018-04-20 16:00 ` Eli Zaretskii 2018-04-20 16:16 ` Stefan Monnier @ 2018-04-20 17:39 ` Michael Albinus 2018-04-21 7:10 ` Eli Zaretskii 1 sibling, 1 reply; 33+ messages in thread From: Michael Albinus @ 2018-04-20 17:39 UTC (permalink / raw) To: Eli Zaretskii; +Cc: emacs-devel Eli Zaretskii <eliz@gnu.org> writes: > Because we don't have infrastructure for tagging sub-ranges of Unicode > with character sets (and in some sense, that would make little sense, > because Unicode is a unifying encoding). > > ISO-2022 has built-in features to tag portions of text as belonging to > some specific charset. Thanks for the explanation. As said I have no knowledge about the topic, but I'm still surprised that something like this isn't possible with utf-8. Best regards, Michael. ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: Encoding of etc/HELLO 2018-04-20 17:39 ` Michael Albinus @ 2018-04-21 7:10 ` Eli Zaretskii 2018-04-21 14:40 ` Clément Pit-Claudel 2018-04-23 2:53 ` Stefan Monnier 0 siblings, 2 replies; 33+ messages in thread From: Eli Zaretskii @ 2018-04-21 7:10 UTC (permalink / raw) To: Michael Albinus; +Cc: emacs-devel > From: Michael Albinus <michael.albinus@gmx.de> > Cc: emacs-devel@gnu.org > Date: Fri, 20 Apr 2018 19:39:21 +0200 > > I'm still surprised that something like this isn't possible with utf-8. UTF-8 cannot encode language-specific differences of a given character, that is something that is against the basic principle of Unicode: that each character has one and only one encoding. This is one reason why Emacs uses a superset of Unicode in its internal representation, btw. ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: Encoding of etc/HELLO 2018-04-21 7:10 ` Eli Zaretskii @ 2018-04-21 14:40 ` Clément Pit-Claudel 2018-04-21 15:43 ` Eli Zaretskii 2018-04-21 15:52 ` Paul Eggert 2018-04-23 2:53 ` Stefan Monnier 1 sibling, 2 replies; 33+ messages in thread From: Clément Pit-Claudel @ 2018-04-21 14:40 UTC (permalink / raw) To: emacs-devel On 2018-04-21 03:10, Eli Zaretskii wrote: > UTF-8 cannot encode language-specific differences of a given > character, that is something that is against the basic principle of > Unicode: that each character has one and only one encoding. Aren't variation selectors used for a similar purpose, though? (Maybe I'm misunderstanding what they are for). ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: Encoding of etc/HELLO 2018-04-21 14:40 ` Clément Pit-Claudel @ 2018-04-21 15:43 ` Eli Zaretskii 2018-04-21 15:52 ` Paul Eggert 1 sibling, 0 replies; 33+ messages in thread From: Eli Zaretskii @ 2018-04-21 15:43 UTC (permalink / raw) To: Clément Pit-Claudel; +Cc: emacs-devel > From: Clément Pit-Claudel <cpitclaudel@gmail.com> > Date: Sat, 21 Apr 2018 10:40:41 -0400 > > On 2018-04-21 03:10, Eli Zaretskii wrote: > > UTF-8 cannot encode language-specific differences of a given > > character, that is something that is against the basic principle of > > Unicode: that each character has one and only one encoding. > > Aren't variation selectors used for a similar purpose, though? I'm not sure, but I don't think so. The variation selectors specify glyphs, not font selection. But I admit I don't know enough about this, so I might be mistaken. (We use variation selectors only in macfont.m.) ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: Encoding of etc/HELLO 2018-04-21 14:40 ` Clément Pit-Claudel 2018-04-21 15:43 ` Eli Zaretskii @ 2018-04-21 15:52 ` Paul Eggert 1 sibling, 0 replies; 33+ messages in thread From: Paul Eggert @ 2018-04-21 15:52 UTC (permalink / raw) To: Clément Pit-Claudel, emacs-devel Clément Pit-Claudel wrote: > Aren't variation selectors used for a similar purpose, though? They can be used for that, yes, though it's safe to say this would be bleeding-edge stuff. As I understand it, Adobe and others use them so that one can round-trip from Adobe formats into UTF-8 and back without losing information about ideograph variants. However, in practice variation selectors tend to be proprietary, so they're a bit of a minefield. As far as etc/HELLO goes, a couple of years ago Ken Lunde proposed the PanCJKV ideographic variation database collection for east-Asia variaton; see: https://github.com/adobe-type-tools/pancjkv-ivd-collection It ran into some roadblocks, though, briefly described here: http://www.unicodeconference.org/presentations/S8T2-Lunde.pdf ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: Encoding of etc/HELLO 2018-04-21 7:10 ` Eli Zaretskii 2018-04-21 14:40 ` Clément Pit-Claudel @ 2018-04-23 2:53 ` Stefan Monnier 2018-04-23 15:07 ` Eli Zaretskii 1 sibling, 1 reply; 33+ messages in thread From: Stefan Monnier @ 2018-04-23 2:53 UTC (permalink / raw) To: emacs-devel > UTF-8 cannot encode language-specific differences of a given > character, that is something that is against the basic principle of > Unicode: that each character has one and only one encoding. But along the way they discovered that it's sometimes difficult to decide whether two "things" should be consider as one and the same character or not. They ended up with a set of "rules" to make those decisions, but it's not nearly as simple as "each character has one and only one encoding". Stefan ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: Encoding of etc/HELLO 2018-04-23 2:53 ` Stefan Monnier @ 2018-04-23 15:07 ` Eli Zaretskii 2018-04-23 15:23 ` Stefan Monnier 0 siblings, 1 reply; 33+ messages in thread From: Eli Zaretskii @ 2018-04-23 15:07 UTC (permalink / raw) To: Stefan Monnier; +Cc: emacs-devel > From: Stefan Monnier <monnier@iro.umontreal.ca> > Date: Sun, 22 Apr 2018 22:53:58 -0400 > > > UTF-8 cannot encode language-specific differences of a given > > character, that is something that is against the basic principle of > > Unicode: that each character has one and only one encoding. > > But along the way they discovered that it's sometimes difficult to > decide whether two "things" should be consider as one and the same > character or not. They ended up with a set of "rules" to make those > decisions, but it's not nearly as simple as "each character has one and > only one encoding". Not sure what you allude to here. Are you talking about the variation selectors? ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: Encoding of etc/HELLO 2018-04-23 15:07 ` Eli Zaretskii @ 2018-04-23 15:23 ` Stefan Monnier 2018-04-23 16:12 ` Eli Zaretskii 0 siblings, 1 reply; 33+ messages in thread From: Stefan Monnier @ 2018-04-23 15:23 UTC (permalink / raw) To: emacs-devel >> But along the way they discovered that it's sometimes difficult to >> decide whether two "things" should be consider as one and the same >> character or not. They ended up with a set of "rules" to make those >> decisions, but it's not nearly as simple as "each character has one and >> only one encoding". > Not sure what you allude to here. For example the fact that some CJK characters should be displayed differently depending on whether they're part of a C text, or a J text, or a K text, so are they really "one and the same character"? Of course, there are other related choices: which versions of β should be one and the same and which shouldn't (e.g. I currently see in Unicode a greek and a latin version plus some variants of a math version (tho none in "roman" shape))? There are murky areas, with no "one right answer", although Unicode has had to choose somehow, i.e. doing the best it can with a messy situation. Stefan ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: Encoding of etc/HELLO 2018-04-23 15:23 ` Stefan Monnier @ 2018-04-23 16:12 ` Eli Zaretskii 0 siblings, 0 replies; 33+ messages in thread From: Eli Zaretskii @ 2018-04-23 16:12 UTC (permalink / raw) To: Stefan Monnier; +Cc: emacs-devel > From: Stefan Monnier <monnier@iro.umontreal.ca> > Date: Mon, 23 Apr 2018 11:23:39 -0400 > > >> But along the way they discovered that it's sometimes difficult to > >> decide whether two "things" should be consider as one and the same > >> character or not. They ended up with a set of "rules" to make those > >> decisions, but it's not nearly as simple as "each character has one and > >> only one encoding". > > Not sure what you allude to here. > > For example the fact that some CJK characters should be displayed > differently depending on whether they're part of a C text, or a J text, > or a K text, so are they really "one and the same character"? This situation existed before Unicode. Unicode tries to overcome it; thus "Han unification". ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: Encoding of etc/HELLO 2018-04-20 15:34 ` Michael Albinus 2018-04-20 16:00 ` Eli Zaretskii @ 2018-04-20 16:56 ` Paul Eggert 2018-04-20 17:37 ` Michael Albinus 1 sibling, 1 reply; 33+ messages in thread From: Paul Eggert @ 2018-04-20 16:56 UTC (permalink / raw) To: Michael Albinus, Eli Zaretskii; +Cc: emacs-devel On 04/20/2018 08:34 AM, Michael Albinus wrote: > No problem to revert the patch, it isn't important. If you revert it, please also revert commit 0585bd643dae2592214e77998b875347e6e59bab, which I installed before seeing this thread. It's true that this isn't important. Still, I like the the "hello" emoji; it's friendly. ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: Encoding of etc/HELLO 2018-04-20 16:56 ` Paul Eggert @ 2018-04-20 17:37 ` Michael Albinus 2018-04-21 20:31 ` Juri Linkov 0 siblings, 1 reply; 33+ messages in thread From: Michael Albinus @ 2018-04-20 17:37 UTC (permalink / raw) To: Paul Eggert; +Cc: Eli Zaretskii, emacs-devel Paul Eggert <eggert@cs.ucla.edu> writes: >> No problem to revert the patch, it isn't important. > > If you revert it, please also revert commit > 0585bd643dae2592214e77998b875347e6e59bab, which I installed before > seeing this thread. Done. I've reverted 0585bd643dae2592214e77998b875347e6e59bab and c4cfb5d20487f9912f5896b3f1d291fe7ccc9804. I haven't reverted e2ae724460e6d73d3ddcc6066427471799c4bd57, because Stefan did commit a better patch on top of this. > It's true that this isn't important. Still, I like the the "hello" > emoji; it's friendly. Yes, that was the idea. It's a pity that we cannot add valid utf-8 characters to etc/HELLO, when they are not iso-2022-7bit compatible. Best regards, Michael. ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: Encoding of etc/HELLO 2018-04-20 17:37 ` Michael Albinus @ 2018-04-21 20:31 ` Juri Linkov 2018-04-23 16:25 ` Eli Zaretskii 0 siblings, 1 reply; 33+ messages in thread From: Juri Linkov @ 2018-04-21 20:31 UTC (permalink / raw) To: Michael Albinus; +Cc: Eli Zaretskii, Paul Eggert, emacs-devel >> It's true that this isn't important. Still, I like the the "hello" >> emoji; it's friendly. > > Yes, that was the idea. It's a pity that we cannot add valid utf-8 > characters to etc/HELLO, when they are not iso-2022-7bit compatible. I don't understand why it's impossible to create a charset like the existing mule-unicode-e000-ffff but for character range over U+FFFF to include such characters as U+1F44B. Or is this an inherent limitation of the iso-2022-7bit coding system? ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: Encoding of etc/HELLO 2018-04-21 20:31 ` Juri Linkov @ 2018-04-23 16:25 ` Eli Zaretskii 2018-04-23 20:05 ` Juri Linkov 0 siblings, 1 reply; 33+ messages in thread From: Eli Zaretskii @ 2018-04-23 16:25 UTC (permalink / raw) To: Juri Linkov, Kenichi Handa; +Cc: eggert, michael.albinus, emacs-devel > From: Juri Linkov <juri@linkov.net> > Cc: Paul Eggert <eggert@cs.ucla.edu>, Eli Zaretskii <eliz@gnu.org>, emacs-devel@gnu.org > Date: Sat, 21 Apr 2018 23:31:22 +0300 > > I don't understand why it's impossible to create a charset like the > existing mule-unicode-e000-ffff but for character range over U+FFFF > to include such characters as U+1F44B. Or is this an inherent limitation > of the iso-2022-7bit coding system? I'm not sure I understand your proposal. Are you suggesting to create a Mule charset covering just the Emoji block? That could be possible (assuming ISO-2022 still has vacant charset slots available, something that I don't think I know how to determine reliably, and assuming we decipher the black art of using define-charset). But is this worth doing it just for Emoji? If you mean to add a larger range of characters, then I think a single ISO-2022 compatible charset can support at most 8192 character, so we will need a lot of charsets to cover codepoints between U+10000 and U+2FFFF, and I'm not sure we have that many vacant slots. Or did you mean to suggest something else? ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: Encoding of etc/HELLO 2018-04-23 16:25 ` Eli Zaretskii @ 2018-04-23 20:05 ` Juri Linkov 0 siblings, 0 replies; 33+ messages in thread From: Juri Linkov @ 2018-04-23 20:05 UTC (permalink / raw) To: Eli Zaretskii; +Cc: Kenichi Handa, eggert, michael.albinus, emacs-devel >> I don't understand why it's impossible to create a charset like the >> existing mule-unicode-e000-ffff but for character range over U+FFFF >> to include such characters as U+1F44B. Or is this an inherent limitation >> of the iso-2022-7bit coding system? > > I'm not sure I understand your proposal. Are you suggesting to create > a Mule charset covering just the Emoji block? That could be possible > (assuming ISO-2022 still has vacant charset slots available, something > that I don't think I know how to determine reliably, and assuming we > decipher the black art of using define-charset). But is this worth > doing it just for Emoji? > > If you mean to add a larger range of characters, then I think a single > ISO-2022 compatible charset can support at most 8192 character, so we > will need a lot of charsets to cover codepoints between U+10000 and > U+2FFFF, and I'm not sure we have that many vacant slots. > > Or did you mean to suggest something else? This is exactly what I meant. While using ISO-2022 encoding in HELLO to represent Unicode characters is just an inconvenience, the inability to encode all Unicode characters in ISO-2022 is a serious limitation. ^ permalink raw reply [flat|nested] 33+ messages in thread
end of thread, other threads:[~2018-05-20 8:56 UTC | newest] Thread overview: 33+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2018-04-20 13:25 Encoding of etc/HELLO Eli Zaretskii 2018-04-20 15:34 ` Michael Albinus 2018-04-20 16:00 ` Eli Zaretskii 2018-04-20 16:16 ` Stefan Monnier 2018-04-20 17:22 ` Eli Zaretskii 2018-04-20 20:42 ` Stefan Monnier 2018-04-20 21:02 ` Clément Pit-Claudel 2018-04-20 21:26 ` Paul Eggert 2018-04-21 7:07 ` Eli Zaretskii 2018-04-21 14:58 ` Michael Welsh Duggan 2018-05-19 15:23 ` Eli Zaretskii 2018-05-19 17:17 ` Paul Eggert 2018-05-19 18:03 ` Eli Zaretskii 2018-05-19 18:23 ` Paul Eggert 2018-05-19 18:39 ` Eli Zaretskii 2018-05-19 19:38 ` Paul Eggert 2018-05-19 20:03 ` Eli Zaretskii 2018-05-20 8:56 ` Eli Zaretskii 2018-05-19 17:52 ` Michael Albinus 2018-04-20 17:39 ` Michael Albinus 2018-04-21 7:10 ` Eli Zaretskii 2018-04-21 14:40 ` Clément Pit-Claudel 2018-04-21 15:43 ` Eli Zaretskii 2018-04-21 15:52 ` Paul Eggert 2018-04-23 2:53 ` Stefan Monnier 2018-04-23 15:07 ` Eli Zaretskii 2018-04-23 15:23 ` Stefan Monnier 2018-04-23 16:12 ` Eli Zaretskii 2018-04-20 16:56 ` Paul Eggert 2018-04-20 17:37 ` Michael Albinus 2018-04-21 20:31 ` Juri Linkov 2018-04-23 16:25 ` Eli Zaretskii 2018-04-23 20:05 ` Juri Linkov
Code repositories for project(s) associated with this external index https://git.savannah.gnu.org/cgit/emacs.git https://git.savannah.gnu.org/cgit/emacs/org-mode.git This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.