bug#33796: 27.0.50; Use utf-8 is all our Elisp files

all messages for Emacs-related lists mirrored at yhetil.org
 help / color / mirror / code / Atom feed

* bug#33796: 27.0.50; Use utf-8 is all our Elisp files
@ 2018-12-18 18:46 Stefan Monnier
  2018-12-18 19:22 ` Eli Zaretskii
                   ` (2 more replies)
  0 siblings, 3 replies; 36+ messages in thread
From: Stefan Monnier @ 2018-12-18 18:46 UTC (permalink / raw)
  To: 33796

[-- Attachment #1: Type: text/plain, Size: 996 bytes --]

Package: Emacs
Version: 27.0.50

Since Emacs-25, UTF-8 is the standard/default encoding for Elisp files.
The attached patch changes the few non-utf-8 Elisp files to use utf-8.

AFAICT, this patch is safe in the sense that the resulting .elc files
are identical (except for titdic-cnv.elc obviously, since I not only
changed the encoding but also the code, but I also checked that the
change of encoding itself does not affect the resulting .elc file).

In this patch, I made titdic-cnv.el use utf-8-emacs instead of utf-8
since it includes chars that can't be encoded with utf-8.  I'm not sure
why the same does not apply to the files it generates, but in my tests all
the quail files it generates can use utf-8 (rather than utf-8-emacs)
without affecting the generated .elc files (although the non-utf-8
chars of titdic-cnv.el seem to be inserted into some of the generated files
according to my reading of the code).

Any comments on the patch, or objection to installing it?

        Stefan

[-- Attachment #2: 0001-Convert-remaining-non-utf-8-Elisp-files-to-utf-8.patch --]
[-- Type: application/octet-stream, Size: 56269 bytes --]

^ permalink raw reply	[flat|nested] 36+ messages in thread

* bug#33796: 27.0.50; Use utf-8 is all our Elisp files
  2018-12-18 18:46 bug#33796: 27.0.50; Use utf-8 is all our Elisp files Stefan Monnier
@ 2018-12-18 19:22 ` Eli Zaretskii
  2018-12-18 19:46   ` Stefan Monnier
  2018-12-19 17:54 ` Paul Eggert
  2019-01-08  2:20 ` Stefan Monnier
  2 siblings, 1 reply; 36+ messages in thread
From: Eli Zaretskii @ 2018-12-18 19:22 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: 33796

> From: Stefan Monnier <monnier@iro.umontreal.ca>
> Date: Tue, 18 Dec 2018 13:46:45 -0500
> 
> Since Emacs-25, UTF-8 is the standard/default encoding for Elisp files.
> The attached patch changes the few non-utf-8 Elisp files to use utf-8.
> 
> AFAICT, this patch is safe in the sense that the resulting .elc files
> are identical (except for titdic-cnv.elc obviously, since I not only
> changed the encoding but also the code, but I also checked that the
> change of encoding itself does not affect the resulting .elc file).

The .elc files are identical, but visiting the .el files will (or
might) use different fonts, because the charset information is lost.
(You will see that I jumped through some hoops to do something similar
with etc/HELLO.)

So I don't think we should make this change without considering
whether the charset information is as important nowadays as it was
back then.  And I'm not really sure who to ask about this.





^ permalink raw reply	[flat|nested] 36+ messages in thread

* bug#33796: 27.0.50; Use utf-8 is all our Elisp files
  2018-12-18 19:22 ` Eli Zaretskii
@ 2018-12-18 19:46   ` Stefan Monnier
  0 siblings, 0 replies; 36+ messages in thread
From: Stefan Monnier @ 2018-12-18 19:46 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 33796

> The .elc files are identical, but visiting the .el files will (or
> might) use different fonts, because the charset information is lost.
> (You will see that I jumped through some hoops to do something similar
> with etc/HELLO.)

That's indeed what I understand of the situation.  But I don't think
it's a good reason to keep supporting non-utf-8 encoding for ever
(many/most programming languages only support a single encoding,
typically ASCII or utf-8 nowadays).
Part of the purpose of this bug-report is to try and come up with a plan ;-)

Hence, there are some questions:
- Do those people who edit those files really care about the difference?
  After all, IIUC utf-8 is becoming standard even in the CJK world so
  maybe the change is not that terrible (or at least, users have gotten
  used to lowering their expectations in this respect).
- If the change is indeed problematic, can we adjust it by using
  a file-global language tag?
- If that's not sufficient, can we use a scheme like that
  of etc/HELLO but to keep the files directly usable as Elisp (so as to
  have our cake and eat it too)?

> So I don't think we should make this change without considering
> whether the charset information is as important nowadays as it was
> back then.

How 'bout installing the titdic-cnv.el part which changes the coding
system used for the generated quail files (being auto-generated their
rending as source files shouldn't matter nearly as much since noone
should edit them)?

> And I'm not really sure who to ask about this.

I added Handa in the Cc, since I had forgotten to add him to the
X-Debbugs-Cc.

        Stefan

^ permalink raw reply	[flat|nested] 36+ messages in thread

* bug#33796: 27.0.50; Use utf-8 is all our Elisp files
  2018-12-18 18:46 bug#33796: 27.0.50; Use utf-8 is all our Elisp files Stefan Monnier
  2018-12-18 19:22 ` Eli Zaretskii
@ 2018-12-19 17:54 ` Paul Eggert
  2018-12-19 18:11   ` Eli Zaretskii
  2018-12-19 21:16   ` bug#33796: 27.0.50; Use utf-8 is all our Elisp files Stefan Monnier
  2019-01-08  2:20 ` Stefan Monnier
  2 siblings, 2 replies; 36+ messages in thread
From: Paul Eggert @ 2018-12-19 17:54 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: 33796

 > I'm not really sure who to ask about this.

You can ask me (:-). Although I can't read east-Asian languages I do 
have significant experience with CJK text as my previous (15-year) job 
was in a company whose customers were almost all CJK and where CJK 
internationalization was essential and where I regularly dealt with 
weird encodings and displays. And this one is an easy call: for 
maintaining these particular files, UTF-8 is an improvement and this 
patch should go in.

To take just one example, titdic-cnv.el: people who are seriously 
maintaining it and who need to read the Chinese text will almost surely 
have their environment set up to display UTF-8 Chinese text well 
already. Furthermore, if you take a look at all the changes made to this 
file in the last decade, here are the statistics:

   edits contributor
      15 Author: Paul Eggert <eggert@cs.ucla.edu>
      10 Author: Glenn Morris <rgm@gnu.org>
       2 Author: Stefan Monnier <monnier@iro.umontreal.ca>
       2 Author: Juanma Barranquero <lekktu@gmail.com>
       1 Author: Phillip Lord <phillip.lord@russet.org.uk>
       1 Author: Kenichi Handa <handa@m17n.org>
       1 Author: Andreas Schwab <schwab@linux-m68k.org>

Only one edit was made by a CJK user, and handa's edit involved only 
ASCII characters. Switching this file to UTF-8 would not have made any 
of our maintenance any more difficult in the last decade.

Conversely, I commonly use tools like 'git grep' to look for issues in 
the code, and these tools mishandle non-UTF-8 files and I see mojibake 
on my screen because of this. So it will be a significant win for me 
(and I suspect others) when we switch these files to UTF-8.

To try to answer Stefan's questions:

 > - Do those people who edit those files really care about the difference?

No, almost always: see above.

 >   utf-8 is becoming standard even in the CJK world so
 >   maybe the change is not that terrible (or at least, users have gotten
 >   used to lowering their expectations in this respect).

Yes, that’s happened. I looked for recent reports about this, and it 
appears that the controversy is mostly over. For example, 
<https://gihyo.jp/lifestyle/serial/01/ganshiki-soushi/0069> (dated 2015) 
lamented the demise of Japanese Knoppix and said that Plamo Linux had 
problems with EUC-JP and suggested users switch to UTF-8. More recently 
<https://qiita.com/tenforward/items/5e353f290f0b401139cb> (dated this 
year) says that the choice of EUC-JP or UTF-8 is user-specific for Plamo 
Linux, and that applications like Firefox have problems with EUC-JP so 
discretion is advised if you choose EUC-JP. If even hardcore holdouts 
like Plamo are folding....

 > - If the change is indeed problematic, can we adjust it by using
 >   a file-global language tag?

I hope that’s not necessary, but it’d be OK if we have to do it.

 > - If that's not sufficient, can we use a scheme like that
 >   of etc/HELLO but to keep the files directly usable as Elisp (so as to
 >   have our cake and eat it too)?

etc/HELLO is pretty much a disaster for me now, as I can’t use any tool 
other than Emacs to look at it, and even Emacs screws up if I do 
something like 'M-x grep RET hello etc/HELLO RET'. I’d rather not extend 
this disaster to other files.

PS. One minor suggestion for your patch: please also update the list of 
files in admin/notes/unicode to remove mention of the files in question.

PPS. How about also converting etc/tutorials/TUTORIAL.ja, 
lisp/leim/quail/hanja-jis.el, lisp/leim/quail/japanese.el, 
lisp/leim/quail/py-punct.el, and lisp/leim/quail/pypunct-b5.el?

^ permalink raw reply	[flat|nested] 36+ messages in thread

* bug#33796: 27.0.50; Use utf-8 is all our Elisp files
  2018-12-19 17:54 ` Paul Eggert
@ 2018-12-19 18:11   ` Eli Zaretskii
  2018-12-19 22:13     ` Paul Eggert
  2018-12-19 21:16   ` bug#33796: 27.0.50; Use utf-8 is all our Elisp files Stefan Monnier
  1 sibling, 1 reply; 36+ messages in thread
From: Eli Zaretskii @ 2018-12-19 18:11 UTC (permalink / raw)
  To: Paul Eggert; +Cc: monnier, 33796

> Cc: 33796@debbugs.gnu.org, Eli Zaretskii <eliz@gnu.org>
> From: Paul Eggert <eggert@cs.ucla.edu>
> Date: Wed, 19 Dec 2018 09:54:40 -0800
> 
>  > I'm not really sure who to ask about this.
> 
> You can ask me (:-). Although I can't read east-Asian languages I do 
> have significant experience with CJK text as my previous (15-year) job 
> was in a company whose customers were almost all CJK and where CJK 
> internationalization was essential and where I regularly dealt with 
> weird encodings and displays. And this one is an easy call: for 
> maintaining these particular files, UTF-8 is an improvement and this 
> patch should go in.

Thanks.

I could predict your answers in advance.  I need to hear a second
opinion, from someone who does read these languages, because the issue
at hand is how the charset information affects the font(s) selected
for displaying the text, and how important are the differences in
those fonts to CJK users.

> etc/HELLO is pretty much a disaster for me now, as I can’t use any tool 
> other than Emacs to look at it

??? It's a UTF-8 file with markup.  Do you have the same problems with
HTML and XML files?

(I'm not saying that we should use the same technique for Lisp files,
of course.)





^ permalink raw reply	[flat|nested] 36+ messages in thread

* bug#33796: 27.0.50; Use utf-8 is all our Elisp files
  2018-12-19 18:11   ` Eli Zaretskii
@ 2018-12-19 22:13     ` Paul Eggert
  2018-12-20 16:06       ` Eli Zaretskii
  0 siblings, 1 reply; 36+ messages in thread
From: Paul Eggert @ 2018-12-19 22:13 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: monnier, 33796

On 12/19/18 10:11 AM, Eli Zaretskii wrote:
 > I need to hear a second opinion,

That would actually be a third opinion, as Stefan's opinion surely 
counts too and he has good reasons to prefer UTF-8 here. And to some 
extent opinions should be weighted for the kind of maintenance that is 
actually done with these files as opposed to the rare cases where the 
font's style might annoy a language-expert developer if the wrong 
language environment were used.

 >> etc/HELLO is pretty much a disaster for me now, as I can’t use any tool
 >> other than Emacs to look at it
 >
 > ??? It's a UTF-8 file with markup.  Do you have the same problems with
 > HTML and XML files?

No, because when I visit those files I see the same thing in my Emacs 
editing buffer that I see after using common keystrokes like 'C-x v =' 
or standard tools like "git diff", and it's easy to use Emacs to edit 
these files in the usual way without becoming expert in html-mode etc. 
In contrast, with etc/HELLO standard tools and common keystrokes give me 
gibberish, and one must gain expertise in enriched-mode to make 
nontrivial changes.

A primary goal of Emacs is to have source code that the user can change 
easily, and using enriched-text mode in etc/HELLO works against this. It 
might be OK just for that one file (as a demonstration of enriched-text 
mode perhaps) but as things stand we shouldn't let these issues infect 
the rest of the Emacs sources.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* bug#33796: 27.0.50; Use utf-8 is all our Elisp files
  2018-12-19 22:13     ` Paul Eggert
@ 2018-12-20 16:06       ` Eli Zaretskii
  2018-12-20 21:49         ` Paul Eggert
  0 siblings, 1 reply; 36+ messages in thread
From: Eli Zaretskii @ 2018-12-20 16:06 UTC (permalink / raw)
  To: Paul Eggert, Kenichi Handa; +Cc: monnier, 33796

> Cc: monnier@iro.umontreal.ca, 33796@debbugs.gnu.org
> From: Paul Eggert <eggert@cs.ucla.edu>
> Date: Wed, 19 Dec 2018 14:13:59 -0800
> 
> On 12/19/18 10:11 AM, Eli Zaretskii wrote:
>  > I need to hear a second opinion,
> 
> That would actually be a third opinion, as Stefan's opinion surely 
> counts too and he has good reasons to prefer UTF-8 here.

Technically, it's the forth, because my opinion should also count,
right?

But this is besides the point, because we need the opinion of people
who might be actually affected by the proposed change, and none of us
qualify.  All 3 of us simply don't care, because we don't read these
scripts and don't distinguish the various fonts used to display the
same Unicode codepoints under different cultural conventions.  At some
point in the past that distinction was very important.  If nowadays it
no longer is, then I see no problems making the change.  Otherwise,
the change will lose information important to some of our users.

We need someone to advise us what is the actual state of the affairs.
I hope Handa-san will (please don't drop him from the CC list).  Or
maybe someone here can propose other experts or even just users with
relevant experience.

> And to some extent opinions should be weighted for the kind of
> maintenance that is actually done with these files as opposed to the
> rare cases where the font's style might annoy a language-expert
> developer if the wrong language environment were used.

This is also beyond the point, because we have nothing to weigh this
against for now.  When we do, we will.

>  >> etc/HELLO is pretty much a disaster for me now, as I can’t use any tool
>  >> other than Emacs to look at it
>  >
>  > ??? It's a UTF-8 file with markup.  Do you have the same problems with
>  > HTML and XML files?
> 
> No, because when I visit those files I see the same thing in my Emacs 
> editing buffer that I see after using common keystrokes like 'C-x v =' 
> or standard tools like "git diff", and it's easy to use Emacs to edit 
> these files in the usual way without becoming expert in html-mode etc. 
> In contrast, with etc/HELLO standard tools and common keystrokes give me 
> gibberish, and one must gain expertise in enriched-mode to make 
> nontrivial changes.

This line of reasoning makes little sense to me:

 . Displaying HELLO doesn't show "gibberish", it shows UTF-8 encoded
   text with pure-ASCII markup.  If your terminal can display these
   characters, you should see legible marked-up text, whereas the
   ISO-2022 encoded file of yore would display as illegible escape
   sequences.  But since in your opinion the current situation is a
   "disaster", you seem to be saying that we should go back to
   ISO-2022?
 . By the above reasoning, if Emacs is enhanced to interpret HTML/XML
   and show typefaces instead of markup, you will see that as a
   regression and complain that raw HTML files are "gibberish"?
 . You have find-file-literally to show you HELLO exactly as any
   text-mode tool will see it, if you really need that.
 . No experience in Enriched mode is needed to edit HELLO, you just
   need to apply text properties (via facemenu.el commands or the
   menu-bar's Edit->Text Properties menu).  And these properties are
   optional.

> A primary goal of Emacs is to have source code that the user can change 
> easily, and using enriched-text mode in etc/HELLO works against this. It 
> might be OK just for that one file (as a demonstration of enriched-text 
> mode perhaps) but as things stand we shouldn't let these issues infect 
> the rest of the Emacs sources.

etc/HELLO is not a demonstration of Enriched mode, it is a
demonstration of facilities to edit and display many different scripts
and character sets in the same buffer.  We use Enriched mode there
because we have no other feature which allows us to save 'charset'
text property to a disk file.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* bug#33796: 27.0.50; Use utf-8 is all our Elisp files
  2018-12-20 16:06       ` Eli Zaretskii
@ 2018-12-20 21:49         ` Paul Eggert
  2018-12-21  7:29           ` Eli Zaretskii
  0 siblings, 1 reply; 36+ messages in thread
From: Paul Eggert @ 2018-12-20 21:49 UTC (permalink / raw)
  To: Eli Zaretskii, Kenichi Handa; +Cc: monnier, 33796

On 12/20/18 8:06 AM, Eli Zaretskii wrote:

 > my opinion should also count, right?

Of course, although my impression was that you weren't expressing an 
opinion and were soliciting opinions. If your opinion is that we should 
not make the change, then of course that counts.

 > we need the opinion of people
 > who might be actually affected by the proposed change,

I assume you mean that we need the opinion of people who would be 
affected _negatively_. Stefan and I would actually be affected 
_positively_ by the proposed change, for the reasons we stated.

 > All 3 of us simply don't care,

No, actually I do care. Non-UTF-8 source files are a real annoyance for 
me, on a fairly regular basis. Stefan seems to care too, though I 
suspect he doesn't care as much as I do.

 >  . Displaying HELLO doesn't show "gibberish", it shows UTF-8 encoded
 >    text with pure-ASCII markup.

You're right. My apologies: when I wrote "gibberish" I was looking at 
the output of "git diff emacs-26..master etc/HELLO", which does indeed 
display gibberish but that's not the current encoding's fault.

 > But since in your opinion the current situation is a
 >    "disaster", you seem to be saying that we should go back to ISO-2022?

Not at all, but I do think we should cut down on the unnecessary markup 
in that file. The markup should be used only when it helps. Text like 
"<x-charset><param>mule-unicode-0100-24ff</param> </x-charset>" is not 
helping anybody; the file should just contain " " there. Most of the 
markup in that file is not necessary for proper display, and just gets 
in the way when using tools other than Emacs.

 >  . By the above reasoning, if Emacs is enhanced to interpret HTML/XML
 >    and show typefaces instead of markup, you will see that as a
 >    regression and complain that raw HTML files are "gibberish"?

I hope Emacs doesn't do any such thing by default. I often use Emacs to 
edit .html and .xml files, and if it attempted to render these files by 
default I would be inconvenienced. Presumably there would be an option 
to keep the old behavior, and I'd use that option.

 >  . You have find-file-literally to show you HELLO exactly as any
 >    text-mode tool will see it

No, because find-file-literally shows hard-to-read stuff like this:

</x-charset><x-charset><param>greek-iso8859-7</param>Greek 
(\316\265\316\273\316\273\316\267\316\275\316\271\316\272\316\254) 
\316\223\316\265\316\271\316\254 \317\203\316\261\317\202

which differs from (and is even worse than) what an ordinary tool like 
git or cat shows:

</x-charset><x-charset><param>greek-iso8859-7</param>Greek (ελληνικά)   
Γειά σας

It would be better to remove this particular markup, so that git etc. 
would show this:

Greek (ελληνικά)    Γειά σας

which is what Emacs ordinarily shows.

 >  . No experience in Enriched mode is needed to edit HELLO, you just
 >    need to apply text properties (via facemenu.el commands or the
 >    menu-bar's Edit->Text Properties menu).  And these properties are
 >    optional.

Let's leave most of them out then, as they're not working well in 
etc/HELLO. I don't use that menu, but I took your hint and just now 
tried it, by selecting the abovementioned word "ελληνικά" and menuing to 
Edit > Text Properties > Describe Properties, but all it said was 'Text 
content at position 1530: There are text properties here: unknown 
("x-charset")'. This missed the point that the word's character set is 
greek-iso8859-7 which is a special hack that hints to Emacs (and nobody 
else, I guess? I couldn't find documentation for this stuff even in the 
Emacs manuals) that the text should be displayed with a Greek font 
instead of the same Greek font that Emacs would be using anyway. And I 
didn't see an easy way to see visually that the this (unnecessary) 
<x-charset> hint is misplaced, since it should be placed so that it 
applies only to the Greek text and not to the surrounding English text 
in the same line.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* bug#33796: 27.0.50; Use utf-8 is all our Elisp files
  2018-12-20 21:49         ` Paul Eggert
@ 2018-12-21  7:29           ` Eli Zaretskii
  2018-12-21 13:46             ` Stefan Monnier
                               ` (2 more replies)
  0 siblings, 3 replies; 36+ messages in thread
From: Eli Zaretskii @ 2018-12-21  7:29 UTC (permalink / raw)
  To: Paul Eggert; +Cc: monnier, 33796

> Cc: monnier@iro.umontreal.ca, 33796@debbugs.gnu.org
> From: Paul Eggert <eggert@cs.ucla.edu>
> Date: Thu, 20 Dec 2018 13:49:44 -0800
> 
> On 12/20/18 8:06 AM, Eli Zaretskii wrote:
> 
>  > my opinion should also count, right?
> 
> Of course, although my impression was that you weren't expressing an 
> opinion and were soliciting opinions.

Same as Stefan, actually: he asked whether there were objections.

>  > we need the opinion of people
>  > who might be actually affected by the proposed change,
> 
> I assume you mean that we need the opinion of people who would be 
> affected _negatively_.

Not necessarily.  I would actually like to hear opinions from people
who read CJK scripts who think the distinction no longer matters, not
these days.

>  > All 3 of us simply don't care,
> 
> No, actually I do care. Non-UTF-8 source files are a real annoyance for 
> me

This is a misunderstanding: by "don't care" I meant we don't care
which font is used to display a particular Unicode codepoint in the
Han area.

> I do think we should cut down on the unnecessary markup 
> in that file.

Agreed.

> The markup should be used only when it helps. Text like 
> "<x-charset><param>mule-unicode-0100-24ff</param> </x-charset>" is not 
> helping anybody; the file should just contain " " there.

There are only 2 such occurrences, so this isn't a grave problem.  I
will take a look when I have time.

> Most of the markup in that file is not necessary for proper display,
> and just gets in the way when using tools other than Emacs.

Which markup is not necessary for display, in your opinion?  I'm
surprised to hear that "most of it" is unnecessary, but maybe I'm
missing something.

>  >  . By the above reasoning, if Emacs is enhanced to interpret HTML/XML
>  >    and show typefaces instead of markup, you will see that as a
>  >    regression and complain that raw HTML files are "gibberish"?
> 
> I hope Emacs doesn't do any such thing by default.

Really?  Quite a few Emacs users think that it should, and that the
fact it doesn't is one of the significant deficiencies in Emacs, as
compared to other popular editors.

> </x-charset><x-charset><param>greek-iso8859-7</param>Greek (ελληνικά)   
> Γειά σας
> 
> It would be better to remove this particular markup, so that git etc. 
> would show this:
> 
> Greek (ελληνικά)    Γειά σας
> 
> which is what Emacs ordinarily shows.

That markup is precisely what keeps the charset properties on the
corresponding greetings.  Removing it would be losing information that
HELLO is trying to preserve.

> I don't use that menu, but I took your hint and just now 
> tried it, by selecting the abovementioned word "ελληνικά" and menuing to 
> Edit > Text Properties > Describe Properties, but all it said was 'Text 
> content at position 1530: There are text properties here: unknown 
> ("x-charset")'. This missed the point that the word's character set is 
> greek-iso8859-7

I cannot reproduce this.  That menu item invokes the command
describe-text-properties, which pops up the *Help* buffer, and the
text there says:

  Text content at position 1530:

  There are text properties here:
    charset              greek-iso8859-7

I wonder why you don't see that.  Is it possible that you are looking
at a file/buffer that was modified from its original contents?

> which is a special hack that hints to Emacs (and nobody else, I
> guess? I couldn't find documentation for this stuff even in the 
> Emacs manuals) that the text should be displayed with a Greek font 
> instead of the same Greek font that Emacs would be using anyway.

The charset property allows us to have a fontset that directs Emacs to
use specific fonts for specific character ranges.  See set-fontset-font.
I do agree that these issues are notoriously under-documented.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* bug#33796: 27.0.50; Use utf-8 is all our Elisp files
  2018-12-21  7:29           ` Eli Zaretskii
@ 2018-12-21 13:46             ` Stefan Monnier
  2018-12-21 15:54               ` Eli Zaretskii
  2018-12-21 13:55             ` Eli Zaretskii
  2018-12-21 21:07             ` Paul Eggert
  2 siblings, 1 reply; 36+ messages in thread
From: Stefan Monnier @ 2018-12-21 13:46 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: Paul Eggert, 33796

> Not necessarily.  I would actually like to hear opinions from people
> who read CJK scripts who think the distinction no longer matters, not
> these days.

BTW, while looking closer, I'm inclined to think that maybe their
opinion doesn't matter that much: while the general issue of font choice
for CJK text in Elisp files might really affect some users, in the
specific case of the files affected by this patch I believe this likely
isn't the case, because while there are affected *chars*, there is no
affected *text*.  More specifically, AFAICT the affected chars are all
part of the code and they represent themselves rather than being used as
a carrier for a specific meaning in a text (because all this code is
about how to insert specific chars).

[ Snipped the rest about etc/HELLO.  ]

        Stefan "I asked Chong what he thought about it but said that
                he's not using CJK enough to be a good source of opinion"

^ permalink raw reply	[flat|nested] 36+ messages in thread

* bug#33796: 27.0.50; Use utf-8 is all our Elisp files
  2018-12-21 13:46             ` Stefan Monnier
@ 2018-12-21 15:54               ` Eli Zaretskii
  0 siblings, 0 replies; 36+ messages in thread
From: Eli Zaretskii @ 2018-12-21 15:54 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: eggert, 33796

> From: Stefan Monnier <monnier@IRO.UMontreal.CA>
> Cc: Paul Eggert <eggert@cs.ucla.edu>, handa@gnu.org, 33796@debbugs.gnu.org
> Date: Fri, 21 Dec 2018 08:46:11 -0500
> 
> BTW, while looking closer, I'm inclined to think that maybe their
> opinion doesn't matter that much: while the general issue of font choice
> for CJK text in Elisp files might really affect some users, in the
> specific case of the files affected by this patch I believe this likely
> isn't the case, because while there are affected *chars*, there is no
> affected *text*.

Maybe.  But I wouldn't jump to conclusions: it could be that the
aversion is (or was) to how the glyphs look, regardless of whether
they are part of meaningful text.





^ permalink raw reply	[flat|nested] 36+ messages in thread

* bug#33796: 27.0.50; Use utf-8 is all our Elisp files
  2018-12-21  7:29           ` Eli Zaretskii
  2018-12-21 13:46             ` Stefan Monnier
@ 2018-12-21 13:55             ` Eli Zaretskii
  2018-12-21 21:07             ` Paul Eggert
  2 siblings, 0 replies; 36+ messages in thread
From: Eli Zaretskii @ 2018-12-21 13:55 UTC (permalink / raw)
  To: eggert; +Cc: monnier, 33796

> Date: Fri, 21 Dec 2018 09:29:36 +0200
> From: Eli Zaretskii <eliz@gnu.org>
> Cc: monnier@iro.umontreal.ca, 33796@debbugs.gnu.org
> 
> > I don't use that menu, but I took your hint and just now 
> > tried it, by selecting the abovementioned word "ελληνικά" and menuing to 
> > Edit > Text Properties > Describe Properties, but all it said was 'Text 
> > content at position 1530: There are text properties here: unknown 
> > ("x-charset")'. This missed the point that the word's character set is 
> > greek-iso8859-7
> 
> I cannot reproduce this.  That menu item invokes the command
> describe-text-properties, which pops up the *Help* buffer, and the
> text there says:
> 
>   Text content at position 1530:
> 
> 
>   There are text properties here:
>     charset              greek-iso8859-7
> 
> I wonder why you don't see that.

I think I know the answer to that: you use Emacs 26 or older to look
at the file.  Only Emacs 27 supports the x-charset property in
Enriched mode.





^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: 27.0.50; Use utf-8 is all our Elisp files
  2018-12-21  7:29           ` Eli Zaretskii
  2018-12-21 13:46             ` Stefan Monnier
  2018-12-21 13:55             ` Eli Zaretskii
@ 2018-12-21 21:07             ` Paul Eggert
  2018-12-22  1:19               ` Eric Lindblad
  2018-12-22  8:12               ` etc/HELLO markup etc Eli Zaretskii
  2 siblings, 2 replies; 36+ messages in thread
From: Paul Eggert @ 2018-12-21 21:07 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: handa, monnier, Emacs Development

[removing 33796@debbugs.gnu.org and adding emacs-devel@gnu.org to cc list]

Eli Zaretskii wrote:
> Which markup is not necessary for display, in your opinion?

At most all that's useful is markup that distinguishes Chinese and Japanese 
variants of Han characters; this might also include hanja (Korean) and Chữ Nôm 
(Vietnamese) variants if we ever added such characters to etc/HELLO. Such markup 
might be useful because a significant set of east Asian users dislike Unicode's 
Han unification and prefer specific variants of Han characters. I'm not aware of 
any other set of users who dislike unification in that way.

> That markup is precisely what keeps the charset properties on the
> corresponding greetings.  Removing it would be losing information that
> HELLO is trying to preserve.

Although the etc/HELLO markup might be of interest to those who care about 
annotating languages in the text, it's irrelevant to the ordinary purpose of 
that file, which is to show textual translations of "Hello", as examples, to an 
audience that doesn't know all those languages, but who can easily see the 
language names in the English (or native-language) parts of the text without 
involving any of the markup.

It's a bit like reading a translation of (say) "War and Peace". Most people just 
want to read the translated text. A small fraction might want to know which part 
of the original was written in Russian, which in French, which in English, etc. 
Markup can help that small fraction, but just gets in the way of the primary use.

> Is it possible that you are looking
> at a file/buffer that was modified from its original contents?

No, I was using Emacs 26 by mistake. Sorry about the noise.

It's still not a good user interface, though, as it is difficult to see the 
markup's effect when visiting etc/HELLO in the usual way, and this makes it hard 
to see mistakes in the markup. etc/HELLO is littered with so much useless 
markup, and the effect of markup errors is so subtle, and it's so much of a pain 
to edit the markup in its ordinary form of display, that the file is not a good 
showroom for how to maintain multilingual text. It's not a good sign that there 
seem to be errors in the possibly-useful (i.e., CJ) markup that nobody has 
noticed since the markup was introduced in May, and that I noticed these errors 
now only because I was visiting the file literally.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: 27.0.50; Use utf-8 is all our Elisp files
  2018-12-21 21:07             ` Paul Eggert
@ 2018-12-22  1:19               ` Eric Lindblad
  2018-12-22  7:56                 ` etc/HELLO markup etc. (Was: 27.0.50; Use utf-8 is all our Elisp files) Eli Zaretskii
  2018-12-22  8:12               ` etc/HELLO markup etc Eli Zaretskii
  1 sibling, 1 reply; 36+ messages in thread
From: Eric Lindblad @ 2018-12-22  1:19 UTC (permalink / raw)
  To: Emacs-devel

[-- Attachment #1: Type: text/html, Size: 450 bytes --]

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: etc/HELLO markup etc. (Was: 27.0.50; Use utf-8 is all our Elisp files)
  2018-12-22  1:19               ` Eric Lindblad
@ 2018-12-22  7:56                 ` Eli Zaretskii
  0 siblings, 0 replies; 36+ messages in thread
From: Eli Zaretskii @ 2018-12-22  7:56 UTC (permalink / raw)
  To: Eric Lindblad; +Cc: Emacs-devel

> From: "Eric Lindblad" <lindblad@gmx.com>
> Date: Sat, 22 Dec 2018 02:19:47 +0100
> Sensitivity: Normal
> 
> Would there be any sympathy to adding a link to this webpage in the etc/HELLO file?
>  
> See also: UTF-8 SAMPLER
> http://kermitproject.org/utf8.html

Thanks, I looked at that file when I added a few scripts to HELLO.

The goals of that file are different from what we try doing in HELLO.
Our goal is to show the different scripts, not different languages or
fonts.  For that reason, many languages are absent from HELLO if they
use the same scripts which are already present in the file (for other
languages).  IOW, the different languages in HELLO are just the means
to a certain end: we need a language using a script to say "hello" for
that script.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: etc/HELLO markup etc.
  2018-12-21 21:07             ` Paul Eggert
  2018-12-22  1:19               ` Eric Lindblad
@ 2018-12-22  8:12               ` Eli Zaretskii
  2018-12-22 19:41                 ` Paul Eggert
                                   ` (3 more replies)
  1 sibling, 4 replies; 36+ messages in thread
From: Eli Zaretskii @ 2018-12-22  8:12 UTC (permalink / raw)
  To: Paul Eggert; +Cc: handa, monnier, Emacs-devel

> Cc: handa@gnu.org, monnier@iro.umontreal.ca,
>  Emacs Development <Emacs-devel@gnu.org>
> From: Paul Eggert <eggert@cs.ucla.edu>
> Date: Fri, 21 Dec 2018 13:07:09 -0800
> 
> [removing 33796@debbugs.gnu.org and adding emacs-devel@gnu.org to cc list]

I've changed the Subject, as the original one was too similar to the
bug report.

> Eli Zaretskii wrote:
> > Which markup is not necessary for display, in your opinion?
> 
> At most all that's useful is markup that distinguishes Chinese and Japanese 
> variants of Han characters; this might also include hanja (Korean) and Chữ Nôm 
> (Vietnamese) variants if we ever added such characters to etc/HELLO. Such markup 
> might be useful because a significant set of east Asian users dislike Unicode's 
> Han unification and prefer specific variants of Han characters. I'm not aware of 
> any other set of users who dislike unification in that way.

I'm not yet sure this is only about Han unification.  Using charsets
for specifying fonts is a general feature in Emacs, which can be used
to control which fonts are selected independently of what the OS
facilities such as fontconfig do.

I hope Handa-san will be able to comment on this stuff.

If Han unification is the only important user of the charset property,
then yes, we could remove the rest of the charset info from HELLO.
But please realize that the current HELLO just keeps the information
that was there before recoding it in UTF-8, nothing was added.  It is
just kept in a different form, which makes the charset info
human-readable, where previously it was encoded in the ISO 2022
sequences.

> > That markup is precisely what keeps the charset properties on the
> > corresponding greetings.  Removing it would be losing information that
> > HELLO is trying to preserve.
> 
> Although the etc/HELLO markup might be of interest to those who care about 
> annotating languages in the text, it's irrelevant to the ordinary purpose of 
> that file, which is to show textual translations of "Hello"

That's not the original purpose of that file.  The purpose is to show
scripts, not languages, and to show how we display different scripts
in the same buffer.

> It's still not a good user interface, though, as it is difficult to see the 
> markup's effect when visiting etc/HELLO in the usual way

If the usual way is via find-file and its ilk, then you should see the
same results as with "C-h h", so I'm not sure I understand what you
mean here.

> etc/HELLO is littered with so much useless markup

I disagree that it's useless.  Most of it is useful.

> the effect of markup errors is so subtle, and it's so much of a pain
> to edit the markup in its ordinary form of display

If you mean manually editing the markup, then you aren't supposed to
do that.

In what way most of what you say is not applicable to etc/enriched.txt
in general?  If you just dislike what Enriched mode produces on disk,
then let's stop this argument, as you seem to be arguing against files
with markup in general, and that's a non-starter for me.

> the file is not a good showroom for how to maintain multilingual
> text.

What other facilities are you aware of or can suggest for showing
multilingual text with such level of detail and precision?

> It's not a good sign that there seem to be errors in the
> possibly-useful (i.e., CJ) markup that nobody has noticed since the
> markup was introduced in May, and that I noticed these errors now
> only because I was visiting the file literally.

Which errors?  I don't think we discovered any errors.  We may have
discovered some markup on whitespace where we perhaps could do without
it (I'm not yet sure of that), but that's all, and is not necessarily
an error.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: etc/HELLO markup etc.
  2018-12-22  8:12               ` etc/HELLO markup etc Eli Zaretskii
@ 2018-12-22 19:41                 ` Paul Eggert
  2018-12-22 20:42                   ` Eli Zaretskii
  2018-12-23  7:47                 ` Yuri Khan
                                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 36+ messages in thread
From: Paul Eggert @ 2018-12-22 19:41 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: handa, monnier, Emacs-devel

Eli Zaretskii wrote:

> If Han unification is the only important user of the charset property,
> then yes, we could remove the rest of the charset info from HELLO.

Yes, that's the case.

> the current HELLO just keeps the information
> that was there before recoding it in UTF-8, nothing was added.

Sure, but the non-Han markup is merely a relic of that file's old method of 
encoding, which avoided Unicode and instead used ISO 2022 escape sequences to 
switch among various 8- and 16-bit encodings, as that was the only way to show 
text in (say) Russian under the constraints of the old method. The non-Han 
markup is completely unnecessary now that the file uses UTF-8. (The Han markup 
probably isn't needed either, though I also would like Handa's opinion on that.)

>> Although the etc/HELLO markup might be of interest to those who care about
>> annotating languages in the text, it's irrelevant to the ordinary purpose of
>> that file, which is to show textual translations of "Hello"
> 
> That's not the original purpose of that file.  The purpose is to show
> scripts, not languages, and to show how we display different scripts
> in the same buffer.

OK, but either way the non-Han markup is irrelevant to the ordinary purpose of 
the file.

>> It's still not a good user interface, though, as it is difficult to see the
>> markup's effect when visiting etc/HELLO in the usual way
> 
> If the usual way is via find-file and its ilk, then you should see the
> same results as with "C-h h", so I'm not sure I understand what you
> mean here.

I meant that one cannot see the markup's effect when visiting the file with 
either C-h h or find-file in the usual way. It's useless markup.

> In what way most of what you say is not applicable to etc/enriched.txt
> in general?

Other forms of enriched-text markup are typically easily visible. If I visit 
etc/enriched.txt I can easily see which parts are marked white on blue 
background, which parts are marked italic, etc. Invisible enriched-text markup 
is much harder to deal with when editing an enriched-text file.

>> the file is not a good showroom for how to maintain multilingual
>> text.
> 
> What other facilities are you aware of or can suggest for showing
> multilingual text with such level of detail and precision?

In practice the most common and often the best way to deal with the situation is 
to do what the non-markup part of etc/HELLO is already doing: indicate within 
the text itself what language or script is being used, to help the reader who 
may be unacquainted with them, and with enough punctuation within the text so 
that the reader can easily see what's going on. This technique has been used for 
centuries, it's by far the most popular technique in common practice today, and 
it suffices for this particular application (with the possible exception of its 
Chinese and Japanese text).

>> It's not a good sign that there seem to be errors in the
>> possibly-useful (i.e., CJ) markup that nobody has noticed since the
>> markup was introduced in May, and that I noticed these errors now
>> only because I was visiting the file literally.
> 
> Which errors?  I don't think we discovered any errors.

Yes, and that's the point! The approach we're taking is not good for dealing 
with the situation.

One example of such an error is that "日本語" has no charset properties even 
though it's obviously intended to use a Japanese script (since it follows the 
word "Japanese"). I'm sure there are others.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: etc/HELLO markup etc.
  2018-12-22 19:41                 ` Paul Eggert
@ 2018-12-22 20:42                   ` Eli Zaretskii
  0 siblings, 0 replies; 36+ messages in thread
From: Eli Zaretskii @ 2018-12-22 20:42 UTC (permalink / raw)
  To: Paul Eggert; +Cc: handa, monnier, Emacs-devel

> Cc: handa@gnu.org, monnier@iro.umontreal.ca, Emacs-devel@gnu.org
> From: Paul Eggert <eggert@cs.ucla.edu>
> Date: Sat, 22 Dec 2018 11:41:05 -0800
> 
> Eli Zaretskii wrote:
> 
> > If Han unification is the only important user of the charset property,
> > then yes, we could remove the rest of the charset info from HELLO.
> 
> Yes, that's the case.

Says you.  The issue at hand is precisely whether that is so, or just
your opinion and tendency.

> the non-Han markup is merely a relic of that file's old method of 
> encoding

It could be both a relic and an important piece of information.

> one cannot see the markup's effect when visiting the file with
> either C-h h or find-file in the usual way.

Of course, one can: via the fonts used to display the various scripts.

> > In what way most of what you say is not applicable to etc/enriched.txt
> > in general?
> 
> Other forms of enriched-text markup are typically easily visible.

Typically, but not exclusively.  There's read-only property, there's
the 'display' property, and to some extent even the "fixed" face.

> > What other facilities are you aware of or can suggest for showing
> > multilingual text with such level of detail and precision?
> 
> In practice the most common and often the best way to deal with the situation is 
> to do what the non-markup part of etc/HELLO is already doing: indicate within 
> the text itself what language or script is being used, to help the reader who 
> may be unacquainted with them, and with enough punctuation within the text so 
> that the reader can easily see what's going on.

That's useless for preserving text properties, so won't fit the bill.

> One example of such an error is that "日本語" has no charset properties even 
> though it's obviously intended to use a Japanese script (since it follows the 
> word "Japanese").

Thanks, I fixed that.

> I'm sure there are others.

Please report them if you find them.



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: etc/HELLO markup etc.
  2018-12-22  8:12               ` etc/HELLO markup etc Eli Zaretskii
  2018-12-22 19:41                 ` Paul Eggert
@ 2018-12-23  7:47                 ` Yuri Khan
  2018-12-23 15:42                   ` Eli Zaretskii
  2018-12-28  7:10                 ` Eli Zaretskii
  2018-12-29  7:23                 ` handa
  3 siblings, 1 reply; 36+ messages in thread
From: Yuri Khan @ 2018-12-23  7:47 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: handa, Paul Eggert, Stefan Monnier, Emacs developers

On Sat, Dec 22, 2018 at 3:13 PM Eli Zaretskii <eliz@gnu.org> wrote:

> I'm not yet sure this is only about Han unification.  Using charsets
> for specifying fonts is a general feature in Emacs, which can be used
> to control which fonts are selected independently of what the OS
> facilities such as fontconfig do.

There is at least one more situation where different glyphs
could/should be selected for the same Unicode code points, which
charset markup does not solve.

I’m talking about italic shapes of Cyrillic letters. For some of them,
Russian and Bulgarian use one shape but Serbian and Macedonian use
another shape[1]. There are no examples of Bulgarian, Serbian, or
Macedonian in HELLO, but Russian, Ukrainian and Mongolian examples are
all marked up as “cyrillic-iso8859-5”, which is an encoding that does
not carry language information.

So: charset markup is not the right solution to the problem of
rendering the same Unicode code point with different glyphs.

[1]: https://en.wikipedia.org/wiki/Cyrillic_script#/media/File:Cyrillic_cursive.svg

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: etc/HELLO markup etc.
  2018-12-23  7:47                 ` Yuri Khan
@ 2018-12-23 15:42                   ` Eli Zaretskii
  2018-12-23 15:53                     ` Werner LEMBERG
  0 siblings, 1 reply; 36+ messages in thread
From: Eli Zaretskii @ 2018-12-23 15:42 UTC (permalink / raw)
  To: Yuri Khan; +Cc: handa, eggert, monnier, Emacs-devel

> From: Yuri Khan <yurivkhan@gmail.com>
> Date: Sun, 23 Dec 2018 14:47:39 +0700
> Cc: Paul Eggert <eggert@cs.ucla.edu>, handa@gnu.org, 
> 	Stefan Monnier <monnier@iro.umontreal.ca>, Emacs developers <Emacs-devel@gnu.org>
> 
> There is at least one more situation where different glyphs
> could/should be selected for the same Unicode code points, which
> charset markup does not solve.
> 
> I’m talking about italic shapes of Cyrillic letters. For some of them,
> Russian and Bulgarian use one shape but Serbian and Macedonian use
> another shape[1]. There are no examples of Bulgarian, Serbian, or
> Macedonian in HELLO, but Russian, Ukrainian and Mongolian examples are
> all marked up as “cyrillic-iso8859-5”, which is an encoding that does
> not carry language information.
> 
> So: charset markup is not the right solution to the problem of
> rendering the same Unicode code point with different glyphs.

You mean, it's not a perfect solution, right?  Because in the "good"
department, it's "good enough" to solve at least part of the problem.
No one says we need to reject a solution because it is only partial.

I would also like to point out that, as far as the 'charset' property
is considered, HELLO is just an example of what _can_ be done, it
doesn't pretend to show _everything_ that you could do.  E.g., if it's
important to be able to display Ukrainian in a font different from
that used for Russian, we could use the koi8-u charset for the
Ukrainian greeting, and tweak our default fontset to use special fonts
for that.  We could even invent additional charsets (see
define-charset) and then use them for some greetings.  Of course, this
machinery works best when a charset is unequivocally determined by the
prevalent encoding used for text that uses that charset, and that
isn't always the case.  But still, the feature is there, and it can be
extended if needed.

Finally, regarding the special handling of italics in Serbian: is
there _any_ application out there that solves this problem
satisfactorily in multilingual environment?  I'm not sure how you
could go about that, since fonts generally cover scripts, and there's
no special Serbian Cyrillic script, there's just Cyrl to cover them
all.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: etc/HELLO markup etc.
  2018-12-23 15:42                   ` Eli Zaretskii
@ 2018-12-23 15:53                     ` Werner LEMBERG
  2018-12-23 16:04                       ` Eli Zaretskii
  0 siblings, 1 reply; 36+ messages in thread
From: Werner LEMBERG @ 2018-12-23 15:53 UTC (permalink / raw)
  To: eliz; +Cc: yurivkhan, eggert, Emacs-devel, monnier, handa


>> So: charset markup is not the right solution to the problem of
>> rendering the same Unicode code point with different glyphs.
>
> Finally, regarding the special handling of italics in Serbian: is
> there _any_ application out there that solves this problem
> satisfactorily in multilingual environment?  I'm not sure how you
> could go about that, since fonts generally cover scripts, and
> there's no special Serbian Cyrillic script, there's just Cyrl to
> cover them all.

OpenType fonts provide a language tag (in addition to a script tag) to
handle this.  XeTeX and luatex support language tags – I don't know
whether there is an editor with such a capability.


    Werner

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: etc/HELLO markup etc.
  2018-12-23 15:53                     ` Werner LEMBERG
@ 2018-12-23 16:04                       ` Eli Zaretskii
  2018-12-23 21:11                         ` Werner LEMBERG
  0 siblings, 1 reply; 36+ messages in thread
From: Eli Zaretskii @ 2018-12-23 16:04 UTC (permalink / raw)
  To: Werner LEMBERG; +Cc: yurivkhan, eggert, Emacs-devel, monnier, handa

> Date: Sun, 23 Dec 2018 16:53:14 +0100 (CET)
> Cc: yurivkhan@gmail.com, handa@gnu.org, eggert@cs.ucla.edu,
>  monnier@iro.umontreal.ca, Emacs-devel@gnu.org
> From: Werner LEMBERG <wl@gnu.org>
> 
> > Finally, regarding the special handling of italics in Serbian: is
> > there _any_ application out there that solves this problem
> > satisfactorily in multilingual environment?  I'm not sure how you
> > could go about that, since fonts generally cover scripts, and
> > there's no special Serbian Cyrillic script, there's just Cyrl to
> > cover them all.
> 
> OpenType fonts provide a language tag (in addition to a script tag) to
> handle this.

Yes, but aren't these tags used only to select fonts that have
features required by the language's shaping requirements?  That's what
Emacs does with those.



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: etc/HELLO markup etc.
  2018-12-23 16:04                       ` Eli Zaretskii
@ 2018-12-23 21:11                         ` Werner LEMBERG
  0 siblings, 0 replies; 36+ messages in thread
From: Werner LEMBERG @ 2018-12-23 21:11 UTC (permalink / raw)
  To: eliz; +Cc: yurivkhan, eggert, Emacs-devel, monnier, handa

>> > Finally, regarding the special handling of italics in Serbian: is
>> > there _any_ application out there that solves this problem
>> > satisfactorily in multilingual environment?  I'm not sure how you
>> > could go about that, since fonts generally cover scripts, and
>> > there's no special Serbian Cyrillic script, there's just Cyrl to
>> > cover them all.
>> 
>> OpenType fonts provide a language tag (in addition to a script tag)
>> to handle this.
> 
> Yes, but aren't these tags used only to select fonts that have
> features required by the language's shaping requirements?  That's
> what Emacs does with those.

Well, I could imagine the following use case: Within Emacs, you
activate a Serbian language environment.  This passes the script tag
`Cyrl' and the language tag `SRB' to the current font (which must be
reloaded).

Within a document, the language tag must be explicitly passed to the
text snippet in question (using some sort of markup or text
properties); while it might be possible to algorithmically deduce a
language tag for longer texts, this certainly doesn't work for just a
few characters.

    Werner

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: etc/HELLO markup etc.
  2018-12-22  8:12               ` etc/HELLO markup etc Eli Zaretskii
  2018-12-22 19:41                 ` Paul Eggert
  2018-12-23  7:47                 ` Yuri Khan
@ 2018-12-28  7:10                 ` Eli Zaretskii
  2018-12-29  7:23                 ` handa
  3 siblings, 0 replies; 36+ messages in thread
From: Eli Zaretskii @ 2018-12-28  7:10 UTC (permalink / raw)
  To: Kenichi Handa; +Cc: eggert, monnier, Emacs-devel

Ping!

Kenichi, could you please comment on this issue?  TIA.

> Date: Sat, 22 Dec 2018 10:12:37 +0200
> From: Eli Zaretskii <eliz@gnu.org>
> Cc: handa@gnu.org, monnier@iro.umontreal.ca, Emacs-devel@gnu.org
> 
> > Cc: handa@gnu.org, monnier@iro.umontreal.ca,
> >  Emacs Development <Emacs-devel@gnu.org>
> > From: Paul Eggert <eggert@cs.ucla.edu>
> > Date: Fri, 21 Dec 2018 13:07:09 -0800
> > 
> > [removing 33796@debbugs.gnu.org and adding emacs-devel@gnu.org to cc list]
> 
> I've changed the Subject, as the original one was too similar to the
> bug report.
> 
> > Eli Zaretskii wrote:
> > > Which markup is not necessary for display, in your opinion?
> > 
> > At most all that's useful is markup that distinguishes Chinese and Japanese 
> > variants of Han characters; this might also include hanja (Korean) and Chữ Nôm 
> > (Vietnamese) variants if we ever added such characters to etc/HELLO. Such markup 
> > might be useful because a significant set of east Asian users dislike Unicode's 
> > Han unification and prefer specific variants of Han characters. I'm not aware of 
> > any other set of users who dislike unification in that way.
> 
> I'm not yet sure this is only about Han unification.  Using charsets
> for specifying fonts is a general feature in Emacs, which can be used
> to control which fonts are selected independently of what the OS
> facilities such as fontconfig do.
> 
> I hope Handa-san will be able to comment on this stuff.
> 
> If Han unification is the only important user of the charset property,
> then yes, we could remove the rest of the charset info from HELLO.
> But please realize that the current HELLO just keeps the information
> that was there before recoding it in UTF-8, nothing was added.  It is
> just kept in a different form, which makes the charset info
> human-readable, where previously it was encoded in the ISO 2022
> sequences.
> 
> > > That markup is precisely what keeps the charset properties on the
> > > corresponding greetings.  Removing it would be losing information that
> > > HELLO is trying to preserve.
> > 
> > Although the etc/HELLO markup might be of interest to those who care about 
> > annotating languages in the text, it's irrelevant to the ordinary purpose of 
> > that file, which is to show textual translations of "Hello"
> 
> That's not the original purpose of that file.  The purpose is to show
> scripts, not languages, and to show how we display different scripts
> in the same buffer.
> 
> > It's still not a good user interface, though, as it is difficult to see the 
> > markup's effect when visiting etc/HELLO in the usual way
> 
> If the usual way is via find-file and its ilk, then you should see the
> same results as with "C-h h", so I'm not sure I understand what you
> mean here.
> 
> > etc/HELLO is littered with so much useless markup
> 
> I disagree that it's useless.  Most of it is useful.
> 
> > the effect of markup errors is so subtle, and it's so much of a pain
> > to edit the markup in its ordinary form of display
> 
> If you mean manually editing the markup, then you aren't supposed to
> do that.
> 
> In what way most of what you say is not applicable to etc/enriched.txt
> in general?  If you just dislike what Enriched mode produces on disk,
> then let's stop this argument, as you seem to be arguing against files
> with markup in general, and that's a non-starter for me.
> 
> > the file is not a good showroom for how to maintain multilingual
> > text.
> 
> What other facilities are you aware of or can suggest for showing
> multilingual text with such level of detail and precision?
> 
> > It's not a good sign that there seem to be errors in the
> > possibly-useful (i.e., CJ) markup that nobody has noticed since the
> > markup was introduced in May, and that I noticed these errors now
> > only because I was visiting the file literally.
> 
> Which errors?  I don't think we discovered any errors.  We may have
> discovered some markup on whitespace where we perhaps could do without
> it (I'm not yet sure of that), but that's all, and is not necessarily
> an error.
> 
> 



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: etc/HELLO markup etc.
  2018-12-22  8:12               ` etc/HELLO markup etc Eli Zaretskii
                                   ` (2 preceding siblings ...)
  2018-12-28  7:10                 ` Eli Zaretskii
@ 2018-12-29  7:23                 ` handa
  2018-12-29  7:37                   ` Eli Zaretskii
  3 siblings, 1 reply; 36+ messages in thread
From: handa @ 2018-12-29  7:23 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: eggert, monnier, Emacs-devel

In article <838t0iasju.fsf@gnu.org>, Eli Zaretskii <eliz@gnu.org> writes:

> > Eli Zaretskii wrote:
> > > Which markup is not necessary for display, in your opinion?
> > 
> > At most all that's useful is markup that distinguishes Chinese and Japanese 
> > variants of Han characters; this might also include hanja (Korean) and Chữ Nôm 
> > (Vietnamese) variants if we ever added such characters to etc/HELLO. Such markup 
> > might be useful because a significant set of east Asian users dislike Unicode's 
> > Han unification and prefer specific variants of Han characters. I'm not aware of 
> > any other set of users who dislike unification in that way.

> I'm not yet sure this is only about Han unification.  Using charsets
> for specifying fonts is a general feature in Emacs, which can be used
> to control which fonts are selected independently of what the OS
> facilities such as fontconfig do.

> I hope Handa-san will be able to comment on this stuff.

> If Han unification is the only important user of the charset property,
> then yes, we could remove the rest of the charset info from HELLO.

Long ago, the quality of fonts designed for a specific regacy charset
were far better than so-called Unicode fonts even for non-Han charaters.
So, the charset information for non-Han charsets did have some meaning.
But, I don't know the current situation.  Perhaps, it is good to remove
them and wait for complaint from users.

---
K. Handa
handa@gnu.org



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: etc/HELLO markup etc.
  2018-12-29  7:23                 ` handa
@ 2018-12-29  7:37                   ` Eli Zaretskii
  2019-01-06 12:06                     ` handa
  0 siblings, 1 reply; 36+ messages in thread
From: Eli Zaretskii @ 2018-12-29  7:37 UTC (permalink / raw)
  To: handa; +Cc: eggert, monnier, Emacs-devel

> From: handa <handa@gnu.org>
> Cc: eggert@cs.ucla.edu, monnier@iro.umontreal.ca, Emacs-devel@gnu.org
> Date: Sat, 29 Dec 2018 16:23:24 +0900
> 
> > I'm not yet sure this is only about Han unification.  Using charsets
> > for specifying fonts is a general feature in Emacs, which can be used
> > to control which fonts are selected independently of what the OS
> > facilities such as fontconfig do.
> 
> > I hope Handa-san will be able to comment on this stuff.
> 
> > If Han unification is the only important user of the charset property,
> > then yes, we could remove the rest of the charset info from HELLO.
> 
> Long ago, the quality of fonts designed for a specific regacy charset
> were far better than so-called Unicode fonts even for non-Han charaters.
> So, the charset information for non-Han charsets did have some meaning.
> But, I don't know the current situation.  Perhaps, it is good to remove
> them and wait for complaint from users.

Thanks.

What about using the charset information in general for font
selection?  Do you think this is a valuable feature, or was it again
designed only due to the issues you mention above with fonts designed
for legacy charsets?



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: etc/HELLO markup etc.
  2018-12-29  7:37                   ` Eli Zaretskii
@ 2019-01-06 12:06                     ` handa
  2019-01-06 15:29                       ` Eli Zaretskii
  0 siblings, 1 reply; 36+ messages in thread
From: handa @ 2019-01-06 12:06 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: eggert, monnier, Emacs-devel

In article <83lg486awy.fsf@gnu.org>, Eli Zaretskii <eliz@gnu.org> writes:

> What about using the charset information in general for font
> selection?  Do you think this is a valuable feature, or was it again
> designed only due to the issues you mention above with fonts designed
> for legacy charsets?

The latter.  As an Open Type font has shaping rules for script and/or
language, script and language information is more useful than charset.

---
K. Handa
handa@gnu.org



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: etc/HELLO markup etc.
  2019-01-06 12:06                     ` handa
@ 2019-01-06 15:29                       ` Eli Zaretskii
  2019-01-06 17:26                         ` Stefan Monnier
  0 siblings, 1 reply; 36+ messages in thread
From: Eli Zaretskii @ 2019-01-06 15:29 UTC (permalink / raw)
  To: handa; +Cc: eggert, monnier, Emacs-devel

> From: handa <handa@gnu.org>
> Cc: eggert@cs.ucla.edu, monnier@iro.umontreal.ca, Emacs-devel@gnu.org
> Date: Sun, 06 Jan 2019 21:06:22 +0900
> 
> In article <83lg486awy.fsf@gnu.org>, Eli Zaretskii <eliz@gnu.org> writes:
> 
> > What about using the charset information in general for font
> > selection?  Do you think this is a valuable feature, or was it again
> > designed only due to the issues you mention above with fonts designed
> > for legacy charsets?
> 
> The latter.  As an Open Type font has shaping rules for script and/or
> language, script and language information is more useful than charset.

Thanks.  I guess we can remove most of charset markup from HELLO,
leaving only one or two as an example of the facility.



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: etc/HELLO markup etc.
  2019-01-06 15:29                       ` Eli Zaretskii
@ 2019-01-06 17:26                         ` Stefan Monnier
  2019-01-06 17:39                           ` Eli Zaretskii
  0 siblings, 1 reply; 36+ messages in thread
From: Stefan Monnier @ 2019-01-06 17:26 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: handa, eggert, Emacs-devel

> Thanks.  I guess we can remove most of charset markup from HELLO,
> leaving only one or two as an example of the facility.

And to get back to bug#33796: does that mean I can install a change to
convert those Elisp files to utf-8?


        Stefan



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: etc/HELLO markup etc.
  2019-01-06 17:26                         ` Stefan Monnier
@ 2019-01-06 17:39                           ` Eli Zaretskii
  2019-01-06 18:08                             ` Stefan Monnier
  0 siblings, 1 reply; 36+ messages in thread
From: Eli Zaretskii @ 2019-01-06 17:39 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: handa, eggert, Emacs-devel

> From: Stefan Monnier <monnier@IRO.UMontreal.CA>
> Cc: handa <handa@gnu.org>, eggert@cs.ucla.edu, Emacs-devel@gnu.org
> Date: Sun, 06 Jan 2019 12:26:39 -0500
> 
> > Thanks.  I guess we can remove most of charset markup from HELLO,
> > leaving only one or two as an example of the facility.
> 
> And to get back to bug#33796: does that mean I can install a change to
> convert those Elisp files to utf-8?

Yes, I think so.  Except that I'd prefer not to mix code changes and
encoding changes.  Can you do that in two separate patches?



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: etc/HELLO markup etc.
  2019-01-06 17:39                           ` Eli Zaretskii
@ 2019-01-06 18:08                             ` Stefan Monnier
  0 siblings, 0 replies; 36+ messages in thread
From: Stefan Monnier @ 2019-01-06 18:08 UTC (permalink / raw)
  To: emacs-devel

>> > Thanks.  I guess we can remove most of charset markup from HELLO,
>> > leaving only one or two as an example of the facility.
>> 
>> And to get back to bug#33796: does that mean I can install a change to
>> convert those Elisp files to utf-8?
>
> Yes, I think so.  Except that I'd prefer not to mix code changes and
> encoding changes.  Can you do that in two separate patches?

Yes, of course,


        Stefan




^ permalink raw reply	[flat|nested] 36+ messages in thread

* bug#33796: 27.0.50; Use utf-8 is all our Elisp files
  2018-12-19 17:54 ` Paul Eggert
  2018-12-19 18:11   ` Eli Zaretskii
@ 2018-12-19 21:16   ` Stefan Monnier
  1 sibling, 0 replies; 36+ messages in thread
From: Stefan Monnier @ 2018-12-19 21:16 UTC (permalink / raw)
  To: Paul Eggert; +Cc: 33796

> PPS. How about also converting etc/tutorials/TUTORIAL.ja,
> lisp/leim/quail/hanja-jis.el, lisp/leim/quail/japanese.el,
> lisp/leim/quail/py-punct.el, and lisp/leim/quail/pypunct-b5.el?

I don't see how we'll ever get rid of support for iso-2022 encoding, so
I'm not terribly concerned about converting files like TUTORIAL.ja.
If you think it's a good idea, of course, I'm very much in favor of such
a change, but I focused on .el files because I'm interested in
standardizing Elisp files to utf-8 and get rid of
load-with-code-conversion (a distant target, admittedly, but at least
I can see a path that can get us there).

I missed the above 4 Elisp files because my regexp fu was too weak.
I'll update my patch, thanks,

        Stefan

^ permalink raw reply	[flat|nested] 36+ messages in thread

* bug#33796: 27.0.50; Use utf-8 is all our Elisp files
  2018-12-18 18:46 bug#33796: 27.0.50; Use utf-8 is all our Elisp files Stefan Monnier
  2018-12-18 19:22 ` Eli Zaretskii
  2018-12-19 17:54 ` Paul Eggert
@ 2019-01-08  2:20 ` Stefan Monnier
  2 siblings, 0 replies; 36+ messages in thread
From: Stefan Monnier @ 2019-01-08  2:20 UTC (permalink / raw)
  To: 33796-done

Installed,


        Stefan





^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: etc/HELLO markup etc.
@ 2018-12-29  5:32 Van L
  2018-12-29  7:33 ` Eli Zaretskii
  0 siblings, 1 reply; 36+ messages in thread
From: Van L @ 2018-12-29  5:32 UTC (permalink / raw)
  To: Emacs developers

>> Although the etc/HELLO markup might be of interest to those who care about 
>> annotating languages in the text, it's irrelevant to the ordinary purpose of 
>> that file, which is to show textual translations of "Hello” 

> That's not the original purpose of that file. The purpose is to show scripts, 
> not languages, and to show how we display different scripts in the same buffer.

The descriptive text accompanying (view-hello-file) says the following,
which needs to swap scripts for languages where that is.

: Display the HELLO file, which lists many languages and characters.


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: etc/HELLO markup etc.
  2018-12-29  5:32 etc/HELLO markup etc Van L
@ 2018-12-29  7:33 ` Eli Zaretskii
  2018-12-30  6:51   ` Van L
  0 siblings, 1 reply; 36+ messages in thread
From: Eli Zaretskii @ 2018-12-29  7:33 UTC (permalink / raw)
  To: Van L; +Cc: emacs-devel

> From: Van L <van@scratch.space>
> Date: Sat, 29 Dec 2018 16:32:26 +1100
> 
> >> Although the etc/HELLO markup might be of interest to those who care about 
> >> annotating languages in the text, it's irrelevant to the ordinary purpose of 
> >> that file, which is to show textual translations of "Hello” 
> 
> > That's not the original purpose of that file. The purpose is to show scripts, 
> > not languages, and to show how we display different scripts in the same buffer.
> 
> The descriptive text accompanying (view-hello-file) says the following,
> which needs to swap scripts for languages where that is.
> 
> : Display the HELLO file, which lists many languages and characters.

I'm not sure.  This discussion has been very technical, and presumably
the participants are well aware of what a script is, in this context.
By contrast, a random reader of the doc string doesn't necessarily
know what a script is.  Saying "many languages and characters" is
vaguely similar, while using only terminology most people understand.



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: etc/HELLO markup etc.
  2018-12-29  7:33 ` Eli Zaretskii
@ 2018-12-30  6:51   ` Van L
  0 siblings, 0 replies; 36+ messages in thread
From: Van L @ 2018-12-30  6:51 UTC (permalink / raw)
  To: Emacs developers

>>> That's not the original purpose of that file. The purpose is to show scripts, 
>>> not languages, and to show how we display different scripts in the same buffer.
>> 
>> The descriptive text accompanying (view-hello-file) says the following,
>> which needs to swap scripts for languages where that is.
>> 
>> : Display the HELLO file, which lists many languages and characters.
> 
> I'm not sure.  This discussion has been very technical, and presumably

Yes. It is well and truely deep in the weed of it.

> the participants are well aware of what a script is, in this context.
> By contrast, a random reader of the doc string doesn't necessarily
> know what a script is.  Saying "many languages and characters" is
> vaguely similar, while using only terminology most people understand.

A random reader may have in their consciousness the Rosetta Stone, ESA’s Rosetta Mission which will be surpassed by NASA’s New Horizon at Ultima Thule very very soon to bring in the New Year. 

You have to think all the grade school students following the Rosetta Mission were taught what is a language as distinct from a script and characters at least among the EU nationals when the UK was in there before Brexit. 

How about the following? it is 74 columns wide; anyway.

: Display HELLO file, a short sample of some languages, scripts, characters.

^ permalink raw reply	[flat|nested] 36+ messages in thread

end of thread, other threads:[~2019-01-08  2:20 UTC | newest]

Thread overview: 36+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2018-12-18 18:46 bug#33796: 27.0.50; Use utf-8 is all our Elisp files Stefan Monnier
2018-12-18 19:22 ` Eli Zaretskii
2018-12-18 19:46   ` Stefan Monnier
2018-12-19 17:54 ` Paul Eggert
2018-12-19 18:11   ` Eli Zaretskii
2018-12-19 22:13     ` Paul Eggert
2018-12-20 16:06       ` Eli Zaretskii
2018-12-20 21:49         ` Paul Eggert
2018-12-21  7:29           ` Eli Zaretskii
2018-12-21 13:46             ` Stefan Monnier
2018-12-21 15:54               ` Eli Zaretskii
2018-12-21 13:55             ` Eli Zaretskii
2018-12-21 21:07             ` Paul Eggert
2018-12-22  1:19               ` Eric Lindblad
2018-12-22  7:56                 ` etc/HELLO markup etc. (Was: 27.0.50; Use utf-8 is all our Elisp files) Eli Zaretskii
2018-12-22  8:12               ` etc/HELLO markup etc Eli Zaretskii
2018-12-22 19:41                 ` Paul Eggert
2018-12-22 20:42                   ` Eli Zaretskii
2018-12-23  7:47                 ` Yuri Khan
2018-12-23 15:42                   ` Eli Zaretskii
2018-12-23 15:53                     ` Werner LEMBERG
2018-12-23 16:04                       ` Eli Zaretskii
2018-12-23 21:11                         ` Werner LEMBERG
2018-12-28  7:10                 ` Eli Zaretskii
2018-12-29  7:23                 ` handa
2018-12-29  7:37                   ` Eli Zaretskii
2019-01-06 12:06                     ` handa
2019-01-06 15:29                       ` Eli Zaretskii
2019-01-06 17:26                         ` Stefan Monnier
2019-01-06 17:39                           ` Eli Zaretskii
2019-01-06 18:08                             ` Stefan Monnier
2018-12-19 21:16   ` bug#33796: 27.0.50; Use utf-8 is all our Elisp files Stefan Monnier
2019-01-08  2:20 ` Stefan Monnier
  -- strict thread matches above, loose matches on Subject: below --
2018-12-29  5:32 etc/HELLO markup etc Van L
2018-12-29  7:33 ` Eli Zaretskii
2018-12-30  6:51   ` Van L

Code repositories for project(s) associated with this external index

	https://git.savannah.gnu.org/cgit/emacs.git
	https://git.savannah.gnu.org/cgit/emacs/org-mode.git

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.