Re: [Emacs-diffs] master db828f6: Don't rely on defaults in decoding UTF-8 encoded Lisp files

unofficial mirror of emacs-devel@gnu.org 
 help / color / mirror / code / Atom feed

* Re: [Emacs-diffs] master db828f6: Don't rely on defaults in decoding UTF-8 encoded Lisp files
       [not found] ` <E1Ze4K3-0005KC-5U@vcs.savannah.gnu.org>
@ 2015-09-21 19:57   ` Stefan Monnier
  2015-09-21 20:07     ` Eli Zaretskii
  0 siblings, 1 reply; 70+ messages in thread
From: Stefan Monnier @ 2015-09-21 19:57 UTC (permalink / raw)
  To: emacs-devel; +Cc: Eli Zaretskii

>     Don't rely on defaults in decoding UTF-8 encoded Lisp files

FWIW, I've removed the "coding: utf-8" thingy on a bunch of files in the
last year.

Why not?  Since Emacs-24.4 the coding-system Emacs uses for .el files is
`prefer-utf-8', i.e. it is explicitly defined to be "utf-8 if it is
valid" and the user's locale/settings is only used as a fallback.

        Stefan

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [Emacs-diffs] master db828f6: Don't rely on defaults in decoding UTF-8 encoded Lisp files
  2015-09-21 19:57   ` [Emacs-diffs] master db828f6: Don't rely on defaults in decoding UTF-8 encoded Lisp files Stefan Monnier
@ 2015-09-21 20:07     ` Eli Zaretskii
  2015-09-24 16:44       ` Eli Zaretskii
  0 siblings, 1 reply; 70+ messages in thread
From: Eli Zaretskii @ 2015-09-21 20:07 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: emacs-devel

> From: Stefan Monnier <monnier@IRO.UMontreal.CA>
> Cc: Eli Zaretskii <eliz@gnu.org>
> Date: Mon, 21 Sep 2015 15:57:48 -0400
> 
> >     Don't rely on defaults in decoding UTF-8 encoded Lisp files
> 
> FWIW, I've removed the "coding: utf-8" thingy on a bunch of files in the
> last year.
> 
> Why not?

Because I'm tired of hunting problems with raw bytes being displayed,
just to learn yet another deficiency in our guesswork.

> Since Emacs-24.4 the coding-system Emacs uses for .el files is
> `prefer-utf-8', i.e. it is explicitly defined to be "utf-8 if it is
> valid"

I don't think prefer-utf-8 does what you say here.  In any case, I've
seen the default decoding do incorrect things, and I see no reason to
risk that in files we control.



^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [Emacs-diffs] master db828f6: Don't rely on defaults in decoding UTF-8 encoded Lisp files
  2015-09-21 20:07     ` Eli Zaretskii
@ 2015-09-24 16:44       ` Eli Zaretskii
  2015-09-24 21:29         ` Stefan Monnier
  0 siblings, 1 reply; 70+ messages in thread
From: Eli Zaretskii @ 2015-09-24 16:44 UTC (permalink / raw)
  To: monnier; +Cc: emacs-devel

> Date: Mon, 21 Sep 2015 23:07:52 +0300
> From: Eli Zaretskii <eliz@gnu.org>
> Cc: emacs-devel@gnu.org
> 
> > Since Emacs-24.4 the coding-system Emacs uses for .el files is
> > `prefer-utf-8', i.e. it is explicitly defined to be "utf-8 if it is
> > valid"
> 
> I don't think prefer-utf-8 does what you say here.  In any case, I've
> seen the default decoding do incorrect things, and I see no reason to
> risk that in files we control.

Here's an example, btw:

  emacs -Q
  M-x set-locale-environment RET he_IL.ISO-8859-8 RET
  C-x C-f doc/lispref/tips.texi RET

The encoding-detection guesswork fails even with the current Git
master, let alone Emacs 24.5.  Use '(skip-chars-forward "\000-\177")'
to find the mess it produces.



^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [Emacs-diffs] master db828f6: Don't rely on defaults in decoding UTF-8 encoded Lisp files
  2015-09-24 16:44       ` Eli Zaretskii
@ 2015-09-24 21:29         ` Stefan Monnier
  2015-09-25  7:55           ` Eli Zaretskii
  0 siblings, 1 reply; 70+ messages in thread
From: Stefan Monnier @ 2015-09-24 21:29 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: emacs-devel

>> > Since Emacs-24.4 the coding-system Emacs uses for .el files is
>> > `prefer-utf-8', i.e. it is explicitly defined to be "utf-8 if it is
>> > valid"
>> I don't think prefer-utf-8 does what you say here.  In any case, I've
>> seen the default decoding do incorrect things, and I see no reason to
>> risk that in files we control.
> Here's an example, btw:

>   emacs -Q
>   M-x set-locale-environment RET he_IL.ISO-8859-8 RET
>   C-x C-f doc/lispref/tips.texi RET

Hmm.... I don't think this is using prefer-utf-8.  `prefer-utf-8' is
used for *.el files via file-coding-system-alist.


        Stefan



^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [Emacs-diffs] master db828f6: Don't rely on defaults in decoding UTF-8 encoded Lisp files
  2015-09-24 21:29         ` Stefan Monnier
@ 2015-09-25  7:55           ` Eli Zaretskii
  2015-09-25 12:21             ` Stefan Monnier
  0 siblings, 1 reply; 70+ messages in thread
From: Eli Zaretskii @ 2015-09-25  7:55 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: emacs-devel

> From: Stefan Monnier <monnier@iro.umontreal.ca>
> Cc: emacs-devel@gnu.org
> Date: Thu, 24 Sep 2015 17:29:38 -0400
> 
> >   emacs -Q
> >   M-x set-locale-environment RET he_IL.ISO-8859-8 RET
> >   C-x C-f doc/lispref/tips.texi RET
> 
> Hmm.... I don't think this is using prefer-utf-8.  `prefer-utf-8' is
> used for *.el files via file-coding-system-alist.

So we now agree that at least non-*.el files should have the coding
cookie, yes?

As for *.el files: prefer-utf-8 is too easily duped for us to have
such infinite faith in it.  I can easily force a .el file to be saved
in non-UTF-8 encoding, and then it will be decoded incorrectly when
visited, if it doesn't have a coding cookie.  E.g., try saving a
foo.el with the following contents:

  (setq string "א“”")

using cp1255, then kill the buffer and visit it again.  You will see
this instead:

  (setq string "Ӕ")

Bottom line: we use prefer-utf-8 for *.el files so that the
probability of such catastrophic errors be minimized when the lazy
maintainers couldn't be bothered to add a cookie.  But we don't want
to be lazy ourselves, with the files we own and control.

More generally, I think we should require any text file in the Emacs
repository that includes non-ASCII characters to have an explicit
coding cookie, so that these subtle problems don't lie low because
most Emacs contributors live in UTF-8 locales.

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [Emacs-diffs] master db828f6: Don't rely on defaults in decoding UTF-8 encoded Lisp files
  2015-09-25  7:55           ` Eli Zaretskii
@ 2015-09-25 12:21             ` Stefan Monnier
  2015-09-25 13:37               ` Eli Zaretskii
  2015-09-25 22:32               ` Paul Eggert
  0 siblings, 2 replies; 70+ messages in thread
From: Stefan Monnier @ 2015-09-25 12:21 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: emacs-devel

> E.g., try saving a foo.el with the following contents:
>
>   (setq string "א“”")
>
> using cp1255, then kill the buffer and visit it again.

AFAIK saving in this way requires very explicit action on the part of
the user.  She gets what she asked for.  But yes, we should probably
make it even harder (i.e. disallow it altogether as long as there's no
"coding:cp1255" tag).

> So we now agree that at least non-*.el files should have the coding
> cookie, yes?

Yes, definitely.

> Bottom line: we use prefer-utf-8 for *.el files so that the
> probability of such catastrophic errors be minimized when the lazy
> maintainers couldn't be bothered to add a cookie.

No.  I pushed for prefer-utf-8 because I want Elisp source code to be
declared to use utf-8 encoding.  I can imagine a future where we don't
even support Elisp files using another coding system (i.e. throw away the
load-with-code-conversion machinery).

> More generally, I think we should require any text file in the Emacs
> repository that includes non-ASCII characters to have an explicit
> coding cookie, so that these subtle problems don't lie low because
> most Emacs contributors live in UTF-8 locales.

My view OTOH is that the future is utf-8 only, and in that future we
won't want to have redundant "coding:utf-8" tags everywhere, so we need
to find ways to go from here (i.e. "need a coding: tag for any non-ASCII
file") to there.  I don't have an answer in general, but prefer-utf-8 is
a step in that direction, which can be used for some class of files
(e.g. Elisp).

        Stefan

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [Emacs-diffs] master db828f6: Don't rely on defaults in decoding UTF-8 encoded Lisp files
  2015-09-25 12:21             ` Stefan Monnier
@ 2015-09-25 13:37               ` Eli Zaretskii
  2015-09-25 22:32               ` Paul Eggert
  1 sibling, 0 replies; 70+ messages in thread
From: Eli Zaretskii @ 2015-09-25 13:37 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: emacs-devel

> From: Stefan Monnier <monnier@iro.umontreal.ca>
> Cc: emacs-devel@gnu.org
> Date: Fri, 25 Sep 2015 08:21:59 -0400
> 
> > E.g., try saving a foo.el with the following contents:
> >
> >   (setq string "א“”")
> >
> > using cp1255, then kill the buffer and visit it again.
> 
> AFAIK saving in this way requires very explicit action on the part of
> the user.  She gets what she asked for.

Who does?  We are talking about 2 different people here, the one who
was sloppy forgetting the coding cookie, and another who visited it.

> I can imagine a future where we don't even support Elisp files using
> another coding system (i.e. throw away the load-with-code-conversion
> machinery).

I'm not sure this can be done.  AFAIK, a few files under leim/quail
are encoded with non-UTF encoding, and for a good reason.

> > More generally, I think we should require any text file in the Emacs
> > repository that includes non-ASCII characters to have an explicit
> > coding cookie, so that these subtle problems don't lie low because
> > most Emacs contributors live in UTF-8 locales.
> 
> My view OTOH is that the future is utf-8 only

If you know the future, perhaps you could suggest which shares of what
companies I should invest in?  Why waste such an important insight on
some insignificant piece of software?

> we need to find ways to go from here (i.e. "need a coding: tag for
> any non-ASCII file") to there.  I don't have an answer in general,
> but prefer-utf-8 is a step in that direction, which can be used for
> some class of files (e.g. Elisp).

I think there's no way from here to there, not as long as our encoding
detection's reliability is what it is.




^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [Emacs-diffs] master db828f6: Don't rely on defaults in decoding UTF-8 encoded Lisp files
  2015-09-25 12:21             ` Stefan Monnier
  2015-09-25 13:37               ` Eli Zaretskii
@ 2015-09-25 22:32               ` Paul Eggert
  2015-09-26  6:27                 ` Eli Zaretskii
  1 sibling, 1 reply; 70+ messages in thread
From: Paul Eggert @ 2015-09-25 22:32 UTC (permalink / raw)
  To: Stefan Monnier, Eli Zaretskii; +Cc: emacs-devel

Stefan Monnier wrote:
> we won't want to have redundant "coding:utf-8" tags everywhere, so we need
> to find ways to go from here (i.e. "need a coding: tag for any non-ASCII
> file") to there.

Yes, requiring coding: cookies for every UTF-8 file is error-prone.  We 
can't easily put cookies into every such file, as some of them are 
copied verbatim from other sources.  And even for our own files, it's 
too easy to add a bit of UTF-8 text to a cookieless file and forget to 
add a cookie.

Here's a better idea.  Developers that use a UTF-8 locale are OK 
already.  Let's suggest to the remaining developers that they put 
something like the following into their .emacs:

   (add-hook 'auto-coding-functions (lambda (size) 'utf-8))

This will let Emacs default to UTF-8 for files that don't already have a 
coding cookie.

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [Emacs-diffs] master db828f6: Don't rely on defaults in decoding UTF-8 encoded Lisp files
  2015-09-25 22:32               ` Paul Eggert
@ 2015-09-26  6:27                 ` Eli Zaretskii
  2015-09-26  6:32                   ` Eli Zaretskii
  2015-09-26 14:31                   ` Paul Eggert
  0 siblings, 2 replies; 70+ messages in thread
From: Eli Zaretskii @ 2015-09-26  6:27 UTC (permalink / raw)
  To: Paul Eggert; +Cc: monnier, emacs-devel

> From: Paul Eggert <eggert@cs.ucla.edu>
> Cc: emacs-devel@gnu.org
> Date: Fri, 25 Sep 2015 15:32:11 -0700
> 
> Here's a better idea.  Developers that use a UTF-8 locale are OK 
> already.  Let's suggest to the remaining developers that they put 
> something like the following into their .emacs:
> 
>    (add-hook 'auto-coding-functions (lambda (size) 'utf-8))
> 
> This will let Emacs default to UTF-8 for files that don't already have a 
> coding cookie.

You are assuming that those "remaining developers" use Emacs only for
working on Emacs, is that right?



^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [Emacs-diffs] master db828f6: Don't rely on defaults in decoding UTF-8 encoded Lisp files
  2015-09-26  6:27                 ` Eli Zaretskii
@ 2015-09-26  6:32                   ` Eli Zaretskii
  2015-09-26 14:31                   ` Paul Eggert
  1 sibling, 0 replies; 70+ messages in thread
From: Eli Zaretskii @ 2015-09-26  6:32 UTC (permalink / raw)
  To: eggert; +Cc: monnier, emacs-devel

> Date: Sat, 26 Sep 2015 09:27:11 +0300
> From: Eli Zaretskii <eliz@gnu.org>
> Cc: monnier@iro.umontreal.ca, emacs-devel@gnu.org
> 
> > From: Paul Eggert <eggert@cs.ucla.edu>
> > Cc: emacs-devel@gnu.org
> > Date: Fri, 25 Sep 2015 15:32:11 -0700
> > 
> > Here's a better idea.  Developers that use a UTF-8 locale are OK 
> > already.  Let's suggest to the remaining developers that they put 
> > something like the following into their .emacs:
> > 
> >    (add-hook 'auto-coding-functions (lambda (size) 'utf-8))
> > 
> > This will let Emacs default to UTF-8 for files that don't already have a 
> > coding cookie.
> 
> You are assuming that those "remaining developers" use Emacs only for
> working on Emacs, is that right?

And I fail to see how's that less prone to errors, anyway: it still
requires something to be added manually to some file.  The only
solution which will make the current situation better is one that will
work in "emacs -Q".



^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [Emacs-diffs] master db828f6: Don't rely on defaults in decoding UTF-8 encoded Lisp files
  2015-09-26  6:27                 ` Eli Zaretskii
  2015-09-26  6:32                   ` Eli Zaretskii
@ 2015-09-26 14:31                   ` Paul Eggert
  2015-09-26 15:15                     ` Eli Zaretskii
  1 sibling, 1 reply; 70+ messages in thread
From: Paul Eggert @ 2015-09-26 14:31 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: monnier, emacs-devel

Eli Zaretskii wrote:
> You are assuming that those "remaining developers" use Emacs only for
> working on Emacs, is that right?

No, I am assuming that the typical default nowadays, for text that is not 
otherwise labeled, is to use UTF-8.  This is a reasonable assumption.  It's not 
always correct, but exceptions can be handled.

I see that you have added more coding: cookies.  Oh well.  I do take your point 
that we need a better solution than what we have.

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [Emacs-diffs] master db828f6: Don't rely on defaults in decoding UTF-8 encoded Lisp files
  2015-09-26 14:31                   ` Paul Eggert
@ 2015-09-26 15:15                     ` Eli Zaretskii
  2015-09-26 16:01                       ` Paul Eggert
  0 siblings, 1 reply; 70+ messages in thread
From: Eli Zaretskii @ 2015-09-26 15:15 UTC (permalink / raw)
  To: Paul Eggert; +Cc: monnier, emacs-devel

> Cc: monnier@iro.umontreal.ca, emacs-devel@gnu.org
> From: Paul Eggert <eggert@cs.ucla.edu>
> Date: Sat, 26 Sep 2015 07:31:36 -0700
> 
> Eli Zaretskii wrote:
> > You are assuming that those "remaining developers" use Emacs only for
> > working on Emacs, is that right?
> 
> No, I am assuming that the typical default nowadays, for text that is not 
> otherwise labeled, is to use UTF-8.  This is a reasonable assumption.  It's not 
> always correct, but exceptions can be handled.

So you are, in effect, saying that it is incorrect to derive the
default encodings from the locale's codeset?  I'm not sure about that,
but if so, the issue is much broader than just what was discussed
here, it touches a lot of other defaults as well, and a lot of code
that supports those defaults.

> I see that you have added more coding: cookies.  Oh well.  I do take your point 
> that we need a better solution than what we have.

I don't enjoy adding those cookies, but I enjoy even less seeing those
"8" indications in the mode line when I know there's not a chance in
the world the file was encoded in ISO-8859-8.

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [Emacs-diffs] master db828f6: Don't rely on defaults in decoding UTF-8 encoded Lisp files
  2015-09-26 15:15                     ` Eli Zaretskii
@ 2015-09-26 16:01                       ` Paul Eggert
  2015-09-26 16:09                         ` David Kastrup
                                           ` (2 more replies)
  0 siblings, 3 replies; 70+ messages in thread
From: Paul Eggert @ 2015-09-26 16:01 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: monnier, emacs-devel

Eli Zaretskii wrote:
> So you are, in effect, saying that it is incorrect to derive the
> default encodings from the locale's codeset?

Yes, for Emacs developers.  And come to think of it, for most Emacs users. 
Nowadays in my experience most non-ASCII text files use UTF-8, regardless of 
locale.  The old days of having to guess encoding from the locale are passing 
away.  This is partly due to UTF-8 being the encoding of choice for HTML and 
XML, where UTF-8 overtook the older 8-bit encodings in 2008 and now is by far 
the dominant encoding.

One way to accommodate the new reality would be to change Emacs so that by 
default the system locale does not affect Emacs's guess of a file's encoding if 
the file's initial sample is valid UTF-8.  Users could set a variable to 
re-enable the old behavior.  If we did this, we wouldn't have the error-prone 
process if sprinkling 'coding: utf-8' cookies all over the place.

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [Emacs-diffs] master db828f6: Don't rely on defaults in decoding UTF-8 encoded Lisp files
  2015-09-26 16:01                       ` Paul Eggert
@ 2015-09-26 16:09                         ` David Kastrup
  2015-09-26 17:26                           ` Eli Zaretskii
  2015-09-26 18:53                           ` Paul Eggert
  2015-09-26 17:25                         ` Eli Zaretskii
  2015-09-27  0:12                         ` stephen
  2 siblings, 2 replies; 70+ messages in thread
From: David Kastrup @ 2015-09-26 16:09 UTC (permalink / raw)
  To: Paul Eggert; +Cc: Eli Zaretskii, monnier, emacs-devel

Paul Eggert <eggert@cs.ucla.edu> writes:

> Eli Zaretskii wrote:
>> So you are, in effect, saying that it is incorrect to derive the
>> default encodings from the locale's codeset?
>
> Yes, for Emacs developers.  And come to think of it, for most Emacs
> users.

If the answer is "most" rather than "all", it would be absurd if Emacs
developers were not to use circumstances which they are supposed to
support.

> Nowadays in my experience most non-ASCII text files use UTF-8,
> regardless of locale.

How frequent are you reading Hebrew, Arabic, Chinese, Japanese, and
Korean texts?  How relevant is your experience?

-- 
David Kastrup



^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [Emacs-diffs] master db828f6: Don't rely on defaults in decoding UTF-8 encoded Lisp files
  2015-09-26 16:09                         ` David Kastrup
@ 2015-09-26 17:26                           ` Eli Zaretskii
  2015-09-26 18:53                           ` Paul Eggert
  1 sibling, 0 replies; 70+ messages in thread
From: Eli Zaretskii @ 2015-09-26 17:26 UTC (permalink / raw)
  To: David Kastrup; +Cc: eggert, monnier, emacs-devel

> From: David Kastrup <dak@gnu.org>
> Date: Sat, 26 Sep 2015 18:09:36 +0200
> Cc: Eli Zaretskii <eliz@gnu.org>, monnier@iro.umontreal.ca, emacs-devel@gnu.org
> 
> > Nowadays in my experience most non-ASCII text files use UTF-8,
> > regardless of locale.
> 
> How frequent are you reading Hebrew, Arabic, Chinese, Japanese, and
> Korean texts?  How relevant is your experience?

Indeed, I think Far Eastern locales frequently use non-UTF-8
encodings.



^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [Emacs-diffs] master db828f6: Don't rely on defaults in decoding UTF-8 encoded Lisp files
  2015-09-26 16:09                         ` David Kastrup
  2015-09-26 17:26                           ` Eli Zaretskii
@ 2015-09-26 18:53                           ` Paul Eggert
  2015-09-26 19:35                             ` Eli Zaretskii
  1 sibling, 1 reply; 70+ messages in thread
From: Paul Eggert @ 2015-09-26 18:53 UTC (permalink / raw)
  To: David Kastrup; +Cc: Eli Zaretskii, monnier, emacs-devel

David Kastrup wrote:
> How frequent are you reading Hebrew, Arabic, Chinese, Japanese, and
> Korean texts?  How relevant is your experience?

Hebrew, not so much -- Eli has far more experience with that.  Arabic I was just 
reading last week (not natively; I use a translator).  This week I was reading a 
lot of Turkish.  In all cases I was looking at text prepared by others.  In all 
cases my sources used UTF-8 -- not due to my influence, but because that's 
what's typical these days.

In my previous job I routinely had to deal with CJK text, and did so with lots 
of different encodings, including monstrosities such as DBCS-Host that Emacs 
doesn't even support.  So my experience is reasonably good in this area -- 
better than the average random hacker anyway.  If you go back 20 years, 
non-UTF-8 encodings such as Shift-JIS and EUC were by far the most popular in 
Japan.  Nowadays?  Sure, Shift-JIS and EUC are still used, but they're going 
downhill.  Of the top 20 web sites in Japan (according to Alexa), 18 use UTF-8, 
one uses Shift-JIS, and one uses EUC on their home pages.  In the w3techs survey 
of world web sites, 85% use UTF-8; the second most-popular encoding, ISO-8859-1, 
is at only 7.5%, and it's that high only because the old HTML standard made 
ISO-8859-1 the default.

So in practice, defaulting to UTF-8 is quite a good choice nowadays.  Of course 
if we can get the proper encoding from the document or its envelope we should 
prefer that, and that should let us deal with web documents and email.

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [Emacs-diffs] master db828f6: Don't rely on defaults in decoding UTF-8 encoded Lisp files
  2015-09-26 18:53                           ` Paul Eggert
@ 2015-09-26 19:35                             ` Eli Zaretskii
  2015-09-26 20:26                               ` Chad Brown
  2015-09-26 20:32                               ` Paul Eggert
  0 siblings, 2 replies; 70+ messages in thread
From: Eli Zaretskii @ 2015-09-26 19:35 UTC (permalink / raw)
  To: Paul Eggert; +Cc: dak, monnier, emacs-devel

> From: Paul Eggert <eggert@cs.ucla.edu>
> Date: Sat, 26 Sep 2015 11:53:09 -0700
> Cc: Eli Zaretskii <eliz@gnu.org>, monnier@iro.umontreal.ca, emacs-devel@gnu.org
> 
> Of the top 20 web sites in Japan (according to Alexa), 18 use UTF-8,
> one uses Shift-JIS, and one uses EUC on their home pages.  In the
> w3techs survey of world web sites, 85% use UTF-8; the second
> most-popular encoding, ISO-8859-1, is at only 7.5%, and it's that
> high only because the old HTML standard made ISO-8859-1 the default.

The relevant statistics for Emacs is of source files, not of HTML
pages.



^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [Emacs-diffs] master db828f6: Don't rely on defaults in decoding UTF-8 encoded Lisp files
  2015-09-26 19:35                             ` Eli Zaretskii
@ 2015-09-26 20:26                               ` Chad Brown
  2015-09-26 21:50                                 ` David Kastrup
  2015-09-27  7:34                                 ` Eli Zaretskii
  2015-09-26 20:32                               ` Paul Eggert
  1 sibling, 2 replies; 70+ messages in thread
From: Chad Brown @ 2015-09-26 20:26 UTC (permalink / raw)
  To: Eli Zaretskii, emacs-devel


> On 26 Sep 2015, at 12:35, Eli Zaretskii <eliz@gnu.org> wrote:
> 
> The relevant statistics for Emacs is of source files, not of HTML
> pages.

The default for GCC is UTF-8. Python requires a coding cookie (intentionally similar to Emacs’) to get away from Latin-1. Java is UTF-8. Javascript, roughly speaking, tracks HTML. Which other languages did you have in mind?

~Chad


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [Emacs-diffs] master db828f6: Don't rely on defaults in decoding UTF-8 encoded Lisp files
  2015-09-26 20:26                               ` Chad Brown
@ 2015-09-26 21:50                                 ` David Kastrup
  2015-09-27  4:44                                   ` Paul Eggert
  2015-09-27  7:34                                 ` Eli Zaretskii
  1 sibling, 1 reply; 70+ messages in thread
From: David Kastrup @ 2015-09-26 21:50 UTC (permalink / raw)
  To: Chad Brown; +Cc: Eli Zaretskii, emacs-devel

Chad Brown <yandros@gmail.com> writes:

>> On 26 Sep 2015, at 12:35, Eli Zaretskii <eliz@gnu.org> wrote:
>> 
>> The relevant statistics for Emacs is of source files, not of HTML
>> pages.
>
> The default for GCC is UTF-8.

How so?  The default is defined by the compiled language.  For C, it is
essentially 8-bit bytes where the meaning-carrying subset is ASCII.
Everything else is just replication.

GCC communicates on the terminal with compiler diagnostics.  For that it
uses the current locale.

> Python requires a coding cookie (intentionally similar to Emacs’) to
> get away from Latin-1. Java is UTF-8. Javascript, roughly speaking,
> tracks HTML. Which other languages did you have in mind?

Emacs is, not least of all, a text editor.  I am currently using it to
write this Email reply.  Not everything that one uses Emacs for has a
well-defined default encoding.

-- 
David Kastrup

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [Emacs-diffs] master db828f6: Don't rely on defaults in decoding UTF-8 encoded Lisp files
  2015-09-26 21:50                                 ` David Kastrup
@ 2015-09-27  4:44                                   ` Paul Eggert
  2015-09-27  5:29                                     ` David Kastrup
  2015-09-27  7:39                                     ` Eli Zaretskii
  0 siblings, 2 replies; 70+ messages in thread
From: Paul Eggert @ 2015-09-27  4:44 UTC (permalink / raw)
  To: David Kastrup; +Cc: emacs-devel

David Kastrup wrote:
> The default is defined by the compiled language.  For C, it is
> essentially 8-bit bytes where the meaning-carrying subset is ASCII.

That was true for C99 and earlier, but it stopped being true in C11, where the 
source-file encoding does matter and where UTF-8 is the only sane default nowadays.




^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [Emacs-diffs] master db828f6: Don't rely on defaults in decoding UTF-8 encoded Lisp files
  2015-09-27  4:44                                   ` Paul Eggert
@ 2015-09-27  5:29                                     ` David Kastrup
  2015-09-27  7:38                                       ` Paul Eggert
                                                         ` (2 more replies)
  2015-09-27  7:39                                     ` Eli Zaretskii
  1 sibling, 3 replies; 70+ messages in thread
From: David Kastrup @ 2015-09-27  5:29 UTC (permalink / raw)
  To: Paul Eggert; +Cc: emacs-devel

Paul Eggert <eggert@cs.ucla.edu> writes:

> David Kastrup wrote:
>> The default is defined by the compiled language.  For C, it is
>> essentially 8-bit bytes where the meaning-carrying subset is ASCII.
>
> That was true for C99 and earlier, but it stopped being true in C11,
> where the source-file encoding does matter and where UTF-8 is the only
> sane default nowadays.

"stopped being true in C11" suggests that the world moved on.  Here is
the manual extract from the GCC delivered in the latest Ubuntu
distribution (the most commonly used GNU/Linux system):

     A fourth version of the C standard, known as "C11", was published
    in 2011 as ISO/IEC 9899:2011.  GCC has substantially complete
    support for this standard, enabled with '-std=c11' or
    '-std=iso9899:2011'.  (While in development, drafts of this standard
    version were referred to as "C1X".)

It is not even accepted without using extra options.  And we are not
talking anyway about the encoding Emacs is to choose for new files but
rather about the encoding for opening existing files.

How are you going to magically eradicate all pre-C11 files from Earth?
Wouldn't it be convenient to actually load them into an editor for doing
the conversion?

-- 
David Kastrup

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [Emacs-diffs] master db828f6: Don't rely on defaults in decoding UTF-8 encoded Lisp files
  2015-09-27  5:29                                     ` David Kastrup
@ 2015-09-27  7:38                                       ` Paul Eggert
  2015-09-27  7:46                                         ` David Kastrup
  2015-09-27  9:47                                       ` Andreas Schwab
  2015-09-27 22:48                                       ` Richard Stallman
  2 siblings, 1 reply; 70+ messages in thread
From: Paul Eggert @ 2015-09-27  7:38 UTC (permalink / raw)
  To: David Kastrup; +Cc: emacs-devel

David Kastrup wrote:
> How are you going to magically eradicate all pre-C11 files from Earth?

Old C files will build just fine with newer GCC.  And GCC can support UTF-8 in 
strings even if you don't use the -std=c11 option.  So this is not a problem.



^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [Emacs-diffs] master db828f6: Don't rely on defaults in decoding UTF-8 encoded Lisp files
  2015-09-27  7:38                                       ` Paul Eggert
@ 2015-09-27  7:46                                         ` David Kastrup
  2015-09-27  7:52                                           ` Paul Eggert
  0 siblings, 1 reply; 70+ messages in thread
From: David Kastrup @ 2015-09-27  7:46 UTC (permalink / raw)
  To: Paul Eggert; +Cc: emacs-devel

Paul Eggert <eggert@cs.ucla.edu> writes:

> David Kastrup wrote:
>> How are you going to magically eradicate all pre-C11 files from Earth?
>
> Old C files will build just fine with newer GCC.  And GCC can support
> UTF-8 in strings even if you don't use the -std=c11 option.  So this
> is not a problem.

Are we still talking about the defaults Emacs chooses when detecting the
file encoding of C files?  You seem about equally likely to argue
against your own proposals than you are arguing for them.

-- 
David Kastrup



^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [Emacs-diffs] master db828f6: Don't rely on defaults in decoding UTF-8 encoded Lisp files
  2015-09-27  7:46                                         ` David Kastrup
@ 2015-09-27  7:52                                           ` Paul Eggert
  0 siblings, 0 replies; 70+ messages in thread
From: Paul Eggert @ 2015-09-27  7:52 UTC (permalink / raw)
  To: David Kastrup; +Cc: emacs-devel

David Kastrup wrote:
> Are we still talking about the defaults Emacs chooses when detecting the
> file encoding of C files?

Yes, of course.



^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [Emacs-diffs] master db828f6: Don't rely on defaults in decoding UTF-8 encoded Lisp files
  2015-09-27  5:29                                     ` David Kastrup
  2015-09-27  7:38                                       ` Paul Eggert
@ 2015-09-27  9:47                                       ` Andreas Schwab
  2015-09-27  9:54                                         ` David Kastrup
  2015-09-27 22:48                                       ` Richard Stallman
  2 siblings, 1 reply; 70+ messages in thread
From: Andreas Schwab @ 2015-09-27  9:47 UTC (permalink / raw)
  To: David Kastrup; +Cc: Paul Eggert, emacs-devel

David Kastrup <dak@gnu.org> writes:

>      A fourth version of the C standard, known as "C11", was published
>     in 2011 as ISO/IEC 9899:2011.  GCC has substantially complete
>     support for this standard, enabled with '-std=c11' or
>     '-std=iso9899:2011'.  (While in development, drafts of this standard
>     version were referred to as "C1X".)
>
> It is not even accepted without using extra options.

The latest release of gcc has C11 as the default standard.

Andreas.

-- 
Andreas Schwab, schwab@linux-m68k.org
GPG Key fingerprint = 58CA 54C7 6D53 942B 1756  01D3 44D5 214B 8276 4ED5
"And now for something completely different."



^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [Emacs-diffs] master db828f6: Don't rely on defaults in decoding UTF-8 encoded Lisp files
  2015-09-27  9:47                                       ` Andreas Schwab
@ 2015-09-27  9:54                                         ` David Kastrup
  2015-09-27 10:03                                           ` Andreas Schwab
  0 siblings, 1 reply; 70+ messages in thread
From: David Kastrup @ 2015-09-27  9:54 UTC (permalink / raw)
  To: Andreas Schwab; +Cc: Paul Eggert, emacs-devel

Andreas Schwab <schwab@linux-m68k.org> writes:

> David Kastrup <dak@gnu.org> writes:
>
>>      A fourth version of the C standard, known as "C11", was published
>>     in 2011 as ISO/IEC 9899:2011.  GCC has substantially complete
>>     support for this standard, enabled with '-std=c11' or
>>     '-std=iso9899:2011'.  (While in development, drafts of this standard
>>     version were referred to as "C1X".)
>>
>> It is not even accepted without using extra options.
>
> The latest release of gcc has C11 as the default standard.

You just got to love the "creative editing" culture on this mailing
list.  First edit a posting into what you would rather want to reply to,
then pretend the stuff you elided was not there in the first place.

I was _very_ _explicitly_ _not_ talking about the "latest release of
gcc" but rather the latest release of GCC in the most wide-spread
production GNU/Linux distribution.

Can we please stop this silly gamesmanship?  It very much contributes to
"discussions" going in circles.

-- 
David Kastrup

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [Emacs-diffs] master db828f6: Don't rely on defaults in decoding UTF-8 encoded Lisp files
  2015-09-27  9:54                                         ` David Kastrup
@ 2015-09-27 10:03                                           ` Andreas Schwab
  2015-09-27 10:12                                             ` David Kastrup
  0 siblings, 1 reply; 70+ messages in thread
From: Andreas Schwab @ 2015-09-27 10:03 UTC (permalink / raw)
  To: David Kastrup; +Cc: Paul Eggert, emacs-devel

David Kastrup <dak@gnu.org> writes:

> I was _very_ _explicitly_ _not_ talking about the "latest release of
> gcc" but rather the latest release of GCC in the most wide-spread
> production GNU/Linux distribution.

Many distributions already ship gcc5, some as default even.

Andreas.

-- 
Andreas Schwab, schwab@linux-m68k.org
GPG Key fingerprint = 58CA 54C7 6D53 942B 1756  01D3 44D5 214B 8276 4ED5
"And now for something completely different."



^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [Emacs-diffs] master db828f6: Don't rely on defaults in decoding UTF-8 encoded Lisp files
  2015-09-27 10:03                                           ` Andreas Schwab
@ 2015-09-27 10:12                                             ` David Kastrup
  2015-09-27 11:10                                               ` Andreas Schwab
  0 siblings, 1 reply; 70+ messages in thread
From: David Kastrup @ 2015-09-27 10:12 UTC (permalink / raw)
  To: Andreas Schwab; +Cc: Paul Eggert, emacs-devel

Andreas Schwab <schwab@linux-m68k.org> writes:

> David Kastrup <dak@gnu.org> writes:
>
>> I was _very_ _explicitly_ _not_ talking about the "latest release of
>> gcc" but rather the latest release of GCC in the most wide-spread
>> production GNU/Linux distribution.
>
> Many distributions already ship gcc5, some as default even.

That's nice but I was talking about the latest release of GCC in the
most wide-spread production GNU/Linux distribution.  That's kind of a
relevant counterexample to Paul's generalizations.  It's not an obscure
corner case.

Apart of which I am still waiting for an explanation of just why Emacs
should stop supporting non-UTF-8 C source files _because_ the C11
standard now provides the means to place UTF-8 strings in executables
when using non-UTF-8 source files (previously, you needed to have an
UTF-8 encoded source file to do that).

Emacs should support non-UTF-8 source files worse because C11 makes it
more convenient to use them?

It's worse enough that we are arguing straw men all the time, but these
straw men are upside down.

-- 
David Kastrup

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [Emacs-diffs] master db828f6: Don't rely on defaults in decoding UTF-8 encoded Lisp files
  2015-09-27 10:12                                             ` David Kastrup
@ 2015-09-27 11:10                                               ` Andreas Schwab
  0 siblings, 0 replies; 70+ messages in thread
From: Andreas Schwab @ 2015-09-27 11:10 UTC (permalink / raw)
  To: David Kastrup; +Cc: Paul Eggert, emacs-devel

David Kastrup <dak@gnu.org> writes:

> It's worse enough that we are arguing straw men all the time, but these
> straw men are upside down.

So please go ahead.

Andreas.

-- 
Andreas Schwab, schwab@linux-m68k.org
GPG Key fingerprint = 58CA 54C7 6D53 942B 1756  01D3 44D5 214B 8276 4ED5
"And now for something completely different."



^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [Emacs-diffs] master db828f6: Don't rely on defaults in decoding UTF-8 encoded Lisp files
  2015-09-27  5:29                                     ` David Kastrup
  2015-09-27  7:38                                       ` Paul Eggert
  2015-09-27  9:47                                       ` Andreas Schwab
@ 2015-09-27 22:48                                       ` Richard Stallman
  2015-09-28  2:41                                         ` Paul Eggert
  2 siblings, 1 reply; 70+ messages in thread
From: Richard Stallman @ 2015-09-27 22:48 UTC (permalink / raw)
  To: David Kastrup; +Cc: eggert, emacs-devel

[[[ To any NSA and FBI agents reading my email: please consider    ]]]
[[[ whether defending the US Constitution against all enemies,     ]]]
[[[ foreign or domestic, requires you to follow Snowden's example. ]]]

Could someone tell me what issue is now under discussion?

The Subject line seems to refer to Lisp files, and yet here people
are talking about changes in C as of C11.

-- 
Dr Richard Stallman
President, Free Software Foundation (gnu.org, fsf.org)
Internet Hall-of-Famer (internethalloffame.org)
Skype: No way! See stallman.org/skype.html.

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [Emacs-diffs] master db828f6: Don't rely on defaults in decoding UTF-8 encoded Lisp files
  2015-09-27 22:48                                       ` Richard Stallman
@ 2015-09-28  2:41                                         ` Paul Eggert
  2015-09-28  6:53                                           ` Eli Zaretskii
  0 siblings, 1 reply; 70+ messages in thread
From: Paul Eggert @ 2015-09-28  2:41 UTC (permalink / raw)
  To: rms; +Cc: emacs-devel

Richard Stallman wrote:
> The Subject line seems to refer to Lisp files, and yet here people
> are talking about changes in C as of C11.

The subject line comes from a commit to Emacs master that added coding-cookie 
lines like the following to some .el files that had UTF-8 text:

   ;; Local Variables:
   ;; coding: utf-8
   ;; End:

Lines like these are no longer needed with current Emacs, which prefers UTF-8 
for .el files regardless of the system locale.  This can be a win, as people 
often forget to insert coding cookies and the cookies are a bit awkward anyway.

The discussion has morphed into the possibility of a similar facility for files 
other than .el files, and what the defaults for such a facility should be.  The 
idea is to somehow avoid the need for UTF-8 coding cookies for users who prefer 
Emacs to default to UTF-8 for text files regardless of system locale.

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [Emacs-diffs] master db828f6: Don't rely on defaults in decoding UTF-8 encoded Lisp files
  2015-09-28  2:41                                         ` Paul Eggert
@ 2015-09-28  6:53                                           ` Eli Zaretskii
  2015-09-28 15:08                                             ` Paul Eggert
  0 siblings, 1 reply; 70+ messages in thread
From: Eli Zaretskii @ 2015-09-28  6:53 UTC (permalink / raw)
  To: Paul Eggert; +Cc: rms, emacs-devel

> From: Paul Eggert <eggert@cs.ucla.edu>
> Date: Sun, 27 Sep 2015 19:41:36 -0700
> Cc: emacs-devel@gnu.org
> 
> The discussion has morphed into the possibility of a similar facility for files 
> other than .el files, and what the defaults for such a facility should be.  The 
> idea is to somehow avoid the need for UTF-8 coding cookies for users who prefer 
> Emacs to default to UTF-8 for text files regardless of system locale.

I think

  (prefer-coding-system 'utf-8)

is what those users should do, but I'm not sure.  It definitely sets
the defaults for new files and for communicating with sub-processes
(the latter part might not be what you want, btw), but its effect on
decoding existing files might sometimes surprise, due to the way the
priority of trying various decoders is implemented.  (Hint: look at
the implementation of set-coding-priority, the function that is called
under the hood by prefer-coding-system.)

Btw, the issue under discussion, as I perceive it, was somewhat
different: how to ensure correct decoding of UTF-8 encoded files
(other than *.el) in Emacs source tree _regardless_ of whether the
user in question wants to prefer UTF-8 outside of Emacs tree.  The
best solution we have now is to have a coding cookie in each such
file, and the question is how can that be avoided.

IOW, the solution should IMO be independent of user's preferences.

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [Emacs-diffs] master db828f6: Don't rely on defaults in decoding UTF-8 encoded Lisp files
  2015-09-28  6:53                                           ` Eli Zaretskii
@ 2015-09-28 15:08                                             ` Paul Eggert
  2015-09-28 15:58                                               ` Eli Zaretskii
  0 siblings, 1 reply; 70+ messages in thread
From: Paul Eggert @ 2015-09-28 15:08 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: rms, emacs-devel

On 09/27/2015 11:53 PM, Eli Zaretskii wrote:
> The best solution we have now is to have a coding cookie in each such
> file, and the question is how can that be avoided.
>
> IOW, the solution should IMO be independent of user's preferences.

Here's an idea: improve the handling of .dir-locals.el so that it could 
contain something like this:

   ((nil . ((coding . 'utf-8)
        (tab-width . 8)
        (fill-column . 70)))
    (c-mode . ((c-file-style . "GNU"))))

This specification for "coding" would take precedence over coding 
inferred from environment settings.  It would not take precedence over 
an explicit coding cookie in the file.

Currently .dir-locals.el cannot specify 'coding', but I suspect that's 
mostly just due to the intricacies of the current implementation, not 
due to any specific desire to reject the use of 'coding' in .dir-locals.el.

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [Emacs-diffs] master db828f6: Don't rely on defaults in decoding UTF-8 encoded Lisp files
  2015-09-28 15:08                                             ` Paul Eggert
@ 2015-09-28 15:58                                               ` Eli Zaretskii
  0 siblings, 0 replies; 70+ messages in thread
From: Eli Zaretskii @ 2015-09-28 15:58 UTC (permalink / raw)
  To: Paul Eggert; +Cc: rms, emacs-devel

> Cc: rms@gnu.org, emacs-devel@gnu.org
> From: Paul Eggert <eggert@cs.ucla.edu>
> Date: Mon, 28 Sep 2015 08:08:32 -0700
> 
> On 09/27/2015 11:53 PM, Eli Zaretskii wrote:
> > The best solution we have now is to have a coding cookie in each such
> > file, and the question is how can that be avoided.
> >
> > IOW, the solution should IMO be independent of user's preferences.
> 
> Here's an idea: improve the handling of .dir-locals.el so that it could 
> contain something like this:
> 
>    ((nil . ((coding . 'utf-8)
>         (tab-width . 8)
>         (fill-column . 70)))
>     (c-mode . ((c-file-style . "GNU"))))

That's better than having to specify encoding in individual files, so
I think this would be progress.  Thanks.



^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [Emacs-diffs] master db828f6: Don't rely on defaults in decoding UTF-8 encoded Lisp files
  2015-09-27  4:44                                   ` Paul Eggert
  2015-09-27  5:29                                     ` David Kastrup
@ 2015-09-27  7:39                                     ` Eli Zaretskii
  2015-09-27  7:52                                       ` Paul Eggert
  1 sibling, 1 reply; 70+ messages in thread
From: Eli Zaretskii @ 2015-09-27  7:39 UTC (permalink / raw)
  To: Paul Eggert; +Cc: dak, emacs-devel

> From: Paul Eggert <eggert@cs.ucla.edu>
> Date: Sat, 26 Sep 2015 21:44:19 -0700
> Cc: emacs-devel <emacs-devel@gnu.org>
> 
> David Kastrup wrote:
> > The default is defined by the compiled language.  For C, it is
> > essentially 8-bit bytes where the meaning-carrying subset is ASCII.
> 
> That was true for C99 and earlier, but it stopped being true in C11, where the 
> source-file encoding does matter and where UTF-8 is the only sane default nowadays.

I don't see any language to that effect in the C11 Final Draft I have
here.  AFAICT, non-UTF-8 multibyte sequences are still supported by
C11.  Can you show the text on which you based the above assertion?

Maybe you are talking about encoding of the identifier names.  What I
had in mind was comments and strings, not identifier names.



^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [Emacs-diffs] master db828f6: Don't rely on defaults in decoding UTF-8 encoded Lisp files
  2015-09-27  7:39                                     ` Eli Zaretskii
@ 2015-09-27  7:52                                       ` Paul Eggert
  2015-09-27  8:00                                         ` David Kastrup
  2015-09-27  8:03                                         ` Eli Zaretskii
  0 siblings, 2 replies; 70+ messages in thread
From: Paul Eggert @ 2015-09-27  7:52 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: dak, emacs-devel

Eli Zaretskii wrote:
> I don't see any language to that effect in the C11 Final Draft I have
> here.  AFAICT, non-UTF-8 multibyte sequences are still supported by
> C11.

Of course; that part didn't change.  I was talking about C11's new UTF-8 string 
literals, e.g., u8"Emacsの主要操作(早見表)".  There is no similar notation for 
Shift-JIS, etc.  Of course implementations can support legacy encodings, and 
some legacy C programs are written that way, but the only portable way to go in 
the future is Unicode.

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [Emacs-diffs] master db828f6: Don't rely on defaults in decoding UTF-8 encoded Lisp files
  2015-09-27  7:52                                       ` Paul Eggert
@ 2015-09-27  8:00                                         ` David Kastrup
  2015-09-27  8:03                                         ` Eli Zaretskii
  1 sibling, 0 replies; 70+ messages in thread
From: David Kastrup @ 2015-09-27  8:00 UTC (permalink / raw)
  To: Paul Eggert; +Cc: Eli Zaretskii, emacs-devel

Paul Eggert <eggert@cs.ucla.edu> writes:

> Eli Zaretskii wrote:
>> I don't see any language to that effect in the C11 Final Draft I have
>> here.  AFAICT, non-UTF-8 multibyte sequences are still supported by
>> C11.
>
> Of course; that part didn't change.  I was talking about C11's new
> UTF-8 string literals, e.g., u8"Emacsの主要操作(早見表)".

Again, are you arguing for or against your own proposals?  The _only_
purpose of such string literals is to support generating UTF-8 encoded
strings in the executable even when the source file is _not_ encoded in
UTF-8.

So you argue because C11 contains a feature for supporting source files
_not_ encoded in UTF-8, Emacs should support only source files encoded
in UTF-8?

If anything, this is somewhat of an argument for GDB to preferably
interpret C strings as being encoded in UTF-8 even when the source code
encoding of a C file appears to be different.

We are not talking about editing executables here.  We are talking about
editing source files.

-- 
David Kastrup

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [Emacs-diffs] master db828f6: Don't rely on defaults in decoding UTF-8 encoded Lisp files
  2015-09-27  7:52                                       ` Paul Eggert
  2015-09-27  8:00                                         ` David Kastrup
@ 2015-09-27  8:03                                         ` Eli Zaretskii
  2015-09-27  8:29                                           ` Paul Eggert
  1 sibling, 1 reply; 70+ messages in thread
From: Eli Zaretskii @ 2015-09-27  8:03 UTC (permalink / raw)
  To: Paul Eggert; +Cc: dak, emacs-devel

> Cc: dak@gnu.org, emacs-devel@gnu.org
> From: Paul Eggert <eggert@cs.ucla.edu>
> Date: Sun, 27 Sep 2015 00:52:08 -0700
> 
> Eli Zaretskii wrote:
> > I don't see any language to that effect in the C11 Final Draft I have
> > here.  AFAICT, non-UTF-8 multibyte sequences are still supported by
> > C11.
> 
> Of course; that part didn't change.  I was talking about C11's new UTF-8 string 
> literals, e.g., u8"Emacsの主要操作(早見表)".

That's indeed a new feature of C11, but it doesn't disallow using
arbitrary byte sequences in otherwise C11-compliant sources.

> Of course implementations can support legacy encodings, and some
> legacy C programs are written that way, but the only portable way to
> go in the future is Unicode.

Not sure what kind of "portability" did you have in mind here.  If
that's portability between locales, then our solution of having a
coding cookie is better for Emacs, because it supports more use cases
than just assuming UTF-8 would.




^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [Emacs-diffs] master db828f6: Don't rely on defaults in decoding UTF-8 encoded Lisp files
  2015-09-27  8:03                                         ` Eli Zaretskii
@ 2015-09-27  8:29                                           ` Paul Eggert
  2015-09-27  8:37                                             ` David Kastrup
  2015-09-27  8:57                                             ` Eli Zaretskii
  0 siblings, 2 replies; 70+ messages in thread
From: Paul Eggert @ 2015-09-27  8:29 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: dak, emacs-devel

Eli Zaretskii wrote:
> If
> that's portability between locales, then our solution of having a
> coding cookie is better for Emacs, because it supports more use cases
> than just assuming UTF-8 would.

Sure, but the point is that we shouldn't need a cookie for UTF-8.  Cookies are 
awkward, and should be inserted only when needed; they shouldn't be needed for 
the typical case.



^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [Emacs-diffs] master db828f6: Don't rely on defaults in decoding UTF-8 encoded Lisp files
  2015-09-27  8:29                                           ` Paul Eggert
@ 2015-09-27  8:37                                             ` David Kastrup
  2015-09-27  8:40                                               ` Paul Eggert
  2015-09-27  8:57                                             ` Eli Zaretskii
  1 sibling, 1 reply; 70+ messages in thread
From: David Kastrup @ 2015-09-27  8:37 UTC (permalink / raw)
  To: Paul Eggert; +Cc: Eli Zaretskii, emacs-devel

Paul Eggert <eggert@cs.ucla.edu> writes:

> Eli Zaretskii wrote:
>> If
>> that's portability between locales, then our solution of having a
>> coding cookie is better for Emacs, because it supports more use cases
>> than just assuming UTF-8 would.
>
> Sure, but the point is that we shouldn't need a cookie for UTF-8.

Is this the majestic "we" or are you talking about Emacs development in
particular?  If the latter, why not set a directory-wide variable for
the Emacs project (namely in the repository) for making the
Emacs-internal Elisp files default to utf-8?

That should cater for "us" without enforcing encodings in other people's
projects.

-- 
David Kastrup



^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [Emacs-diffs] master db828f6: Don't rely on defaults in decoding UTF-8 encoded Lisp files
  2015-09-27  8:37                                             ` David Kastrup
@ 2015-09-27  8:40                                               ` Paul Eggert
  2015-09-27  8:50                                                 ` David Kastrup
  2015-09-27 10:14                                                 ` Eli Zaretskii
  0 siblings, 2 replies; 70+ messages in thread
From: Paul Eggert @ 2015-09-27  8:40 UTC (permalink / raw)
  To: David Kastrup; +Cc: Eli Zaretskii, emacs-devel

David Kastrup wrote:
> why not set a directory-wide variable for
> the Emacs project (namely in the repository) for making the
> Emacs-internal Elisp files default to utf-8?

Great idea!  One that has been suggested multiple times.  Unfortunately it's a 
bit trickier to implement than one might think.



^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [Emacs-diffs] master db828f6: Don't rely on defaults in decoding UTF-8 encoded Lisp files
  2015-09-27  8:40                                               ` Paul Eggert
@ 2015-09-27  8:50                                                 ` David Kastrup
  2015-09-27 10:14                                                 ` Eli Zaretskii
  1 sibling, 0 replies; 70+ messages in thread
From: David Kastrup @ 2015-09-27  8:50 UTC (permalink / raw)
  To: Paul Eggert; +Cc: Eli Zaretskii, emacs-devel

Paul Eggert <eggert@cs.ucla.edu> writes:

> David Kastrup wrote:
>> why not set a directory-wide variable for
>> the Emacs project (namely in the repository) for making the
>> Emacs-internal Elisp files default to utf-8?
>
> Great idea!  One that has been suggested multiple times.
> Unfortunately it's a bit trickier to implement than one might think.

"Well, I can't seem to find them either.  Did you really lose your keys
over here?"  "No, down that alley.  But I'd rather search here since the
light is much better."

So because the right solution for Emacs is a bit trickier to implement
than one might think, we pick something else making life harder for
everybody else?

-- 
David Kastrup



^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [Emacs-diffs] master db828f6: Don't rely on defaults in decoding UTF-8 encoded Lisp files
  2015-09-27  8:40                                               ` Paul Eggert
  2015-09-27  8:50                                                 ` David Kastrup
@ 2015-09-27 10:14                                                 ` Eli Zaretskii
  1 sibling, 0 replies; 70+ messages in thread
From: Eli Zaretskii @ 2015-09-27 10:14 UTC (permalink / raw)
  To: Paul Eggert; +Cc: dak, emacs-devel

> From: Paul Eggert <eggert@cs.ucla.edu>
> Date: Sun, 27 Sep 2015 01:40:06 -0700
> Cc: Eli Zaretskii <eliz@gnu.org>, emacs-devel@gnu.org
> 
> David Kastrup wrote:
> > why not set a directory-wide variable for
> > the Emacs project (namely in the repository) for making the
> > Emacs-internal Elisp files default to utf-8?
> 
> Great idea!  One that has been suggested multiple times.  Unfortunately it's a 
> bit trickier to implement than one might think.

Yes, there are no simple and easy solutions for these issues.  But
that doesn't mean we shouldn't look for them.



^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [Emacs-diffs] master db828f6: Don't rely on defaults in decoding UTF-8 encoded Lisp files
  2015-09-27  8:29                                           ` Paul Eggert
  2015-09-27  8:37                                             ` David Kastrup
@ 2015-09-27  8:57                                             ` Eli Zaretskii
  1 sibling, 0 replies; 70+ messages in thread
From: Eli Zaretskii @ 2015-09-27  8:57 UTC (permalink / raw)
  To: Paul Eggert; +Cc: dak, emacs-devel

> Cc: dak@gnu.org, emacs-devel@gnu.org
> From: Paul Eggert <eggert@cs.ucla.edu>
> Date: Sun, 27 Sep 2015 01:29:57 -0700
> 
> Eli Zaretskii wrote:
> > If
> > that's portability between locales, then our solution of having a
> > coding cookie is better for Emacs, because it supports more use cases
> > than just assuming UTF-8 would.
> 
> Sure, but the point is that we shouldn't need a cookie for UTF-8.  Cookies are 
> awkward, and should be inserted only when needed; they shouldn't be needed for 
> the typical case.

Our experience since Emacs 20 is that the "typical case" is not a good
guideline for implementing multilingual tools.  Not unless the typical
case becomes the only case.



^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [Emacs-diffs] master db828f6: Don't rely on defaults in decoding UTF-8 encoded Lisp files
  2015-09-26 20:26                               ` Chad Brown
  2015-09-26 21:50                                 ` David Kastrup
@ 2015-09-27  7:34                                 ` Eli Zaretskii
  2015-09-27 16:03                                   ` Chad Brown
  1 sibling, 1 reply; 70+ messages in thread
From: Eli Zaretskii @ 2015-09-27  7:34 UTC (permalink / raw)
  To: Chad Brown; +Cc: emacs-devel

> From: Chad Brown <yandros@gmail.com>
> Date: Sat, 26 Sep 2015 13:26:52 -0700
> 
> 
> > On 26 Sep 2015, at 12:35, Eli Zaretskii <eliz@gnu.org> wrote:
> > 
> > The relevant statistics for Emacs is of source files, not of HTML
> > pages.
> 
> The default for GCC is UTF-8.

GCC doesn't write C sources, so its default are not very relevant,
even if you are right in the above assessment (and I don't think you
are).

> Python requires a coding cookie (intentionally similar to Emacs’) to get away from Latin-1. Java is UTF-8. Javascript, roughly speaking, tracks HTML. Which other languages did you have in mind?

All the rest of them.




^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [Emacs-diffs] master db828f6: Don't rely on defaults in decoding UTF-8 encoded Lisp files
  2015-09-27  7:34                                 ` Eli Zaretskii
@ 2015-09-27 16:03                                   ` Chad Brown
  2015-09-27 18:41                                     ` Eli Zaretskii
  0 siblings, 1 reply; 70+ messages in thread
From: Chad Brown @ 2015-09-27 16:03 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: emacs-devel


> On 27 Sep 2015, at 00:34, Eli Zaretskii <eliz@gnu.org> wrote:
> 
>> From: Chad Brown <yandros@gmail.com>
>> Date: Sat, 26 Sep 2015 13:26:52 -0700
>> 
>> The default for GCC is UTF-8.
> 
> GCC doesn't write C sources, so its default are not very relevant,
> even if you are right in the above assessment (and I don't think you
> are).

I took the information from the GCC 4.7 documentation:

  -finput-charset=charset
  Set the input character set, used for translation from the character
  set of the input file to the source character set used by GCC. If
  the locale does not specify, or GCC cannot get this information
  from the locale, the default is UTF-8. This can be overridden by
  either the locale or this command line option. Currently the command
  line option takes precedence if there's a conflict. charset can be
  any encoding supported by the system's iconv library routine.

I saw almost identical text in the 4.2.4 documentation, and didn’t go
back further.

~Chad




^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [Emacs-diffs] master db828f6: Don't rely on defaults in decoding UTF-8 encoded Lisp files
  2015-09-27 16:03                                   ` Chad Brown
@ 2015-09-27 18:41                                     ` Eli Zaretskii
  2015-09-27 19:52                                       ` Chad Brown
  0 siblings, 1 reply; 70+ messages in thread
From: Eli Zaretskii @ 2015-09-27 18:41 UTC (permalink / raw)
  To: Chad Brown; +Cc: emacs-devel

> From: Chad Brown <yandros@gmail.com>
> Date: Sun, 27 Sep 2015 09:03:54 -0700
> Cc: emacs-devel@gnu.org
> 
>   -finput-charset=charset
>   Set the input character set, used for translation from the character
>   set of the input file to the source character set used by GCC. If
>   the locale does not specify, or GCC cannot get this information
>   from the locale, the default is UTF-8. This can be overridden by
>   either the locale or this command line option. Currently the command
>   line option takes precedence if there's a conflict. charset can be
>   any encoding supported by the system's iconv library routine.

Note the "if the locale does not specify" clause.  That should almost
never happen.



^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [Emacs-diffs] master db828f6: Don't rely on defaults in decoding UTF-8 encoded Lisp files
  2015-09-27 18:41                                     ` Eli Zaretskii
@ 2015-09-27 19:52                                       ` Chad Brown
  2015-09-27 20:52                                         ` Eli Zaretskii
  0 siblings, 1 reply; 70+ messages in thread
From: Chad Brown @ 2015-09-27 19:52 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: emacs-devel

> On 27 Sep 2015, at 11:41, Eli Zaretskii <eliz@gnu.org> wrote:
> 
>> From: Chad Brown <yandros@gmail.com>
>> Date: Sun, 27 Sep 2015 09:03:54 -0700
>> Cc: emacs-devel@gnu.org
>> 
>>  -finput-charset=charset
>>  Set the input character set, used for translation from the character
>>  set of the input file to the source character set used by GCC. If
>>  the locale does not specify, or GCC cannot get this information
>>  from the locale, the default is UTF-8. This can be overridden by
>>  either the locale or this command line option. Currently the command
>>  line option takes precedence if there's a conflict. charset can be
>>  any encoding supported by the system's iconv library routine.
> 
> Note the "if the locale does not specify" clause.  That should almost
> never happen.

Sure. I almost mentioned that, but at the time it seemed clear
to me that we were talking about the defaults for each. I used to
deal with this issue ‘back in the day’, so it provoked my curiosity 
enough to look. Roughly speaking, the modern ‘programming
languages’ these days are UTF-8, while a decent chunk of the 
‘scripting languages’ seem to be in a messier state, but with 
established methods (coding cookies, odd quoting, ascii by fiat, 
try not to look at comments, etc).

Since then, exchanges on this thread have suggested that maybe I
was wrong about the topic at hand, but the data still seemed useful,
so I pushed it along, with the full quote for context. Sorry if it caused
confusion.

Thanks,
~Chad

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [Emacs-diffs] master db828f6: Don't rely on defaults in decoding UTF-8 encoded Lisp files
  2015-09-27 19:52                                       ` Chad Brown
@ 2015-09-27 20:52                                         ` Eli Zaretskii
  0 siblings, 0 replies; 70+ messages in thread
From: Eli Zaretskii @ 2015-09-27 20:52 UTC (permalink / raw)
  To: Chad Brown; +Cc: emacs-devel

> From: Chad Brown <yandros@gmail.com>
> Date: Sun, 27 Sep 2015 12:52:15 -0700
> Cc: emacs-devel@gnu.org
> 
> 
> > On 27 Sep 2015, at 11:41, Eli Zaretskii <eliz@gnu.org> wrote:
> > 
> >> From: Chad Brown <yandros@gmail.com>
> >> Date: Sun, 27 Sep 2015 09:03:54 -0700
> >> Cc: emacs-devel@gnu.org
> >> 
> >>  -finput-charset=charset
> >>  Set the input character set, used for translation from the character
> >>  set of the input file to the source character set used by GCC. If
> >>  the locale does not specify, or GCC cannot get this information
> >>  from the locale, the default is UTF-8. This can be overridden by
> >>  either the locale or this command line option. Currently the command
> >>  line option takes precedence if there's a conflict. charset can be
> >>  any encoding supported by the system's iconv library routine.
> > 
> > Note the "if the locale does not specify" clause.  That should almost
> > never happen.
> 
> Sure. I almost mentioned that, but at the time it seemed clear
> to me that we were talking about the defaults for each.

The issue at hand is whether Emacs should favor UTF-8 _before_ the
locale-derived defaults.  What happens when the locale cannot be
queried wasn't touched at all.  I don't think such a situation is a
real possibility in the first place, and if it is, I don't object if
we'd use UTF-8 in that case.



^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [Emacs-diffs] master db828f6: Don't rely on defaults in decoding UTF-8 encoded Lisp files
  2015-09-26 19:35                             ` Eli Zaretskii
  2015-09-26 20:26                               ` Chad Brown
@ 2015-09-26 20:32                               ` Paul Eggert
  2015-09-27  7:27                                 ` Eli Zaretskii
  1 sibling, 1 reply; 70+ messages in thread
From: Paul Eggert @ 2015-09-26 20:32 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: dak, monnier, emacs-devel

Eli Zaretskii wrote:
> The relevant statistics for Emacs is of source files, not of HTML
> pages.

Sure, and source files are how this thread got started: nowadays in GNU projects 
they're typically UTF-8 regardless of system locale settings, and Emacs should 
be better about supporting this typical situation.  UTF-8 is common partly 
because source files are shared widely via the Internet, on sites like Savannah.

The days of lonely hackers writing code in their own private Shift-JIS 
directories are largely over.  Of course Emacs can still support such users, but 
the default should be tailored to what's more typical nowadays.

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [Emacs-diffs] master db828f6: Don't rely on defaults in decoding UTF-8 encoded Lisp files
  2015-09-26 20:32                               ` Paul Eggert
@ 2015-09-27  7:27                                 ` Eli Zaretskii
  2015-09-27  7:42                                   ` David Kastrup
  2015-09-27  8:22                                   ` Paul Eggert
  0 siblings, 2 replies; 70+ messages in thread
From: Eli Zaretskii @ 2015-09-27  7:27 UTC (permalink / raw)
  To: Paul Eggert; +Cc: dak, monnier, emacs-devel

> Cc: dak@gnu.org, monnier@iro.umontreal.ca, emacs-devel@gnu.org
> From: Paul Eggert <eggert@cs.ucla.edu>
> Date: Sat, 26 Sep 2015 13:32:33 -0700
> 
> Eli Zaretskii wrote:
> > The relevant statistics for Emacs is of source files, not of HTML
> > pages.
> 
> Sure, and source files are how this thread got started: nowadays in GNU projects 
> they're typically UTF-8 regardless of system locale settings, and Emacs should 
> be better about supporting this typical situation.  UTF-8 is common partly 
> because source files are shared widely via the Internet, on sites like Savannah.
> 
> The days of lonely hackers writing code in their own private Shift-JIS 
> directories are largely over.  Of course Emacs can still support such users, but 
> the default should be tailored to what's more typical nowadays.

Emacs supports the typical situation quite well already, definitely so
in a typical (i.e. UTF-8) locale.  The issue at hand is not how to
support the typical situation, it's whether that typical situation is
the _only_ situation that matters, so much so that we can ignore the
locale-derived defaults.

In any case, I said we needed _statistics_, i.e. numbers, not just
impressions and opinions.

I don't know how to find a representative set of C sources, not even
for European locales.  I looked at the C files of GNU projects from
the last years on my main development system, which is probably not
very representative.  There are more than 142,000 C files there.
Using the 'file' utility, I found about 1.8% of UTF-8 encoded files
and about 0.2% ISO-8859 encoded files (the vast majority was US ASCII,
of course).  That's still more than 250 ISO-8859 encoded files.

I've also looked at the *.po files in the latest releases of GNU Make,
Gawk, Texinfo, and Binutils, and I find that between 20% and 25% of
such files still use non-UTF-8 encodings.  I see similar figures for
the txi-*.tex files that came with Texinfo 6.0.  Presumably, that
follows the default conventions of the respective locales.

So, while I agree with you that UTF-8 encoded files are the majority
among non-ASCII files (and Emacs development aligns itself with that
fact very well), the non-UTF-8 minority, even in the Posix world, is
still significant enough, and we cannot possibly ignore it.

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [Emacs-diffs] master db828f6: Don't rely on defaults in decoding UTF-8 encoded Lisp files
  2015-09-27  7:27                                 ` Eli Zaretskii
@ 2015-09-27  7:42                                   ` David Kastrup
  2015-09-27  9:20                                     ` Rustom Mody
  2015-09-27  8:22                                   ` Paul Eggert
  1 sibling, 1 reply; 70+ messages in thread
From: David Kastrup @ 2015-09-27  7:42 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: Paul Eggert, monnier, emacs-devel

Eli Zaretskii <eliz@gnu.org> writes:

> I've also looked at the *.po files in the latest releases of GNU Make,
> Gawk, Texinfo, and Binutils, and I find that between 20% and 25% of
> such files still use non-UTF-8 encodings.

Which, btw, I consider crazy.  It's one thing to pick an encoding for
local language processing and display.  But for an internationalization
system, it does not really make sense to venture to local encodings
outside of I/O.  There is a really strong case for using only UTF-8 in
PO files instead of juggling with many-to-many encoding setups.

> I see similar figures for the txi-*.tex files that came with Texinfo
> 6.0.  Presumably, that follows the default conventions of the
> respective locales.

Texinfo uses PDFTeX for its encoding processing, and PDFTeX is firmly an
8-bit system.  TeX wouldn't be TeX if it wasn't macroprogrammed to deal
with that, but Texinfo being a rather low-level format, UTF-8 processing
time dwarves anything else.

So if you have, say, a German input file for Texinfo and can process it
either in Latin-1 or UTF-8, chances are that the Latin-1 version runs
more than twice as fast.

Now that's of course just the processing in printed form.  Thanks to
Texinfo now being written in Perl, the PDFTeX backend is likely the
fastest right now either way so it may not be as much of a concern.

But many Texinfo sources originate from a time where UTF-8 was either
not supported at all or was a major contributor to conversion time.

-- 
David Kastrup

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [Emacs-diffs] master db828f6: Don't rely on defaults in decoding UTF-8 encoded Lisp files
  2015-09-27  7:42                                   ` David Kastrup
@ 2015-09-27  9:20                                     ` Rustom Mody
  2015-09-27 10:13                                       ` Eli Zaretskii
  0 siblings, 1 reply; 70+ messages in thread
From: Rustom Mody @ 2015-09-27  9:20 UTC (permalink / raw)
  To: emacs-devel

On Sun, Sep 27, 2015 at 1:12 PM, David Kastrup <dak@gnu.org> wrote:
>
> Eli Zaretskii <eliz@gnu.org> writes:
>
> > I've also looked at the *.po files in the latest releases of GNU Make,
> > Gawk, Texinfo, and Binutils, and I find that between 20% and 25% of
> > such files still use non-UTF-8 encodings.
>
> Which, btw, I consider crazy.
>


Ive been trying to understand this stuff and was looking at eg.
lisp/language/indian.el

In there I find that:
(defconst bengali-composable-pattern
  (let ((table
     '(("a" . "\u0981")        ; SIGN CANDRABINDU
       ("A" . "[\u0982-\u0983]")    ; SIGN ANUSVARA .. VISARGA
       ("V" . "[\u0985-\u0994\u09E0-\u09E1]") ; independent vowel
       ("C" . "[\u0995-\u09B9\u09DC-\u09DF\u09F1]") ; consonant
       ("B" . "[\u09AC\u09AF-\u09B0\u09F0]")        ; BA, YA, RA
       ("R" . "[\u09B0\u09F0]")        ; RA
       ("n" . "\u09BC")        ; NUKTA
       ("v" . "[\u09BE-\u09CC\u09D7\u09E2-\u09E3]") ; vowel sign
       ("H" . "\u09CD")        ; HALANT
       ("T" . "\u09CE")        ; KHANDA TA
       ("N" . "\u200C")        ; ZWNJ
       ("J" . "\u200D")        ; ZWJ
       ("X" . "[\u0980-\u09FF]"))))    ; all coverage
etc etc

And repeated with small variations for devanagari, tamil, telugu etc
It would sure help a native speaker if the comment and the ucs-hex
were interchanged with the actual chars used instead.

So then I checked why the file was showing as UTF-8 encoded.

Found this one non-ASCII line:

(set-language-info-alist
 "Kannada" '((charset unicode)
         (coding-system mule-utf-8)
         (coding-priority mule-utf-8)
         (input-method . "kannada-itrans")
         (sample-text . "Kannada (ಕನ್ನಡ)    ನಮಸ್ಕಾರ")
         (documentation . "\
Kannada language and script is supported in this language
environment."))
 '("Indian"))

It strikes me that this sample text should be there for the other
languages also but it does not seem to be there

Just for context if I can understand whats going on, I would like to
help improve this/these docs:


(info "(elisp)input methods")

  | How to define input methods is not yet documented in this manual,
but here we
  | describe how to use them.



^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [Emacs-diffs] master db828f6: Don't rely on defaults in decoding UTF-8 encoded Lisp files
  2015-09-27  9:20                                     ` Rustom Mody
@ 2015-09-27 10:13                                       ` Eli Zaretskii
  2015-09-27 20:21                                         ` Paul Eggert
  0 siblings, 1 reply; 70+ messages in thread
From: Eli Zaretskii @ 2015-09-27 10:13 UTC (permalink / raw)
  To: Rustom Mody; +Cc: emacs-devel

> From: Rustom Mody <rustompmody@gmail.com>
> Date: Sun, 27 Sep 2015 14:50:48 +0530
> 
> Ive been trying to understand this stuff and was looking at eg.
> lisp/language/indian.el
> 
> In there I find that:
> (defconst bengali-composable-pattern
>   (let ((table
>      '(("a" . "\u0981")        ; SIGN CANDRABINDU
>        ("A" . "[\u0982-\u0983]")    ; SIGN ANUSVARA .. VISARGA
>        ("V" . "[\u0985-\u0994\u09E0-\u09E1]") ; independent vowel
>        ("C" . "[\u0995-\u09B9\u09DC-\u09DF\u09F1]") ; consonant
>        ("B" . "[\u09AC\u09AF-\u09B0\u09F0]")        ; BA, YA, RA
>        ("R" . "[\u09B0\u09F0]")        ; RA
>        ("n" . "\u09BC")        ; NUKTA
>        ("v" . "[\u09BE-\u09CC\u09D7\u09E2-\u09E3]") ; vowel sign
>        ("H" . "\u09CD")        ; HALANT
>        ("T" . "\u09CE")        ; KHANDA TA
>        ("N" . "\u200C")        ; ZWNJ
>        ("J" . "\u200D")        ; ZWJ
>        ("X" . "[\u0980-\u09FF]"))))    ; all coverage
> etc etc

This is unrelated: it specifies which character sequences should be
composed and displayed as a single grapheme cluster.

> So then I checked why the file was showing as UTF-8 encoded.
> 
> Found this one non-ASCII line:
> 
> (set-language-info-alist
>  "Kannada" '((charset unicode)
>          (coding-system mule-utf-8)
>          (coding-priority mule-utf-8)
>          (input-method . "kannada-itrans")
>          (sample-text . "Kannada (ಕನ್ನಡ)    ನಮಸ್ಕಾರ")
>          (documentation . "\
> Kannada language and script is supported in this language
> environment."))
>  '("Indian"))
> 
> It strikes me that this sample text should be there for the other
> languages also but it does not seem to be there

You cannot base encoding decisions on the language or script alone,
unless that language exists in a single locale.  Many languages and
scripts serve several different locales with several different default
encodings.

> Just for context if I can understand whats going on, I would like to
> help improve this/these docs:
> 
> 
> (info "(elisp)input methods")
> 
>   | How to define input methods is not yet documented in this manual,
> but here we
>   | describe how to use them.

Again unrelated.  Input methods are about typing characters not
directly supported by the user's keyboard.




^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [Emacs-diffs] master db828f6: Don't rely on defaults in decoding UTF-8 encoded Lisp files
  2015-09-27 10:13                                       ` Eli Zaretskii
@ 2015-09-27 20:21                                         ` Paul Eggert
  2015-09-27 21:04                                           ` Eli Zaretskii
  0 siblings, 1 reply; 70+ messages in thread
From: Paul Eggert @ 2015-09-27 20:21 UTC (permalink / raw)
  To: Eli Zaretskii, Rustom Mody; +Cc: emacs-devel

Eli Zaretskii wrote:
> This is unrelated: it specifies which character sequences should be
> composed and displayed as a single grapheme cluster.

Yes.  It might be reasonable to replace some of those \u instances for 
readability, e.g.:

-	   ("V" . "[\u0904-\u0914\u0960-\u0961\u0972]") ; independent vowel
+	   ("V" . "[ऄ-औॠ-ॡॲ]") ; independent vowel

But replacements would not be such a good idea for some of this code, e.g.:

-	   ("H" . "\u094D")		; HALANT
+	   ("H" . "्")		; HALANT

as standalone combining characters are problematic on display, and here:

-	   ("J" . "\u200D")		; ZWJ
+	   ("J" . "‍")		; ZWJ

where one can't easily see a zero width joiner when editing the source file.  I 
expect that whoever wrote that code felt more comfortable sticking with \u 
escapes uniformly, rather than using \u sometimes and not other times.



^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [Emacs-diffs] master db828f6: Don't rely on defaults in decoding UTF-8 encoded Lisp files
  2015-09-27 20:21                                         ` Paul Eggert
@ 2015-09-27 21:04                                           ` Eli Zaretskii
  0 siblings, 0 replies; 70+ messages in thread
From: Eli Zaretskii @ 2015-09-27 21:04 UTC (permalink / raw)
  To: Paul Eggert; +Cc: rustompmody, emacs-devel

> Cc: emacs-devel@gnu.org
> From: Paul Eggert <eggert@cs.ucla.edu>
> Date: Sun, 27 Sep 2015 13:21:51 -0700
> 
> Eli Zaretskii wrote:
> > This is unrelated: it specifies which character sequences should be
> > composed and displayed as a single grapheme cluster.
> 
> Yes.  It might be reasonable to replace some of those \u instances for 
> readability, e.g.:
> 
> -	   ("V" . "[\u0904-\u0914\u0960-\u0961\u0972]") ; independent vowel
> +	   ("V" . "[ऄ-औॠ-ॡॲ]") ; independent vowel

I'm not so sure this is a good idea: since most of us don't read Indic
scripts, leaving the codepoints there makes it easier to compare these
patterns with various relevant publications and standards on the
Internet.  If we make them characters instead, most of us will have to
use "C-x =" to see the codepoints anyway.

> But replacements would not be such a good idea for some of this code, e.g.:
> 
> -	   ("H" . "\u094D")		; HALANT
> +	   ("H" . "्")		; HALANT
> 
> as standalone combining characters are problematic on display, and here:
> 
> -	   ("J" . "\u200D")		; ZWJ
> +	   ("J" . "‍")		; ZWJ
> 
> where one can't easily see a zero width joiner when editing the
> source file.

Indeed.




^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [Emacs-diffs] master db828f6: Don't rely on defaults in decoding UTF-8 encoded Lisp files
  2015-09-27  7:27                                 ` Eli Zaretskii
  2015-09-27  7:42                                   ` David Kastrup
@ 2015-09-27  8:22                                   ` Paul Eggert
  2015-09-27  8:55                                     ` Eli Zaretskii
  2015-09-27  9:56                                     ` Andreas Schwab
  1 sibling, 2 replies; 70+ messages in thread
From: Paul Eggert @ 2015-09-27  8:22 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: dak, monnier, emacs-devel

Eli Zaretskii wrote:
> I've also looked at the *.po files in the latest releases of GNU Make,
> Gawk, Texinfo, and Binutils, and I find that between 20% and 25% of
> such files still use non-UTF-8 encodings.

Yes, and those files are a pain to look at with Emacs now, since it typically 
misguesses their encodings.  Presumably Emacs should be looking at .po files' 
charset= decorations.

What's likely happening with those files is that they were originally created 
long ago in an 8-bit locale, and nobody has bothered to update their encodings 
since then.  Many of the files haven't been changed in ages (about half of them 
have revision dates before 2010), and of course the older files will prefer 
legacy encodings.  These older files are not a particularly good match for text 
that people edit today.

> while I agree with you that UTF-8 encoded files are the majority
> among non-ASCII files (and Emacs development aligns itself with that
> fact very well), the non-UTF-8 minority, even in the Posix world, is
> still significant enough, and we cannot possibly ignore it.

Naturally we cannot ignore it.  All I'm suggesting is that we change the default 
behavior so that it's more UTF-8 friendly, since that's the way the world is 
going.  The old Emacs behavior should still be available, for people who need it.

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [Emacs-diffs] master db828f6: Don't rely on defaults in decoding UTF-8 encoded Lisp files
  2015-09-27  8:22                                   ` Paul Eggert
@ 2015-09-27  8:55                                     ` Eli Zaretskii
  2015-09-27  9:56                                     ` Andreas Schwab
  1 sibling, 0 replies; 70+ messages in thread
From: Eli Zaretskii @ 2015-09-27  8:55 UTC (permalink / raw)
  To: Paul Eggert; +Cc: dak, monnier, emacs-devel

> Cc: dak@gnu.org, monnier@iro.umontreal.ca, emacs-devel@gnu.org
> From: Paul Eggert <eggert@cs.ucla.edu>
> Date: Sun, 27 Sep 2015 01:22:48 -0700
> 
> Eli Zaretskii wrote:
> > I've also looked at the *.po files in the latest releases of GNU Make,
> > Gawk, Texinfo, and Binutils, and I find that between 20% and 25% of
> > such files still use non-UTF-8 encodings.
> 
> Yes, and those files are a pain to look at with Emacs now, since it typically 
> misguesses their encodings.  Presumably Emacs should be looking at .po files' 
> charset= decorations.

You need to install the po-mode.

But anyway, that's not the issue at hand.  I just used those files as
indicators of preferences of some locales.

> > while I agree with you that UTF-8 encoded files are the majority
> > among non-ASCII files (and Emacs development aligns itself with that
> > fact very well), the non-UTF-8 minority, even in the Posix world, is
> > still significant enough, and we cannot possibly ignore it.
> 
> Naturally we cannot ignore it.  All I'm suggesting is that we change the default 
> behavior so that it's more UTF-8 friendly, since that's the way the world is 
> going.  The old Emacs behavior should still be available, for people who need it.

You use "default" here in a sense that is different from what the Mule
stuff does.  Since Emacs attempts to support i18n, not just l10n, it
cannot ask users to modify their defaults whenever they meet a file
that's decoded incorrectly.  Emacs uses the defaults in this area as
the last resort, when no other information is available in the file
itself or its accompanying meta-data.  That default is already as
friendly to UTF-8 as possible: UTF-8 is used in any locale where
that's the default.  Going further, i.e. preferring UTF-8 in locales
whose preferences are different, will simply bring back the old bugs
and misfeatures of Emacs 20 and 21 which we worked so hard to
eradicate.

IMO, the _only_ sane way forward is to introduce more reliable ways of
detecting the encoding, whether by using some new kinds of meta-data
or by more extensive analysis of the text itself.  (The latter
solution will probably have difficulties with decoding sub-process
output, but it could be very efficient with disk files and large
bodies of text made available to Emacs at once.)

IOW, I don't think we will be able to change our locale-derived
defaults any time soon.  What we can do is minimize the probability of
having to fall back on those defaults.  But this requires that
Someone™ volunteers to revamp our detect_coding_* implementations in
that direction.

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [Emacs-diffs] master db828f6: Don't rely on defaults in decoding UTF-8 encoded Lisp files
  2015-09-27  8:22                                   ` Paul Eggert
  2015-09-27  8:55                                     ` Eli Zaretskii
@ 2015-09-27  9:56                                     ` Andreas Schwab
  2015-09-27 10:04                                       ` David Kastrup
  1 sibling, 1 reply; 70+ messages in thread
From: Andreas Schwab @ 2015-09-27  9:56 UTC (permalink / raw)
  To: Paul Eggert; +Cc: Eli Zaretskii, dak, monnier, emacs-devel

Paul Eggert <eggert@cs.ucla.edu> writes:

> Yes, and those files are a pain to look at with Emacs now, since it
> typically misguesses their encodings.  Presumably Emacs should be looking
> at .po files' charset= decorations.

It does already if you use the po-mode distributed with gettext.

Andreas.

-- 
Andreas Schwab, schwab@linux-m68k.org
GPG Key fingerprint = 58CA 54C7 6D53 942B 1756  01D3 44D5 214B 8276 4ED5
"And now for something completely different."



^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [Emacs-diffs] master db828f6: Don't rely on defaults in decoding UTF-8 encoded Lisp files
  2015-09-27  9:56                                     ` Andreas Schwab
@ 2015-09-27 10:04                                       ` David Kastrup
  2015-09-27 10:16                                         ` Eli Zaretskii
  0 siblings, 1 reply; 70+ messages in thread
From: David Kastrup @ 2015-09-27 10:04 UTC (permalink / raw)
  To: Andreas Schwab; +Cc: Eli Zaretskii, Paul Eggert, monnier, emacs-devel

Andreas Schwab <schwab@linux-m68k.org> writes:

> Paul Eggert <eggert@cs.ucla.edu> writes:
>
>> Yes, and those files are a pain to look at with Emacs now, since it
>> typically misguesses their encodings.  Presumably Emacs should be looking
>> at .po files' charset= decorations.
>
> It does already if you use the po-mode distributed with gettext.

gettext being the standard GNU i18n mechanism, wouldn't it make sense to
keep the latest version distributed with Emacs rather than requiring
users to manually install them?

-- 
David Kastrup



^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [Emacs-diffs] master db828f6: Don't rely on defaults in decoding UTF-8 encoded Lisp files
  2015-09-27 10:04                                       ` David Kastrup
@ 2015-09-27 10:16                                         ` Eli Zaretskii
  2015-09-27 10:36                                           ` Eli Zaretskii
  0 siblings, 1 reply; 70+ messages in thread
From: Eli Zaretskii @ 2015-09-27 10:16 UTC (permalink / raw)
  To: David Kastrup; +Cc: eggert, schwab, monnier, emacs-devel

> From: David Kastrup <dak@gnu.org>
> Cc: Paul Eggert <eggert@cs.ucla.edu>,  Eli Zaretskii <eliz@gnu.org>,  monnier@iro.umontreal.ca,  emacs-devel@gnu.org
> Date: Sun, 27 Sep 2015 12:04:45 +0200
> 
> Andreas Schwab <schwab@linux-m68k.org> writes:
> 
> > Paul Eggert <eggert@cs.ucla.edu> writes:
> >
> >> Yes, and those files are a pain to look at with Emacs now, since it
> >> typically misguesses their encodings.  Presumably Emacs should be looking
> >> at .po files' charset= decorations.
> >
> > It does already if you use the po-mode distributed with gettext.
> 
> gettext being the standard GNU i18n mechanism, wouldn't it make sense to
> keep the latest version distributed with Emacs rather than requiring
> users to manually install them?

We discussed that at some point in the past.  I don't remember why we
decided not to do that, but a search in the archives might tell.
Maybe those reasons are no longer relevant.



^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [Emacs-diffs] master db828f6: Don't rely on defaults in decoding UTF-8 encoded Lisp files
  2015-09-27 10:16                                         ` Eli Zaretskii
@ 2015-09-27 10:36                                           ` Eli Zaretskii
  2015-09-27 10:59                                             ` Eli Zaretskii
  0 siblings, 1 reply; 70+ messages in thread
From: Eli Zaretskii @ 2015-09-27 10:36 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: emacs-devel, dak, schwab, monnier, eggert

> Date: Sun, 27 Sep 2015 13:16:18 +0300
> From: Eli Zaretskii <eliz@gnu.org>
> Cc: eggert@cs.ucla.edu, schwab@linux-m68k.org, monnier@iro.umontreal.ca,
> 	emacs-devel@gnu.org
> 
> > From: David Kastrup <dak@gnu.org>
> > Cc: Paul Eggert <eggert@cs.ucla.edu>,  Eli Zaretskii <eliz@gnu.org>,  monnier@iro.umontreal.ca,  emacs-devel@gnu.org
> > Date: Sun, 27 Sep 2015 12:04:45 +0200
> > 
> > Andreas Schwab <schwab@linux-m68k.org> writes:
> > 
> > > Paul Eggert <eggert@cs.ucla.edu> writes:
> > >
> > >> Yes, and those files are a pain to look at with Emacs now, since it
> > >> typically misguesses their encodings.  Presumably Emacs should be looking
> > >> at .po files' charset= decorations.
> > >
> > > It does already if you use the po-mode distributed with gettext.
> > 
> > gettext being the standard GNU i18n mechanism, wouldn't it make sense to
> > keep the latest version distributed with Emacs rather than requiring
> > users to manually install them?
> 
> We discussed that at some point in the past.  I don't remember why we
> decided not to do that, but a search in the archives might tell.
> Maybe those reasons are no longer relevant.

I've misremembered.  The discussion is here:

  http://lists.gnu.org/archive/html/emacs-devel/2002-03/msg00167.html

and, more importantly, its result is already in Emacs:

file-coding-system-alist is a variable defined in ‘C source code’.
Its value is shown below.

[...]
Value: (("\\.dz\\'" no-conversion . no-conversion)
 ("\\.txz\\'" no-conversion . no-conversion)
 ("\\.xz\\'" no-conversion . no-conversion)
 ("\\.lzma\\'" no-conversion . no-conversion)
 ("\\.lz\\'" no-conversion . no-conversion)
 ("\\.g?z\\'" no-conversion . no-conversion)
 ("\\.\\(?:tgz\\|svgz\\|sifz\\)\\'" no-conversion . no-conversion)
 ("\\.tbz2?\\'" no-conversion . no-conversion)
 ("\\.bz2\\'" no-conversion . no-conversion)
 ("\\.Z\\'" no-conversion . no-conversion)
 ("\\.elc\\'" . utf-8-emacs)
 ("\\.el\\'" . prefer-utf-8)
 ("\\.utf\\(-8\\)?\\'" . utf-8)
 ("\\.xml\\'" . xml-find-file-coding-system)
 ("\\(\\`\\|/\\)loaddefs.el\\'" raw-text . raw-text-unix)
 ("\\.tar\\'" no-conversion . no-conversion)
 ("\\.po[tx]?\\'\\|\\.po\\." . po-find-file-coding-system)
  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 ("\\.\\(tex\\|ltx\\|dtx\\|drv\\)\\'" . latexenc-find-file-coding-system)
 ("" undecided))

And the bundled po.el already defines po-find-file-coding-system.

So it sounds like we simply have a bug here.

But once again, the handling of *.po files is not the issue here.  The
issue is whether we can ignore the possibility of non-UTF-8 encodings
in locales whose codeset is not UTF-8.




^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [Emacs-diffs] master db828f6: Don't rely on defaults in decoding UTF-8 encoded Lisp files
  2015-09-27 10:36                                           ` Eli Zaretskii
@ 2015-09-27 10:59                                             ` Eli Zaretskii
  2015-09-27 20:05                                               ` Paul Eggert
  0 siblings, 1 reply; 70+ messages in thread
From: Eli Zaretskii @ 2015-09-27 10:59 UTC (permalink / raw)
  To: eggert; +Cc: dak, schwab, monnier, emacs-devel

> Date: Sun, 27 Sep 2015 13:36:08 +0300
> From: Eli Zaretskii <eliz@gnu.org>
> Cc: emacs-devel@gnu.org, dak@gnu.org, schwab@linux-m68k.org,
> 	monnier@iro.umontreal.ca, eggert@cs.ucla.edu
> 
>  ("\\.po[tx]?\\'\\|\\.po\\." . po-find-file-coding-system)
>   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>  ("\\.\\(tex\\|ltx\\|dtx\\|drv\\)\\'" . latexenc-find-file-coding-system)
>  ("" undecided))
> 
> And the bundled po.el already defines po-find-file-coding-system.
> 
> So it sounds like we simply have a bug here.

Ehm.. what bug?  AFAICS, the encoding is correctly detected and used
when I visit *.po files, no matter what is their encoding.

So I'm not sure why Paul said:

 >> Yes, and those files are a pain to look at with Emacs now, since it
 >> typically misguesses their encodings.  Presumably Emacs should be looking
 >> at .po files' charset= decorations.

as I see no such problems.  Maybe Paul has some customizations that
somehow disable po.el's detection?



^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [Emacs-diffs] master db828f6: Don't rely on defaults in decoding UTF-8 encoded Lisp files
  2015-09-27 10:59                                             ` Eli Zaretskii
@ 2015-09-27 20:05                                               ` Paul Eggert
  0 siblings, 0 replies; 70+ messages in thread
From: Paul Eggert @ 2015-09-27 20:05 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: dak, schwab, monnier, emacs-devel

Eli Zaretskii wrote:
> Maybe Paul has some customizations that
> somehow disable po.el's detection?

Yes, sorry, false alarm; I had put a hack a while ago into my .emacs file 
temporarily for testing coding systems, and forgot that it was there.  When I 
removed that hack, most of the problem went away.

po-mode still has a coding-system problem with ASCII files (of all things!).  I 
just now filed a bug report for it (Bug#21574).  Surely this is low priority, as 
I expect hardly anybody uses ASCII .po files nowadays.

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [Emacs-diffs] master db828f6: Don't rely on defaults in decoding UTF-8 encoded Lisp files
  2015-09-26 16:01                       ` Paul Eggert
  2015-09-26 16:09                         ` David Kastrup
@ 2015-09-26 17:25                         ` Eli Zaretskii
  2015-09-26 18:51                           ` Paul Eggert
  2015-09-27  0:12                         ` stephen
  2 siblings, 1 reply; 70+ messages in thread
From: Eli Zaretskii @ 2015-09-26 17:25 UTC (permalink / raw)
  To: Paul Eggert; +Cc: monnier, emacs-devel

> Cc: monnier@iro.umontreal.ca, emacs-devel@gnu.org
> From: Paul Eggert <eggert@cs.ucla.edu>
> Date: Sat, 26 Sep 2015 09:01:04 -0700
> 
> Eli Zaretskii wrote:
> > So you are, in effect, saying that it is incorrect to derive the
> > default encodings from the locale's codeset?
> 
> Yes, for Emacs developers.  And come to think of it, for most Emacs users. 
> Nowadays in my experience most non-ASCII text files use UTF-8, regardless of 
> locale.

Are you sure your experience isn't biased by the fact you mostly work
in UTF-8 locales?

> The old days of having to guess encoding from the locale are passing
> away.  This is partly due to UTF-8 being the encoding of choice for
> HTML and XML, where UTF-8 overtook the older 8-bit encodings in 2008
> and now is by far the dominant encoding.

We already DTRT with XML files, and should be doing TRT with any file
format that includes the specification of the encoding in it.

The problem, IMO, is not only with disk files.  It is also with email
messages, output from processes, etc.  E.g., I routinely get Latin-1
encoded email from people whose platform is GNU/Linux.  IOW, non-UTF
encodings are far from being dead yet.

Using UTF-8 by default is certainly wrong on MS-Windows.

> One way to accommodate the new reality would be to change Emacs so that by 
> default the system locale does not affect Emacs's guess of a file's encoding if 
> the file's initial sample is valid UTF-8.  Users could set a variable to 
> re-enable the old behavior.

The problem with this line of thought is that "initial sample" part --
how far into the file should we look, how far is far enough?  E.g.,
tips.texi has its first non-ASCII character at character position
25353.  We've been there before, and found this not reliable enough.

Anyway, doesn't "(prefer-coding-system 'utf-8)" already does what you
want us to offer?

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [Emacs-diffs] master db828f6: Don't rely on defaults in decoding UTF-8 encoded Lisp files
  2015-09-26 17:25                         ` Eli Zaretskii
@ 2015-09-26 18:51                           ` Paul Eggert
  0 siblings, 0 replies; 70+ messages in thread
From: Paul Eggert @ 2015-09-26 18:51 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: monnier, emacs-devel

Eli Zaretskii wrote:
> Anyway, doesn't "(prefer-coding-system 'utf-8)" already does what you
> want us to offer?

If that works, then let's make it the default, at least on non-MS-Windows 
platforms.  I normally work in a UTF-8 locale, so I assume it'd be a no-op for 
me, but perhaps it would help for others.



^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [Emacs-diffs] master db828f6: Don't rely on defaults in decoding UTF-8 encoded Lisp files
  2015-09-26 16:01                       ` Paul Eggert
  2015-09-26 16:09                         ` David Kastrup
  2015-09-26 17:25                         ` Eli Zaretskii
@ 2015-09-27  0:12                         ` stephen
  2015-09-27  4:44                           ` Paul Eggert
  2 siblings, 1 reply; 70+ messages in thread
From: stephen @ 2015-09-27  0:12 UTC (permalink / raw)
  To: emacs-devel

>>>>> Paul Eggert writes:
 > Eli Zaretskii wrote:

 >> So you are, in effect, saying that it is incorrect to derive the
 >> default encodings from the locale's codeset?

 > Yes, for Emacs developers.

I think this makes sense.  IIUC Emacs already uses characters outside
of the Unicode repertoire, so it shouldn't be too hard to replicate
any Emacs capabilities that require non-Unicode characters or charsets
*inside* Emacs by using such characters.  Assuming there are any; I
suspect even HELLO doesn't actually need them.  There's no "gaiji"
problem of how to tell Emacs what to do with those characters; the
developer who introduces them into Emacs is responsible for adding
them to Emacs's non-Unicode repertoire.

 > And come to think of it, for most Emacs users.

I hope not, because that would imply that Emacs users in China, Japan,
probably Korea, and Taiwan are becoming a decreasing rather than
increasing fraction of Emacs users.

 > Nowadays in my experience most non-ASCII text files use UTF-8,
 > regardless of locale.

Toto, I don't think we're in Kansas any more.

 > The old days of having to guess encoding from the locale are
 > passing away.  This is partly due to UTF-8 being the encoding of
 > choice for HTML and XML, where UTF-8 overtook the older 8-bit
 > encodings in 2008 and now is by far the dominant encoding.

On the commercial internet, yes, but not for government and academic
sites in Japan and China.

 > One way to accommodate the new reality would be to

Recognize that it's probably due to insufficient experience?

 > change Emacs so that by default the system locale does not affect
 > Emacs's guess of a file's encoding if the file's initial sample is
 > valid UTF-8.

"Not affect" is probably a bad idea.  Giving UTF-8 too strong
preference on Windows is a bad idea, because there are a lot of
Windows coding systems that use UTF-8 trailing bytes to represent
characters; it's occasionally possible to run into UTF-8-conforming
files that are intended to be something else.  This isn't true for
ISO-8859 coding systems.

 > Users could set a variable to re-enable the old behavior.  If we
 > did this, we wouldn't have the error-prone process if sprinkling
 > 'coding: utf-8' cookies all over the place.

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [Emacs-diffs] master db828f6: Don't rely on defaults in decoding UTF-8 encoded Lisp files
  2015-09-27  0:12                         ` stephen
@ 2015-09-27  4:44                           ` Paul Eggert
  2015-09-27  6:20                             ` stephen
  0 siblings, 1 reply; 70+ messages in thread
From: Paul Eggert @ 2015-09-27  4:44 UTC (permalink / raw)
  To: stephen, emacs-devel

stephen@xemacs.org wrote:
> This is partly due to UTF-8 being the encoding of
>   > choice for HTML and XML, where UTF-8 overtook the older 8-bit
>   > encodings in 2008 and now is by far the dominant encoding.
>
> On the commercial internet, yes, but not for government and academic
> sites in Japan and China.

I think your information is out of date.  Yes, ten years ago there was a lot of 
non-UTF-8 out there, but nowadays they've largely moved on to UTF-8.

For fun I just now visited a few of the top government and academic websites in 
Japan:

http://www.japan.go.jp/
http://www.mofa.go.jp/
http://nettv.gov-online.go.jp/
http://www.e-kokusei.go.jp/
https://www.env.go.jp/
http://www.u-tokyo.ac.jp/
http://www.kyoto-u.ac.jp/
http://www.osaka-u.ac.jp/
http://www.keio.ac.jp/

I configured my browser to say that I preferred Japanese text.  All ten web 
sites gave me UTF-8.  Feel free to canvass China, but I daresay you'll find the 
same.

Of course one can still find a few web sites using other encodings, but like it 
or not, UTF-8 dominates now.

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [Emacs-diffs] master db828f6: Don't rely on defaults in decoding UTF-8 encoded Lisp files
  2015-09-27  4:44                           ` Paul Eggert
@ 2015-09-27  6:20                             ` stephen
  2015-09-27  8:34                               ` Paul Eggert
  0 siblings, 1 reply; 70+ messages in thread
From: stephen @ 2015-09-27  6:20 UTC (permalink / raw)
  To: Paul Eggert; +Cc: emacs-devel

Paul Eggert writes:

 > I think your information is out of date.

Rather, I think that yours is superficial.  Really, you should listen
to those of us who live and work outside of the ASCII hemisphere.

I live and teach in Japan (a stone's throw from ETL, as it happens),
and most of the students I supervise are Chinese.  I regularly need to
access Chinese and Japanese government and corporate data, and
retrieve preprints and data (and sometimes code) from the personal
pages of other scholars.  Mojibake in the HTML pages is frequent, in
both Firefox and Chrome (of course it's almost always easy to guess
the actual coded character set in use, but it is mojibake).  A
frequent cause is webservers configured to send "Content-Type:
text/html; charset=utf-8" but the page is encoded in something else.

 > Yes, ten years ago there was a lot of non-UTF-8 out there, but
 > nowadays they've largely moved on to UTF-8.

"Beauty is only skin-deep."  The *top* pages, and some whole sites,
have moved on, because having beautiful (if mostly useless) top pages
is a matter of "face", so they buy new ones from companies with fancy
up-to-date web design software every couple of years.  Perhaps most
recently authored pages are UTF-8.  But the data sets themselves are
typically flat files, either CSV or plaintext.  The explanatory pages,
even if in HTML, often haven't been revised in decades.  Such useful
content is typically in a national standard coded character set rather
than Unicode.

And Emacs is hardly limited to the web.  In practice, almost all mail
I receive from Chinese (even when it is in English or Japanese) is
labelled GB2312, GBK, or GB18030.  The great majority of Japanese mail
is either Shift JIS or ISO 2022 JP (sometimes with "OEM characters"
that even today aren't in Unicode because they're not in JIS).

 > Of course one can still find a few web sites using other encodings,
 > but like it or not, UTF-8 dominates now.

What's not to like about UTF-8?!  I *wish* non-UTF-8 was a matter of
information archaeology and Buddhist scholarship!  I'm sad to say, it
is not: GB variants, Big5, and JIS variants are the *majority* of the
non-ASCII data I handle every day in my Emacs.  (It's not the "great
majority" only because about 30% of the non-ASCII text I handle in
Emacs is authored by me, in UTF-8, of course.)

Regards,

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [Emacs-diffs] master db828f6: Don't rely on defaults in decoding UTF-8 encoded Lisp files
  2015-09-27  6:20                             ` stephen
@ 2015-09-27  8:34                               ` Paul Eggert
  0 siblings, 0 replies; 70+ messages in thread
From: Paul Eggert @ 2015-09-27  8:34 UTC (permalink / raw)
  To: stephen; +Cc: emacs-devel

stephen@xemacs.org wrote:
> Perhaps most
> recently authored pages are UTF-8.  But the data sets themselves are
> typically flat files, either CSV or plaintext.  The explanatory pages,
> even if in HTML, often haven't been revised in decades.

Yes, that's pretty much my experience.  In Japan older stuff is mostly 
Shift-JIS, EUC, or maybe ISO-2022-JP.  New stuff is mostly UTF-8.  People using 
old email software send old encodings because that's what they've been doing for 
decades.  Normally it works, because the email envelope tells you the encoding. 
  But sometimes people screw up and you get mojibake.

But this situation is not an argument for having the locale determine encoding 
when visiting random imported files that lack envelopes.  For such files, it 
often doesn't work to set LC_ALL=ja_JP.ujis and expect Emacs to get things 
right.  (This is one of things that Eli has noted multiple times, and he's right.)

Of course if one is working in a conservative Japanese government ministry that 
standardized on Shift-JIS back in 1992 and hasn't changed since then, then 
things are different, and Emacs should support such users.  But typical Emacs 
users are not in this situation, and the Emacs default should cater to the 
more-typical case today.

To narrow things down a bit I briefly looked for .jp websites that talk about 
Emacs.  Google reported the following first page's worth of hits (I list year of 
composition, encoding, and URL).  Again, the new stuff is mostly UTF-8, and the 
old stuff is a mishmash, so it's another data point suggesting that defaulting 
to UTF-8 would not be such a bad thing for editing today's text.

2002 Shift-JIS   http://www.rsch.tuis.ac.jp/~ohmi/literacy/emacs/quick.html
2008 ISO-2022-JP http://www.wakayama-u.ac.jp/~takehiko/webprg/03.html
2015 EUC-JP      http://d.hatena.ne.jp/tarao/20150221/1424518030
2015 UTF-8       http://uguisu.skr.jp/Windows/emacs.html
2015 UTF-8 
http://www.amazon.co.jp/Emacs%E5%AE%9F%E8%B7%B5%E5%85%A5%E9%96%80-%EF%BD%9E%E6%80%9D%E8%80%83%E3%82%92%E7%9B%B4%E6%84%9F%E7%9A%84%E3%81%AB%E3%82%B3%E3%83%BC%E3%83%89%E5%8C%96%E3%81%97%E3%80%81%E9%96%8B%E7%99%BA%E3%82%92%E5%8A%A0%E9%80%9F%E3%81%99%E3%82%8B-WEB-DB-PRESS-plus/dp/4774150029
2015 UTF-8       http://www.sigasi.jp/better-emacs-vhdl-mode
2006 Shift-JIS 
http://www.math.kobe-u.ac.jp/icms2006/icms2006-video/slides/grayson/share/doc/Macaulay2/Macaulay2/html/_teaching_spemacs_sphow_spto_spfind_sp__M2.html
2015 UTF-8       https://osdn.jp/projects/gnupack/

^ permalink raw reply	[flat|nested] 70+ messages in thread

end of thread, other threads:[~2015-09-28 15:58 UTC | newest]

Thread overview: 70+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20150921165211.20434.28114@vcs.savannah.gnu.org>
     [not found] ` <E1Ze4K3-0005KC-5U@vcs.savannah.gnu.org>
2015-09-21 19:57   ` [Emacs-diffs] master db828f6: Don't rely on defaults in decoding UTF-8 encoded Lisp files Stefan Monnier
2015-09-21 20:07     ` Eli Zaretskii
2015-09-24 16:44       ` Eli Zaretskii
2015-09-24 21:29         ` Stefan Monnier
2015-09-25  7:55           ` Eli Zaretskii
2015-09-25 12:21             ` Stefan Monnier
2015-09-25 13:37               ` Eli Zaretskii
2015-09-25 22:32               ` Paul Eggert
2015-09-26  6:27                 ` Eli Zaretskii
2015-09-26  6:32                   ` Eli Zaretskii
2015-09-26 14:31                   ` Paul Eggert
2015-09-26 15:15                     ` Eli Zaretskii
2015-09-26 16:01                       ` Paul Eggert
2015-09-26 16:09                         ` David Kastrup
2015-09-26 17:26                           ` Eli Zaretskii
2015-09-26 18:53                           ` Paul Eggert
2015-09-26 19:35                             ` Eli Zaretskii
2015-09-26 20:26                               ` Chad Brown
2015-09-26 21:50                                 ` David Kastrup
2015-09-27  4:44                                   ` Paul Eggert
2015-09-27  5:29                                     ` David Kastrup
2015-09-27  7:38                                       ` Paul Eggert
2015-09-27  7:46                                         ` David Kastrup
2015-09-27  7:52                                           ` Paul Eggert
2015-09-27  9:47                                       ` Andreas Schwab
2015-09-27  9:54                                         ` David Kastrup
2015-09-27 10:03                                           ` Andreas Schwab
2015-09-27 10:12                                             ` David Kastrup
2015-09-27 11:10                                               ` Andreas Schwab
2015-09-27 22:48                                       ` Richard Stallman
2015-09-28  2:41                                         ` Paul Eggert
2015-09-28  6:53                                           ` Eli Zaretskii
2015-09-28 15:08                                             ` Paul Eggert
2015-09-28 15:58                                               ` Eli Zaretskii
2015-09-27  7:39                                     ` Eli Zaretskii
2015-09-27  7:52                                       ` Paul Eggert
2015-09-27  8:00                                         ` David Kastrup
2015-09-27  8:03                                         ` Eli Zaretskii
2015-09-27  8:29                                           ` Paul Eggert
2015-09-27  8:37                                             ` David Kastrup
2015-09-27  8:40                                               ` Paul Eggert
2015-09-27  8:50                                                 ` David Kastrup
2015-09-27 10:14                                                 ` Eli Zaretskii
2015-09-27  8:57                                             ` Eli Zaretskii
2015-09-27  7:34                                 ` Eli Zaretskii
2015-09-27 16:03                                   ` Chad Brown
2015-09-27 18:41                                     ` Eli Zaretskii
2015-09-27 19:52                                       ` Chad Brown
2015-09-27 20:52                                         ` Eli Zaretskii
2015-09-26 20:32                               ` Paul Eggert
2015-09-27  7:27                                 ` Eli Zaretskii
2015-09-27  7:42                                   ` David Kastrup
2015-09-27  9:20                                     ` Rustom Mody
2015-09-27 10:13                                       ` Eli Zaretskii
2015-09-27 20:21                                         ` Paul Eggert
2015-09-27 21:04                                           ` Eli Zaretskii
2015-09-27  8:22                                   ` Paul Eggert
2015-09-27  8:55                                     ` Eli Zaretskii
2015-09-27  9:56                                     ` Andreas Schwab
2015-09-27 10:04                                       ` David Kastrup
2015-09-27 10:16                                         ` Eli Zaretskii
2015-09-27 10:36                                           ` Eli Zaretskii
2015-09-27 10:59                                             ` Eli Zaretskii
2015-09-27 20:05                                               ` Paul Eggert
2015-09-26 17:25                         ` Eli Zaretskii
2015-09-26 18:51                           ` Paul Eggert
2015-09-27  0:12                         ` stephen
2015-09-27  4:44                           ` Paul Eggert
2015-09-27  6:20                             ` stephen
2015-09-27  8:34                               ` Paul Eggert

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).