* Re: [Emacs-diffs] master db828f6: Don't rely on defaults in decoding UTF-8 encoded Lisp files [not found] ` <E1Ze4K3-0005KC-5U@vcs.savannah.gnu.org> @ 2015-09-21 19:57 ` Stefan Monnier 2015-09-21 20:07 ` Eli Zaretskii 0 siblings, 1 reply; 70+ messages in thread From: Stefan Monnier @ 2015-09-21 19:57 UTC (permalink / raw) To: emacs-devel; +Cc: Eli Zaretskii > Don't rely on defaults in decoding UTF-8 encoded Lisp files FWIW, I've removed the "coding: utf-8" thingy on a bunch of files in the last year. Why not? Since Emacs-24.4 the coding-system Emacs uses for .el files is `prefer-utf-8', i.e. it is explicitly defined to be "utf-8 if it is valid" and the user's locale/settings is only used as a fallback. Stefan ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [Emacs-diffs] master db828f6: Don't rely on defaults in decoding UTF-8 encoded Lisp files 2015-09-21 19:57 ` [Emacs-diffs] master db828f6: Don't rely on defaults in decoding UTF-8 encoded Lisp files Stefan Monnier @ 2015-09-21 20:07 ` Eli Zaretskii 2015-09-24 16:44 ` Eli Zaretskii 0 siblings, 1 reply; 70+ messages in thread From: Eli Zaretskii @ 2015-09-21 20:07 UTC (permalink / raw) To: Stefan Monnier; +Cc: emacs-devel > From: Stefan Monnier <monnier@IRO.UMontreal.CA> > Cc: Eli Zaretskii <eliz@gnu.org> > Date: Mon, 21 Sep 2015 15:57:48 -0400 > > > Don't rely on defaults in decoding UTF-8 encoded Lisp files > > FWIW, I've removed the "coding: utf-8" thingy on a bunch of files in the > last year. > > Why not? Because I'm tired of hunting problems with raw bytes being displayed, just to learn yet another deficiency in our guesswork. > Since Emacs-24.4 the coding-system Emacs uses for .el files is > `prefer-utf-8', i.e. it is explicitly defined to be "utf-8 if it is > valid" I don't think prefer-utf-8 does what you say here. In any case, I've seen the default decoding do incorrect things, and I see no reason to risk that in files we control. ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [Emacs-diffs] master db828f6: Don't rely on defaults in decoding UTF-8 encoded Lisp files 2015-09-21 20:07 ` Eli Zaretskii @ 2015-09-24 16:44 ` Eli Zaretskii 2015-09-24 21:29 ` Stefan Monnier 0 siblings, 1 reply; 70+ messages in thread From: Eli Zaretskii @ 2015-09-24 16:44 UTC (permalink / raw) To: monnier; +Cc: emacs-devel > Date: Mon, 21 Sep 2015 23:07:52 +0300 > From: Eli Zaretskii <eliz@gnu.org> > Cc: emacs-devel@gnu.org > > > Since Emacs-24.4 the coding-system Emacs uses for .el files is > > `prefer-utf-8', i.e. it is explicitly defined to be "utf-8 if it is > > valid" > > I don't think prefer-utf-8 does what you say here. In any case, I've > seen the default decoding do incorrect things, and I see no reason to > risk that in files we control. Here's an example, btw: emacs -Q M-x set-locale-environment RET he_IL.ISO-8859-8 RET C-x C-f doc/lispref/tips.texi RET The encoding-detection guesswork fails even with the current Git master, let alone Emacs 24.5. Use '(skip-chars-forward "\000-\177")' to find the mess it produces. ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [Emacs-diffs] master db828f6: Don't rely on defaults in decoding UTF-8 encoded Lisp files 2015-09-24 16:44 ` Eli Zaretskii @ 2015-09-24 21:29 ` Stefan Monnier 2015-09-25 7:55 ` Eli Zaretskii 0 siblings, 1 reply; 70+ messages in thread From: Stefan Monnier @ 2015-09-24 21:29 UTC (permalink / raw) To: Eli Zaretskii; +Cc: emacs-devel >> > Since Emacs-24.4 the coding-system Emacs uses for .el files is >> > `prefer-utf-8', i.e. it is explicitly defined to be "utf-8 if it is >> > valid" >> I don't think prefer-utf-8 does what you say here. In any case, I've >> seen the default decoding do incorrect things, and I see no reason to >> risk that in files we control. > Here's an example, btw: > emacs -Q > M-x set-locale-environment RET he_IL.ISO-8859-8 RET > C-x C-f doc/lispref/tips.texi RET Hmm.... I don't think this is using prefer-utf-8. `prefer-utf-8' is used for *.el files via file-coding-system-alist. Stefan ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [Emacs-diffs] master db828f6: Don't rely on defaults in decoding UTF-8 encoded Lisp files 2015-09-24 21:29 ` Stefan Monnier @ 2015-09-25 7:55 ` Eli Zaretskii 2015-09-25 12:21 ` Stefan Monnier 0 siblings, 1 reply; 70+ messages in thread From: Eli Zaretskii @ 2015-09-25 7:55 UTC (permalink / raw) To: Stefan Monnier; +Cc: emacs-devel > From: Stefan Monnier <monnier@iro.umontreal.ca> > Cc: emacs-devel@gnu.org > Date: Thu, 24 Sep 2015 17:29:38 -0400 > > > emacs -Q > > M-x set-locale-environment RET he_IL.ISO-8859-8 RET > > C-x C-f doc/lispref/tips.texi RET > > Hmm.... I don't think this is using prefer-utf-8. `prefer-utf-8' is > used for *.el files via file-coding-system-alist. So we now agree that at least non-*.el files should have the coding cookie, yes? As for *.el files: prefer-utf-8 is too easily duped for us to have such infinite faith in it. I can easily force a .el file to be saved in non-UTF-8 encoding, and then it will be decoded incorrectly when visited, if it doesn't have a coding cookie. E.g., try saving a foo.el with the following contents: (setq string "א“”") using cp1255, then kill the buffer and visit it again. You will see this instead: (setq string "Ӕ") Bottom line: we use prefer-utf-8 for *.el files so that the probability of such catastrophic errors be minimized when the lazy maintainers couldn't be bothered to add a cookie. But we don't want to be lazy ourselves, with the files we own and control. More generally, I think we should require any text file in the Emacs repository that includes non-ASCII characters to have an explicit coding cookie, so that these subtle problems don't lie low because most Emacs contributors live in UTF-8 locales. ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [Emacs-diffs] master db828f6: Don't rely on defaults in decoding UTF-8 encoded Lisp files 2015-09-25 7:55 ` Eli Zaretskii @ 2015-09-25 12:21 ` Stefan Monnier 2015-09-25 13:37 ` Eli Zaretskii 2015-09-25 22:32 ` Paul Eggert 0 siblings, 2 replies; 70+ messages in thread From: Stefan Monnier @ 2015-09-25 12:21 UTC (permalink / raw) To: Eli Zaretskii; +Cc: emacs-devel > E.g., try saving a foo.el with the following contents: > > (setq string "א“”") > > using cp1255, then kill the buffer and visit it again. AFAIK saving in this way requires very explicit action on the part of the user. She gets what she asked for. But yes, we should probably make it even harder (i.e. disallow it altogether as long as there's no "coding:cp1255" tag). > So we now agree that at least non-*.el files should have the coding > cookie, yes? Yes, definitely. > Bottom line: we use prefer-utf-8 for *.el files so that the > probability of such catastrophic errors be minimized when the lazy > maintainers couldn't be bothered to add a cookie. No. I pushed for prefer-utf-8 because I want Elisp source code to be declared to use utf-8 encoding. I can imagine a future where we don't even support Elisp files using another coding system (i.e. throw away the load-with-code-conversion machinery). > More generally, I think we should require any text file in the Emacs > repository that includes non-ASCII characters to have an explicit > coding cookie, so that these subtle problems don't lie low because > most Emacs contributors live in UTF-8 locales. My view OTOH is that the future is utf-8 only, and in that future we won't want to have redundant "coding:utf-8" tags everywhere, so we need to find ways to go from here (i.e. "need a coding: tag for any non-ASCII file") to there. I don't have an answer in general, but prefer-utf-8 is a step in that direction, which can be used for some class of files (e.g. Elisp). Stefan ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [Emacs-diffs] master db828f6: Don't rely on defaults in decoding UTF-8 encoded Lisp files 2015-09-25 12:21 ` Stefan Monnier @ 2015-09-25 13:37 ` Eli Zaretskii 2015-09-25 22:32 ` Paul Eggert 1 sibling, 0 replies; 70+ messages in thread From: Eli Zaretskii @ 2015-09-25 13:37 UTC (permalink / raw) To: Stefan Monnier; +Cc: emacs-devel > From: Stefan Monnier <monnier@iro.umontreal.ca> > Cc: emacs-devel@gnu.org > Date: Fri, 25 Sep 2015 08:21:59 -0400 > > > E.g., try saving a foo.el with the following contents: > > > > (setq string "א“”") > > > > using cp1255, then kill the buffer and visit it again. > > AFAIK saving in this way requires very explicit action on the part of > the user. She gets what she asked for. Who does? We are talking about 2 different people here, the one who was sloppy forgetting the coding cookie, and another who visited it. > I can imagine a future where we don't even support Elisp files using > another coding system (i.e. throw away the load-with-code-conversion > machinery). I'm not sure this can be done. AFAIK, a few files under leim/quail are encoded with non-UTF encoding, and for a good reason. > > More generally, I think we should require any text file in the Emacs > > repository that includes non-ASCII characters to have an explicit > > coding cookie, so that these subtle problems don't lie low because > > most Emacs contributors live in UTF-8 locales. > > My view OTOH is that the future is utf-8 only If you know the future, perhaps you could suggest which shares of what companies I should invest in? Why waste such an important insight on some insignificant piece of software? > we need to find ways to go from here (i.e. "need a coding: tag for > any non-ASCII file") to there. I don't have an answer in general, > but prefer-utf-8 is a step in that direction, which can be used for > some class of files (e.g. Elisp). I think there's no way from here to there, not as long as our encoding detection's reliability is what it is. ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [Emacs-diffs] master db828f6: Don't rely on defaults in decoding UTF-8 encoded Lisp files 2015-09-25 12:21 ` Stefan Monnier 2015-09-25 13:37 ` Eli Zaretskii @ 2015-09-25 22:32 ` Paul Eggert 2015-09-26 6:27 ` Eli Zaretskii 1 sibling, 1 reply; 70+ messages in thread From: Paul Eggert @ 2015-09-25 22:32 UTC (permalink / raw) To: Stefan Monnier, Eli Zaretskii; +Cc: emacs-devel Stefan Monnier wrote: > we won't want to have redundant "coding:utf-8" tags everywhere, so we need > to find ways to go from here (i.e. "need a coding: tag for any non-ASCII > file") to there. Yes, requiring coding: cookies for every UTF-8 file is error-prone. We can't easily put cookies into every such file, as some of them are copied verbatim from other sources. And even for our own files, it's too easy to add a bit of UTF-8 text to a cookieless file and forget to add a cookie. Here's a better idea. Developers that use a UTF-8 locale are OK already. Let's suggest to the remaining developers that they put something like the following into their .emacs: (add-hook 'auto-coding-functions (lambda (size) 'utf-8)) This will let Emacs default to UTF-8 for files that don't already have a coding cookie. ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [Emacs-diffs] master db828f6: Don't rely on defaults in decoding UTF-8 encoded Lisp files 2015-09-25 22:32 ` Paul Eggert @ 2015-09-26 6:27 ` Eli Zaretskii 2015-09-26 6:32 ` Eli Zaretskii 2015-09-26 14:31 ` Paul Eggert 0 siblings, 2 replies; 70+ messages in thread From: Eli Zaretskii @ 2015-09-26 6:27 UTC (permalink / raw) To: Paul Eggert; +Cc: monnier, emacs-devel > From: Paul Eggert <eggert@cs.ucla.edu> > Cc: emacs-devel@gnu.org > Date: Fri, 25 Sep 2015 15:32:11 -0700 > > Here's a better idea. Developers that use a UTF-8 locale are OK > already. Let's suggest to the remaining developers that they put > something like the following into their .emacs: > > (add-hook 'auto-coding-functions (lambda (size) 'utf-8)) > > This will let Emacs default to UTF-8 for files that don't already have a > coding cookie. You are assuming that those "remaining developers" use Emacs only for working on Emacs, is that right? ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [Emacs-diffs] master db828f6: Don't rely on defaults in decoding UTF-8 encoded Lisp files 2015-09-26 6:27 ` Eli Zaretskii @ 2015-09-26 6:32 ` Eli Zaretskii 2015-09-26 14:31 ` Paul Eggert 1 sibling, 0 replies; 70+ messages in thread From: Eli Zaretskii @ 2015-09-26 6:32 UTC (permalink / raw) To: eggert; +Cc: monnier, emacs-devel > Date: Sat, 26 Sep 2015 09:27:11 +0300 > From: Eli Zaretskii <eliz@gnu.org> > Cc: monnier@iro.umontreal.ca, emacs-devel@gnu.org > > > From: Paul Eggert <eggert@cs.ucla.edu> > > Cc: emacs-devel@gnu.org > > Date: Fri, 25 Sep 2015 15:32:11 -0700 > > > > Here's a better idea. Developers that use a UTF-8 locale are OK > > already. Let's suggest to the remaining developers that they put > > something like the following into their .emacs: > > > > (add-hook 'auto-coding-functions (lambda (size) 'utf-8)) > > > > This will let Emacs default to UTF-8 for files that don't already have a > > coding cookie. > > You are assuming that those "remaining developers" use Emacs only for > working on Emacs, is that right? And I fail to see how's that less prone to errors, anyway: it still requires something to be added manually to some file. The only solution which will make the current situation better is one that will work in "emacs -Q". ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [Emacs-diffs] master db828f6: Don't rely on defaults in decoding UTF-8 encoded Lisp files 2015-09-26 6:27 ` Eli Zaretskii 2015-09-26 6:32 ` Eli Zaretskii @ 2015-09-26 14:31 ` Paul Eggert 2015-09-26 15:15 ` Eli Zaretskii 1 sibling, 1 reply; 70+ messages in thread From: Paul Eggert @ 2015-09-26 14:31 UTC (permalink / raw) To: Eli Zaretskii; +Cc: monnier, emacs-devel Eli Zaretskii wrote: > You are assuming that those "remaining developers" use Emacs only for > working on Emacs, is that right? No, I am assuming that the typical default nowadays, for text that is not otherwise labeled, is to use UTF-8. This is a reasonable assumption. It's not always correct, but exceptions can be handled. I see that you have added more coding: cookies. Oh well. I do take your point that we need a better solution than what we have. ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [Emacs-diffs] master db828f6: Don't rely on defaults in decoding UTF-8 encoded Lisp files 2015-09-26 14:31 ` Paul Eggert @ 2015-09-26 15:15 ` Eli Zaretskii 2015-09-26 16:01 ` Paul Eggert 0 siblings, 1 reply; 70+ messages in thread From: Eli Zaretskii @ 2015-09-26 15:15 UTC (permalink / raw) To: Paul Eggert; +Cc: monnier, emacs-devel > Cc: monnier@iro.umontreal.ca, emacs-devel@gnu.org > From: Paul Eggert <eggert@cs.ucla.edu> > Date: Sat, 26 Sep 2015 07:31:36 -0700 > > Eli Zaretskii wrote: > > You are assuming that those "remaining developers" use Emacs only for > > working on Emacs, is that right? > > No, I am assuming that the typical default nowadays, for text that is not > otherwise labeled, is to use UTF-8. This is a reasonable assumption. It's not > always correct, but exceptions can be handled. So you are, in effect, saying that it is incorrect to derive the default encodings from the locale's codeset? I'm not sure about that, but if so, the issue is much broader than just what was discussed here, it touches a lot of other defaults as well, and a lot of code that supports those defaults. > I see that you have added more coding: cookies. Oh well. I do take your point > that we need a better solution than what we have. I don't enjoy adding those cookies, but I enjoy even less seeing those "8" indications in the mode line when I know there's not a chance in the world the file was encoded in ISO-8859-8. ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [Emacs-diffs] master db828f6: Don't rely on defaults in decoding UTF-8 encoded Lisp files 2015-09-26 15:15 ` Eli Zaretskii @ 2015-09-26 16:01 ` Paul Eggert 2015-09-26 16:09 ` David Kastrup ` (2 more replies) 0 siblings, 3 replies; 70+ messages in thread From: Paul Eggert @ 2015-09-26 16:01 UTC (permalink / raw) To: Eli Zaretskii; +Cc: monnier, emacs-devel Eli Zaretskii wrote: > So you are, in effect, saying that it is incorrect to derive the > default encodings from the locale's codeset? Yes, for Emacs developers. And come to think of it, for most Emacs users. Nowadays in my experience most non-ASCII text files use UTF-8, regardless of locale. The old days of having to guess encoding from the locale are passing away. This is partly due to UTF-8 being the encoding of choice for HTML and XML, where UTF-8 overtook the older 8-bit encodings in 2008 and now is by far the dominant encoding. One way to accommodate the new reality would be to change Emacs so that by default the system locale does not affect Emacs's guess of a file's encoding if the file's initial sample is valid UTF-8. Users could set a variable to re-enable the old behavior. If we did this, we wouldn't have the error-prone process if sprinkling 'coding: utf-8' cookies all over the place. ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [Emacs-diffs] master db828f6: Don't rely on defaults in decoding UTF-8 encoded Lisp files 2015-09-26 16:01 ` Paul Eggert @ 2015-09-26 16:09 ` David Kastrup 2015-09-26 17:26 ` Eli Zaretskii 2015-09-26 18:53 ` Paul Eggert 2015-09-26 17:25 ` Eli Zaretskii 2015-09-27 0:12 ` stephen 2 siblings, 2 replies; 70+ messages in thread From: David Kastrup @ 2015-09-26 16:09 UTC (permalink / raw) To: Paul Eggert; +Cc: Eli Zaretskii, monnier, emacs-devel Paul Eggert <eggert@cs.ucla.edu> writes: > Eli Zaretskii wrote: >> So you are, in effect, saying that it is incorrect to derive the >> default encodings from the locale's codeset? > > Yes, for Emacs developers. And come to think of it, for most Emacs > users. If the answer is "most" rather than "all", it would be absurd if Emacs developers were not to use circumstances which they are supposed to support. > Nowadays in my experience most non-ASCII text files use UTF-8, > regardless of locale. How frequent are you reading Hebrew, Arabic, Chinese, Japanese, and Korean texts? How relevant is your experience? -- David Kastrup ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [Emacs-diffs] master db828f6: Don't rely on defaults in decoding UTF-8 encoded Lisp files 2015-09-26 16:09 ` David Kastrup @ 2015-09-26 17:26 ` Eli Zaretskii 2015-09-26 18:53 ` Paul Eggert 1 sibling, 0 replies; 70+ messages in thread From: Eli Zaretskii @ 2015-09-26 17:26 UTC (permalink / raw) To: David Kastrup; +Cc: eggert, monnier, emacs-devel > From: David Kastrup <dak@gnu.org> > Date: Sat, 26 Sep 2015 18:09:36 +0200 > Cc: Eli Zaretskii <eliz@gnu.org>, monnier@iro.umontreal.ca, emacs-devel@gnu.org > > > Nowadays in my experience most non-ASCII text files use UTF-8, > > regardless of locale. > > How frequent are you reading Hebrew, Arabic, Chinese, Japanese, and > Korean texts? How relevant is your experience? Indeed, I think Far Eastern locales frequently use non-UTF-8 encodings. ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [Emacs-diffs] master db828f6: Don't rely on defaults in decoding UTF-8 encoded Lisp files 2015-09-26 16:09 ` David Kastrup 2015-09-26 17:26 ` Eli Zaretskii @ 2015-09-26 18:53 ` Paul Eggert 2015-09-26 19:35 ` Eli Zaretskii 1 sibling, 1 reply; 70+ messages in thread From: Paul Eggert @ 2015-09-26 18:53 UTC (permalink / raw) To: David Kastrup; +Cc: Eli Zaretskii, monnier, emacs-devel David Kastrup wrote: > How frequent are you reading Hebrew, Arabic, Chinese, Japanese, and > Korean texts? How relevant is your experience? Hebrew, not so much -- Eli has far more experience with that. Arabic I was just reading last week (not natively; I use a translator). This week I was reading a lot of Turkish. In all cases I was looking at text prepared by others. In all cases my sources used UTF-8 -- not due to my influence, but because that's what's typical these days. In my previous job I routinely had to deal with CJK text, and did so with lots of different encodings, including monstrosities such as DBCS-Host that Emacs doesn't even support. So my experience is reasonably good in this area -- better than the average random hacker anyway. If you go back 20 years, non-UTF-8 encodings such as Shift-JIS and EUC were by far the most popular in Japan. Nowadays? Sure, Shift-JIS and EUC are still used, but they're going downhill. Of the top 20 web sites in Japan (according to Alexa), 18 use UTF-8, one uses Shift-JIS, and one uses EUC on their home pages. In the w3techs survey of world web sites, 85% use UTF-8; the second most-popular encoding, ISO-8859-1, is at only 7.5%, and it's that high only because the old HTML standard made ISO-8859-1 the default. So in practice, defaulting to UTF-8 is quite a good choice nowadays. Of course if we can get the proper encoding from the document or its envelope we should prefer that, and that should let us deal with web documents and email. ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [Emacs-diffs] master db828f6: Don't rely on defaults in decoding UTF-8 encoded Lisp files 2015-09-26 18:53 ` Paul Eggert @ 2015-09-26 19:35 ` Eli Zaretskii 2015-09-26 20:26 ` Chad Brown 2015-09-26 20:32 ` Paul Eggert 0 siblings, 2 replies; 70+ messages in thread From: Eli Zaretskii @ 2015-09-26 19:35 UTC (permalink / raw) To: Paul Eggert; +Cc: dak, monnier, emacs-devel > From: Paul Eggert <eggert@cs.ucla.edu> > Date: Sat, 26 Sep 2015 11:53:09 -0700 > Cc: Eli Zaretskii <eliz@gnu.org>, monnier@iro.umontreal.ca, emacs-devel@gnu.org > > Of the top 20 web sites in Japan (according to Alexa), 18 use UTF-8, > one uses Shift-JIS, and one uses EUC on their home pages. In the > w3techs survey of world web sites, 85% use UTF-8; the second > most-popular encoding, ISO-8859-1, is at only 7.5%, and it's that > high only because the old HTML standard made ISO-8859-1 the default. The relevant statistics for Emacs is of source files, not of HTML pages. ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [Emacs-diffs] master db828f6: Don't rely on defaults in decoding UTF-8 encoded Lisp files 2015-09-26 19:35 ` Eli Zaretskii @ 2015-09-26 20:26 ` Chad Brown 2015-09-26 21:50 ` David Kastrup 2015-09-27 7:34 ` Eli Zaretskii 2015-09-26 20:32 ` Paul Eggert 1 sibling, 2 replies; 70+ messages in thread From: Chad Brown @ 2015-09-26 20:26 UTC (permalink / raw) To: Eli Zaretskii, emacs-devel > On 26 Sep 2015, at 12:35, Eli Zaretskii <eliz@gnu.org> wrote: > > The relevant statistics for Emacs is of source files, not of HTML > pages. The default for GCC is UTF-8. Python requires a coding cookie (intentionally similar to Emacs’) to get away from Latin-1. Java is UTF-8. Javascript, roughly speaking, tracks HTML. Which other languages did you have in mind? ~Chad ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [Emacs-diffs] master db828f6: Don't rely on defaults in decoding UTF-8 encoded Lisp files 2015-09-26 20:26 ` Chad Brown @ 2015-09-26 21:50 ` David Kastrup 2015-09-27 4:44 ` Paul Eggert 2015-09-27 7:34 ` Eli Zaretskii 1 sibling, 1 reply; 70+ messages in thread From: David Kastrup @ 2015-09-26 21:50 UTC (permalink / raw) To: Chad Brown; +Cc: Eli Zaretskii, emacs-devel Chad Brown <yandros@gmail.com> writes: >> On 26 Sep 2015, at 12:35, Eli Zaretskii <eliz@gnu.org> wrote: >> >> The relevant statistics for Emacs is of source files, not of HTML >> pages. > > The default for GCC is UTF-8. How so? The default is defined by the compiled language. For C, it is essentially 8-bit bytes where the meaning-carrying subset is ASCII. Everything else is just replication. GCC communicates on the terminal with compiler diagnostics. For that it uses the current locale. > Python requires a coding cookie (intentionally similar to Emacs’) to > get away from Latin-1. Java is UTF-8. Javascript, roughly speaking, > tracks HTML. Which other languages did you have in mind? Emacs is, not least of all, a text editor. I am currently using it to write this Email reply. Not everything that one uses Emacs for has a well-defined default encoding. -- David Kastrup ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [Emacs-diffs] master db828f6: Don't rely on defaults in decoding UTF-8 encoded Lisp files 2015-09-26 21:50 ` David Kastrup @ 2015-09-27 4:44 ` Paul Eggert 2015-09-27 5:29 ` David Kastrup 2015-09-27 7:39 ` Eli Zaretskii 0 siblings, 2 replies; 70+ messages in thread From: Paul Eggert @ 2015-09-27 4:44 UTC (permalink / raw) To: David Kastrup; +Cc: emacs-devel David Kastrup wrote: > The default is defined by the compiled language. For C, it is > essentially 8-bit bytes where the meaning-carrying subset is ASCII. That was true for C99 and earlier, but it stopped being true in C11, where the source-file encoding does matter and where UTF-8 is the only sane default nowadays. ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [Emacs-diffs] master db828f6: Don't rely on defaults in decoding UTF-8 encoded Lisp files 2015-09-27 4:44 ` Paul Eggert @ 2015-09-27 5:29 ` David Kastrup 2015-09-27 7:38 ` Paul Eggert ` (2 more replies) 2015-09-27 7:39 ` Eli Zaretskii 1 sibling, 3 replies; 70+ messages in thread From: David Kastrup @ 2015-09-27 5:29 UTC (permalink / raw) To: Paul Eggert; +Cc: emacs-devel Paul Eggert <eggert@cs.ucla.edu> writes: > David Kastrup wrote: >> The default is defined by the compiled language. For C, it is >> essentially 8-bit bytes where the meaning-carrying subset is ASCII. > > That was true for C99 and earlier, but it stopped being true in C11, > where the source-file encoding does matter and where UTF-8 is the only > sane default nowadays. "stopped being true in C11" suggests that the world moved on. Here is the manual extract from the GCC delivered in the latest Ubuntu distribution (the most commonly used GNU/Linux system): A fourth version of the C standard, known as "C11", was published in 2011 as ISO/IEC 9899:2011. GCC has substantially complete support for this standard, enabled with '-std=c11' or '-std=iso9899:2011'. (While in development, drafts of this standard version were referred to as "C1X".) It is not even accepted without using extra options. And we are not talking anyway about the encoding Emacs is to choose for new files but rather about the encoding for opening existing files. How are you going to magically eradicate all pre-C11 files from Earth? Wouldn't it be convenient to actually load them into an editor for doing the conversion? -- David Kastrup ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [Emacs-diffs] master db828f6: Don't rely on defaults in decoding UTF-8 encoded Lisp files 2015-09-27 5:29 ` David Kastrup @ 2015-09-27 7:38 ` Paul Eggert 2015-09-27 7:46 ` David Kastrup 2015-09-27 9:47 ` Andreas Schwab 2015-09-27 22:48 ` Richard Stallman 2 siblings, 1 reply; 70+ messages in thread From: Paul Eggert @ 2015-09-27 7:38 UTC (permalink / raw) To: David Kastrup; +Cc: emacs-devel David Kastrup wrote: > How are you going to magically eradicate all pre-C11 files from Earth? Old C files will build just fine with newer GCC. And GCC can support UTF-8 in strings even if you don't use the -std=c11 option. So this is not a problem. ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [Emacs-diffs] master db828f6: Don't rely on defaults in decoding UTF-8 encoded Lisp files 2015-09-27 7:38 ` Paul Eggert @ 2015-09-27 7:46 ` David Kastrup 2015-09-27 7:52 ` Paul Eggert 0 siblings, 1 reply; 70+ messages in thread From: David Kastrup @ 2015-09-27 7:46 UTC (permalink / raw) To: Paul Eggert; +Cc: emacs-devel Paul Eggert <eggert@cs.ucla.edu> writes: > David Kastrup wrote: >> How are you going to magically eradicate all pre-C11 files from Earth? > > Old C files will build just fine with newer GCC. And GCC can support > UTF-8 in strings even if you don't use the -std=c11 option. So this > is not a problem. Are we still talking about the defaults Emacs chooses when detecting the file encoding of C files? You seem about equally likely to argue against your own proposals than you are arguing for them. -- David Kastrup ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [Emacs-diffs] master db828f6: Don't rely on defaults in decoding UTF-8 encoded Lisp files 2015-09-27 7:46 ` David Kastrup @ 2015-09-27 7:52 ` Paul Eggert 0 siblings, 0 replies; 70+ messages in thread From: Paul Eggert @ 2015-09-27 7:52 UTC (permalink / raw) To: David Kastrup; +Cc: emacs-devel David Kastrup wrote: > Are we still talking about the defaults Emacs chooses when detecting the > file encoding of C files? Yes, of course. ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [Emacs-diffs] master db828f6: Don't rely on defaults in decoding UTF-8 encoded Lisp files 2015-09-27 5:29 ` David Kastrup 2015-09-27 7:38 ` Paul Eggert @ 2015-09-27 9:47 ` Andreas Schwab 2015-09-27 9:54 ` David Kastrup 2015-09-27 22:48 ` Richard Stallman 2 siblings, 1 reply; 70+ messages in thread From: Andreas Schwab @ 2015-09-27 9:47 UTC (permalink / raw) To: David Kastrup; +Cc: Paul Eggert, emacs-devel David Kastrup <dak@gnu.org> writes: > A fourth version of the C standard, known as "C11", was published > in 2011 as ISO/IEC 9899:2011. GCC has substantially complete > support for this standard, enabled with '-std=c11' or > '-std=iso9899:2011'. (While in development, drafts of this standard > version were referred to as "C1X".) > > It is not even accepted without using extra options. The latest release of gcc has C11 as the default standard. Andreas. -- Andreas Schwab, schwab@linux-m68k.org GPG Key fingerprint = 58CA 54C7 6D53 942B 1756 01D3 44D5 214B 8276 4ED5 "And now for something completely different." ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [Emacs-diffs] master db828f6: Don't rely on defaults in decoding UTF-8 encoded Lisp files 2015-09-27 9:47 ` Andreas Schwab @ 2015-09-27 9:54 ` David Kastrup 2015-09-27 10:03 ` Andreas Schwab 0 siblings, 1 reply; 70+ messages in thread From: David Kastrup @ 2015-09-27 9:54 UTC (permalink / raw) To: Andreas Schwab; +Cc: Paul Eggert, emacs-devel Andreas Schwab <schwab@linux-m68k.org> writes: > David Kastrup <dak@gnu.org> writes: > >> A fourth version of the C standard, known as "C11", was published >> in 2011 as ISO/IEC 9899:2011. GCC has substantially complete >> support for this standard, enabled with '-std=c11' or >> '-std=iso9899:2011'. (While in development, drafts of this standard >> version were referred to as "C1X".) >> >> It is not even accepted without using extra options. > > The latest release of gcc has C11 as the default standard. You just got to love the "creative editing" culture on this mailing list. First edit a posting into what you would rather want to reply to, then pretend the stuff you elided was not there in the first place. I was _very_ _explicitly_ _not_ talking about the "latest release of gcc" but rather the latest release of GCC in the most wide-spread production GNU/Linux distribution. Can we please stop this silly gamesmanship? It very much contributes to "discussions" going in circles. -- David Kastrup ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [Emacs-diffs] master db828f6: Don't rely on defaults in decoding UTF-8 encoded Lisp files 2015-09-27 9:54 ` David Kastrup @ 2015-09-27 10:03 ` Andreas Schwab 2015-09-27 10:12 ` David Kastrup 0 siblings, 1 reply; 70+ messages in thread From: Andreas Schwab @ 2015-09-27 10:03 UTC (permalink / raw) To: David Kastrup; +Cc: Paul Eggert, emacs-devel David Kastrup <dak@gnu.org> writes: > I was _very_ _explicitly_ _not_ talking about the "latest release of > gcc" but rather the latest release of GCC in the most wide-spread > production GNU/Linux distribution. Many distributions already ship gcc5, some as default even. Andreas. -- Andreas Schwab, schwab@linux-m68k.org GPG Key fingerprint = 58CA 54C7 6D53 942B 1756 01D3 44D5 214B 8276 4ED5 "And now for something completely different." ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [Emacs-diffs] master db828f6: Don't rely on defaults in decoding UTF-8 encoded Lisp files 2015-09-27 10:03 ` Andreas Schwab @ 2015-09-27 10:12 ` David Kastrup 2015-09-27 11:10 ` Andreas Schwab 0 siblings, 1 reply; 70+ messages in thread From: David Kastrup @ 2015-09-27 10:12 UTC (permalink / raw) To: Andreas Schwab; +Cc: Paul Eggert, emacs-devel Andreas Schwab <schwab@linux-m68k.org> writes: > David Kastrup <dak@gnu.org> writes: > >> I was _very_ _explicitly_ _not_ talking about the "latest release of >> gcc" but rather the latest release of GCC in the most wide-spread >> production GNU/Linux distribution. > > Many distributions already ship gcc5, some as default even. That's nice but I was talking about the latest release of GCC in the most wide-spread production GNU/Linux distribution. That's kind of a relevant counterexample to Paul's generalizations. It's not an obscure corner case. Apart of which I am still waiting for an explanation of just why Emacs should stop supporting non-UTF-8 C source files _because_ the C11 standard now provides the means to place UTF-8 strings in executables when using non-UTF-8 source files (previously, you needed to have an UTF-8 encoded source file to do that). Emacs should support non-UTF-8 source files worse because C11 makes it more convenient to use them? It's worse enough that we are arguing straw men all the time, but these straw men are upside down. -- David Kastrup ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [Emacs-diffs] master db828f6: Don't rely on defaults in decoding UTF-8 encoded Lisp files 2015-09-27 10:12 ` David Kastrup @ 2015-09-27 11:10 ` Andreas Schwab 0 siblings, 0 replies; 70+ messages in thread From: Andreas Schwab @ 2015-09-27 11:10 UTC (permalink / raw) To: David Kastrup; +Cc: Paul Eggert, emacs-devel David Kastrup <dak@gnu.org> writes: > It's worse enough that we are arguing straw men all the time, but these > straw men are upside down. So please go ahead. Andreas. -- Andreas Schwab, schwab@linux-m68k.org GPG Key fingerprint = 58CA 54C7 6D53 942B 1756 01D3 44D5 214B 8276 4ED5 "And now for something completely different." ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [Emacs-diffs] master db828f6: Don't rely on defaults in decoding UTF-8 encoded Lisp files 2015-09-27 5:29 ` David Kastrup 2015-09-27 7:38 ` Paul Eggert 2015-09-27 9:47 ` Andreas Schwab @ 2015-09-27 22:48 ` Richard Stallman 2015-09-28 2:41 ` Paul Eggert 2 siblings, 1 reply; 70+ messages in thread From: Richard Stallman @ 2015-09-27 22:48 UTC (permalink / raw) To: David Kastrup; +Cc: eggert, emacs-devel [[[ To any NSA and FBI agents reading my email: please consider ]]] [[[ whether defending the US Constitution against all enemies, ]]] [[[ foreign or domestic, requires you to follow Snowden's example. ]]] Could someone tell me what issue is now under discussion? The Subject line seems to refer to Lisp files, and yet here people are talking about changes in C as of C11. -- Dr Richard Stallman President, Free Software Foundation (gnu.org, fsf.org) Internet Hall-of-Famer (internethalloffame.org) Skype: No way! See stallman.org/skype.html. ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [Emacs-diffs] master db828f6: Don't rely on defaults in decoding UTF-8 encoded Lisp files 2015-09-27 22:48 ` Richard Stallman @ 2015-09-28 2:41 ` Paul Eggert 2015-09-28 6:53 ` Eli Zaretskii 0 siblings, 1 reply; 70+ messages in thread From: Paul Eggert @ 2015-09-28 2:41 UTC (permalink / raw) To: rms; +Cc: emacs-devel Richard Stallman wrote: > The Subject line seems to refer to Lisp files, and yet here people > are talking about changes in C as of C11. The subject line comes from a commit to Emacs master that added coding-cookie lines like the following to some .el files that had UTF-8 text: ;; Local Variables: ;; coding: utf-8 ;; End: Lines like these are no longer needed with current Emacs, which prefers UTF-8 for .el files regardless of the system locale. This can be a win, as people often forget to insert coding cookies and the cookies are a bit awkward anyway. The discussion has morphed into the possibility of a similar facility for files other than .el files, and what the defaults for such a facility should be. The idea is to somehow avoid the need for UTF-8 coding cookies for users who prefer Emacs to default to UTF-8 for text files regardless of system locale. ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [Emacs-diffs] master db828f6: Don't rely on defaults in decoding UTF-8 encoded Lisp files 2015-09-28 2:41 ` Paul Eggert @ 2015-09-28 6:53 ` Eli Zaretskii 2015-09-28 15:08 ` Paul Eggert 0 siblings, 1 reply; 70+ messages in thread From: Eli Zaretskii @ 2015-09-28 6:53 UTC (permalink / raw) To: Paul Eggert; +Cc: rms, emacs-devel > From: Paul Eggert <eggert@cs.ucla.edu> > Date: Sun, 27 Sep 2015 19:41:36 -0700 > Cc: emacs-devel@gnu.org > > The discussion has morphed into the possibility of a similar facility for files > other than .el files, and what the defaults for such a facility should be. The > idea is to somehow avoid the need for UTF-8 coding cookies for users who prefer > Emacs to default to UTF-8 for text files regardless of system locale. I think (prefer-coding-system 'utf-8) is what those users should do, but I'm not sure. It definitely sets the defaults for new files and for communicating with sub-processes (the latter part might not be what you want, btw), but its effect on decoding existing files might sometimes surprise, due to the way the priority of trying various decoders is implemented. (Hint: look at the implementation of set-coding-priority, the function that is called under the hood by prefer-coding-system.) Btw, the issue under discussion, as I perceive it, was somewhat different: how to ensure correct decoding of UTF-8 encoded files (other than *.el) in Emacs source tree _regardless_ of whether the user in question wants to prefer UTF-8 outside of Emacs tree. The best solution we have now is to have a coding cookie in each such file, and the question is how can that be avoided. IOW, the solution should IMO be independent of user's preferences. ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [Emacs-diffs] master db828f6: Don't rely on defaults in decoding UTF-8 encoded Lisp files 2015-09-28 6:53 ` Eli Zaretskii @ 2015-09-28 15:08 ` Paul Eggert 2015-09-28 15:58 ` Eli Zaretskii 0 siblings, 1 reply; 70+ messages in thread From: Paul Eggert @ 2015-09-28 15:08 UTC (permalink / raw) To: Eli Zaretskii; +Cc: rms, emacs-devel On 09/27/2015 11:53 PM, Eli Zaretskii wrote: > The best solution we have now is to have a coding cookie in each such > file, and the question is how can that be avoided. > > IOW, the solution should IMO be independent of user's preferences. Here's an idea: improve the handling of .dir-locals.el so that it could contain something like this: ((nil . ((coding . 'utf-8) (tab-width . 8) (fill-column . 70))) (c-mode . ((c-file-style . "GNU")))) This specification for "coding" would take precedence over coding inferred from environment settings. It would not take precedence over an explicit coding cookie in the file. Currently .dir-locals.el cannot specify 'coding', but I suspect that's mostly just due to the intricacies of the current implementation, not due to any specific desire to reject the use of 'coding' in .dir-locals.el. ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [Emacs-diffs] master db828f6: Don't rely on defaults in decoding UTF-8 encoded Lisp files 2015-09-28 15:08 ` Paul Eggert @ 2015-09-28 15:58 ` Eli Zaretskii 0 siblings, 0 replies; 70+ messages in thread From: Eli Zaretskii @ 2015-09-28 15:58 UTC (permalink / raw) To: Paul Eggert; +Cc: rms, emacs-devel > Cc: rms@gnu.org, emacs-devel@gnu.org > From: Paul Eggert <eggert@cs.ucla.edu> > Date: Mon, 28 Sep 2015 08:08:32 -0700 > > On 09/27/2015 11:53 PM, Eli Zaretskii wrote: > > The best solution we have now is to have a coding cookie in each such > > file, and the question is how can that be avoided. > > > > IOW, the solution should IMO be independent of user's preferences. > > Here's an idea: improve the handling of .dir-locals.el so that it could > contain something like this: > > ((nil . ((coding . 'utf-8) > (tab-width . 8) > (fill-column . 70))) > (c-mode . ((c-file-style . "GNU")))) That's better than having to specify encoding in individual files, so I think this would be progress. Thanks. ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [Emacs-diffs] master db828f6: Don't rely on defaults in decoding UTF-8 encoded Lisp files 2015-09-27 4:44 ` Paul Eggert 2015-09-27 5:29 ` David Kastrup @ 2015-09-27 7:39 ` Eli Zaretskii 2015-09-27 7:52 ` Paul Eggert 1 sibling, 1 reply; 70+ messages in thread From: Eli Zaretskii @ 2015-09-27 7:39 UTC (permalink / raw) To: Paul Eggert; +Cc: dak, emacs-devel > From: Paul Eggert <eggert@cs.ucla.edu> > Date: Sat, 26 Sep 2015 21:44:19 -0700 > Cc: emacs-devel <emacs-devel@gnu.org> > > David Kastrup wrote: > > The default is defined by the compiled language. For C, it is > > essentially 8-bit bytes where the meaning-carrying subset is ASCII. > > That was true for C99 and earlier, but it stopped being true in C11, where the > source-file encoding does matter and where UTF-8 is the only sane default nowadays. I don't see any language to that effect in the C11 Final Draft I have here. AFAICT, non-UTF-8 multibyte sequences are still supported by C11. Can you show the text on which you based the above assertion? Maybe you are talking about encoding of the identifier names. What I had in mind was comments and strings, not identifier names. ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [Emacs-diffs] master db828f6: Don't rely on defaults in decoding UTF-8 encoded Lisp files 2015-09-27 7:39 ` Eli Zaretskii @ 2015-09-27 7:52 ` Paul Eggert 2015-09-27 8:00 ` David Kastrup 2015-09-27 8:03 ` Eli Zaretskii 0 siblings, 2 replies; 70+ messages in thread From: Paul Eggert @ 2015-09-27 7:52 UTC (permalink / raw) To: Eli Zaretskii; +Cc: dak, emacs-devel Eli Zaretskii wrote: > I don't see any language to that effect in the C11 Final Draft I have > here. AFAICT, non-UTF-8 multibyte sequences are still supported by > C11. Of course; that part didn't change. I was talking about C11's new UTF-8 string literals, e.g., u8"Emacsの主要操作(早見表)". There is no similar notation for Shift-JIS, etc. Of course implementations can support legacy encodings, and some legacy C programs are written that way, but the only portable way to go in the future is Unicode. ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [Emacs-diffs] master db828f6: Don't rely on defaults in decoding UTF-8 encoded Lisp files 2015-09-27 7:52 ` Paul Eggert @ 2015-09-27 8:00 ` David Kastrup 2015-09-27 8:03 ` Eli Zaretskii 1 sibling, 0 replies; 70+ messages in thread From: David Kastrup @ 2015-09-27 8:00 UTC (permalink / raw) To: Paul Eggert; +Cc: Eli Zaretskii, emacs-devel Paul Eggert <eggert@cs.ucla.edu> writes: > Eli Zaretskii wrote: >> I don't see any language to that effect in the C11 Final Draft I have >> here. AFAICT, non-UTF-8 multibyte sequences are still supported by >> C11. > > Of course; that part didn't change. I was talking about C11's new > UTF-8 string literals, e.g., u8"Emacsの主要操作(早見表)". Again, are you arguing for or against your own proposals? The _only_ purpose of such string literals is to support generating UTF-8 encoded strings in the executable even when the source file is _not_ encoded in UTF-8. So you argue because C11 contains a feature for supporting source files _not_ encoded in UTF-8, Emacs should support only source files encoded in UTF-8? If anything, this is somewhat of an argument for GDB to preferably interpret C strings as being encoded in UTF-8 even when the source code encoding of a C file appears to be different. We are not talking about editing executables here. We are talking about editing source files. -- David Kastrup ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [Emacs-diffs] master db828f6: Don't rely on defaults in decoding UTF-8 encoded Lisp files 2015-09-27 7:52 ` Paul Eggert 2015-09-27 8:00 ` David Kastrup @ 2015-09-27 8:03 ` Eli Zaretskii 2015-09-27 8:29 ` Paul Eggert 1 sibling, 1 reply; 70+ messages in thread From: Eli Zaretskii @ 2015-09-27 8:03 UTC (permalink / raw) To: Paul Eggert; +Cc: dak, emacs-devel > Cc: dak@gnu.org, emacs-devel@gnu.org > From: Paul Eggert <eggert@cs.ucla.edu> > Date: Sun, 27 Sep 2015 00:52:08 -0700 > > Eli Zaretskii wrote: > > I don't see any language to that effect in the C11 Final Draft I have > > here. AFAICT, non-UTF-8 multibyte sequences are still supported by > > C11. > > Of course; that part didn't change. I was talking about C11's new UTF-8 string > literals, e.g., u8"Emacsの主要操作(早見表)". That's indeed a new feature of C11, but it doesn't disallow using arbitrary byte sequences in otherwise C11-compliant sources. > Of course implementations can support legacy encodings, and some > legacy C programs are written that way, but the only portable way to > go in the future is Unicode. Not sure what kind of "portability" did you have in mind here. If that's portability between locales, then our solution of having a coding cookie is better for Emacs, because it supports more use cases than just assuming UTF-8 would. ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [Emacs-diffs] master db828f6: Don't rely on defaults in decoding UTF-8 encoded Lisp files 2015-09-27 8:03 ` Eli Zaretskii @ 2015-09-27 8:29 ` Paul Eggert 2015-09-27 8:37 ` David Kastrup 2015-09-27 8:57 ` Eli Zaretskii 0 siblings, 2 replies; 70+ messages in thread From: Paul Eggert @ 2015-09-27 8:29 UTC (permalink / raw) To: Eli Zaretskii; +Cc: dak, emacs-devel Eli Zaretskii wrote: > If > that's portability between locales, then our solution of having a > coding cookie is better for Emacs, because it supports more use cases > than just assuming UTF-8 would. Sure, but the point is that we shouldn't need a cookie for UTF-8. Cookies are awkward, and should be inserted only when needed; they shouldn't be needed for the typical case. ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [Emacs-diffs] master db828f6: Don't rely on defaults in decoding UTF-8 encoded Lisp files 2015-09-27 8:29 ` Paul Eggert @ 2015-09-27 8:37 ` David Kastrup 2015-09-27 8:40 ` Paul Eggert 2015-09-27 8:57 ` Eli Zaretskii 1 sibling, 1 reply; 70+ messages in thread From: David Kastrup @ 2015-09-27 8:37 UTC (permalink / raw) To: Paul Eggert; +Cc: Eli Zaretskii, emacs-devel Paul Eggert <eggert@cs.ucla.edu> writes: > Eli Zaretskii wrote: >> If >> that's portability between locales, then our solution of having a >> coding cookie is better for Emacs, because it supports more use cases >> than just assuming UTF-8 would. > > Sure, but the point is that we shouldn't need a cookie for UTF-8. Is this the majestic "we" or are you talking about Emacs development in particular? If the latter, why not set a directory-wide variable for the Emacs project (namely in the repository) for making the Emacs-internal Elisp files default to utf-8? That should cater for "us" without enforcing encodings in other people's projects. -- David Kastrup ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [Emacs-diffs] master db828f6: Don't rely on defaults in decoding UTF-8 encoded Lisp files 2015-09-27 8:37 ` David Kastrup @ 2015-09-27 8:40 ` Paul Eggert 2015-09-27 8:50 ` David Kastrup 2015-09-27 10:14 ` Eli Zaretskii 0 siblings, 2 replies; 70+ messages in thread From: Paul Eggert @ 2015-09-27 8:40 UTC (permalink / raw) To: David Kastrup; +Cc: Eli Zaretskii, emacs-devel David Kastrup wrote: > why not set a directory-wide variable for > the Emacs project (namely in the repository) for making the > Emacs-internal Elisp files default to utf-8? Great idea! One that has been suggested multiple times. Unfortunately it's a bit trickier to implement than one might think. ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [Emacs-diffs] master db828f6: Don't rely on defaults in decoding UTF-8 encoded Lisp files 2015-09-27 8:40 ` Paul Eggert @ 2015-09-27 8:50 ` David Kastrup 2015-09-27 10:14 ` Eli Zaretskii 1 sibling, 0 replies; 70+ messages in thread From: David Kastrup @ 2015-09-27 8:50 UTC (permalink / raw) To: Paul Eggert; +Cc: Eli Zaretskii, emacs-devel Paul Eggert <eggert@cs.ucla.edu> writes: > David Kastrup wrote: >> why not set a directory-wide variable for >> the Emacs project (namely in the repository) for making the >> Emacs-internal Elisp files default to utf-8? > > Great idea! One that has been suggested multiple times. > Unfortunately it's a bit trickier to implement than one might think. "Well, I can't seem to find them either. Did you really lose your keys over here?" "No, down that alley. But I'd rather search here since the light is much better." So because the right solution for Emacs is a bit trickier to implement than one might think, we pick something else making life harder for everybody else? -- David Kastrup ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [Emacs-diffs] master db828f6: Don't rely on defaults in decoding UTF-8 encoded Lisp files 2015-09-27 8:40 ` Paul Eggert 2015-09-27 8:50 ` David Kastrup @ 2015-09-27 10:14 ` Eli Zaretskii 1 sibling, 0 replies; 70+ messages in thread From: Eli Zaretskii @ 2015-09-27 10:14 UTC (permalink / raw) To: Paul Eggert; +Cc: dak, emacs-devel > From: Paul Eggert <eggert@cs.ucla.edu> > Date: Sun, 27 Sep 2015 01:40:06 -0700 > Cc: Eli Zaretskii <eliz@gnu.org>, emacs-devel@gnu.org > > David Kastrup wrote: > > why not set a directory-wide variable for > > the Emacs project (namely in the repository) for making the > > Emacs-internal Elisp files default to utf-8? > > Great idea! One that has been suggested multiple times. Unfortunately it's a > bit trickier to implement than one might think. Yes, there are no simple and easy solutions for these issues. But that doesn't mean we shouldn't look for them. ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [Emacs-diffs] master db828f6: Don't rely on defaults in decoding UTF-8 encoded Lisp files 2015-09-27 8:29 ` Paul Eggert 2015-09-27 8:37 ` David Kastrup @ 2015-09-27 8:57 ` Eli Zaretskii 1 sibling, 0 replies; 70+ messages in thread From: Eli Zaretskii @ 2015-09-27 8:57 UTC (permalink / raw) To: Paul Eggert; +Cc: dak, emacs-devel > Cc: dak@gnu.org, emacs-devel@gnu.org > From: Paul Eggert <eggert@cs.ucla.edu> > Date: Sun, 27 Sep 2015 01:29:57 -0700 > > Eli Zaretskii wrote: > > If > > that's portability between locales, then our solution of having a > > coding cookie is better for Emacs, because it supports more use cases > > than just assuming UTF-8 would. > > Sure, but the point is that we shouldn't need a cookie for UTF-8. Cookies are > awkward, and should be inserted only when needed; they shouldn't be needed for > the typical case. Our experience since Emacs 20 is that the "typical case" is not a good guideline for implementing multilingual tools. Not unless the typical case becomes the only case. ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [Emacs-diffs] master db828f6: Don't rely on defaults in decoding UTF-8 encoded Lisp files 2015-09-26 20:26 ` Chad Brown 2015-09-26 21:50 ` David Kastrup @ 2015-09-27 7:34 ` Eli Zaretskii 2015-09-27 16:03 ` Chad Brown 1 sibling, 1 reply; 70+ messages in thread From: Eli Zaretskii @ 2015-09-27 7:34 UTC (permalink / raw) To: Chad Brown; +Cc: emacs-devel > From: Chad Brown <yandros@gmail.com> > Date: Sat, 26 Sep 2015 13:26:52 -0700 > > > > On 26 Sep 2015, at 12:35, Eli Zaretskii <eliz@gnu.org> wrote: > > > > The relevant statistics for Emacs is of source files, not of HTML > > pages. > > The default for GCC is UTF-8. GCC doesn't write C sources, so its default are not very relevant, even if you are right in the above assessment (and I don't think you are). > Python requires a coding cookie (intentionally similar to Emacs’) to get away from Latin-1. Java is UTF-8. Javascript, roughly speaking, tracks HTML. Which other languages did you have in mind? All the rest of them. ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [Emacs-diffs] master db828f6: Don't rely on defaults in decoding UTF-8 encoded Lisp files 2015-09-27 7:34 ` Eli Zaretskii @ 2015-09-27 16:03 ` Chad Brown 2015-09-27 18:41 ` Eli Zaretskii 0 siblings, 1 reply; 70+ messages in thread From: Chad Brown @ 2015-09-27 16:03 UTC (permalink / raw) To: Eli Zaretskii; +Cc: emacs-devel > On 27 Sep 2015, at 00:34, Eli Zaretskii <eliz@gnu.org> wrote: > >> From: Chad Brown <yandros@gmail.com> >> Date: Sat, 26 Sep 2015 13:26:52 -0700 >> >> The default for GCC is UTF-8. > > GCC doesn't write C sources, so its default are not very relevant, > even if you are right in the above assessment (and I don't think you > are). I took the information from the GCC 4.7 documentation: -finput-charset=charset Set the input character set, used for translation from the character set of the input file to the source character set used by GCC. If the locale does not specify, or GCC cannot get this information from the locale, the default is UTF-8. This can be overridden by either the locale or this command line option. Currently the command line option takes precedence if there's a conflict. charset can be any encoding supported by the system's iconv library routine. I saw almost identical text in the 4.2.4 documentation, and didn’t go back further. ~Chad ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [Emacs-diffs] master db828f6: Don't rely on defaults in decoding UTF-8 encoded Lisp files 2015-09-27 16:03 ` Chad Brown @ 2015-09-27 18:41 ` Eli Zaretskii 2015-09-27 19:52 ` Chad Brown 0 siblings, 1 reply; 70+ messages in thread From: Eli Zaretskii @ 2015-09-27 18:41 UTC (permalink / raw) To: Chad Brown; +Cc: emacs-devel > From: Chad Brown <yandros@gmail.com> > Date: Sun, 27 Sep 2015 09:03:54 -0700 > Cc: emacs-devel@gnu.org > > -finput-charset=charset > Set the input character set, used for translation from the character > set of the input file to the source character set used by GCC. If > the locale does not specify, or GCC cannot get this information > from the locale, the default is UTF-8. This can be overridden by > either the locale or this command line option. Currently the command > line option takes precedence if there's a conflict. charset can be > any encoding supported by the system's iconv library routine. Note the "if the locale does not specify" clause. That should almost never happen. ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [Emacs-diffs] master db828f6: Don't rely on defaults in decoding UTF-8 encoded Lisp files 2015-09-27 18:41 ` Eli Zaretskii @ 2015-09-27 19:52 ` Chad Brown 2015-09-27 20:52 ` Eli Zaretskii 0 siblings, 1 reply; 70+ messages in thread From: Chad Brown @ 2015-09-27 19:52 UTC (permalink / raw) To: Eli Zaretskii; +Cc: emacs-devel > On 27 Sep 2015, at 11:41, Eli Zaretskii <eliz@gnu.org> wrote: > >> From: Chad Brown <yandros@gmail.com> >> Date: Sun, 27 Sep 2015 09:03:54 -0700 >> Cc: emacs-devel@gnu.org >> >> -finput-charset=charset >> Set the input character set, used for translation from the character >> set of the input file to the source character set used by GCC. If >> the locale does not specify, or GCC cannot get this information >> from the locale, the default is UTF-8. This can be overridden by >> either the locale or this command line option. Currently the command >> line option takes precedence if there's a conflict. charset can be >> any encoding supported by the system's iconv library routine. > > Note the "if the locale does not specify" clause. That should almost > never happen. Sure. I almost mentioned that, but at the time it seemed clear to me that we were talking about the defaults for each. I used to deal with this issue ‘back in the day’, so it provoked my curiosity enough to look. Roughly speaking, the modern ‘programming languages’ these days are UTF-8, while a decent chunk of the ‘scripting languages’ seem to be in a messier state, but with established methods (coding cookies, odd quoting, ascii by fiat, try not to look at comments, etc). Since then, exchanges on this thread have suggested that maybe I was wrong about the topic at hand, but the data still seemed useful, so I pushed it along, with the full quote for context. Sorry if it caused confusion. Thanks, ~Chad ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [Emacs-diffs] master db828f6: Don't rely on defaults in decoding UTF-8 encoded Lisp files 2015-09-27 19:52 ` Chad Brown @ 2015-09-27 20:52 ` Eli Zaretskii 0 siblings, 0 replies; 70+ messages in thread From: Eli Zaretskii @ 2015-09-27 20:52 UTC (permalink / raw) To: Chad Brown; +Cc: emacs-devel > From: Chad Brown <yandros@gmail.com> > Date: Sun, 27 Sep 2015 12:52:15 -0700 > Cc: emacs-devel@gnu.org > > > > On 27 Sep 2015, at 11:41, Eli Zaretskii <eliz@gnu.org> wrote: > > > >> From: Chad Brown <yandros@gmail.com> > >> Date: Sun, 27 Sep 2015 09:03:54 -0700 > >> Cc: emacs-devel@gnu.org > >> > >> -finput-charset=charset > >> Set the input character set, used for translation from the character > >> set of the input file to the source character set used by GCC. If > >> the locale does not specify, or GCC cannot get this information > >> from the locale, the default is UTF-8. This can be overridden by > >> either the locale or this command line option. Currently the command > >> line option takes precedence if there's a conflict. charset can be > >> any encoding supported by the system's iconv library routine. > > > > Note the "if the locale does not specify" clause. That should almost > > never happen. > > Sure. I almost mentioned that, but at the time it seemed clear > to me that we were talking about the defaults for each. The issue at hand is whether Emacs should favor UTF-8 _before_ the locale-derived defaults. What happens when the locale cannot be queried wasn't touched at all. I don't think such a situation is a real possibility in the first place, and if it is, I don't object if we'd use UTF-8 in that case. ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [Emacs-diffs] master db828f6: Don't rely on defaults in decoding UTF-8 encoded Lisp files 2015-09-26 19:35 ` Eli Zaretskii 2015-09-26 20:26 ` Chad Brown @ 2015-09-26 20:32 ` Paul Eggert 2015-09-27 7:27 ` Eli Zaretskii 1 sibling, 1 reply; 70+ messages in thread From: Paul Eggert @ 2015-09-26 20:32 UTC (permalink / raw) To: Eli Zaretskii; +Cc: dak, monnier, emacs-devel Eli Zaretskii wrote: > The relevant statistics for Emacs is of source files, not of HTML > pages. Sure, and source files are how this thread got started: nowadays in GNU projects they're typically UTF-8 regardless of system locale settings, and Emacs should be better about supporting this typical situation. UTF-8 is common partly because source files are shared widely via the Internet, on sites like Savannah. The days of lonely hackers writing code in their own private Shift-JIS directories are largely over. Of course Emacs can still support such users, but the default should be tailored to what's more typical nowadays. ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [Emacs-diffs] master db828f6: Don't rely on defaults in decoding UTF-8 encoded Lisp files 2015-09-26 20:32 ` Paul Eggert @ 2015-09-27 7:27 ` Eli Zaretskii 2015-09-27 7:42 ` David Kastrup 2015-09-27 8:22 ` Paul Eggert 0 siblings, 2 replies; 70+ messages in thread From: Eli Zaretskii @ 2015-09-27 7:27 UTC (permalink / raw) To: Paul Eggert; +Cc: dak, monnier, emacs-devel > Cc: dak@gnu.org, monnier@iro.umontreal.ca, emacs-devel@gnu.org > From: Paul Eggert <eggert@cs.ucla.edu> > Date: Sat, 26 Sep 2015 13:32:33 -0700 > > Eli Zaretskii wrote: > > The relevant statistics for Emacs is of source files, not of HTML > > pages. > > Sure, and source files are how this thread got started: nowadays in GNU projects > they're typically UTF-8 regardless of system locale settings, and Emacs should > be better about supporting this typical situation. UTF-8 is common partly > because source files are shared widely via the Internet, on sites like Savannah. > > The days of lonely hackers writing code in their own private Shift-JIS > directories are largely over. Of course Emacs can still support such users, but > the default should be tailored to what's more typical nowadays. Emacs supports the typical situation quite well already, definitely so in a typical (i.e. UTF-8) locale. The issue at hand is not how to support the typical situation, it's whether that typical situation is the _only_ situation that matters, so much so that we can ignore the locale-derived defaults. In any case, I said we needed _statistics_, i.e. numbers, not just impressions and opinions. I don't know how to find a representative set of C sources, not even for European locales. I looked at the C files of GNU projects from the last years on my main development system, which is probably not very representative. There are more than 142,000 C files there. Using the 'file' utility, I found about 1.8% of UTF-8 encoded files and about 0.2% ISO-8859 encoded files (the vast majority was US ASCII, of course). That's still more than 250 ISO-8859 encoded files. I've also looked at the *.po files in the latest releases of GNU Make, Gawk, Texinfo, and Binutils, and I find that between 20% and 25% of such files still use non-UTF-8 encodings. I see similar figures for the txi-*.tex files that came with Texinfo 6.0. Presumably, that follows the default conventions of the respective locales. So, while I agree with you that UTF-8 encoded files are the majority among non-ASCII files (and Emacs development aligns itself with that fact very well), the non-UTF-8 minority, even in the Posix world, is still significant enough, and we cannot possibly ignore it. ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [Emacs-diffs] master db828f6: Don't rely on defaults in decoding UTF-8 encoded Lisp files 2015-09-27 7:27 ` Eli Zaretskii @ 2015-09-27 7:42 ` David Kastrup 2015-09-27 9:20 ` Rustom Mody 2015-09-27 8:22 ` Paul Eggert 1 sibling, 1 reply; 70+ messages in thread From: David Kastrup @ 2015-09-27 7:42 UTC (permalink / raw) To: Eli Zaretskii; +Cc: Paul Eggert, monnier, emacs-devel Eli Zaretskii <eliz@gnu.org> writes: > I've also looked at the *.po files in the latest releases of GNU Make, > Gawk, Texinfo, and Binutils, and I find that between 20% and 25% of > such files still use non-UTF-8 encodings. Which, btw, I consider crazy. It's one thing to pick an encoding for local language processing and display. But for an internationalization system, it does not really make sense to venture to local encodings outside of I/O. There is a really strong case for using only UTF-8 in PO files instead of juggling with many-to-many encoding setups. > I see similar figures for the txi-*.tex files that came with Texinfo > 6.0. Presumably, that follows the default conventions of the > respective locales. Texinfo uses PDFTeX for its encoding processing, and PDFTeX is firmly an 8-bit system. TeX wouldn't be TeX if it wasn't macroprogrammed to deal with that, but Texinfo being a rather low-level format, UTF-8 processing time dwarves anything else. So if you have, say, a German input file for Texinfo and can process it either in Latin-1 or UTF-8, chances are that the Latin-1 version runs more than twice as fast. Now that's of course just the processing in printed form. Thanks to Texinfo now being written in Perl, the PDFTeX backend is likely the fastest right now either way so it may not be as much of a concern. But many Texinfo sources originate from a time where UTF-8 was either not supported at all or was a major contributor to conversion time. -- David Kastrup ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [Emacs-diffs] master db828f6: Don't rely on defaults in decoding UTF-8 encoded Lisp files 2015-09-27 7:42 ` David Kastrup @ 2015-09-27 9:20 ` Rustom Mody 2015-09-27 10:13 ` Eli Zaretskii 0 siblings, 1 reply; 70+ messages in thread From: Rustom Mody @ 2015-09-27 9:20 UTC (permalink / raw) To: emacs-devel On Sun, Sep 27, 2015 at 1:12 PM, David Kastrup <dak@gnu.org> wrote: > > Eli Zaretskii <eliz@gnu.org> writes: > > > I've also looked at the *.po files in the latest releases of GNU Make, > > Gawk, Texinfo, and Binutils, and I find that between 20% and 25% of > > such files still use non-UTF-8 encodings. > > Which, btw, I consider crazy. > Ive been trying to understand this stuff and was looking at eg. lisp/language/indian.el In there I find that: (defconst bengali-composable-pattern (let ((table '(("a" . "\u0981") ; SIGN CANDRABINDU ("A" . "[\u0982-\u0983]") ; SIGN ANUSVARA .. VISARGA ("V" . "[\u0985-\u0994\u09E0-\u09E1]") ; independent vowel ("C" . "[\u0995-\u09B9\u09DC-\u09DF\u09F1]") ; consonant ("B" . "[\u09AC\u09AF-\u09B0\u09F0]") ; BA, YA, RA ("R" . "[\u09B0\u09F0]") ; RA ("n" . "\u09BC") ; NUKTA ("v" . "[\u09BE-\u09CC\u09D7\u09E2-\u09E3]") ; vowel sign ("H" . "\u09CD") ; HALANT ("T" . "\u09CE") ; KHANDA TA ("N" . "\u200C") ; ZWNJ ("J" . "\u200D") ; ZWJ ("X" . "[\u0980-\u09FF]")))) ; all coverage etc etc And repeated with small variations for devanagari, tamil, telugu etc It would sure help a native speaker if the comment and the ucs-hex were interchanged with the actual chars used instead. So then I checked why the file was showing as UTF-8 encoded. Found this one non-ASCII line: (set-language-info-alist "Kannada" '((charset unicode) (coding-system mule-utf-8) (coding-priority mule-utf-8) (input-method . "kannada-itrans") (sample-text . "Kannada (ಕನ್ನಡ) ನಮಸ್ಕಾರ") (documentation . "\ Kannada language and script is supported in this language environment.")) '("Indian")) It strikes me that this sample text should be there for the other languages also but it does not seem to be there Just for context if I can understand whats going on, I would like to help improve this/these docs: (info "(elisp)input methods") | How to define input methods is not yet documented in this manual, but here we | describe how to use them. ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [Emacs-diffs] master db828f6: Don't rely on defaults in decoding UTF-8 encoded Lisp files 2015-09-27 9:20 ` Rustom Mody @ 2015-09-27 10:13 ` Eli Zaretskii 2015-09-27 20:21 ` Paul Eggert 0 siblings, 1 reply; 70+ messages in thread From: Eli Zaretskii @ 2015-09-27 10:13 UTC (permalink / raw) To: Rustom Mody; +Cc: emacs-devel > From: Rustom Mody <rustompmody@gmail.com> > Date: Sun, 27 Sep 2015 14:50:48 +0530 > > Ive been trying to understand this stuff and was looking at eg. > lisp/language/indian.el > > In there I find that: > (defconst bengali-composable-pattern > (let ((table > '(("a" . "\u0981") ; SIGN CANDRABINDU > ("A" . "[\u0982-\u0983]") ; SIGN ANUSVARA .. VISARGA > ("V" . "[\u0985-\u0994\u09E0-\u09E1]") ; independent vowel > ("C" . "[\u0995-\u09B9\u09DC-\u09DF\u09F1]") ; consonant > ("B" . "[\u09AC\u09AF-\u09B0\u09F0]") ; BA, YA, RA > ("R" . "[\u09B0\u09F0]") ; RA > ("n" . "\u09BC") ; NUKTA > ("v" . "[\u09BE-\u09CC\u09D7\u09E2-\u09E3]") ; vowel sign > ("H" . "\u09CD") ; HALANT > ("T" . "\u09CE") ; KHANDA TA > ("N" . "\u200C") ; ZWNJ > ("J" . "\u200D") ; ZWJ > ("X" . "[\u0980-\u09FF]")))) ; all coverage > etc etc This is unrelated: it specifies which character sequences should be composed and displayed as a single grapheme cluster. > So then I checked why the file was showing as UTF-8 encoded. > > Found this one non-ASCII line: > > (set-language-info-alist > "Kannada" '((charset unicode) > (coding-system mule-utf-8) > (coding-priority mule-utf-8) > (input-method . "kannada-itrans") > (sample-text . "Kannada (ಕನ್ನಡ) ನಮಸ್ಕಾರ") > (documentation . "\ > Kannada language and script is supported in this language > environment.")) > '("Indian")) > > It strikes me that this sample text should be there for the other > languages also but it does not seem to be there You cannot base encoding decisions on the language or script alone, unless that language exists in a single locale. Many languages and scripts serve several different locales with several different default encodings. > Just for context if I can understand whats going on, I would like to > help improve this/these docs: > > > (info "(elisp)input methods") > > | How to define input methods is not yet documented in this manual, > but here we > | describe how to use them. Again unrelated. Input methods are about typing characters not directly supported by the user's keyboard. ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [Emacs-diffs] master db828f6: Don't rely on defaults in decoding UTF-8 encoded Lisp files 2015-09-27 10:13 ` Eli Zaretskii @ 2015-09-27 20:21 ` Paul Eggert 2015-09-27 21:04 ` Eli Zaretskii 0 siblings, 1 reply; 70+ messages in thread From: Paul Eggert @ 2015-09-27 20:21 UTC (permalink / raw) To: Eli Zaretskii, Rustom Mody; +Cc: emacs-devel Eli Zaretskii wrote: > This is unrelated: it specifies which character sequences should be > composed and displayed as a single grapheme cluster. Yes. It might be reasonable to replace some of those \u instances for readability, e.g.: - ("V" . "[\u0904-\u0914\u0960-\u0961\u0972]") ; independent vowel + ("V" . "[ऄ-औॠ-ॡॲ]") ; independent vowel But replacements would not be such a good idea for some of this code, e.g.: - ("H" . "\u094D") ; HALANT + ("H" . "्") ; HALANT as standalone combining characters are problematic on display, and here: - ("J" . "\u200D") ; ZWJ + ("J" . "") ; ZWJ where one can't easily see a zero width joiner when editing the source file. I expect that whoever wrote that code felt more comfortable sticking with \u escapes uniformly, rather than using \u sometimes and not other times. ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [Emacs-diffs] master db828f6: Don't rely on defaults in decoding UTF-8 encoded Lisp files 2015-09-27 20:21 ` Paul Eggert @ 2015-09-27 21:04 ` Eli Zaretskii 0 siblings, 0 replies; 70+ messages in thread From: Eli Zaretskii @ 2015-09-27 21:04 UTC (permalink / raw) To: Paul Eggert; +Cc: rustompmody, emacs-devel > Cc: emacs-devel@gnu.org > From: Paul Eggert <eggert@cs.ucla.edu> > Date: Sun, 27 Sep 2015 13:21:51 -0700 > > Eli Zaretskii wrote: > > This is unrelated: it specifies which character sequences should be > > composed and displayed as a single grapheme cluster. > > Yes. It might be reasonable to replace some of those \u instances for > readability, e.g.: > > - ("V" . "[\u0904-\u0914\u0960-\u0961\u0972]") ; independent vowel > + ("V" . "[ऄ-औॠ-ॡॲ]") ; independent vowel I'm not so sure this is a good idea: since most of us don't read Indic scripts, leaving the codepoints there makes it easier to compare these patterns with various relevant publications and standards on the Internet. If we make them characters instead, most of us will have to use "C-x =" to see the codepoints anyway. > But replacements would not be such a good idea for some of this code, e.g.: > > - ("H" . "\u094D") ; HALANT > + ("H" . "्") ; HALANT > > as standalone combining characters are problematic on display, and here: > > - ("J" . "\u200D") ; ZWJ > + ("J" . "") ; ZWJ > > where one can't easily see a zero width joiner when editing the > source file. Indeed. ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [Emacs-diffs] master db828f6: Don't rely on defaults in decoding UTF-8 encoded Lisp files 2015-09-27 7:27 ` Eli Zaretskii 2015-09-27 7:42 ` David Kastrup @ 2015-09-27 8:22 ` Paul Eggert 2015-09-27 8:55 ` Eli Zaretskii 2015-09-27 9:56 ` Andreas Schwab 1 sibling, 2 replies; 70+ messages in thread From: Paul Eggert @ 2015-09-27 8:22 UTC (permalink / raw) To: Eli Zaretskii; +Cc: dak, monnier, emacs-devel Eli Zaretskii wrote: > I've also looked at the *.po files in the latest releases of GNU Make, > Gawk, Texinfo, and Binutils, and I find that between 20% and 25% of > such files still use non-UTF-8 encodings. Yes, and those files are a pain to look at with Emacs now, since it typically misguesses their encodings. Presumably Emacs should be looking at .po files' charset= decorations. What's likely happening with those files is that they were originally created long ago in an 8-bit locale, and nobody has bothered to update their encodings since then. Many of the files haven't been changed in ages (about half of them have revision dates before 2010), and of course the older files will prefer legacy encodings. These older files are not a particularly good match for text that people edit today. > while I agree with you that UTF-8 encoded files are the majority > among non-ASCII files (and Emacs development aligns itself with that > fact very well), the non-UTF-8 minority, even in the Posix world, is > still significant enough, and we cannot possibly ignore it. Naturally we cannot ignore it. All I'm suggesting is that we change the default behavior so that it's more UTF-8 friendly, since that's the way the world is going. The old Emacs behavior should still be available, for people who need it. ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [Emacs-diffs] master db828f6: Don't rely on defaults in decoding UTF-8 encoded Lisp files 2015-09-27 8:22 ` Paul Eggert @ 2015-09-27 8:55 ` Eli Zaretskii 2015-09-27 9:56 ` Andreas Schwab 1 sibling, 0 replies; 70+ messages in thread From: Eli Zaretskii @ 2015-09-27 8:55 UTC (permalink / raw) To: Paul Eggert; +Cc: dak, monnier, emacs-devel > Cc: dak@gnu.org, monnier@iro.umontreal.ca, emacs-devel@gnu.org > From: Paul Eggert <eggert@cs.ucla.edu> > Date: Sun, 27 Sep 2015 01:22:48 -0700 > > Eli Zaretskii wrote: > > I've also looked at the *.po files in the latest releases of GNU Make, > > Gawk, Texinfo, and Binutils, and I find that between 20% and 25% of > > such files still use non-UTF-8 encodings. > > Yes, and those files are a pain to look at with Emacs now, since it typically > misguesses their encodings. Presumably Emacs should be looking at .po files' > charset= decorations. You need to install the po-mode. But anyway, that's not the issue at hand. I just used those files as indicators of preferences of some locales. > > while I agree with you that UTF-8 encoded files are the majority > > among non-ASCII files (and Emacs development aligns itself with that > > fact very well), the non-UTF-8 minority, even in the Posix world, is > > still significant enough, and we cannot possibly ignore it. > > Naturally we cannot ignore it. All I'm suggesting is that we change the default > behavior so that it's more UTF-8 friendly, since that's the way the world is > going. The old Emacs behavior should still be available, for people who need it. You use "default" here in a sense that is different from what the Mule stuff does. Since Emacs attempts to support i18n, not just l10n, it cannot ask users to modify their defaults whenever they meet a file that's decoded incorrectly. Emacs uses the defaults in this area as the last resort, when no other information is available in the file itself or its accompanying meta-data. That default is already as friendly to UTF-8 as possible: UTF-8 is used in any locale where that's the default. Going further, i.e. preferring UTF-8 in locales whose preferences are different, will simply bring back the old bugs and misfeatures of Emacs 20 and 21 which we worked so hard to eradicate. IMO, the _only_ sane way forward is to introduce more reliable ways of detecting the encoding, whether by using some new kinds of meta-data or by more extensive analysis of the text itself. (The latter solution will probably have difficulties with decoding sub-process output, but it could be very efficient with disk files and large bodies of text made available to Emacs at once.) IOW, I don't think we will be able to change our locale-derived defaults any time soon. What we can do is minimize the probability of having to fall back on those defaults. But this requires that Someone™ volunteers to revamp our detect_coding_* implementations in that direction. ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [Emacs-diffs] master db828f6: Don't rely on defaults in decoding UTF-8 encoded Lisp files 2015-09-27 8:22 ` Paul Eggert 2015-09-27 8:55 ` Eli Zaretskii @ 2015-09-27 9:56 ` Andreas Schwab 2015-09-27 10:04 ` David Kastrup 1 sibling, 1 reply; 70+ messages in thread From: Andreas Schwab @ 2015-09-27 9:56 UTC (permalink / raw) To: Paul Eggert; +Cc: Eli Zaretskii, dak, monnier, emacs-devel Paul Eggert <eggert@cs.ucla.edu> writes: > Yes, and those files are a pain to look at with Emacs now, since it > typically misguesses their encodings. Presumably Emacs should be looking > at .po files' charset= decorations. It does already if you use the po-mode distributed with gettext. Andreas. -- Andreas Schwab, schwab@linux-m68k.org GPG Key fingerprint = 58CA 54C7 6D53 942B 1756 01D3 44D5 214B 8276 4ED5 "And now for something completely different." ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [Emacs-diffs] master db828f6: Don't rely on defaults in decoding UTF-8 encoded Lisp files 2015-09-27 9:56 ` Andreas Schwab @ 2015-09-27 10:04 ` David Kastrup 2015-09-27 10:16 ` Eli Zaretskii 0 siblings, 1 reply; 70+ messages in thread From: David Kastrup @ 2015-09-27 10:04 UTC (permalink / raw) To: Andreas Schwab; +Cc: Eli Zaretskii, Paul Eggert, monnier, emacs-devel Andreas Schwab <schwab@linux-m68k.org> writes: > Paul Eggert <eggert@cs.ucla.edu> writes: > >> Yes, and those files are a pain to look at with Emacs now, since it >> typically misguesses their encodings. Presumably Emacs should be looking >> at .po files' charset= decorations. > > It does already if you use the po-mode distributed with gettext. gettext being the standard GNU i18n mechanism, wouldn't it make sense to keep the latest version distributed with Emacs rather than requiring users to manually install them? -- David Kastrup ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [Emacs-diffs] master db828f6: Don't rely on defaults in decoding UTF-8 encoded Lisp files 2015-09-27 10:04 ` David Kastrup @ 2015-09-27 10:16 ` Eli Zaretskii 2015-09-27 10:36 ` Eli Zaretskii 0 siblings, 1 reply; 70+ messages in thread From: Eli Zaretskii @ 2015-09-27 10:16 UTC (permalink / raw) To: David Kastrup; +Cc: eggert, schwab, monnier, emacs-devel > From: David Kastrup <dak@gnu.org> > Cc: Paul Eggert <eggert@cs.ucla.edu>, Eli Zaretskii <eliz@gnu.org>, monnier@iro.umontreal.ca, emacs-devel@gnu.org > Date: Sun, 27 Sep 2015 12:04:45 +0200 > > Andreas Schwab <schwab@linux-m68k.org> writes: > > > Paul Eggert <eggert@cs.ucla.edu> writes: > > > >> Yes, and those files are a pain to look at with Emacs now, since it > >> typically misguesses their encodings. Presumably Emacs should be looking > >> at .po files' charset= decorations. > > > > It does already if you use the po-mode distributed with gettext. > > gettext being the standard GNU i18n mechanism, wouldn't it make sense to > keep the latest version distributed with Emacs rather than requiring > users to manually install them? We discussed that at some point in the past. I don't remember why we decided not to do that, but a search in the archives might tell. Maybe those reasons are no longer relevant. ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [Emacs-diffs] master db828f6: Don't rely on defaults in decoding UTF-8 encoded Lisp files 2015-09-27 10:16 ` Eli Zaretskii @ 2015-09-27 10:36 ` Eli Zaretskii 2015-09-27 10:59 ` Eli Zaretskii 0 siblings, 1 reply; 70+ messages in thread From: Eli Zaretskii @ 2015-09-27 10:36 UTC (permalink / raw) To: Eli Zaretskii; +Cc: emacs-devel, dak, schwab, monnier, eggert > Date: Sun, 27 Sep 2015 13:16:18 +0300 > From: Eli Zaretskii <eliz@gnu.org> > Cc: eggert@cs.ucla.edu, schwab@linux-m68k.org, monnier@iro.umontreal.ca, > emacs-devel@gnu.org > > > From: David Kastrup <dak@gnu.org> > > Cc: Paul Eggert <eggert@cs.ucla.edu>, Eli Zaretskii <eliz@gnu.org>, monnier@iro.umontreal.ca, emacs-devel@gnu.org > > Date: Sun, 27 Sep 2015 12:04:45 +0200 > > > > Andreas Schwab <schwab@linux-m68k.org> writes: > > > > > Paul Eggert <eggert@cs.ucla.edu> writes: > > > > > >> Yes, and those files are a pain to look at with Emacs now, since it > > >> typically misguesses their encodings. Presumably Emacs should be looking > > >> at .po files' charset= decorations. > > > > > > It does already if you use the po-mode distributed with gettext. > > > > gettext being the standard GNU i18n mechanism, wouldn't it make sense to > > keep the latest version distributed with Emacs rather than requiring > > users to manually install them? > > We discussed that at some point in the past. I don't remember why we > decided not to do that, but a search in the archives might tell. > Maybe those reasons are no longer relevant. I've misremembered. The discussion is here: http://lists.gnu.org/archive/html/emacs-devel/2002-03/msg00167.html and, more importantly, its result is already in Emacs: file-coding-system-alist is a variable defined in ‘C source code’. Its value is shown below. [...] Value: (("\\.dz\\'" no-conversion . no-conversion) ("\\.txz\\'" no-conversion . no-conversion) ("\\.xz\\'" no-conversion . no-conversion) ("\\.lzma\\'" no-conversion . no-conversion) ("\\.lz\\'" no-conversion . no-conversion) ("\\.g?z\\'" no-conversion . no-conversion) ("\\.\\(?:tgz\\|svgz\\|sifz\\)\\'" no-conversion . no-conversion) ("\\.tbz2?\\'" no-conversion . no-conversion) ("\\.bz2\\'" no-conversion . no-conversion) ("\\.Z\\'" no-conversion . no-conversion) ("\\.elc\\'" . utf-8-emacs) ("\\.el\\'" . prefer-utf-8) ("\\.utf\\(-8\\)?\\'" . utf-8) ("\\.xml\\'" . xml-find-file-coding-system) ("\\(\\`\\|/\\)loaddefs.el\\'" raw-text . raw-text-unix) ("\\.tar\\'" no-conversion . no-conversion) ("\\.po[tx]?\\'\\|\\.po\\." . po-find-file-coding-system) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ("\\.\\(tex\\|ltx\\|dtx\\|drv\\)\\'" . latexenc-find-file-coding-system) ("" undecided)) And the bundled po.el already defines po-find-file-coding-system. So it sounds like we simply have a bug here. But once again, the handling of *.po files is not the issue here. The issue is whether we can ignore the possibility of non-UTF-8 encodings in locales whose codeset is not UTF-8. ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [Emacs-diffs] master db828f6: Don't rely on defaults in decoding UTF-8 encoded Lisp files 2015-09-27 10:36 ` Eli Zaretskii @ 2015-09-27 10:59 ` Eli Zaretskii 2015-09-27 20:05 ` Paul Eggert 0 siblings, 1 reply; 70+ messages in thread From: Eli Zaretskii @ 2015-09-27 10:59 UTC (permalink / raw) To: eggert; +Cc: dak, schwab, monnier, emacs-devel > Date: Sun, 27 Sep 2015 13:36:08 +0300 > From: Eli Zaretskii <eliz@gnu.org> > Cc: emacs-devel@gnu.org, dak@gnu.org, schwab@linux-m68k.org, > monnier@iro.umontreal.ca, eggert@cs.ucla.edu > > ("\\.po[tx]?\\'\\|\\.po\\." . po-find-file-coding-system) > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > ("\\.\\(tex\\|ltx\\|dtx\\|drv\\)\\'" . latexenc-find-file-coding-system) > ("" undecided)) > > And the bundled po.el already defines po-find-file-coding-system. > > So it sounds like we simply have a bug here. Ehm.. what bug? AFAICS, the encoding is correctly detected and used when I visit *.po files, no matter what is their encoding. So I'm not sure why Paul said: >> Yes, and those files are a pain to look at with Emacs now, since it >> typically misguesses their encodings. Presumably Emacs should be looking >> at .po files' charset= decorations. as I see no such problems. Maybe Paul has some customizations that somehow disable po.el's detection? ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [Emacs-diffs] master db828f6: Don't rely on defaults in decoding UTF-8 encoded Lisp files 2015-09-27 10:59 ` Eli Zaretskii @ 2015-09-27 20:05 ` Paul Eggert 0 siblings, 0 replies; 70+ messages in thread From: Paul Eggert @ 2015-09-27 20:05 UTC (permalink / raw) To: Eli Zaretskii; +Cc: dak, schwab, monnier, emacs-devel Eli Zaretskii wrote: > Maybe Paul has some customizations that > somehow disable po.el's detection? Yes, sorry, false alarm; I had put a hack a while ago into my .emacs file temporarily for testing coding systems, and forgot that it was there. When I removed that hack, most of the problem went away. po-mode still has a coding-system problem with ASCII files (of all things!). I just now filed a bug report for it (Bug#21574). Surely this is low priority, as I expect hardly anybody uses ASCII .po files nowadays. ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [Emacs-diffs] master db828f6: Don't rely on defaults in decoding UTF-8 encoded Lisp files 2015-09-26 16:01 ` Paul Eggert 2015-09-26 16:09 ` David Kastrup @ 2015-09-26 17:25 ` Eli Zaretskii 2015-09-26 18:51 ` Paul Eggert 2015-09-27 0:12 ` stephen 2 siblings, 1 reply; 70+ messages in thread From: Eli Zaretskii @ 2015-09-26 17:25 UTC (permalink / raw) To: Paul Eggert; +Cc: monnier, emacs-devel > Cc: monnier@iro.umontreal.ca, emacs-devel@gnu.org > From: Paul Eggert <eggert@cs.ucla.edu> > Date: Sat, 26 Sep 2015 09:01:04 -0700 > > Eli Zaretskii wrote: > > So you are, in effect, saying that it is incorrect to derive the > > default encodings from the locale's codeset? > > Yes, for Emacs developers. And come to think of it, for most Emacs users. > Nowadays in my experience most non-ASCII text files use UTF-8, regardless of > locale. Are you sure your experience isn't biased by the fact you mostly work in UTF-8 locales? > The old days of having to guess encoding from the locale are passing > away. This is partly due to UTF-8 being the encoding of choice for > HTML and XML, where UTF-8 overtook the older 8-bit encodings in 2008 > and now is by far the dominant encoding. We already DTRT with XML files, and should be doing TRT with any file format that includes the specification of the encoding in it. The problem, IMO, is not only with disk files. It is also with email messages, output from processes, etc. E.g., I routinely get Latin-1 encoded email from people whose platform is GNU/Linux. IOW, non-UTF encodings are far from being dead yet. Using UTF-8 by default is certainly wrong on MS-Windows. > One way to accommodate the new reality would be to change Emacs so that by > default the system locale does not affect Emacs's guess of a file's encoding if > the file's initial sample is valid UTF-8. Users could set a variable to > re-enable the old behavior. The problem with this line of thought is that "initial sample" part -- how far into the file should we look, how far is far enough? E.g., tips.texi has its first non-ASCII character at character position 25353. We've been there before, and found this not reliable enough. Anyway, doesn't "(prefer-coding-system 'utf-8)" already does what you want us to offer? ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [Emacs-diffs] master db828f6: Don't rely on defaults in decoding UTF-8 encoded Lisp files 2015-09-26 17:25 ` Eli Zaretskii @ 2015-09-26 18:51 ` Paul Eggert 0 siblings, 0 replies; 70+ messages in thread From: Paul Eggert @ 2015-09-26 18:51 UTC (permalink / raw) To: Eli Zaretskii; +Cc: monnier, emacs-devel Eli Zaretskii wrote: > Anyway, doesn't "(prefer-coding-system 'utf-8)" already does what you > want us to offer? If that works, then let's make it the default, at least on non-MS-Windows platforms. I normally work in a UTF-8 locale, so I assume it'd be a no-op for me, but perhaps it would help for others. ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [Emacs-diffs] master db828f6: Don't rely on defaults in decoding UTF-8 encoded Lisp files 2015-09-26 16:01 ` Paul Eggert 2015-09-26 16:09 ` David Kastrup 2015-09-26 17:25 ` Eli Zaretskii @ 2015-09-27 0:12 ` stephen 2015-09-27 4:44 ` Paul Eggert 2 siblings, 1 reply; 70+ messages in thread From: stephen @ 2015-09-27 0:12 UTC (permalink / raw) To: emacs-devel >>>>> Paul Eggert writes: > Eli Zaretskii wrote: >> So you are, in effect, saying that it is incorrect to derive the >> default encodings from the locale's codeset? > Yes, for Emacs developers. I think this makes sense. IIUC Emacs already uses characters outside of the Unicode repertoire, so it shouldn't be too hard to replicate any Emacs capabilities that require non-Unicode characters or charsets *inside* Emacs by using such characters. Assuming there are any; I suspect even HELLO doesn't actually need them. There's no "gaiji" problem of how to tell Emacs what to do with those characters; the developer who introduces them into Emacs is responsible for adding them to Emacs's non-Unicode repertoire. > And come to think of it, for most Emacs users. I hope not, because that would imply that Emacs users in China, Japan, probably Korea, and Taiwan are becoming a decreasing rather than increasing fraction of Emacs users. > Nowadays in my experience most non-ASCII text files use UTF-8, > regardless of locale. Toto, I don't think we're in Kansas any more. > The old days of having to guess encoding from the locale are > passing away. This is partly due to UTF-8 being the encoding of > choice for HTML and XML, where UTF-8 overtook the older 8-bit > encodings in 2008 and now is by far the dominant encoding. On the commercial internet, yes, but not for government and academic sites in Japan and China. > One way to accommodate the new reality would be to Recognize that it's probably due to insufficient experience? > change Emacs so that by default the system locale does not affect > Emacs's guess of a file's encoding if the file's initial sample is > valid UTF-8. "Not affect" is probably a bad idea. Giving UTF-8 too strong preference on Windows is a bad idea, because there are a lot of Windows coding systems that use UTF-8 trailing bytes to represent characters; it's occasionally possible to run into UTF-8-conforming files that are intended to be something else. This isn't true for ISO-8859 coding systems. > Users could set a variable to re-enable the old behavior. If we > did this, we wouldn't have the error-prone process if sprinkling > 'coding: utf-8' cookies all over the place. ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [Emacs-diffs] master db828f6: Don't rely on defaults in decoding UTF-8 encoded Lisp files 2015-09-27 0:12 ` stephen @ 2015-09-27 4:44 ` Paul Eggert 2015-09-27 6:20 ` stephen 0 siblings, 1 reply; 70+ messages in thread From: Paul Eggert @ 2015-09-27 4:44 UTC (permalink / raw) To: stephen, emacs-devel stephen@xemacs.org wrote: > This is partly due to UTF-8 being the encoding of > > choice for HTML and XML, where UTF-8 overtook the older 8-bit > > encodings in 2008 and now is by far the dominant encoding. > > On the commercial internet, yes, but not for government and academic > sites in Japan and China. I think your information is out of date. Yes, ten years ago there was a lot of non-UTF-8 out there, but nowadays they've largely moved on to UTF-8. For fun I just now visited a few of the top government and academic websites in Japan: http://www.japan.go.jp/ http://www.mofa.go.jp/ http://nettv.gov-online.go.jp/ http://www.e-kokusei.go.jp/ https://www.env.go.jp/ http://www.u-tokyo.ac.jp/ http://www.kyoto-u.ac.jp/ http://www.osaka-u.ac.jp/ http://www.keio.ac.jp/ I configured my browser to say that I preferred Japanese text. All ten web sites gave me UTF-8. Feel free to canvass China, but I daresay you'll find the same. Of course one can still find a few web sites using other encodings, but like it or not, UTF-8 dominates now. ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [Emacs-diffs] master db828f6: Don't rely on defaults in decoding UTF-8 encoded Lisp files 2015-09-27 4:44 ` Paul Eggert @ 2015-09-27 6:20 ` stephen 2015-09-27 8:34 ` Paul Eggert 0 siblings, 1 reply; 70+ messages in thread From: stephen @ 2015-09-27 6:20 UTC (permalink / raw) To: Paul Eggert; +Cc: emacs-devel Paul Eggert writes: > I think your information is out of date. Rather, I think that yours is superficial. Really, you should listen to those of us who live and work outside of the ASCII hemisphere. I live and teach in Japan (a stone's throw from ETL, as it happens), and most of the students I supervise are Chinese. I regularly need to access Chinese and Japanese government and corporate data, and retrieve preprints and data (and sometimes code) from the personal pages of other scholars. Mojibake in the HTML pages is frequent, in both Firefox and Chrome (of course it's almost always easy to guess the actual coded character set in use, but it is mojibake). A frequent cause is webservers configured to send "Content-Type: text/html; charset=utf-8" but the page is encoded in something else. > Yes, ten years ago there was a lot of non-UTF-8 out there, but > nowadays they've largely moved on to UTF-8. "Beauty is only skin-deep." The *top* pages, and some whole sites, have moved on, because having beautiful (if mostly useless) top pages is a matter of "face", so they buy new ones from companies with fancy up-to-date web design software every couple of years. Perhaps most recently authored pages are UTF-8. But the data sets themselves are typically flat files, either CSV or plaintext. The explanatory pages, even if in HTML, often haven't been revised in decades. Such useful content is typically in a national standard coded character set rather than Unicode. And Emacs is hardly limited to the web. In practice, almost all mail I receive from Chinese (even when it is in English or Japanese) is labelled GB2312, GBK, or GB18030. The great majority of Japanese mail is either Shift JIS or ISO 2022 JP (sometimes with "OEM characters" that even today aren't in Unicode because they're not in JIS). > Of course one can still find a few web sites using other encodings, > but like it or not, UTF-8 dominates now. What's not to like about UTF-8?! I *wish* non-UTF-8 was a matter of information archaeology and Buddhist scholarship! I'm sad to say, it is not: GB variants, Big5, and JIS variants are the *majority* of the non-ASCII data I handle every day in my Emacs. (It's not the "great majority" only because about 30% of the non-ASCII text I handle in Emacs is authored by me, in UTF-8, of course.) Regards, ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [Emacs-diffs] master db828f6: Don't rely on defaults in decoding UTF-8 encoded Lisp files 2015-09-27 6:20 ` stephen @ 2015-09-27 8:34 ` Paul Eggert 0 siblings, 0 replies; 70+ messages in thread From: Paul Eggert @ 2015-09-27 8:34 UTC (permalink / raw) To: stephen; +Cc: emacs-devel stephen@xemacs.org wrote: > Perhaps most > recently authored pages are UTF-8. But the data sets themselves are > typically flat files, either CSV or plaintext. The explanatory pages, > even if in HTML, often haven't been revised in decades. Yes, that's pretty much my experience. In Japan older stuff is mostly Shift-JIS, EUC, or maybe ISO-2022-JP. New stuff is mostly UTF-8. People using old email software send old encodings because that's what they've been doing for decades. Normally it works, because the email envelope tells you the encoding. But sometimes people screw up and you get mojibake. But this situation is not an argument for having the locale determine encoding when visiting random imported files that lack envelopes. For such files, it often doesn't work to set LC_ALL=ja_JP.ujis and expect Emacs to get things right. (This is one of things that Eli has noted multiple times, and he's right.) Of course if one is working in a conservative Japanese government ministry that standardized on Shift-JIS back in 1992 and hasn't changed since then, then things are different, and Emacs should support such users. But typical Emacs users are not in this situation, and the Emacs default should cater to the more-typical case today. To narrow things down a bit I briefly looked for .jp websites that talk about Emacs. Google reported the following first page's worth of hits (I list year of composition, encoding, and URL). Again, the new stuff is mostly UTF-8, and the old stuff is a mishmash, so it's another data point suggesting that defaulting to UTF-8 would not be such a bad thing for editing today's text. 2002 Shift-JIS http://www.rsch.tuis.ac.jp/~ohmi/literacy/emacs/quick.html 2008 ISO-2022-JP http://www.wakayama-u.ac.jp/~takehiko/webprg/03.html 2015 EUC-JP http://d.hatena.ne.jp/tarao/20150221/1424518030 2015 UTF-8 http://uguisu.skr.jp/Windows/emacs.html 2015 UTF-8 http://www.amazon.co.jp/Emacs%E5%AE%9F%E8%B7%B5%E5%85%A5%E9%96%80-%EF%BD%9E%E6%80%9D%E8%80%83%E3%82%92%E7%9B%B4%E6%84%9F%E7%9A%84%E3%81%AB%E3%82%B3%E3%83%BC%E3%83%89%E5%8C%96%E3%81%97%E3%80%81%E9%96%8B%E7%99%BA%E3%82%92%E5%8A%A0%E9%80%9F%E3%81%99%E3%82%8B-WEB-DB-PRESS-plus/dp/4774150029 2015 UTF-8 http://www.sigasi.jp/better-emacs-vhdl-mode 2006 Shift-JIS http://www.math.kobe-u.ac.jp/icms2006/icms2006-video/slides/grayson/share/doc/Macaulay2/Macaulay2/html/_teaching_spemacs_sphow_spto_spfind_sp__M2.html 2015 UTF-8 https://osdn.jp/projects/gnupack/ ^ permalink raw reply [flat|nested] 70+ messages in thread
end of thread, other threads:[~2015-09-28 15:58 UTC | newest] Thread overview: 70+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- [not found] <20150921165211.20434.28114@vcs.savannah.gnu.org> [not found] ` <E1Ze4K3-0005KC-5U@vcs.savannah.gnu.org> 2015-09-21 19:57 ` [Emacs-diffs] master db828f6: Don't rely on defaults in decoding UTF-8 encoded Lisp files Stefan Monnier 2015-09-21 20:07 ` Eli Zaretskii 2015-09-24 16:44 ` Eli Zaretskii 2015-09-24 21:29 ` Stefan Monnier 2015-09-25 7:55 ` Eli Zaretskii 2015-09-25 12:21 ` Stefan Monnier 2015-09-25 13:37 ` Eli Zaretskii 2015-09-25 22:32 ` Paul Eggert 2015-09-26 6:27 ` Eli Zaretskii 2015-09-26 6:32 ` Eli Zaretskii 2015-09-26 14:31 ` Paul Eggert 2015-09-26 15:15 ` Eli Zaretskii 2015-09-26 16:01 ` Paul Eggert 2015-09-26 16:09 ` David Kastrup 2015-09-26 17:26 ` Eli Zaretskii 2015-09-26 18:53 ` Paul Eggert 2015-09-26 19:35 ` Eli Zaretskii 2015-09-26 20:26 ` Chad Brown 2015-09-26 21:50 ` David Kastrup 2015-09-27 4:44 ` Paul Eggert 2015-09-27 5:29 ` David Kastrup 2015-09-27 7:38 ` Paul Eggert 2015-09-27 7:46 ` David Kastrup 2015-09-27 7:52 ` Paul Eggert 2015-09-27 9:47 ` Andreas Schwab 2015-09-27 9:54 ` David Kastrup 2015-09-27 10:03 ` Andreas Schwab 2015-09-27 10:12 ` David Kastrup 2015-09-27 11:10 ` Andreas Schwab 2015-09-27 22:48 ` Richard Stallman 2015-09-28 2:41 ` Paul Eggert 2015-09-28 6:53 ` Eli Zaretskii 2015-09-28 15:08 ` Paul Eggert 2015-09-28 15:58 ` Eli Zaretskii 2015-09-27 7:39 ` Eli Zaretskii 2015-09-27 7:52 ` Paul Eggert 2015-09-27 8:00 ` David Kastrup 2015-09-27 8:03 ` Eli Zaretskii 2015-09-27 8:29 ` Paul Eggert 2015-09-27 8:37 ` David Kastrup 2015-09-27 8:40 ` Paul Eggert 2015-09-27 8:50 ` David Kastrup 2015-09-27 10:14 ` Eli Zaretskii 2015-09-27 8:57 ` Eli Zaretskii 2015-09-27 7:34 ` Eli Zaretskii 2015-09-27 16:03 ` Chad Brown 2015-09-27 18:41 ` Eli Zaretskii 2015-09-27 19:52 ` Chad Brown 2015-09-27 20:52 ` Eli Zaretskii 2015-09-26 20:32 ` Paul Eggert 2015-09-27 7:27 ` Eli Zaretskii 2015-09-27 7:42 ` David Kastrup 2015-09-27 9:20 ` Rustom Mody 2015-09-27 10:13 ` Eli Zaretskii 2015-09-27 20:21 ` Paul Eggert 2015-09-27 21:04 ` Eli Zaretskii 2015-09-27 8:22 ` Paul Eggert 2015-09-27 8:55 ` Eli Zaretskii 2015-09-27 9:56 ` Andreas Schwab 2015-09-27 10:04 ` David Kastrup 2015-09-27 10:16 ` Eli Zaretskii 2015-09-27 10:36 ` Eli Zaretskii 2015-09-27 10:59 ` Eli Zaretskii 2015-09-27 20:05 ` Paul Eggert 2015-09-26 17:25 ` Eli Zaretskii 2015-09-26 18:51 ` Paul Eggert 2015-09-27 0:12 ` stephen 2015-09-27 4:44 ` Paul Eggert 2015-09-27 6:20 ` stephen 2015-09-27 8:34 ` Paul Eggert
Code repositories for project(s) associated with this public inbox https://git.savannah.gnu.org/cgit/emacs.git This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).