unofficial mirror of emacs-devel@gnu.org 
 help / color / mirror / code / Atom feed
* request to revert the chnage of revno 112925
@ 2013-06-19 11:54 Kenichi Handa
  2013-06-19 12:59 ` Stefan Monnier
  0 siblings, 1 reply; 15+ messages in thread
From: Kenichi Handa @ 2013-06-19 11:54 UTC (permalink / raw)
  To: emacs-devel

I'd like to revert the following change (revno 112925):

2013-06-11  Stefan Monnier  <monnier@iro.umontreal.ca>

	* international/mule-conf.el (file-coding-system-alist): Use utf-8 as
	default for Elisp files.

By my recent changes for tuning up ASCII and UTF-8 file
reading, the speed for reading a UTF-8 file is almost the
same in the following cases:

* the file has coding: utf-8; tag
* find-operation-coding-system returns utf-8 for the file
  (the current case of *.el files)
* the file has no coding tag and utf-8 is the most preferred
  coding system
* the file has no coding tag and utf-8 is the most preferred
  coding system among 8-bit encoding (which means that such
  7-bit coding systems as iso-2022-jp may be more preferred)

So, the above change does not improve the performance that
much.

In addition, as iso-2022-jp and iso-2022-7bit have been the
most correctly detected coding systems in any environments,
there are many packages that uses them for *.el files (at
least in Japan).  Now many of them doesn't work.

In some sence, the above change is a regression because it
disables Emacs' facility to automatically decode ISO 2022
based 7-bit encodings, and we should notify users about such
a change in advance, for instance, by showing warnings by
byte-compiler for non-UTF-8 and no-coding-tag *.el files for
a while (perhaps while Emacs' version is 24.*).

---
Kenichi Handa
handa@gnu.org




^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: request to revert the chnage of revno 112925
  2013-06-19 11:54 request to revert the chnage of revno 112925 Kenichi Handa
@ 2013-06-19 12:59 ` Stefan Monnier
  2013-06-19 15:35   ` Kenichi Handa
  0 siblings, 1 reply; 15+ messages in thread
From: Stefan Monnier @ 2013-06-19 12:59 UTC (permalink / raw)
  To: Kenichi Handa; +Cc: emacs-devel

> So, the above change does not improve the performance that much.

The change was not done for performance reasons.

> In addition, as iso-2022-jp and iso-2022-7bit have been the
> most correctly detected coding systems in any environments,
> there are many packages that uses them for *.el files (at
> least in Japan).  Now many of them doesn't work.

Then maybe we should use a new coding-system which does falls back on
iso-2022 if some incorrect utf-8 byte sequence is found?

> In some sence, the above change is a regression because it
> disables Emacs' facility to automatically decode ISO 2022
> based 7-bit encodings, and we should notify users about such
> a change in advance, for instance, by showing warnings by
> byte-compiler for non-UTF-8 and no-coding-tag *.el files for
> a while (perhaps while Emacs' version is 24.*).

I knew that the change was "risky".  Admittedly, part of the motivation
was to see how much breakage we'd bump into.

But the core of what I want: make it so that utf-8 Elisp files are
always recognized correctly, even in the absence of a coding: tag, and
regardless of the user's locale.
The way I implemented it broke recognition of iso-2022, but if there;s
some other way that doesn't break it, that's even better.


        Stefan



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: request to revert the chnage of revno 112925
  2013-06-19 12:59 ` Stefan Monnier
@ 2013-06-19 15:35   ` Kenichi Handa
  2013-06-19 16:11     ` Paul Eggert
  2013-06-19 16:54     ` Stefan Monnier
  0 siblings, 2 replies; 15+ messages in thread
From: Kenichi Handa @ 2013-06-19 15:35 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: emacs-devel

In article <jwv1u7yl0x6.fsf-monnier+emacs@gnu.org>, Stefan Monnier
<monnier@iro.umontreal.ca> writes:

> But the core of what I want: make it so that utf-8 Elisp files are
> always recognized correctly, even in the absence of a coding: tag, and
> regardless of the user's locale.
> The way I implemented it broke recognition of iso-2022, but if there;s
> some other way that doesn't break it, that's even better.

I'd like to find a better solution, but at first please
clarify the requirements.

* What to do with an ASCII file?  Previously find-file for
  such a file results in undecided-xxx
  buffer-file-coding-system.  Now it's utf-8-xxx.

* What to do with an invalid UTF-8 file.  Previously,
  find-file detects a proper coding-system for such a file.
  Now utf-8 is forced and any invalid UTF-8 byte sequences
  are decoded as raw bytes.

* What to do with null byte detection.  Previously, if a
  *.el file contains a null byte and
  inhibit-null-byte-detection is nil (the default), it's
  detected as a binary file.  Now utf-8 is forced regardless
  of inhibit-null-byte-detection.

---
Kenichi Handa
handa@gnu.org



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: request to revert the chnage of revno 112925
  2013-06-19 15:35   ` Kenichi Handa
@ 2013-06-19 16:11     ` Paul Eggert
  2013-06-19 20:49       ` Stefan Monnier
  2013-06-19 16:54     ` Stefan Monnier
  1 sibling, 1 reply; 15+ messages in thread
From: Paul Eggert @ 2013-06-19 16:11 UTC (permalink / raw)
  To: Kenichi Handa; +Cc: Stefan Monnier, emacs-devel

On 06/19/2013 08:35 AM, Kenichi Handa wrote:
> I'd like to find a better solution, but at first please
> clarify the requirements.

I assume these requirements should be the same for Elisp
files as for other files that Emacs is asked to read;
if not, could this be clarified?

> * What to do with an ASCII file?  Previously find-file for
>   such a file results in undecided-xxx
>   buffer-file-coding-system.  Now it's utf-8-xxx.

To help think this through, could you please explain
the practical consequences of this change?
If I edit a file that's undecided-xxx, and insert
a character that can be encoded either as UTF-8 or
as ISO-2022-JP say, the buffer becomes utf-8-xxx,
right?  So in that scenario there is not much practical
difference.  What scenarios entail significant differences
depending on whether the file is undecided-xxx or utf-8-xxx?

> * What to do with an invalid UTF-8 file.  Previously,
>   find-file detects a proper coding-system for such a file.
>   Now utf-8 is forced and any invalid UTF-8 byte sequences
>   are decoded as raw bytes.

Surely this should be fixed: the file should be decoded
properly, as before.

> * What to do with null byte detection.  Previously, if a
>   *.el file contains a null byte and
>   inhibit-null-byte-detection is nil (the default), it's
>   detected as a binary file.  Now utf-8 is forced regardless
>   of inhibit-null-byte-detection.

I suggest going back to the old behavior (that's the normal
behavior for random files that Emacs edits, right?).
Elisp files normally don't contain null bytes; such files are not
considered to be text files in the POSIXish world.



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: request to revert the chnage of revno 112925
  2013-06-19 15:35   ` Kenichi Handa
  2013-06-19 16:11     ` Paul Eggert
@ 2013-06-19 16:54     ` Stefan Monnier
  2013-06-22 12:36       ` Kenichi Handa
  1 sibling, 1 reply; 15+ messages in thread
From: Stefan Monnier @ 2013-06-19 16:54 UTC (permalink / raw)
  To: Kenichi Handa; +Cc: emacs-devel

>> But the core of what I want: make it so that utf-8 Elisp files are
>> always recognized correctly, even in the absence of a coding: tag, and
>> regardless of the user's locale.
>> The way I implemented it broke recognition of iso-2022, but if there;s
>> some other way that doesn't break it, that's even better.
> I'd like to find a better solution, but at first please
> clarify the requirements.
> * What to do with an ASCII file?  Previously find-file for
>   such a file results in undecided-xxx
>   buffer-file-coding-system.  Now it's utf-8-xxx.

I like the utf-8 better, but either is OK.

> * What to do with an invalid UTF-8 file.  Previously,
>   find-file detects a proper coding-system for such a file.
>   Now utf-8 is forced and any invalid UTF-8 byte sequences
>   are decoded as raw bytes.

Ideally: emit a warning, and then try to find a more appropriate coding
system (e.g. iso-2022).

> * What to do with null byte detection.  Previously, if a
>   *.el file contains a null byte and
>   inhibit-null-byte-detection is nil (the default), it's
>   detected as a binary file.  Now utf-8 is forced regardless
>   of inhibit-null-byte-detection.

I like the utf-8 better, but I don't know of any concrete case where it
makes a significant difference, so either way is OK.


        Stefan



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: request to revert the chnage of revno 112925
  2013-06-19 16:11     ` Paul Eggert
@ 2013-06-19 20:49       ` Stefan Monnier
  2013-06-19 21:15         ` Paul Eggert
  0 siblings, 1 reply; 15+ messages in thread
From: Stefan Monnier @ 2013-06-19 20:49 UTC (permalink / raw)
  To: Paul Eggert; +Cc: Kenichi Handa, emacs-devel

>> I'd like to find a better solution, but at first please
>> clarify the requirements.
> I assume these requirements should be the same for Elisp
> files as for other files that Emacs is asked to read;

No.  We're talking here about files which can/will be shared by
thousands of people who have very little in common in terms the kind of
locale/coding-system they use.  So it's important for the coding-system
to be decided independently from the locale.

This is not specific to Elisp, of course, it's true of most programming
languages, and indeed most of those used to specify that the code had to
be written in "pure ascii" for the code part and "anything compatible
with ascii" for the comments.  But nowadays, most programming languages
are shifting towards allowing non-ascii in the code part and this is
usually done by specifying an encoding such as utf-8.

IOW I consider it a bug to have an Elisp files that use non-utf-8
encoding without an explicit coding: cookie.

> If I edit a file that's undecided-xxx, and insert
> a character that can be encoded either as UTF-8 or
> as ISO-2022-JP say, the buffer becomes utf-8-xxx,
> right?

That depends on the locale.  Which is why I prefer the use of utf-8 for
ascii-only files.

>> * What to do with an invalid UTF-8 file.  Previously,
>> find-file detects a proper coding-system for such a file.
>> Now utf-8 is forced and any invalid UTF-8 byte sequences
>> are decoded as raw bytes.
> Surely this should be fixed: the file should be decoded
> properly, as before.

Yes, tho only as a temporary measure to give people time to fix their files.

> I suggest going back to the old behavior (that's the normal
> behavior for random files that Emacs edits, right?).

These are not random files.

> Elisp files normally don't contain null bytes;

Most don't, indeed.  But there's no reason why they shouldn't contain
a nul byte, e.g. embedded in a string.

> such files are not considered to be text files in the POSIXish world.

The POSIX world doesn't care too much about labeling files as
text-vs-binary except when it's really useful (e.g. to try and avoid
spewing crap in the output of grep).  Disallowing nul bytes in Elisp
files doesn't serve any such purpose, AFAICT, so I think the natural
POSIX behavior here would be to allow nul bytes.


        Stefan



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: request to revert the chnage of revno 112925
  2013-06-19 20:49       ` Stefan Monnier
@ 2013-06-19 21:15         ` Paul Eggert
  2013-06-20  2:09           ` Stefan Monnier
  2013-06-21  5:25           ` Achim Gratz
  0 siblings, 2 replies; 15+ messages in thread
From: Paul Eggert @ 2013-06-19 21:15 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: Kenichi Handa, emacs-devel

On 06/19/13 13:49, Stefan Monnier wrote:

> This is not specific to Elisp, of course, it's true of most programming
> languages

Yes, that sounds right.  Should we make this change for all
programming-language files then?  .c, .h, Makefile, etc....

> The POSIX world doesn't care too much about labeling files as
> text-vs-binary except when it's really useful (e.g. to try and avoid
> spewing crap in the output of grep).

True, but in practice this means one should avoid putting NUL bytes in
such files.  grep uses a heuristic that if a file contains a NUL byte,
it's considered to be a binary file, and by default grep won't output
the matching lines for that file.  POSIX allows this behavior, and it's
common among many GNU and/or POSIX tools, which means it's typically
not a good idea to put NUL bytes in source files.

Emacs of course can treat a NUL character just like any other
character.  But the issue of UTF-8 versus other encodings is largely
independent of what Emacs does with NUL characters.  It may be better
to leave the treatment of NUL characters alone when making the UTF-8
change, if only to do changes one at a time.

(Can you tell that I use grep a lot?  Sometimes I think it's my
favorite software tool....)




^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: request to revert the chnage of revno 112925
  2013-06-19 21:15         ` Paul Eggert
@ 2013-06-20  2:09           ` Stefan Monnier
  2013-06-21  5:25           ` Achim Gratz
  1 sibling, 0 replies; 15+ messages in thread
From: Stefan Monnier @ 2013-06-20  2:09 UTC (permalink / raw)
  To: Paul Eggert; +Cc: Kenichi Handa, emacs-devel

>> This is not specific to Elisp, of course, it's true of most programming
>> languages
> Yes, that sounds right.  Should we make this change for all
> programming-language files then?  .c, .h, Makefile, etc....

Only for those languages that say so in the definition/standard/spec.

> True, but in practice this means one should avoid putting NUL bytes in
> such files.  grep uses a heuristic that if a file contains a NUL byte,
> it's considered to be a binary file, and by default grep won't output
> the matching lines for that file.  POSIX allows this behavior, and it's
> common among many GNU and/or POSIX tools, which means it's typically
> not a good idea to put NUL bytes in source files.

Agreed, which is why it's very rare for Elisp files to have NUL bytes.
But that's no reason to treat an Elisp file with a NUL bytes as being
encoded in binary instead of utf-8 (which is the question under
discussion here).


        Stefan



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: request to revert the chnage of revno 112925
  2013-06-19 21:15         ` Paul Eggert
  2013-06-20  2:09           ` Stefan Monnier
@ 2013-06-21  5:25           ` Achim Gratz
  2013-06-21  6:48             ` Stephen J. Turnbull
  1 sibling, 1 reply; 15+ messages in thread
From: Achim Gratz @ 2013-06-21  5:25 UTC (permalink / raw)
  To: emacs-devel

Paul Eggert writes:
> On 06/19/13 13:49, Stefan Monnier wrote:
>
>> This is not specific to Elisp, of course, it's true of most programming
>> languages
>
> Yes, that sounds right.  Should we make this change for all
> programming-language files then?  .c, .h, Makefile, etc....

Only when you know for certain they will be processed by UTF-8-safe
tools and backwards compatibility won't be an issue.  As an example, for
Perl code this depends on which version of Perl is going to see it.  I'm
having fun lately to convert old Perl scripts to do the correct thing
with UTF-8…


Regards,
Achim.
-- 
+<[Q+ Matrix-12 WAVE#46+305 Neuron microQkb Andromeda XTk Blofeld]>+

Wavetables for the Waldorf Blofeld:
http://Synth.Stromeko.net/Downloads.html#BlofeldUserWavetables




^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: request to revert the chnage of revno 112925
  2013-06-21  5:25           ` Achim Gratz
@ 2013-06-21  6:48             ` Stephen J. Turnbull
  0 siblings, 0 replies; 15+ messages in thread
From: Stephen J. Turnbull @ 2013-06-21  6:48 UTC (permalink / raw)
  To: emacs-devel

Achim Gratz writes:
 > Paul Eggert writes:
 > > On 06/19/13 13:49, Stefan Monnier wrote:
 > >
 > >> This is not specific to Elisp, of course, it's true of most programming
 > >> languages
 > >
 > > Yes, that sounds right.  Should we make this change for all
 > > programming-language files then?  .c, .h, Makefile, etc....
 > 
 > Only when you know for certain they will be processed by UTF-8-safe
 > tools and backwards compatibility won't be an issue.

I would say only if the language definition specifies it.  Eg, Python
specifies something else (UTF-8 if the UTF-8 signature is present else
ASCII; see PEP 263).  And most sites will have tools that assume the
current locale encoding.

If Emacs wants a global setting, it should be customizable and have a
special, non-encoding value (eg, 'locale-encoding) that means "use the
current locale's encoding".  In that case, based on experience with
XEmacs and Python, I recommend 'locale-encoding, but Emacs could
specify 'utf-8 if the risk of annoying lots of users is acceptable.




^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: request to revert the chnage of revno 112925
  2013-06-19 16:54     ` Stefan Monnier
@ 2013-06-22 12:36       ` Kenichi Handa
  2013-06-22 12:46         ` Eli Zaretskii
  2013-06-22 15:26         ` Stefan Monnier
  0 siblings, 2 replies; 15+ messages in thread
From: Kenichi Handa @ 2013-06-22 12:36 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: emacs-devel

In article <jwvehbyavvj.fsf-monnier+emacs@gnu.org>, Stefan Monnier
<monnier@iro.umontreal.ca> writes:

> > * What to do with an ASCII file?
> I like the utf-8 better, but either is OK.

> > * What to do with an invalid UTF-8 file.
> Ideally: emit a warning, and then try to find a more appropriate
> coding
> system (e.g. iso-2022).

> > * What to do with null byte detection.
> I like the utf-8 better, but I don't know of any concrete case where
> it
> makes a significant difference, so either way is OK.

Then, how about implementing a coding system slightly
different from `undecided'.  The differences are:

(1) On reading, if a file contains 8-bit bytes and they are
    all valid UTF-8 sequences, detect the source as utf-8
    regardless of the currrent coding priorities.

(2) On writing, if a buffer contains non-ASCII characters,
    encode the buffer contents by utf-8.

How to name it?  Just `undecided-utf-8'?

---
Kenichi Handa
handa@gnu.org



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: request to revert the chnage of revno 112925
  2013-06-22 12:36       ` Kenichi Handa
@ 2013-06-22 12:46         ` Eli Zaretskii
  2013-06-22 12:54           ` Juanma Barranquero
  2013-06-22 15:26         ` Stefan Monnier
  1 sibling, 1 reply; 15+ messages in thread
From: Eli Zaretskii @ 2013-06-22 12:46 UTC (permalink / raw)
  To: Kenichi Handa; +Cc: monnier, emacs-devel

> From: Kenichi Handa <handa@gnu.org>
> Date: Sat, 22 Jun 2013 08:36:24 -0400
> Cc: emacs-devel@gnu.org
> 
> (1) On reading, if a file contains 8-bit bytes and they are
>     all valid UTF-8 sequences, detect the source as utf-8
>     regardless of the currrent coding priorities.
> 
> (2) On writing, if a buffer contains non-ASCII characters,
>     encode the buffer contents by utf-8.
> 
> How to name it?  Just `undecided-utf-8'?

'force-utf-8'?



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: request to revert the chnage of revno 112925
  2013-06-22 12:46         ` Eli Zaretskii
@ 2013-06-22 12:54           ` Juanma Barranquero
  0 siblings, 0 replies; 15+ messages in thread
From: Juanma Barranquero @ 2013-06-22 12:54 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: Kenichi Handa, Stefan Monnier, Emacs developers

On Sat, Jun 22, 2013 at 2:46 PM, Eli Zaretskii <eliz@gnu.org> wrote:

> 'force-utf-8'?

More like 'prefer-utf-8', as it does not really force it neither on
reading nor writing...

  J



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: request to revert the chnage of revno 112925
  2013-06-22 12:36       ` Kenichi Handa
  2013-06-22 12:46         ` Eli Zaretskii
@ 2013-06-22 15:26         ` Stefan Monnier
  2013-06-29  3:50           ` request to revert the change " Kenichi Handa
  1 sibling, 1 reply; 15+ messages in thread
From: Stefan Monnier @ 2013-06-22 15:26 UTC (permalink / raw)
  To: Kenichi Handa; +Cc: emacs-devel

> Then, how about implementing a coding system slightly
> different from `undecided'.  The differences are:
> (1) On reading, if a file contains 8-bit bytes and they are
>     all valid UTF-8 sequences, detect the source as utf-8
>     regardless of the currrent coding priorities.
> (2) On writing, if a buffer contains non-ASCII characters,
>     encode the buffer contents by utf-8.
> How to name it?

Sounds good.  I think Juanma has the better name.


        Stefan



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: request to revert the change of revno 112925
  2013-06-22 15:26         ` Stefan Monnier
@ 2013-06-29  3:50           ` Kenichi Handa
  0 siblings, 0 replies; 15+ messages in thread
From: Kenichi Handa @ 2013-06-29  3:50 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: emacs-devel

In article <jwv1u7u9nhh.fsf-monnier+emacs@gnu.org>, Stefan Monnier <monnier@iro.umontreal.ca> writes:

> > Then, how about implementing a coding system slightly
> > different from `undecided'.  The differences are:
> > (1) On reading, if a file contains 8-bit bytes and they are
> >     all valid UTF-8 sequences, detect the source as utf-8
> >     regardless of the currrent coding priorities.
> > (2) On writing, if a buffer contains non-ASCII characters,
> >     encode the buffer contents by utf-8.
> > How to name it?

> Sounds good.  I think Juanma has the better name.

I've just committed changes for `prefer-utf-8' coding
system.

---
Kenichi Handa
handa@gnu.org



^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2013-06-29  3:50 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-06-19 11:54 request to revert the chnage of revno 112925 Kenichi Handa
2013-06-19 12:59 ` Stefan Monnier
2013-06-19 15:35   ` Kenichi Handa
2013-06-19 16:11     ` Paul Eggert
2013-06-19 20:49       ` Stefan Monnier
2013-06-19 21:15         ` Paul Eggert
2013-06-20  2:09           ` Stefan Monnier
2013-06-21  5:25           ` Achim Gratz
2013-06-21  6:48             ` Stephen J. Turnbull
2013-06-19 16:54     ` Stefan Monnier
2013-06-22 12:36       ` Kenichi Handa
2013-06-22 12:46         ` Eli Zaretskii
2013-06-22 12:54           ` Juanma Barranquero
2013-06-22 15:26         ` Stefan Monnier
2013-06-29  3:50           ` request to revert the change " Kenichi Handa

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).