unofficial mirror of emacs-devel@gnu.org 
 help / color / mirror / code / Atom feed
* [PATCH] * etc/NEWS: Announce addition of BOM to utf-8-auto
@ 2023-01-29 17:58 Tom Gillespie
  2023-01-29 18:14 ` Andreas Schwab
  2023-01-29 18:29 ` Eli Zaretskii
  0 siblings, 2 replies; 11+ messages in thread
From: Tom Gillespie @ 2023-01-29 17:58 UTC (permalink / raw)
  To: Emacs developers

[-- Attachment #1: Type: text/plain, Size: 580 bytes --]

Following a wild adventure yesterday tracking down
an insanity inducing bug in emacs-zmq which wound
up being related to the change to add a byte order
mark, here is a patch to add a NEWS entry so that
others will be warned that the change has happened.
Tom

PS is it a bug that "\uFEFF" reads as a symbol instead
of whitespace in utf-8-unix encoding?

Relevant links.
https://debbugs.gnu.org/cgi/bugreport.cgi?bug=60750
https://github.com/nnicandro/emacs-zmq/pull/43
https://github.com/search?p=3&q=utf-8-auto+-async.el&type=Code
https://github.com/search?q=utf-8-auto&type=Code

[-- Attachment #2: 0001-etc-NEWS-Announce-addition-of-BOM-to-utf-8-auto.patch --]
[-- Type: text/x-patch, Size: 712 bytes --]

From d62a7b12d5247c78c8c980d94012ba2f8db45221 Mon Sep 17 00:00:00 2001
From: Tom Gillespie <tgbugs@gmail.com>
Date: Sun, 29 Jan 2023 12:32:27 -0500
Subject: [PATCH] * etc/NEWS: Announce addition of BOM to utf-8-auto

---
 etc/NEWS | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/etc/NEWS b/etc/NEWS
index fb211f9b7d0..828bfb795fc 100644
--- a/etc/NEWS
+++ b/etc/NEWS
@@ -563,6 +563,9 @@ The variable 'font-lock-support-mode' is occasionally useful for
 debugging purposes.  It is now a regular variable (instead of a user
 option) and can be set to nil to disable Just-in-time Lock mode.
 
+** The 'utf-8-auto' coding-system now includes a byte order mark
+
++++
 \f
 * Changes in Emacs 29.1
 
-- 
2.39.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH] * etc/NEWS: Announce addition of BOM to utf-8-auto
  2023-01-29 17:58 [PATCH] * etc/NEWS: Announce addition of BOM to utf-8-auto Tom Gillespie
@ 2023-01-29 18:14 ` Andreas Schwab
  2023-01-29 18:29 ` Eli Zaretskii
  1 sibling, 0 replies; 11+ messages in thread
From: Andreas Schwab @ 2023-01-29 18:14 UTC (permalink / raw)
  To: Tom Gillespie; +Cc: Emacs developers

On Jan 29 2023, Tom Gillespie wrote:

> PS is it a bug that "\uFEFF" reads as a symbol instead
> of whitespace in utf-8-unix encoding?

\uFEFF is the same as uFEFF, just with a redundant quote.  \u has no
special meaning in the read syntax of a symbol.

-- 
Andreas Schwab, schwab@linux-m68k.org
GPG Key fingerprint = 7578 EB47 D4E5 4D69 2510  2552 DF73 E780 A9DA AEC1
"And now for something completely different."



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] * etc/NEWS: Announce addition of BOM to utf-8-auto
  2023-01-29 17:58 [PATCH] * etc/NEWS: Announce addition of BOM to utf-8-auto Tom Gillespie
  2023-01-29 18:14 ` Andreas Schwab
@ 2023-01-29 18:29 ` Eli Zaretskii
  2023-01-29 19:11   ` Tom Gillespie
  1 sibling, 1 reply; 11+ messages in thread
From: Eli Zaretskii @ 2023-01-29 18:29 UTC (permalink / raw)
  To: Tom Gillespie; +Cc: emacs-devel

> From: Tom Gillespie <tgbugs@gmail.com>
> Date: Sun, 29 Jan 2023 12:58:38 -0500
> 
> --- a/etc/NEWS
> +++ b/etc/NEWS
> @@ -563,6 +563,9 @@ The variable 'font-lock-support-mode' is occasionally useful for
>  debugging purposes.  It is now a regular variable (instead of a user
>  option) and can be set to nil to disable Just-in-time Lock mode.
>  
> +** The 'utf-8-auto' coding-system now includes a byte order mark

This is inaccurate: the change is only on encoding, and saying that a
coding-system "includes" a BOM is confusing English, IMO.

More importantly, it was a bugfix.  utf-8-auto was previously behaving
contrary to the documentation:

  ‘:bom’

  This attributes specifies whether the coding system uses a "byte order
  mark".  VALUE must be nil, t, or a cons cell of coding systems whose
  ‘:coding-type’ is ‘utf-16’ or ‘utf-8’.
  [...]
  If the value is a cons cell, on decoding, check the first two bytes.
  If they are 0xFE 0xFF, use the car part coding system of the value.
  If they are 0xFF 0xFE, use the cdr part coding system of the value.
  Otherwise, treat them as bytes for a normal character.  On encoding,
  produce BOM bytes according to the value of ‘:endian’.

Note the last sentence.

We don't announce bugfixes in NEWS, mainly because doing so would make
an already large file many times larger.



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] * etc/NEWS: Announce addition of BOM to utf-8-auto
  2023-01-29 18:29 ` Eli Zaretskii
@ 2023-01-29 19:11   ` Tom Gillespie
  2023-01-29 19:38     ` Eli Zaretskii
  0 siblings, 1 reply; 11+ messages in thread
From: Tom Gillespie @ 2023-01-29 19:11 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: emacs-devel

> > +** The 'utf-8-auto' coding-system now includes a byte order mark
>
> This is inaccurate: the change is only on encoding, and saying that a
> coding-system "includes" a BOM is confusing English, IMO.

Hrm. I agree. Would it be better to say something like the following?

"Encoding 'utf-8-auto' now correctly produces a byte order mark"

> More importantly, it was a bugfix.  utf-8-auto was previously behaving
> contrary to the documentation:
> We don't announce bugfixes in NEWS, mainly because doing so would make
> an already large file many times larger.

I understand that this is technically a bugfix, but it is also a major
change in the actual behavior that could catch users by surprise
and that is very difficult to detect and debug. Is it reasonable to
use NEWS to try to mitigate the potential blast radius in such cases?



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] * etc/NEWS: Announce addition of BOM to utf-8-auto
  2023-01-29 19:11   ` Tom Gillespie
@ 2023-01-29 19:38     ` Eli Zaretskii
  2023-01-29 19:56       ` Tom Gillespie
  0 siblings, 1 reply; 11+ messages in thread
From: Eli Zaretskii @ 2023-01-29 19:38 UTC (permalink / raw)
  To: Tom Gillespie; +Cc: emacs-devel

> From: Tom Gillespie <tgbugs@gmail.com>
> Date: Sun, 29 Jan 2023 14:11:13 -0500
> Cc: emacs-devel@gnu.org
> 
> > > +** The 'utf-8-auto' coding-system now includes a byte order mark
> >
> > This is inaccurate: the change is only on encoding, and saying that a
> > coding-system "includes" a BOM is confusing English, IMO.
> 
> Hrm. I agree. Would it be better to say something like the following?
> 
> "Encoding 'utf-8-auto' now correctly produces a byte order mark"

 Encoding with 'utf-8-auto' now correctly produces a byte order mark.

> > More importantly, it was a bugfix.  utf-8-auto was previously behaving
> > contrary to the documentation:
> > We don't announce bugfixes in NEWS, mainly because doing so would make
> > an already large file many times larger.
> 
> I understand that this is technically a bugfix, but it is also a major
> change in the actual behavior that could catch users by surprise
> and that is very difficult to detect and debug. Is it reasonable to
> use NEWS to try to mitigate the potential blast radius in such cases?

Maybe (you assume that people really read all the small print in
NEWS?).  But first, could you explain why on earth are you using
utf-8-auto _on_encoding_?  It basically makes no sense at all.

All the people who did that with whom I talked until now did it
because they thought the "auto" part was about the EOL format (CR-LF
vs Newline).  Is that so in your case as well?



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] * etc/NEWS: Announce addition of BOM to utf-8-auto
  2023-01-29 19:38     ` Eli Zaretskii
@ 2023-01-29 19:56       ` Tom Gillespie
  2023-01-30 14:16         ` Eli Zaretskii
  2023-02-02 10:36         ` Eli Zaretskii
  0 siblings, 2 replies; 11+ messages in thread
From: Tom Gillespie @ 2023-01-29 19:56 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: emacs-devel

>  Encoding with 'utf-8-auto' now correctly produces a byte order mark.

Much better.

> Maybe (you assume that people really read all the small print in
> NEWS?).  But first, could you explain why on earth are you using
> utf-8-auto _on_encoding_?  It basically makes no sense at all.

Hah, no, I don't think many people do, but maybe the maintainers
of some of the more widely used packages might?

I have no idea why they are using it on encoding. Having played
with it, it produces absolutely insane results like multiple calls
prepending multiple BOMs when the default coding system is
not itself set to utf-8-auto (or something like that).

Maybe an opportunity to add a line to the message that says
"As a reminder, there are next to no cases where utf-8-auto
should be used with 'encode-coding-' functions." or similar?

> All the people who did that with whom I talked until now did it
> because they thought the "auto" part was about the EOL format (CR-LF
> vs Newline).  Is that so in your case as well?

I personally have never touched utf-8-auto, but I'm cleaning
up existing bugs that have impacted me.

If I had to guess this issue is probably the result of people
copying what is done in async.el where there is a comment
that reads:

  ;; FIXME: Why use `utf-8-auto' instead of `utf-8-unix'?  This is
  ;; a communication channel over which we have complete control,
  ;; so we get to choose exactly which encoding and EOL we use, isn't it?

https://github.com/jwiegley/emacs-async/blob/270c3d0bd99386dd9a8538990401993a6a3cb1bc/async.el#L201-L203

Which suggests that your account of the confusion is exactly the issue.

However there is also a comment about it somehow mitigating issues
with strings that have EOFs in them?? Is this even true?

  ;; Just in case the string we're sending might contain EOF
  (encode-coding-region (point-min) (point-max) 'utf-8-auto)
https://github.com/jwiegley/emacs-async/blob/270c3d0bd99386dd9a8538990401993a6a3cb1bc/async.el#L222-L223



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] * etc/NEWS: Announce addition of BOM to utf-8-auto
  2023-01-29 19:56       ` Tom Gillespie
@ 2023-01-30 14:16         ` Eli Zaretskii
  2023-01-30 15:06           ` Stefan Monnier
  2023-02-02 10:36         ` Eli Zaretskii
  1 sibling, 1 reply; 11+ messages in thread
From: Eli Zaretskii @ 2023-01-30 14:16 UTC (permalink / raw)
  To: Tom Gillespie, Stefan Monnier; +Cc: emacs-devel

> From: Tom Gillespie <tgbugs@gmail.com>
> Date: Sun, 29 Jan 2023 14:56:11 -0500
> Cc: emacs-devel@gnu.org
> 
> >  Encoding with 'utf-8-auto' now correctly produces a byte order mark.
> 
> Much better.
> 
> > Maybe (you assume that people really read all the small print in
> > NEWS?).  But first, could you explain why on earth are you using
> > utf-8-auto _on_encoding_?  It basically makes no sense at all.
> 
> Hah, no, I don't think many people do, but maybe the maintainers
> of some of the more widely used packages might?

I'll dwell on this.

> If I had to guess this issue is probably the result of people
> copying what is done in async.el where there is a comment
> that reads:
> 
>   ;; FIXME: Why use `utf-8-auto' instead of `utf-8-unix'?  This is
>   ;; a communication channel over which we have complete control,
>   ;; so we get to choose exactly which encoding and EOL we use, isn't it?
> 
> https://github.com/jwiegley/emacs-async/blob/270c3d0bd99386dd9a8538990401993a6a3cb1bc/async.el#L201-L203
> 
> Which suggests that your account of the confusion is exactly the issue.
> 
> However there is also a comment about it somehow mitigating issues
> with strings that have EOFs in them?? Is this even true?
> 
>   ;; Just in case the string we're sending might contain EOF
>   (encode-coding-region (point-min) (point-max) 'utf-8-auto)
> https://github.com/jwiegley/emacs-async/blob/270c3d0bd99386dd9a8538990401993a6a3cb1bc/async.el#L222-L223

I think both of these are mistakes of the kind I described.  I filed
an issue with emacs-async, but someone should probably fix the one we
have in ELPA.  Stefan?



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] * etc/NEWS: Announce addition of BOM to utf-8-auto
  2023-01-30 14:16         ` Eli Zaretskii
@ 2023-01-30 15:06           ` Stefan Monnier
  2023-01-30 17:12             ` Eli Zaretskii
  0 siblings, 1 reply; 11+ messages in thread
From: Stefan Monnier @ 2023-01-30 15:06 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: Tom Gillespie, emacs-devel

> I think both of these are mistakes of the kind I described.  I filed
> an issue with emacs-async, but someone should probably fix the one we
> have in ELPA.  Stefan?

It should use `emacs-internal` or `utf-8-emacs-unix` or some such very
precisely defined format since both ends are under its control, indeed.

As for "the one we have in ELPA" it's auto-synchronized with the one
upstream, so we'll get the fix when it gets installed there (and
installing it locally would break the auto-sync'ing, causing a lot of pain).


        Stefan




^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] * etc/NEWS: Announce addition of BOM to utf-8-auto
  2023-01-30 15:06           ` Stefan Monnier
@ 2023-01-30 17:12             ` Eli Zaretskii
  0 siblings, 0 replies; 11+ messages in thread
From: Eli Zaretskii @ 2023-01-30 17:12 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: tgbugs, emacs-devel

> From: Stefan Monnier <monnier@iro.umontreal.ca>
> Cc: Tom Gillespie <tgbugs@gmail.com>,  emacs-devel@gnu.org
> Date: Mon, 30 Jan 2023 10:06:21 -0500
> 
> > I think both of these are mistakes of the kind I described.  I filed
> > an issue with emacs-async, but someone should probably fix the one we
> > have in ELPA.  Stefan?
> 
> It should use `emacs-internal` or `utf-8-emacs-unix` or some such very
> precisely defined format since both ends are under its control, indeed.

Probably.

> As for "the one we have in ELPA" it's auto-synchronized with the one
> upstream, so we'll get the fix when it gets installed there (and
> installing it locally would break the auto-sync'ing, causing a lot of pain).

Assuming it is maintained actively enough, yes.



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] * etc/NEWS: Announce addition of BOM to utf-8-auto
  2023-01-29 19:56       ` Tom Gillespie
  2023-01-30 14:16         ` Eli Zaretskii
@ 2023-02-02 10:36         ` Eli Zaretskii
  2023-02-02 17:56           ` Tom Gillespie
  1 sibling, 1 reply; 11+ messages in thread
From: Eli Zaretskii @ 2023-02-02 10:36 UTC (permalink / raw)
  To: Tom Gillespie; +Cc: emacs-devel

> From: Tom Gillespie <tgbugs@gmail.com>
> Date: Sun, 29 Jan 2023 14:56:11 -0500
> Cc: emacs-devel@gnu.org
> 
> >  Encoding with 'utf-8-auto' now correctly produces a byte order mark.
> 
> Much better.
> 
> > Maybe (you assume that people really read all the small print in
> > NEWS?).  But first, could you explain why on earth are you using
> > utf-8-auto _on_encoding_?  It basically makes no sense at all.
> 
> Hah, no, I don't think many people do, but maybe the maintainers
> of some of the more widely used packages might?

I added something to NEWS about this.



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] * etc/NEWS: Announce addition of BOM to utf-8-auto
  2023-02-02 10:36         ` Eli Zaretskii
@ 2023-02-02 17:56           ` Tom Gillespie
  0 siblings, 0 replies; 11+ messages in thread
From: Tom Gillespie @ 2023-02-02 17:56 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: emacs-devel

> I added something to NEWS about this.

Perfect. Just what is needed to go around to various projects
and make it simple to explain why the changes need to be made,
or hopefully for them to notice on their own.
Thanks!
Tom



^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2023-02-02 17:56 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-01-29 17:58 [PATCH] * etc/NEWS: Announce addition of BOM to utf-8-auto Tom Gillespie
2023-01-29 18:14 ` Andreas Schwab
2023-01-29 18:29 ` Eli Zaretskii
2023-01-29 19:11   ` Tom Gillespie
2023-01-29 19:38     ` Eli Zaretskii
2023-01-29 19:56       ` Tom Gillespie
2023-01-30 14:16         ` Eli Zaretskii
2023-01-30 15:06           ` Stefan Monnier
2023-01-30 17:12             ` Eli Zaretskii
2023-02-02 10:36         ` Eli Zaretskii
2023-02-02 17:56           ` Tom Gillespie

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).