all messages for Emacs-related lists mirrored at yhetil.org
 help / color / mirror / code / Atom feed
* bug#60750: 29.0.60; encode-coding-char fails for utf-8-auto coding system
@ 2023-01-12  9:08 Robert Pluim
  2023-01-12 12:32 ` Eli Zaretskii
  0 siblings, 1 reply; 6+ messages in thread
From: Robert Pluim @ 2023-01-12  9:08 UTC (permalink / raw)
  To: 60750


src/emacs -Q
M-x toggle-debug-on-error
M-: (setq buffer-file-coding-system 'utf-8-auto)
C-b
C-u C-x =

=>
Debugger entered--Lisp error: (args-out-of-range "))" 3 1)
  encode-coding-char(41 utf-8-auto ascii)
  describe-char(189)
  what-cursor-position((4))

This is because utf-8-auto has a non-nil :bom property:

(define-coding-system 'utf-8-auto
  "UTF-8 (auto-detect signature (BOM))"
  :coding-type 'utf-8
  :mnemonic ?U
  :charset-list '(unicode)
  :bom '(utf-8-with-signature . utf-8))

and `encode-coding-char' does this:

        ;; We also need to exclude the leading 2 or 3 bytes if they
        ;; come from a BOM.
        (setq i0
              (if bom-p
                  (cond
                   ((eq (coding-system-type coding-system) 'utf-8)
                    3)
                   ((eq (coding-system-type coding-system) 'utf-16)
                    2)
                   (t 0))
                0))
	(substring enc2 i0 i2)))))

Iʼm not sure if this needs fixing, but it was surprising, and the
docstring of `define-coding-system' didnʼt make it clear to me whether
a BOM should have been produced here or not. (Iʼm willing to be told
that buffer-file-coding-system shouldnʼt be 'utf-8-auto, but I never
set that explicitly as far as I know 😀)

Thanks

Robert

In GNU Emacs 29.0.60 (build 14, x86_64-pc-linux-gnu, GTK+ Version
 3.24.24, cairo version 1.16.0) of 2023-01-12 built on rltb
Repository revision: f4f30ff4c44dcfdf780f1981aa541af713f2805f
Repository branch: emacs-29
System Description: Debian GNU/Linux 11 (bullseye)

Configured features:
ACL CAIRO DBUS FREETYPE GIF GLIB GMP GNUTLS GPM GSETTINGS HARFBUZZ JPEG
JSON LCMS2 LIBOTF LIBSELINUX LIBSYSTEMD LIBXML2 M17N_FLT MODULES NOTIFY
INOTIFY PDUMPER PNG RSVG SECCOMP SOUND SQLITE3 THREADS TIFF
TOOLKIT_SCROLL_BARS WEBP X11 XDBE XIM XINPUT2 XPM GTK3 ZLIB





^ permalink raw reply	[flat|nested] 6+ messages in thread

* bug#60750: 29.0.60; encode-coding-char fails for utf-8-auto coding system
  2023-01-12  9:08 bug#60750: 29.0.60; encode-coding-char fails for utf-8-auto coding system Robert Pluim
@ 2023-01-12 12:32 ` Eli Zaretskii
  2023-01-12 13:44   ` Robert Pluim
  0 siblings, 1 reply; 6+ messages in thread
From: Eli Zaretskii @ 2023-01-12 12:32 UTC (permalink / raw)
  To: Robert Pluim; +Cc: 60750

> From: Robert Pluim <rpluim@gmail.com>
> Date: Thu, 12 Jan 2023 10:08:31 +0100
> 
> 
> src/emacs -Q
> M-x toggle-debug-on-error
> M-: (setq buffer-file-coding-system 'utf-8-auto)
> C-b
> C-u C-x =
> 
> =>
> Debugger entered--Lisp error: (args-out-of-range "))" 3 1)
>   encode-coding-char(41 utf-8-auto ascii)
>   describe-char(189)
>   what-cursor-position((4))
> 
> This is because utf-8-auto has a non-nil :bom property:
> 
> (define-coding-system 'utf-8-auto
>   "UTF-8 (auto-detect signature (BOM))"
>   :coding-type 'utf-8
>   :mnemonic ?U
>   :charset-list '(unicode)
>   :bom '(utf-8-with-signature . utf-8))

Right.  This is a very old bug in encoding with utf-8 family of
encoding which has a :bom property that is a cons cell.  The fix is
simple, but I wonder what will this break out there.  So:

> Iʼm not sure if this needs fixing, but it was surprising, and the
> docstring of `define-coding-system' didnʼt make it clear to me whether
> a BOM should have been produced here or not.

Actually, the doc string is clear:

  If the value is a cons cell, on decoding, check the first two bytes.
  If they are 0xFE 0xFF, use the car part coding system of the value.
  If they are 0xFF 0xFE, use the cdr part coding system of the value.
  Otherwise, treat them as bytes for a normal character.  On encoding,
  produce BOM bytes according to the value of ‘:endian’.

Note the last sentence: it should unconditionally produce the BOM on
encoding.  Which is what we do in your scenario.

> (Iʼm willing to be told that buffer-file-coding-system shouldnʼt be
> 'utf-8-auto, but I never set that explicitly as far as I know 😀)

Who does set utf-8-auto? where did you originally bump into this?
This is an obscure coding-system, and the fix to make it work as
documented will produce an incompatible change in behavior.  So before
I decide whether to make the change and on what branch, I'd like to
know how in the world did you encounter this.

Thanks.





^ permalink raw reply	[flat|nested] 6+ messages in thread

* bug#60750: 29.0.60; encode-coding-char fails for utf-8-auto coding system
  2023-01-12 12:32 ` Eli Zaretskii
@ 2023-01-12 13:44   ` Robert Pluim
  2023-01-12 14:04     ` Eli Zaretskii
  0 siblings, 1 reply; 6+ messages in thread
From: Robert Pluim @ 2023-01-12 13:44 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 60750

>>>>> On Thu, 12 Jan 2023 14:32:52 +0200, Eli Zaretskii <eliz@gnu.org> said:

    Eli> Actually, the doc string is clear:

    Eli>   If the value is a cons cell, on decoding, check the first two bytes.
    Eli>   If they are 0xFE 0xFF, use the car part coding system of the value.
    Eli>   If they are 0xFF 0xFE, use the cdr part coding system of the value.
    Eli>   Otherwise, treat them as bytes for a normal character.  On encoding,
    Eli>   produce BOM bytes according to the value of ‘:endian’.

    Eli> Note the last sentence: it should unconditionally produce the BOM on
    Eli> encoding.  Which is what we do in your scenario.

Ah, I misread that as "depending on the value of ':endian'"

One minor nit, the description for ':endian' says:

    `:endian'

    VALUE must be `big' or `little' specifying big-endian and
    little-endian respectively.  The default value is `big'.

    This attribute is meaningful only when `:coding-type' is `utf-16'.

That last sentence seems untrue, as ':endian' is meaningful for
'utf-8-auto'

    >> (Iʼm willing to be told that buffer-file-coding-system shouldnʼt be
    >> 'utf-8-auto, but I never set that explicitly as far as I know 😀)

    Eli> Who does set utf-8-auto? where did you originally bump into this?
    Eli> This is an obscure coding-system, and the fix to make it work as
    Eli> documented will produce an incompatible change in behavior.  So before
    Eli> I decide whether to make the change and on what branch, I'd like to
    Eli> know how in the world did you encounter this.

Itʼs entirely my own fault:

The file where I noticed this is shared between a GNU/Linux and a
macOS machine, which means I foolishly added the following a year ago,
even though itʼs unnecessary (perhaps I was thinking I was going to be
sharing it with a Windows machine?):

    ;; -*- lexical-binding: t; coding: utf-8-auto; -*-

I think that means we can leave the code as it is.

Robert
-- 





^ permalink raw reply	[flat|nested] 6+ messages in thread

* bug#60750: 29.0.60; encode-coding-char fails for utf-8-auto coding system
  2023-01-12 13:44   ` Robert Pluim
@ 2023-01-12 14:04     ` Eli Zaretskii
  2023-01-12 14:28       ` Robert Pluim
  0 siblings, 1 reply; 6+ messages in thread
From: Eli Zaretskii @ 2023-01-12 14:04 UTC (permalink / raw)
  To: Robert Pluim; +Cc: 60750

> From: Robert Pluim <rpluim@gmail.com>
> Cc: 60750@debbugs.gnu.org
> Date: Thu, 12 Jan 2023 14:44:29 +0100
> 
> One minor nit, the description for ':endian' says:
> 
>     `:endian'
> 
>     VALUE must be `big' or `little' specifying big-endian and
>     little-endian respectively.  The default value is `big'.
> 
>     This attribute is meaningful only when `:coding-type' is `utf-16'.
> 
> That last sentence seems untrue, as ':endian' is meaningful for
> 'utf-8-auto'

That depends on what you mean by "meaningful".  What it wants to say
is that it's meaningless to change the value of this property for any
coding-system other than UTF-16.

>     Eli> Who does set utf-8-auto? where did you originally bump into this?
>     Eli> This is an obscure coding-system, and the fix to make it work as
>     Eli> documented will produce an incompatible change in behavior.  So before
>     Eli> I decide whether to make the change and on what branch, I'd like to
>     Eli> know how in the world did you encounter this.
> 
> Itʼs entirely my own fault:
> 
> The file where I noticed this is shared between a GNU/Linux and a
> macOS machine, which means I foolishly added the following a year ago,
> even though itʼs unnecessary (perhaps I was thinking I was going to be
> sharing it with a Windows machine?):
> 
>     ;; -*- lexical-binding: t; coding: utf-8-auto; -*-

So you thought the "-auto" part was about the EOL format?

> I think that means we can leave the code as it is.

??? "As it is" means this coding-system behaves contrary to
documentation: it should produce BOM on encoding.  Leaving it as is
doesn't sound TRT, so I'd like to have this fixed.  From your
description, it sounds like you bumped into this by mistake, and I see
only one other use of it -- in the test suite.  So I'm inclined to
installing this on the emacs-29 release branch.





^ permalink raw reply	[flat|nested] 6+ messages in thread

* bug#60750: 29.0.60; encode-coding-char fails for utf-8-auto coding system
  2023-01-12 14:04     ` Eli Zaretskii
@ 2023-01-12 14:28       ` Robert Pluim
  2023-01-12 14:39         ` Eli Zaretskii
  0 siblings, 1 reply; 6+ messages in thread
From: Robert Pluim @ 2023-01-12 14:28 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 60750

>>>>> On Thu, 12 Jan 2023 16:04:07 +0200, Eli Zaretskii <eliz@gnu.org> said:

    >> From: Robert Pluim <rpluim@gmail.com>
    >> Cc: 60750@debbugs.gnu.org
    >> Date: Thu, 12 Jan 2023 14:44:29 +0100
    >> 
    >> One minor nit, the description for ':endian' says:
    >> 
    >> `:endian'
    >> 
    >> VALUE must be `big' or `little' specifying big-endian and
    >> little-endian respectively.  The default value is `big'.
    >> 
    >> This attribute is meaningful only when `:coding-type' is `utf-16'.
    >> 
    >> That last sentence seems untrue, as ':endian' is meaningful for
    >> 'utf-8-auto'

    Eli> That depends on what you mean by "meaningful".  What it wants to say
    Eli> is that it's meaningless to change the value of this property for any
    Eli> coding-system other than UTF-16.

OK

    Eli> Who does set utf-8-auto? where did you originally bump into this?
    Eli> This is an obscure coding-system, and the fix to make it work as
    Eli> documented will produce an incompatible change in behavior.  So before
    Eli> I decide whether to make the change and on what branch, I'd like to
    Eli> know how in the world did you encounter this.
    >> 
    >> Itʼs entirely my own fault:
    >> 
    >> The file where I noticed this is shared between a GNU/Linux and a
    >> macOS machine, which means I foolishly added the following a year ago,
    >> even though itʼs unnecessary (perhaps I was thinking I was going to be
    >> sharing it with a Windows machine?):
    >> 
    >> ;; -*- lexical-binding: t; coding: utf-8-auto; -*-

    Eli> So you thought the "-auto" part was about the EOL format?

yes. Iʼm having a reading incomprehension day, obviously (just like a
year ago when I made the change originally).

    >> I think that means we can leave the code as it is.

    Eli> ??? "As it is" means this coding-system behaves contrary to
    Eli> documentation: it should produce BOM on encoding.  Leaving it as is
    Eli> doesn't sound TRT, so I'd like to have this fixed.  From your
    Eli> description, it sounds like you bumped into this by mistake, and I see
    Eli> only one other use of it -- in the test suite.  So I'm inclined to
    Eli> installing this on the emacs-29 release branch.

Oh, I thought you were proposing *not* to fix it at all, since itʼs
such an obscure coding system. I have no opinion on where a fix should
go: Iʼm not going to be using that coding system again.

Robert
-- 





^ permalink raw reply	[flat|nested] 6+ messages in thread

* bug#60750: 29.0.60; encode-coding-char fails for utf-8-auto coding system
  2023-01-12 14:28       ` Robert Pluim
@ 2023-01-12 14:39         ` Eli Zaretskii
  0 siblings, 0 replies; 6+ messages in thread
From: Eli Zaretskii @ 2023-01-12 14:39 UTC (permalink / raw)
  To: Robert Pluim; +Cc: 60750-done

> From: Robert Pluim <rpluim@gmail.com>
> Cc: 60750@debbugs.gnu.org
> Date: Thu, 12 Jan 2023 15:28:49 +0100
> 
>     >> I think that means we can leave the code as it is.
> 
>     Eli> ??? "As it is" means this coding-system behaves contrary to
>     Eli> documentation: it should produce BOM on encoding.  Leaving it as is
>     Eli> doesn't sound TRT, so I'd like to have this fixed.  From your
>     Eli> description, it sounds like you bumped into this by mistake, and I see
>     Eli> only one other use of it -- in the test suite.  So I'm inclined to
>     Eli> installing this on the emacs-29 release branch.
> 
> Oh, I thought you were proposing *not* to fix it at all, since itʼs
> such an obscure coding system. I have no opinion on where a fix should
> go: Iʼm not going to be using that coding system again.

OK.  So I've installed the fix on the emacs-29 branch, and I'm boldly
closing this bug.





^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2023-01-12 14:39 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-01-12  9:08 bug#60750: 29.0.60; encode-coding-char fails for utf-8-auto coding system Robert Pluim
2023-01-12 12:32 ` Eli Zaretskii
2023-01-12 13:44   ` Robert Pluim
2023-01-12 14:04     ` Eli Zaretskii
2023-01-12 14:28       ` Robert Pluim
2023-01-12 14:39         ` Eli Zaretskii

Code repositories for project(s) associated with this external index

	https://git.savannah.gnu.org/cgit/emacs.git
	https://git.savannah.gnu.org/cgit/emacs/org-mode.git

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.