unofficial mirror of bug-gnu-emacs@gnu.org 
 help / color / mirror / code / Atom feed
* bug#69718: 29.2; grep japanese-iso-8bit-unix string fails with "grep: illegal byte sequence"
@ 2024-03-10 23:45 Akira Shirai
  2024-03-11 12:50 ` Eli Zaretskii
  0 siblings, 1 reply; 6+ messages in thread
From: Akira Shirai @ 2024-03-10 23:45 UTC (permalink / raw)
  To: 69718; +Cc: 白井彰

1. emacs -Q on macOS
2. evaluate (set-language-environment "Japanese")
3. visit the directory where SKK-JISYO.L exists (ex. ~/emacs-29.2/leim/SKK-DIC/ )
4. type C-x RET c to run universal-coding-system-argument command, and specify japanese-iso-8bit-unix as the coding-system
5. type M-x grep to run grep, and specify "grep --color=auto -nH --null -e この辞書は SKK-JISYO.L" as the command-args
  => grep fails with "grep: illegal byte sequence"

On emacs 29.1, the grep runs successfully with the same procedures.
If the command-args is "LANG=C grep --color=auto -nH --null -e この辞書は SKK-JISYO.L", the grep runs successfully.



In GNU Emacs 29.2 (build 1, x86_64-apple-darwin23.3.0, NS appkit-2487.40
 Version 14.3 (Build 23D56)) of 2024-01-25 built on F9A6231BCF26.local
Windowing system distributor 'Apple', version 10.3.2487
System Description:  macOS 14.3.1

Configured using:
 'configure --with-ns --without-x --without-compress-install
 --with-gnutls=no'

Configured features:
ACL LIBXML2 MODULES NOTIFY KQUEUE NS PDUMPER SQLITE3 THREADS
TOOLKIT_SCROLL_BARS ZLIB

Important settings:
  value of $LANG: en_US.UTF-8
  locale-coding-system: utf-8-unix

Major mode: Fundamental

Minor modes in effect:
  shell-dirtrack-mode: t
  tooltip-mode: t
  global-eldoc-mode: t
  show-paren-mode: t
  electric-indent-mode: t
  mouse-wheel-mode: t
  tool-bar-mode: t
  menu-bar-mode: t
  file-name-shadow-mode: t
  global-font-lock-mode: t
  font-lock-mode: t
  blink-cursor-mode: t
  buffer-read-only: t
  line-number-mode: t
  indent-tabs-mode: t
  transient-mark-mode: t
  auto-composition-mode: t
  auto-encryption-mode: t
  auto-compression-mode: t

Load-path shadows:
None found.

Features:
(shadow sort mail-extr emacsbug message mailcap yank-media puny rfc822
mml mml-sec password-cache epa derived epg rfc6068 epg-config gnus-util
mm-decode mm-bodies mm-encode mail-parse rfc2231 mailabbrev gmm-utils
mailheader sendmail rfc2047 rfc2045 ietf-drums mm-util mail-prsvr
mail-utils face-remap help-mode shell pcomplete thingatpt files-x grep
compile text-property-search comint ansi-osc ansi-color ring dired-aux
dired dired-loaddefs japan-util time-date subr-x cl-loaddefs cl-lib rmc
iso-transl tooltip cconv eldoc paren electric uniquify ediff-hook
vc-hooks lisp-float-type elisp-mode mwheel term/ns-win ns-win
ucs-normalize mule-util term/common-win tool-bar dnd fontset image
regexp-opt fringe tabulated-list replace newcomment text-mode lisp-mode
prog-mode register page tab-bar menu-bar rfn-eshadow isearch easymenu
timer select scroll-bar mouse jit-lock font-lock syntax font-core
term/tty-colors frame minibuffer nadvice seq simple cl-generic
indonesian philippine cham georgian utf-8-lang misc-lang vietnamese
tibetan thai tai-viet lao korean japanese eucjp-ms cp51932 hebrew greek
romanian slovak czech european ethiopic indian cyrillic chinese
composite emoji-zwj charscript charprop case-table epa-hook
jka-cmpr-hook help abbrev obarray oclosure cl-preloaded button loaddefs
theme-loaddefs faces cus-face macroexp files window text-properties
overlay sha1 md5 base64 format env code-pages mule custom widget keymap
hashtable-print-readable backquote threads kqueue cocoa ns multi-tty
make-network-process emacs)

Memory information:
((conses 16 58569 6787)
 (symbols 48 6316 0)
 (strings 32 18079 1823)
 (string-bytes 1 543082)
 (vectors 16 14244)
 (vector-slots 8 280500 7109)
 (floats 8 26 33)
 (intervals 56 785 0)
 (buffers 976 16))






^ permalink raw reply	[flat|nested] 6+ messages in thread

* bug#69718: 29.2; grep japanese-iso-8bit-unix string fails with "grep: illegal byte sequence"
  2024-03-10 23:45 bug#69718: 29.2; grep japanese-iso-8bit-unix string fails with "grep: illegal byte sequence" Akira Shirai
@ 2024-03-11 12:50 ` Eli Zaretskii
  2024-03-11 13:15   ` Eli Zaretskii
  0 siblings, 1 reply; 6+ messages in thread
From: Eli Zaretskii @ 2024-03-11 12:50 UTC (permalink / raw)
  To: Akira Shirai; +Cc: 69718

> Cc: 白井彰 <okshirai@gmail.com>
> From: Akira Shirai <okshirai@gmail.com>
> Date: Mon, 11 Mar 2024 08:45:33 +0900
> 
> 1. emacs -Q on macOS
> 2. evaluate (set-language-environment "Japanese")
> 3. visit the directory where SKK-JISYO.L exists (ex. ~/emacs-29.2/leim/SKK-DIC/ )
> 4. type C-x RET c to run universal-coding-system-argument command, and specify japanese-iso-8bit-unix as the coding-system
> 5. type M-x grep to run grep, and specify "grep --color=auto -nH --null -e この辞書は SKK-JISYO.L" as the command-args
>   => grep fails with "grep: illegal byte sequence"
> 
> On emacs 29.1, the grep runs successfully with the same procedures.
> If the command-args is "LANG=C grep --color=auto -nH --null -e この辞書は SKK-JISYO.L", the grep runs successfully.

I cannot reproduce this, but I'm not on macOS.

We made a change in msterm.m between Emacs 29.1 and Emacs 29.2, which
might be responsible for this: we now set the Emacs locale
differently.  But I'm not sure what you see means there's a bug in
Emacs, because it could well be a bug in Grep that you have on macOS;
for example, this page:

  https://stackoverflow.com/questions/19242275/re-error-illegal-byte-sequence-on-mac-os-x

clearly hints that this might be the case, and that setting LANG=C is
indeed the right solution for this.





^ permalink raw reply	[flat|nested] 6+ messages in thread

* bug#69718: 29.2; grep japanese-iso-8bit-unix string fails with "grep: illegal byte sequence"
  2024-03-11 12:50 ` Eli Zaretskii
@ 2024-03-11 13:15   ` Eli Zaretskii
  2024-03-12 15:42     ` Akira Shirai
  0 siblings, 1 reply; 6+ messages in thread
From: Eli Zaretskii @ 2024-03-11 13:15 UTC (permalink / raw)
  To: okshirai; +Cc: 69718

> Cc: 69718@debbugs.gnu.org
> Date: Mon, 11 Mar 2024 14:50:31 +0200
> From: Eli Zaretskii <eliz@gnu.org>
> 
> We made a change in msterm.m between Emacs 29.1 and Emacs 29.2, which
                      ^^^^^^^^
Sorry, that was supposed to be nsterm.m.





^ permalink raw reply	[flat|nested] 6+ messages in thread

* bug#69718: 29.2; grep japanese-iso-8bit-unix string fails with "grep: illegal byte sequence"
  2024-03-11 13:15   ` Eli Zaretskii
@ 2024-03-12 15:42     ` Akira Shirai
  2024-03-12 19:39       ` Eli Zaretskii
  0 siblings, 1 reply; 6+ messages in thread
From: Akira Shirai @ 2024-03-12 15:42 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: okshirai, 69718

On japanese-iso-8bit-unix (= eucJP), この辞書は is 0xa4b3 0xa4ce 0xbcad 0xbdf1 0xa4cf,
and 0xa4b3 0xa4ce 0xbcad 0xbdf1 0xa4cf is illegal byte sequence on UTF-8.

On UTF-8 mode, macOS grep signals "grep: illegal byte sequence" error for the byte sequence,
but LANG=C mode or LANG is not specified, macOS grep accepts the byte sequence.
| % cd ~/emacs-29.2/leim/SKK-DIC
| 
| % LANG=en_US.UTF-8 grep --color=auto -nH --null -e `echo この辞書は | iconv -f utf-8 -t eucJP` SKK-JISYO.L | iconv -f eucJP -t utf-8
| grep: illegal byte sequence
| 
| % LANG=C           grep --color=auto -nH --null -e `echo この辞書は | iconv -f utf-8 -t eucJP` SKK-JISYO.L | iconv -f eucJP -t utf-8
| SKK-JISYO.L\035:;; この辞書は、SKK 原作者の佐藤雅彦先生が、第 1 版作成のために東北大学
|  
| % LANG=            grep --color=auto -nH --null -e `echo この辞書は | iconv -f utf-8 -t eucJP` SKK-JISYO.L | iconv -f eucJP -t utf-8
| SKK-JISYO.L\035:;; この辞書は、SKK 原作者の佐藤雅彦先生が、第 1 版作成のために東北大学

emacs-29.1 executes /usr/bin/grep without LANG,
but emacs-29.2 seems to execute /usr/bin/grep with LANG=en_US.UTF-8.

I wonder /usr/bin/grep should be invoked in non UTF-8 mode, because emacs might pass non UTF-8 byte sequence to /usr/bin/grep.

> 2024/03/11 22:15、Eli Zaretskii <eliz@gnu.org>のメール:
> 
>> Cc: 69718@debbugs.gnu.org
>> Date: Mon, 11 Mar 2024 14:50:31 +0200
>> From: Eli Zaretskii <eliz@gnu.org>
>> 
>> We made a change in msterm.m between Emacs 29.1 and Emacs 29.2, which
>                      ^^^^^^^^
> Sorry, that was supposed to be nsterm.m.






^ permalink raw reply	[flat|nested] 6+ messages in thread

* bug#69718: 29.2; grep japanese-iso-8bit-unix string fails with "grep: illegal byte sequence"
  2024-03-12 15:42     ` Akira Shirai
@ 2024-03-12 19:39       ` Eli Zaretskii
  2024-03-13 14:07         ` Akira Shirai
  0 siblings, 1 reply; 6+ messages in thread
From: Eli Zaretskii @ 2024-03-12 19:39 UTC (permalink / raw)
  To: Akira Shirai; +Cc: okshirai, 69718

> From: Akira Shirai <okshirai@gmail.com>
> Date: Wed, 13 Mar 2024 00:42:22 +0900
> Cc: 69718@debbugs.gnu.org,
>  okshirai@joy.ocn.ne.jp
> 
> emacs-29.1 executes /usr/bin/grep without LANG,
> but emacs-29.2 seems to execute /usr/bin/grep with LANG=en_US.UTF-8.

The fact that LANG could cause this is IMO a bug in macOS's Grep.

There's no problem for Emacs to put LANG into the environment, but
Grep can be invoked on several very different files, with no single
LANG that fits all of them.  Grep should not use LANG at all.

> I wonder /usr/bin/grep should be invoked in non UTF-8 mode, because emacs might pass non UTF-8 byte sequence to /usr/bin/grep.

Illegal byte sequence is not limited to UTF-8.  There really is no
good solution for this, except in Grep itself.  Which is why I don't
think this is an Emacs bug.





^ permalink raw reply	[flat|nested] 6+ messages in thread

* bug#69718: 29.2; grep japanese-iso-8bit-unix string fails with "grep: illegal byte sequence"
  2024-03-12 19:39       ` Eli Zaretskii
@ 2024-03-13 14:07         ` Akira Shirai
  0 siblings, 0 replies; 6+ messages in thread
From: Akira Shirai @ 2024-03-13 14:07 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 白井彰, 69718

If LANG=locale grep [-e pattern] [file ...] is specified, macOS grep checks the consistency between the locale and the byte sequence of the pattern,
but does not check the consistency between the locale and the byte sequence of the file contents.
(Please see *1 below)

This consistency check feature can be disabled if LANG=C is specified or LANG is not specified.

emacs-29.1 executes /usr/bin/grep without LANG (*2),
but emacs-29.2 seems to execute /usr/bin/grep with LANG=en_US.UTF-8 (*3).

I wonder that in some cases emacs should invoke subprocesses with LANG=C is specified or LANG is not specified,
and this grep issue might be one of the case.

----------------------------------------------------------------------
*1 If LANG=locale grep [-e pattern] [file ...] is specified, macOS grep checks the consistency between the locale and the byte sequence of the pattern,
but does not check the consistency between the locale and the byte sequence of the file contents.

% cd ~/emacs-29.2/leim/SKK-DIC

/*
 * locale (= $LANG): en_US.UTF-8
 * pattern (= $aaa): UTF-8
 * file:             eucJP
 *   => Though the encodings of locale and file are inconsistent, grep executes successfully without any hit
 */
% export aaa=この辞書は; export bbb=`echo $aaa | iconv -f utf-8 -t eucJP`; LANG=en_US.UTF-8 grep --color=auto -nH --null -e $aaa SKK-JISYO.L | iconv -f eucJP -t utf-8

/*
 * locale (= $LANG): en_US.UTF-8
 * pattern (= $bbb): eucJP
 * file:             eucJP
 *   => Because the encodings of locale and pattern are inconsistent, grep fails with illegal byte sequence
 */
% export aaa=この辞書は; export bbb=`echo $aaa | iconv -f utf-8 -t eucJP`; LANG=en_US.UTF-8 grep --color=auto -nH --null -e $bbb SKK-JISYO.L | iconv -f eucJP -t utf-8
grep: illegal byte sequence

/*
 * locale (= $LANG): C
 * pattern (= $bbb): eucJP
 * file:             eucJP
 *   => Because the encodings of locale is C, grep does not check the consistency between the locale and the byte sequence of the pattern and executes successfully with one hit
 */
% export aaa=この辞書は; export bbb=`echo $aaa | iconv -f utf-8 -t eucJP`; LANG=C grep --color=auto -nH --null -e $bbb SKK-JISYO.L | iconv -f eucJP -t utf-8
SKK-JISYO.L\035:;; この辞書は、SKK 原作者の佐藤雅彦先生が、第 1 版作成のために東北大学

/*
 * locale (= $LANG): unspecified
 * pattern (= $bbb): eucJP
 * file:             eucJP
 *   => Because the encodings of locale is not specified, grep does not check the consistency between the locale and the byte sequence of the pattern and executes successfully with one hit
 */
% export aaa=この辞書は; export bbb=`echo $aaa | iconv -f utf-8 -t eucJP`; LANG= grep --color=auto -nH --null -e $bbb SKK-JISYO.L | iconv -f eucJP -t utf-8
SKK-JISYO.L\035:;; この辞書は、SKK 原作者の佐藤雅彦先生が、第 1 版作成のために東北大学

----------------------------------------------------------------------
*2 emacs-29.1 executes /usr/bin/grep without LANG

/*
 * emacs-29.1
 * grep executes successfully with one hit
 * printenv shows that the subprocess is invokeed with LANG unspecified
 */

-*- mode: grep; default-directory: "~/emacs-29.1/leim/SKK-DIC/" -*-
Grep started at Wed Mar 13 22:21:06

grep --color=auto -nH --null -e この辞書は SKK-JISYO.L; printenv
SKK-JISYO.L:35:;; この辞書は、SKK 原作者の佐藤雅彦先生が、第 1 版作成のために東北大学
GREP_COLOR=01;31
SHELL=/bin/bash
TERM=emacs-grep
TMPDIR=/var/folders/4l/q0w9w6j914q2n_v7qyysbrhh0000gn/T/
USER=shiraiakira
COMMAND_MODE=unix2003
GREP_COLORS=mt=01;31:fn=:ln=:bn=:se=:sl=:cx=:ne
SSH_AUTH_SOCK=/private/tmp/com.apple.launchd.yfsSXeLGOX/Listeners
__CF_USER_TEXT_ENCODING=0x1F5:0x1:0xE
PAGER=
PATH=/usr/bin:/bin:/usr/sbin:/sbin
LaunchInstanceID=8A9BF0A3-3EF0-4A5E-B1A2-E3E4FE729A53
__CFBundleIdentifier=org.gnu.Emacs
PWD=/Users/shiraiakira/emacs-29.1/leim/SKK-DIC
XPC_FLAGS=0x0
XPC_SERVICE_NAME=0
SHLVL=1
HOME=/Users/shiraiakira
LOGNAME=shiraiakira
DISPLAY=F9A6231BCF26.local
INSIDE_EMACS=29.1,compile
SECURITYSESSIONID=186a2
_=/usr/bin/printenv

Grep finished with 1 match found at Wed Mar 13 22:21:06

----------------------------------------------------------------------
*3 emacs-29.2 seems to execute /usr/bin/grep with LANG=en_US.UTF-8

/*
 * emacs-29.2
 * grep fails with illegal byte sequence
 * printenv shows that the subprocess is invoked with LANG=en_US.UTF-8
 */

-*- mode: grep; default-directory: "~/emacs-29.2/leim/SKK-DIC/" -*-
Grep started at Wed Mar 13 22:19:25

grep --color=auto -nH --null -e この辞書は SKK-JISYO.L; printenv
grep: illegal byte sequence
GREP_COLOR=01;31
SHELL=/bin/bash
TERM=emacs-grep
TMPDIR=/var/folders/4l/q0w9w6j914q2n_v7qyysbrhh0000gn/T/
USER=shiraiakira
COMMAND_MODE=unix2003
GREP_COLORS=mt=01;31:fn=:ln=:bn=:se=:sl=:cx=:ne
SSH_AUTH_SOCK=/private/tmp/com.apple.launchd.yfsSXeLGOX/Listeners
__CF_USER_TEXT_ENCODING=0x1F5:0x1:0xE
PAGER=
PATH=/usr/bin:/bin:/usr/sbin:/sbin
LaunchInstanceID=8A9BF0A3-3EF0-4A5E-B1A2-E3E4FE729A53
__CFBundleIdentifier=org.gnu.Emacs
PWD=/Users/shiraiakira/emacs-29.2/leim/SKK-DIC
LANG=en_US.UTF-8
XPC_FLAGS=0x0
XPC_SERVICE_NAME=0
SHLVL=1
HOME=/Users/shiraiakira
LOGNAME=shiraiakira
DISPLAY=F9A6231BCF26.local
INSIDE_EMACS=29.2,compile
SECURITYSESSIONID=186a2
_=/usr/bin/printenv

Grep finished with matches found at Wed Mar 13 22:19:25


> 2024/03/13 4:39、Eli Zaretskii <eliz@gnu.org>のメール:
> 
>> From: Akira Shirai <okshirai@gmail.com>
>> Date: Wed, 13 Mar 2024 00:42:22 +0900
>> Cc: 69718@debbugs.gnu.org,
>> okshirai@joy.ocn.ne.jp
>> 
>> emacs-29.1 executes /usr/bin/grep without LANG,
>> but emacs-29.2 seems to execute /usr/bin/grep with LANG=en_US.UTF-8.
> 
> The fact that LANG could cause this is IMO a bug in macOS's Grep.
> 
> There's no problem for Emacs to put LANG into the environment, but
> Grep can be invoked on several very different files, with no single
> LANG that fits all of them.  Grep should not use LANG at all.
> 
>> I wonder /usr/bin/grep should be invoked in non UTF-8 mode, because emacs might pass non UTF-8 byte sequence to /usr/bin/grep.
> 
> Illegal byte sequence is not limited to UTF-8.  There really is no
> good solution for this, except in Grep itself.  Which is why I don't
> think this is an Emacs bug.






^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2024-03-13 14:07 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-03-10 23:45 bug#69718: 29.2; grep japanese-iso-8bit-unix string fails with "grep: illegal byte sequence" Akira Shirai
2024-03-11 12:50 ` Eli Zaretskii
2024-03-11 13:15   ` Eli Zaretskii
2024-03-12 15:42     ` Akira Shirai
2024-03-12 19:39       ` Eli Zaretskii
2024-03-13 14:07         ` Akira Shirai

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).