From: awrhygty@outlook.com
To: Eli Zaretskii <eliz@gnu.org>
Cc: 65305@debbugs.gnu.org
Subject: bug#65305: 29.1; archive-mode can not handle subfile names encoded with utf-8
Date: Wed, 16 Aug 2023 12:47:14 +0900 [thread overview]
Message-ID: <TYZPR01MB39205F2F052C03AFB1C2F78EC315A@TYZPR01MB3920.apcprd01.prod.exchangelabs.com> (raw)
In-Reply-To: <83bkf89x7o.fsf@gnu.org> (Eli Zaretskii's message of "Tue, 15 Aug 2023 10:50:01 -0400")
Eli Zaretskii <eliz@gnu.org> writes:
>> From: awrhygty@outlook.com
>> Cc: 65305@debbugs.gnu.org
>> Date: Tue, 15 Aug 2023 22:53:01 +0900
>>
>> Eli Zaretskii <eliz@gnu.org> writes:
>>
>> > Is there any way of distinguishing these Python-created ZIP archives
>> > from ZIP archives created by other Windows programs?
>> >
>> > Emacs by default assumes that file names in a ZIP archive created by a
>> > Windows program are encoded in the console codepage, and it enforces
>> > using that encoding for file names when the "creator" of the ZIP
>> > archive indicates the archive was created by Windows programs such as
>> > InfoZip's zip.exe and the File Explorer. In my testing, zip archives
>> > created by Python as above record the "creator" as number 0 (zero),
>> > which is identical to what InfoZip does. So, unless someone explains
>> > how to distinguish these zip archives from those created by InfoZip, I
>> > don't see how can Emacs know whether to use the InfoZip heuristics or
>> > the Python heuristics. Without the InfoZip/File Explorer heuristics
>> > we have in arc-mode.el today, Emacs on Windows would be completely
>> > unable to support non-ASCII file names in ZIP archives.
>>
>> There is a bit flag indicating that the subfile name is encoded with
>> utf-8. Bytes 6-7 in local file header or bytes 8-9 in central directory
>> header are general purpose bit flag. And bit 11 of the flag represents
>> file encoding flag(1 for utf-8 encoding).
>
> Thanks, please try the patch below. If it gives good results, I will
> install it.
>
>> I guess unzip.exe does not support utf-8 encoded subfile name.
>> Writing batch file with utf-8 encoding:
>> c:\Emacs\emacs-29.1\bin\unzip.exe test.zip 一.txt
>> and run with chcp 932, 荳\200.txt is extracted.
>> With chcp 65001, extraction failed.
>>
>> Writing batch file with cp932 encoding:(same as above)
>> c:\Emacs\emacs-29.1\bin\unzip.exe test.zip 一.txt
>> and run with chcp 65001, 荳\200.txt is extracted.
>> With chcp 932, extraction failed.
>> This is not an ideal behavior, but extraction to STDOUT may work.
>>
>> To the contrary, 7z.exe extracts 一.txt correctly.
>> If batch file is encoded with utf-8, it works with chcp 65001.
>> If batch file is encoded with cp932, it works with chcp 932.
>
> Like I said: support for UTF-8 encoded file names on Windows is
> sporadic and incomplete. It will remain so until Windows file-related
> APIs support UTF-8 encoded file names.
>
> diff --git a/lisp/arc-mode.el b/lisp/arc-mode.el
> index 5e696c0..05a71fb 100644
> --- a/lisp/arc-mode.el
> +++ b/lisp/arc-mode.el
> @@ -1990,6 +1990,7 @@ archive-zip-summarize
> (setq p (+ p (point-min)))
> (while (string= "PK\001\002" (buffer-substring p (+ p 4)))
> (let* ((creator (get-byte (+ p 5)))
> + (gpflags (archive-l-e (+ p 8) 2))
> ;; (method (archive-l-e (+ p 10) 2))
> (modtime (archive-l-e (+ p 12) 2))
> (moddate (archive-l-e (+ p 14) 2))
> @@ -2001,7 +2002,12 @@ archive-zip-summarize
> (efnname (let ((str (buffer-substring (+ p 46) (+ p 46 fnlen))))
> (decode-coding-string
> str
> - (or (if (and w32-fname-encoding
> + ;; Bit 11 of general purpose bit flags (bytes
> + ;; 8-9) of Central Directory: 1 means UTF-8
> + ;; encoded file names.
> + (or (if (/= 0 (logand gpflags #x0800))
> + 'utf-8-unix)
> + (if (and w32-fname-encoding
> (memq creator
> ;; This should be just 10 and
> ;; 14, but InfoZip uses 0 and
The patch works to list entries, and the contents can be extracted with
7z.exe. unzip.exe does not work well.
I tried the settings below, but rewriting entries does not work.
(archive-zip-* variables' values are default if archive-7z-program is set
and zip.exe/unzip.exe are non-existent)
(setq archive-7z-program "c:/Program Files/7-Zip/7z.exe"
archive-zip-extract '("c:/Program Files/7-Zip/7z.exe" "x" "-so")
archive-zip-expunge '("c:/Program Files/7-Zip/7z.exe" "d")
archive-zip-update '("c:/Program Files/7-Zip/7z.exe" "u")
archive-zip-update-case archive-zip-update)
It is because update command needs "-si" option followed by an entry
name. It should be one argument like (format "-si%s" name).
next prev parent reply other threads:[~2023-08-16 3:47 UTC|newest]
Thread overview: 9+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-08-15 4:31 bug#65305: 29.1; archive-mode can not handle subfile names encoded with utf-8 awrhygty
2023-08-15 11:33 ` Eli Zaretskii
2023-08-15 13:53 ` awrhygty
2023-08-15 14:50 ` Eli Zaretskii
2023-08-16 3:47 ` awrhygty [this message]
2023-08-16 12:38 ` Eli Zaretskii
2023-08-17 13:56 ` awrhygty
2023-08-17 14:19 ` Eli Zaretskii
2023-08-24 6:18 ` Eli Zaretskii
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
List information: https://www.gnu.org/software/emacs/
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=TYZPR01MB39205F2F052C03AFB1C2F78EC315A@TYZPR01MB3920.apcprd01.prod.exchangelabs.com \
--to=awrhygty@outlook.com \
--cc=65305@debbugs.gnu.org \
--cc=eliz@gnu.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this public inbox
https://git.savannah.gnu.org/cgit/emacs.git
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).