unofficial mirror of bug-gnu-emacs@gnu.org 
 help / color / mirror / code / Atom feed
From: Eli Zaretskii <eliz@gnu.org>
To: awrhygty@outlook.com
Cc: 65305@debbugs.gnu.org
Subject: bug#65305: 29.1; archive-mode can not handle subfile names encoded with utf-8
Date: Tue, 15 Aug 2023 17:50:03 +0300	[thread overview]
Message-ID: <83bkf89x7o.fsf@gnu.org> (raw)
In-Reply-To: <TYZPR01MB3920333EB23EAE1FE2D13298C314A@TYZPR01MB3920.apcprd01.prod.exchangelabs.com> (awrhygty@outlook.com)

> From: awrhygty@outlook.com
> Cc: 65305@debbugs.gnu.org
> Date: Tue, 15 Aug 2023 22:53:01 +0900
> 
> Eli Zaretskii <eliz@gnu.org> writes:
> 
> > Is there any way of distinguishing these Python-created ZIP archives
> > from ZIP archives created by other Windows programs?
> >
> > Emacs by default assumes that file names in a ZIP archive created by a
> > Windows program are encoded in the console codepage, and it enforces
> > using that encoding for file names when the "creator" of the ZIP
> > archive indicates the archive was created by Windows programs such as
> > InfoZip's zip.exe and the File Explorer.  In my testing, zip archives
> > created by Python as above record the "creator" as number 0 (zero),
> > which is identical to what InfoZip does.  So, unless someone explains
> > how to distinguish these zip archives from those created by InfoZip, I
> > don't see how can Emacs know whether to use the InfoZip heuristics or
> > the Python heuristics.  Without the InfoZip/File Explorer heuristics
> > we have in arc-mode.el today, Emacs on Windows would be completely
> > unable to support non-ASCII file names in ZIP archives.
> 
> There is a bit flag indicating that the subfile name is encoded with
> utf-8. Bytes 6-7 in local file header or bytes 8-9 in central directory
> header are general purpose bit flag. And bit 11 of the flag represents
> file encoding flag(1 for utf-8 encoding).

Thanks, please try the patch below.  If it gives good results, I will
install it.

> I guess unzip.exe does not support utf-8 encoded subfile name.
> Writing batch file with utf-8 encoding:
>   c:\Emacs\emacs-29.1\bin\unzip.exe test.zip 一.txt
> and run with chcp 932, 荳\200.txt is extracted.
> With chcp 65001, extraction failed.
> 
> Writing batch file with cp932 encoding:(same as above)
>   c:\Emacs\emacs-29.1\bin\unzip.exe test.zip 一.txt
> and run with chcp 65001, 荳\200.txt is extracted.
> With chcp 932, extraction failed.
> This is not an ideal behavior, but extraction to STDOUT may work.
> 
> To the contrary, 7z.exe extracts 一.txt correctly.
> If batch file is encoded with utf-8, it works with chcp 65001.
> If batch file is encoded with cp932, it works with chcp 932.

Like I said: support for UTF-8 encoded file names on Windows is
sporadic and incomplete.  It will remain so until Windows file-related
APIs support UTF-8 encoded file names.

diff --git a/lisp/arc-mode.el b/lisp/arc-mode.el
index 5e696c0..05a71fb 100644
--- a/lisp/arc-mode.el
+++ b/lisp/arc-mode.el
@@ -1990,6 +1990,7 @@ archive-zip-summarize
     (setq p (+ p (point-min)))
     (while (string= "PK\001\002" (buffer-substring p (+ p 4)))
       (let* ((creator (get-byte (+ p 5)))
+             (gpflags (archive-l-e (+ p 8) 2))
 	     ;; (method  (archive-l-e (+ p 10) 2))
              (modtime (archive-l-e (+ p 12) 2))
              (moddate (archive-l-e (+ p 14) 2))
@@ -2001,7 +2002,12 @@ archive-zip-summarize
              (efnname (let ((str (buffer-substring (+ p 46) (+ p 46 fnlen))))
 			(decode-coding-string
 			 str
-                         (or (if (and w32-fname-encoding
+                         ;; Bit 11 of general purpose bit flags (bytes
+                         ;; 8-9) of Central Directory: 1 means UTF-8
+                         ;; encoded file names.
+                         (or (if (/= 0 (logand gpflags #x0800))
+                                 'utf-8-unix)
+                             (if (and w32-fname-encoding
                                       (memq creator
                                             ;; This should be just 10 and
                                             ;; 14, but InfoZip uses 0 and





  reply	other threads:[~2023-08-15 14:50 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-08-15  4:31 bug#65305: 29.1; archive-mode can not handle subfile names encoded with utf-8 awrhygty
2023-08-15 11:33 ` Eli Zaretskii
2023-08-15 13:53   ` awrhygty
2023-08-15 14:50     ` Eli Zaretskii [this message]
2023-08-16  3:47       ` awrhygty
2023-08-16 12:38         ` Eli Zaretskii
2023-08-17 13:56           ` awrhygty
2023-08-17 14:19             ` Eli Zaretskii
2023-08-24  6:18               ` Eli Zaretskii

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://www.gnu.org/software/emacs/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=83bkf89x7o.fsf@gnu.org \
    --to=eliz@gnu.org \
    --cc=65305@debbugs.gnu.org \
    --cc=awrhygty@outlook.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).