From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!.POSTED!not-for-mail From: David Kastrup Newsgroups: gmane.lisp.guile.user Subject: Re: guile can't find a chinese named file Date: Mon, 30 Jan 2017 19:32:14 +0100 Message-ID: <878tpsqtzl.fsf@fencepost.gnu.org> References: <874m0gd3z4.fsf@gnu.org> <87wpdc8rx7.fsf@elektro.pacujo.net> <87poj4r04c.fsf@fencepost.gnu.org> <87k29c8q3b.fsf@elektro.pacujo.net> <87h94gqz34.fsf@fencepost.gnu.org> <87fuk0ctve.fsf@elektro.pacujo.net> NNTP-Posting-Host: blaine.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Trace: blaine.gmane.org 1485801460 6115 195.159.176.226 (30 Jan 2017 18:37:40 GMT) X-Complaints-To: usenet@blaine.gmane.org NNTP-Posting-Date: Mon, 30 Jan 2017 18:37:40 +0000 (UTC) User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/25.1.50 (gnu/linux) Cc: guile-user@gnu.org To: Marko Rauhamaa Original-X-From: guile-user-bounces+guile-user=m.gmane.org@gnu.org Mon Jan 30 19:37:37 2017 Return-path: Envelope-to: guile-user@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by blaine.gmane.org with esmtp (Exim 4.84_2) (envelope-from ) id 1cYGpc-0001SK-L4 for guile-user@m.gmane.org; Mon, 30 Jan 2017 19:37:36 +0100 Original-Received: from localhost ([::1]:34433 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1cYGpg-0004GB-Ht for guile-user@m.gmane.org; Mon, 30 Jan 2017 13:37:40 -0500 Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:53855) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1cYGkX-0007zB-JH for guile-user@gnu.org; Mon, 30 Jan 2017 13:32:26 -0500 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1cYGkU-0006pu-Ag for guile-user@gnu.org; Mon, 30 Jan 2017 13:32:21 -0500 Original-Received: from fencepost.gnu.org ([2001:4830:134:3::e]:34441) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1cYGkU-0006pn-6v; Mon, 30 Jan 2017 13:32:18 -0500 Original-Received: from x2f3b97a.dyn.telefonica.de ([2.243.185.122]:48022 helo=lola) by fencepost.gnu.org with esmtpsa (TLS1.2:RSA_AES_256_CBC_SHA1:256) (Exim 4.82) (envelope-from ) id 1cYGkT-0004UN-Cb; Mon, 30 Jan 2017 13:32:17 -0500 In-Reply-To: <87fuk0ctve.fsf@elektro.pacujo.net> (Marko Rauhamaa's message of "Mon, 30 Jan 2017 19:58:29 +0200") X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] X-Received-From: 2001:4830:134:3::e X-BeenThere: guile-user@gnu.org X-Mailman-Version: 2.1.21 Precedence: list List-Id: General Guile related discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: guile-user-bounces+guile-user=m.gmane.org@gnu.org Original-Sender: "guile-user" Xref: news.gmane.org gmane.lisp.guile.user:13145 Archived-At: Marko Rauhamaa writes: > David Kastrup : > >> But at any rate, this cannot easily be fixed since Guile uses libraries >> for encoding/decoding that cannot deal reproducibly with improper byte >> patterns. > > Guile's mistake was to move to Unicode strings in the operating system > interface. Emacs uses an UTF-8 based encoding internally: basically, valid UTF-8 is represented as itself, there is a number of coding points beyond the actual limit of UTF-8 that is used for non-Unicode character sets, and single bytes not properly belonging to the read encoding are represented with 0x00...0x7f, 0xc0 0x80 ... 0xc0 0xbf and 0xc1 0x80 ... 0xbf (the latter two ranges are "overlong" encodings of 0x00...0x7f and consequently also not valid utf-8). The result is that random binary files read as "utf-8" grow by less than 50% in the internal representation (0x00-0x7f gets represented as itself, and 0x80-0xff gets encoded with two bytes only when not being a part of a valid utf-8 sequence). The internal representation has several guarantees for processing. And when reencoding to utf-8 as output encoding, the input gets reconstructed perfectly even when it wasn't actually utf-8 to start with. Emacs does not use "Unicode strings in the operating system interface" but rather has a number of explicit encodings: file-name-coding-system is a variable defined in =E2=80=98C source code=E2= =80=99. Its value is nil Documentation: Coding system for encoding file names. If it is nil, =E2=80=98default-file-name-coding-system=E2=80=99 (which see)= is used. On MS-Windows, the value of this variable is largely ignored if =E2=80=98w32-unicode-filenames=E2=80=99 (which see) is non-nil. Emacs on W= indows behaves as if file names were encoded in =E2=80=98utf-8=E2=80=99. [back] Coding system for saving this buffer: U -- utf-8-emacs-unix (alias: emacs-internal) Default coding system (for new files): U -- utf-8-unix (alias: mule-utf-8-unix) Coding system for keyboard input: U -- utf-8-unix (alias: mule-utf-8-unix) Coding system for terminal output: U -- utf-8-unix (alias: mule-utf-8-unix) Coding system for inter-client cut and paste: nil Defaults for subprocess I/O: decoding: U -- utf-8-unix (alias: mule-utf-8-unix) encoding: U -- utf-8-unix (alias: mule-utf-8-unix) Priority order for recognizing coding systems when reading files: 1. utf-8 (alias: mule-utf-8) 2. iso-2022-7bit=20 3. iso-latin-1 (alias: iso-8859-1 latin-1) 4. iso-2022-7bit-lock (alias: iso-2022-int-1) 5. iso-2022-8bit-ss2=20 6. emacs-mule=20 7. raw-text=20 8. iso-2022-jp (alias: junet) 9. in-is13194-devanagari (alias: devanagari) 10. chinese-iso-8bit (alias: cn-gb-2312 euc-china euc-cn cn-gb gb2312) 11. utf-8-auto=20 12. utf-8-with-signature=20 13. utf-16=20 14. utf-16be-with-signature (alias: utf-16-be) 15. utf-16le-with-signature (alias: utf-16-le) 16. utf-16be=20 17. utf-16le=20 18. japanese-shift-jis (alias: shift_jis sjis) 19. chinese-big5 (alias: big5 cn-big5 cp950) 20. undecided=20 Other coding systems cannot be distinguished automatically from these, and therefore cannot be recognized automatically with the present coding system priorities. Particular coding systems specified for certain file names: OPERATION TARGET PATTERN CODING SYSTEM(s) --------- -------------- ---------------- File I/O "\\.dz\\'" (no-conversion . no-conversion) "\\.txz\\'" (no-conversion . no-conversion) "\\.xz\\'" (no-conversion . no-conversion) "\\.lzma\\'" (no-conversion . no-conversion) "\\.lz\\'" (no-conversion . no-conversion) "\\.g?z\\'" (no-conversion . no-conversion) "\\.\\(?:tgz\\|svgz\\|sifz\\)\\'" (no-conversion . no-conversion) "\\.tbz2?\\'" (no-conversion . no-conversion) "\\.bz2\\'" (no-conversion . no-conversion) "\\.Z\\'" (no-conversion . no-conversion) "\\.elc\\'" utf-8-emacs "\\.el\\'" prefer-utf-8 "\\.utf\\(-8\\)?\\'" utf-8 "\\.xml\\'" xml-find-file-coding-system "\\(\\`\\|/\\)loaddefs.el\\'" (raw-text . raw-text-unix) "\\.tar\\'" (no-conversion . no-conversion) "\\.po[tx]?\\'\\|\\.po\\." po-find-file-coding-system "\\.\\(tex\\|ltx\\|dtx\\|drv\\)\\'" latexenc-find-file-coding-system "" (undecided) Process I/O nothing specified Network I/O nothing specified [back] So in short: this is a rather complex domain. And Elisp, as a text-manipulating platform, has a whole lot of tools and bells and whistles to deal with it well enough that you usually won't even notice. It took a number of years to arrive there and caused the last large migration to XEmacs. --=20 David Kastrup