From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!.POSTED!not-for-mail From: David Kastrup Newsgroups: gmane.lisp.guile.user Subject: Re: guile can't find a chinese named file Date: Fri, 17 Feb 2017 10:04:29 +0100 Message-ID: <87fujdnq76.fsf@fencepost.gnu.org> References: <878tpsqtzl.fsf@fencepost.gnu.org> <87wpcq38sa.fsf@elektro.pacujo.net> <87efyy4k4c.fsf@fencepost.gnu.org> <87mvdmv3kg.fsf@elektro.pacujo.net> <8760ka4drd.fsf@fencepost.gnu.org> <87h93uv1kl.fsf@elektro.pacujo.net> <871suy4cha.fsf@fencepost.gnu.org> <877f4qv0a8.fsf@elektro.pacujo.net> <87wpcq2w58.fsf@fencepost.gnu.org> <871suyuyby.fsf@elektro.pacujo.net> <83tw7uxg1o.fsf@gnu.org> <87efyyt7jb.fsf@elektro.pacujo.net> <83o9y2xc3w.fsf@gnu.org> <87o9y2yo9k.fsf@elektro.pacujo.net> <83h93ux9br.fsf@gnu.org> <87efyyyln3.fsf@elektro.pacujo.net> <83fujdyk0y.fsf@gnu.org> <87r32xoo2n.fsf@fencepost.gnu.org> <8737fdzvnk.fsf@elektro.pacujo.net> <83bmu1xqns.fsf@gnu.org> <871suxtdav.fsf@elektro.pacujo.net> NNTP-Posting-Host: blaine.gmane.org Mime-Version: 1.0 Content-Type: text/plain X-Trace: blaine.gmane.org 1487323242 3976 195.159.176.226 (17 Feb 2017 09:20:42 GMT) X-Complaints-To: usenet@blaine.gmane.org NNTP-Posting-Date: Fri, 17 Feb 2017 09:20:42 +0000 (UTC) User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.0.50 (gnu/linux) Cc: guile-user@gnu.org To: Marko Rauhamaa Original-X-From: guile-user-bounces+guile-user=m.gmane.org@gnu.org Fri Feb 17 10:20:38 2017 Return-path: Envelope-to: guile-user@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by blaine.gmane.org with esmtp (Exim 4.84_2) (envelope-from ) id 1ceeiQ-0000Le-IY for guile-user@m.gmane.org; Fri, 17 Feb 2017 10:20:34 +0100 Original-Received: from localhost ([::1]:52383 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1ceeiW-0000Hl-B5 for guile-user@m.gmane.org; Fri, 17 Feb 2017 04:20:40 -0500 Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:52494) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1ceeSy-0003Im-NJ for guile-user@gnu.org; Fri, 17 Feb 2017 04:04:37 -0500 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1ceeSv-0003bs-Ha for guile-user@gnu.org; Fri, 17 Feb 2017 04:04:36 -0500 Original-Received: from fencepost.gnu.org ([2001:4830:134:3::e]:39524) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1ceeSv-0003bl-FY; Fri, 17 Feb 2017 04:04:33 -0500 Original-Received: from x2f3ed78.dyn.telefonica.de ([2.243.237.120]:45422 helo=lola) by fencepost.gnu.org with esmtpsa (TLS1.2:RSA_AES_256_CBC_SHA1:256) (Exim 4.82) (envelope-from ) id 1ceeSu-0001dt-MU; Fri, 17 Feb 2017 04:04:33 -0500 In-Reply-To: <871suxtdav.fsf@elektro.pacujo.net> (Marko Rauhamaa's message of "Fri, 17 Feb 2017 10:46:32 +0200") X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] X-Received-From: 2001:4830:134:3::e X-BeenThere: guile-user@gnu.org X-Mailman-Version: 2.1.21 Precedence: list List-Id: General Guile related discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: guile-user-bounces+guile-user=m.gmane.org@gnu.org Original-Sender: "guile-user" Xref: news.gmane.org gmane.lisp.guile.user:13281 Archived-At: Marko Rauhamaa writes: > Eli Zaretskii : >>> From: Marko Rauhamaa >>> Python uses the surrogate hole in the middle of the Unicode range to >>> represent such stray bytes, but only when naming files. >> >> IMO, it makes no sense to limit this to file names, because (a) you >> don't always know on all levels of the code which string is a file >> name or a part thereof; and (b) because situations where non-ASCII >> bytes cannot be properly decoded into Unicode happen with text that is >> not file names, and users still expect Emacs to silently produce the >> same byte stream on round-trip operations, e.g., when copying text >> from one file to another. > > Python just barfs: > > $ python3 -c "import sys; print(sys.stdin.read(30))" <<<$'\xdd' > Traceback (most recent call last): > File "", line 1, in > File "/usr/lib64/python3.5/codecs.py", line 321, in decode > (result, consumed) = self._buffer_decode(data, self.errors, final) > UnicodeDecodeError: 'utf-8' codec can't decode byte 0xdd in position \ > 0: invalid continuation byte > > The situation is a bit difficult to recover from. You can load an executable into an Emacs buffer and do a search-and-replace on UTF-8 strings, then save again. Assuming that the replacement has been by a string of the same length and that the string does not appear as part of symbols for the linker, the executable will likely work fine afterwards. I don't think that XEmacs (another Emacs implementation that migrated a lot more leisurely to multibyte encodings) would stand up to the same sort of abuse. And probably quite a few text editors would throw in the towel as well. But once you view Emacs as a text processing platform, it's a reasonable conclusion that failure is not a good option. For a general-purpose programming language like Python or Guile, I should think it should be at least as important that strings can represent input accurately without having to degress outside of string processing and use stuff like byte arrays. -- David Kastrup