From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!.POSTED!not-for-mail From: Mark H Weaver Newsgroups: gmane.lisp.guile.bugs Subject: bug#33044: Guile misbehaves in the "ja_JP.sjis" locale Date: Mon, 15 Oct 2018 21:57:02 -0400 Message-ID: <87y3ayodqp.fsf_-_@netris.org> References: <469f2345-5e76-1fc5-1105-f1d508611140@suse.de> <8a6a308f-a981-fd46-93d5-c2d2870f4eb4@suse.de> NNTP-Posting-Host: blaine.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Trace: blaine.gmane.org 1539654968 3848 195.159.176.226 (16 Oct 2018 01:56:08 GMT) X-Complaints-To: usenet@blaine.gmane.org NNTP-Posting-Date: Tue, 16 Oct 2018 01:56:08 +0000 (UTC) User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.1 (gnu/linux) Cc: 33044@debbugs.gnu.org To: Tom de Vries Original-X-From: bug-guile-bounces+guile-bugs=m.gmane.org@gnu.org Tue Oct 16 03:56:04 2018 Return-path: Envelope-to: guile-bugs@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by blaine.gmane.org with esmtp (Exim 4.84_2) (envelope-from ) id 1gCEaY-0000r9-CP for guile-bugs@m.gmane.org; Tue, 16 Oct 2018 03:56:02 +0200 Original-Received: from localhost ([::1]:55547 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1gCEce-0007wH-Hj for guile-bugs@m.gmane.org; Mon, 15 Oct 2018 21:58:12 -0400 Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:47025) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1gCEcZ-0007w4-Lp for bug-guile@gnu.org; Mon, 15 Oct 2018 21:58:08 -0400 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1gCEcU-0005Bd-9M for bug-guile@gnu.org; Mon, 15 Oct 2018 21:58:07 -0400 Original-Received: from debbugs.gnu.org ([208.118.235.43]:47402) by eggs.gnu.org with esmtps (TLS1.0:RSA_AES_128_CBC_SHA1:16) (Exim 4.71) (envelope-from ) id 1gCEcU-0005AW-2e for bug-guile@gnu.org; Mon, 15 Oct 2018 21:58:02 -0400 Original-Received: from Debian-debbugs by debbugs.gnu.org with local (Exim 4.84_2) (envelope-from ) id 1gCEcT-0007Ep-SX for bug-guile@gnu.org; Mon, 15 Oct 2018 21:58:01 -0400 X-Loop: help-debbugs@gnu.org Resent-From: Mark H Weaver Original-Sender: "Debbugs-submit" Resent-CC: bug-guile@gnu.org Resent-Date: Tue, 16 Oct 2018 01:58:01 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 33044 X-GNU-PR-Package: guile X-GNU-PR-Keywords: Original-Received: via spool by 33044-submit@debbugs.gnu.org id=B33044.153965506127791 (code B ref 33044); Tue, 16 Oct 2018 01:58:01 +0000 Original-Received: (at 33044) by debbugs.gnu.org; 16 Oct 2018 01:57:41 +0000 Original-Received: from localhost ([127.0.0.1]:51660 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1gCEc8-0007E5-AI for submit@debbugs.gnu.org; Mon, 15 Oct 2018 21:57:41 -0400 Original-Received: from world.peace.net ([64.112.178.59]:38790) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1gCEc5-0007Dk-CL; Mon, 15 Oct 2018 21:57:38 -0400 Original-Received: from mhw by world.peace.net with esmtpsa (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.89) (envelope-from ) id 1gCEby-0003tn-9e; Mon, 15 Oct 2018 21:57:31 -0400 In-Reply-To: <8a6a308f-a981-fd46-93d5-c2d2870f4eb4@suse.de> (Tom de Vries's message of "Mon, 15 Oct 2018 20:59:03 +0200") X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] X-Received-From: 208.118.235.43 X-BeenThere: bug-guile@gnu.org List-Id: "Bug reports for GUILE, GNU's Ubiquitous Extension Language" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: bug-guile-bounces+guile-bugs=m.gmane.org@gnu.org Original-Sender: "bug-guile" Xref: news.gmane.org gmane.lisp.guile.bugs:9200 Archived-At: retitle 33044 Guile misbehaves in the "ja_JP.sjis" locale thanks Hi Tom, Thanks for the report, analysis and patch. I agree with your analysis, and the patch looks good. However, there's also a much deeper problem here. You found and fixed one occurrence of Guile assuming that the locale encoding is ASCII- compatible. In fact, this assumption is widespread in Guile, and I would guess that it's widespread throughout the POSIX world. I admit that before I saw your message, I believed that it was legitimate to assume that the locale encoding was ASCII-compatible. Now I'm unsure, although I'll note that according to the 'localedef' utility from GNU libc, this locale is "not ISO C compliant". It printed the following message when I asked it to generate the "ja_JP.sjis" locale: [warning] character map `SHIFT_JIS' is not ASCII compatible, locale not I= SO C compliant [--no-warnings=3Dascii] Shift_JIS is _mostly_ ASCII-compatible, except that code points 0x5C and 0x7E, which represent backslash (\) and tilde (~) in ASCII, are mapped to the Yen sign (=C2=A5) and overline (=E2=80=BE) in Shift_JIS. Backslash = (\) and tilde (~) are multibyte characters in Shift_JIS. One common problem is that Guile often uses 'scm_from_locale_string' to create Scheme strings from ASCII-only C string literals. These should all be changed to use either 'scm_from_latin1_string' or 'scm_from_utf8_string'. I prefer the latter because modern C compilers typically use UTF-8 as the default execution character set, i.e. the character set used to encode string and character constants, regardless of the locale settings. GCC uses UTF-8 by default unless -fexec-charset=3DCHARSET is given at compile time. I'd prefer to promote writing code that works for arbitrary string literals, so that code needn't be adjusted if non-ASCII characters are later added. A related set of problems is that Guile often applies 'scm_from_locale_string' to char* arguments passed in from the user, or produced by third-party libraries. These issues are more difficult to address. We provide several C APIs that accept C strings without specifying what encoding is expected. If the string ultimately derives from a C string constant, we probably want UTF-8, whereas if the string came from I/O, or program arguments, then we probably want the locale encoding. For example, consider 'scm_c_eval_string'. This has been a public API function since 2002, but we did not specify the encoding of its C string argument until 2011. We chose the locale encoding in this case, which I think is reasonable, but I also expect that code exists in the wild that passes a C string literal to 'scm_c_eval_string'. Until now, problems like this have been mostly harmless, since the C string literals are typically ASCII-only. However, if we wish to support non-ASCII-compatible encodings such as Shift_JIS, we can no longer consider these problems harmless. For example, programs which pass C string literals to 'scm_c_eval_string' will fail when using the "ja_JP.sjis" locale, if any tildes or backslashes are present. Backslashes are fairly common in Scheme code. There's various other code scattered in Guile that assumes ASCII characters can searched for, and sometimes replaced with other ASCII characters. For example, several functions in load.c, including 'search_path', 'load_thunk_from_path' scan through file names in the locale encoding, scanning the bytes looking for particular ASCII codes such as '.', '/', and '\'. On MingW, 'scm_i_mirror_backslashes' in load.c converts backslashes into forward slashes byte-wise, assuming ASCII-compatibility, and this transformation is applied to file names in several places. While looking into this, I also discovered that Guile's S-expression reader, i.e. the 'read' procedure, assumes an ASCII-compatible port encoding, despite the fact that it is meant to support arbitrary encodings such as UTF-16 and UTF-32. I just filed a related bug to track this probem. These are some of the problems that I'm currently aware of. I expect that this bug report will remain open for a while. To begin, I've started working on a patch to change many occurrences of 'scm_from_locale_string' to 'scm_from_utf8_string', in cases where the C string clearly originates from a C string literal. Thanks again for the detailed bug report and analysis. Regards, Mark