From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!.POSTED!not-for-mail From: Tom de Vries Newsgroups: gmane.lisp.guile.bugs Subject: bug#33044: Guile misbehaves in the "ja_JP.sjis" locale Date: Wed, 17 Oct 2018 01:27:33 +0200 Message-ID: <25e93980-834d-cf91-cabf-77c2bb9a31c5@suse.de> References: <469f2345-5e76-1fc5-1105-f1d508611140@suse.de> <8a6a308f-a981-fd46-93d5-c2d2870f4eb4@suse.de> <87y3ayodqp.fsf_-_@netris.org> NNTP-Posting-Host: blaine.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit X-Trace: blaine.gmane.org 1539732370 17507 195.159.176.226 (16 Oct 2018 23:26:10 GMT) X-Complaints-To: usenet@blaine.gmane.org NNTP-Posting-Date: Tue, 16 Oct 2018 23:26:10 +0000 (UTC) User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Thunderbird/60.2.1 Cc: 33044@debbugs.gnu.org To: Mark H Weaver Original-X-From: bug-guile-bounces+guile-bugs=m.gmane.org@gnu.org Wed Oct 17 01:26:05 2018 Return-path: Envelope-to: guile-bugs@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by blaine.gmane.org with esmtp (Exim 4.84_2) (envelope-from ) id 1gCYiw-0004QH-UK for guile-bugs@m.gmane.org; Wed, 17 Oct 2018 01:26:03 +0200 Original-Received: from localhost ([::1]:60581 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1gCYl2-0003EU-T8 for guile-bugs@m.gmane.org; Tue, 16 Oct 2018 19:28:12 -0400 Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:38541) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1gCYkx-0003EB-2g for bug-guile@gnu.org; Tue, 16 Oct 2018 19:28:08 -0400 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1gCYkt-0004Zs-36 for bug-guile@gnu.org; Tue, 16 Oct 2018 19:28:07 -0400 Original-Received: from debbugs.gnu.org ([208.118.235.43]:49852) by eggs.gnu.org with esmtps (TLS1.0:RSA_AES_128_CBC_SHA1:16) (Exim 4.71) (envelope-from ) id 1gCYks-0004Zg-Uo for bug-guile@gnu.org; Tue, 16 Oct 2018 19:28:03 -0400 Original-Received: from Debian-debbugs by debbugs.gnu.org with local (Exim 4.84_2) (envelope-from ) id 1gCYks-0007w9-Io for bug-guile@gnu.org; Tue, 16 Oct 2018 19:28:02 -0400 X-Loop: help-debbugs@gnu.org Resent-From: Tom de Vries Original-Sender: "Debbugs-submit" Resent-CC: bug-guile@gnu.org Resent-Date: Tue, 16 Oct 2018 23:28:02 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 33044 X-GNU-PR-Package: guile X-GNU-PR-Keywords: Original-Received: via spool by 33044-submit@debbugs.gnu.org id=B33044.153973245430451 (code B ref 33044); Tue, 16 Oct 2018 23:28:02 +0000 Original-Received: (at 33044) by debbugs.gnu.org; 16 Oct 2018 23:27:34 +0000 Original-Received: from localhost ([127.0.0.1]:54110 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1gCYkP-0007v5-Nq for submit@debbugs.gnu.org; Tue, 16 Oct 2018 19:27:34 -0400 Original-Received: from mx2.suse.de ([195.135.220.15]:41246 helo=mx1.suse.de) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1gCYkN-0007un-QI for 33044@debbugs.gnu.org; Tue, 16 Oct 2018 19:27:32 -0400 X-Virus-Scanned: by amavisd-new at test-mx.suse.de Original-Received: from relay2.suse.de (unknown [195.135.220.254]) by mx1.suse.de (Postfix) with ESMTP id 8F315B00D; Tue, 16 Oct 2018 23:27:25 +0000 (UTC) In-Reply-To: <87y3ayodqp.fsf_-_@netris.org> Content-Language: en-US X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] X-Received-From: 208.118.235.43 X-BeenThere: bug-guile@gnu.org List-Id: "Bug reports for GUILE, GNU's Ubiquitous Extension Language" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: bug-guile-bounces+guile-bugs=m.gmane.org@gnu.org Original-Sender: "bug-guile" Xref: news.gmane.org gmane.lisp.guile.bugs:9206 Archived-At: On 10/16/18 3:57 AM, Mark H Weaver wrote: > retitle 33044 Guile misbehaves in the "ja_JP.sjis" locale > thanks > > Hi Tom, > > Thanks for the report, analysis and patch. I agree with your analysis, > and the patch looks good. > If so, can the patch be committed? I'm running into this problem in the context of gdb, which fails like this: ... $ LC_CTYPE=ja_JP.sjis gdb". Segmentation fault (core dumped) ... So, gdb (which has a dependency on libguile) aborts because of guile initialization, without gdb actually using the guile functionality, and the patch fixes this. > However, there's also a much deeper problem here. You found and fixed > one occurrence of Guile assuming that the locale encoding is ASCII- > compatible. In fact, this assumption is widespread in Guile, and I > would guess that it's widespread throughout the POSIX world. > > I admit that before I saw your message, I believed that it was > legitimate to assume that the locale encoding was ASCII-compatible. Now > I'm unsure, although I'll note that according to the 'localedef' utility > from GNU libc, this locale is "not ISO C compliant". It printed the > following message when I asked it to generate the "ja_JP.sjis" locale: > > [warning] character map `SHIFT_JIS' is not ASCII compatible, locale not ISO C compliant [--no-warnings=ascii] > > Shift_JIS is _mostly_ ASCII-compatible, except that code points 0x5C and > 0x7E, which represent backslash (\) and tilde (~) in ASCII, are mapped > to the Yen sign (¥) and overline (‾) in Shift_JIS. Backslash (\) and > tilde (~) are multibyte characters in Shift_JIS. > > One common problem is that Guile often uses 'scm_from_locale_string' to > create Scheme strings from ASCII-only C string literals. These should > all be changed to use either 'scm_from_latin1_string' or > 'scm_from_utf8_string'. I prefer the latter because modern C compilers > typically use UTF-8 as the default execution character set, i.e. the > character set used to encode string and character constants, regardless > of the locale settings. GCC uses UTF-8 by default unless > -fexec-charset=CHARSET is given at compile time. I'd prefer to promote > writing code that works for arbitrary string literals, so that code > needn't be adjusted if non-ASCII characters are later added. > > A related set of problems is that Guile often applies > 'scm_from_locale_string' to char* arguments passed in from the user, or > produced by third-party libraries. These issues are more difficult to > address. We provide several C APIs that accept C strings without > specifying what encoding is expected. If the string ultimately derives > from a C string constant, we probably want UTF-8, whereas if the string > came from I/O, or program arguments, then we probably want the locale > encoding. > > For example, consider 'scm_c_eval_string'. This has been a public API > function since 2002, but we did not specify the encoding of its C string > argument until 2011. We chose the locale encoding in this case, which I > think is reasonable, but I also expect that code exists in the wild that > passes a C string literal to 'scm_c_eval_string'. > > Until now, problems like this have been mostly harmless, since the C > string literals are typically ASCII-only. However, if we wish to > support non-ASCII-compatible encodings such as Shift_JIS, we can no > longer consider these problems harmless. For example, programs which > pass C string literals to 'scm_c_eval_string' will fail when using the > "ja_JP.sjis" locale, if any tildes or backslashes are present. > Backslashes are fairly common in Scheme code. > > There's various other code scattered in Guile that assumes ASCII > characters can searched for, and sometimes replaced with other ASCII > characters. For example, several functions in load.c, including > 'search_path', 'load_thunk_from_path' scan through file names in the > locale encoding, scanning the bytes looking for particular ASCII codes > such as '.', '/', and '\'. > > On MingW, 'scm_i_mirror_backslashes' in load.c converts backslashes into > forward slashes byte-wise, assuming ASCII-compatibility, and this > transformation is applied to file names in several places. > > While looking into this, I also discovered that Guile's S-expression > reader, i.e. the 'read' procedure, assumes an ASCII-compatible port > encoding, despite the fact that it is meant to support arbitrary > encodings such as UTF-16 and UTF-32. I just filed a related bug > to track this probem. > > These are some of the problems that I'm currently aware of. I expect > that this bug report will remain open for a while. > > To begin, I've started working on a patch to change many occurrences of > 'scm_from_locale_string' to 'scm_from_utf8_string', in cases where the C > string clearly originates from a C string literal. > Thanks for the elaboration here, that's helpful for me. Thanks, - Tom