unofficial mirror of bug-guile@gnu.org 
 help / color / mirror / Atom feed
From: Tom de Vries <tdevries@suse.de>
To: Mark H Weaver <mhw@netris.org>
Cc: 33044@debbugs.gnu.org
Subject: bug#33044: Guile misbehaves in the "ja_JP.sjis" locale
Date: Wed, 17 Oct 2018 01:27:33 +0200	[thread overview]
Message-ID: <25e93980-834d-cf91-cabf-77c2bb9a31c5@suse.de> (raw)
In-Reply-To: <87y3ayodqp.fsf_-_@netris.org>

On 10/16/18 3:57 AM, Mark H Weaver wrote:
> retitle 33044 Guile misbehaves in the "ja_JP.sjis" locale
> thanks
> 
> Hi Tom,
> 
> Thanks for the report, analysis and patch.  I agree with your analysis,
> and the patch looks good.
> 

If so, can the patch be committed?

I'm running into this problem in the context of gdb, which fails like this:
...
$ LC_CTYPE=ja_JP.sjis gdb".
Segmentation fault (core dumped)
...

So, gdb (which has a dependency on libguile) aborts because of guile
initialization, without gdb actually using the guile functionality, and
the patch fixes this.

> However, there's also a much deeper problem here.  You found and fixed
> one occurrence of Guile assuming that the locale encoding is ASCII-
> compatible.  In fact, this assumption is widespread in Guile, and I
> would guess that it's widespread throughout the POSIX world.
> 
> I admit that before I saw your message, I believed that it was
> legitimate to assume that the locale encoding was ASCII-compatible.  Now
> I'm unsure, although I'll note that according to the 'localedef' utility
> from GNU libc, this locale is "not ISO C compliant".  It printed the
> following message when I asked it to generate the "ja_JP.sjis" locale:
> 
>   [warning] character map `SHIFT_JIS' is not ASCII compatible, locale not ISO C compliant [--no-warnings=ascii]
> 
> Shift_JIS is _mostly_ ASCII-compatible, except that code points 0x5C and
> 0x7E, which represent backslash (\) and tilde (~) in ASCII, are mapped
> to the Yen sign (¥) and overline (‾) in Shift_JIS.  Backslash (\) and
> tilde (~) are multibyte characters in Shift_JIS.
> 
> One common problem is that Guile often uses 'scm_from_locale_string' to
> create Scheme strings from ASCII-only C string literals.  These should
> all be changed to use either 'scm_from_latin1_string' or
> 'scm_from_utf8_string'.  I prefer the latter because modern C compilers
> typically use UTF-8 as the default execution character set, i.e. the
> character set used to encode string and character constants, regardless
> of the locale settings.  GCC uses UTF-8 by default unless
> -fexec-charset=CHARSET is given at compile time.  I'd prefer to promote
> writing code that works for arbitrary string literals, so that code
> needn't be adjusted if non-ASCII characters are later added.
> 
> A related set of problems is that Guile often applies
> 'scm_from_locale_string' to char* arguments passed in from the user, or
> produced by third-party libraries.  These issues are more difficult to
> address.  We provide several C APIs that accept C strings without
> specifying what encoding is expected.  If the string ultimately derives
> from a C string constant, we probably want UTF-8, whereas if the string
> came from I/O, or program arguments, then we probably want the locale
> encoding.
> 
> For example, consider 'scm_c_eval_string'.  This has been a public API
> function since 2002, but we did not specify the encoding of its C string
> argument until 2011.  We chose the locale encoding in this case, which I
> think is reasonable, but I also expect that code exists in the wild that
> passes a C string literal to 'scm_c_eval_string'.
> 
> Until now, problems like this have been mostly harmless, since the C
> string literals are typically ASCII-only.  However, if we wish to
> support non-ASCII-compatible encodings such as Shift_JIS, we can no
> longer consider these problems harmless.  For example, programs which
> pass C string literals to 'scm_c_eval_string' will fail when using the
> "ja_JP.sjis" locale, if any tildes or backslashes are present.
> Backslashes are fairly common in Scheme code.
> 
> There's various other code scattered in Guile that assumes ASCII
> characters can searched for, and sometimes replaced with other ASCII
> characters.  For example, several functions in load.c, including
> 'search_path', 'load_thunk_from_path' scan through file names in the
> locale encoding, scanning the bytes looking for particular ASCII codes
> such as '.', '/', and '\'.
> 
> On MingW, 'scm_i_mirror_backslashes' in load.c converts backslashes into
> forward slashes byte-wise, assuming ASCII-compatibility, and this
> transformation is applied to file names in several places.
> 
> While looking into this, I also discovered that Guile's S-expression
> reader, i.e. the 'read' procedure, assumes an ASCII-compatible port
> encoding, despite the fact that it is meant to support arbitrary
> encodings such as UTF-16 and UTF-32.  I just filed a related bug
> <https://bug.gnu.org/33057> to track this probem.
> 
> These are some of the problems that I'm currently aware of.  I expect
> that this bug report will remain open for a while.
> 
> To begin, I've started working on a patch to change many occurrences of
> 'scm_from_locale_string' to 'scm_from_utf8_string', in cases where the C
> string clearly originates from a C string literal.
> 

Thanks for the elaboration here, that's helpful for me.

Thanks,
- Tom





  parent reply	other threads:[~2018-10-16 23:27 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-10-15  8:44 bug#33044: Invalid read access of chars of wide string in scm_seed_to_random_state Tom de Vries
2018-10-15 14:20 ` bug#33044: Reproduced using guile binary Tom de Vries
2018-10-21 16:24   ` Tom de Vries
2018-10-15 18:59 ` bug#33044: Analysis and proposed patch Tom de Vries
2018-10-16  1:57   ` bug#33044: Guile misbehaves in the "ja_JP.sjis" locale Mark H Weaver
2018-10-16  5:13     ` Mark H Weaver
2018-10-16 12:52       ` John Cowan
2018-10-16 23:38       ` Tom de Vries
2018-10-17  7:00       ` Tom de Vries
2018-10-16 23:27     ` Tom de Vries [this message]
2018-10-18  1:56       ` Mark H Weaver
2018-10-18 10:26         ` Tom de Vries
2018-10-20  2:24         ` Mark H Weaver

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://www.gnu.org/software/guile/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=25e93980-834d-cf91-cabf-77c2bb9a31c5@suse.de \
    --to=tdevries@suse.de \
    --cc=33044@debbugs.gnu.org \
    --cc=mhw@netris.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).