unofficial mirror of bug-guile@gnu.org 
 help / color / mirror / Atom feed
From: Mark H Weaver <mhw@netris.org>
To: Tom de Vries <tdevries@suse.de>
Cc: 33044@debbugs.gnu.org
Subject: bug#33044: Guile misbehaves in the "ja_JP.sjis" locale
Date: Mon, 15 Oct 2018 21:57:02 -0400	[thread overview]
Message-ID: <87y3ayodqp.fsf_-_@netris.org> (raw)
In-Reply-To: <8a6a308f-a981-fd46-93d5-c2d2870f4eb4@suse.de> (Tom de Vries's message of "Mon, 15 Oct 2018 20:59:03 +0200")

retitle 33044 Guile misbehaves in the "ja_JP.sjis" locale
thanks

Hi Tom,

Thanks for the report, analysis and patch.  I agree with your analysis,
and the patch looks good.

However, there's also a much deeper problem here.  You found and fixed
one occurrence of Guile assuming that the locale encoding is ASCII-
compatible.  In fact, this assumption is widespread in Guile, and I
would guess that it's widespread throughout the POSIX world.

I admit that before I saw your message, I believed that it was
legitimate to assume that the locale encoding was ASCII-compatible.  Now
I'm unsure, although I'll note that according to the 'localedef' utility
from GNU libc, this locale is "not ISO C compliant".  It printed the
following message when I asked it to generate the "ja_JP.sjis" locale:

  [warning] character map `SHIFT_JIS' is not ASCII compatible, locale not ISO C compliant [--no-warnings=ascii]

Shift_JIS is _mostly_ ASCII-compatible, except that code points 0x5C and
0x7E, which represent backslash (\) and tilde (~) in ASCII, are mapped
to the Yen sign (¥) and overline (‾) in Shift_JIS.  Backslash (\) and
tilde (~) are multibyte characters in Shift_JIS.

One common problem is that Guile often uses 'scm_from_locale_string' to
create Scheme strings from ASCII-only C string literals.  These should
all be changed to use either 'scm_from_latin1_string' or
'scm_from_utf8_string'.  I prefer the latter because modern C compilers
typically use UTF-8 as the default execution character set, i.e. the
character set used to encode string and character constants, regardless
of the locale settings.  GCC uses UTF-8 by default unless
-fexec-charset=CHARSET is given at compile time.  I'd prefer to promote
writing code that works for arbitrary string literals, so that code
needn't be adjusted if non-ASCII characters are later added.

A related set of problems is that Guile often applies
'scm_from_locale_string' to char* arguments passed in from the user, or
produced by third-party libraries.  These issues are more difficult to
address.  We provide several C APIs that accept C strings without
specifying what encoding is expected.  If the string ultimately derives
from a C string constant, we probably want UTF-8, whereas if the string
came from I/O, or program arguments, then we probably want the locale
encoding.

For example, consider 'scm_c_eval_string'.  This has been a public API
function since 2002, but we did not specify the encoding of its C string
argument until 2011.  We chose the locale encoding in this case, which I
think is reasonable, but I also expect that code exists in the wild that
passes a C string literal to 'scm_c_eval_string'.

Until now, problems like this have been mostly harmless, since the C
string literals are typically ASCII-only.  However, if we wish to
support non-ASCII-compatible encodings such as Shift_JIS, we can no
longer consider these problems harmless.  For example, programs which
pass C string literals to 'scm_c_eval_string' will fail when using the
"ja_JP.sjis" locale, if any tildes or backslashes are present.
Backslashes are fairly common in Scheme code.

There's various other code scattered in Guile that assumes ASCII
characters can searched for, and sometimes replaced with other ASCII
characters.  For example, several functions in load.c, including
'search_path', 'load_thunk_from_path' scan through file names in the
locale encoding, scanning the bytes looking for particular ASCII codes
such as '.', '/', and '\'.

On MingW, 'scm_i_mirror_backslashes' in load.c converts backslashes into
forward slashes byte-wise, assuming ASCII-compatibility, and this
transformation is applied to file names in several places.

While looking into this, I also discovered that Guile's S-expression
reader, i.e. the 'read' procedure, assumes an ASCII-compatible port
encoding, despite the fact that it is meant to support arbitrary
encodings such as UTF-16 and UTF-32.  I just filed a related bug
<https://bug.gnu.org/33057> to track this probem.

These are some of the problems that I'm currently aware of.  I expect
that this bug report will remain open for a while.

To begin, I've started working on a patch to change many occurrences of
'scm_from_locale_string' to 'scm_from_utf8_string', in cases where the C
string clearly originates from a C string literal.

Thanks again for the detailed bug report and analysis.

    Regards,
      Mark





  reply	other threads:[~2018-10-16  1:57 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-10-15  8:44 bug#33044: Invalid read access of chars of wide string in scm_seed_to_random_state Tom de Vries
2018-10-15 14:20 ` bug#33044: Reproduced using guile binary Tom de Vries
2018-10-21 16:24   ` Tom de Vries
2018-10-15 18:59 ` bug#33044: Analysis and proposed patch Tom de Vries
2018-10-16  1:57   ` Mark H Weaver [this message]
2018-10-16  5:13     ` bug#33044: Guile misbehaves in the "ja_JP.sjis" locale Mark H Weaver
2018-10-16 12:52       ` John Cowan
2018-10-16 23:38       ` Tom de Vries
2018-10-17  7:00       ` Tom de Vries
2018-10-16 23:27     ` Tom de Vries
2018-10-18  1:56       ` Mark H Weaver
2018-10-18 10:26         ` Tom de Vries
2018-10-20  2:24         ` Mark H Weaver

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://www.gnu.org/software/guile/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87y3ayodqp.fsf_-_@netris.org \
    --to=mhw@netris.org \
    --cc=33044@debbugs.gnu.org \
    --cc=tdevries@suse.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).