unofficial mirror of bug-guile@gnu.org 
 help / color / mirror / Atom feed
* bug#25397: guile-2.2 regression in utf8 support in scm_puts scm_lfwrite scm_c_put_string
@ 2017-01-08 18:16 Linas Vepstas
  2017-01-09 22:03 ` Andy Wingo
  0 siblings, 1 reply; 5+ messages in thread
From: Linas Vepstas @ 2017-01-08 18:16 UTC (permalink / raw)
  To: 25397

There appears to be a regression in guile-2.2 with utf8 handling
in the scm_puts() scm_lfwrite() and scm_c_put_string() functions.

In guile-2.0, one could give these utf8-encoded strings, and these
would display just fine.  In 2.2 they get mangled.

The source of the mangling seems to be an assumption that these
three are being given latin1 strings, which they then attempt to
convert to utf8, thus wrecking the encoding.  See, e.g. libguile/ports.c
line 3526

Presumably this change was intentional, but I don't understand
why; guile-2.0 seems utf-8 clean, correctly handling utf-8 in
essentially all cases.  Why would one want to go back to the
bad old days of latin1 and iso-8859-1 for guile 2.2?

I could submit a patch for this, but would it be wanted?

Test case is straight-forward:

printf("duuude port-encoding is=%s\n",
   scm_to_utf8_string(scm_port_encoding(scm_current_output_port ())));
scm_puts ("係 拉 丁 字 母", scm_current_output_port ());

which works in guile-2.0 but is garbled in 2.2





^ permalink raw reply	[flat|nested] 5+ messages in thread

* bug#25397: guile-2.2 regression in utf8 support in scm_puts scm_lfwrite scm_c_put_string
  2017-01-08 18:16 bug#25397: guile-2.2 regression in utf8 support in scm_puts scm_lfwrite scm_c_put_string Linas Vepstas
@ 2017-01-09 22:03 ` Andy Wingo
  2017-01-10  3:34   ` Linas Vepstas
  0 siblings, 1 reply; 5+ messages in thread
From: Andy Wingo @ 2017-01-09 22:03 UTC (permalink / raw)
  To: Linas Vepstas; +Cc: 25397

On Sun 08 Jan 2017 19:16, Linas Vepstas <linasvepstas@gmail.com> writes:

> There appears to be a regression in guile-2.2 with utf8 handling
> in the scm_puts() scm_lfwrite() and scm_c_put_string() functions.
>
> In guile-2.0, one could give these utf8-encoded strings, and these
> would display just fine.  In 2.2 they get mangled.

Could it be this from NEWS:

  ** Better locale support in Guile scripts

  When Guile is invoked directly, either from the command line or via a
  hash-bang line (e.g. "#!/usr/bin/guile"), it now installs the current
  locale via a call to `(setlocale LC_ALL "")'.  For users with a unicode
  locale, this makes all ports unicode-capable by default, without the
  need to call `setlocale' in your program.  This behavior may be
  controlled via the GUILE_INSTALL_LOCALE environment variable; see the
  manual for more.





^ permalink raw reply	[flat|nested] 5+ messages in thread

* bug#25397: guile-2.2 regression in utf8 support in scm_puts scm_lfwrite scm_c_put_string
  2017-01-09 22:03 ` Andy Wingo
@ 2017-01-10  3:34   ` Linas Vepstas
  2017-03-01 15:45     ` Andy Wingo
  0 siblings, 1 reply; 5+ messages in thread
From: Linas Vepstas @ 2017-01-10  3:34 UTC (permalink / raw)
  To: Andy Wingo; +Cc: 25397

This short C program illustrates the issue.  The locale, the output port etc.
are UTF-8.  The bad results are no surprise: the code currently in git for
scm_puts etc. explicitly ignores the locale setting, always, and always
assumes latin1 -- its hard-coded in there.

--linas

#include <libguile.h>

void *wrap_eval(void* p)
{
   char *wtf = "(setlocale LC_ALL \"\")";
   SCM eval_str = scm_from_utf8_string(wtf);
   scm_eval_string(eval_str);

   return NULL;
}

void *wrap_puts(void* p)
{
   char *wtf = p;

   SCM port = scm_current_output_port ();

   scm_puts("the port-encoding is=", port);
   scm_puts(scm_to_utf8_string(scm_port_encoding(port)), port);

   scm_puts("\nThe string to display is =", port);
   scm_puts (wtf, port);

   scm_puts("\nWas expecting to see this=", port);
   SCM str = scm_from_utf8_string(wtf);
   scm_display(str, port);
   scm_puts("\n\n", port);

   return NULL;
}

int main(int argc, char* argv[])
{
   scm_with_guile(wrap_eval, 0x0);

   char * wtf = "Ćićolina";
   scm_with_guile(wrap_puts, wtf);

   wtf = "Thủ Dầu Một";
   scm_with_guile(wrap_puts, wtf);

   wtf = "Småland";
   scm_with_guile(wrap_puts, wtf);

   wtf = "Hòa Phú Phú Tân";
   scm_with_guile(wrap_puts, wtf);

   wtf = "係 拉 丁 字 母";
   scm_with_guile(wrap_puts, wtf);
}

The output is always this:

the port-encoding is=UTF-8
The string to display is =Ćićolina
Was expecting to see this=Ćićolina

the port-encoding is=UTF-8
The string to display is =Thủ Dầu Một
Was expecting to see this=Thủ Dầu Một

the port-encoding is=UTF-8
The string to display is =Småland
Was expecting to see this=Småland

the port-encoding is=UTF-8
The string to display is =Hòa Phú Phú Tân
Was expecting to see this=Hòa Phú Phú Tân

the port-encoding is=UTF-8
Was expecting to see this=係 拉 丁 字 母 æ¯


What's cool is that all this stuff works in email!

--linas

On Mon, Jan 9, 2017 at 4:03 PM, Andy Wingo <wingo@pobox.com> wrote:
> On Sun 08 Jan 2017 19:16, Linas Vepstas <linasvepstas@gmail.com> writes:
>
>> There appears to be a regression in guile-2.2 with utf8 handling
>> in the scm_puts() scm_lfwrite() and scm_c_put_string() functions.
>>
>> In guile-2.0, one could give these utf8-encoded strings, and these
>> would display just fine.  In 2.2 they get mangled.
>
> Could it be this from NEWS:
>
>   ** Better locale support in Guile scripts
>
>   When Guile is invoked directly, either from the command line or via a
>   hash-bang line (e.g. "#!/usr/bin/guile"), it now installs the current
>   locale via a call to `(setlocale LC_ALL "")'.  For users with a unicode
>   locale, this makes all ports unicode-capable by default, without the
>   need to call `setlocale' in your program.  This behavior may be
>   controlled via the GUILE_INSTALL_LOCALE environment variable; see the
>   manual for more.





^ permalink raw reply	[flat|nested] 5+ messages in thread

* bug#25397: guile-2.2 regression in utf8 support in scm_puts scm_lfwrite scm_c_put_string
  2017-01-10  3:34   ` Linas Vepstas
@ 2017-03-01 15:45     ` Andy Wingo
  2017-03-01 20:18       ` Linas Vepstas
  0 siblings, 1 reply; 5+ messages in thread
From: Andy Wingo @ 2017-03-01 15:45 UTC (permalink / raw)
  To: Linas Vepstas; +Cc: 25397

On Tue 10 Jan 2017 04:34, Linas Vepstas <linasvepstas@gmail.com> writes:

> void *wrap_puts(void* p)
> {
>    char *wtf = p;
>
>    SCM port = scm_current_output_port ();
>
>    scm_puts("the port-encoding is=", port);
>    scm_puts(scm_to_utf8_string(scm_port_encoding(port)), port);
>
>    scm_puts("\nThe string to display is =", port);
>    scm_puts (wtf, port);
>
>    scm_puts("\nWas expecting to see this=", port);
>    SCM str = scm_from_utf8_string(wtf);
>    scm_display(str, port);
>    scm_puts("\n\n", port);
>
>    return NULL;
> }

So, there are a few questions here.  scm_puts and scm_lfwrite are not
documented, so we need to do basic science on them to see what they are
supposed to do.

Firstly, is scm_puts() a textual interface or a binary interface?
I.e. does it write a sequence of characters or a sequence of bytes?

If I look at uses of scm_puts in Guile sources, it seems clear that it's
a textual interface.  That is to say, at all points, the intention seems
to be to write characters on a Guile port.  All of the uses are of
strings.  Please do a "git grep" on your source to see if your
perceptions correspond.

Now the question is, what encoding is the argument in?  If the port is
UTF-16, that byte string should be decoded to characters, and that
character sequence encoded to UTF-16.

All of the scm_puts calls in Guile are of one-byte characters with
codepoints less than 128, so when doing some port refactoring I chose to
interpret the argument as latin1.

FTR, in Guile 2.0, this was effectively a binary interface.  Guile 2.0's
scm_lfwrite interpreted the incoming bytes as ISO-8859-1 codepoints for
the purposes of updating line and column, but scm_puts and scm_lfwrite
just wrote out the bytes to the port directly, regardless of the
encoding.  That was the wrong thing.

Are you arguing that the byte string given to scm_puts should be decoded
from UTF-8?  That would be OK.

Andy





^ permalink raw reply	[flat|nested] 5+ messages in thread

* bug#25397: guile-2.2 regression in utf8 support in scm_puts scm_lfwrite scm_c_put_string
  2017-03-01 15:45     ` Andy Wingo
@ 2017-03-01 20:18       ` Linas Vepstas
  0 siblings, 0 replies; 5+ messages in thread
From: Linas Vepstas @ 2017-03-01 20:18 UTC (permalink / raw)
  To: Andy Wingo; +Cc: 25397@debbugs.gnu.org

[-- Attachment #1: Type: text/plain, Size: 2351 bytes --]

In the bad old days, not every thing was documented ... My use of scm_puts
dates back to guile-1.8.  I only ever send it utf8.  I can change my code,
no problem,... I just thought I'd report a regression in case .... others
are affected.

Linas

On Wednesday, March 1, 2017, Andy Wingo <wingo@pobox.com> wrote:

> On Tue 10 Jan 2017 04:34, Linas Vepstas <linasvepstas@gmail.com
> <javascript:;>> writes:
>
> > void *wrap_puts(void* p)
> > {
> >    char *wtf = p;
> >
> >    SCM port = scm_current_output_port ();
> >
> >    scm_puts("the port-encoding is=", port);
> >    scm_puts(scm_to_utf8_string(scm_port_encoding(port)), port);
> >
> >    scm_puts("\nThe string to display is =", port);
> >    scm_puts (wtf, port);
> >
> >    scm_puts("\nWas expecting to see this=", port);
> >    SCM str = scm_from_utf8_string(wtf);
> >    scm_display(str, port);
> >    scm_puts("\n\n", port);
> >
> >    return NULL;
> > }
>
> So, there are a few questions here.  scm_puts and scm_lfwrite are not
> documented, so we need to do basic science on them to see what they are
> supposed to do.
>
> Firstly, is scm_puts() a textual interface or a binary interface?
> I.e. does it write a sequence of characters or a sequence of bytes?
>
> If I look at uses of scm_puts in Guile sources, it seems clear that it's
> a textual interface.  That is to say, at all points, the intention seems
> to be to write characters on a Guile port.  All of the uses are of
> strings.  Please do a "git grep" on your source to see if your
> perceptions correspond.
>
> Now the question is, what encoding is the argument in?  If the port is
> UTF-16, that byte string should be decoded to characters, and that
> character sequence encoded to UTF-16.
>
> All of the scm_puts calls in Guile are of one-byte characters with
> codepoints less than 128, so when doing some port refactoring I chose to
> interpret the argument as latin1.
>
> FTR, in Guile 2.0, this was effectively a binary interface.  Guile 2.0's
> scm_lfwrite interpreted the incoming bytes as ISO-8859-1 codepoints for
> the purposes of updating line and column, but scm_puts and scm_lfwrite
> just wrote out the bytes to the port directly, regardless of the
> encoding.  That was the wrong thing.
>
> Are you arguing that the byte string given to scm_puts should be decoded
> from UTF-8?  That would be OK.
>
> Andy
>

[-- Attachment #2: Type: text/html, Size: 2951 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2017-03-01 20:18 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-01-08 18:16 bug#25397: guile-2.2 regression in utf8 support in scm_puts scm_lfwrite scm_c_put_string Linas Vepstas
2017-01-09 22:03 ` Andy Wingo
2017-01-10  3:34   ` Linas Vepstas
2017-03-01 15:45     ` Andy Wingo
2017-03-01 20:18       ` Linas Vepstas

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).