* bug#25397: guile-2.2 regression in utf8 support in scm_puts scm_lfwrite scm_c_put_string @ 2017-01-08 18:16 Linas Vepstas 2017-01-09 22:03 ` Andy Wingo 0 siblings, 1 reply; 5+ messages in thread From: Linas Vepstas @ 2017-01-08 18:16 UTC (permalink / raw) To: 25397 There appears to be a regression in guile-2.2 with utf8 handling in the scm_puts() scm_lfwrite() and scm_c_put_string() functions. In guile-2.0, one could give these utf8-encoded strings, and these would display just fine. In 2.2 they get mangled. The source of the mangling seems to be an assumption that these three are being given latin1 strings, which they then attempt to convert to utf8, thus wrecking the encoding. See, e.g. libguile/ports.c line 3526 Presumably this change was intentional, but I don't understand why; guile-2.0 seems utf-8 clean, correctly handling utf-8 in essentially all cases. Why would one want to go back to the bad old days of latin1 and iso-8859-1 for guile 2.2? I could submit a patch for this, but would it be wanted? Test case is straight-forward: printf("duuude port-encoding is=%s\n", scm_to_utf8_string(scm_port_encoding(scm_current_output_port ()))); scm_puts ("係 拉 丁 字 母", scm_current_output_port ()); which works in guile-2.0 but is garbled in 2.2 ^ permalink raw reply [flat|nested] 5+ messages in thread
* bug#25397: guile-2.2 regression in utf8 support in scm_puts scm_lfwrite scm_c_put_string 2017-01-08 18:16 bug#25397: guile-2.2 regression in utf8 support in scm_puts scm_lfwrite scm_c_put_string Linas Vepstas @ 2017-01-09 22:03 ` Andy Wingo 2017-01-10 3:34 ` Linas Vepstas 0 siblings, 1 reply; 5+ messages in thread From: Andy Wingo @ 2017-01-09 22:03 UTC (permalink / raw) To: Linas Vepstas; +Cc: 25397 On Sun 08 Jan 2017 19:16, Linas Vepstas <linasvepstas@gmail.com> writes: > There appears to be a regression in guile-2.2 with utf8 handling > in the scm_puts() scm_lfwrite() and scm_c_put_string() functions. > > In guile-2.0, one could give these utf8-encoded strings, and these > would display just fine. In 2.2 they get mangled. Could it be this from NEWS: ** Better locale support in Guile scripts When Guile is invoked directly, either from the command line or via a hash-bang line (e.g. "#!/usr/bin/guile"), it now installs the current locale via a call to `(setlocale LC_ALL "")'. For users with a unicode locale, this makes all ports unicode-capable by default, without the need to call `setlocale' in your program. This behavior may be controlled via the GUILE_INSTALL_LOCALE environment variable; see the manual for more. ^ permalink raw reply [flat|nested] 5+ messages in thread
* bug#25397: guile-2.2 regression in utf8 support in scm_puts scm_lfwrite scm_c_put_string 2017-01-09 22:03 ` Andy Wingo @ 2017-01-10 3:34 ` Linas Vepstas 2017-03-01 15:45 ` Andy Wingo 0 siblings, 1 reply; 5+ messages in thread From: Linas Vepstas @ 2017-01-10 3:34 UTC (permalink / raw) To: Andy Wingo; +Cc: 25397 This short C program illustrates the issue. The locale, the output port etc. are UTF-8. The bad results are no surprise: the code currently in git for scm_puts etc. explicitly ignores the locale setting, always, and always assumes latin1 -- its hard-coded in there. --linas #include <libguile.h> void *wrap_eval(void* p) { char *wtf = "(setlocale LC_ALL \"\")"; SCM eval_str = scm_from_utf8_string(wtf); scm_eval_string(eval_str); return NULL; } void *wrap_puts(void* p) { char *wtf = p; SCM port = scm_current_output_port (); scm_puts("the port-encoding is=", port); scm_puts(scm_to_utf8_string(scm_port_encoding(port)), port); scm_puts("\nThe string to display is =", port); scm_puts (wtf, port); scm_puts("\nWas expecting to see this=", port); SCM str = scm_from_utf8_string(wtf); scm_display(str, port); scm_puts("\n\n", port); return NULL; } int main(int argc, char* argv[]) { scm_with_guile(wrap_eval, 0x0); char * wtf = "Ćićolina"; scm_with_guile(wrap_puts, wtf); wtf = "Thủ Dầu Một"; scm_with_guile(wrap_puts, wtf); wtf = "Småland"; scm_with_guile(wrap_puts, wtf); wtf = "Hòa Phú Phú Tân"; scm_with_guile(wrap_puts, wtf); wtf = "係 拉 丁 字 母"; scm_with_guile(wrap_puts, wtf); } The output is always this: the port-encoding is=UTF-8 The string to display is =Ćićolina Was expecting to see this=Ćićolina the port-encoding is=UTF-8 The string to display is =Thủ Dầu Má»™t Was expecting to see this=Thủ Dầu Một the port-encoding is=UTF-8 The string to display is =SmÃ¥land Was expecting to see this=Småland the port-encoding is=UTF-8 The string to display is =Hòa Phú Phú Tân Was expecting to see this=Hòa Phú Phú Tân the port-encoding is=UTF-8 Was expecting to see this=係 拉 丁 字 母 æ¯ What's cool is that all this stuff works in email! --linas On Mon, Jan 9, 2017 at 4:03 PM, Andy Wingo <wingo@pobox.com> wrote: > On Sun 08 Jan 2017 19:16, Linas Vepstas <linasvepstas@gmail.com> writes: > >> There appears to be a regression in guile-2.2 with utf8 handling >> in the scm_puts() scm_lfwrite() and scm_c_put_string() functions. >> >> In guile-2.0, one could give these utf8-encoded strings, and these >> would display just fine. In 2.2 they get mangled. > > Could it be this from NEWS: > > ** Better locale support in Guile scripts > > When Guile is invoked directly, either from the command line or via a > hash-bang line (e.g. "#!/usr/bin/guile"), it now installs the current > locale via a call to `(setlocale LC_ALL "")'. For users with a unicode > locale, this makes all ports unicode-capable by default, without the > need to call `setlocale' in your program. This behavior may be > controlled via the GUILE_INSTALL_LOCALE environment variable; see the > manual for more. ^ permalink raw reply [flat|nested] 5+ messages in thread
* bug#25397: guile-2.2 regression in utf8 support in scm_puts scm_lfwrite scm_c_put_string 2017-01-10 3:34 ` Linas Vepstas @ 2017-03-01 15:45 ` Andy Wingo 2017-03-01 20:18 ` Linas Vepstas 0 siblings, 1 reply; 5+ messages in thread From: Andy Wingo @ 2017-03-01 15:45 UTC (permalink / raw) To: Linas Vepstas; +Cc: 25397 On Tue 10 Jan 2017 04:34, Linas Vepstas <linasvepstas@gmail.com> writes: > void *wrap_puts(void* p) > { > char *wtf = p; > > SCM port = scm_current_output_port (); > > scm_puts("the port-encoding is=", port); > scm_puts(scm_to_utf8_string(scm_port_encoding(port)), port); > > scm_puts("\nThe string to display is =", port); > scm_puts (wtf, port); > > scm_puts("\nWas expecting to see this=", port); > SCM str = scm_from_utf8_string(wtf); > scm_display(str, port); > scm_puts("\n\n", port); > > return NULL; > } So, there are a few questions here. scm_puts and scm_lfwrite are not documented, so we need to do basic science on them to see what they are supposed to do. Firstly, is scm_puts() a textual interface or a binary interface? I.e. does it write a sequence of characters or a sequence of bytes? If I look at uses of scm_puts in Guile sources, it seems clear that it's a textual interface. That is to say, at all points, the intention seems to be to write characters on a Guile port. All of the uses are of strings. Please do a "git grep" on your source to see if your perceptions correspond. Now the question is, what encoding is the argument in? If the port is UTF-16, that byte string should be decoded to characters, and that character sequence encoded to UTF-16. All of the scm_puts calls in Guile are of one-byte characters with codepoints less than 128, so when doing some port refactoring I chose to interpret the argument as latin1. FTR, in Guile 2.0, this was effectively a binary interface. Guile 2.0's scm_lfwrite interpreted the incoming bytes as ISO-8859-1 codepoints for the purposes of updating line and column, but scm_puts and scm_lfwrite just wrote out the bytes to the port directly, regardless of the encoding. That was the wrong thing. Are you arguing that the byte string given to scm_puts should be decoded from UTF-8? That would be OK. Andy ^ permalink raw reply [flat|nested] 5+ messages in thread
* bug#25397: guile-2.2 regression in utf8 support in scm_puts scm_lfwrite scm_c_put_string 2017-03-01 15:45 ` Andy Wingo @ 2017-03-01 20:18 ` Linas Vepstas 0 siblings, 0 replies; 5+ messages in thread From: Linas Vepstas @ 2017-03-01 20:18 UTC (permalink / raw) To: Andy Wingo; +Cc: 25397@debbugs.gnu.org [-- Attachment #1: Type: text/plain, Size: 2351 bytes --] In the bad old days, not every thing was documented ... My use of scm_puts dates back to guile-1.8. I only ever send it utf8. I can change my code, no problem,... I just thought I'd report a regression in case .... others are affected. Linas On Wednesday, March 1, 2017, Andy Wingo <wingo@pobox.com> wrote: > On Tue 10 Jan 2017 04:34, Linas Vepstas <linasvepstas@gmail.com > <javascript:;>> writes: > > > void *wrap_puts(void* p) > > { > > char *wtf = p; > > > > SCM port = scm_current_output_port (); > > > > scm_puts("the port-encoding is=", port); > > scm_puts(scm_to_utf8_string(scm_port_encoding(port)), port); > > > > scm_puts("\nThe string to display is =", port); > > scm_puts (wtf, port); > > > > scm_puts("\nWas expecting to see this=", port); > > SCM str = scm_from_utf8_string(wtf); > > scm_display(str, port); > > scm_puts("\n\n", port); > > > > return NULL; > > } > > So, there are a few questions here. scm_puts and scm_lfwrite are not > documented, so we need to do basic science on them to see what they are > supposed to do. > > Firstly, is scm_puts() a textual interface or a binary interface? > I.e. does it write a sequence of characters or a sequence of bytes? > > If I look at uses of scm_puts in Guile sources, it seems clear that it's > a textual interface. That is to say, at all points, the intention seems > to be to write characters on a Guile port. All of the uses are of > strings. Please do a "git grep" on your source to see if your > perceptions correspond. > > Now the question is, what encoding is the argument in? If the port is > UTF-16, that byte string should be decoded to characters, and that > character sequence encoded to UTF-16. > > All of the scm_puts calls in Guile are of one-byte characters with > codepoints less than 128, so when doing some port refactoring I chose to > interpret the argument as latin1. > > FTR, in Guile 2.0, this was effectively a binary interface. Guile 2.0's > scm_lfwrite interpreted the incoming bytes as ISO-8859-1 codepoints for > the purposes of updating line and column, but scm_puts and scm_lfwrite > just wrote out the bytes to the port directly, regardless of the > encoding. That was the wrong thing. > > Are you arguing that the byte string given to scm_puts should be decoded > from UTF-8? That would be OK. > > Andy > [-- Attachment #2: Type: text/html, Size: 2951 bytes --] ^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2017-03-01 20:18 UTC | newest] Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2017-01-08 18:16 bug#25397: guile-2.2 regression in utf8 support in scm_puts scm_lfwrite scm_c_put_string Linas Vepstas 2017-01-09 22:03 ` Andy Wingo 2017-01-10 3:34 ` Linas Vepstas 2017-03-01 15:45 ` Andy Wingo 2017-03-01 20:18 ` Linas Vepstas
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).