bug#35920: strftime incorrectly assumes that nstrftime will produce UTF-8

unofficial mirror of bug-guile@gnu.org 
 help / color / mirror / Atom feed

* bug#35920: strftime incorrectly assumes that nstrftime will produce UTF-8
       [not found]                 ` <CAKVAZZJ0v1-snrRQj0CAxFikKDy+Xf_FWxHXV4tVJ1mVrnY8ig@mail.gmail.com>
@ 2019-05-26 20:41                   ` Mark H Weaver
  2019-05-26 20:53                     ` Mark H Weaver
  2019-05-27  0:04                     ` Christopher Lam
  0 siblings, 2 replies; 12+ messages in thread
From: Mark H Weaver @ 2019-05-26 20:41 UTC (permalink / raw)
  To: Christopher Lam; +Cc: 35920

Hi Christopher,

Christopher Lam <christopher.lck@gmail.com> writes:

> Addendum - wish to confirm if guile bug (guile-2.2 on Windows):
> - set locale to non-Anglo so that (setlocale LC_ALL) returns
> "French_France.1252"
> - call (strftime "%B" 4000000) - that's 4x10^6 -- this should return
> "février 1970"
>
> but the following error arises:
> Throw to key `decoding-error' with args `("scm_from_utf8_stringn" "input
> locale conversion error" 0 #vu8(102 233 118 114 105 101 114 32 49 57 55
> 48))'.
>
> Is this a bug?

Yes.  Guile's 'strftime' procedure currently assumes that the underlying
'nstrftime' C function (from Gnulib) will produce output in UTF-8,
although it almost certainly produces output in the locale encoding.
Indeed, the bytevector #vu8(102 233 118 114 105 101 114 32 49 57 55 48)
represents the characters "février 1970" in Windows-1252 encoding.

I'm CC'ing this reply to <bug-guile@gnu.org>, so that a bug ticket will
be created.  In the future, that's the preferred address for sending bug
reports.

Anyway, thanks for letting us know about this.  I'll work on it soon.

      Mark

^ permalink raw reply	[flat|nested] 12+ messages in thread

* bug#35920: strftime incorrectly assumes that nstrftime will produce UTF-8
  2019-05-26 20:41                   ` bug#35920: strftime incorrectly assumes that nstrftime will produce UTF-8 Mark H Weaver
@ 2019-05-26 20:53                     ` Mark H Weaver
  2019-05-26 21:48                       ` Mark H Weaver
  2019-05-27  0:04                     ` Christopher Lam
  1 sibling, 1 reply; 12+ messages in thread
From: Mark H Weaver @ 2019-05-26 20:53 UTC (permalink / raw)
  To: Christopher Lam; +Cc: 35920

There might also be related problems with 'strptime'.  These problems
date back to when Guile was first extended to support non-ASCII strings.
Here's the relevant commit in 2009 that added non-ASCII support to
'strftime' and 'strptime', but did so imperfectly:
587a33556fdef90025c1b7d4d172af649c8ebba8

       Mark

^ permalink raw reply	[flat|nested] 12+ messages in thread

* bug#35920: strftime incorrectly assumes that nstrftime will produce UTF-8
  2019-05-26 20:53                     ` Mark H Weaver
@ 2019-05-26 21:48                       ` Mark H Weaver
  2019-06-30 19:51                         ` Ludovic Courtès
  0 siblings, 1 reply; 12+ messages in thread
From: Mark H Weaver @ 2019-05-26 21:48 UTC (permalink / raw)
  To: Christopher Lam; +Cc: 35920

Here's a patch that might fix the problem, but I don't have time to test
it right now.

       Mark


--8<---------------cut here---------------start------------->8---
diff --git a/libguile/stime.c b/libguile/stime.c
index b681d7ee3..9a21b61fe 100644
--- a/libguile/stime.c
+++ b/libguile/stime.c
@@ -662,9 +662,9 @@ SCM_DEFINE (scm_strftime, "strftime", 2, 0, 0,
   SCM_VALIDATE_STRING (1, format);
   bdtime2c (stime, &t, SCM_ARG2, FUNC_NAME);
 
-  /* Convert string to UTF-8 so that non-ASCII characters in the
-     format are passed through unchanged.  */
-  fmt = scm_to_utf8_stringn (format, &len);
+  /* Convert the format string to the locale encoding, as the underlying
+     'strftime' C function expects.  */
+  fmt = scm_to_locale_stringn (format, &len);
 
   /* Ugly hack: strftime can return 0 if its buffer is too small,
      but some valid time strings (e.g. "%p") can sometimes produce
@@ -727,7 +727,7 @@ SCM_DEFINE (scm_strftime, "strftime", 2, 0, 0,
 #endif
     }
 
-  result = scm_from_utf8_string (tbuf + 1);
+  result = scm_from_locale_string (tbuf + 1);
   free (tbuf);
   free (myfmt);
 #if HAVE_STRUCT_TM_TM_ZONE
@@ -754,16 +754,16 @@ SCM_DEFINE (scm_strptime, "strptime", 2, 0, 0,
 {
   struct tm t;
   char *fmt, *str, *rest;
-  size_t used_len;
+  SCM used_len;
   long zoff;
 
   SCM_VALIDATE_STRING (1, format);
   SCM_VALIDATE_STRING (2, string);
 
-  /* Convert strings to UTF-8 so that non-ASCII characters are passed
-     through unchanged.  */
-  fmt = scm_to_utf8_string (format);
-  str = scm_to_utf8_string (string);
+  /* Convert strings to the locale encoding, as the underlying
+     'strptime' C function expects.  */
+  fmt = scm_to_locale_string (format);
+  str = scm_to_locale_string (string);
 
   /* initialize the struct tm */
 #define tm_init(field) t.field = 0
@@ -807,14 +807,14 @@ SCM_DEFINE (scm_strptime, "strptime", 2, 0, 0,
   zoff = 0;
 #endif
 
-  /* Compute the number of UTF-8 characters.  */
-  used_len = u8_strnlen ((scm_t_uint8*) str, rest-str);
+  /* Compute the number of characters parsed.  */
+  used_len = scm_string_length (scm_from_locale_stringn (str, rest-str));
   scm_remember_upto_here_2 (format, string);
   free (str);
   free (fmt);
 
   return scm_cons (filltime (&t, zoff, NULL),
-		   scm_from_signed_integer (used_len));
+                   used_len);
 }
 #undef FUNC_NAME
 #endif /* HAVE_STRPTIME */
--8<---------------cut here---------------end--------------->8---





^ permalink raw reply related	[flat|nested] 12+ messages in thread

* bug#35920: strftime incorrectly assumes that nstrftime will produce UTF-8
  2019-05-26 20:41                   ` bug#35920: strftime incorrectly assumes that nstrftime will produce UTF-8 Mark H Weaver
  2019-05-26 20:53                     ` Mark H Weaver
@ 2019-05-27  0:04                     ` Christopher Lam
  1 sibling, 0 replies; 12+ messages in thread
From: Christopher Lam @ 2019-05-27  0:04 UTC (permalink / raw)
  To: Mark H Weaver; +Cc: 35920

[-- Attachment #1: Type: text/plain, Size: 1397 bytes --]

Thanks! I'm glad to know this. I have adequate fluency in guile now but
very basic C hence some bugs are very opaque to me.

On Mon., 27 May 2019, 04:43 Mark H Weaver, <mhw@netris.org> wrote:

> Hi Christopher,
>
> Christopher Lam <christopher.lck@gmail.com> writes:
>
> > Addendum - wish to confirm if guile bug (guile-2.2 on Windows):
> > - set locale to non-Anglo so that (setlocale LC_ALL) returns
> > "French_France.1252"
> > - call (strftime "%B" 4000000) - that's 4x10^6 -- this should return
> > "février 1970"
> >
> > but the following error arises:
> > Throw to key `decoding-error' with args `("scm_from_utf8_stringn" "input
> > locale conversion error" 0 #vu8(102 233 118 114 105 101 114 32 49 57 55
> > 48))'.
> >
> > Is this a bug?
>
> Yes.  Guile's 'strftime' procedure currently assumes that the underlying
> 'nstrftime' C function (from Gnulib) will produce output in UTF-8,
> although it almost certainly produces output in the locale encoding.
> Indeed, the bytevector #vu8(102 233 118 114 105 101 114 32 49 57 55 48)
> represents the characters "février 1970" in Windows-1252 encoding.
>
> I'm CC'ing this reply to <bug-guile@gnu.org>, so that a bug ticket will
> be created.  In the future, that's the preferred address for sending bug
> reports.
>
> Anyway, thanks for letting us know about this.  I'll work on it soon.
>
>       Mark
>

[-- Attachment #2: Type: text/html, Size: 2029 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* bug#35920: strftime incorrectly assumes that nstrftime will produce UTF-8
  2019-05-26 21:48                       ` Mark H Weaver
@ 2019-06-30 19:51                         ` Ludovic Courtès
  2019-06-30 21:12                           ` Mark H Weaver
  0 siblings, 1 reply; 12+ messages in thread
From: Ludovic Courtès @ 2019-06-30 19:51 UTC (permalink / raw)
  To: Mark H Weaver; +Cc: 35920-done, Christopher Lam

Hi Mark,

Mark H Weaver <mhw@netris.org> skribis:

> Here's a patch that might fix the problem, but I don't have time to test
> it right now.

It works! :-)  I wrote tests and pushed it as
ab2fd70ef1e36c6532128b73082809ef3c056556.

I forgot to change the commit author to you before pushing, apologies!

Thanks,
Ludo’.





^ permalink raw reply	[flat|nested] 12+ messages in thread

* bug#35920: strftime incorrectly assumes that nstrftime will produce UTF-8
  2019-06-30 19:51                         ` Ludovic Courtès
@ 2019-06-30 21:12                           ` Mark H Weaver
  2019-06-30 22:37                             ` John Cowan
  2019-07-02  9:07                             ` Ludovic Courtès
  0 siblings, 2 replies; 12+ messages in thread
From: Mark H Weaver @ 2019-06-30 21:12 UTC (permalink / raw)
  To: Ludovic Courtès; +Cc: 35920, Christopher Lam

reopen 35920
thanks

Hi Ludovic,

> Mark H Weaver <mhw@netris.org> skribis:
>
>> Here's a patch that might fix the problem, but I don't have time to test
>> it right now.
>
> It works! :-)  I wrote tests and pushed it as
> ab2fd70ef1e36c6532128b73082809ef3c056556.

On my system, I found that my proposed patch caused one of the existing
tests to fail.  The problem is that if the format string includes
characters that are not representable in the current locale encoding, it
will fail.  It seems to me that this could break existing code that
currently works.  User code that uses 'strftime' might never encode the
resulting string in the locale encoding.

I was planning to rewrite the code to scan for the '%' escapes
ourselves, to call 'strftime' for each escape sequence (without
including the surrounding text), and to concatenate the results.

> I forgot to change the commit author to you before pushing, apologies!

No worries.  Thanks for working on it.

      Mark

^ permalink raw reply	[flat|nested] 12+ messages in thread

* bug#35920: strftime incorrectly assumes that nstrftime will produce UTF-8
  2019-06-30 21:12                           ` Mark H Weaver
@ 2019-06-30 22:37                             ` John Cowan
  2019-06-30 23:06                               ` Mark H Weaver
  2019-07-02  9:07                             ` Ludovic Courtès
  1 sibling, 1 reply; 12+ messages in thread
From: John Cowan @ 2019-06-30 22:37 UTC (permalink / raw)
  To: Mark H Weaver; +Cc: 35920, Christopher Lam, Ludovic Courtès

[-- Attachment #1: Type: text/plain, Size: 1413 bytes --]

That's a mug's game: I've been there and tried it (not in Scheme). I
recommend writing a strftime in Scheme from scratch.  It's not that hard;
the most annoying thing is getting into the locale files to handle the
locale-sensitive directives (month name, weekday name, AM/PM, and the
ordering of dates).


On Sun, Jun 30, 2019 at 5:14 PM Mark H Weaver <mhw@netris.org> wrote:

> reopen 35920
> thanks
>
> Hi Ludovic,
>
> > Mark H Weaver <mhw@netris.org> skribis:
> >
> >> Here's a patch that might fix the problem, but I don't have time to test
> >> it right now.
> >
> > It works! :-)  I wrote tests and pushed it as
> > ab2fd70ef1e36c6532128b73082809ef3c056556.
>
> On my system, I found that my proposed patch caused one of the existing
> tests to fail.  The problem is that if the format string includes
> characters that are not representable in the current locale encoding, it
> will fail.  It seems to me that this could break existing code that
> currently works.  User code that uses 'strftime' might never encode the
> resulting string in the locale encoding.
>
> I was planning to rewrite the code to scan for the '%' escapes
> ourselves, to call 'strftime' for each escape sequence (without
> including the surrounding text), and to concatenate the results.
>
> > I forgot to change the commit author to you before pushing, apologies!
>
> No worries.  Thanks for working on it.
>
>       Mark
>
>
>
>

[-- Attachment #2: Type: text/html, Size: 1950 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* bug#35920: strftime incorrectly assumes that nstrftime will produce UTF-8
  2019-06-30 22:37                             ` John Cowan
@ 2019-06-30 23:06                               ` Mark H Weaver
  2019-07-01  1:28                                 ` John Cowan
  2019-07-02  8:58                                 ` Ludovic Courtès
  0 siblings, 2 replies; 12+ messages in thread
From: Mark H Weaver @ 2019-06-30 23:06 UTC (permalink / raw)
  To: John Cowan; +Cc: 35920, Christopher Lam, Ludovic Courtès

Hi John,

John Cowan <cowan@ccil.org> writes:

> That's a mug's game: I've been there and tried it (not in Scheme). I
> recommend writing a strftime in Scheme from scratch.  It's not that
> hard; the most annoying thing is getting into the locale files to
> handle the locale-sensitive directives (month name, weekday name,
> AM/PM, and the ordering of dates).

Is there a portable way to find the relevant locale files and interpret
them, on both POSIX and Windows systems?  If so, can you point out the
relevant documentation?

      Thanks,
        Mark





^ permalink raw reply	[flat|nested] 12+ messages in thread

* bug#35920: strftime incorrectly assumes that nstrftime will produce UTF-8
  2019-06-30 23:06                               ` Mark H Weaver
@ 2019-07-01  1:28                                 ` John Cowan
  2019-07-02  8:58                                 ` Ludovic Courtès
  1 sibling, 0 replies; 12+ messages in thread
From: John Cowan @ 2019-07-01  1:28 UTC (permalink / raw)
  To: Mark H Weaver; +Cc: 35920, Christopher Lam, Ludovic Courtès

[-- Attachment #1: Type: text/plain, Size: 1363 bytes --]

On Sun, Jun 30, 2019 at 7:06 PM Mark H Weaver <mhw@netris.org> wrote:

Is there a portable way to find the relevant locale files and interpret
> them, on both POSIX and Windows systems?  If so, can you point out the
> relevant documentation?
>

Portable in the sense that the information can be obtained on both Posix
and Windows, but not with exactly the same code.

On Posix, you need the nl_langinfo() and nl_langinfo_l() functions from
<langinfo.h>.  These functions are documented at <
http://pubs.opengroup.org/onlinepubs/9699919799/functions/nl_langinfo.html>,
and the constants d at <
http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/langinfo.h.html>.

On Windows, you need to call EnumCalendarInfoExEx if you have dropped
support for Vista and earlier versions, or if not, then follow the links
from the page about it.  The function is documented at <
https://docs.microsoft.com/en-us/windows/desktop/api/Winnls/nf-winnls-enumcalendarinfoexex>,
and the constants that specify particular pieces of information at <
https://docs.microsoft.com/en-us/windows/desktop/Intl/calendar-type-information>.
(I have never used these interfaces myself.)

I hope this is helpful.

John Cowan          http://vrici.lojban.org/~cowan        cowan@ccil.org
Eric Raymond is the Margaret Mead of the Open Source movement.
          --Bruce Perens, a long time ago

[-- Attachment #2: Type: text/html, Size: 2326 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* bug#35920: strftime incorrectly assumes that nstrftime will produce UTF-8
  2019-06-30 23:06                               ` Mark H Weaver
  2019-07-01  1:28                                 ` John Cowan
@ 2019-07-02  8:58                                 ` Ludovic Courtès
  1 sibling, 0 replies; 12+ messages in thread
From: Ludovic Courtès @ 2019-07-02  8:58 UTC (permalink / raw)
  To: Mark H Weaver; +Cc: 35920, Christopher Lam

Hi,

Mark H Weaver <mhw@netris.org> skribis:

> John Cowan <cowan@ccil.org> writes:
>
>> That's a mug's game: I've been there and tried it (not in Scheme). I
>> recommend writing a strftime in Scheme from scratch.  It's not that
>> hard; the most annoying thing is getting into the locale files to
>> handle the locale-sensitive directives (month name, weekday name,
>> AM/PM, and the ordering of dates).
>
> Is there a portable way to find the relevant locale files and interpret
> them, on both POSIX and Windows systems?  If so, can you point out the
> relevant documentation?

The (ice-9 i18n) module provides bindings to nl_langinfo et al.  The
actual data format is specific to the C library, so I think we cannot
portably go deeper than what (ice-9 i18n) does.

Ludo’.





^ permalink raw reply	[flat|nested] 12+ messages in thread

* bug#35920: strftime incorrectly assumes that nstrftime will produce UTF-8
  2019-06-30 21:12                           ` Mark H Weaver
  2019-06-30 22:37                             ` John Cowan
@ 2019-07-02  9:07                             ` Ludovic Courtès
  2019-07-02 16:51                               ` John Cowan
  1 sibling, 1 reply; 12+ messages in thread
From: Ludovic Courtès @ 2019-07-02  9:07 UTC (permalink / raw)
  To: Mark H Weaver; +Cc: 35920, Christopher Lam

Hi Mark,

Mark H Weaver <mhw@netris.org> skribis:

>> Mark H Weaver <mhw@netris.org> skribis:
>>
>>> Here's a patch that might fix the problem, but I don't have time to test
>>> it right now.
>>
>> It works! :-)  I wrote tests and pushed it as
>> ab2fd70ef1e36c6532128b73082809ef3c056556.
>
> On my system, I found that my proposed patch caused one of the existing
> tests to fail.

Which test?  In commit ab2fd70ef1e36c6532128b73082809ef3c056556 I
modified the test that passes \u0100 to run in a UTF-8 locale, on the
grounds that the previous behavior was fragile: “raw bytes” of the input
string would be preserved, but they could be mixed with things like
month names in the current locale encoding.  The result is rather
unpredictable.

> The problem is that if the format string includes characters that are
> not representable in the current locale encoding, it will fail.  It
> seems to me that this could break existing code that currently works.
> User code that uses 'strftime' might never encode the resulting string
> in the locale encoding.

In theory yes, but I cannot think of a scenario where the previous
behavior would be “useful”, because it’s hard to even describe what it
means.

> I was planning to rewrite the code to scan for the '%' escapes
> ourselves, to call 'strftime' for each escape sequence (without
> including the surrounding text), and to concatenate the results.

I think we should deprecate ‘strftime’ and ‘strptime’: (srfi srfi-19)
provides similar functionality, it uses (ice-9 i18n) for the locale
stuff, and it has a better API.

Perhaps something we can do in 3.0?

Thanks,
Ludo’.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* bug#35920: strftime incorrectly assumes that nstrftime will produce UTF-8
  2019-07-02  9:07                             ` Ludovic Courtès
@ 2019-07-02 16:51                               ` John Cowan
  0 siblings, 0 replies; 12+ messages in thread
From: John Cowan @ 2019-07-02 16:51 UTC (permalink / raw)
  To: Ludovic Courtès; +Cc: 35920, Christopher Lam

[-- Attachment #1: Type: text/plain, Size: 1098 bytes --]

On Tue, Jul 2, 2019 at 5:08 AM Ludovic Courtès <ludo@gnu.org> wrote:

I think we should deprecate ‘strftime’ and ‘strptime’: (srfi srfi-19)
> provides similar functionality, it uses (ice-9 i18n) for the locale
> stuff, and it has a better API.
>

Just a heads-up.  I don't consider SRFI 19 to have a very good API, and I'm
working on a pre-SRFI for dates and times.  There is an outline of it (very
subject to change) at <
https://bitbucket.org/cowan/r7rs-wg1-infra/src/default/TimeAdvancedCowan.md>.
 Note that it does not do localization except for timezones, however, so is
probably not directly relevant.  I'd appreciate review comments at
cowan@ccil.org anyway.  Thanks.

John Cowan          http://vrici.lojban.org/~cowan        cowan@ccil.org
Is a chair finely made tragic or comic? Is the portrait of Mona Lisa
good if I desire to see it? Is the bust of Sir Philip Crampton lyrical,
epical or dramatic?  If a man hacking in fury at a block of wood make
there an image of a cow, is that image a work of art? If not, why not?
                --Stephen Dedalus

[-- Attachment #2: Type: text/html, Size: 1771 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2019-07-02 16:51 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <CAKVAZZKGeVMxtMsONgmM6dVkT2UZuCmNQrXFmrDkb9_TvW_Yeg@mail.gmail.com>
     [not found] ` <877ebt7tc0.fsf@netris.org>
     [not found]   ` <CAKVAZZLTL1+AJCkQ0t8VbdS=39BxLG5=T7TD+C0_2_7-kPuY0g@mail.gmail.com>
     [not found]     ` <87tvew4efa.fsf@netris.org>
     [not found]       ` <CAKVAZZ+v9kFaAUxeJ5_opNSnmwJONT+wW14R38AHGpnwqEfuGA@mail.gmail.com>
     [not found]         ` <875zrb3ydk.fsf@netris.org>
     [not found]           ` <871s1z3tbq.fsf@netris.org>
     [not found]             ` <CAKVAZZLQ628QAfYffmOQ3Q4j9hzBck7nzh0BHSEshEU0sRQpOA@mail.gmail.com>
     [not found]               ` <CAKVAZZJP1veiMKYqpvZTZG2g6LO=_O-ovzouzEJ-T-mQUsOzYQ@mail.gmail.com>
     [not found]                 ` <CAKVAZZJ0v1-snrRQj0CAxFikKDy+Xf_FWxHXV4tVJ1mVrnY8ig@mail.gmail.com>
2019-05-26 20:41                   ` bug#35920: strftime incorrectly assumes that nstrftime will produce UTF-8 Mark H Weaver
2019-05-26 20:53                     ` Mark H Weaver
2019-05-26 21:48                       ` Mark H Weaver
2019-06-30 19:51                         ` Ludovic Courtès
2019-06-30 21:12                           ` Mark H Weaver
2019-06-30 22:37                             ` John Cowan
2019-06-30 23:06                               ` Mark H Weaver
2019-07-01  1:28                                 ` John Cowan
2019-07-02  8:58                                 ` Ludovic Courtès
2019-07-02  9:07                             ` Ludovic Courtès
2019-07-02 16:51                               ` John Cowan
2019-05-27  0:04                     ` Christopher Lam

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).