unofficial mirror of bug-guix@gnu.org 
 help / color / mirror / code / Atom feed
* bug#35785: guix won't download if locale is set to swedish
@ 2019-05-17 20:03 Einar Largenius
  2019-05-18 11:55 ` Ludovic Courtès
  0 siblings, 1 reply; 13+ messages in thread
From: Einar Largenius @ 2019-05-17 20:03 UTC (permalink / raw)
  To: 35785

Hello.

I just downloaded guix and installed it. In my config I have this line:

    (locale "sv_SE.utf8")

If I run 'guix pull' I get the error:

    guix pull: error: lstat: Filen eller katalogen finns inte: "ftp://sourceware.org/pub/libffi-3.2.1.tar.gz"

The part in swedish means "file or directory does not exist".

'LANG= guix pull' works without issue.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* bug#35785: guix won't download if locale is set to swedish
  2019-05-17 20:03 bug#35785: guix won't download if locale is set to swedish Einar Largenius
@ 2019-05-18 11:55 ` Ludovic Courtès
  2019-05-19 17:45   ` Einar Largenius
  0 siblings, 1 reply; 13+ messages in thread
From: Ludovic Courtès @ 2019-05-18 11:55 UTC (permalink / raw)
  To: Einar Largenius; +Cc: 35785

Hello Einar,

Einar Largenius <einar.largenius@gmail.com> skribis:

> I just downloaded guix and installed it. In my config I have this line:
>
>     (locale "sv_SE.utf8")
>
> If I run 'guix pull' I get the error:
>
>     guix pull: error: lstat: Filen eller katalogen finns inte: "ftp://sourceware.org/pub/libffi-3.2.1.tar.gz"
>
> The part in swedish means "file or directory does not exist".

Could you paste the complete output of ‘guix pull -v2’ when running
under that locale?

Thanks,
Ludo’.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* bug#35785: guix won't download if locale is set to swedish
  2019-05-18 11:55 ` Ludovic Courtès
@ 2019-05-19 17:45   ` Einar Largenius
  2019-05-20  8:20     ` Ludovic Courtès
  2019-05-20  9:14     ` bug#35785: ‘string->uri’ is locale-dependent and breaks in ‘sv_SE’ Ludovic Courtès
  0 siblings, 2 replies; 13+ messages in thread
From: Einar Largenius @ 2019-05-19 17:45 UTC (permalink / raw)
  To: Ludovic Courtès; +Cc: 35785


> Could you paste the complete output of ‘guix pull -v2’ when running
> under that locale?

Yes sorry. I have not setup email yet on that system so I need to
manually transcribe any output. This should be the complete output:

    Updating channel 'guix' from Git repository at 'https://git.savannah.gnu.org/git/guix.git'...
    Building from this channel:
      guix   https://git.savannah.gnu.org/git/guix.git  f5557bd
    guix pull: error: lstat: Filen eller katalogen finns inte: "ftp://sourceware.org/pub/libffi-3.2.1.tar.gz"

^ permalink raw reply	[flat|nested] 13+ messages in thread

* bug#35785: guix won't download if locale is set to swedish
  2019-05-19 17:45   ` Einar Largenius
@ 2019-05-20  8:20     ` Ludovic Courtès
  2019-05-20  9:14     ` bug#35785: ‘string->uri’ is locale-dependent and breaks in ‘sv_SE’ Ludovic Courtès
  1 sibling, 0 replies; 13+ messages in thread
From: Ludovic Courtès @ 2019-05-20  8:20 UTC (permalink / raw)
  To: Einar Largenius; +Cc: 35785

Einar Largenius <einar.largenius@gmail.com> skribis:

>> Could you paste the complete output of ‘guix pull -v2’ when running
>> under that locale?
>
> Yes sorry. I have not setup email yet on that system so I need to
> manually transcribe any output. This should be the complete output:
>
>     Updating channel 'guix' from Git repository at 'https://git.savannah.gnu.org/git/guix.git'...
>     Building from this channel:
>       guix   https://git.savannah.gnu.org/git/guix.git  f5557bd
>     guix pull: error: lstat: Filen eller katalogen finns inte: "ftp://sourceware.org/pub/libffi-3.2.1.tar.gz"

I can reproduce it:

--8<---------------cut here---------------start------------->8---
$ export GUIX_LOCPATH=$(guix build glibc-locales)/lib/locale
$ LANGUAGE= LC_ALL=sv_SE.utf8 guix pull -p foo
Updating channel 'guix' from Git repository at 'https://git.savannah.gnu.org/git/guix.git'...
Building from this channel:
  guix      https://git.savannah.gnu.org/git/guix.git	0f469c1
guix pull: error: lstat: Filen eller katalogen finns inte: "ftp://sourceware.org/pub/libffi/libffi-3.2.1.tar.gz"
--8<---------------cut here---------------end--------------->8---

Super weird!

Investigating…

Ludo’.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* bug#35785: ‘string->uri’ is locale-dependent and breaks in ‘sv_SE’
  2019-05-19 17:45   ` Einar Largenius
  2019-05-20  8:20     ` Ludovic Courtès
@ 2019-05-20  9:14     ` Ludovic Courtès
  2019-05-27 11:05       ` Ricardo Wurmus
  1 sibling, 1 reply; 13+ messages in thread
From: Ludovic Courtès @ 2019-05-20  9:14 UTC (permalink / raw)
  To: Einar Largenius; +Cc: 35785

Hi!

So the guts of the problem is that Guile’s ‘string->uri’ procedure
behaves incorrectly under that locale:

--8<---------------cut here---------------start------------->8---
$ export GUIX_LOCPATH=$(guix build glibc-locales)/lib/locale
$ LANGUAGE= LC_ALL=sv_SE.utf8 ./pre-inst-env guile
GNU Guile 2.2.4
Copyright (C) 1995-2017 Free Software Foundation, Inc.

Guile comes with ABSOLUTELY NO WARRANTY; for details type `,show w'.
This program is free software, and you are welcome to redistribute it
under certain conditions; type `,show c' for details.

Enter `,help' for help.
scheme@(guile-user)> ,use(web uri)
scheme@(guile-user)> (string->uri "ftp://sourceware.org/pub/libffi/libffi-3.2.1.tar.gz")
$1 = #f
--8<---------------cut here---------------end--------------->8---

More specifically, ‘parse-authority’ is failing under that locale,
because of the “w”:

--8<---------------cut here---------------start------------->8---
scheme@(guile-user)> ((@@ (web uri) parse-authority) "//sourceware.org" (const 'fail))
$5 = fail
scheme@(guile-user)> ((@@ (web uri) parse-authority) "//sourcevare.org" (const 'fail))
$6 = #f
$7 = "sourcevare.org"
$8 = #f
--8<---------------cut here---------------end--------------->8---

We can boil it down to this example:

--8<---------------cut here---------------start------------->8---
scheme@(guile-user)> ,use(ice-9 regex)
scheme@(guile-user)> (string-match "[a-z]" "a")
$10 = #("a" (0 . 1))
scheme@(guile-user)> (string-match "[a-z]" "w")
$11 = #f
--8<---------------cut here---------------end--------------->8---

In short, under the sv_SE.utf8 locale of glibc 2.28, “w” is not
considered part of the ‘a-z’ interval.

Indeed, ‘localedata/locales/sv_SE’ in glibc reads this:

  % The letter w is normally not present in the Swedish alphabet. It
  % exists in some names in Swedish and foreign words, but is accounted
  % for as a variant of 'v'.  Words and names with 'w' are in Swedish
  % ordered alphabetically among the words and names with 'v'. If two
  % words or names are only to be distinguished by 'v' or % 'w', 'v' is
  % placed before 'w'.

Using the “lower” regexp class instead of “[a-z]” works:

--8<---------------cut here---------------start------------->8---
scheme@(guile-user)> (string-match "[[:lower:]]" "w")
$12 = #("w" (0 . 1))
--8<---------------cut here---------------end--------------->8---

However, it’s not clear to me whether the “lower” class is supposed to
be the same for all locales or if we’re just lucky:

  http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html

Thoughts?

The workaround until we’ve fixed it is to use another locale, though you
can still set “LC_MESSAGES=sv_SE.utf8” or “LANGUAGE=sv”.

Ludo’.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* bug#35785: ‘string->uri’ is locale-dependent and breaks in ‘sv_SE’
  2019-05-20  9:14     ` bug#35785: ‘string->uri’ is locale-dependent and breaks in ‘sv_SE’ Ludovic Courtès
@ 2019-05-27 11:05       ` Ricardo Wurmus
  2019-05-27 13:39         ` Timothy Sample
  0 siblings, 1 reply; 13+ messages in thread
From: Ricardo Wurmus @ 2019-05-27 11:05 UTC (permalink / raw)
  To: Ludovic Courtès; +Cc: 35785, Einar Largenius


Ludovic Courtès <ludo@gnu.org> writes:

> Using the “lower” regexp class instead of “[a-z]” works:
>
> --8<---------------cut here---------------start------------->8---
> scheme@(guile-user)> (string-match "[[:lower:]]" "w")
> $12 = #("w" (0 . 1))
> --8<---------------cut here---------------end--------------->8---
>
> However, it’s not clear to me whether the “lower” class is supposed to
> be the same for all locales or if we’re just lucky:
>
>   http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html
>
> Thoughts?

The lower class is much larger than [a-z].  If we only wanted to work
around this particular problem we could explicitly spell out the range,
which would be the same in all locales.  (Obviously, that wouldn’t be
pretty.)

But can’t URI parts contain more than those characters?  To circumvent
the question whether the lower class is locale dependent we could
generate an explicit range from a charset.

--
Ricardo

^ permalink raw reply	[flat|nested] 13+ messages in thread

* bug#35785: ‘string->uri’ is locale-dependent and breaks in ‘sv_SE’
  2019-05-27 11:05       ` Ricardo Wurmus
@ 2019-05-27 13:39         ` Timothy Sample
  2019-05-28 11:17           ` Ludovic Courtès
  0 siblings, 1 reply; 13+ messages in thread
From: Timothy Sample @ 2019-05-27 13:39 UTC (permalink / raw)
  To: Ricardo Wurmus; +Cc: 35785, Einar Largenius

Hello,

Ricardo Wurmus <rekado@elephly.net> writes:

> Ludovic Courtès <ludo@gnu.org> writes:
>
>> Using the “lower” regexp class instead of “[a-z]” works:
>>
>> --8<---------------cut here---------------start------------->8---
>> scheme@(guile-user)> (string-match "[[:lower:]]" "w")
>> $12 = #("w" (0 . 1))
>> --8<---------------cut here---------------end--------------->8---
>>
>> However, it’s not clear to me whether the “lower” class is supposed to
>> be the same for all locales or if we’re just lucky:
>>
>>   http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html
>>
>> Thoughts?
>
> The lower class is much larger than [a-z].  If we only wanted to work
> around this particular problem we could explicitly spell out the range,
> which would be the same in all locales.  (Obviously, that wouldn’t be
> pretty.)

I think that explicitly spelling out the range is the right thing to do
here.  The POSIX spec says that character ranges work in the POSIX
locale, but “in other locales, a range expression has unspecified
behavior.”

> But can’t URI parts contain more than those characters?

A quick reading of RFC 3986 suggests that the host part of a URI can be
an IP address (version 4 or 6) or a registered name.  It gives the
following rules for registered names:

reg-name      = *( unreserved / pct-encoded / sub-delims )
unreserved    = ALPHA / DIGIT / "-" / "." / "_" / "~"
pct-encoded   = "%" HEXDIG HEXDIG
sub-delims    = "!" / "$" / "&" / "'" / "(" / ")"
              / "*" / "+" / "," / ";" / "="

Here, “ALPHA”, “DIGIT”, and “HEXDIG” are specified in RFC 2234, and are
just the ASCII ranges you might expect (except for that “HEXDIG” only
allows uppercase letters).

It looks like Guile is currently a little stricter than this, but pretty
close (if you take the character ranges to mean ASCII ranges).

> To circumvent
> the question whether the lower class is locale dependent we could
> generate an explicit range from a charset.

I think this is the right approach.  Using “[:lower:]” would allow
things outside of the RFC, like ‘é’.  Adding support for
internationalized domain names using Punycode would be cool, but well
outside the scope of this bug.  :)


-- Tim

^ permalink raw reply	[flat|nested] 13+ messages in thread

* bug#35785: ‘string->uri’ is locale-dependent and breaks in ‘sv_SE’
  2019-05-27 13:39         ` Timothy Sample
@ 2019-05-28 11:17           ` Ludovic Courtès
  2019-06-03  0:39             ` Timothy Sample
  0 siblings, 1 reply; 13+ messages in thread
From: Ludovic Courtès @ 2019-05-28 11:17 UTC (permalink / raw)
  To: Timothy Sample; +Cc: 35785, Einar Largenius

Hi Timothy,

Timothy Sample <samplet@ngyro.com> skribis:

> A quick reading of RFC 3986 suggests that the host part of a URI can be
> an IP address (version 4 or 6) or a registered name.  It gives the
> following rules for registered names:
>
> reg-name      = *( unreserved / pct-encoded / sub-delims )
> unreserved    = ALPHA / DIGIT / "-" / "." / "_" / "~"
> pct-encoded   = "%" HEXDIG HEXDIG
> sub-delims    = "!" / "$" / "&" / "'" / "(" / ")"
>               / "*" / "+" / "," / ";" / "="
>
> Here, “ALPHA”, “DIGIT”, and “HEXDIG” are specified in RFC 2234, and are
> just the ASCII ranges you might expect (except for that “HEXDIG” only
> allows uppercase letters).

Do you think you could turn that into a patch for Guile?  I’d happily
apply it.  :-)

It looks like both [[:alnum:]] & co. and ranges would be
locale-dependent, so my understanding is that we’ll have to list all the
characters explicitly, right?

Thanks,
Ludo’.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* bug#35785: ‘string->uri’ is locale-dependent and breaks in ‘sv_SE’
  2019-05-28 11:17           ` Ludovic Courtès
@ 2019-06-03  0:39             ` Timothy Sample
  2019-06-03 13:01               ` Ludovic Courtès
  0 siblings, 1 reply; 13+ messages in thread
From: Timothy Sample @ 2019-06-03  0:39 UTC (permalink / raw)
  To: Ludovic Courtès; +Cc: 35785, Einar Largenius

[-- Attachment #1: Type: text/plain, Size: 2508 bytes --]

Hi,

Ludovic Courtès <ludo@gnu.org> writes:

> Hi Timothy,
>
> Timothy Sample <samplet@ngyro.com> skribis:
>
>> A quick reading of RFC 3986 suggests that the host part of a URI can be
>> an IP address (version 4 or 6) or a registered name.  It gives the
>> following rules for registered names:
>>
>> reg-name      = *( unreserved / pct-encoded / sub-delims )
>> unreserved    = ALPHA / DIGIT / "-" / "." / "_" / "~"
>> pct-encoded   = "%" HEXDIG HEXDIG
>> sub-delims    = "!" / "$" / "&" / "'" / "(" / ")"
>>               / "*" / "+" / "," / ";" / "="
>>
>> Here, “ALPHA”, “DIGIT”, and “HEXDIG” are specified in RFC 2234, and are
>> just the ASCII ranges you might expect (except for that “HEXDIG” only
>> allows uppercase letters).
>
> Do you think you could turn that into a patch for Guile?  I’d happily
> apply it.  :-)
>
> It looks like both [[:alnum:]] & co. and ranges would be
> locale-dependent, so my understanding is that we’ll have to list all the
> characters explicitly, right?

Here’s a patch for Guile that uses explicit lists of characters in the
‘(web uri)’ module instead of character ranges.  It includes two tests
that are pretty verbose, but seem to do the trick.

I have a bit more background on the problem, mostly coming from a Glibc
bug report: <https://sourceware.org/bugzilla/show_bug.cgi?id=23393>.

It turns out that it is well-known upstream, and avoiding character
ranges is the recommended approach for know.  Some other GNU tools have
adopted what is being called the “Rational Range Interpretation”
<https://www.gnu.org/software/gawk/manual/html_node/Ranges-and-Locales.html>.
AIUI, this means they use the underlying encoding numbers for ranges (I
checked the source, but I’m only mostly sure I read it right).  It looks
like the Glibc folks are unsure how to proceed on this (but are maybe
slightly leaning towards the “rational” approach).

It’s all a pretty big mess, really.  I was hoping there would be some
obvious thing that would fix the problem more generally.  Short of
pulling in the Gnulib regex code or writing something in Scheme, it
looks like Guile is stuck where it is now.

I’m unsure if the changes are considered “trivial” from a copyright
perspective.  It’s pretty close, but I think programmers tend to
underestimate here.  I’ve started the FSF copyright assignment process
either way, since is likely not my last Guile patch.  :)


-- Tim


[-- Attachment #2: patch --]
[-- Type: text/x-patch, Size: 5559 bytes --]

From 7b02be4c050c7b17a0e2685e8e453295f798c360 Mon Sep 17 00:00:00 2001
From: Timothy Sample <samplet@ngyro.com>
Date: Sun, 2 Jun 2019 14:41:20 -0400
Subject: [PATCH] Make URI handling locale independent.

Fixes <https://bugs.gnu.org/35785>.

* module/web/uri.scm (digits, hex-digits, letters): New variables.
(ipv4-regexp, ipv6-regexp, domain-label-regexp, top-label-regexp,
userinfo-pat, host-pat, ipv6-host-pat, port-pat, scheme-pat): Explicitly
list each character instead of using character ranges.
* test-suite/tests/web-uri.test: Add corresponding tests.
---
 module/web/uri.scm            | 31 +++++++++++++++++++++----------
 test-suite/tests/web-uri.test | 29 ++++++++++++++++++++++++++---
 2 files changed, 47 insertions(+), 13 deletions(-)

diff --git a/module/web/uri.scm b/module/web/uri.scm
index 4c6fa5051..b4b89b9cc 100644
--- a/module/web/uri.scm
+++ b/module/web/uri.scm
@@ -1,6 +1,6 @@
 ;;;; (web uri) --- URI manipulation tools
 ;;;;
-;;;; Copyright (C) 1997,2001,2002,2010,2011,2012,2013,2014 Free Software Foundation, Inc.
+;;;; Copyright (C) 1997,2001,2002,2010,2011,2012,2013,2014,2019 Free Software Foundation, Inc.
 ;;;;
 ;;;; This library is free software; you can redistribute it and/or
 ;;;; modify it under the terms of the GNU Lesser General Public
@@ -175,17 +175,28 @@ for ‘build-uri’ except there is no scheme."
 ;;; Converters.
 ;;;
 
+;; Since character ranges in regular expressions may depend on the
+;; current locale, we use explicit lists of characters instead.  See
+;; <https://bugs.gnu.org/35785> for details.
+(define digits "0123456789")
+(define hex-digits "0123456789ABCDEFabcdef")
+(define letters "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz")
+
 ;; See RFC 3986 #3.2.2 for comments on percent-encodings, IDNA (RFC
 ;; 3490), and non-ASCII host names.
 ;;
 (define ipv4-regexp
-  (make-regexp "^([0-9.]+)$"))
+  (make-regexp (string-append "^([" digits ".]+)$")))
 (define ipv6-regexp
-  (make-regexp "^([0-9a-fA-F:.]+)$"))
+  (make-regexp (string-append "^([" hex-digits ":.]+)$")))
 (define domain-label-regexp
-  (make-regexp "^[a-zA-Z0-9]([a-zA-Z0-9-]*[a-zA-Z0-9])?$"))
+  (make-regexp
+   (string-append "^[" letters digits "]"
+                  "([" letters digits "-]*[" letters digits "])?$")))
 (define top-label-regexp
-  (make-regexp "^[a-zA-Z]([a-zA-Z0-9-]*[a-zA-Z0-9])?$"))
+  (make-regexp
+   (string-append "^[" letters "]"
+                  "([" letters digits "-]*[" letters digits "])?$")))
 
 (define (valid-host? host)
   (cond
@@ -203,13 +214,13 @@ for ‘build-uri’ except there is no scheme."
             (regexp-exec top-label-regexp host start)))))))
 
 (define userinfo-pat
-  "[a-zA-Z0-9_.!~*'();:&=+$,-]+")
+  (string-append "[" letters digits "_.!~*'();:&=+$,-]+"))
 (define host-pat
-  "[a-zA-Z0-9.-]+")
+  (string-append "[" letters digits ".-]+"))
 (define ipv6-host-pat
-  "[0-9a-fA-F:.]+")
+  (string-append "[" hex-digits ":.]+"))
 (define port-pat
-  "[0-9]*")
+  (string-append "[" digits "]*"))
 (define authority-regexp
   (make-regexp
    (format #f "^//((~a)@)?((~a)|(\\[(~a)\\]))(:(~a))?$"
@@ -246,7 +257,7 @@ for ‘build-uri’ except there is no scheme."
 ;;;   either.
 
 (define scheme-pat
-  "[a-zA-Z][a-zA-Z0-9+.-]*")
+  (string-append "[" letters "][" letters digits "+.-]*"))
 (define authority-pat
   "[^/?#]*")
 (define path-pat
diff --git a/test-suite/tests/web-uri.test b/test-suite/tests/web-uri.test
index 73391898c..ef8e85eba 100644
--- a/test-suite/tests/web-uri.test
+++ b/test-suite/tests/web-uri.test
@@ -1,6 +1,6 @@
 ;;;; web-uri.test --- URI library          -*- mode: scheme; coding: utf-8; -*-
 ;;;;
-;;;; 	Copyright (C) 2010-2012, 2014, 2017 Free Software Foundation, Inc.
+;;;; 	Copyright (C) 2010-2012, 2014, 2017, 2019 Free Software Foundation, Inc.
 ;;;;
 ;;;; This library is free software; you can redistribute it and/or
 ;;;; modify it under the terms of the GNU Lesser General Public
@@ -121,7 +121,18 @@
 
   (pass-if-uri-exception "http://foo@"
                          "Expected.*host"
-                         (build-uri 'http #:userinfo "foo")))
+                         (build-uri 'http #:userinfo "foo"))
+
+  (pass-if-uri-exception "http://illégal.com"
+                         "Expected.*host"
+                         (dynamic-wind
+                           (lambda () #t)
+                           (lambda ()
+                             (with-locale "en_US.utf8"
+                               (reload-module (resolve-module '(web uri)))
+                               (build-uri 'http #:host "illégal.com")))
+                           (lambda ()
+                             (reload-module (resolve-module '(web uri)))))))
 
 (with-test-prefix "build-uri-reference"
   (pass-if "//host/etc/foo"
@@ -290,7 +301,19 @@
            #:port 100
            #:path "/"
            #:query "q"
-           #:fragment "bar")))
+           #:fragment "bar"))
+
+  ;; bug #35785
+  (pass-if "http://www.example.com (sv_SE)"
+    (dynamic-wind
+      (lambda () #t)
+      (lambda ()
+        (with-locale "sv_SE.utf8"
+          (reload-module (resolve-module '(web uri)))
+          (uri=? (string->uri "http://www.example.com")
+                 #:scheme 'http #:host "www.example.com" #:path "")))
+      (lambda ()
+        (reload-module (resolve-module '(web uri)))))))
 
 (with-test-prefix "string->uri-reference"
   (pass-if "/foo"
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* bug#35785: ‘string->uri’ is locale-dependent and breaks in ‘sv_SE’
  2019-06-03  0:39             ` Timothy Sample
@ 2019-06-03 13:01               ` Ludovic Courtès
  2019-06-03 14:24                 ` Timothy Sample
  0 siblings, 1 reply; 13+ messages in thread
From: Ludovic Courtès @ 2019-06-03 13:01 UTC (permalink / raw)
  To: Timothy Sample; +Cc: 35785, Einar Largenius

Hi Timothy,

Timothy Sample <samplet@ngyro.com> skribis:

> Here’s a patch for Guile that uses explicit lists of characters in the
> ‘(web uri)’ module instead of character ranges.  It includes two tests
> that are pretty verbose, but seem to do the trick.
>
> I have a bit more background on the problem, mostly coming from a Glibc
> bug report: <https://sourceware.org/bugzilla/show_bug.cgi?id=23393>.
>
> It turns out that it is well-known upstream, and avoiding character
> ranges is the recommended approach for know.  Some other GNU tools have
> adopted what is being called the “Rational Range Interpretation”
> <https://www.gnu.org/software/gawk/manual/html_node/Ranges-and-Locales.html>.
> AIUI, this means they use the underlying encoding numbers for ranges (I
> checked the source, but I’m only mostly sure I read it right).  It looks
> like the Glibc folks are unsure how to proceed on this (but are maybe
> slightly leaning towards the “rational” approach).

Great that you gleaned good references on this topic!

> It’s all a pretty big mess, really.  I was hoping there would be some
> obvious thing that would fix the problem more generally.  Short of
> pulling in the Gnulib regex code or writing something in Scheme, it
> looks like Guile is stuck where it is now.

Yeah.  The alternative would be to not use regexps in this context, I
guess.

> I’m unsure if the changes are considered “trivial” from a copyright
> perspective.  It’s pretty close, but I think programmers tend to
> underestimate here.  I’ve started the FSF copyright assignment process
> either way, since is likely not my last Guile patch.  :)

If the process is already underway, I think it’s fine to commit this
patch (I would rather wait if it were longer and/or if we didn’t know
each other already).

> From 7b02be4c050c7b17a0e2685e8e453295f798c360 Mon Sep 17 00:00:00 2001
> From: Timothy Sample <samplet@ngyro.com>
> Date: Sun, 2 Jun 2019 14:41:20 -0400
> Subject: [PATCH] Make URI handling locale independent.
>
> Fixes <https://bugs.gnu.org/35785>.
>
> * module/web/uri.scm (digits, hex-digits, letters): New variables.
> (ipv4-regexp, ipv6-regexp, domain-label-regexp, top-label-regexp,
> userinfo-pat, host-pat, ipv6-host-pat, port-pat, scheme-pat): Explicitly
> list each character instead of using character ranges.
> * test-suite/tests/web-uri.test: Add corresponding tests.

[...]

> +  (pass-if "http://www.example.com (sv_SE)"
> +    (dynamic-wind
> +      (lambda () #t)
> +      (lambda ()
> +        (with-locale "sv_SE.utf8"
> +          (reload-module (resolve-module '(web uri)))
> +          (uri=? (string->uri "http://www.example.com")
> +                 #:scheme 'http #:host "www.example.com" #:path "")))

Aren’t ‘reload-module’ calls a leftover that can now be removed (also in
the other test)?

For the sv_SE test, what about taking a host name with a ‘w’, since
that’s the use case that allowed us to uncover this bug?

Apart from that it LGTM, thank you!

Ludo’.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* bug#35785: ‘string->uri’ is locale-dependent and breaks in ‘sv_SE’
  2019-06-03 13:01               ` Ludovic Courtès
@ 2019-06-03 14:24                 ` Timothy Sample
  2019-06-04  7:42                   ` Ludovic Courtès
  0 siblings, 1 reply; 13+ messages in thread
From: Timothy Sample @ 2019-06-03 14:24 UTC (permalink / raw)
  To: Ludovic Courtès; +Cc: 35785, Einar Largenius

Hi Ludo,

Ludovic Courtès <ludo@gnu.org> writes:

> Hi Timothy,
>
> Timothy Sample <samplet@ngyro.com> skribis:
>
>> Here’s a patch for Guile that uses explicit lists of characters in the
>> ‘(web uri)’ module instead of character ranges.  It includes two tests
>> that are pretty verbose, but seem to do the trick.
>>
>> I have a bit more background on the problem, mostly coming from a Glibc
>> bug report: <https://sourceware.org/bugzilla/show_bug.cgi?id=23393>.
>>
>> It turns out that it is well-known upstream, and avoiding character
>> ranges is the recommended approach for know.  Some other GNU tools have
>> adopted what is being called the “Rational Range Interpretation”
>> <https://www.gnu.org/software/gawk/manual/html_node/Ranges-and-Locales.html>.
>> AIUI, this means they use the underlying encoding numbers for ranges (I
>> checked the source, but I’m only mostly sure I read it right).  It looks
>> like the Glibc folks are unsure how to proceed on this (but are maybe
>> slightly leaning towards the “rational” approach).
>
> Great that you gleaned good references on this topic!
>
>> It’s all a pretty big mess, really.  I was hoping there would be some
>> obvious thing that would fix the problem more generally.  Short of
>> pulling in the Gnulib regex code or writing something in Scheme, it
>> looks like Guile is stuck where it is now.
>
> Yeah.  The alternative would be to not use regexps in this context, I
> guess.

I meant fixing regexes in other contexts, since I’m sure the URI module
is not the only Guile code ever that assumed “[a-z]” would only match
ASCII lowercase letters.

>> I’m unsure if the changes are considered “trivial” from a copyright
>> perspective.  It’s pretty close, but I think programmers tend to
>> underestimate here.  I’ve started the FSF copyright assignment process
>> either way, since is likely not my last Guile patch.  :)
>
> If the process is already underway, I think it’s fine to commit this
> patch (I would rather wait if it were longer and/or if we didn’t know
> each other already).

Sounds good!

>> From 7b02be4c050c7b17a0e2685e8e453295f798c360 Mon Sep 17 00:00:00 2001
>> From: Timothy Sample <samplet@ngyro.com>
>> Date: Sun, 2 Jun 2019 14:41:20 -0400
>> Subject: [PATCH] Make URI handling locale independent.
>>
>> Fixes <https://bugs.gnu.org/35785>.
>>
>> * module/web/uri.scm (digits, hex-digits, letters): New variables.
>> (ipv4-regexp, ipv6-regexp, domain-label-regexp, top-label-regexp,
>> userinfo-pat, host-pat, ipv6-host-pat, port-pat, scheme-pat): Explicitly
>> list each character instead of using character ranges.
>> * test-suite/tests/web-uri.test: Add corresponding tests.
>
> [...]
>
>> +  (pass-if "http://www.example.com (sv_SE)"
>> +    (dynamic-wind
>> +      (lambda () #t)
>> +      (lambda ()
>> +        (with-locale "sv_SE.utf8"
>> +          (reload-module (resolve-module '(web uri)))
>> +          (uri=? (string->uri "http://www.example.com")
>> +                 #:scheme 'http #:host "www.example.com" #:path "")))
>
> Aren’t ‘reload-module’ calls a leftover that can now be removed (also in
> the other test)?

I needed to reload the modules like that to make the tests fail without
the patch and pass with it.  My understanding is that the bug happens
at regex compile time, which happens when the module is loaded.  If I
don’t reload the module, the old URI code passes the tests, since the
regexes were compiled with a locale that does not trigger the bug.  It’s
a little wacky, sure, but it was the best idea I could come up with.

> For the sv_SE test, what about taking a host name with a ‘w’, since
> that’s the use case that allowed us to uncover this bug?

I thought I was being clever by using a “www” hostname, but apparently
it’s so normalized as to be invisible!  Feel free to change it to
something more obvious like “w.com” or whatever.


-- Tim

^ permalink raw reply	[flat|nested] 13+ messages in thread

* bug#35785: ‘string->uri’ is locale-dependent and breaks in ‘sv_SE’
  2019-06-03 14:24                 ` Timothy Sample
@ 2019-06-04  7:42                   ` Ludovic Courtès
  2019-06-04 13:56                     ` Timothy Sample
  0 siblings, 1 reply; 13+ messages in thread
From: Ludovic Courtès @ 2019-06-04  7:42 UTC (permalink / raw)
  To: Timothy Sample; +Cc: 35785, Einar Largenius

Hello,

Timothy Sample <samplet@ngyro.com> skribis:

>>> From 7b02be4c050c7b17a0e2685e8e453295f798c360 Mon Sep 17 00:00:00 2001
>>> From: Timothy Sample <samplet@ngyro.com>
>>> Date: Sun, 2 Jun 2019 14:41:20 -0400
>>> Subject: [PATCH] Make URI handling locale independent.
>>>
>>> Fixes <https://bugs.gnu.org/35785>.
>>>
>>> * module/web/uri.scm (digits, hex-digits, letters): New variables.
>>> (ipv4-regexp, ipv6-regexp, domain-label-regexp, top-label-regexp,
>>> userinfo-pat, host-pat, ipv6-host-pat, port-pat, scheme-pat): Explicitly
>>> list each character instead of using character ranges.
>>> * test-suite/tests/web-uri.test: Add corresponding tests.
>>
>> [...]
>>
>>> +  (pass-if "http://www.example.com (sv_SE)"
>>> +    (dynamic-wind
>>> +      (lambda () #t)
>>> +      (lambda ()
>>> +        (with-locale "sv_SE.utf8"
>>> +          (reload-module (resolve-module '(web uri)))
>>> +          (uri=? (string->uri "http://www.example.com")
>>> +                 #:scheme 'http #:host "www.example.com" #:path "")))
>>
>> Aren’t ‘reload-module’ calls a leftover that can now be removed (also in
>> the other test)?
>
> I needed to reload the modules like that to make the tests fail without
> the patch and pass with it.  My understanding is that the bug happens
> at regex compile time, which happens when the module is loaded.  If I
> don’t reload the module, the old URI code passes the tests, since the
> regexes were compiled with a locale that does not trigger the bug.  It’s
> a little wacky, sure, but it was the best idea I could come up with.

Oooh, I see.  Could you add a comment to explain this?  Then we’re done.

>> For the sv_SE test, what about taking a host name with a ‘w’, since
>> that’s the use case that allowed us to uncover this bug?
>
> I thought I was being clever by using a “www” hostname, but apparently
> it’s so normalized as to be invisible!  Feel free to change it to
> something more obvious like “w.com” or whatever.

Silly me, I guess I need new glasses.  :-)

Thanks!

Ludo’.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* bug#35785: ‘string->uri’ is locale-dependent and breaks in ‘sv_SE’
  2019-06-04  7:42                   ` Ludovic Courtès
@ 2019-06-04 13:56                     ` Timothy Sample
  0 siblings, 0 replies; 13+ messages in thread
From: Timothy Sample @ 2019-06-04 13:56 UTC (permalink / raw)
  To: Ludovic Courtès; +Cc: 35785, Einar Largenius

[-- Attachment #1: Type: text/plain, Size: 695 bytes --]

Hi,

Ludovic Courtès <ludo@gnu.org> writes:

> Timothy Sample <samplet@ngyro.com> skribis:
>
> [...]
>
>> I needed to reload the modules like that to make the tests fail without
>> the patch and pass with it.  My understanding is that the bug happens
>> at regex compile time, which happens when the module is loaded.  If I
>> don’t reload the module, the old URI code passes the tests, since the
>> regexes were compiled with a locale that does not trigger the bug.  It’s
>> a little wacky, sure, but it was the best idea I could come up with.
>
> Oooh, I see.  Could you add a comment to explain this?  Then we’re done.

Here it is!  I hope it is clear.


-- Tim


[-- Attachment #2: patch --]
[-- Type: text/x-patch, Size: 5865 bytes --]

From 9ac8643e5315d4baaddb93ee246ba8db0b3448ab Mon Sep 17 00:00:00 2001
From: Timothy Sample <samplet@ngyro.com>
Date: Sun, 2 Jun 2019 14:41:20 -0400
Subject: [PATCH] Make URI handling locale independent.

Fixes <https://bugs.gnu.org/35785>.

* module/web/uri.scm (digits, hex-digits, letters): New variables.
(ipv4-regexp, ipv6-regexp, domain-label-regexp, top-label-regexp,
userinfo-pat, host-pat, ipv6-host-pat, port-pat, scheme-pat): Explicitly
list each character instead of using character ranges.
* test-suite/tests/web-uri.test: Add corresponding tests.
---
 module/web/uri.scm            | 31 +++++++++++++++++++++----------
 test-suite/tests/web-uri.test | 33 ++++++++++++++++++++++++++++++---
 2 files changed, 51 insertions(+), 13 deletions(-)

diff --git a/module/web/uri.scm b/module/web/uri.scm
index 4c6fa5051..b4b89b9cc 100644
--- a/module/web/uri.scm
+++ b/module/web/uri.scm
@@ -1,6 +1,6 @@
 ;;;; (web uri) --- URI manipulation tools
 ;;;;
-;;;; Copyright (C) 1997,2001,2002,2010,2011,2012,2013,2014 Free Software Foundation, Inc.
+;;;; Copyright (C) 1997,2001,2002,2010,2011,2012,2013,2014,2019 Free Software Foundation, Inc.
 ;;;;
 ;;;; This library is free software; you can redistribute it and/or
 ;;;; modify it under the terms of the GNU Lesser General Public
@@ -175,17 +175,28 @@ for ‘build-uri’ except there is no scheme."
 ;;; Converters.
 ;;;
 
+;; Since character ranges in regular expressions may depend on the
+;; current locale, we use explicit lists of characters instead.  See
+;; <https://bugs.gnu.org/35785> for details.
+(define digits "0123456789")
+(define hex-digits "0123456789ABCDEFabcdef")
+(define letters "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz")
+
 ;; See RFC 3986 #3.2.2 for comments on percent-encodings, IDNA (RFC
 ;; 3490), and non-ASCII host names.
 ;;
 (define ipv4-regexp
-  (make-regexp "^([0-9.]+)$"))
+  (make-regexp (string-append "^([" digits ".]+)$")))
 (define ipv6-regexp
-  (make-regexp "^([0-9a-fA-F:.]+)$"))
+  (make-regexp (string-append "^([" hex-digits ":.]+)$")))
 (define domain-label-regexp
-  (make-regexp "^[a-zA-Z0-9]([a-zA-Z0-9-]*[a-zA-Z0-9])?$"))
+  (make-regexp
+   (string-append "^[" letters digits "]"
+                  "([" letters digits "-]*[" letters digits "])?$")))
 (define top-label-regexp
-  (make-regexp "^[a-zA-Z]([a-zA-Z0-9-]*[a-zA-Z0-9])?$"))
+  (make-regexp
+   (string-append "^[" letters "]"
+                  "([" letters digits "-]*[" letters digits "])?$")))
 
 (define (valid-host? host)
   (cond
@@ -203,13 +214,13 @@ for ‘build-uri’ except there is no scheme."
             (regexp-exec top-label-regexp host start)))))))
 
 (define userinfo-pat
-  "[a-zA-Z0-9_.!~*'();:&=+$,-]+")
+  (string-append "[" letters digits "_.!~*'();:&=+$,-]+"))
 (define host-pat
-  "[a-zA-Z0-9.-]+")
+  (string-append "[" letters digits ".-]+"))
 (define ipv6-host-pat
-  "[0-9a-fA-F:.]+")
+  (string-append "[" hex-digits ":.]+"))
 (define port-pat
-  "[0-9]*")
+  (string-append "[" digits "]*"))
 (define authority-regexp
   (make-regexp
    (format #f "^//((~a)@)?((~a)|(\\[(~a)\\]))(:(~a))?$"
@@ -246,7 +257,7 @@ for ‘build-uri’ except there is no scheme."
 ;;;   either.
 
 (define scheme-pat
-  "[a-zA-Z][a-zA-Z0-9+.-]*")
+  (string-append "[" letters "][" letters digits "+.-]*"))
 (define authority-pat
   "[^/?#]*")
 (define path-pat
diff --git a/test-suite/tests/web-uri.test b/test-suite/tests/web-uri.test
index 73391898c..94778acac 100644
--- a/test-suite/tests/web-uri.test
+++ b/test-suite/tests/web-uri.test
@@ -1,6 +1,6 @@
 ;;;; web-uri.test --- URI library          -*- mode: scheme; coding: utf-8; -*-
 ;;;;
-;;;; 	Copyright (C) 2010-2012, 2014, 2017 Free Software Foundation, Inc.
+;;;; 	Copyright (C) 2010-2012, 2014, 2017, 2019 Free Software Foundation, Inc.
 ;;;;
 ;;;; This library is free software; you can redistribute it and/or
 ;;;; modify it under the terms of the GNU Lesser General Public
@@ -121,7 +121,21 @@
 
   (pass-if-uri-exception "http://foo@"
                          "Expected.*host"
-                         (build-uri 'http #:userinfo "foo")))
+                         (build-uri 'http #:userinfo "foo"))
+
+  ;; In this test, we need to reload the '(web uri)' module with a
+  ;; different locale.  This is because some locale-dependent things
+  ;; (e.g., compiled regexes) are computed when the module is loaded.
+  (pass-if-uri-exception "http://illégal.com"
+                         "Expected.*host"
+                         (dynamic-wind
+                           (lambda () #t)
+                           (lambda ()
+                             (with-locale "en_US.utf8"
+                               (reload-module (resolve-module '(web uri)))
+                               (build-uri 'http #:host "illégal.com")))
+                           (lambda ()
+                             (reload-module (resolve-module '(web uri)))))))
 
 (with-test-prefix "build-uri-reference"
   (pass-if "//host/etc/foo"
@@ -290,7 +304,20 @@
            #:port 100
            #:path "/"
            #:query "q"
-           #:fragment "bar")))
+           #:fragment "bar"))
+
+  ;; This test reproduces bug #35785.  See the 'illégal' test above for
+  ;; why we reload the module.
+  (pass-if "http://www.example.com (sv_SE)"
+    (dynamic-wind
+      (lambda () #t)
+      (lambda ()
+        (with-locale "sv_SE.utf8"
+          (reload-module (resolve-module '(web uri)))
+          (uri=? (string->uri "http://www.example.com")
+                 #:scheme 'http #:host "www.example.com" #:path "")))
+      (lambda ()
+        (reload-module (resolve-module '(web uri)))))))
 
 (with-test-prefix "string->uri-reference"
   (pass-if "/foo"
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2019-06-04 13:57 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2019-05-17 20:03 bug#35785: guix won't download if locale is set to swedish Einar Largenius
2019-05-18 11:55 ` Ludovic Courtès
2019-05-19 17:45   ` Einar Largenius
2019-05-20  8:20     ` Ludovic Courtès
2019-05-20  9:14     ` bug#35785: ‘string->uri’ is locale-dependent and breaks in ‘sv_SE’ Ludovic Courtès
2019-05-27 11:05       ` Ricardo Wurmus
2019-05-27 13:39         ` Timothy Sample
2019-05-28 11:17           ` Ludovic Courtès
2019-06-03  0:39             ` Timothy Sample
2019-06-03 13:01               ` Ludovic Courtès
2019-06-03 14:24                 ` Timothy Sample
2019-06-04  7:42                   ` Ludovic Courtès
2019-06-04 13:56                     ` Timothy Sample

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/guix.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).