all messages for Guix-related lists mirrored at yhetil.org
 help / color / mirror / code / Atom feed
From: Timothy Sample <samplet@ngyro.com>
To: "Ludovic Courtès" <ludo@gnu.org>
Cc: 35785@debbugs.gnu.org, Einar Largenius <einar.largenius@gmail.com>
Subject: bug#35785: ‘string->uri’ is locale-dependent and breaks in ‘sv_SE’
Date: Sun, 02 Jun 2019 20:39:16 -0400	[thread overview]
Message-ID: <87imtnsdsb.fsf@ngyro.com> (raw)
In-Reply-To: <8736ky3k1w.fsf@gnu.org> ("Ludovic \=\?utf-8\?Q\?Court\=C3\=A8s\=22'\?\= \=\?utf-8\?Q\?s\?\= message of "Tue, 28 May 2019 13:17:15 +0200")

[-- Attachment #1: Type: text/plain, Size: 2508 bytes --]

Hi,

Ludovic Courtès <ludo@gnu.org> writes:

> Hi Timothy,
>
> Timothy Sample <samplet@ngyro.com> skribis:
>
>> A quick reading of RFC 3986 suggests that the host part of a URI can be
>> an IP address (version 4 or 6) or a registered name.  It gives the
>> following rules for registered names:
>>
>> reg-name      = *( unreserved / pct-encoded / sub-delims )
>> unreserved    = ALPHA / DIGIT / "-" / "." / "_" / "~"
>> pct-encoded   = "%" HEXDIG HEXDIG
>> sub-delims    = "!" / "$" / "&" / "'" / "(" / ")"
>>               / "*" / "+" / "," / ";" / "="
>>
>> Here, “ALPHA”, “DIGIT”, and “HEXDIG” are specified in RFC 2234, and are
>> just the ASCII ranges you might expect (except for that “HEXDIG” only
>> allows uppercase letters).
>
> Do you think you could turn that into a patch for Guile?  I’d happily
> apply it.  :-)
>
> It looks like both [[:alnum:]] & co. and ranges would be
> locale-dependent, so my understanding is that we’ll have to list all the
> characters explicitly, right?

Here’s a patch for Guile that uses explicit lists of characters in the
‘(web uri)’ module instead of character ranges.  It includes two tests
that are pretty verbose, but seem to do the trick.

I have a bit more background on the problem, mostly coming from a Glibc
bug report: <https://sourceware.org/bugzilla/show_bug.cgi?id=23393>.

It turns out that it is well-known upstream, and avoiding character
ranges is the recommended approach for know.  Some other GNU tools have
adopted what is being called the “Rational Range Interpretation”
<https://www.gnu.org/software/gawk/manual/html_node/Ranges-and-Locales.html>.
AIUI, this means they use the underlying encoding numbers for ranges (I
checked the source, but I’m only mostly sure I read it right).  It looks
like the Glibc folks are unsure how to proceed on this (but are maybe
slightly leaning towards the “rational” approach).

It’s all a pretty big mess, really.  I was hoping there would be some
obvious thing that would fix the problem more generally.  Short of
pulling in the Gnulib regex code or writing something in Scheme, it
looks like Guile is stuck where it is now.

I’m unsure if the changes are considered “trivial” from a copyright
perspective.  It’s pretty close, but I think programmers tend to
underestimate here.  I’ve started the FSF copyright assignment process
either way, since is likely not my last Guile patch.  :)


-- Tim


[-- Attachment #2: patch --]
[-- Type: text/x-patch, Size: 5559 bytes --]

From 7b02be4c050c7b17a0e2685e8e453295f798c360 Mon Sep 17 00:00:00 2001
From: Timothy Sample <samplet@ngyro.com>
Date: Sun, 2 Jun 2019 14:41:20 -0400
Subject: [PATCH] Make URI handling locale independent.

Fixes <https://bugs.gnu.org/35785>.

* module/web/uri.scm (digits, hex-digits, letters): New variables.
(ipv4-regexp, ipv6-regexp, domain-label-regexp, top-label-regexp,
userinfo-pat, host-pat, ipv6-host-pat, port-pat, scheme-pat): Explicitly
list each character instead of using character ranges.
* test-suite/tests/web-uri.test: Add corresponding tests.
---
 module/web/uri.scm            | 31 +++++++++++++++++++++----------
 test-suite/tests/web-uri.test | 29 ++++++++++++++++++++++++++---
 2 files changed, 47 insertions(+), 13 deletions(-)

diff --git a/module/web/uri.scm b/module/web/uri.scm
index 4c6fa5051..b4b89b9cc 100644
--- a/module/web/uri.scm
+++ b/module/web/uri.scm
@@ -1,6 +1,6 @@
 ;;;; (web uri) --- URI manipulation tools
 ;;;;
-;;;; Copyright (C) 1997,2001,2002,2010,2011,2012,2013,2014 Free Software Foundation, Inc.
+;;;; Copyright (C) 1997,2001,2002,2010,2011,2012,2013,2014,2019 Free Software Foundation, Inc.
 ;;;;
 ;;;; This library is free software; you can redistribute it and/or
 ;;;; modify it under the terms of the GNU Lesser General Public
@@ -175,17 +175,28 @@ for ‘build-uri’ except there is no scheme."
 ;;; Converters.
 ;;;
 
+;; Since character ranges in regular expressions may depend on the
+;; current locale, we use explicit lists of characters instead.  See
+;; <https://bugs.gnu.org/35785> for details.
+(define digits "0123456789")
+(define hex-digits "0123456789ABCDEFabcdef")
+(define letters "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz")
+
 ;; See RFC 3986 #3.2.2 for comments on percent-encodings, IDNA (RFC
 ;; 3490), and non-ASCII host names.
 ;;
 (define ipv4-regexp
-  (make-regexp "^([0-9.]+)$"))
+  (make-regexp (string-append "^([" digits ".]+)$")))
 (define ipv6-regexp
-  (make-regexp "^([0-9a-fA-F:.]+)$"))
+  (make-regexp (string-append "^([" hex-digits ":.]+)$")))
 (define domain-label-regexp
-  (make-regexp "^[a-zA-Z0-9]([a-zA-Z0-9-]*[a-zA-Z0-9])?$"))
+  (make-regexp
+   (string-append "^[" letters digits "]"
+                  "([" letters digits "-]*[" letters digits "])?$")))
 (define top-label-regexp
-  (make-regexp "^[a-zA-Z]([a-zA-Z0-9-]*[a-zA-Z0-9])?$"))
+  (make-regexp
+   (string-append "^[" letters "]"
+                  "([" letters digits "-]*[" letters digits "])?$")))
 
 (define (valid-host? host)
   (cond
@@ -203,13 +214,13 @@ for ‘build-uri’ except there is no scheme."
             (regexp-exec top-label-regexp host start)))))))
 
 (define userinfo-pat
-  "[a-zA-Z0-9_.!~*'();:&=+$,-]+")
+  (string-append "[" letters digits "_.!~*'();:&=+$,-]+"))
 (define host-pat
-  "[a-zA-Z0-9.-]+")
+  (string-append "[" letters digits ".-]+"))
 (define ipv6-host-pat
-  "[0-9a-fA-F:.]+")
+  (string-append "[" hex-digits ":.]+"))
 (define port-pat
-  "[0-9]*")
+  (string-append "[" digits "]*"))
 (define authority-regexp
   (make-regexp
    (format #f "^//((~a)@)?((~a)|(\\[(~a)\\]))(:(~a))?$"
@@ -246,7 +257,7 @@ for ‘build-uri’ except there is no scheme."
 ;;;   either.
 
 (define scheme-pat
-  "[a-zA-Z][a-zA-Z0-9+.-]*")
+  (string-append "[" letters "][" letters digits "+.-]*"))
 (define authority-pat
   "[^/?#]*")
 (define path-pat
diff --git a/test-suite/tests/web-uri.test b/test-suite/tests/web-uri.test
index 73391898c..ef8e85eba 100644
--- a/test-suite/tests/web-uri.test
+++ b/test-suite/tests/web-uri.test
@@ -1,6 +1,6 @@
 ;;;; web-uri.test --- URI library          -*- mode: scheme; coding: utf-8; -*-
 ;;;;
-;;;; 	Copyright (C) 2010-2012, 2014, 2017 Free Software Foundation, Inc.
+;;;; 	Copyright (C) 2010-2012, 2014, 2017, 2019 Free Software Foundation, Inc.
 ;;;;
 ;;;; This library is free software; you can redistribute it and/or
 ;;;; modify it under the terms of the GNU Lesser General Public
@@ -121,7 +121,18 @@
 
   (pass-if-uri-exception "http://foo@"
                          "Expected.*host"
-                         (build-uri 'http #:userinfo "foo")))
+                         (build-uri 'http #:userinfo "foo"))
+
+  (pass-if-uri-exception "http://illégal.com"
+                         "Expected.*host"
+                         (dynamic-wind
+                           (lambda () #t)
+                           (lambda ()
+                             (with-locale "en_US.utf8"
+                               (reload-module (resolve-module '(web uri)))
+                               (build-uri 'http #:host "illégal.com")))
+                           (lambda ()
+                             (reload-module (resolve-module '(web uri)))))))
 
 (with-test-prefix "build-uri-reference"
   (pass-if "//host/etc/foo"
@@ -290,7 +301,19 @@
            #:port 100
            #:path "/"
            #:query "q"
-           #:fragment "bar")))
+           #:fragment "bar"))
+
+  ;; bug #35785
+  (pass-if "http://www.example.com (sv_SE)"
+    (dynamic-wind
+      (lambda () #t)
+      (lambda ()
+        (with-locale "sv_SE.utf8"
+          (reload-module (resolve-module '(web uri)))
+          (uri=? (string->uri "http://www.example.com")
+                 #:scheme 'http #:host "www.example.com" #:path "")))
+      (lambda ()
+        (reload-module (resolve-module '(web uri)))))))
 
 (with-test-prefix "string->uri-reference"
   (pass-if "/foo"
-- 
2.21.0


  reply	other threads:[~2019-06-03  0:40 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-05-17 20:03 bug#35785: guix won't download if locale is set to swedish Einar Largenius
2019-05-18 11:55 ` Ludovic Courtès
2019-05-19 17:45   ` Einar Largenius
2019-05-20  8:20     ` Ludovic Courtès
2019-05-20  9:14     ` bug#35785: ‘string->uri’ is locale-dependent and breaks in ‘sv_SE’ Ludovic Courtès
2019-05-27 11:05       ` Ricardo Wurmus
2019-05-27 13:39         ` Timothy Sample
2019-05-28 11:17           ` Ludovic Courtès
2019-06-03  0:39             ` Timothy Sample [this message]
2019-06-03 13:01               ` Ludovic Courtès
2019-06-03 14:24                 ` Timothy Sample
2019-06-04  7:42                   ` Ludovic Courtès
2019-06-04 13:56                     ` Timothy Sample

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87imtnsdsb.fsf@ngyro.com \
    --to=samplet@ngyro.com \
    --cc=35785@debbugs.gnu.org \
    --cc=einar.largenius@gmail.com \
    --cc=ludo@gnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this external index

	https://git.savannah.gnu.org/cgit/guix.git

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.