From mboxrd@z Thu Jan 1 00:00:00 1970 From: Timothy Sample Subject: bug#35785: =?UTF-8?Q?=E2=80=98string->uri=E2=80=99?= is locale-dependent and breaks in =?UTF-8?Q?=E2=80=98sv=5FSE=E2=80=99?= Date: Sun, 02 Jun 2019 20:39:16 -0400 Message-ID: <87imtnsdsb.fsf@ngyro.com> References: <878sv4j1au.fsf@gmail.com> <87d0kgvuxj.fsf@gnu.org> <87tvdqgwyg.fsf@gmail.com> <87blzxwkrn.fsf_-_@gnu.org> <87ftp017k6.fsf@elephly.net> <875zpw6mq0.fsf@ngyro.com> <8736ky3k1w.fsf@gnu.org> Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="=-=-=" Return-path: Received: from eggs.gnu.org ([209.51.188.92]:54956) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1hXb1A-0002ky-28 for bug-guix@gnu.org; Sun, 02 Jun 2019 20:40:05 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1hXb18-0008Md-83 for bug-guix@gnu.org; Sun, 02 Jun 2019 20:40:04 -0400 Received: from debbugs.gnu.org ([209.51.188.43]:55864) by eggs.gnu.org with esmtps (TLS1.0:RSA_AES_128_CBC_SHA1:16) (Exim 4.71) (envelope-from ) id 1hXb18-0008MO-2W for bug-guix@gnu.org; Sun, 02 Jun 2019 20:40:02 -0400 Received: from Debian-debbugs by debbugs.gnu.org with local (Exim 4.84_2) (envelope-from ) id 1hXb17-00061l-UI for bug-guix@gnu.org; Sun, 02 Jun 2019 20:40:01 -0400 Sender: "Debbugs-submit" Resent-Message-ID: In-Reply-To: <8736ky3k1w.fsf@gnu.org> ("Ludovic \=\?utf-8\?Q\?Court\=C3\=A8s\=22'\?\= \=\?utf-8\?Q\?s\?\= message of "Tue, 28 May 2019 13:17:15 +0200") List-Id: Bug reports for GNU Guix List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: bug-guix-bounces+gcggb-bug-guix=m.gmane.org@gnu.org Sender: "bug-Guix" To: Ludovic =?UTF-8?Q?Court=C3=A8s?= Cc: 35785@debbugs.gnu.org, Einar Largenius --=-=-= Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Hi, Ludovic Court=C3=A8s writes: > Hi Timothy, > > Timothy Sample skribis: > >> A quick reading of RFC 3986 suggests that the host part of a URI can be >> an IP address (version 4 or 6) or a registered name. It gives the >> following rules for registered names: >> >> reg-name =3D *( unreserved / pct-encoded / sub-delims ) >> unreserved =3D ALPHA / DIGIT / "-" / "." / "_" / "~" >> pct-encoded =3D "%" HEXDIG HEXDIG >> sub-delims =3D "!" / "$" / "&" / "'" / "(" / ")" >> / "*" / "+" / "," / ";" / "=3D" >> >> Here, =E2=80=9CALPHA=E2=80=9D, =E2=80=9CDIGIT=E2=80=9D, and =E2=80=9CHEX= DIG=E2=80=9D are specified in RFC 2234, and are >> just the ASCII ranges you might expect (except for that =E2=80=9CHEXDIG= =E2=80=9D only >> allows uppercase letters). > > Do you think you could turn that into a patch for Guile? I=E2=80=99d hap= pily > apply it. :-) > > It looks like both [[:alnum:]] & co. and ranges would be > locale-dependent, so my understanding is that we=E2=80=99ll have to list = all the > characters explicitly, right? Here=E2=80=99s a patch for Guile that uses explicit lists of characters in = the =E2=80=98(web uri)=E2=80=99 module instead of character ranges. It include= s two tests that are pretty verbose, but seem to do the trick. I have a bit more background on the problem, mostly coming from a Glibc bug report: . It turns out that it is well-known upstream, and avoiding character ranges is the recommended approach for know. Some other GNU tools have adopted what is being called the =E2=80=9CRational Range Interpretation=E2= =80=9D . AIUI, this means they use the underlying encoding numbers for ranges (I checked the source, but I=E2=80=99m only mostly sure I read it right). It = looks like the Glibc folks are unsure how to proceed on this (but are maybe slightly leaning towards the =E2=80=9Crational=E2=80=9D approach). It=E2=80=99s all a pretty big mess, really. I was hoping there would be so= me obvious thing that would fix the problem more generally. Short of pulling in the Gnulib regex code or writing something in Scheme, it looks like Guile is stuck where it is now. I=E2=80=99m unsure if the changes are considered =E2=80=9Ctrivial=E2=80=9D = from a copyright perspective. It=E2=80=99s pretty close, but I think programmers tend to underestimate here. I=E2=80=99ve started the FSF copyright assignment proc= ess either way, since is likely not my last Guile patch. :) -- Tim --=-=-= Content-Type: text/x-patch; charset=utf-8 Content-Disposition: attachment; filename=0001-Make-URI-handling-locale-independent.patch Content-Transfer-Encoding: quoted-printable Content-Description: patch >From 7b02be4c050c7b17a0e2685e8e453295f798c360 Mon Sep 17 00:00:00 2001 From: Timothy Sample Date: Sun, 2 Jun 2019 14:41:20 -0400 Subject: [PATCH] Make URI handling locale independent. Fixes . * module/web/uri.scm (digits, hex-digits, letters): New variables. (ipv4-regexp, ipv6-regexp, domain-label-regexp, top-label-regexp, userinfo-pat, host-pat, ipv6-host-pat, port-pat, scheme-pat): Explicitly list each character instead of using character ranges. * test-suite/tests/web-uri.test: Add corresponding tests. --- module/web/uri.scm | 31 +++++++++++++++++++++---------- test-suite/tests/web-uri.test | 29 ++++++++++++++++++++++++++--- 2 files changed, 47 insertions(+), 13 deletions(-) diff --git a/module/web/uri.scm b/module/web/uri.scm index 4c6fa5051..b4b89b9cc 100644 --- a/module/web/uri.scm +++ b/module/web/uri.scm @@ -1,6 +1,6 @@ ;;;; (web uri) --- URI manipulation tools ;;;; -;;;; Copyright (C) 1997,2001,2002,2010,2011,2012,2013,2014 Free Software F= oundation, Inc. +;;;; Copyright (C) 1997,2001,2002,2010,2011,2012,2013,2014,2019 Free Softw= are Foundation, Inc. ;;;; ;;;; This library is free software; you can redistribute it and/or ;;;; modify it under the terms of the GNU Lesser General Public @@ -175,17 +175,28 @@ for =E2=80=98build-uri=E2=80=99 except there is no sc= heme." ;;; Converters. ;;; =20 +;; Since character ranges in regular expressions may depend on the +;; current locale, we use explicit lists of characters instead. See +;; for details. +(define digits "0123456789") +(define hex-digits "0123456789ABCDEFabcdef") +(define letters "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz") + ;; See RFC 3986 #3.2.2 for comments on percent-encodings, IDNA (RFC ;; 3490), and non-ASCII host names. ;; (define ipv4-regexp - (make-regexp "^([0-9.]+)$")) + (make-regexp (string-append "^([" digits ".]+)$"))) (define ipv6-regexp - (make-regexp "^([0-9a-fA-F:.]+)$")) + (make-regexp (string-append "^([" hex-digits ":.]+)$"))) (define domain-label-regexp - (make-regexp "^[a-zA-Z0-9]([a-zA-Z0-9-]*[a-zA-Z0-9])?$")) + (make-regexp + (string-append "^[" letters digits "]" + "([" letters digits "-]*[" letters digits "])?$"))) (define top-label-regexp - (make-regexp "^[a-zA-Z]([a-zA-Z0-9-]*[a-zA-Z0-9])?$")) + (make-regexp + (string-append "^[" letters "]" + "([" letters digits "-]*[" letters digits "])?$"))) =20 (define (valid-host? host) (cond @@ -203,13 +214,13 @@ for =E2=80=98build-uri=E2=80=99 except there is no sc= heme." (regexp-exec top-label-regexp host start))))))) =20 (define userinfo-pat - "[a-zA-Z0-9_.!~*'();:&=3D+$,-]+") + (string-append "[" letters digits "_.!~*'();:&=3D+$,-]+")) (define host-pat - "[a-zA-Z0-9.-]+") + (string-append "[" letters digits ".-]+")) (define ipv6-host-pat - "[0-9a-fA-F:.]+") + (string-append "[" hex-digits ":.]+")) (define port-pat - "[0-9]*") + (string-append "[" digits "]*")) (define authority-regexp (make-regexp (format #f "^//((~a)@)?((~a)|(\\[(~a)\\]))(:(~a))?$" @@ -246,7 +257,7 @@ for =E2=80=98build-uri=E2=80=99 except there is no sche= me." ;;; either. =20 (define scheme-pat - "[a-zA-Z][a-zA-Z0-9+.-]*") + (string-append "[" letters "][" letters digits "+.-]*")) (define authority-pat "[^/?#]*") (define path-pat diff --git a/test-suite/tests/web-uri.test b/test-suite/tests/web-uri.test index 73391898c..ef8e85eba 100644 --- a/test-suite/tests/web-uri.test +++ b/test-suite/tests/web-uri.test @@ -1,6 +1,6 @@ ;;;; web-uri.test --- URI library -*- mode: scheme; coding: utf-8= ; -*- ;;;; -;;;; Copyright (C) 2010-2012, 2014, 2017 Free Software Foundation, Inc. +;;;; Copyright (C) 2010-2012, 2014, 2017, 2019 Free Software Foundation, = Inc. ;;;; ;;;; This library is free software; you can redistribute it and/or ;;;; modify it under the terms of the GNU Lesser General Public @@ -121,7 +121,18 @@ =20 (pass-if-uri-exception "http://foo@" "Expected.*host" - (build-uri 'http #:userinfo "foo"))) + (build-uri 'http #:userinfo "foo")) + + (pass-if-uri-exception "http://ill=C3=A9gal.com" + "Expected.*host" + (dynamic-wind + (lambda () #t) + (lambda () + (with-locale "en_US.utf8" + (reload-module (resolve-module '(web uri))) + (build-uri 'http #:host "ill=C3=A9gal.com")= )) + (lambda () + (reload-module (resolve-module '(web uri)))))= )) =20 (with-test-prefix "build-uri-reference" (pass-if "//host/etc/foo" @@ -290,7 +301,19 @@ #:port 100 #:path "/" #:query "q" - #:fragment "bar"))) + #:fragment "bar")) + + ;; bug #35785 + (pass-if "http://www.example.com (sv_SE)" + (dynamic-wind + (lambda () #t) + (lambda () + (with-locale "sv_SE.utf8" + (reload-module (resolve-module '(web uri))) + (uri=3D? (string->uri "http://www.example.com") + #:scheme 'http #:host "www.example.com" #:path ""))) + (lambda () + (reload-module (resolve-module '(web uri))))))) =20 (with-test-prefix "string->uri-reference" (pass-if "/foo" --=20 2.21.0 --=-=-=--