From mboxrd@z Thu Jan 1 00:00:00 1970 From: Timothy Sample Subject: bug#35785: =?UTF-8?Q?=E2=80=98string->uri=E2=80=99?= is locale-dependent and breaks in =?UTF-8?Q?=E2=80=98sv=5FSE=E2=80=99?= Date: Tue, 04 Jun 2019 09:56:39 -0400 Message-ID: <87sgsp8ne0.fsf@ngyro.com> References: <878sv4j1au.fsf@gmail.com> <87d0kgvuxj.fsf@gnu.org> <87tvdqgwyg.fsf@gmail.com> <87blzxwkrn.fsf_-_@gnu.org> <87ftp017k6.fsf@elephly.net> <875zpw6mq0.fsf@ngyro.com> <8736ky3k1w.fsf@gnu.org> <87imtnsdsb.fsf@ngyro.com> <871s0ahlfq.fsf@gnu.org> <87ef4asq53.fsf@ngyro.com> <87imtlhk3k.fsf@gnu.org> Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="=-=-=" Return-path: Received: from eggs.gnu.org ([209.51.188.92]:59658) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1hY9w0-0005cL-7s for bug-guix@gnu.org; Tue, 04 Jun 2019 09:57:07 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1hY9vy-0007ps-N7 for bug-guix@gnu.org; Tue, 04 Jun 2019 09:57:04 -0400 Received: from debbugs.gnu.org ([209.51.188.43]:60197) by eggs.gnu.org with esmtps (TLS1.0:RSA_AES_128_CBC_SHA1:16) (Exim 4.71) (envelope-from ) id 1hY9vy-0007pB-Gm for bug-guix@gnu.org; Tue, 04 Jun 2019 09:57:02 -0400 Received: from Debian-debbugs by debbugs.gnu.org with local (Exim 4.84_2) (envelope-from ) id 1hY9vy-0000jM-D8 for bug-guix@gnu.org; Tue, 04 Jun 2019 09:57:02 -0400 Sender: "Debbugs-submit" Resent-Message-ID: In-Reply-To: <87imtlhk3k.fsf@gnu.org> ("Ludovic \=\?utf-8\?Q\?Court\=C3\=A8s\=22'\?\= \=\?utf-8\?Q\?s\?\= message of "Tue, 04 Jun 2019 09:42:55 +0200") List-Id: Bug reports for GNU Guix List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: bug-guix-bounces+gcggb-bug-guix=m.gmane.org@gnu.org Sender: "bug-Guix" To: Ludovic =?UTF-8?Q?Court=C3=A8s?= Cc: 35785@debbugs.gnu.org, Einar Largenius --=-=-= Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Hi, Ludovic Court=C3=A8s writes: > Timothy Sample skribis: > > [...] > >> I needed to reload the modules like that to make the tests fail without >> the patch and pass with it. My understanding is that the bug happens >> at regex compile time, which happens when the module is loaded. If I >> don=E2=80=99t reload the module, the old URI code passes the tests, sinc= e the >> regexes were compiled with a locale that does not trigger the bug. It= =E2=80=99s >> a little wacky, sure, but it was the best idea I could come up with. > > Oooh, I see. Could you add a comment to explain this? Then we=E2=80=99r= e done. Here it is! I hope it is clear. -- Tim --=-=-= Content-Type: text/x-patch; charset=utf-8 Content-Disposition: attachment; filename=0001-Make-URI-handling-locale-independent.patch Content-Transfer-Encoding: quoted-printable Content-Description: patch >From 9ac8643e5315d4baaddb93ee246ba8db0b3448ab Mon Sep 17 00:00:00 2001 From: Timothy Sample Date: Sun, 2 Jun 2019 14:41:20 -0400 Subject: [PATCH] Make URI handling locale independent. Fixes . * module/web/uri.scm (digits, hex-digits, letters): New variables. (ipv4-regexp, ipv6-regexp, domain-label-regexp, top-label-regexp, userinfo-pat, host-pat, ipv6-host-pat, port-pat, scheme-pat): Explicitly list each character instead of using character ranges. * test-suite/tests/web-uri.test: Add corresponding tests. --- module/web/uri.scm | 31 +++++++++++++++++++++---------- test-suite/tests/web-uri.test | 33 ++++++++++++++++++++++++++++++--- 2 files changed, 51 insertions(+), 13 deletions(-) diff --git a/module/web/uri.scm b/module/web/uri.scm index 4c6fa5051..b4b89b9cc 100644 --- a/module/web/uri.scm +++ b/module/web/uri.scm @@ -1,6 +1,6 @@ ;;;; (web uri) --- URI manipulation tools ;;;; -;;;; Copyright (C) 1997,2001,2002,2010,2011,2012,2013,2014 Free Software F= oundation, Inc. +;;;; Copyright (C) 1997,2001,2002,2010,2011,2012,2013,2014,2019 Free Softw= are Foundation, Inc. ;;;; ;;;; This library is free software; you can redistribute it and/or ;;;; modify it under the terms of the GNU Lesser General Public @@ -175,17 +175,28 @@ for =E2=80=98build-uri=E2=80=99 except there is no sc= heme." ;;; Converters. ;;; =20 +;; Since character ranges in regular expressions may depend on the +;; current locale, we use explicit lists of characters instead. See +;; for details. +(define digits "0123456789") +(define hex-digits "0123456789ABCDEFabcdef") +(define letters "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz") + ;; See RFC 3986 #3.2.2 for comments on percent-encodings, IDNA (RFC ;; 3490), and non-ASCII host names. ;; (define ipv4-regexp - (make-regexp "^([0-9.]+)$")) + (make-regexp (string-append "^([" digits ".]+)$"))) (define ipv6-regexp - (make-regexp "^([0-9a-fA-F:.]+)$")) + (make-regexp (string-append "^([" hex-digits ":.]+)$"))) (define domain-label-regexp - (make-regexp "^[a-zA-Z0-9]([a-zA-Z0-9-]*[a-zA-Z0-9])?$")) + (make-regexp + (string-append "^[" letters digits "]" + "([" letters digits "-]*[" letters digits "])?$"))) (define top-label-regexp - (make-regexp "^[a-zA-Z]([a-zA-Z0-9-]*[a-zA-Z0-9])?$")) + (make-regexp + (string-append "^[" letters "]" + "([" letters digits "-]*[" letters digits "])?$"))) =20 (define (valid-host? host) (cond @@ -203,13 +214,13 @@ for =E2=80=98build-uri=E2=80=99 except there is no sc= heme." (regexp-exec top-label-regexp host start))))))) =20 (define userinfo-pat - "[a-zA-Z0-9_.!~*'();:&=3D+$,-]+") + (string-append "[" letters digits "_.!~*'();:&=3D+$,-]+")) (define host-pat - "[a-zA-Z0-9.-]+") + (string-append "[" letters digits ".-]+")) (define ipv6-host-pat - "[0-9a-fA-F:.]+") + (string-append "[" hex-digits ":.]+")) (define port-pat - "[0-9]*") + (string-append "[" digits "]*")) (define authority-regexp (make-regexp (format #f "^//((~a)@)?((~a)|(\\[(~a)\\]))(:(~a))?$" @@ -246,7 +257,7 @@ for =E2=80=98build-uri=E2=80=99 except there is no sche= me." ;;; either. =20 (define scheme-pat - "[a-zA-Z][a-zA-Z0-9+.-]*") + (string-append "[" letters "][" letters digits "+.-]*")) (define authority-pat "[^/?#]*") (define path-pat diff --git a/test-suite/tests/web-uri.test b/test-suite/tests/web-uri.test index 73391898c..94778acac 100644 --- a/test-suite/tests/web-uri.test +++ b/test-suite/tests/web-uri.test @@ -1,6 +1,6 @@ ;;;; web-uri.test --- URI library -*- mode: scheme; coding: utf-8= ; -*- ;;;; -;;;; Copyright (C) 2010-2012, 2014, 2017 Free Software Foundation, Inc. +;;;; Copyright (C) 2010-2012, 2014, 2017, 2019 Free Software Foundation, = Inc. ;;;; ;;;; This library is free software; you can redistribute it and/or ;;;; modify it under the terms of the GNU Lesser General Public @@ -121,7 +121,21 @@ =20 (pass-if-uri-exception "http://foo@" "Expected.*host" - (build-uri 'http #:userinfo "foo"))) + (build-uri 'http #:userinfo "foo")) + + ;; In this test, we need to reload the '(web uri)' module with a + ;; different locale. This is because some locale-dependent things + ;; (e.g., compiled regexes) are computed when the module is loaded. + (pass-if-uri-exception "http://ill=C3=A9gal.com" + "Expected.*host" + (dynamic-wind + (lambda () #t) + (lambda () + (with-locale "en_US.utf8" + (reload-module (resolve-module '(web uri))) + (build-uri 'http #:host "ill=C3=A9gal.com")= )) + (lambda () + (reload-module (resolve-module '(web uri)))))= )) =20 (with-test-prefix "build-uri-reference" (pass-if "//host/etc/foo" @@ -290,7 +304,20 @@ #:port 100 #:path "/" #:query "q" - #:fragment "bar"))) + #:fragment "bar")) + + ;; This test reproduces bug #35785. See the 'ill=C3=A9gal' test above f= or + ;; why we reload the module. + (pass-if "http://www.example.com (sv_SE)" + (dynamic-wind + (lambda () #t) + (lambda () + (with-locale "sv_SE.utf8" + (reload-module (resolve-module '(web uri))) + (uri=3D? (string->uri "http://www.example.com") + #:scheme 'http #:host "www.example.com" #:path ""))) + (lambda () + (reload-module (resolve-module '(web uri))))))) =20 (with-test-prefix "string->uri-reference" (pass-if "/foo" --=20 2.21.0 --=-=-=--