From mboxrd@z Thu Jan 1 00:00:00 1970 From: Timothy Sample Subject: bug#35785: =?UTF-8?Q?=E2=80=98string->uri=E2=80=99?= is locale-dependent and breaks in =?UTF-8?Q?=E2=80=98sv=5FSE=E2=80=99?= Date: Mon, 03 Jun 2019 10:24:40 -0400 Message-ID: <87ef4asq53.fsf@ngyro.com> References: <878sv4j1au.fsf@gmail.com> <87d0kgvuxj.fsf@gnu.org> <87tvdqgwyg.fsf@gmail.com> <87blzxwkrn.fsf_-_@gnu.org> <87ftp017k6.fsf@elephly.net> <875zpw6mq0.fsf@ngyro.com> <8736ky3k1w.fsf@gnu.org> <87imtnsdsb.fsf@ngyro.com> <871s0ahlfq.fsf@gnu.org> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Return-path: Received: from eggs.gnu.org ([209.51.188.92]:59030) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1hXntX-0007Qn-GE for bug-guix@gnu.org; Mon, 03 Jun 2019 10:25:05 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1hXntV-0004T2-VT for bug-guix@gnu.org; Mon, 03 Jun 2019 10:25:03 -0400 Received: from debbugs.gnu.org ([209.51.188.43]:57322) by eggs.gnu.org with esmtps (TLS1.0:RSA_AES_128_CBC_SHA1:16) (Exim 4.71) (envelope-from ) id 1hXntV-0004Sp-S2 for bug-guix@gnu.org; Mon, 03 Jun 2019 10:25:01 -0400 Received: from Debian-debbugs by debbugs.gnu.org with local (Exim 4.84_2) (envelope-from ) id 1hXntV-0008Ni-My for bug-guix@gnu.org; Mon, 03 Jun 2019 10:25:01 -0400 Sender: "Debbugs-submit" Resent-Message-ID: In-Reply-To: <871s0ahlfq.fsf@gnu.org> ("Ludovic \=\?utf-8\?Q\?Court\=C3\=A8s\=22'\?\= \=\?utf-8\?Q\?s\?\= message of "Mon, 03 Jun 2019 15:01:45 +0200") List-Id: Bug reports for GNU Guix List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: bug-guix-bounces+gcggb-bug-guix=m.gmane.org@gnu.org Sender: "bug-Guix" To: Ludovic =?UTF-8?Q?Court=C3=A8s?= Cc: 35785@debbugs.gnu.org, Einar Largenius Hi Ludo, Ludovic Court=C3=A8s writes: > Hi Timothy, > > Timothy Sample skribis: > >> Here=E2=80=99s a patch for Guile that uses explicit lists of characters = in the >> =E2=80=98(web uri)=E2=80=99 module instead of character ranges. It incl= udes two tests >> that are pretty verbose, but seem to do the trick. >> >> I have a bit more background on the problem, mostly coming from a Glibc >> bug report: . >> >> It turns out that it is well-known upstream, and avoiding character >> ranges is the recommended approach for know. Some other GNU tools have >> adopted what is being called the =E2=80=9CRational Range Interpretation= =E2=80=9D >> . >> AIUI, this means they use the underlying encoding numbers for ranges (I >> checked the source, but I=E2=80=99m only mostly sure I read it right). = It looks >> like the Glibc folks are unsure how to proceed on this (but are maybe >> slightly leaning towards the =E2=80=9Crational=E2=80=9D approach). > > Great that you gleaned good references on this topic! > >> It=E2=80=99s all a pretty big mess, really. I was hoping there would be= some >> obvious thing that would fix the problem more generally. Short of >> pulling in the Gnulib regex code or writing something in Scheme, it >> looks like Guile is stuck where it is now. > > Yeah. The alternative would be to not use regexps in this context, I > guess. I meant fixing regexes in other contexts, since I=E2=80=99m sure the URI mo= dule is not the only Guile code ever that assumed =E2=80=9C[a-z]=E2=80=9D would = only match ASCII lowercase letters. >> I=E2=80=99m unsure if the changes are considered =E2=80=9Ctrivial=E2=80= =9D from a copyright >> perspective. It=E2=80=99s pretty close, but I think programmers tend to >> underestimate here. I=E2=80=99ve started the FSF copyright assignment p= rocess >> either way, since is likely not my last Guile patch. :) > > If the process is already underway, I think it=E2=80=99s fine to commit t= his > patch (I would rather wait if it were longer and/or if we didn=E2=80=99t = know > each other already). Sounds good! >> From 7b02be4c050c7b17a0e2685e8e453295f798c360 Mon Sep 17 00:00:00 2001 >> From: Timothy Sample >> Date: Sun, 2 Jun 2019 14:41:20 -0400 >> Subject: [PATCH] Make URI handling locale independent. >> >> Fixes . >> >> * module/web/uri.scm (digits, hex-digits, letters): New variables. >> (ipv4-regexp, ipv6-regexp, domain-label-regexp, top-label-regexp, >> userinfo-pat, host-pat, ipv6-host-pat, port-pat, scheme-pat): Explicitly >> list each character instead of using character ranges. >> * test-suite/tests/web-uri.test: Add corresponding tests. > > [...] > >> + (pass-if "http://www.example.com (sv_SE)" >> + (dynamic-wind >> + (lambda () #t) >> + (lambda () >> + (with-locale "sv_SE.utf8" >> + (reload-module (resolve-module '(web uri))) >> + (uri=3D? (string->uri "http://www.example.com") >> + #:scheme 'http #:host "www.example.com" #:path ""))) > > Aren=E2=80=99t =E2=80=98reload-module=E2=80=99 calls a leftover that can = now be removed (also in > the other test)? I needed to reload the modules like that to make the tests fail without the patch and pass with it. My understanding is that the bug happens at regex compile time, which happens when the module is loaded. If I don=E2=80=99t reload the module, the old URI code passes the tests, since t= he regexes were compiled with a locale that does not trigger the bug. It=E2= =80=99s a little wacky, sure, but it was the best idea I could come up with. > For the sv_SE test, what about taking a host name with a =E2=80=98w=E2=80= =99, since > that=E2=80=99s the use case that allowed us to uncover this bug? I thought I was being clever by using a =E2=80=9Cwww=E2=80=9D hostname, but= apparently it=E2=80=99s so normalized as to be invisible! Feel free to change it to something more obvious like =E2=80=9Cw.com=E2=80=9D or whatever. -- Tim