From mboxrd@z Thu Jan 1 00:00:00 1970 From: Timothy Sample Subject: bug#35785: =?UTF-8?Q?=E2=80=98string->uri=E2=80=99?= is locale-dependent and breaks in =?UTF-8?Q?=E2=80=98sv=5FSE=E2=80=99?= Date: Mon, 27 May 2019 09:39:03 -0400 Message-ID: <875zpw6mq0.fsf@ngyro.com> References: <878sv4j1au.fsf@gmail.com> <87d0kgvuxj.fsf@gnu.org> <87tvdqgwyg.fsf@gmail.com> <87blzxwkrn.fsf_-_@gnu.org> <87ftp017k6.fsf@elephly.net> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Return-path: Received: from eggs.gnu.org ([209.51.188.92]:36412) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1hVFr9-0000mc-ON for bug-guix@gnu.org; Mon, 27 May 2019 09:40:04 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1hVFr8-0003c0-IZ for bug-guix@gnu.org; Mon, 27 May 2019 09:40:03 -0400 Received: from debbugs.gnu.org ([209.51.188.43]:39420) by eggs.gnu.org with esmtps (TLS1.0:RSA_AES_128_CBC_SHA1:16) (Exim 4.71) (envelope-from ) id 1hVFr8-0003bv-GO for bug-guix@gnu.org; Mon, 27 May 2019 09:40:02 -0400 Received: from Debian-debbugs by debbugs.gnu.org with local (Exim 4.84_2) (envelope-from ) id 1hVFr8-0003n7-AY for bug-guix@gnu.org; Mon, 27 May 2019 09:40:02 -0400 In-Reply-To: <878sv4j1au.fsf@gmail.com> Sender: "Debbugs-submit" Resent-Message-ID: List-Id: Bug reports for GNU Guix List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: bug-guix-bounces+gcggb-bug-guix=m.gmane.org@gnu.org Sender: "bug-Guix" To: Ricardo Wurmus Cc: 35785@debbugs.gnu.org, Einar Largenius Hello, Ricardo Wurmus writes: > Ludovic Court=C3=A8s writes: > >> Using the =E2=80=9Clower=E2=80=9D regexp class instead of =E2=80=9C[a-z]= =E2=80=9D works: >> >> --8<---------------cut here---------------start------------->8--- >> scheme@(guile-user)> (string-match "[[:lower:]]" "w") >> $12 =3D #("w" (0 . 1)) >> --8<---------------cut here---------------end--------------->8--- >> >> However, it=E2=80=99s not clear to me whether the =E2=80=9Clower=E2=80= =9D class is supposed to >> be the same for all locales or if we=E2=80=99re just lucky: >> >> http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html >> >> Thoughts? > > The lower class is much larger than [a-z]. If we only wanted to work > around this particular problem we could explicitly spell out the range, > which would be the same in all locales. (Obviously, that wouldn=E2=80=99= t be > pretty.) I think that explicitly spelling out the range is the right thing to do here. The POSIX spec says that character ranges work in the POSIX locale, but =E2=80=9Cin other locales, a range expression has unspecified behavior.=E2=80=9D > But can=E2=80=99t URI parts contain more than those characters? A quick reading of RFC 3986 suggests that the host part of a URI can be an IP address (version 4 or 6) or a registered name. It gives the following rules for registered names: reg-name =3D *( unreserved / pct-encoded / sub-delims ) unreserved =3D ALPHA / DIGIT / "-" / "." / "_" / "~" pct-encoded =3D "%" HEXDIG HEXDIG sub-delims =3D "!" / "$" / "&" / "'" / "(" / ")" / "*" / "+" / "," / ";" / "=3D" Here, =E2=80=9CALPHA=E2=80=9D, =E2=80=9CDIGIT=E2=80=9D, and =E2=80=9CHEXDIG= =E2=80=9D are specified in RFC 2234, and are just the ASCII ranges you might expect (except for that =E2=80=9CHEXDIG=E2= =80=9D only allows uppercase letters). It looks like Guile is currently a little stricter than this, but pretty close (if you take the character ranges to mean ASCII ranges). > To circumvent > the question whether the lower class is locale dependent we could > generate an explicit range from a charset. I think this is the right approach. Using =E2=80=9C[:lower:]=E2=80=9D woul= d allow things outside of the RFC, like =E2=80=98=C3=A9=E2=80=99. Adding support f= or internationalized domain names using Punycode would be cool, but well outside the scope of this bug. :) -- Tim