From: Timothy Sample <samplet@ngyro.com>
To: Ricardo Wurmus <rekado@elephly.net>
Cc: 35785@debbugs.gnu.org, Einar Largenius <einar.largenius@gmail.com>
Subject: bug#35785: ‘string->uri’ is locale-dependent and breaks in ‘sv_SE’
Date: Mon, 27 May 2019 09:39:03 -0400 [thread overview]
Message-ID: <875zpw6mq0.fsf@ngyro.com> (raw)
In-Reply-To: <878sv4j1au.fsf@gmail.com>
Hello,
Ricardo Wurmus <rekado@elephly.net> writes:
> Ludovic Courtès <ludo@gnu.org> writes:
>
>> Using the “lower” regexp class instead of “[a-z]” works:
>>
>> --8<---------------cut here---------------start------------->8---
>> scheme@(guile-user)> (string-match "[[:lower:]]" "w")
>> $12 = #("w" (0 . 1))
>> --8<---------------cut here---------------end--------------->8---
>>
>> However, it’s not clear to me whether the “lower” class is supposed to
>> be the same for all locales or if we’re just lucky:
>>
>> http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html
>>
>> Thoughts?
>
> The lower class is much larger than [a-z]. If we only wanted to work
> around this particular problem we could explicitly spell out the range,
> which would be the same in all locales. (Obviously, that wouldn’t be
> pretty.)
I think that explicitly spelling out the range is the right thing to do
here. The POSIX spec says that character ranges work in the POSIX
locale, but “in other locales, a range expression has unspecified
behavior.”
> But can’t URI parts contain more than those characters?
A quick reading of RFC 3986 suggests that the host part of a URI can be
an IP address (version 4 or 6) or a registered name. It gives the
following rules for registered names:
reg-name = *( unreserved / pct-encoded / sub-delims )
unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~"
pct-encoded = "%" HEXDIG HEXDIG
sub-delims = "!" / "$" / "&" / "'" / "(" / ")"
/ "*" / "+" / "," / ";" / "="
Here, “ALPHA”, “DIGIT”, and “HEXDIG” are specified in RFC 2234, and are
just the ASCII ranges you might expect (except for that “HEXDIG” only
allows uppercase letters).
It looks like Guile is currently a little stricter than this, but pretty
close (if you take the character ranges to mean ASCII ranges).
> To circumvent
> the question whether the lower class is locale dependent we could
> generate an explicit range from a charset.
I think this is the right approach. Using “[:lower:]” would allow
things outside of the RFC, like ‘é’. Adding support for
internationalized domain names using Punycode would be cool, but well
outside the scope of this bug. :)
-- Tim
next prev parent reply other threads:[~2019-05-27 13:40 UTC|newest]
Thread overview: 13+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-05-17 20:03 bug#35785: guix won't download if locale is set to swedish Einar Largenius
2019-05-18 11:55 ` Ludovic Courtès
2019-05-19 17:45 ` Einar Largenius
2019-05-20 8:20 ` Ludovic Courtès
2019-05-20 9:14 ` bug#35785: ‘string->uri’ is locale-dependent and breaks in ‘sv_SE’ Ludovic Courtès
2019-05-27 11:05 ` Ricardo Wurmus
2019-05-27 13:39 ` Timothy Sample [this message]
2019-05-28 11:17 ` Ludovic Courtès
2019-06-03 0:39 ` Timothy Sample
2019-06-03 13:01 ` Ludovic Courtès
2019-06-03 14:24 ` Timothy Sample
2019-06-04 7:42 ` Ludovic Courtès
2019-06-04 13:56 ` Timothy Sample
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=875zpw6mq0.fsf@ngyro.com \
--to=samplet@ngyro.com \
--cc=35785@debbugs.gnu.org \
--cc=einar.largenius@gmail.com \
--cc=rekado@elephly.net \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this external index
https://git.savannah.gnu.org/cgit/guix.git
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.