all messages for Guix-related lists mirrored at yhetil.org
 help / color / mirror / code / Atom feed
From: Timothy Sample <samplet@ngyro.com>
To: Ricardo Wurmus <rekado@elephly.net>
Cc: 35785@debbugs.gnu.org, Einar Largenius <einar.largenius@gmail.com>
Subject: bug#35785: ‘string->uri’ is locale-dependent and breaks in ‘sv_SE’
Date: Mon, 27 May 2019 09:39:03 -0400	[thread overview]
Message-ID: <875zpw6mq0.fsf@ngyro.com> (raw)
In-Reply-To: <878sv4j1au.fsf@gmail.com>

Hello,

Ricardo Wurmus <rekado@elephly.net> writes:

> Ludovic Courtès <ludo@gnu.org> writes:
>
>> Using the “lower” regexp class instead of “[a-z]” works:
>>
>> --8<---------------cut here---------------start------------->8---
>> scheme@(guile-user)> (string-match "[[:lower:]]" "w")
>> $12 = #("w" (0 . 1))
>> --8<---------------cut here---------------end--------------->8---
>>
>> However, it’s not clear to me whether the “lower” class is supposed to
>> be the same for all locales or if we’re just lucky:
>>
>>   http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html
>>
>> Thoughts?
>
> The lower class is much larger than [a-z].  If we only wanted to work
> around this particular problem we could explicitly spell out the range,
> which would be the same in all locales.  (Obviously, that wouldn’t be
> pretty.)

I think that explicitly spelling out the range is the right thing to do
here.  The POSIX spec says that character ranges work in the POSIX
locale, but “in other locales, a range expression has unspecified
behavior.”

> But can’t URI parts contain more than those characters?

A quick reading of RFC 3986 suggests that the host part of a URI can be
an IP address (version 4 or 6) or a registered name.  It gives the
following rules for registered names:

reg-name      = *( unreserved / pct-encoded / sub-delims )
unreserved    = ALPHA / DIGIT / "-" / "." / "_" / "~"
pct-encoded   = "%" HEXDIG HEXDIG
sub-delims    = "!" / "$" / "&" / "'" / "(" / ")"
              / "*" / "+" / "," / ";" / "="

Here, “ALPHA”, “DIGIT”, and “HEXDIG” are specified in RFC 2234, and are
just the ASCII ranges you might expect (except for that “HEXDIG” only
allows uppercase letters).

It looks like Guile is currently a little stricter than this, but pretty
close (if you take the character ranges to mean ASCII ranges).

> To circumvent
> the question whether the lower class is locale dependent we could
> generate an explicit range from a charset.

I think this is the right approach.  Using “[:lower:]” would allow
things outside of the RFC, like ‘é’.  Adding support for
internationalized domain names using Punycode would be cool, but well
outside the scope of this bug.  :)


-- Tim

  reply	other threads:[~2019-05-27 13:40 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-05-17 20:03 bug#35785: guix won't download if locale is set to swedish Einar Largenius
2019-05-18 11:55 ` Ludovic Courtès
2019-05-19 17:45   ` Einar Largenius
2019-05-20  8:20     ` Ludovic Courtès
2019-05-20  9:14     ` bug#35785: ‘string->uri’ is locale-dependent and breaks in ‘sv_SE’ Ludovic Courtès
2019-05-27 11:05       ` Ricardo Wurmus
2019-05-27 13:39         ` Timothy Sample [this message]
2019-05-28 11:17           ` Ludovic Courtès
2019-06-03  0:39             ` Timothy Sample
2019-06-03 13:01               ` Ludovic Courtès
2019-06-03 14:24                 ` Timothy Sample
2019-06-04  7:42                   ` Ludovic Courtès
2019-06-04 13:56                     ` Timothy Sample

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=875zpw6mq0.fsf@ngyro.com \
    --to=samplet@ngyro.com \
    --cc=35785@debbugs.gnu.org \
    --cc=einar.largenius@gmail.com \
    --cc=rekado@elephly.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this external index

	https://git.savannah.gnu.org/cgit/guix.git

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.