unofficial mirror of bug-guile@gnu.org 
 help / color / mirror / Atom feed
From: Jean Abou Samra <jean@abou-samra.fr>
To: "Ludovic Courtès" <ludo@gnu.org>
Cc: 57507@debbugs.gnu.org
Subject: bug#57507: Regular expression matching depends on locale encoding
Date: Thu, 17 Nov 2022 21:33:42 +0100	[thread overview]
Message-ID: <a586554a-c916-6a39-a96d-6cc032e79daa@abou-samra.fr> (raw)
In-Reply-To: <87czc939qk.fsf@gnu.org>


[-- Attachment #1.1: Type: text/plain, Size: 1894 bytes --]

Le 05/09/2022 à 21:24, Ludovic Courtès a écrit :
> Yes, that’d be welcome.  I would not call it a constraint or limitation;
> for example, that ‘w’ is not a letter in Swedish is the kind of thing
> you’d generally want to take into account.  Now, it’d be nice if one
> could easily specify the locale to operate under, with an API similar to
> that of (ice-9 i18n) and its first-class locale objects.



Sorry that it took me forever to send this.



 From c666ca4f72dc0a00d28b8d7ef1221ebfc9741551 Mon Sep 17 00:00:00 2001
From: Jean Abou Samra <jean@abou-samra.fr>
Date: Thu, 17 Nov 2022 21:26:07 +0100
Subject: [PATCH] Doc: clarification on regexes and encodings

* doc/ref/api-regex.texi: make it more obviously clear that regexp
   matching supports only characters supported by the locale encoding.
---
  doc/ref/api-regex.texi | 6 +++++-
  1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/doc/ref/api-regex.texi b/doc/ref/api-regex.texi
index b14c2b39c..bd1f4079d 100644
--- a/doc/ref/api-regex.texi
+++ b/doc/ref/api-regex.texi
@@ -57,7 +57,11 @@ locale's encoding, and then passed to the C library's 
regular expression
  routines (@pxref{Regular Expressions,,, libc, The GNU C Library
  Reference Manual}).  The returned match structures always point to
  characters in the strings, not to individual bytes, even in the case of
-multi-byte encodings.
+multi-byte encodings.  This ensures that the match structures are
+correct when performing matching with characters that have a multi-byte
+representation in the locale encoding.  Note, however, that using
+characters which cannot be represented in the locale encoding can lead
+to surprising results.

  @deffn {Scheme Procedure} string-match pattern str [start]
  Compile the string @var{pattern} into a regular expression and compare
-- 
2.38.1



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 236 bytes --]

      reply	other threads:[~2022-11-17 20:33 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-08-31 16:54 bug#57507: Regular expression matching depends on locale encoding Jean Abou Samra
2022-09-01 19:34 ` dsmich
2022-09-05  7:48 ` Ludovic Courtès
2022-09-05 18:39   ` Jean Abou Samra
2022-09-05 19:24     ` Ludovic Courtès
2022-11-17 20:33       ` Jean Abou Samra [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://www.gnu.org/software/guile/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=a586554a-c916-6a39-a96d-6cc032e79daa@abou-samra.fr \
    --to=jean@abou-samra.fr \
    --cc=57507@debbugs.gnu.org \
    --cc=ludo@gnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).