unofficial mirror of bug-guile@gnu.org 
 help / color / mirror / Atom feed
* bug#57507: Regular expression matching depends on locale encoding
@ 2022-08-31 16:54 Jean Abou Samra
  2022-09-01 19:34 ` dsmich
  2022-09-05  7:48 ` Ludovic Courtès
  0 siblings, 2 replies; 6+ messages in thread
From: Jean Abou Samra @ 2022-08-31 16:54 UTC (permalink / raw)
  To: 57507

Regular expressions do funky things with Unicode if a non-Unicode-aware
locale is set. Yet, they're purely string operations, so I don't think
it's expected that they depend on the locale encoding.



$ LC_ALL=C guile3.0
GNU Guile 3.0.7
Copyright (C) 1995-2021 Free Software Foundation, Inc.

Guile comes with ABSOLUTELY NO WARRANTY; for details type `,show w'.
This program is free software, and you are welcome to redistribute it
under certain conditions; type `,show c' for details.

Enter `,help' for help.
scheme@(guile-user)> (use-modules (ice-9 regex))
scheme@(guile-user)> (match:substring (string-match "\u203f" "\u3091"))
ice-9/boot-9.scm:1685:16: In procedure raise-exception:
In procedure make-regexp: Invalid preceding regular expression

Entering a new prompt.  Type `,bt' for a backtrace or `,q' to continue.
scheme@(guile-user) [1]> ,q
scheme@(guile-user)> (match:substring (string-match "[\u203f]" "\u3091"))
$1 = "\u3091"
scheme@(guile-user)>






^ permalink raw reply	[flat|nested] 6+ messages in thread

* bug#57507: Regular expression matching depends on locale encoding
  2022-08-31 16:54 bug#57507: Regular expression matching depends on locale encoding Jean Abou Samra
@ 2022-09-01 19:34 ` dsmich
  2022-09-05  7:48 ` Ludovic Courtès
  1 sibling, 0 replies; 6+ messages in thread
From: dsmich @ 2022-09-01 19:34 UTC (permalink / raw)
  To: 'Jean Abou Samra'; +Cc: '57507@debbugs.gnu.org'

[-- Attachment #1: Type: text/plain, Size: 1416 bytes --]


Also remember that Guile uses the system C library regex routines. And
is using C strings, not Guile strings.

(sorry for top post, too tired to fight with this web editor)

-Dale

	-----------------------------------------From: "Jean Abou Samra" 
To: 57507@debbugs.gnu.org
Cc: 
Sent: Wednesday August 31 2022 12:55:13PM
Subject: bug#57507: Regular expression matching depends on locale
encoding

 Regular expressions do funky things with Unicode if a
non-Unicode-aware
 locale is set. Yet, they're purely string operations, so I don't
think
 it's expected that they depend on the locale encoding.

 $ LC_ALL=C guile3.0
 GNU Guile 3.0.7
 Copyright (C) 1995-2021 Free Software Foundation, Inc.

 Guile comes with ABSOLUTELY NO WARRANTY; for details type `,show w'.
 This program is free software, and you are welcome to redistribute it
 under certain conditions; type `,show c' for details.

 Enter `,help' for help.
 scheme@(guile-user)> (use-modules (ice-9 regex))
 scheme@(guile-user)> (match:substring (string-match "u203f" "u3091"))
 ice-9/boot-9.scm:1685:16: In procedure raise-exception:
 In procedure make-regexp: Invalid preceding regular expression

 Entering a new prompt. Type `,bt' for a backtrace or `,q' to
continue.
 scheme@(guile-user) [1]> ,q
 scheme@(guile-user)> (match:substring (string-match "[u203f]"
"u3091"))
 $1 = "u3091"
 scheme@(guile-user)>



[-- Attachment #2: Type: text/html, Size: 1670 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* bug#57507: Regular expression matching depends on locale encoding
  2022-08-31 16:54 bug#57507: Regular expression matching depends on locale encoding Jean Abou Samra
  2022-09-01 19:34 ` dsmich
@ 2022-09-05  7:48 ` Ludovic Courtès
  2022-09-05 18:39   ` Jean Abou Samra
  1 sibling, 1 reply; 6+ messages in thread
From: Ludovic Courtès @ 2022-09-05  7:48 UTC (permalink / raw)
  To: Jean Abou Samra; +Cc: 57507

Hi Jean,

Jean Abou Samra <jean@abou-samra.fr> skribis:

> Regular expressions do funky things with Unicode if a non-Unicode-aware
> locale is set. Yet, they're purely string operations, so I don't think
> it's expected that they depend on the locale encoding.

This is the expected behavior: first because (ice-9 regex) is
implemented in terms of the libc regex functions, as Dale put (but that
could be thought as an implementation detail), and second because things
such as character classes are necessarily locale-dependent (this has
bitten us in the past, for instance with <https://bugs.gnu.org/35785>).

I hope that makes sense.

Thanks,
Ludo’.





^ permalink raw reply	[flat|nested] 6+ messages in thread

* bug#57507: Regular expression matching depends on locale encoding
  2022-09-05  7:48 ` Ludovic Courtès
@ 2022-09-05 18:39   ` Jean Abou Samra
  2022-09-05 19:24     ` Ludovic Courtès
  0 siblings, 1 reply; 6+ messages in thread
From: Jean Abou Samra @ 2022-09-05 18:39 UTC (permalink / raw)
  To: Ludovic Courtès; +Cc: 57507

Le 05/09/2022 à 09:48, Ludovic Courtès a écrit :
> Hi Jean,
>
> Jean Abou Samra <jean@abou-samra.fr> skribis:
>
>> Regular expressions do funky things with Unicode if a non-Unicode-aware
>> locale is set. Yet, they're purely string operations, so I don't think
>> it's expected that they depend on the locale encoding.
> This is the expected behavior: first because (ice-9 regex) is
> implemented in terms of the libc regex functions, as Dale put (but that
> could be thought as an implementation detail), and second because things
> such as character classes are necessarily locale-dependent (this has
> bitten us in the past, for instance with <https://bugs.gnu.org/35785>).
>
> I hope that makes sense.



OK, thanks, but in this case, it should be clearly stated as a limitation
in the (ice-9 regex) documentation IMHO. If you don't know what constraints
there are on the implementation, there is no reason to expect this. Would it
help if I submitted a patch for that?






^ permalink raw reply	[flat|nested] 6+ messages in thread

* bug#57507: Regular expression matching depends on locale encoding
  2022-09-05 18:39   ` Jean Abou Samra
@ 2022-09-05 19:24     ` Ludovic Courtès
  2022-11-17 20:33       ` Jean Abou Samra
  0 siblings, 1 reply; 6+ messages in thread
From: Ludovic Courtès @ 2022-09-05 19:24 UTC (permalink / raw)
  To: Jean Abou Samra; +Cc: 57507

Hi,

Jean Abou Samra <jean@abou-samra.fr> skribis:

> Le 05/09/2022 à 09:48, Ludovic Courtès a écrit :
>> Hi Jean,
>>
>> Jean Abou Samra <jean@abou-samra.fr> skribis:
>>
>>> Regular expressions do funky things with Unicode if a non-Unicode-aware
>>> locale is set. Yet, they're purely string operations, so I don't think
>>> it's expected that they depend on the locale encoding.
>> This is the expected behavior: first because (ice-9 regex) is
>> implemented in terms of the libc regex functions, as Dale put (but that
>> could be thought as an implementation detail), and second because things
>> such as character classes are necessarily locale-dependent (this has
>> bitten us in the past, for instance with <https://bugs.gnu.org/35785>).
>>
>> I hope that makes sense.
>
>
>
> OK, thanks, but in this case, it should be clearly stated as a limitation
> in the (ice-9 regex) documentation IMHO. If you don't know what constraints
> there are on the implementation, there is no reason to expect this. Would it
> help if I submitted a patch for that?

Yes, that’d be welcome.  I would not call it a constraint or limitation;
for example, that ‘w’ is not a letter in Swedish is the kind of thing
you’d generally want to take into account.  Now, it’d be nice if one
could easily specify the locale to operate under, with an API similar to
that of (ice-9 i18n) and its first-class locale objects.

Thanks,
Ludo’.





^ permalink raw reply	[flat|nested] 6+ messages in thread

* bug#57507: Regular expression matching depends on locale encoding
  2022-09-05 19:24     ` Ludovic Courtès
@ 2022-11-17 20:33       ` Jean Abou Samra
  0 siblings, 0 replies; 6+ messages in thread
From: Jean Abou Samra @ 2022-11-17 20:33 UTC (permalink / raw)
  To: Ludovic Courtès; +Cc: 57507


[-- Attachment #1.1: Type: text/plain, Size: 1894 bytes --]

Le 05/09/2022 à 21:24, Ludovic Courtès a écrit :
> Yes, that’d be welcome.  I would not call it a constraint or limitation;
> for example, that ‘w’ is not a letter in Swedish is the kind of thing
> you’d generally want to take into account.  Now, it’d be nice if one
> could easily specify the locale to operate under, with an API similar to
> that of (ice-9 i18n) and its first-class locale objects.



Sorry that it took me forever to send this.



 From c666ca4f72dc0a00d28b8d7ef1221ebfc9741551 Mon Sep 17 00:00:00 2001
From: Jean Abou Samra <jean@abou-samra.fr>
Date: Thu, 17 Nov 2022 21:26:07 +0100
Subject: [PATCH] Doc: clarification on regexes and encodings

* doc/ref/api-regex.texi: make it more obviously clear that regexp
   matching supports only characters supported by the locale encoding.
---
  doc/ref/api-regex.texi | 6 +++++-
  1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/doc/ref/api-regex.texi b/doc/ref/api-regex.texi
index b14c2b39c..bd1f4079d 100644
--- a/doc/ref/api-regex.texi
+++ b/doc/ref/api-regex.texi
@@ -57,7 +57,11 @@ locale's encoding, and then passed to the C library's 
regular expression
  routines (@pxref{Regular Expressions,,, libc, The GNU C Library
  Reference Manual}).  The returned match structures always point to
  characters in the strings, not to individual bytes, even in the case of
-multi-byte encodings.
+multi-byte encodings.  This ensures that the match structures are
+correct when performing matching with characters that have a multi-byte
+representation in the locale encoding.  Note, however, that using
+characters which cannot be represented in the locale encoding can lead
+to surprising results.

  @deffn {Scheme Procedure} string-match pattern str [start]
  Compile the string @var{pattern} into a regular expression and compare
-- 
2.38.1



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 236 bytes --]

^ permalink raw reply related	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2022-11-17 20:33 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-08-31 16:54 bug#57507: Regular expression matching depends on locale encoding Jean Abou Samra
2022-09-01 19:34 ` dsmich
2022-09-05  7:48 ` Ludovic Courtès
2022-09-05 18:39   ` Jean Abou Samra
2022-09-05 19:24     ` Ludovic Courtès
2022-11-17 20:33       ` Jean Abou Samra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).