* bug#19878: 24.4; Syntax class [:alpha:] wrongly matches the Indian digits ۱۲۳۴۵۶۷۸۹۰ as letter
@ 2015-02-15 15:44 mohammad.mahmoudi
2015-02-15 20:16 ` Andreas Politz
0 siblings, 1 reply; 6+ messages in thread
From: mohammad.mahmoudi @ 2015-02-15 15:44 UTC (permalink / raw)
To: 19878
This is to report that the Syntax class [:alpha:] wrongly matches the
Indian digits ۱۲۳۴۵۶۷۸۹۰ as letter.
In GNU Emacs 24.4.1 (i686-pc-mingw32)
of 2014-10-24 on LEG570
Windowing system distributor `Microsoft Corp.', version 6.1.7601
Configured using:
`configure --prefix=/c/usr'
Important settings:
value of $LANG: ENU
locale-coding-system: cp1256
^ permalink raw reply [flat|nested] 6+ messages in thread
* bug#19878: 24.4; Syntax class [:alpha:] wrongly matches the Indian digits ۱۲۳۴۵۶۷۸۹۰ as letter
2015-02-15 15:44 bug#19878: 24.4; Syntax class [:alpha:] wrongly matches the Indian digits ۱۲۳۴۵۶۷۸۹۰ as letter mohammad.mahmoudi
@ 2015-02-15 20:16 ` Andreas Politz
2015-02-17 16:13 ` Eli Zaretskii
0 siblings, 1 reply; 6+ messages in thread
From: Andreas Politz @ 2015-02-15 20:16 UTC (permalink / raw)
To: mohammad.mahmoudi; +Cc: 19878
I think this is supposed to be:
,----[ (info "(elisp) Char Classes") ]
| `[:alpha:]'
| This matches any letter. (At present, for multibyte characters, it
| matches anything that has word syntax.)
`----
-ap
^ permalink raw reply [flat|nested] 6+ messages in thread
* bug#19878: 24.4; Syntax class [:alpha:] wrongly matches the Indian digits ۱۲۳۴۵۶۷۸۹۰ as letter
2015-02-15 20:16 ` Andreas Politz
@ 2015-02-17 16:13 ` Eli Zaretskii
2015-02-17 18:15 ` Ivan Shmakov
2015-02-28 12:29 ` Eli Zaretskii
0 siblings, 2 replies; 6+ messages in thread
From: Eli Zaretskii @ 2015-02-17 16:13 UTC (permalink / raw)
To: Andreas Politz; +Cc: mohammad.mahmoudi, 19878
> From: Andreas Politz <politza@hochschule-trier.de>
> Date: Sun, 15 Feb 2015 21:16:13 +0100
> Cc: 19878@debbugs.gnu.org
>
>
> I think this is supposed to be:
>
> ,----[ (info "(elisp) Char Classes") ]
> | `[:alpha:]'
> | This matches any letter. (At present, for multibyte characters, it
> | matches anything that has word syntax.)
> `----
Indeed, which doesn't sound very nice.
Does someone object to the changes below (to be installed on master)?
They make [:alpha:] and [:alnum:] closer to the Unicode
recommendations in UTS #18, although we are still very far from
supporting even Level 1 of conformance. But these two seem like
low-hanging fruit to me.
The modified definitions of these two sets are not 100% compatible
with the old ones for the multibyte characters. However, if it turns
out that some code used these to get word-constituent characters,
those places should simply be changed to use \sw instead.
Also, does someone see any potential problem to make [:digit:] be a
superset of the current ASCII-only set, to match UTS #18 as well? The
comment in regex.c says it is "only used for single-byte characters",
but it isn't clear to me whether this is a requirement, i.e. there's
some code in Emacs that relies on that, or just a statement of facts.
Please note that this is my first serious change in regex.c, so I'd
appreciate review from people "in the know". TIA.
--- src/regex.c~0 2015-01-04 10:44:36 +0200
+++ src/regex.c 2015-02-17 17:40:56 +0200
@@ -324,12 +324,12 @@ enum syntaxcode { Swhitespace = 0, Sword
? (((c) >= 'a' && (c) <= 'z') \
|| ((c) >= 'A' && (c) <= 'Z') \
|| ((c) >= '0' && (c) <= '9')) \
- : SYNTAX (c) == Sword)
+ : (alphabeticp (c) || decimalnump (c)))
# define ISALPHA(c) (IS_REAL_ASCII (c) \
? (((c) >= 'a' && (c) <= 'z') \
|| ((c) >= 'A' && (c) <= 'Z')) \
- : SYNTAX (c) == Sword)
+ : alphabeticp (c))
# define ISLOWER(c) lowercasep (c)
@@ -1872,6 +1872,8 @@ struct range_table_work_area
#define BIT_SPACE 0x8
#define BIT_UPPER 0x10
#define BIT_MULTIBYTE 0x20
+#define BIT_ALPHA 0x40
+#define BIT_ALNUM 0x80
\f
/* Set the bit for character C in a list. */
@@ -2072,7 +2074,9 @@ re_wctype_to_bit (re_wctype_t cc)
{
case RECC_NONASCII: case RECC_PRINT: case RECC_GRAPH:
case RECC_MULTIBYTE: return BIT_MULTIBYTE;
- case RECC_ALPHA: case RECC_ALNUM: case RECC_WORD: return BIT_WORD;
+ case RECC_ALPHA: return BIT_ALPHA;
+ case RECC_ALNUM: return BIT_ALNUM;
+ case RECC_WORD: return BIT_WORD;
case RECC_LOWER: return BIT_LOWER;
case RECC_UPPER: return BIT_UPPER;
case RECC_PUNCT: return BIT_PUNCT;
@@ -2930,7 +2934,7 @@ regex_compile (const_re_char *pattern, s
#endif /* emacs */
/* In most cases the matching rule for char classes
only uses the syntax table for multibyte chars,
- so that the content of the syntax-table it is not
+ so that the content of the syntax-table is not
hardcoded in the range_table. SPACE and WORD are
the two exceptions. */
if ((1 << cc) & ((1 << RECC_SPACE) | (1 << RECC_WORD)))
@@ -2945,7 +2949,7 @@ regex_compile (const_re_char *pattern, s
p = class_beg;
SET_LIST_BIT ('[');
- /* Because the `:' may starts the range, we
+ /* Because the `:' may start the range, we
can't simply set bit and repeat the loop.
Instead, just set it to C and handle below. */
c = ':';
@@ -5513,7 +5517,9 @@ re_match_2_internal (struct re_pattern_b
| (class_bits & BIT_PUNCT && ISPUNCT (c))
| (class_bits & BIT_SPACE && ISSPACE (c))
| (class_bits & BIT_UPPER && ISUPPER (c))
- | (class_bits & BIT_WORD && ISWORD (c)))
+ | (class_bits & BIT_WORD && ISWORD (c))
+ | (class_bits & BIT_ALPHA && ISALPHA (c))
+ | (class_bits & BIT_ALNUM && ISALNUM (c)))
not = !not;
else
CHARSET_LOOKUP_RANGE_TABLE_RAW (not, c, range_table, count);
--- src/character.c~0 2015-01-13 06:48:01 +0200
+++ src/character.c 2015-02-17 17:05:20 +0200
@@ -984,6 +984,48 @@ character is not ASCII nor 8-bit charact
#ifdef emacs
+/* Return 'true' if C is an alphabetic character as defined by its
+ Unicode properties. */
+bool
+alphabeticp (int c)
+{
+ Lisp_Object category = CHAR_TABLE_REF (Vunicode_category_table, c);
+
+ if (INTEGERP (category))
+ {
+ unicode_category_t gen_cat = XINT (category);
+
+ /* See UTS #18. There are additional characters that should be
+ here, those designated as Other_uppercase, Other_lowercase,
+ and Other_alphabetic; FIXME. */
+ return (gen_cat == UNICODE_CATEGORY_Lu
+ || gen_cat == UNICODE_CATEGORY_Ll
+ || gen_cat == UNICODE_CATEGORY_Lt
+ || gen_cat == UNICODE_CATEGORY_Lm
+ || gen_cat == UNICODE_CATEGORY_Lo
+ || gen_cat == UNICODE_CATEGORY_Mn
+ || gen_cat == UNICODE_CATEGORY_Mc
+ || gen_cat == UNICODE_CATEGORY_Me
+ || gen_cat == UNICODE_CATEGORY_Nl) ? true : false;
+ }
+}
+
+/* Return 'true' if C is an decimal-number character as defined by its
+ Unicode properties. */
+bool
+decimalnump (int c)
+{
+ Lisp_Object category = CHAR_TABLE_REF (Vunicode_category_table, c);
+
+ if (INTEGERP (category))
+ {
+ unicode_category_t gen_cat = XINT (category);
+
+ /* See UTS #18. */
+ return (gen_cat == UNICODE_CATEGORY_Nd) ? true : false;
+ }
+}
+
void
syms_of_character (void)
{
--- src/character.h~0 2015-01-06 10:15:13 +0200
+++ src/character.h 2015-02-17 17:05:33 +0200
@@ -660,6 +660,9 @@
extern Lisp_Object Vchar_unify_table;
extern Lisp_Object string_escape_byte8 (Lisp_Object);
+extern bool alphabeticp (int);
+extern bool decimalnump (int);
+
/* Return a translation table of id number ID. */
#define GET_TRANSLATION_TABLE(id) \
(XCDR (XVECTOR (Vtranslation_table_vector)->contents[(id)]))
^ permalink raw reply [flat|nested] 6+ messages in thread
* bug#19878: 24.4; Syntax class [:alpha:] wrongly matches the Indian digits ۱۲۳۴۵۶۷۸۹۰ as letter
2015-02-17 16:13 ` Eli Zaretskii
@ 2015-02-17 18:15 ` Ivan Shmakov
2015-02-17 18:45 ` Eli Zaretskii
2015-02-28 12:29 ` Eli Zaretskii
1 sibling, 1 reply; 6+ messages in thread
From: Ivan Shmakov @ 2015-02-17 18:15 UTC (permalink / raw)
To: 19878
>>>>> Eli Zaretskii <eliz@gnu.org> writes:
[…]
> Also, does someone see any potential problem to make [:digit:] be a
> superset of the current ASCII-only set, to match UTS #18 as well?
> The comment in regex.c says it is "only used for single-byte
> characters", but it isn't clear to me whether this is a requirement,
> i. e. there's some code in Emacs that relies on that, or just a
> statement of facts.
Just for a random data point, my own preference was to always
use [0-9] when the intent is to discern a number for a later use
of number-to-string, etc. Frankly, I can’t even readily suggest
any reasonable examples where one’d want to use [:digit:] in the
first place.
[…]
--
FSF associate member #7257 http://boycottsystemd.org/ … 3013 B6A0 230E 334A
^ permalink raw reply [flat|nested] 6+ messages in thread
* bug#19878: 24.4; Syntax class [:alpha:] wrongly matches the Indian digits ۱۲۳۴۵۶۷۸۹۰ as letter
2015-02-17 18:15 ` Ivan Shmakov
@ 2015-02-17 18:45 ` Eli Zaretskii
0 siblings, 0 replies; 6+ messages in thread
From: Eli Zaretskii @ 2015-02-17 18:45 UTC (permalink / raw)
To: Ivan Shmakov; +Cc: 19878
> From: Ivan Shmakov <ivan@siamics.net>
> Date: Tue, 17 Feb 2015 18:15:09 +0000
>
> Frankly, I can’t even readily suggest any reasonable examples
> where one’d want to use [:digit:] in the first place.
Interactive search is one obvious use case, I think.
^ permalink raw reply [flat|nested] 6+ messages in thread
* bug#19878: 24.4; Syntax class [:alpha:] wrongly matches the Indian digits ۱۲۳۴۵۶۷۸۹۰ as letter
2015-02-17 16:13 ` Eli Zaretskii
2015-02-17 18:15 ` Ivan Shmakov
@ 2015-02-28 12:29 ` Eli Zaretskii
1 sibling, 0 replies; 6+ messages in thread
From: Eli Zaretskii @ 2015-02-28 12:29 UTC (permalink / raw)
To: politza, mohammad.mahmoudi; +Cc: 19878-done
> Date: Tue, 17 Feb 2015 18:13:05 +0200
> From: Eli Zaretskii <eliz@gnu.org>
> Cc: mohammad.mahmoudi@gmail.com, 19878@debbugs.gnu.org
>
> > From: Andreas Politz <politza@hochschule-trier.de>
> > Date: Sun, 15 Feb 2015 21:16:13 +0100
> > Cc: 19878@debbugs.gnu.org
> >
> >
> > I think this is supposed to be:
> >
> > ,----[ (info "(elisp) Char Classes") ]
> > | `[:alpha:]'
> > | This matches any letter. (At present, for multibyte characters, it
> > | matches anything that has word syntax.)
> > `----
>
> Indeed, which doesn't sound very nice.
>
> Does someone object to the changes below (to be installed on master)?
> They make [:alpha:] and [:alnum:] closer to the Unicode
> recommendations in UTS #18, although we are still very far from
> supporting even Level 1 of conformance. But these two seem like
> low-hanging fruit to me.
>
> The modified definitions of these two sets are not 100% compatible
> with the old ones for the multibyte characters. However, if it turns
> out that some code used these to get word-constituent characters,
> those places should simply be changed to use \sw instead.
No further comments, so I pushed the changes as commit 1a50945 on the
master branch, and I'm marking this bug closed.
> Also, does someone see any potential problem to make [:digit:] be a
> superset of the current ASCII-only set, to match UTS #18 as well? The
> comment in regex.c says it is "only used for single-byte characters",
> but it isn't clear to me whether this is a requirement, i.e. there's
> some code in Emacs that relies on that, or just a statement of facts.
I'd still like to hear an answer and/or opinions about this. If I
hear no comments, I will look into making a similar change to
[:digit:] soon.
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2015-02-28 12:29 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-02-15 15:44 bug#19878: 24.4; Syntax class [:alpha:] wrongly matches the Indian digits ۱۲۳۴۵۶۷۸۹۰ as letter mohammad.mahmoudi
2015-02-15 20:16 ` Andreas Politz
2015-02-17 16:13 ` Eli Zaretskii
2015-02-17 18:15 ` Ivan Shmakov
2015-02-17 18:45 ` Eli Zaretskii
2015-02-28 12:29 ` Eli Zaretskii
Code repositories for project(s) associated with this public inbox
https://git.savannah.gnu.org/cgit/emacs.git
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).