* bug#19878: 24.4; Syntax class [:alpha:] wrongly matches the Indian digits ۱۲۳۴۵۶۷۸۹۰ as letter @ 2015-02-15 15:44 mohammad.mahmoudi 2015-02-15 20:16 ` Andreas Politz 0 siblings, 1 reply; 6+ messages in thread From: mohammad.mahmoudi @ 2015-02-15 15:44 UTC (permalink / raw) To: 19878 This is to report that the Syntax class [:alpha:] wrongly matches the Indian digits ۱۲۳۴۵۶۷۸۹۰ as letter. In GNU Emacs 24.4.1 (i686-pc-mingw32) of 2014-10-24 on LEG570 Windowing system distributor `Microsoft Corp.', version 6.1.7601 Configured using: `configure --prefix=/c/usr' Important settings: value of $LANG: ENU locale-coding-system: cp1256 ^ permalink raw reply [flat|nested] 6+ messages in thread
* bug#19878: 24.4; Syntax class [:alpha:] wrongly matches the Indian digits ۱۲۳۴۵۶۷۸۹۰ as letter 2015-02-15 15:44 bug#19878: 24.4; Syntax class [:alpha:] wrongly matches the Indian digits ۱۲۳۴۵۶۷۸۹۰ as letter mohammad.mahmoudi @ 2015-02-15 20:16 ` Andreas Politz 2015-02-17 16:13 ` Eli Zaretskii 0 siblings, 1 reply; 6+ messages in thread From: Andreas Politz @ 2015-02-15 20:16 UTC (permalink / raw) To: mohammad.mahmoudi; +Cc: 19878 I think this is supposed to be: ,----[ (info "(elisp) Char Classes") ] | `[:alpha:]' | This matches any letter. (At present, for multibyte characters, it | matches anything that has word syntax.) `---- -ap ^ permalink raw reply [flat|nested] 6+ messages in thread
* bug#19878: 24.4; Syntax class [:alpha:] wrongly matches the Indian digits ۱۲۳۴۵۶۷۸۹۰ as letter 2015-02-15 20:16 ` Andreas Politz @ 2015-02-17 16:13 ` Eli Zaretskii 2015-02-17 18:15 ` Ivan Shmakov 2015-02-28 12:29 ` Eli Zaretskii 0 siblings, 2 replies; 6+ messages in thread From: Eli Zaretskii @ 2015-02-17 16:13 UTC (permalink / raw) To: Andreas Politz; +Cc: mohammad.mahmoudi, 19878 > From: Andreas Politz <politza@hochschule-trier.de> > Date: Sun, 15 Feb 2015 21:16:13 +0100 > Cc: 19878@debbugs.gnu.org > > > I think this is supposed to be: > > ,----[ (info "(elisp) Char Classes") ] > | `[:alpha:]' > | This matches any letter. (At present, for multibyte characters, it > | matches anything that has word syntax.) > `---- Indeed, which doesn't sound very nice. Does someone object to the changes below (to be installed on master)? They make [:alpha:] and [:alnum:] closer to the Unicode recommendations in UTS #18, although we are still very far from supporting even Level 1 of conformance. But these two seem like low-hanging fruit to me. The modified definitions of these two sets are not 100% compatible with the old ones for the multibyte characters. However, if it turns out that some code used these to get word-constituent characters, those places should simply be changed to use \sw instead. Also, does someone see any potential problem to make [:digit:] be a superset of the current ASCII-only set, to match UTS #18 as well? The comment in regex.c says it is "only used for single-byte characters", but it isn't clear to me whether this is a requirement, i.e. there's some code in Emacs that relies on that, or just a statement of facts. Please note that this is my first serious change in regex.c, so I'd appreciate review from people "in the know". TIA. --- src/regex.c~0 2015-01-04 10:44:36 +0200 +++ src/regex.c 2015-02-17 17:40:56 +0200 @@ -324,12 +324,12 @@ enum syntaxcode { Swhitespace = 0, Sword ? (((c) >= 'a' && (c) <= 'z') \ || ((c) >= 'A' && (c) <= 'Z') \ || ((c) >= '0' && (c) <= '9')) \ - : SYNTAX (c) == Sword) + : (alphabeticp (c) || decimalnump (c))) # define ISALPHA(c) (IS_REAL_ASCII (c) \ ? (((c) >= 'a' && (c) <= 'z') \ || ((c) >= 'A' && (c) <= 'Z')) \ - : SYNTAX (c) == Sword) + : alphabeticp (c)) # define ISLOWER(c) lowercasep (c) @@ -1872,6 +1872,8 @@ struct range_table_work_area #define BIT_SPACE 0x8 #define BIT_UPPER 0x10 #define BIT_MULTIBYTE 0x20 +#define BIT_ALPHA 0x40 +#define BIT_ALNUM 0x80 \f /* Set the bit for character C in a list. */ @@ -2072,7 +2074,9 @@ re_wctype_to_bit (re_wctype_t cc) { case RECC_NONASCII: case RECC_PRINT: case RECC_GRAPH: case RECC_MULTIBYTE: return BIT_MULTIBYTE; - case RECC_ALPHA: case RECC_ALNUM: case RECC_WORD: return BIT_WORD; + case RECC_ALPHA: return BIT_ALPHA; + case RECC_ALNUM: return BIT_ALNUM; + case RECC_WORD: return BIT_WORD; case RECC_LOWER: return BIT_LOWER; case RECC_UPPER: return BIT_UPPER; case RECC_PUNCT: return BIT_PUNCT; @@ -2930,7 +2934,7 @@ regex_compile (const_re_char *pattern, s #endif /* emacs */ /* In most cases the matching rule for char classes only uses the syntax table for multibyte chars, - so that the content of the syntax-table it is not + so that the content of the syntax-table is not hardcoded in the range_table. SPACE and WORD are the two exceptions. */ if ((1 << cc) & ((1 << RECC_SPACE) | (1 << RECC_WORD))) @@ -2945,7 +2949,7 @@ regex_compile (const_re_char *pattern, s p = class_beg; SET_LIST_BIT ('['); - /* Because the `:' may starts the range, we + /* Because the `:' may start the range, we can't simply set bit and repeat the loop. Instead, just set it to C and handle below. */ c = ':'; @@ -5513,7 +5517,9 @@ re_match_2_internal (struct re_pattern_b | (class_bits & BIT_PUNCT && ISPUNCT (c)) | (class_bits & BIT_SPACE && ISSPACE (c)) | (class_bits & BIT_UPPER && ISUPPER (c)) - | (class_bits & BIT_WORD && ISWORD (c))) + | (class_bits & BIT_WORD && ISWORD (c)) + | (class_bits & BIT_ALPHA && ISALPHA (c)) + | (class_bits & BIT_ALNUM && ISALNUM (c))) not = !not; else CHARSET_LOOKUP_RANGE_TABLE_RAW (not, c, range_table, count); --- src/character.c~0 2015-01-13 06:48:01 +0200 +++ src/character.c 2015-02-17 17:05:20 +0200 @@ -984,6 +984,48 @@ character is not ASCII nor 8-bit charact #ifdef emacs +/* Return 'true' if C is an alphabetic character as defined by its + Unicode properties. */ +bool +alphabeticp (int c) +{ + Lisp_Object category = CHAR_TABLE_REF (Vunicode_category_table, c); + + if (INTEGERP (category)) + { + unicode_category_t gen_cat = XINT (category); + + /* See UTS #18. There are additional characters that should be + here, those designated as Other_uppercase, Other_lowercase, + and Other_alphabetic; FIXME. */ + return (gen_cat == UNICODE_CATEGORY_Lu + || gen_cat == UNICODE_CATEGORY_Ll + || gen_cat == UNICODE_CATEGORY_Lt + || gen_cat == UNICODE_CATEGORY_Lm + || gen_cat == UNICODE_CATEGORY_Lo + || gen_cat == UNICODE_CATEGORY_Mn + || gen_cat == UNICODE_CATEGORY_Mc + || gen_cat == UNICODE_CATEGORY_Me + || gen_cat == UNICODE_CATEGORY_Nl) ? true : false; + } +} + +/* Return 'true' if C is an decimal-number character as defined by its + Unicode properties. */ +bool +decimalnump (int c) +{ + Lisp_Object category = CHAR_TABLE_REF (Vunicode_category_table, c); + + if (INTEGERP (category)) + { + unicode_category_t gen_cat = XINT (category); + + /* See UTS #18. */ + return (gen_cat == UNICODE_CATEGORY_Nd) ? true : false; + } +} + void syms_of_character (void) { --- src/character.h~0 2015-01-06 10:15:13 +0200 +++ src/character.h 2015-02-17 17:05:33 +0200 @@ -660,6 +660,9 @@ extern Lisp_Object Vchar_unify_table; extern Lisp_Object string_escape_byte8 (Lisp_Object); +extern bool alphabeticp (int); +extern bool decimalnump (int); + /* Return a translation table of id number ID. */ #define GET_TRANSLATION_TABLE(id) \ (XCDR (XVECTOR (Vtranslation_table_vector)->contents[(id)])) ^ permalink raw reply [flat|nested] 6+ messages in thread
* bug#19878: 24.4; Syntax class [:alpha:] wrongly matches the Indian digits ۱۲۳۴۵۶۷۸۹۰ as letter 2015-02-17 16:13 ` Eli Zaretskii @ 2015-02-17 18:15 ` Ivan Shmakov 2015-02-17 18:45 ` Eli Zaretskii 2015-02-28 12:29 ` Eli Zaretskii 1 sibling, 1 reply; 6+ messages in thread From: Ivan Shmakov @ 2015-02-17 18:15 UTC (permalink / raw) To: 19878 >>>>> Eli Zaretskii <eliz@gnu.org> writes: […] > Also, does someone see any potential problem to make [:digit:] be a > superset of the current ASCII-only set, to match UTS #18 as well? > The comment in regex.c says it is "only used for single-byte > characters", but it isn't clear to me whether this is a requirement, > i. e. there's some code in Emacs that relies on that, or just a > statement of facts. Just for a random data point, my own preference was to always use [0-9] when the intent is to discern a number for a later use of number-to-string, etc. Frankly, I can’t even readily suggest any reasonable examples where one’d want to use [:digit:] in the first place. […] -- FSF associate member #7257 http://boycottsystemd.org/ … 3013 B6A0 230E 334A ^ permalink raw reply [flat|nested] 6+ messages in thread
* bug#19878: 24.4; Syntax class [:alpha:] wrongly matches the Indian digits ۱۲۳۴۵۶۷۸۹۰ as letter 2015-02-17 18:15 ` Ivan Shmakov @ 2015-02-17 18:45 ` Eli Zaretskii 0 siblings, 0 replies; 6+ messages in thread From: Eli Zaretskii @ 2015-02-17 18:45 UTC (permalink / raw) To: Ivan Shmakov; +Cc: 19878 > From: Ivan Shmakov <ivan@siamics.net> > Date: Tue, 17 Feb 2015 18:15:09 +0000 > > Frankly, I can’t even readily suggest any reasonable examples > where one’d want to use [:digit:] in the first place. Interactive search is one obvious use case, I think. ^ permalink raw reply [flat|nested] 6+ messages in thread
* bug#19878: 24.4; Syntax class [:alpha:] wrongly matches the Indian digits ۱۲۳۴۵۶۷۸۹۰ as letter 2015-02-17 16:13 ` Eli Zaretskii 2015-02-17 18:15 ` Ivan Shmakov @ 2015-02-28 12:29 ` Eli Zaretskii 1 sibling, 0 replies; 6+ messages in thread From: Eli Zaretskii @ 2015-02-28 12:29 UTC (permalink / raw) To: politza, mohammad.mahmoudi; +Cc: 19878-done > Date: Tue, 17 Feb 2015 18:13:05 +0200 > From: Eli Zaretskii <eliz@gnu.org> > Cc: mohammad.mahmoudi@gmail.com, 19878@debbugs.gnu.org > > > From: Andreas Politz <politza@hochschule-trier.de> > > Date: Sun, 15 Feb 2015 21:16:13 +0100 > > Cc: 19878@debbugs.gnu.org > > > > > > I think this is supposed to be: > > > > ,----[ (info "(elisp) Char Classes") ] > > | `[:alpha:]' > > | This matches any letter. (At present, for multibyte characters, it > > | matches anything that has word syntax.) > > `---- > > Indeed, which doesn't sound very nice. > > Does someone object to the changes below (to be installed on master)? > They make [:alpha:] and [:alnum:] closer to the Unicode > recommendations in UTS #18, although we are still very far from > supporting even Level 1 of conformance. But these two seem like > low-hanging fruit to me. > > The modified definitions of these two sets are not 100% compatible > with the old ones for the multibyte characters. However, if it turns > out that some code used these to get word-constituent characters, > those places should simply be changed to use \sw instead. No further comments, so I pushed the changes as commit 1a50945 on the master branch, and I'm marking this bug closed. > Also, does someone see any potential problem to make [:digit:] be a > superset of the current ASCII-only set, to match UTS #18 as well? The > comment in regex.c says it is "only used for single-byte characters", > but it isn't clear to me whether this is a requirement, i.e. there's > some code in Emacs that relies on that, or just a statement of facts. I'd still like to hear an answer and/or opinions about this. If I hear no comments, I will look into making a similar change to [:digit:] soon. ^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2015-02-28 12:29 UTC | newest] Thread overview: 6+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2015-02-15 15:44 bug#19878: 24.4; Syntax class [:alpha:] wrongly matches the Indian digits ۱۲۳۴۵۶۷۸۹۰ as letter mohammad.mahmoudi 2015-02-15 20:16 ` Andreas Politz 2015-02-17 16:13 ` Eli Zaretskii 2015-02-17 18:15 ` Ivan Shmakov 2015-02-17 18:45 ` Eli Zaretskii 2015-02-28 12:29 ` Eli Zaretskii
Code repositories for project(s) associated with this external index https://git.savannah.gnu.org/cgit/emacs.git https://git.savannah.gnu.org/cgit/emacs/org-mode.git This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.