From mboxrd@z Thu Jan  1 00:00:00 1970
Path: news.gmane.org!not-for-mail
From: Eli Zaretskii <eliz@gnu.org>
Newsgroups: gmane.emacs.bugs
Subject: bug#19878: 24.4; Syntax class [:alpha:] wrongly matches the Indian
	digits	=?UTF-8?Q?=DB=B1=DB=B2=DB=B3=DB=B4=DB=B5=DB=B6=DB=B7=DB=B8=DB=B9=DB=B0?=	as
	letter
Date: Tue, 17 Feb 2015 18:13:05 +0200
Message-ID: <838ufw7bzi.fsf@gnu.org>
References: <alpine.WNT.2.11.1502151913420.2828@anzr>
	<87k2zjj5gy.fsf@hochschule-trier.de>
Reply-To: Eli Zaretskii <eliz@gnu.org>
NNTP-Posting-Host: plane.gmane.org
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 8bit
X-Trace: ger.gmane.org 1424189670 13839 80.91.229.3 (17 Feb 2015 16:14:30 GMT)
X-Complaints-To: usenet@ger.gmane.org
NNTP-Posting-Date: Tue, 17 Feb 2015 16:14:30 +0000 (UTC)
Cc: mohammad.mahmoudi@gmail.com, 19878@debbugs.gnu.org
To: Andreas Politz <politza@hochschule-trier.de>
Original-X-From: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane.org@gnu.org Tue Feb 17 17:14:17 2015
Return-path: <bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane.org@gnu.org>
Envelope-to: geb-bug-gnu-emacs@m.gmane.org
Original-Received: from lists.gnu.org ([208.118.235.17])
	by plane.gmane.org with esmtp (Exim 4.69)
	(envelope-from <bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane.org@gnu.org>)
	id 1YNkmu-0007J0-PF
	for geb-bug-gnu-emacs@m.gmane.org; Tue, 17 Feb 2015 17:14:17 +0100
Original-Received: from localhost ([::1]:46072 helo=lists.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane.org@gnu.org>)
	id 1YNkmu-0006tH-7T
	for geb-bug-gnu-emacs@m.gmane.org; Tue, 17 Feb 2015 11:14:16 -0500
Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:52564)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <Debian-debbugs@debbugs.gnu.org>) id 1YNkml-0006sK-Kh
	for bug-gnu-emacs@gnu.org; Tue, 17 Feb 2015 11:14:12 -0500
Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <Debian-debbugs@debbugs.gnu.org>) id 1YNkmg-0003cs-79
	for bug-gnu-emacs@gnu.org; Tue, 17 Feb 2015 11:14:07 -0500
Original-Received: from debbugs.gnu.org ([140.186.70.43]:55234)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <Debian-debbugs@debbugs.gnu.org>) id 1YNkmg-0003ce-4N
	for bug-gnu-emacs@gnu.org; Tue, 17 Feb 2015 11:14:02 -0500
Original-Received: from Debian-debbugs by debbugs.gnu.org with local (Exim 4.80)
	(envelope-from <Debian-debbugs@debbugs.gnu.org>) id 1YNkmf-0005To-O8
	for bug-gnu-emacs@gnu.org; Tue, 17 Feb 2015 11:14:01 -0500
X-Loop: help-debbugs@gnu.org
Resent-From: Eli Zaretskii <eliz@gnu.org>
Original-Sender: "Debbugs-submit" <debbugs-submit-bounces@debbugs.gnu.org>
Resent-CC: bug-gnu-emacs@gnu.org
Resent-Date: Tue, 17 Feb 2015 16:14:01 +0000
Resent-Message-ID: <handler.19878.B19878.142418958820982@debbugs.gnu.org>
Resent-Sender: help-debbugs@gnu.org
X-GNU-PR-Message: followup 19878
X-GNU-PR-Package: emacs
X-GNU-PR-Keywords: 
Original-Received: via spool by 19878-submit@debbugs.gnu.org id=B19878.142418958820982
	(code B ref 19878); Tue, 17 Feb 2015 16:14:01 +0000
Original-Received: (at 19878) by debbugs.gnu.org; 17 Feb 2015 16:13:08 +0000
Original-Received: from localhost ([127.0.0.1]:46474 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.80)
	(envelope-from <debbugs-submit-bounces@debbugs.gnu.org>)
	id 1YNkln-0005SM-73
	for submit@debbugs.gnu.org; Tue, 17 Feb 2015 11:13:07 -0500
Original-Received: from mtaout21.012.net.il ([80.179.55.169]:48004)
	by debbugs.gnu.org with esmtp (Exim 4.80)
	(envelope-from <eliz@gnu.org>) id 1YNklj-0005Rl-5I
	for 19878@debbugs.gnu.org; Tue, 17 Feb 2015 11:13:04 -0500
Original-Received: from conversion-daemon.a-mtaout21.012.net.il by
	a-mtaout21.012.net.il (HyperSendmail v2007.08) id
	<0NJX00M00BJP0P00@a-mtaout21.012.net.il> for 19878@debbugs.gnu.org;
	Tue, 17 Feb 2015 18:12:56 +0200 (IST)
Original-Received: from HOME-C4E4A596F7 ([87.69.4.28]) by a-mtaout21.012.net.il
	(HyperSendmail v2007.08) with ESMTPA id
	<0NJX00LJ8BPKZ840@a-mtaout21.012.net.il>;
	Tue, 17 Feb 2015 18:12:56 +0200 (IST)
In-reply-to: <87k2zjj5gy.fsf@hochschule-trier.de>
X-012-Sender: halo1@inter.net.il
X-BeenThere: debbugs-submit@debbugs.gnu.org
X-Mailman-Version: 2.1.15
Precedence: list
X-detected-operating-system: by eggs.gnu.org: GNU/Linux 3.x
X-Received-From: 140.186.70.43
X-BeenThere: bug-gnu-emacs@gnu.org
List-Id: "Bug reports for GNU Emacs,
	the Swiss army knife of text editors" <bug-gnu-emacs.gnu.org>
List-Unsubscribe: <https://lists.gnu.org/mailman/options/bug-gnu-emacs>,
	<mailto:bug-gnu-emacs-request@gnu.org?subject=unsubscribe>
List-Archive: <http://lists.gnu.org/archive/html/bug-gnu-emacs>
List-Post: <mailto:bug-gnu-emacs@gnu.org>
List-Help: <mailto:bug-gnu-emacs-request@gnu.org?subject=help>
List-Subscribe: <https://lists.gnu.org/mailman/listinfo/bug-gnu-emacs>,
	<mailto:bug-gnu-emacs-request@gnu.org?subject=subscribe>
Errors-To: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane.org@gnu.org
Original-Sender: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane.org@gnu.org
Xref: news.gmane.org gmane.emacs.bugs:99494
Archived-At: <http://permalink.gmane.org/gmane.emacs.bugs/99494>

> From: Andreas Politz <politza@hochschule-trier.de>
> Date: Sun, 15 Feb 2015 21:16:13 +0100
> Cc: 19878@debbugs.gnu.org
> 
> 
> I think this is supposed to be:
> 
> ,----[ (info "(elisp) Char Classes") ]
> | `[:alpha:]'
> |      This matches any letter.  (At present, for multibyte characters, it
> |      matches anything that has word syntax.)
> `----

Indeed, which doesn't sound very nice.

Does someone object to the changes below (to be installed on master)?
They make [:alpha:] and [:alnum:] closer to the Unicode
recommendations in UTS #18, although we are still very far from
supporting even Level 1 of conformance.  But these two seem like
low-hanging fruit to me.

The modified definitions of these two sets are not 100% compatible
with the old ones for the multibyte characters.  However, if it turns
out that some code used these to get word-constituent characters,
those places should simply be changed to use \sw instead.

Also, does someone see any potential problem to make [:digit:] be a
superset of the current ASCII-only set, to match UTS #18 as well?  The
comment in regex.c says it is "only used for single-byte characters",
but it isn't clear to me whether this is a requirement, i.e. there's
some code in Emacs that relies on that, or just a statement of facts.

Please note that this is my first serious change in regex.c, so I'd
appreciate review from people "in the know".  TIA.

--- src/regex.c~0	2015-01-04 10:44:36 +0200
+++ src/regex.c	2015-02-17 17:40:56 +0200
@@ -324,12 +324,12 @@ enum syntaxcode { Swhitespace = 0, Sword
 		    ? (((c) >= 'a' && (c) <= 'z')	\
 		       || ((c) >= 'A' && (c) <= 'Z')	\
 		       || ((c) >= '0' && (c) <= '9'))	\
-		    : SYNTAX (c) == Sword)
+		    : (alphabeticp (c) || decimalnump (c)))
 
 # define ISALPHA(c) (IS_REAL_ASCII (c)			\
 		    ? (((c) >= 'a' && (c) <= 'z')	\
 		       || ((c) >= 'A' && (c) <= 'Z'))	\
-		    : SYNTAX (c) == Sword)
+		    : alphabeticp (c))
 
 # define ISLOWER(c) lowercasep (c)
 
@@ -1872,6 +1872,8 @@ struct range_table_work_area
 #define BIT_SPACE	0x8
 #define BIT_UPPER	0x10
 #define BIT_MULTIBYTE	0x20
+#define BIT_ALPHA	0x40
+#define BIT_ALNUM	0x80
 
 
 /* Set the bit for character C in a list.  */
@@ -2072,7 +2074,9 @@ re_wctype_to_bit (re_wctype_t cc)
     {
     case RECC_NONASCII: case RECC_PRINT: case RECC_GRAPH:
     case RECC_MULTIBYTE: return BIT_MULTIBYTE;
-    case RECC_ALPHA: case RECC_ALNUM: case RECC_WORD: return BIT_WORD;
+    case RECC_ALPHA: return BIT_ALPHA;
+    case RECC_ALNUM: return BIT_ALNUM;
+    case RECC_WORD: return BIT_WORD;
     case RECC_LOWER: return BIT_LOWER;
     case RECC_UPPER: return BIT_UPPER;
     case RECC_PUNCT: return BIT_PUNCT;
@@ -2930,7 +2934,7 @@ regex_compile (const_re_char *pattern, s
 #endif	/* emacs */
 			/* In most cases the matching rule for char classes
 			   only uses the syntax table for multibyte chars,
-			   so that the content of the syntax-table it is not
+			   so that the content of the syntax-table is not
 			   hardcoded in the range_table.  SPACE and WORD are
 			   the two exceptions.  */
 			if ((1 << cc) & ((1 << RECC_SPACE) | (1 << RECC_WORD)))
@@ -2945,7 +2949,7 @@ regex_compile (const_re_char *pattern, s
 			p = class_beg;
 			SET_LIST_BIT ('[');
 
-			/* Because the `:' may starts the range, we
+			/* Because the `:' may start the range, we
 			   can't simply set bit and repeat the loop.
 			   Instead, just set it to C and handle below.  */
 			c = ':';
@@ -5513,7 +5517,9 @@ re_match_2_internal (struct re_pattern_b
 		    | (class_bits & BIT_PUNCT && ISPUNCT (c))
 		    | (class_bits & BIT_SPACE && ISSPACE (c))
 		    | (class_bits & BIT_UPPER && ISUPPER (c))
-		    | (class_bits & BIT_WORD  && ISWORD (c)))
+		    | (class_bits & BIT_WORD  && ISWORD  (c))
+		    | (class_bits & BIT_ALPHA && ISALPHA (c))
+		    | (class_bits & BIT_ALNUM && ISALNUM (c)))
 		  not = !not;
 		else
 		  CHARSET_LOOKUP_RANGE_TABLE_RAW (not, c, range_table, count);

--- src/character.c~0	2015-01-13 06:48:01 +0200
+++ src/character.c	2015-02-17 17:05:20 +0200
@@ -984,6 +984,48 @@ character is not ASCII nor 8-bit charact
 
 #ifdef emacs
 
+/* Return 'true' if C is an alphabetic character as defined by its
+   Unicode properties.  */
+bool
+alphabeticp (int c)
+{
+  Lisp_Object category = CHAR_TABLE_REF (Vunicode_category_table, c);
+
+  if (INTEGERP (category))
+    {
+      unicode_category_t gen_cat = XINT (category);
+
+      /* See UTS #18.  There are additional characters that should be
+	 here, those designated as Other_uppercase, Other_lowercase,
+	 and Other_alphabetic; FIXME.  */
+      return (gen_cat == UNICODE_CATEGORY_Lu
+	      || gen_cat == UNICODE_CATEGORY_Ll
+	      || gen_cat == UNICODE_CATEGORY_Lt
+	      || gen_cat == UNICODE_CATEGORY_Lm
+	      || gen_cat == UNICODE_CATEGORY_Lo
+	      || gen_cat == UNICODE_CATEGORY_Mn
+	      || gen_cat == UNICODE_CATEGORY_Mc
+	      || gen_cat == UNICODE_CATEGORY_Me
+	      || gen_cat == UNICODE_CATEGORY_Nl) ? true : false;
+    }
+}
+
+/* Return 'true' if C is an decimal-number character as defined by its
+   Unicode properties.  */
+bool
+decimalnump (int c)
+{
+  Lisp_Object category = CHAR_TABLE_REF (Vunicode_category_table, c);
+
+  if (INTEGERP (category))
+    {
+      unicode_category_t gen_cat = XINT (category);
+
+      /* See UTS #18.  */
+      return (gen_cat == UNICODE_CATEGORY_Nd) ? true : false;
+    }
+}
+
 void
 syms_of_character (void)
 {


--- src/character.h~0	2015-01-06 10:15:13 +0200
+++ src/character.h	2015-02-17 17:05:33 +0200
@@ -660,6 +660,9 @@
 extern Lisp_Object Vchar_unify_table;
 extern Lisp_Object string_escape_byte8 (Lisp_Object);
 
+extern bool alphabeticp (int);
+extern bool decimalnump (int);
+
 /* Return a translation table of id number ID.  */
 #define GET_TRANSLATION_TABLE(id) \
   (XCDR (XVECTOR (Vtranslation_table_vector)->contents[(id)]))