unofficial mirror of bug-gnu-emacs@gnu.org 
 help / color / mirror / code / Atom feed
* bug#25366: 26.0.50; [:blank:] character class should match all Unicode horizontal whitespace
@ 2017-01-05 13:46 Philipp Stephani
  2017-01-05 15:50 ` Eli Zaretskii
  0 siblings, 1 reply; 6+ messages in thread
From: Philipp Stephani @ 2017-01-05 13:46 UTC (permalink / raw)
  To: 25366


(string-match-p "[[:blank:]]" "\N{HAIR SPACE}")
=> nil, expected 0

[[:blank:]] should be the same as \h in PRCE.


In GNU Emacs 26.0.50.26 (x86_64-unknown-linux-gnu, GTK+ Version 3.10.8)
 of 2017-01-05 built on unknown
Repository revision: d88cdad2847726438c7d1de9fd2651c4be9243aa
Windowing system distributor 'The X.Org Foundation', version 11.0.11501000
System Description:	Ubuntu 14.04 LTS

Recent messages:
For information about GNU Emacs and the GNU system, type C-h C-a.
Entering debugger...
Back to top level

Configured using:
 'configure --with-modules --enable-checking
 --enable-check-lisp-object-type 'CFLAGS=-ggdb3 -O0''

Configured features:
XPM JPEG TIFF GIF PNG SOUND GSETTINGS NOTIFY GNUTLS FREETYPE XFT ZLIB
TOOLKIT_SCROLL_BARS GTK3 X11 MODULES

Important settings:
  value of $LANG: en_US.UTF-8
  locale-coding-system: utf-8-unix

Major mode: Lisp Interaction

Minor modes in effect:
  tooltip-mode: t
  global-eldoc-mode: t
  electric-indent-mode: t
  mouse-wheel-mode: t
  tool-bar-mode: t
  menu-bar-mode: t
  file-name-shadow-mode: t
  global-font-lock-mode: t
  font-lock-mode: t
  blink-cursor-mode: t
  auto-composition-mode: t
  auto-encryption-mode: t
  auto-compression-mode: t
  line-number-mode: t
  transient-mark-mode: t

Load-path shadows:
None found.

Features:
(shadow sort mail-extr emacsbug message subr-x puny seq byte-opt gv
bytecomp byte-compile cl-extra cconv dired dired-loaddefs format-spec
rfc822 mml mml-sec password-cache epa derived epg epg-config gnus-util
rmail rmail-loaddefs mm-decode mm-bodies mm-encode mail-parse rfc2231
mailabbrev gmm-utils mailheader sendmail rfc2047 rfc2045 ietf-drums
mm-util mail-prsvr mail-utils help-mode easymenu cl-loaddefs pcase
cl-lib debug time-date mule-util tooltip eldoc electric uniquify
ediff-hook vc-hooks lisp-float-type mwheel term/x-win x-win
term/common-win x-dnd tool-bar dnd fontset image regexp-opt fringe
tabulated-list replace newcomment text-mode elisp-mode lisp-mode
prog-mode register page menu-bar rfn-eshadow isearch timer select
scroll-bar mouse jit-lock font-lock syntax facemenu font-core
term/tty-colors frame cl-generic cham georgian utf-8-lang misc-lang
vietnamese tibetan thai tai-viet lao korean japanese eucjp-ms cp51932
hebrew greek romanian slovak czech european ethiopic indian cyrillic
chinese composite charscript case-table epa-hook jka-cmpr-hook help
simple abbrev obarray minibuffer cl-preloaded nadvice loaddefs button
faces cus-face macroexp files text-properties overlay sha1 md5 base64
format env code-pages mule custom widget hashtable-print-readable
backquote inotify dynamic-setting system-font-setting
font-render-setting move-toolbar gtk x-toolkit x multi-tty
make-network-process emacs)

Memory information:
((conses 16 182571 10570)
 (symbols 48 31257 1)
 (miscs 40 340 231)
 (strings 32 71112 6419)
 (string-bytes 1 1678721)
 (vectors 16 14561)
 (vector-slots 8 529555 10250)
 (floats 8 183 150)
 (intervals 56 250 6)
 (buffers 976 13)
 (heap 1024 36602 1391))

-- 
Google Germany GmbH
Erika-Mann-Straße 33
80636 München

Registergericht und -nummer: Hamburg, HRB 86891
Sitz der Gesellschaft: Hamburg
Geschäftsführer: Matthew Scott Sucherman, Paul Terence Manicle

Diese E-Mail ist vertraulich.  Wenn Sie nicht der richtige Adressat sind,
leiten Sie diese bitte nicht weiter, informieren Sie den Absender und löschen
Sie die E-Mail und alle Anhänge.  Vielen Dank.

This e-mail is confidential.  If you are not the right addressee please do not
forward it, please inform the sender, and please erase this e-mail including
any attachments.  Thanks.





^ permalink raw reply	[flat|nested] 6+ messages in thread

* bug#25366: 26.0.50; [:blank:] character class should match all Unicode horizontal whitespace
  2017-01-05 13:46 bug#25366: 26.0.50; [:blank:] character class should match all Unicode horizontal whitespace Philipp Stephani
@ 2017-01-05 15:50 ` Eli Zaretskii
  2017-01-06 15:00   ` Philipp Stephani
  0 siblings, 1 reply; 6+ messages in thread
From: Eli Zaretskii @ 2017-01-05 15:50 UTC (permalink / raw)
  To: Philipp Stephani; +Cc: 25366

> From: Philipp Stephani <p.stephani2@gmail.com>
> Date: Thu, 05 Jan 2017 14:46:01 +0100
> 
> (string-match-p "[[:blank:]]" "\N{HAIR SPACE}")
> => nil, expected 0
> 
> [[:blank:]] should be the same as \h in PRCE.

We are consistent with our documentation, but I agree that it would be
good to extend [:blank:], as proposed here:

  http://www.unicode.org/reports/tr18/tr18-19.html#Compatibility_Properties

Patches to that effect are welcome.





^ permalink raw reply	[flat|nested] 6+ messages in thread

* bug#25366: 26.0.50; [:blank:] character class should match all Unicode horizontal whitespace
  2017-01-05 15:50 ` Eli Zaretskii
@ 2017-01-06 15:00   ` Philipp Stephani
  2017-01-06 15:11     ` Eli Zaretskii
  0 siblings, 1 reply; 6+ messages in thread
From: Philipp Stephani @ 2017-01-06 15:00 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 25366


[-- Attachment #1.1: Type: text/plain, Size: 559 bytes --]

Eli Zaretskii <eliz@gnu.org> schrieb am Do., 5. Jan. 2017 um 16:50 Uhr:

> > From: Philipp Stephani <p.stephani2@gmail.com>
> > Date: Thu, 05 Jan 2017 14:46:01 +0100
> >
> > (string-match-p "[[:blank:]]" "\N{HAIR SPACE}")
> > => nil, expected 0
> >
> > [[:blank:]] should be the same as \h in PRCE.
>
> We are consistent with our documentation, but I agree that it would be
> good to extend [:blank:], as proposed here:
>
>
> http://www.unicode.org/reports/tr18/tr18-19.html#Compatibility_Properties
>
> Patches to that effect are welcome.
>

Here's a patch.

[-- Attachment #1.2: Type: text/html, Size: 1390 bytes --]

[-- Attachment #2: 0001-Add-support-for-Unicode-whitespace-in-blank.txt --]
[-- Type: text/plain, Size: 5993 bytes --]

From c8cc92da17f8e33ed886d3411f631347ef1c55ff Mon Sep 17 00:00:00 2001
From: Philipp Stephani <phst@google.com>
Date: Fri, 6 Jan 2017 15:56:51 +0100
Subject: [PATCH] Add support for Unicode whitespace in [:blank:]

See Bug#25366.

* src/character.c (blankp): New function for checking Unicode
horizontal whitespace.
* src/regex.c (ISBLANK): Use 'blankp' for non-ASCII horizontal
whitespace.
(BIT_BLANK): New bit for range table.
(re_wctype_to_bit, execute_charset): Use it.
* test/lisp/subr-tests.el (subr-tests--string-match-p--blank): Add
unit test for [:blank:] character class.
* doc/lispref/searching.texi (Char Classes): Document new Unicode
behavior for [:blank:].
---
 doc/lispref/searching.texi |  5 ++++-
 etc/NEWS                   |  5 +++++
 src/character.c            | 15 +++++++++++++++
 src/character.h            |  1 +
 src/regex.c                | 12 ++++++++----
 test/lisp/subr-tests.el    | 10 ++++++++++
 6 files changed, 43 insertions(+), 5 deletions(-)

diff --git a/doc/lispref/searching.texi b/doc/lispref/searching.texi
index b011d14ee3..38d21216d6 100644
--- a/doc/lispref/searching.texi
+++ b/doc/lispref/searching.texi
@@ -553,7 +553,10 @@ Char Classes
 (@pxref{Character Properties}) indicates they are alphabetic
 characters.
 @item [:blank:]
-This matches space and tab only.
+This matches horizontal whitespace, as defined by Unicode Technical
+Standard #18.  In particular, it matches tabs and characters whose
+Unicode @samp{general-category} property (@pxref{Character
+Properties}) indicates they are spacing separators.
 @item [:cntrl:]
 This matches any @acronym{ASCII} control character.
 @item [:digit:]
diff --git a/etc/NEWS b/etc/NEWS
index d91204b21b..9a7aa207bc 100644
--- a/etc/NEWS
+++ b/etc/NEWS
@@ -710,6 +710,11 @@ of curved quotes in format arguments to functions like 'message' and
 now generate less chatter and more-compact diagnostics.  The auxiliary
 function 'check-declare-errmsg' has been removed.
 
++++
+** The regular expression character class [:blank:] now matches
+Unicode horizontal whitespace as defined in
+http://www.unicode.org/reports/tr18/tr18-19.html#blank.
+
 \f
 * Lisp Changes in Emacs 26.1
 
diff --git a/src/character.c b/src/character.c
index b594af040c..74d6410fc7 100644
--- a/src/character.c
+++ b/src/character.c
@@ -1038,6 +1038,21 @@ printablep (int c)
 	    || gen_cat == UNICODE_CATEGORY_Cn)); /* unassigned */
 }
 
+/* Return true if C is a horizontal whitespace character, as defined
+   by http://www.unicode.org/reports/tr18/tr18-19.html#blank.  */
+bool
+blankp (int c)
+{
+  if (c == '\t')
+    return true;
+
+  Lisp_Object category = CHAR_TABLE_REF (Vunicode_category_table, c);
+  if (! INTEGERP (category))
+    return false;
+
+  return XINT (category) == UNICODE_CATEGORY_Zs; /* separator, space */
+}
+
 void
 syms_of_character (void)
 {
diff --git a/src/character.h b/src/character.h
index fc8a0dd74d..62d252e91b 100644
--- a/src/character.h
+++ b/src/character.h
@@ -680,6 +680,7 @@ extern bool alphabeticp (int);
 extern bool alphanumericp (int);
 extern bool graphicp (int);
 extern bool printablep (int);
+extern bool blankp (int);
 
 /* Return a translation table of id number ID.  */
 #define GET_TRANSLATION_TABLE(id) \
diff --git a/src/regex.c b/src/regex.c
index ae3fde80c9..7e70c494f4 100644
--- a/src/regex.c
+++ b/src/regex.c
@@ -310,11 +310,12 @@ enum syntaxcode { Swhitespace = 0, Sword = 1, Ssymbol = 2 };
 		     || ((c) >= 'a' && (c) <= 'f')	\
 		     || ((c) >= 'A' && (c) <= 'F'))
 
-/* This is only used for single-byte characters.  */
-# define ISBLANK(c) ((c) == ' ' || (c) == '\t')
-
 /* The rest must handle multibyte characters.  */
 
+# define ISBLANK(c) (IS_REAL_ASCII (c)                  \
+                     ? ((c) == ' ' || (c) == '\t')      \
+                     : blankp (c))
+
 # define ISGRAPH(c) (SINGLE_BYTE_CHAR_P (c)				\
 		     ? (c) > ' ' && !((c) >= 0177 && (c) <= 0240)	\
 		     : graphicp (c))
@@ -1790,6 +1791,7 @@ struct range_table_work_area
 #define BIT_ALNUM	0x80
 #define BIT_GRAPH	0x100
 #define BIT_PRINT	0x200
+#define BIT_BLANK       0x400
 \f
 
 /* Set the bit for character C in a list.  */
@@ -2066,8 +2068,9 @@ re_wctype_to_bit (re_wctype_t cc)
     case RECC_SPACE: return BIT_SPACE;
     case RECC_GRAPH: return BIT_GRAPH;
     case RECC_PRINT: return BIT_PRINT;
+    case RECC_BLANK: return BIT_BLANK;
     case RECC_ASCII: case RECC_DIGIT: case RECC_XDIGIT: case RECC_CNTRL:
-    case RECC_BLANK: case RECC_UNIBYTE: case RECC_ERROR: return 0;
+    case RECC_UNIBYTE: case RECC_ERROR: return 0;
     default:
       abort ();
     }
@@ -4658,6 +4661,7 @@ execute_charset (const_re_char **pp, unsigned c, unsigned corig, bool unibyte)
 	  (class_bits & BIT_ALNUM && ISALNUM (c)) ||
 	  (class_bits & BIT_ALPHA && ISALPHA (c)) ||
 	  (class_bits & BIT_SPACE && ISSPACE (c)) ||
+          (class_bits & BIT_BLANK && ISBLANK (c)) ||
 	  (class_bits & BIT_WORD  && ISWORD  (c)) ||
 	  ((class_bits & BIT_UPPER) &&
 	   (ISUPPER (c) || (corig != c &&
diff --git a/test/lisp/subr-tests.el b/test/lisp/subr-tests.el
index 3c5dbcdbd7..a3b08e9697 100644
--- a/test/lisp/subr-tests.el
+++ b/test/lisp/subr-tests.el
@@ -271,5 +271,15 @@ subr-test--frames-1
   (let ((frame-lists (subr-test--frames-1 'subr-test--frames-2)))
     (should (equal (car frame-lists) (cdr frame-lists)))))
 
+(ert-deftest subr-tests--string-match-p--blank ()
+  "Test that [:blank:] matches horizontal whitespace, cf. Bug#25366."
+  (should (equal (string-match-p "\\`[[:blank:]]\\'" " ") 0))
+  (should (equal (string-match-p "\\`[[:blank:]]\\'" "\t") 0))
+  (should-not (string-match-p "\\`[[:blank:]]\\'" "\n"))
+  (should-not (string-match-p "\\`[[:blank:]]\\'" "a"))
+  (should (equal (string-match-p "\\`[[:blank:]]\\'" "\N{HAIR SPACE}") 0))
+  (should (equal (string-match-p "\\`[[:blank:]]\\'" "\u3000") 0))
+  (should-not (string-match-p "\\`[[:blank:]]\\'" "\N{LINE SEPARATOR}")))
+
 (provide 'subr-tests)
 ;;; subr-tests.el ends here
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* bug#25366: 26.0.50; [:blank:] character class should match all Unicode horizontal whitespace
  2017-01-06 15:00   ` Philipp Stephani
@ 2017-01-06 15:11     ` Eli Zaretskii
  2017-01-06 19:10       ` Philipp Stephani
  0 siblings, 1 reply; 6+ messages in thread
From: Eli Zaretskii @ 2017-01-06 15:11 UTC (permalink / raw)
  To: Philipp Stephani; +Cc: 25366

> From: Philipp Stephani <p.stephani2@gmail.com>
> Date: Fri, 06 Jan 2017 15:00:22 +0000
> Cc: 25366@debbugs.gnu.org
> 
>  http://www.unicode.org/reports/tr18/tr18-19.html#Compatibility_Properties
> 
>  Patches to that effect are welcome.
> 
> Here's a patch. 

Thanks.  A few minor comments below.

> +/* Return true if C is a horizontal whitespace character, as defined
> +   by http://www.unicode.org/reports/tr18/tr18-19.html#blank.  */
> +bool
> +blankp (int c)
> +{
> +  if (c == '\t')
> +    return true;

Why does this test explicitly only for a TAB?  What about SPC, for
example?

> --- a/doc/lispref/searching.texi
> +++ b/doc/lispref/searching.texi
> @@ -553,7 +553,10 @@ Char Classes
>  (@pxref{Character Properties}) indicates they are alphabetic
>  characters.
>  @item [:blank:]
> -This matches space and tab only.
> +This matches horizontal whitespace, as defined by Unicode Technical
> +Standard #18.  In particular, it matches tabs and characters whose
> +Unicode @samp{general-category} property (@pxref{Character
> +Properties}) indicates they are spacing separators.

Similarly here: I find the lack of reference to a space potentially
confusing.

> +** The regular expression character class [:blank:] now matches
> +Unicode horizontal whitespace as defined in
> +http://www.unicode.org/reports/tr18/tr18-19.html#blank.

The reference to a particular version of UTS#18 might become obsolete
when a new version is released.  So I suggest to provide a general
reference to the report and its section, not an exact URL.





^ permalink raw reply	[flat|nested] 6+ messages in thread

* bug#25366: 26.0.50; [:blank:] character class should match all Unicode horizontal whitespace
  2017-01-06 15:11     ` Eli Zaretskii
@ 2017-01-06 19:10       ` Philipp Stephani
  2017-01-06 19:21         ` Philipp Stephani
  0 siblings, 1 reply; 6+ messages in thread
From: Philipp Stephani @ 2017-01-06 19:10 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 25366

[-- Attachment #1: Type: text/plain, Size: 1942 bytes --]

Eli Zaretskii <eliz@gnu.org> schrieb am Fr., 6. Jan. 2017 um 16:11 Uhr:

> > From: Philipp Stephani <p.stephani2@gmail.com>
> > Date: Fri, 06 Jan 2017 15:00:22 +0000
> > Cc: 25366@debbugs.gnu.org
> >
> >
> http://www.unicode.org/reports/tr18/tr18-19.html#Compatibility_Properties
> >
> >  Patches to that effect are welcome.
> >
> > Here's a patch.
>
> Thanks.  A few minor comments below.
>
> > +/* Return true if C is a horizontal whitespace character, as defined
> > +   by http://www.unicode.org/reports/tr18/tr18-19.html#blank.  */
> > +bool
> > +blankp (int c)
> > +{
> > +  if (c == '\t')
> > +    return true;
>
> Why does this test explicitly only for a TAB?  What about SPC, for
> example?
>

Because TAB is the only character that is blank, but doesn't have the
general category Zs.
I've now also included space and added a comment. The risk that the general
category of space will ever be changed seems very small.


>
> > --- a/doc/lispref/searching.texi
> > +++ b/doc/lispref/searching.texi
> > @@ -553,7 +553,10 @@ Char Classes
> >  (@pxref{Character Properties}) indicates they are alphabetic
> >  characters.
> >  @item [:blank:]
> > -This matches space and tab only.
> > +This matches horizontal whitespace, as defined by Unicode Technical
> > +Standard #18.  In particular, it matches tabs and characters whose
> > +Unicode @samp{general-category} property (@pxref{Character
> > +Properties}) indicates they are spacing separators.
>
> Similarly here: I find the lack of reference to a space potentially
> confusing.
>

Added.


>
> > +** The regular expression character class [:blank:] now matches
> > +Unicode horizontal whitespace as defined in
> > +http://www.unicode.org/reports/tr18/tr18-19.html#blank.
>
> The reference to a particular version of UTS#18 might become obsolete
> when a new version is released.  So I suggest to provide a general
> reference to the report and its section, not an exact URL.
>

Done.

[-- Attachment #2: Type: text/html, Size: 4126 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* bug#25366: 26.0.50; [:blank:] character class should match all Unicode horizontal whitespace
  2017-01-06 19:10       ` Philipp Stephani
@ 2017-01-06 19:21         ` Philipp Stephani
  0 siblings, 0 replies; 6+ messages in thread
From: Philipp Stephani @ 2017-01-06 19:21 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 25366-done

[-- Attachment #1: Type: text/plain, Size: 2085 bytes --]

Philipp Stephani <p.stephani2@gmail.com> schrieb am Fr., 6. Jan. 2017 um
20:10 Uhr:

> Eli Zaretskii <eliz@gnu.org> schrieb am Fr., 6. Jan. 2017 um 16:11 Uhr:
>
> > From: Philipp Stephani <p.stephani2@gmail.com>
> > Date: Fri, 06 Jan 2017 15:00:22 +0000
> > Cc: 25366@debbugs.gnu.org
> >
> >
> http://www.unicode.org/reports/tr18/tr18-19.html#Compatibility_Properties
> >
> >  Patches to that effect are welcome.
> >
> > Here's a patch.
>
> Thanks.  A few minor comments below.
>
> > +/* Return true if C is a horizontal whitespace character, as defined
> > +   by http://www.unicode.org/reports/tr18/tr18-19.html#blank.  */
> > +bool
> > +blankp (int c)
> > +{
> > +  if (c == '\t')
> > +    return true;
>
> Why does this test explicitly only for a TAB?  What about SPC, for
> example?
>
>
> Because TAB is the only character that is blank, but doesn't have the
> general category Zs.
> I've now also included space and added a comment. The risk that the
> general category of space will ever be changed seems very small.
>
>
>
> > --- a/doc/lispref/searching.texi
> > +++ b/doc/lispref/searching.texi
> > @@ -553,7 +553,10 @@ Char Classes
> >  (@pxref{Character Properties}) indicates they are alphabetic
> >  characters.
> >  @item [:blank:]
> > -This matches space and tab only.
> > +This matches horizontal whitespace, as defined by Unicode Technical
> > +Standard #18.  In particular, it matches tabs and characters whose
> > +Unicode @samp{general-category} property (@pxref{Character
> > +Properties}) indicates they are spacing separators.
>
> Similarly here: I find the lack of reference to a space potentially
> confusing.
>
>
> Added.
>
>
>
> > +** The regular expression character class [:blank:] now matches
> > +Unicode horizontal whitespace as defined in
> > +http://www.unicode.org/reports/tr18/tr18-19.html#blank.
>
> The reference to a particular version of UTS#18 might become obsolete
> when a new version is released.  So I suggest to provide a general
> reference to the report and its section, not an exact URL.
>
>
> Done.
>


Pushed to master as 512e9886be.

[-- Attachment #2: Type: text/html, Size: 5281 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2017-01-06 19:21 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2017-01-05 13:46 bug#25366: 26.0.50; [:blank:] character class should match all Unicode horizontal whitespace Philipp Stephani
2017-01-05 15:50 ` Eli Zaretskii
2017-01-06 15:00   ` Philipp Stephani
2017-01-06 15:11     ` Eli Zaretskii
2017-01-06 19:10       ` Philipp Stephani
2017-01-06 19:21         ` Philipp Stephani

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).