* bug#25366: 26.0.50; [:blank:] character class should match all Unicode horizontal whitespace
@ 2017-01-05 13:46 Philipp Stephani
2017-01-05 15:50 ` Eli Zaretskii
0 siblings, 1 reply; 6+ messages in thread
From: Philipp Stephani @ 2017-01-05 13:46 UTC (permalink / raw)
To: 25366
(string-match-p "[[:blank:]]" "\N{HAIR SPACE}")
=> nil, expected 0
[[:blank:]] should be the same as \h in PRCE.
In GNU Emacs 26.0.50.26 (x86_64-unknown-linux-gnu, GTK+ Version 3.10.8)
of 2017-01-05 built on unknown
Repository revision: d88cdad2847726438c7d1de9fd2651c4be9243aa
Windowing system distributor 'The X.Org Foundation', version 11.0.11501000
System Description: Ubuntu 14.04 LTS
Recent messages:
For information about GNU Emacs and the GNU system, type C-h C-a.
Entering debugger...
Back to top level
Configured using:
'configure --with-modules --enable-checking
--enable-check-lisp-object-type 'CFLAGS=-ggdb3 -O0''
Configured features:
XPM JPEG TIFF GIF PNG SOUND GSETTINGS NOTIFY GNUTLS FREETYPE XFT ZLIB
TOOLKIT_SCROLL_BARS GTK3 X11 MODULES
Important settings:
value of $LANG: en_US.UTF-8
locale-coding-system: utf-8-unix
Major mode: Lisp Interaction
Minor modes in effect:
tooltip-mode: t
global-eldoc-mode: t
electric-indent-mode: t
mouse-wheel-mode: t
tool-bar-mode: t
menu-bar-mode: t
file-name-shadow-mode: t
global-font-lock-mode: t
font-lock-mode: t
blink-cursor-mode: t
auto-composition-mode: t
auto-encryption-mode: t
auto-compression-mode: t
line-number-mode: t
transient-mark-mode: t
Load-path shadows:
None found.
Features:
(shadow sort mail-extr emacsbug message subr-x puny seq byte-opt gv
bytecomp byte-compile cl-extra cconv dired dired-loaddefs format-spec
rfc822 mml mml-sec password-cache epa derived epg epg-config gnus-util
rmail rmail-loaddefs mm-decode mm-bodies mm-encode mail-parse rfc2231
mailabbrev gmm-utils mailheader sendmail rfc2047 rfc2045 ietf-drums
mm-util mail-prsvr mail-utils help-mode easymenu cl-loaddefs pcase
cl-lib debug time-date mule-util tooltip eldoc electric uniquify
ediff-hook vc-hooks lisp-float-type mwheel term/x-win x-win
term/common-win x-dnd tool-bar dnd fontset image regexp-opt fringe
tabulated-list replace newcomment text-mode elisp-mode lisp-mode
prog-mode register page menu-bar rfn-eshadow isearch timer select
scroll-bar mouse jit-lock font-lock syntax facemenu font-core
term/tty-colors frame cl-generic cham georgian utf-8-lang misc-lang
vietnamese tibetan thai tai-viet lao korean japanese eucjp-ms cp51932
hebrew greek romanian slovak czech european ethiopic indian cyrillic
chinese composite charscript case-table epa-hook jka-cmpr-hook help
simple abbrev obarray minibuffer cl-preloaded nadvice loaddefs button
faces cus-face macroexp files text-properties overlay sha1 md5 base64
format env code-pages mule custom widget hashtable-print-readable
backquote inotify dynamic-setting system-font-setting
font-render-setting move-toolbar gtk x-toolkit x multi-tty
make-network-process emacs)
Memory information:
((conses 16 182571 10570)
(symbols 48 31257 1)
(miscs 40 340 231)
(strings 32 71112 6419)
(string-bytes 1 1678721)
(vectors 16 14561)
(vector-slots 8 529555 10250)
(floats 8 183 150)
(intervals 56 250 6)
(buffers 976 13)
(heap 1024 36602 1391))
--
Google Germany GmbH
Erika-Mann-Straße 33
80636 München
Registergericht und -nummer: Hamburg, HRB 86891
Sitz der Gesellschaft: Hamburg
Geschäftsführer: Matthew Scott Sucherman, Paul Terence Manicle
Diese E-Mail ist vertraulich. Wenn Sie nicht der richtige Adressat sind,
leiten Sie diese bitte nicht weiter, informieren Sie den Absender und löschen
Sie die E-Mail und alle Anhänge. Vielen Dank.
This e-mail is confidential. If you are not the right addressee please do not
forward it, please inform the sender, and please erase this e-mail including
any attachments. Thanks.
^ permalink raw reply [flat|nested] 6+ messages in thread
* bug#25366: 26.0.50; [:blank:] character class should match all Unicode horizontal whitespace
2017-01-05 13:46 bug#25366: 26.0.50; [:blank:] character class should match all Unicode horizontal whitespace Philipp Stephani
@ 2017-01-05 15:50 ` Eli Zaretskii
2017-01-06 15:00 ` Philipp Stephani
0 siblings, 1 reply; 6+ messages in thread
From: Eli Zaretskii @ 2017-01-05 15:50 UTC (permalink / raw)
To: Philipp Stephani; +Cc: 25366
> From: Philipp Stephani <p.stephani2@gmail.com>
> Date: Thu, 05 Jan 2017 14:46:01 +0100
>
> (string-match-p "[[:blank:]]" "\N{HAIR SPACE}")
> => nil, expected 0
>
> [[:blank:]] should be the same as \h in PRCE.
We are consistent with our documentation, but I agree that it would be
good to extend [:blank:], as proposed here:
http://www.unicode.org/reports/tr18/tr18-19.html#Compatibility_Properties
Patches to that effect are welcome.
^ permalink raw reply [flat|nested] 6+ messages in thread
* bug#25366: 26.0.50; [:blank:] character class should match all Unicode horizontal whitespace
2017-01-05 15:50 ` Eli Zaretskii
@ 2017-01-06 15:00 ` Philipp Stephani
2017-01-06 15:11 ` Eli Zaretskii
0 siblings, 1 reply; 6+ messages in thread
From: Philipp Stephani @ 2017-01-06 15:00 UTC (permalink / raw)
To: Eli Zaretskii; +Cc: 25366
[-- Attachment #1.1: Type: text/plain, Size: 559 bytes --]
Eli Zaretskii <eliz@gnu.org> schrieb am Do., 5. Jan. 2017 um 16:50 Uhr:
> > From: Philipp Stephani <p.stephani2@gmail.com>
> > Date: Thu, 05 Jan 2017 14:46:01 +0100
> >
> > (string-match-p "[[:blank:]]" "\N{HAIR SPACE}")
> > => nil, expected 0
> >
> > [[:blank:]] should be the same as \h in PRCE.
>
> We are consistent with our documentation, but I agree that it would be
> good to extend [:blank:], as proposed here:
>
>
> http://www.unicode.org/reports/tr18/tr18-19.html#Compatibility_Properties
>
> Patches to that effect are welcome.
>
Here's a patch.
[-- Attachment #1.2: Type: text/html, Size: 1390 bytes --]
[-- Attachment #2: 0001-Add-support-for-Unicode-whitespace-in-blank.txt --]
[-- Type: text/plain, Size: 5993 bytes --]
From c8cc92da17f8e33ed886d3411f631347ef1c55ff Mon Sep 17 00:00:00 2001
From: Philipp Stephani <phst@google.com>
Date: Fri, 6 Jan 2017 15:56:51 +0100
Subject: [PATCH] Add support for Unicode whitespace in [:blank:]
See Bug#25366.
* src/character.c (blankp): New function for checking Unicode
horizontal whitespace.
* src/regex.c (ISBLANK): Use 'blankp' for non-ASCII horizontal
whitespace.
(BIT_BLANK): New bit for range table.
(re_wctype_to_bit, execute_charset): Use it.
* test/lisp/subr-tests.el (subr-tests--string-match-p--blank): Add
unit test for [:blank:] character class.
* doc/lispref/searching.texi (Char Classes): Document new Unicode
behavior for [:blank:].
---
doc/lispref/searching.texi | 5 ++++-
etc/NEWS | 5 +++++
src/character.c | 15 +++++++++++++++
src/character.h | 1 +
src/regex.c | 12 ++++++++----
test/lisp/subr-tests.el | 10 ++++++++++
6 files changed, 43 insertions(+), 5 deletions(-)
diff --git a/doc/lispref/searching.texi b/doc/lispref/searching.texi
index b011d14ee3..38d21216d6 100644
--- a/doc/lispref/searching.texi
+++ b/doc/lispref/searching.texi
@@ -553,7 +553,10 @@ Char Classes
(@pxref{Character Properties}) indicates they are alphabetic
characters.
@item [:blank:]
-This matches space and tab only.
+This matches horizontal whitespace, as defined by Unicode Technical
+Standard #18. In particular, it matches tabs and characters whose
+Unicode @samp{general-category} property (@pxref{Character
+Properties}) indicates they are spacing separators.
@item [:cntrl:]
This matches any @acronym{ASCII} control character.
@item [:digit:]
diff --git a/etc/NEWS b/etc/NEWS
index d91204b21b..9a7aa207bc 100644
--- a/etc/NEWS
+++ b/etc/NEWS
@@ -710,6 +710,11 @@ of curved quotes in format arguments to functions like 'message' and
now generate less chatter and more-compact diagnostics. The auxiliary
function 'check-declare-errmsg' has been removed.
++++
+** The regular expression character class [:blank:] now matches
+Unicode horizontal whitespace as defined in
+http://www.unicode.org/reports/tr18/tr18-19.html#blank.
+
\f
* Lisp Changes in Emacs 26.1
diff --git a/src/character.c b/src/character.c
index b594af040c..74d6410fc7 100644
--- a/src/character.c
+++ b/src/character.c
@@ -1038,6 +1038,21 @@ printablep (int c)
|| gen_cat == UNICODE_CATEGORY_Cn)); /* unassigned */
}
+/* Return true if C is a horizontal whitespace character, as defined
+ by http://www.unicode.org/reports/tr18/tr18-19.html#blank. */
+bool
+blankp (int c)
+{
+ if (c == '\t')
+ return true;
+
+ Lisp_Object category = CHAR_TABLE_REF (Vunicode_category_table, c);
+ if (! INTEGERP (category))
+ return false;
+
+ return XINT (category) == UNICODE_CATEGORY_Zs; /* separator, space */
+}
+
void
syms_of_character (void)
{
diff --git a/src/character.h b/src/character.h
index fc8a0dd74d..62d252e91b 100644
--- a/src/character.h
+++ b/src/character.h
@@ -680,6 +680,7 @@ extern bool alphabeticp (int);
extern bool alphanumericp (int);
extern bool graphicp (int);
extern bool printablep (int);
+extern bool blankp (int);
/* Return a translation table of id number ID. */
#define GET_TRANSLATION_TABLE(id) \
diff --git a/src/regex.c b/src/regex.c
index ae3fde80c9..7e70c494f4 100644
--- a/src/regex.c
+++ b/src/regex.c
@@ -310,11 +310,12 @@ enum syntaxcode { Swhitespace = 0, Sword = 1, Ssymbol = 2 };
|| ((c) >= 'a' && (c) <= 'f') \
|| ((c) >= 'A' && (c) <= 'F'))
-/* This is only used for single-byte characters. */
-# define ISBLANK(c) ((c) == ' ' || (c) == '\t')
-
/* The rest must handle multibyte characters. */
+# define ISBLANK(c) (IS_REAL_ASCII (c) \
+ ? ((c) == ' ' || (c) == '\t') \
+ : blankp (c))
+
# define ISGRAPH(c) (SINGLE_BYTE_CHAR_P (c) \
? (c) > ' ' && !((c) >= 0177 && (c) <= 0240) \
: graphicp (c))
@@ -1790,6 +1791,7 @@ struct range_table_work_area
#define BIT_ALNUM 0x80
#define BIT_GRAPH 0x100
#define BIT_PRINT 0x200
+#define BIT_BLANK 0x400
\f
/* Set the bit for character C in a list. */
@@ -2066,8 +2068,9 @@ re_wctype_to_bit (re_wctype_t cc)
case RECC_SPACE: return BIT_SPACE;
case RECC_GRAPH: return BIT_GRAPH;
case RECC_PRINT: return BIT_PRINT;
+ case RECC_BLANK: return BIT_BLANK;
case RECC_ASCII: case RECC_DIGIT: case RECC_XDIGIT: case RECC_CNTRL:
- case RECC_BLANK: case RECC_UNIBYTE: case RECC_ERROR: return 0;
+ case RECC_UNIBYTE: case RECC_ERROR: return 0;
default:
abort ();
}
@@ -4658,6 +4661,7 @@ execute_charset (const_re_char **pp, unsigned c, unsigned corig, bool unibyte)
(class_bits & BIT_ALNUM && ISALNUM (c)) ||
(class_bits & BIT_ALPHA && ISALPHA (c)) ||
(class_bits & BIT_SPACE && ISSPACE (c)) ||
+ (class_bits & BIT_BLANK && ISBLANK (c)) ||
(class_bits & BIT_WORD && ISWORD (c)) ||
((class_bits & BIT_UPPER) &&
(ISUPPER (c) || (corig != c &&
diff --git a/test/lisp/subr-tests.el b/test/lisp/subr-tests.el
index 3c5dbcdbd7..a3b08e9697 100644
--- a/test/lisp/subr-tests.el
+++ b/test/lisp/subr-tests.el
@@ -271,5 +271,15 @@ subr-test--frames-1
(let ((frame-lists (subr-test--frames-1 'subr-test--frames-2)))
(should (equal (car frame-lists) (cdr frame-lists)))))
+(ert-deftest subr-tests--string-match-p--blank ()
+ "Test that [:blank:] matches horizontal whitespace, cf. Bug#25366."
+ (should (equal (string-match-p "\\`[[:blank:]]\\'" " ") 0))
+ (should (equal (string-match-p "\\`[[:blank:]]\\'" "\t") 0))
+ (should-not (string-match-p "\\`[[:blank:]]\\'" "\n"))
+ (should-not (string-match-p "\\`[[:blank:]]\\'" "a"))
+ (should (equal (string-match-p "\\`[[:blank:]]\\'" "\N{HAIR SPACE}") 0))
+ (should (equal (string-match-p "\\`[[:blank:]]\\'" "\u3000") 0))
+ (should-not (string-match-p "\\`[[:blank:]]\\'" "\N{LINE SEPARATOR}")))
+
(provide 'subr-tests)
;;; subr-tests.el ends here
--
2.11.0
^ permalink raw reply related [flat|nested] 6+ messages in thread
* bug#25366: 26.0.50; [:blank:] character class should match all Unicode horizontal whitespace
2017-01-06 15:00 ` Philipp Stephani
@ 2017-01-06 15:11 ` Eli Zaretskii
2017-01-06 19:10 ` Philipp Stephani
0 siblings, 1 reply; 6+ messages in thread
From: Eli Zaretskii @ 2017-01-06 15:11 UTC (permalink / raw)
To: Philipp Stephani; +Cc: 25366
> From: Philipp Stephani <p.stephani2@gmail.com>
> Date: Fri, 06 Jan 2017 15:00:22 +0000
> Cc: 25366@debbugs.gnu.org
>
> http://www.unicode.org/reports/tr18/tr18-19.html#Compatibility_Properties
>
> Patches to that effect are welcome.
>
> Here's a patch.
Thanks. A few minor comments below.
> +/* Return true if C is a horizontal whitespace character, as defined
> + by http://www.unicode.org/reports/tr18/tr18-19.html#blank. */
> +bool
> +blankp (int c)
> +{
> + if (c == '\t')
> + return true;
Why does this test explicitly only for a TAB? What about SPC, for
example?
> --- a/doc/lispref/searching.texi
> +++ b/doc/lispref/searching.texi
> @@ -553,7 +553,10 @@ Char Classes
> (@pxref{Character Properties}) indicates they are alphabetic
> characters.
> @item [:blank:]
> -This matches space and tab only.
> +This matches horizontal whitespace, as defined by Unicode Technical
> +Standard #18. In particular, it matches tabs and characters whose
> +Unicode @samp{general-category} property (@pxref{Character
> +Properties}) indicates they are spacing separators.
Similarly here: I find the lack of reference to a space potentially
confusing.
> +** The regular expression character class [:blank:] now matches
> +Unicode horizontal whitespace as defined in
> +http://www.unicode.org/reports/tr18/tr18-19.html#blank.
The reference to a particular version of UTS#18 might become obsolete
when a new version is released. So I suggest to provide a general
reference to the report and its section, not an exact URL.
^ permalink raw reply [flat|nested] 6+ messages in thread
* bug#25366: 26.0.50; [:blank:] character class should match all Unicode horizontal whitespace
2017-01-06 15:11 ` Eli Zaretskii
@ 2017-01-06 19:10 ` Philipp Stephani
2017-01-06 19:21 ` Philipp Stephani
0 siblings, 1 reply; 6+ messages in thread
From: Philipp Stephani @ 2017-01-06 19:10 UTC (permalink / raw)
To: Eli Zaretskii; +Cc: 25366
[-- Attachment #1: Type: text/plain, Size: 1942 bytes --]
Eli Zaretskii <eliz@gnu.org> schrieb am Fr., 6. Jan. 2017 um 16:11 Uhr:
> > From: Philipp Stephani <p.stephani2@gmail.com>
> > Date: Fri, 06 Jan 2017 15:00:22 +0000
> > Cc: 25366@debbugs.gnu.org
> >
> >
> http://www.unicode.org/reports/tr18/tr18-19.html#Compatibility_Properties
> >
> > Patches to that effect are welcome.
> >
> > Here's a patch.
>
> Thanks. A few minor comments below.
>
> > +/* Return true if C is a horizontal whitespace character, as defined
> > + by http://www.unicode.org/reports/tr18/tr18-19.html#blank. */
> > +bool
> > +blankp (int c)
> > +{
> > + if (c == '\t')
> > + return true;
>
> Why does this test explicitly only for a TAB? What about SPC, for
> example?
>
Because TAB is the only character that is blank, but doesn't have the
general category Zs.
I've now also included space and added a comment. The risk that the general
category of space will ever be changed seems very small.
>
> > --- a/doc/lispref/searching.texi
> > +++ b/doc/lispref/searching.texi
> > @@ -553,7 +553,10 @@ Char Classes
> > (@pxref{Character Properties}) indicates they are alphabetic
> > characters.
> > @item [:blank:]
> > -This matches space and tab only.
> > +This matches horizontal whitespace, as defined by Unicode Technical
> > +Standard #18. In particular, it matches tabs and characters whose
> > +Unicode @samp{general-category} property (@pxref{Character
> > +Properties}) indicates they are spacing separators.
>
> Similarly here: I find the lack of reference to a space potentially
> confusing.
>
Added.
>
> > +** The regular expression character class [:blank:] now matches
> > +Unicode horizontal whitespace as defined in
> > +http://www.unicode.org/reports/tr18/tr18-19.html#blank.
>
> The reference to a particular version of UTS#18 might become obsolete
> when a new version is released. So I suggest to provide a general
> reference to the report and its section, not an exact URL.
>
Done.
[-- Attachment #2: Type: text/html, Size: 4126 bytes --]
^ permalink raw reply [flat|nested] 6+ messages in thread
* bug#25366: 26.0.50; [:blank:] character class should match all Unicode horizontal whitespace
2017-01-06 19:10 ` Philipp Stephani
@ 2017-01-06 19:21 ` Philipp Stephani
0 siblings, 0 replies; 6+ messages in thread
From: Philipp Stephani @ 2017-01-06 19:21 UTC (permalink / raw)
To: Eli Zaretskii; +Cc: 25366-done
[-- Attachment #1: Type: text/plain, Size: 2085 bytes --]
Philipp Stephani <p.stephani2@gmail.com> schrieb am Fr., 6. Jan. 2017 um
20:10 Uhr:
> Eli Zaretskii <eliz@gnu.org> schrieb am Fr., 6. Jan. 2017 um 16:11 Uhr:
>
> > From: Philipp Stephani <p.stephani2@gmail.com>
> > Date: Fri, 06 Jan 2017 15:00:22 +0000
> > Cc: 25366@debbugs.gnu.org
> >
> >
> http://www.unicode.org/reports/tr18/tr18-19.html#Compatibility_Properties
> >
> > Patches to that effect are welcome.
> >
> > Here's a patch.
>
> Thanks. A few minor comments below.
>
> > +/* Return true if C is a horizontal whitespace character, as defined
> > + by http://www.unicode.org/reports/tr18/tr18-19.html#blank. */
> > +bool
> > +blankp (int c)
> > +{
> > + if (c == '\t')
> > + return true;
>
> Why does this test explicitly only for a TAB? What about SPC, for
> example?
>
>
> Because TAB is the only character that is blank, but doesn't have the
> general category Zs.
> I've now also included space and added a comment. The risk that the
> general category of space will ever be changed seems very small.
>
>
>
> > --- a/doc/lispref/searching.texi
> > +++ b/doc/lispref/searching.texi
> > @@ -553,7 +553,10 @@ Char Classes
> > (@pxref{Character Properties}) indicates they are alphabetic
> > characters.
> > @item [:blank:]
> > -This matches space and tab only.
> > +This matches horizontal whitespace, as defined by Unicode Technical
> > +Standard #18. In particular, it matches tabs and characters whose
> > +Unicode @samp{general-category} property (@pxref{Character
> > +Properties}) indicates they are spacing separators.
>
> Similarly here: I find the lack of reference to a space potentially
> confusing.
>
>
> Added.
>
>
>
> > +** The regular expression character class [:blank:] now matches
> > +Unicode horizontal whitespace as defined in
> > +http://www.unicode.org/reports/tr18/tr18-19.html#blank.
>
> The reference to a particular version of UTS#18 might become obsolete
> when a new version is released. So I suggest to provide a general
> reference to the report and its section, not an exact URL.
>
>
> Done.
>
Pushed to master as 512e9886be.
[-- Attachment #2: Type: text/html, Size: 5281 bytes --]
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2017-01-06 19:21 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2017-01-05 13:46 bug#25366: 26.0.50; [:blank:] character class should match all Unicode horizontal whitespace Philipp Stephani
2017-01-05 15:50 ` Eli Zaretskii
2017-01-06 15:00 ` Philipp Stephani
2017-01-06 15:11 ` Eli Zaretskii
2017-01-06 19:10 ` Philipp Stephani
2017-01-06 19:21 ` Philipp Stephani
Code repositories for project(s) associated with this external index
https://git.savannah.gnu.org/cgit/emacs.git
https://git.savannah.gnu.org/cgit/emacs/org-mode.git
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.