* bug#25366: 26.0.50; [:blank:] character class should match all Unicode horizontal whitespace @ 2017-01-05 13:46 Philipp Stephani 2017-01-05 15:50 ` Eli Zaretskii 0 siblings, 1 reply; 6+ messages in thread From: Philipp Stephani @ 2017-01-05 13:46 UTC (permalink / raw) To: 25366 (string-match-p "[[:blank:]]" "\N{HAIR SPACE}") => nil, expected 0 [[:blank:]] should be the same as \h in PRCE. In GNU Emacs 26.0.50.26 (x86_64-unknown-linux-gnu, GTK+ Version 3.10.8) of 2017-01-05 built on unknown Repository revision: d88cdad2847726438c7d1de9fd2651c4be9243aa Windowing system distributor 'The X.Org Foundation', version 11.0.11501000 System Description: Ubuntu 14.04 LTS Recent messages: For information about GNU Emacs and the GNU system, type C-h C-a. Entering debugger... Back to top level Configured using: 'configure --with-modules --enable-checking --enable-check-lisp-object-type 'CFLAGS=-ggdb3 -O0'' Configured features: XPM JPEG TIFF GIF PNG SOUND GSETTINGS NOTIFY GNUTLS FREETYPE XFT ZLIB TOOLKIT_SCROLL_BARS GTK3 X11 MODULES Important settings: value of $LANG: en_US.UTF-8 locale-coding-system: utf-8-unix Major mode: Lisp Interaction Minor modes in effect: tooltip-mode: t global-eldoc-mode: t electric-indent-mode: t mouse-wheel-mode: t tool-bar-mode: t menu-bar-mode: t file-name-shadow-mode: t global-font-lock-mode: t font-lock-mode: t blink-cursor-mode: t auto-composition-mode: t auto-encryption-mode: t auto-compression-mode: t line-number-mode: t transient-mark-mode: t Load-path shadows: None found. Features: (shadow sort mail-extr emacsbug message subr-x puny seq byte-opt gv bytecomp byte-compile cl-extra cconv dired dired-loaddefs format-spec rfc822 mml mml-sec password-cache epa derived epg epg-config gnus-util rmail rmail-loaddefs mm-decode mm-bodies mm-encode mail-parse rfc2231 mailabbrev gmm-utils mailheader sendmail rfc2047 rfc2045 ietf-drums mm-util mail-prsvr mail-utils help-mode easymenu cl-loaddefs pcase cl-lib debug time-date mule-util tooltip eldoc electric uniquify ediff-hook vc-hooks lisp-float-type mwheel term/x-win x-win term/common-win x-dnd tool-bar dnd fontset image regexp-opt fringe tabulated-list replace newcomment text-mode elisp-mode lisp-mode prog-mode register page menu-bar rfn-eshadow isearch timer select scroll-bar mouse jit-lock font-lock syntax facemenu font-core term/tty-colors frame cl-generic cham georgian utf-8-lang misc-lang vietnamese tibetan thai tai-viet lao korean japanese eucjp-ms cp51932 hebrew greek romanian slovak czech european ethiopic indian cyrillic chinese composite charscript case-table epa-hook jka-cmpr-hook help simple abbrev obarray minibuffer cl-preloaded nadvice loaddefs button faces cus-face macroexp files text-properties overlay sha1 md5 base64 format env code-pages mule custom widget hashtable-print-readable backquote inotify dynamic-setting system-font-setting font-render-setting move-toolbar gtk x-toolkit x multi-tty make-network-process emacs) Memory information: ((conses 16 182571 10570) (symbols 48 31257 1) (miscs 40 340 231) (strings 32 71112 6419) (string-bytes 1 1678721) (vectors 16 14561) (vector-slots 8 529555 10250) (floats 8 183 150) (intervals 56 250 6) (buffers 976 13) (heap 1024 36602 1391)) -- Google Germany GmbH Erika-Mann-Straße 33 80636 München Registergericht und -nummer: Hamburg, HRB 86891 Sitz der Gesellschaft: Hamburg Geschäftsführer: Matthew Scott Sucherman, Paul Terence Manicle Diese E-Mail ist vertraulich. Wenn Sie nicht der richtige Adressat sind, leiten Sie diese bitte nicht weiter, informieren Sie den Absender und löschen Sie die E-Mail und alle Anhänge. Vielen Dank. This e-mail is confidential. If you are not the right addressee please do not forward it, please inform the sender, and please erase this e-mail including any attachments. Thanks. ^ permalink raw reply [flat|nested] 6+ messages in thread
* bug#25366: 26.0.50; [:blank:] character class should match all Unicode horizontal whitespace 2017-01-05 13:46 bug#25366: 26.0.50; [:blank:] character class should match all Unicode horizontal whitespace Philipp Stephani @ 2017-01-05 15:50 ` Eli Zaretskii 2017-01-06 15:00 ` Philipp Stephani 0 siblings, 1 reply; 6+ messages in thread From: Eli Zaretskii @ 2017-01-05 15:50 UTC (permalink / raw) To: Philipp Stephani; +Cc: 25366 > From: Philipp Stephani <p.stephani2@gmail.com> > Date: Thu, 05 Jan 2017 14:46:01 +0100 > > (string-match-p "[[:blank:]]" "\N{HAIR SPACE}") > => nil, expected 0 > > [[:blank:]] should be the same as \h in PRCE. We are consistent with our documentation, but I agree that it would be good to extend [:blank:], as proposed here: http://www.unicode.org/reports/tr18/tr18-19.html#Compatibility_Properties Patches to that effect are welcome. ^ permalink raw reply [flat|nested] 6+ messages in thread
* bug#25366: 26.0.50; [:blank:] character class should match all Unicode horizontal whitespace 2017-01-05 15:50 ` Eli Zaretskii @ 2017-01-06 15:00 ` Philipp Stephani 2017-01-06 15:11 ` Eli Zaretskii 0 siblings, 1 reply; 6+ messages in thread From: Philipp Stephani @ 2017-01-06 15:00 UTC (permalink / raw) To: Eli Zaretskii; +Cc: 25366 [-- Attachment #1.1: Type: text/plain, Size: 559 bytes --] Eli Zaretskii <eliz@gnu.org> schrieb am Do., 5. Jan. 2017 um 16:50 Uhr: > > From: Philipp Stephani <p.stephani2@gmail.com> > > Date: Thu, 05 Jan 2017 14:46:01 +0100 > > > > (string-match-p "[[:blank:]]" "\N{HAIR SPACE}") > > => nil, expected 0 > > > > [[:blank:]] should be the same as \h in PRCE. > > We are consistent with our documentation, but I agree that it would be > good to extend [:blank:], as proposed here: > > > http://www.unicode.org/reports/tr18/tr18-19.html#Compatibility_Properties > > Patches to that effect are welcome. > Here's a patch. [-- Attachment #1.2: Type: text/html, Size: 1390 bytes --] [-- Attachment #2: 0001-Add-support-for-Unicode-whitespace-in-blank.txt --] [-- Type: text/plain, Size: 5993 bytes --] From c8cc92da17f8e33ed886d3411f631347ef1c55ff Mon Sep 17 00:00:00 2001 From: Philipp Stephani <phst@google.com> Date: Fri, 6 Jan 2017 15:56:51 +0100 Subject: [PATCH] Add support for Unicode whitespace in [:blank:] See Bug#25366. * src/character.c (blankp): New function for checking Unicode horizontal whitespace. * src/regex.c (ISBLANK): Use 'blankp' for non-ASCII horizontal whitespace. (BIT_BLANK): New bit for range table. (re_wctype_to_bit, execute_charset): Use it. * test/lisp/subr-tests.el (subr-tests--string-match-p--blank): Add unit test for [:blank:] character class. * doc/lispref/searching.texi (Char Classes): Document new Unicode behavior for [:blank:]. --- doc/lispref/searching.texi | 5 ++++- etc/NEWS | 5 +++++ src/character.c | 15 +++++++++++++++ src/character.h | 1 + src/regex.c | 12 ++++++++---- test/lisp/subr-tests.el | 10 ++++++++++ 6 files changed, 43 insertions(+), 5 deletions(-) diff --git a/doc/lispref/searching.texi b/doc/lispref/searching.texi index b011d14ee3..38d21216d6 100644 --- a/doc/lispref/searching.texi +++ b/doc/lispref/searching.texi @@ -553,7 +553,10 @@ Char Classes (@pxref{Character Properties}) indicates they are alphabetic characters. @item [:blank:] -This matches space and tab only. +This matches horizontal whitespace, as defined by Unicode Technical +Standard #18. In particular, it matches tabs and characters whose +Unicode @samp{general-category} property (@pxref{Character +Properties}) indicates they are spacing separators. @item [:cntrl:] This matches any @acronym{ASCII} control character. @item [:digit:] diff --git a/etc/NEWS b/etc/NEWS index d91204b21b..9a7aa207bc 100644 --- a/etc/NEWS +++ b/etc/NEWS @@ -710,6 +710,11 @@ of curved quotes in format arguments to functions like 'message' and now generate less chatter and more-compact diagnostics. The auxiliary function 'check-declare-errmsg' has been removed. ++++ +** The regular expression character class [:blank:] now matches +Unicode horizontal whitespace as defined in +http://www.unicode.org/reports/tr18/tr18-19.html#blank. + \f * Lisp Changes in Emacs 26.1 diff --git a/src/character.c b/src/character.c index b594af040c..74d6410fc7 100644 --- a/src/character.c +++ b/src/character.c @@ -1038,6 +1038,21 @@ printablep (int c) || gen_cat == UNICODE_CATEGORY_Cn)); /* unassigned */ } +/* Return true if C is a horizontal whitespace character, as defined + by http://www.unicode.org/reports/tr18/tr18-19.html#blank. */ +bool +blankp (int c) +{ + if (c == '\t') + return true; + + Lisp_Object category = CHAR_TABLE_REF (Vunicode_category_table, c); + if (! INTEGERP (category)) + return false; + + return XINT (category) == UNICODE_CATEGORY_Zs; /* separator, space */ +} + void syms_of_character (void) { diff --git a/src/character.h b/src/character.h index fc8a0dd74d..62d252e91b 100644 --- a/src/character.h +++ b/src/character.h @@ -680,6 +680,7 @@ extern bool alphabeticp (int); extern bool alphanumericp (int); extern bool graphicp (int); extern bool printablep (int); +extern bool blankp (int); /* Return a translation table of id number ID. */ #define GET_TRANSLATION_TABLE(id) \ diff --git a/src/regex.c b/src/regex.c index ae3fde80c9..7e70c494f4 100644 --- a/src/regex.c +++ b/src/regex.c @@ -310,11 +310,12 @@ enum syntaxcode { Swhitespace = 0, Sword = 1, Ssymbol = 2 }; || ((c) >= 'a' && (c) <= 'f') \ || ((c) >= 'A' && (c) <= 'F')) -/* This is only used for single-byte characters. */ -# define ISBLANK(c) ((c) == ' ' || (c) == '\t') - /* The rest must handle multibyte characters. */ +# define ISBLANK(c) (IS_REAL_ASCII (c) \ + ? ((c) == ' ' || (c) == '\t') \ + : blankp (c)) + # define ISGRAPH(c) (SINGLE_BYTE_CHAR_P (c) \ ? (c) > ' ' && !((c) >= 0177 && (c) <= 0240) \ : graphicp (c)) @@ -1790,6 +1791,7 @@ struct range_table_work_area #define BIT_ALNUM 0x80 #define BIT_GRAPH 0x100 #define BIT_PRINT 0x200 +#define BIT_BLANK 0x400 \f /* Set the bit for character C in a list. */ @@ -2066,8 +2068,9 @@ re_wctype_to_bit (re_wctype_t cc) case RECC_SPACE: return BIT_SPACE; case RECC_GRAPH: return BIT_GRAPH; case RECC_PRINT: return BIT_PRINT; + case RECC_BLANK: return BIT_BLANK; case RECC_ASCII: case RECC_DIGIT: case RECC_XDIGIT: case RECC_CNTRL: - case RECC_BLANK: case RECC_UNIBYTE: case RECC_ERROR: return 0; + case RECC_UNIBYTE: case RECC_ERROR: return 0; default: abort (); } @@ -4658,6 +4661,7 @@ execute_charset (const_re_char **pp, unsigned c, unsigned corig, bool unibyte) (class_bits & BIT_ALNUM && ISALNUM (c)) || (class_bits & BIT_ALPHA && ISALPHA (c)) || (class_bits & BIT_SPACE && ISSPACE (c)) || + (class_bits & BIT_BLANK && ISBLANK (c)) || (class_bits & BIT_WORD && ISWORD (c)) || ((class_bits & BIT_UPPER) && (ISUPPER (c) || (corig != c && diff --git a/test/lisp/subr-tests.el b/test/lisp/subr-tests.el index 3c5dbcdbd7..a3b08e9697 100644 --- a/test/lisp/subr-tests.el +++ b/test/lisp/subr-tests.el @@ -271,5 +271,15 @@ subr-test--frames-1 (let ((frame-lists (subr-test--frames-1 'subr-test--frames-2))) (should (equal (car frame-lists) (cdr frame-lists))))) +(ert-deftest subr-tests--string-match-p--blank () + "Test that [:blank:] matches horizontal whitespace, cf. Bug#25366." + (should (equal (string-match-p "\\`[[:blank:]]\\'" " ") 0)) + (should (equal (string-match-p "\\`[[:blank:]]\\'" "\t") 0)) + (should-not (string-match-p "\\`[[:blank:]]\\'" "\n")) + (should-not (string-match-p "\\`[[:blank:]]\\'" "a")) + (should (equal (string-match-p "\\`[[:blank:]]\\'" "\N{HAIR SPACE}") 0)) + (should (equal (string-match-p "\\`[[:blank:]]\\'" "\u3000") 0)) + (should-not (string-match-p "\\`[[:blank:]]\\'" "\N{LINE SEPARATOR}"))) + (provide 'subr-tests) ;;; subr-tests.el ends here -- 2.11.0 ^ permalink raw reply related [flat|nested] 6+ messages in thread
* bug#25366: 26.0.50; [:blank:] character class should match all Unicode horizontal whitespace 2017-01-06 15:00 ` Philipp Stephani @ 2017-01-06 15:11 ` Eli Zaretskii 2017-01-06 19:10 ` Philipp Stephani 0 siblings, 1 reply; 6+ messages in thread From: Eli Zaretskii @ 2017-01-06 15:11 UTC (permalink / raw) To: Philipp Stephani; +Cc: 25366 > From: Philipp Stephani <p.stephani2@gmail.com> > Date: Fri, 06 Jan 2017 15:00:22 +0000 > Cc: 25366@debbugs.gnu.org > > http://www.unicode.org/reports/tr18/tr18-19.html#Compatibility_Properties > > Patches to that effect are welcome. > > Here's a patch. Thanks. A few minor comments below. > +/* Return true if C is a horizontal whitespace character, as defined > + by http://www.unicode.org/reports/tr18/tr18-19.html#blank. */ > +bool > +blankp (int c) > +{ > + if (c == '\t') > + return true; Why does this test explicitly only for a TAB? What about SPC, for example? > --- a/doc/lispref/searching.texi > +++ b/doc/lispref/searching.texi > @@ -553,7 +553,10 @@ Char Classes > (@pxref{Character Properties}) indicates they are alphabetic > characters. > @item [:blank:] > -This matches space and tab only. > +This matches horizontal whitespace, as defined by Unicode Technical > +Standard #18. In particular, it matches tabs and characters whose > +Unicode @samp{general-category} property (@pxref{Character > +Properties}) indicates they are spacing separators. Similarly here: I find the lack of reference to a space potentially confusing. > +** The regular expression character class [:blank:] now matches > +Unicode horizontal whitespace as defined in > +http://www.unicode.org/reports/tr18/tr18-19.html#blank. The reference to a particular version of UTS#18 might become obsolete when a new version is released. So I suggest to provide a general reference to the report and its section, not an exact URL. ^ permalink raw reply [flat|nested] 6+ messages in thread
* bug#25366: 26.0.50; [:blank:] character class should match all Unicode horizontal whitespace 2017-01-06 15:11 ` Eli Zaretskii @ 2017-01-06 19:10 ` Philipp Stephani 2017-01-06 19:21 ` Philipp Stephani 0 siblings, 1 reply; 6+ messages in thread From: Philipp Stephani @ 2017-01-06 19:10 UTC (permalink / raw) To: Eli Zaretskii; +Cc: 25366 [-- Attachment #1: Type: text/plain, Size: 1942 bytes --] Eli Zaretskii <eliz@gnu.org> schrieb am Fr., 6. Jan. 2017 um 16:11 Uhr: > > From: Philipp Stephani <p.stephani2@gmail.com> > > Date: Fri, 06 Jan 2017 15:00:22 +0000 > > Cc: 25366@debbugs.gnu.org > > > > > http://www.unicode.org/reports/tr18/tr18-19.html#Compatibility_Properties > > > > Patches to that effect are welcome. > > > > Here's a patch. > > Thanks. A few minor comments below. > > > +/* Return true if C is a horizontal whitespace character, as defined > > + by http://www.unicode.org/reports/tr18/tr18-19.html#blank. */ > > +bool > > +blankp (int c) > > +{ > > + if (c == '\t') > > + return true; > > Why does this test explicitly only for a TAB? What about SPC, for > example? > Because TAB is the only character that is blank, but doesn't have the general category Zs. I've now also included space and added a comment. The risk that the general category of space will ever be changed seems very small. > > > --- a/doc/lispref/searching.texi > > +++ b/doc/lispref/searching.texi > > @@ -553,7 +553,10 @@ Char Classes > > (@pxref{Character Properties}) indicates they are alphabetic > > characters. > > @item [:blank:] > > -This matches space and tab only. > > +This matches horizontal whitespace, as defined by Unicode Technical > > +Standard #18. In particular, it matches tabs and characters whose > > +Unicode @samp{general-category} property (@pxref{Character > > +Properties}) indicates they are spacing separators. > > Similarly here: I find the lack of reference to a space potentially > confusing. > Added. > > > +** The regular expression character class [:blank:] now matches > > +Unicode horizontal whitespace as defined in > > +http://www.unicode.org/reports/tr18/tr18-19.html#blank. > > The reference to a particular version of UTS#18 might become obsolete > when a new version is released. So I suggest to provide a general > reference to the report and its section, not an exact URL. > Done. [-- Attachment #2: Type: text/html, Size: 4126 bytes --] ^ permalink raw reply [flat|nested] 6+ messages in thread
* bug#25366: 26.0.50; [:blank:] character class should match all Unicode horizontal whitespace 2017-01-06 19:10 ` Philipp Stephani @ 2017-01-06 19:21 ` Philipp Stephani 0 siblings, 0 replies; 6+ messages in thread From: Philipp Stephani @ 2017-01-06 19:21 UTC (permalink / raw) To: Eli Zaretskii; +Cc: 25366-done [-- Attachment #1: Type: text/plain, Size: 2085 bytes --] Philipp Stephani <p.stephani2@gmail.com> schrieb am Fr., 6. Jan. 2017 um 20:10 Uhr: > Eli Zaretskii <eliz@gnu.org> schrieb am Fr., 6. Jan. 2017 um 16:11 Uhr: > > > From: Philipp Stephani <p.stephani2@gmail.com> > > Date: Fri, 06 Jan 2017 15:00:22 +0000 > > Cc: 25366@debbugs.gnu.org > > > > > http://www.unicode.org/reports/tr18/tr18-19.html#Compatibility_Properties > > > > Patches to that effect are welcome. > > > > Here's a patch. > > Thanks. A few minor comments below. > > > +/* Return true if C is a horizontal whitespace character, as defined > > + by http://www.unicode.org/reports/tr18/tr18-19.html#blank. */ > > +bool > > +blankp (int c) > > +{ > > + if (c == '\t') > > + return true; > > Why does this test explicitly only for a TAB? What about SPC, for > example? > > > Because TAB is the only character that is blank, but doesn't have the > general category Zs. > I've now also included space and added a comment. The risk that the > general category of space will ever be changed seems very small. > > > > > --- a/doc/lispref/searching.texi > > +++ b/doc/lispref/searching.texi > > @@ -553,7 +553,10 @@ Char Classes > > (@pxref{Character Properties}) indicates they are alphabetic > > characters. > > @item [:blank:] > > -This matches space and tab only. > > +This matches horizontal whitespace, as defined by Unicode Technical > > +Standard #18. In particular, it matches tabs and characters whose > > +Unicode @samp{general-category} property (@pxref{Character > > +Properties}) indicates they are spacing separators. > > Similarly here: I find the lack of reference to a space potentially > confusing. > > > Added. > > > > > +** The regular expression character class [:blank:] now matches > > +Unicode horizontal whitespace as defined in > > +http://www.unicode.org/reports/tr18/tr18-19.html#blank. > > The reference to a particular version of UTS#18 might become obsolete > when a new version is released. So I suggest to provide a general > reference to the report and its section, not an exact URL. > > > Done. > Pushed to master as 512e9886be. [-- Attachment #2: Type: text/html, Size: 5281 bytes --] ^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2017-01-06 19:21 UTC | newest] Thread overview: 6+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2017-01-05 13:46 bug#25366: 26.0.50; [:blank:] character class should match all Unicode horizontal whitespace Philipp Stephani 2017-01-05 15:50 ` Eli Zaretskii 2017-01-06 15:00 ` Philipp Stephani 2017-01-06 15:11 ` Eli Zaretskii 2017-01-06 19:10 ` Philipp Stephani 2017-01-06 19:21 ` Philipp Stephani
Code repositories for project(s) associated with this external index https://git.savannah.gnu.org/cgit/emacs.git https://git.savannah.gnu.org/cgit/emacs/org-mode.git This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.