bug#13399: 24.3.50; Word-wrap can't wrap at zero-width space U-200B

all messages for Emacs-related lists mirrored at yhetil.org
 help / color / mirror / code / Atom feed

From: Adam Tack <adam.tack.513@gmail.com>
To: Eli Zaretskii <eliz@gnu.org>
Cc: 13399@debbugs.gnu.org
Subject: bug#13399: 24.3.50; Word-wrap can't wrap at zero-width space U-200B
Date: Sat, 9 Dec 2017 03:50:05 +0000	[thread overview]
Message-ID: <CAA+VxxG4Zpj6W+QzLbSnSx04spdeMV2fiaqmPZWktnqX7cScKA@mail.gmail.com> (raw)
In-Reply-To: <83r2s5ugnd.fsf@gnu.org>

[-- Attachment #1: Type: text/plain, Size: 4013 bytes --]

Thanks for the very fast replies and the suggestions!

> I think this is okay, but maybe the macro could be converted into an
> inline function, and then fetching the character from the various
> objects separated from looking up the char-table for that character?

I've made the conversion — it's now slightly less messy.  Regarding
the separation, I think that the most that can be done is to have the
look-up in a separate function.  Regrettably, trying to first obtain
the character, for example via a set of if-else clauses, and then
looking it up, which would be cleaner, can't really work since the
cases (in particular the first and fourth) are not disjunct.

> Well, since it's a char-table, users will probably want to control
> which characters cause word-wrap.  One idea would be to have a minor
> mode or some such, providing users an ability to include or exclude
> different groups of related whitespace characters as a whole?  This
> could be in follow-up patches, though.

Customisability was the idea. :)  I'm not sure how best to expose it in
a reasonably user-friendly way, though.  For the time being, allowing
control directly via the char-table might suffice.

> We could also look at LineBreak.txt in the Unicode database for
> inspiration and ideas.

The three main customisation options that I'm considering are:

i) Unicode whitespace (U+2000 - U+200B),
ii) vim's breakat characters (default " ^I!@*-+;:,./?"), since
presumably they had given it some thought,
iii) The characters in LineBreak.txt (parsing the file shouldn't be
hard, if there aren't copyright issues).

> But I do think that the default should be only TAB and SPC, as Emacs
> always did, and the rest should be optional, and probably in Lisp, not
> C.

> And also a couple of tests (the ones you used would be a good start).

These would presumably have to be in tests/manual since the position of
the word-wrap depends on too many variables (width of window, font
type, font size)?

> I will send the forms off-list, thanks.

Thanks!

> One other thought: since TAB and SPC are single-byte characters,
> whereas the other "whitespace" characters are not, supporting the
> non-ASCII whitespace will be associated with some performance hit in
> the display engine, because it requires a char-table look up and
> fetching multibyte characters.  So perhaps we should allow the
> word-wrap-chars char-table to be nil (and make that the default), and
> in that case support only TAB and SPC as word-wrap characters.  This
> would let the default configuration work as fast is it does now,
> imposing the performance penalty only on those who want to support
> more whitespace characters.

> WDYT?

That seems sensible.  The old behaviour will now be the default and
look-up using the char-table only enabled with the global minor mode
`word-wrap-char-table-mode' (suggestions for a catchier name very
welcome).  For the time being, its definition is in a new file
`lisp/word-wrap.el'.  Also temporarily, for ease of testing, it allows
wrapping on the unicode whitespace characters.


The current iteration is attached.  Until they've found a proper home,
the slightly updated tests are below.

(require 'word-wrap)

(with-current-buffer (get-buffer-create "*bar*")
  (dotimes (i 1000)
    (insert "1234")) ; U-200B
  (setq word-wrap t)
  (setq whitespace-display-mappings
    '((space-mark 32
              [183]
              [46])
      (space-mark 160
              [164]
              [95])
      (space-mark 8203
              [164]
              [95])
      (newline-mark 10
            [36 10])
      (tab-mark 9
            [187 9]
            [92 9])))
  (whitespace-mode)
  (word-wrap-char-table-mode)
  (display-buffer "*bar*"))

(with-current-buffer (get-buffer-create "*foo*")
  (dotimes (i 1000)
    (insert "1234")) ; U-200B
  (setq word-wrap t)
  (word-wrap-char-table-mode)
  (display-buffer "*foo*"))

[-- Attachment #2: word_wrap_char_table.diff --]
[-- Type: text/plain, Size: 5836 bytes --]

diff --git a/lisp/word-wrap.el b/lisp/word-wrap.el
new file mode 100644
index 0000000..6d59a83
--- /dev/null
+++ b/lisp/word-wrap.el
@@ -0,0 +1,21 @@
+(define-minor-mode word-wrap-char-table-mode
+  "Toggle wrapping using a look-up to word-wrap-chars, globally.
+
+Currently, this allows word wrapping on the characters U+2000 to
+U+200B in addition to the default of space and tap, when
+`word-wrap' is set to t.
+
+(Provisional and unstable.)
+"
+  :global t
+  :lighter "uws "
+  (if word-wrap-char-table-mode
+      (progn (setq word-wrap-chars (make-char-table nil nil))
+             (set-char-table-range word-wrap-chars 9 t)
+             (set-char-table-range word-wrap-chars 32 t)
+             (set-char-table-range word-wrap-chars
+                                   '(8192 . 8203) t))
+    (setq word-wrap-chars nil)))
+
+(provide 'word-wrap)
+
diff --git a/src/character.c b/src/character.c
index c8ffa2b..af89a8b 100644
--- a/src/character.c
+++ b/src/character.c
@@ -1145,4 +1145,10 @@ All Unicode characters have one of the following values (symbol):
 See The Unicode Standard for the meaning of those values.  */);
   /* The correct char-table is setup in characters.el.  */
   Vunicode_category_table = Qnil;
+
+  DEFVAR_LISP ("word-wrap-chars", Vword_wrap_chars,
+	       doc: /* A char-table for characters at which word-wrap occurs.
+Such characters have value t in this table.
+This is set up in ... */);
+  Vword_wrap_chars = Qnil;
 }
diff --git a/src/xdisp.c b/src/xdisp.c
index 7e47c06..4e8b045 100644
--- a/src/xdisp.c
+++ b/src/xdisp.c
@@ -348,20 +348,41 @@ static Lisp_Object list_of_error;
 #endif /* HAVE_WINDOW_SYSTEM */
 
 /* Test if the display element loaded in IT, or the underlying buffer
-   or string character, is a space or a TAB character.  This is used
-   to determine where word wrapping can occur.  */
-
-#define IT_DISPLAYING_WHITESPACE(it)					\
-  ((it->what == IT_CHARACTER && (it->c == ' ' || it->c == '\t'))	\
-   || ((STRINGP (it->string)						\
-	&& (SREF (it->string, IT_STRING_BYTEPOS (*it)) == ' '		\
-	    || SREF (it->string, IT_STRING_BYTEPOS (*it)) == '\t'))	\
-       || (it->s							\
-	   && (it->s[IT_BYTEPOS (*it)] == ' '				\
-	       || it->s[IT_BYTEPOS (*it)] == '\t'))			\
-       || (IT_BYTEPOS (*it) < ZV_BYTE					\
-	   && (*BYTE_POS_ADDR (IT_BYTEPOS (*it)) == ' '			\
-	       || *BYTE_POS_ADDR (IT_BYTEPOS (*it)) == '\t'))))		\
+   or string character, is a space or tab (by default, to avoid the
+   unnecessary performance hit of char-table lookup).  If
+   word-wrap-chars is a char-table, then instead check if the relevant
+   element or character belongs to the char-table.  This is used to
+   determine where word wrapping can occur.  */
+
+static inline bool
+char_is_whitespace_p (int c) {
+  return !NILP (CHAR_TABLE_REF (Vword_wrap_chars, c));
+}
+
+static inline bool
+it_displaying_whitespace (struct it *it) {
+  if (!CHAR_TABLE_P (Vword_wrap_chars)) {
+    return ((it->what == IT_CHARACTER && (it->c == ' ' || it->c == '\t'))
+	    || ((STRINGP (it->string)
+		 && (SREF (it->string, IT_STRING_BYTEPOS (*it)) == ' '
+		     || SREF (it->string, IT_STRING_BYTEPOS (*it)) == '\t'))
+		|| (it->s
+		    && (it->s[IT_BYTEPOS (*it)] == ' '
+			|| it->s[IT_BYTEPOS (*it)] == '\t'))
+		|| (IT_BYTEPOS (*it) < ZV_BYTE
+		    && (*BYTE_POS_ADDR (IT_BYTEPOS (*it)) == ' '
+			|| *BYTE_POS_ADDR (IT_BYTEPOS (*it)) == '\t'))));
+  } else {
+    return ((it->what == IT_CHARACTER && char_is_whitespace_p (it->c))
+	    || (STRINGP (it->string) && char_is_whitespace_p
+		(STRING_CHAR
+		 (SDATA (it->string) + IT_STRING_BYTEPOS (*it))))
+	    || (it->s && char_is_whitespace_p
+		(STRING_CHAR(it->s + IT_BYTEPOS (*it))))
+	    || (IT_BYTEPOS (*it) < ZV_BYTE && char_is_whitespace_p
+		(FETCH_CHAR (IT_BYTEPOS (*it)))));
+  }
+}
 
 /* True means print newline to stdout before next mini-buffer message.  */
 
@@ -8785,7 +8806,7 @@ move_it_in_display_line_to (struct it *it,
 	{
 	  if (it->line_wrap == WORD_WRAP && it->area == TEXT_AREA)
 	    {
-	      if (IT_DISPLAYING_WHITESPACE (it))
+	      if (it_displaying_whitespace (it))
 		may_wrap = true;
 	      else if (may_wrap)
 		{
@@ -8950,7 +8971,7 @@ move_it_in_display_line_to (struct it *it,
 				  SAVE_IT (tem_it, *it, tem_data);
 				  set_iterator_to_next (it, true);
 				  if (get_next_display_element (it)
-				      && IT_DISPLAYING_WHITESPACE (it))
+				      && it_displaying_whitespace (it))
 				    can_wrap = false;
 				  RESTORE_IT (it, &tem_it, tem_data);
 				}
@@ -9041,7 +9062,7 @@ move_it_in_display_line_to (struct it *it,
 			 wrapped in the middle of whitespace.
 			 Therefore, wrap_it _is_ relevant in that
 			 case.  */
-		      && !(moved_forward && IT_DISPLAYING_WHITESPACE (it)))
+		      && !(moved_forward && it_displaying_whitespace (it)))
 		    {
 		      /* If we've found TO_X, go back there, as we now
 			 know the last word fits on this screen line.  */
@@ -21427,7 +21448,7 @@ display_line (struct it *it, int cursor_vpos)
 
 	  if (it->line_wrap == WORD_WRAP && it->area == TEXT_AREA)
 	    {
-	      if (IT_DISPLAYING_WHITESPACE (it))
+	      if (it_displaying_whitespace (it))
 		may_wrap = true;
 	      else if (may_wrap)
 		{
@@ -21571,7 +21592,7 @@ display_line (struct it *it, int cursor_vpos)
 				 was a space or tab AND (ii) the
 				 current character is not.  */
 			      && (!may_wrap
-				  || IT_DISPLAYING_WHITESPACE (it)))
+				  || it_displaying_whitespace (it)))
 			    goto back_to_wrap;
 
 			  /* Record the maximum and minimum buffer
@@ -21605,7 +21626,7 @@ display_line (struct it *it, int cursor_vpos)
 					  was a space or tab AND (ii) the
 					  current character is not.  */
 				       && (!may_wrap
-					   || IT_DISPLAYING_WHITESPACE (it)))
+					   || it_displaying_whitespace (it)))
 				goto back_to_wrap;
 
 			    }

next prev parent reply	other threads:[~2017-12-09  3:50 UTC|newest]

Thread overview: 54+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-01-10  8:29 bug#13399: 24.3.50; Word-wrap can't wrap at zero-width space U-200B martin rudalics
2013-01-10 19:15 ` Eli Zaretskii
2013-01-11  8:16   ` martin rudalics
2013-01-11  8:58     ` Eli Zaretskii
2013-01-11 10:29       ` martin rudalics
2013-01-11 10:57         ` Eli Zaretskii
2013-01-11 14:30           ` martin rudalics
2013-01-11 14:49             ` Eli Zaretskii
2013-01-11 15:17               ` martin rudalics
2013-01-11 15:22                 ` Christopher Schmidt
2013-01-11 18:04                   ` martin rudalics
2013-01-11 15:53                 ` Eli Zaretskii
2013-01-11 18:04                   ` martin rudalics
2013-01-11 16:08             ` Stefan Monnier
2013-01-11 18:06               ` martin rudalics
2013-01-11 18:50                 ` Stefan Monnier
2013-01-11 19:29                   ` Eli Zaretskii
2013-01-11 22:47                     ` Stefan Monnier
2013-01-12  8:28                       ` Eli Zaretskii
2013-01-12 13:20                         ` Stefan Monnier
2013-01-12 14:12                           ` Eli Zaretskii
2013-01-12 16:06                             ` Stefan Monnier
2013-02-02 16:48                         ` martin rudalics
2013-02-02 17:52                           ` Eli Zaretskii
2013-02-02 18:20                             ` martin rudalics
2013-02-02 18:36                               ` Eli Zaretskii
2013-02-03  9:44                                 ` martin rudalics
2013-02-03 16:01                                   ` Stefan Monnier
2013-02-03 19:32                                   ` Eli Zaretskii
2013-02-04 17:04                                     ` martin rudalics
2013-02-04 17:57                                       ` Eli Zaretskii
2013-01-11 19:08                 ` Eli Zaretskii
2013-01-12 14:29                   ` martin rudalics
2013-01-12 14:56                     ` Eli Zaretskii
2013-01-12 16:37                       ` martin rudalics
2013-01-12 16:51                         ` Eli Zaretskii
2013-01-12 18:01                           ` martin rudalics
2013-01-12 18:38                             ` Eli Zaretskii
2013-01-14 18:04                               ` martin rudalics
2013-02-03 18:57   ` martin rudalics
2013-02-03 19:45     ` Eli Zaretskii
2017-12-08  1:02 ` Adam Tack
2017-12-08 10:12   ` martin rudalics
2017-12-08 15:38   ` Eli Zaretskii
2017-12-08 20:08     ` Eli Zaretskii
2017-12-09  3:50       ` Adam Tack [this message]
2017-12-12 17:13         ` Eli Zaretskii
2017-12-13  4:00           ` Adam Tack
2017-12-13 16:09             ` Eli Zaretskii
2017-12-17  2:22               ` Adam Tack
2020-09-18 14:55                 ` Lars Ingebrigtsen
2020-09-18 15:39                   ` Eli Zaretskii
2020-09-19 13:15                     ` Lars Ingebrigtsen
2020-09-19 14:36                       ` Eli Zaretskii

find likely ancestor, descendant, or conflicting patches for this message:
( dfblob:6d59a83 dfblob:c8ffa2b dfblob:af89a8b dfblob:7e47c06
dfblob:4e8b045 )
 OR (
bs:"bug#13399: 24.3.50; Word-wrap can't wrap at zero-width space U-200B" )
	(help)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAA+VxxG4Zpj6W+QzLbSnSx04spdeMV2fiaqmPZWktnqX7cScKA@mail.gmail.com \
    --to=adam.tack.513@gmail.com \
    --cc=13399@debbugs.gnu.org \
    --cc=eliz@gnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

Code repositories for project(s) associated with this external index

	https://git.savannah.gnu.org/cgit/emacs.git
	https://git.savannah.gnu.org/cgit/emacs/org-mode.git

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.