* [PATCH 0/2] Balance bidi control chars @ 2020-08-15 9:30 Teemu Likonen 2020-08-15 9:30 ` [PATCH 1/2] Emacs: Add a new function for balancing " Teemu Likonen ` (2 more replies) 0 siblings, 3 replies; 6+ messages in thread From: Teemu Likonen @ 2020-08-15 9:30 UTC (permalink / raw) To: notmuch; +Cc: tomi.ollila These patches continue the ideas written in message: id:87sgcuuzio.fsf@iki.fi The first patch adds an new function which can be used to balance Unicode's bidirectional control characters in its string argument. The seconds patch modifies old "notmuch-sanitize" function so that it calls the new function. Teemu Likonen (2): Emacs: Add a new function for balancing bidi control chars Emacs: Call notmuch-balance-bidi-ctrl-chars in notmuch-sanitize emacs/notmuch-lib.el | 57 +++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 56 insertions(+), 1 deletion(-) -- 2.20.1 ^ permalink raw reply [flat|nested] 6+ messages in thread
* [PATCH 1/2] Emacs: Add a new function for balancing bidi control chars 2020-08-15 9:30 [PATCH 0/2] Balance bidi control chars Teemu Likonen @ 2020-08-15 9:30 ` Teemu Likonen 2020-08-16 16:28 ` Tomi Ollila 2020-08-15 9:30 ` [PATCH 2/2] Emacs: Call notmuch-balance-bidi-ctrl-chars in notmuch-sanitize Teemu Likonen 2020-08-15 9:44 ` [PATCH 0/2] Balance bidi control chars Teemu Likonen 2 siblings, 1 reply; 6+ messages in thread From: Teemu Likonen @ 2020-08-15 9:30 UTC (permalink / raw) To: notmuch; +Cc: tomi.ollila The following Unicode's bidirectional control chars are modal so that they push a new bidirectional rendering mode to a stack: U+202A LEFT-TO-RIGHT EMBEDDING U+202B RIGHT-TO-LEFT EMBEDDING U+202D LEFT-TO-RIGHT OVERRIDE U+202E RIGHT-TO-LEFT OVERRIDE Every mode must be terminated with with character U+202C POP DIRECTIONAL FORMATTING which pops the mode from the stack. The stack is per paragraph. A new text paragraph resets the rendering mode changed by these control characters. This change adds a new function "notmuch-balance-bidi-ctrl-chars" which reads its STRING argument and ensures that all push characters (U+202A, U+202B, U+202D, U+202E) have a pop character pair (U+202C). The function may add more U+202C characters at the end of the returned string, or it may remove some U+202C characters. The returned string is safe in the sense that it won't change the surrounding bidirectional rendering mode. This function should be used when sanitizing arbitrary input. --- emacs/notmuch-lib.el | 54 ++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 54 insertions(+) diff --git a/emacs/notmuch-lib.el b/emacs/notmuch-lib.el index 118faf1e..e6252c6c 100644 --- a/emacs/notmuch-lib.el +++ b/emacs/notmuch-lib.el @@ -469,6 +469,60 @@ be displayed." "[No Subject]" subject))) + +(defun notmuch-balance-bidi-ctrl-chars (string) + "Balance bidirectional control chars in STRING. + +The following Unicode's bidirectional control chars are modal so +that they push a new bidirectional rendering mode to a stack: +U+202A LEFT-TO-RIGHT EMBEDDING, U+202B RIGHT-TO-LEFT EMBEDDING, +U+202D LEFT-TO-RIGHT OVERRIDE and U+202E RIGHT-TO-LEFT OVERRIDE. +Every mode must be terminated with with character U+202C POP +DIRECTIONAL FORMATTING which pops the mode from the stack. The +stack is per paragraph. A new text paragraph resets the rendering +mode changed by these control characters. + +This function reads the STRING argument and ensures that all push +characters (U+202A, U+202B, U+202D, U+202E) have a pop character +pair (U+202C). The function may add more U+202C characters at the +end of the returned string, or it may remove some U+202C +characters. The returned string is safe in the sense that it +won't change the surrounding bidirectional rendering mode. This +function should be used when sanitizing arbitrary input." + + (let ((new-string nil) + (stack-count 0)) + + (cl-flet ((push-char-p (c) + ;; U+202A LEFT-TO-RIGHT EMBEDDING + ;; U+202B RIGHT-TO-LEFT EMBEDDING + ;; U+202D LEFT-TO-RIGHT OVERRIDE + ;; U+202E RIGHT-TO-LEFT OVERRIDE + (cl-find c '(?\u202a ?\u202b ?\u202d ?\u202e))) + (pop-char-p (c) + ;; U+202C POP DIRECTIONAL FORMATTING + (eql c ?\u202c))) + + (cl-loop for char across string + do (cond ((push-char-p char) + (cl-incf stack-count) + (push char new-string)) + ((and (pop-char-p char) + (cl-plusp stack-count)) + (cl-decf stack-count) + (push char new-string)) + ((and (pop-char-p char) + (not (cl-plusp stack-count))) + ;; The stack is empty. Ignore this pop character. + ) + (t (push char new-string))))) + + ;; Add possible missing pop characters. + (cl-loop repeat stack-count + do (push ?\x202c new-string)) + + (seq-into (nreverse new-string) 'string))) + (defun notmuch-sanitize (str) "Sanitize control character in STR. -- 2.20.1 ^ permalink raw reply related [flat|nested] 6+ messages in thread
* Re: [PATCH 1/2] Emacs: Add a new function for balancing bidi control chars 2020-08-15 9:30 ` [PATCH 1/2] Emacs: Add a new function for balancing " Teemu Likonen @ 2020-08-16 16:28 ` Tomi Ollila 2020-08-16 17:41 ` Teemu Likonen 0 siblings, 1 reply; 6+ messages in thread From: Tomi Ollila @ 2020-08-16 16:28 UTC (permalink / raw) To: Teemu Likonen, notmuch; +Cc: David Edmondson On Sat, Aug 15 2020, Teemu Likonen wrote: > The following Unicode's bidirectional control chars are modal so that > they push a new bidirectional rendering mode to a stack: > > U+202A LEFT-TO-RIGHT EMBEDDING > U+202B RIGHT-TO-LEFT EMBEDDING > U+202D LEFT-TO-RIGHT OVERRIDE > U+202E RIGHT-TO-LEFT OVERRIDE Good stuff -- implementation looks like port of the php code in https://www.iamcal.com/understanding-bidirectional-text to emacs lisp... anyway nice implementation took be a bit of time for me to understand it... thoughts - is it slow to execute it always, pure lisp implementation; (string-match "[\u202a-\u202e]") could be done before that. (if it were executed often could loop with `looking-at` (and then moving point based on match-end) be faster... - *but* adding U+202C's in `notmuch-sanitize` is doing it too early, as some functions truncate the strings afterwards if those are too long (e.g. `notmuch-search-insert-authors`) so those get lost.. - what about https://en.wikipedia.org/wiki/Bidirectional_text#Isolates (was documented more in some page, cannot find it anymore...) (what I noticed when looking `notmuch-search-insert-authors` that it uses `length` to check the length of a string -- but that also counts these bidi mode changing "characters" (as one char). `string-width` would be better there -- and probably in many other places.) (I tried quite a few things, something that could "reset" the stack with e.g. one invisible tab, but no go (or that was filtered as I added it to `notmuch-sanitize` ;), As a final step I did (defun notmuch-sanitize (str) ... - (replace-regexp-in-string "[[:cntrl:]\x7f\u2028\u2029]+" " " str)) + (replace-regexp-in-string + "[\u202A-\u202E\u2066-\u2069]" "" + (replace-regexp-in-string "[[:cntrl:]\x7f\u2028\u2029]+" " " str))) just to test-drop those chars. probably not good enough ;/) Tomi > > Every mode must be terminated with with character U+202C POP > DIRECTIONAL FORMATTING which pops the mode from the stack. The stack > is per paragraph. A new text paragraph resets the rendering mode > changed by these control characters. > > This change adds a new function "notmuch-balance-bidi-ctrl-chars" > which reads its STRING argument and ensures that all push > characters (U+202A, U+202B, U+202D, U+202E) have a pop character > pair (U+202C). The function may add more U+202C characters at the end > of the returned string, or it may remove some U+202C characters. The > returned string is safe in the sense that it won't change the > surrounding bidirectional rendering mode. This function should be used > when sanitizing arbitrary input. > --- > emacs/notmuch-lib.el | 54 ++++++++++++++++++++++++++++++++++++++++++++ > 1 file changed, 54 insertions(+) > > diff --git a/emacs/notmuch-lib.el b/emacs/notmuch-lib.el > index 118faf1e..e6252c6c 100644 > --- a/emacs/notmuch-lib.el > +++ b/emacs/notmuch-lib.el > @@ -469,6 +469,60 @@ be displayed." > "[No Subject]" > subject))) > > + > +(defun notmuch-balance-bidi-ctrl-chars (string) > + "Balance bidirectional control chars in STRING. > + > +The following Unicode's bidirectional control chars are modal so > +that they push a new bidirectional rendering mode to a stack: > +U+202A LEFT-TO-RIGHT EMBEDDING, U+202B RIGHT-TO-LEFT EMBEDDING, > +U+202D LEFT-TO-RIGHT OVERRIDE and U+202E RIGHT-TO-LEFT OVERRIDE. > +Every mode must be terminated with with character U+202C POP > +DIRECTIONAL FORMATTING which pops the mode from the stack. The > +stack is per paragraph. A new text paragraph resets the rendering > +mode changed by these control characters. > + > +This function reads the STRING argument and ensures that all push > +characters (U+202A, U+202B, U+202D, U+202E) have a pop character > +pair (U+202C). The function may add more U+202C characters at the > +end of the returned string, or it may remove some U+202C > +characters. The returned string is safe in the sense that it > +won't change the surrounding bidirectional rendering mode. This > +function should be used when sanitizing arbitrary input." > + > + (let ((new-string nil) > + (stack-count 0)) > + > + (cl-flet ((push-char-p (c) > + ;; U+202A LEFT-TO-RIGHT EMBEDDING > + ;; U+202B RIGHT-TO-LEFT EMBEDDING > + ;; U+202D LEFT-TO-RIGHT OVERRIDE > + ;; U+202E RIGHT-TO-LEFT OVERRIDE > + (cl-find c '(?\u202a ?\u202b ?\u202d ?\u202e))) > + (pop-char-p (c) > + ;; U+202C POP DIRECTIONAL FORMATTING > + (eql c ?\u202c))) > + > + (cl-loop for char across string > + do (cond ((push-char-p char) > + (cl-incf stack-count) > + (push char new-string)) > + ((and (pop-char-p char) > + (cl-plusp stack-count)) > + (cl-decf stack-count) > + (push char new-string)) > + ((and (pop-char-p char) > + (not (cl-plusp stack-count))) > + ;; The stack is empty. Ignore this pop character. > + ) > + (t (push char new-string))))) > + > + ;; Add possible missing pop characters. > + (cl-loop repeat stack-count > + do (push ?\x202c new-string)) > + > + (seq-into (nreverse new-string) 'string))) > + > (defun notmuch-sanitize (str) > "Sanitize control character in STR. > > -- > 2.20.1 ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH 1/2] Emacs: Add a new function for balancing bidi control chars 2020-08-16 16:28 ` Tomi Ollila @ 2020-08-16 17:41 ` Teemu Likonen 0 siblings, 0 replies; 6+ messages in thread From: Teemu Likonen @ 2020-08-16 17:41 UTC (permalink / raw) To: Tomi Ollila, notmuch; +Cc: David Edmondson [-- Attachment #1.1: Type: text/plain, Size: 1957 bytes --] * 2020-08-16 19:28:51+03, Tomi Ollila wrote: > Good stuff -- implementation looks like port of the php code in > > https://www.iamcal.com/understanding-bidirectional-text > > to emacs lisp... anyway nice implementation took be a bit of > time for me to understand it... I don't read PHP and didn't try to read that code at all but the idea is simple enough. > thoughts > > - is it slow to execute it always, pure lisp implementation; > (string-match "[\u202a-\u202e]") could be done before that. > (if it were executed often could loop with `looking-at` > (and then moving point based on match-end) be faster... I don't see any speed issues but if we want to optimize I would create a new sanitize function which walks just once across the characters without using regular expressions. But currently I think it's unnecessary micro optimization. > - *but* adding U+202C's in `notmuch-sanitize` is doing it too early, as > some functions truncate the strings afterwards if those are too long > (e.g. `notmuch-search-insert-authors`) so those get lost.. Good point. This would mean that we shouldn't do "bidi ctrl char balancing" in notmuch-sanitize. We should call the new notmuch-balance-bidi-ctrl-chars function in various places before inserting arbitrary strings to buffer and before combining such strings with other strings. > (what I noticed when looking `notmuch-search-insert-authors` that it uses > `length` to check the length of a string -- but that also counts these bidi > mode changing "characters" (as one char). `string-width` would be better > there -- and probably in many other places.) Yes, definitely string-width when truncating is based on width and when using tabular format in buffers. With that function zero-width characters really have no width. -- /// Teemu Likonen - .-.. http://www.iki.fi/tlikonen/ // OpenPGP: 4E1055DC84E9DFF613D78557719D69D324539450 [-- Attachment #1.2: signature.asc --] [-- Type: application/pgp-signature, Size: 251 bytes --] [-- Attachment #2: Type: text/plain, Size: 0 bytes --] ^ permalink raw reply [flat|nested] 6+ messages in thread
* [PATCH 2/2] Emacs: Call notmuch-balance-bidi-ctrl-chars in notmuch-sanitize 2020-08-15 9:30 [PATCH 0/2] Balance bidi control chars Teemu Likonen 2020-08-15 9:30 ` [PATCH 1/2] Emacs: Add a new function for balancing " Teemu Likonen @ 2020-08-15 9:30 ` Teemu Likonen 2020-08-15 9:44 ` [PATCH 0/2] Balance bidi control chars Teemu Likonen 2 siblings, 0 replies; 6+ messages in thread From: Teemu Likonen @ 2020-08-15 9:30 UTC (permalink / raw) To: notmuch; +Cc: tomi.ollila --- emacs/notmuch-lib.el | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/emacs/notmuch-lib.el b/emacs/notmuch-lib.el index e6252c6c..e0122f7a 100644 --- a/emacs/notmuch-lib.el +++ b/emacs/notmuch-lib.el @@ -527,7 +527,8 @@ function should be used when sanitizing arbitrary input." "Sanitize control character in STR. This includes newlines, tabs, and other funny characters." - (replace-regexp-in-string "[[:cntrl:]\x7f\u2028\u2029]+" " " str)) + (notmuch-balance-bidi-ctrl-chars + (replace-regexp-in-string "[[:cntrl:]\x7f\u2028\u2029]+" " " str))) (defun notmuch-escape-boolean-term (term) "Escape a boolean term for use in a query. -- 2.20.1 ^ permalink raw reply related [flat|nested] 6+ messages in thread
* Re: [PATCH 0/2] Balance bidi control chars 2020-08-15 9:30 [PATCH 0/2] Balance bidi control chars Teemu Likonen 2020-08-15 9:30 ` [PATCH 1/2] Emacs: Add a new function for balancing " Teemu Likonen 2020-08-15 9:30 ` [PATCH 2/2] Emacs: Call notmuch-balance-bidi-ctrl-chars in notmuch-sanitize Teemu Likonen @ 2020-08-15 9:44 ` Teemu Likonen 2 siblings, 0 replies; 6+ messages in thread From: Teemu Likonen @ 2020-08-15 9:44 UTC (permalink / raw) To: notmuch [-- Attachment #1.1: Type: text/plain, Size: 414 bytes --] * 2020-08-15 12:30:34+03, Teemu Likonen wrote: > These patches continue the ideas written in message: > > id:87sgcuuzio.fsf@iki.fi Here is a nice and relatively short reference for anyone who is interested in the subject: https://www.iamcal.com/understanding-bidirectional-text -- /// Teemu Likonen - .-.. http://www.iki.fi/tlikonen/ // OpenPGP: 4E1055DC84E9DFF613D78557719D69D324539450 [-- Attachment #1.2: signature.asc --] [-- Type: application/pgp-signature, Size: 251 bytes --] [-- Attachment #2: Type: text/plain, Size: 0 bytes --] ^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2020-08-16 17:42 UTC | newest] Thread overview: 6+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2020-08-15 9:30 [PATCH 0/2] Balance bidi control chars Teemu Likonen 2020-08-15 9:30 ` [PATCH 1/2] Emacs: Add a new function for balancing " Teemu Likonen 2020-08-16 16:28 ` Tomi Ollila 2020-08-16 17:41 ` Teemu Likonen 2020-08-15 9:30 ` [PATCH 2/2] Emacs: Call notmuch-balance-bidi-ctrl-chars in notmuch-sanitize Teemu Likonen 2020-08-15 9:44 ` [PATCH 0/2] Balance bidi control chars Teemu Likonen
Code repositories for project(s) associated with this public inbox https://yhetil.org/notmuch.git/ This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).