* [PATCH 0/2] Balance bidi control chars
@ 2020-08-15 9:30 Teemu Likonen
2020-08-15 9:30 ` [PATCH 1/2] Emacs: Add a new function for balancing " Teemu Likonen
` (2 more replies)
0 siblings, 3 replies; 6+ messages in thread
From: Teemu Likonen @ 2020-08-15 9:30 UTC (permalink / raw)
To: notmuch; +Cc: tomi.ollila
These patches continue the ideas written in message:
id:87sgcuuzio.fsf@iki.fi
The first patch adds an new function which can be used to balance
Unicode's bidirectional control characters in its string argument. The
seconds patch modifies old "notmuch-sanitize" function so that it
calls the new function.
Teemu Likonen (2):
Emacs: Add a new function for balancing bidi control chars
Emacs: Call notmuch-balance-bidi-ctrl-chars in notmuch-sanitize
emacs/notmuch-lib.el | 57 +++++++++++++++++++++++++++++++++++++++++++-
1 file changed, 56 insertions(+), 1 deletion(-)
--
2.20.1
^ permalink raw reply [flat|nested] 6+ messages in thread
* [PATCH 1/2] Emacs: Add a new function for balancing bidi control chars
2020-08-15 9:30 [PATCH 0/2] Balance bidi control chars Teemu Likonen
@ 2020-08-15 9:30 ` Teemu Likonen
2020-08-16 16:28 ` Tomi Ollila
2020-08-15 9:30 ` [PATCH 2/2] Emacs: Call notmuch-balance-bidi-ctrl-chars in notmuch-sanitize Teemu Likonen
2020-08-15 9:44 ` [PATCH 0/2] Balance bidi control chars Teemu Likonen
2 siblings, 1 reply; 6+ messages in thread
From: Teemu Likonen @ 2020-08-15 9:30 UTC (permalink / raw)
To: notmuch; +Cc: tomi.ollila
The following Unicode's bidirectional control chars are modal so that
they push a new bidirectional rendering mode to a stack:
U+202A LEFT-TO-RIGHT EMBEDDING
U+202B RIGHT-TO-LEFT EMBEDDING
U+202D LEFT-TO-RIGHT OVERRIDE
U+202E RIGHT-TO-LEFT OVERRIDE
Every mode must be terminated with with character U+202C POP
DIRECTIONAL FORMATTING which pops the mode from the stack. The stack
is per paragraph. A new text paragraph resets the rendering mode
changed by these control characters.
This change adds a new function "notmuch-balance-bidi-ctrl-chars"
which reads its STRING argument and ensures that all push
characters (U+202A, U+202B, U+202D, U+202E) have a pop character
pair (U+202C). The function may add more U+202C characters at the end
of the returned string, or it may remove some U+202C characters. The
returned string is safe in the sense that it won't change the
surrounding bidirectional rendering mode. This function should be used
when sanitizing arbitrary input.
---
emacs/notmuch-lib.el | 54 ++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 54 insertions(+)
diff --git a/emacs/notmuch-lib.el b/emacs/notmuch-lib.el
index 118faf1e..e6252c6c 100644
--- a/emacs/notmuch-lib.el
+++ b/emacs/notmuch-lib.el
@@ -469,6 +469,60 @@ be displayed."
"[No Subject]"
subject)))
+
+(defun notmuch-balance-bidi-ctrl-chars (string)
+ "Balance bidirectional control chars in STRING.
+
+The following Unicode's bidirectional control chars are modal so
+that they push a new bidirectional rendering mode to a stack:
+U+202A LEFT-TO-RIGHT EMBEDDING, U+202B RIGHT-TO-LEFT EMBEDDING,
+U+202D LEFT-TO-RIGHT OVERRIDE and U+202E RIGHT-TO-LEFT OVERRIDE.
+Every mode must be terminated with with character U+202C POP
+DIRECTIONAL FORMATTING which pops the mode from the stack. The
+stack is per paragraph. A new text paragraph resets the rendering
+mode changed by these control characters.
+
+This function reads the STRING argument and ensures that all push
+characters (U+202A, U+202B, U+202D, U+202E) have a pop character
+pair (U+202C). The function may add more U+202C characters at the
+end of the returned string, or it may remove some U+202C
+characters. The returned string is safe in the sense that it
+won't change the surrounding bidirectional rendering mode. This
+function should be used when sanitizing arbitrary input."
+
+ (let ((new-string nil)
+ (stack-count 0))
+
+ (cl-flet ((push-char-p (c)
+ ;; U+202A LEFT-TO-RIGHT EMBEDDING
+ ;; U+202B RIGHT-TO-LEFT EMBEDDING
+ ;; U+202D LEFT-TO-RIGHT OVERRIDE
+ ;; U+202E RIGHT-TO-LEFT OVERRIDE
+ (cl-find c '(?\u202a ?\u202b ?\u202d ?\u202e)))
+ (pop-char-p (c)
+ ;; U+202C POP DIRECTIONAL FORMATTING
+ (eql c ?\u202c)))
+
+ (cl-loop for char across string
+ do (cond ((push-char-p char)
+ (cl-incf stack-count)
+ (push char new-string))
+ ((and (pop-char-p char)
+ (cl-plusp stack-count))
+ (cl-decf stack-count)
+ (push char new-string))
+ ((and (pop-char-p char)
+ (not (cl-plusp stack-count)))
+ ;; The stack is empty. Ignore this pop character.
+ )
+ (t (push char new-string)))))
+
+ ;; Add possible missing pop characters.
+ (cl-loop repeat stack-count
+ do (push ?\x202c new-string))
+
+ (seq-into (nreverse new-string) 'string)))
+
(defun notmuch-sanitize (str)
"Sanitize control character in STR.
--
2.20.1
^ permalink raw reply related [flat|nested] 6+ messages in thread
* [PATCH 2/2] Emacs: Call notmuch-balance-bidi-ctrl-chars in notmuch-sanitize
2020-08-15 9:30 [PATCH 0/2] Balance bidi control chars Teemu Likonen
2020-08-15 9:30 ` [PATCH 1/2] Emacs: Add a new function for balancing " Teemu Likonen
@ 2020-08-15 9:30 ` Teemu Likonen
2020-08-15 9:44 ` [PATCH 0/2] Balance bidi control chars Teemu Likonen
2 siblings, 0 replies; 6+ messages in thread
From: Teemu Likonen @ 2020-08-15 9:30 UTC (permalink / raw)
To: notmuch; +Cc: tomi.ollila
---
emacs/notmuch-lib.el | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/emacs/notmuch-lib.el b/emacs/notmuch-lib.el
index e6252c6c..e0122f7a 100644
--- a/emacs/notmuch-lib.el
+++ b/emacs/notmuch-lib.el
@@ -527,7 +527,8 @@ function should be used when sanitizing arbitrary input."
"Sanitize control character in STR.
This includes newlines, tabs, and other funny characters."
- (replace-regexp-in-string "[[:cntrl:]\x7f\u2028\u2029]+" " " str))
+ (notmuch-balance-bidi-ctrl-chars
+ (replace-regexp-in-string "[[:cntrl:]\x7f\u2028\u2029]+" " " str)))
(defun notmuch-escape-boolean-term (term)
"Escape a boolean term for use in a query.
--
2.20.1
^ permalink raw reply related [flat|nested] 6+ messages in thread
* Re: [PATCH 0/2] Balance bidi control chars
2020-08-15 9:30 [PATCH 0/2] Balance bidi control chars Teemu Likonen
2020-08-15 9:30 ` [PATCH 1/2] Emacs: Add a new function for balancing " Teemu Likonen
2020-08-15 9:30 ` [PATCH 2/2] Emacs: Call notmuch-balance-bidi-ctrl-chars in notmuch-sanitize Teemu Likonen
@ 2020-08-15 9:44 ` Teemu Likonen
2 siblings, 0 replies; 6+ messages in thread
From: Teemu Likonen @ 2020-08-15 9:44 UTC (permalink / raw)
To: notmuch
[-- Attachment #1.1: Type: text/plain, Size: 414 bytes --]
* 2020-08-15 12:30:34+03, Teemu Likonen wrote:
> These patches continue the ideas written in message:
>
> id:87sgcuuzio.fsf@iki.fi
Here is a nice and relatively short reference for anyone who is
interested in the subject:
https://www.iamcal.com/understanding-bidirectional-text
--
/// Teemu Likonen - .-.. http://www.iki.fi/tlikonen/
// OpenPGP: 4E1055DC84E9DFF613D78557719D69D324539450
[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 251 bytes --]
[-- Attachment #2: Type: text/plain, Size: 0 bytes --]
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH 1/2] Emacs: Add a new function for balancing bidi control chars
2020-08-15 9:30 ` [PATCH 1/2] Emacs: Add a new function for balancing " Teemu Likonen
@ 2020-08-16 16:28 ` Tomi Ollila
2020-08-16 17:41 ` Teemu Likonen
0 siblings, 1 reply; 6+ messages in thread
From: Tomi Ollila @ 2020-08-16 16:28 UTC (permalink / raw)
To: Teemu Likonen, notmuch; +Cc: David Edmondson
On Sat, Aug 15 2020, Teemu Likonen wrote:
> The following Unicode's bidirectional control chars are modal so that
> they push a new bidirectional rendering mode to a stack:
>
> U+202A LEFT-TO-RIGHT EMBEDDING
> U+202B RIGHT-TO-LEFT EMBEDDING
> U+202D LEFT-TO-RIGHT OVERRIDE
> U+202E RIGHT-TO-LEFT OVERRIDE
Good stuff -- implementation looks like port of the php code in
https://www.iamcal.com/understanding-bidirectional-text
to emacs lisp... anyway nice implementation took be a bit of
time for me to understand it...
thoughts
- is it slow to execute it always, pure lisp implementation;
(string-match "[\u202a-\u202e]") could be done before that.
(if it were executed often could loop with `looking-at`
(and then moving point based on match-end) be faster...
- *but* adding U+202C's in `notmuch-sanitize` is doing it too early, as
some functions truncate the strings afterwards if those are too long
(e.g. `notmuch-search-insert-authors`) so those get lost..
- what about https://en.wikipedia.org/wiki/Bidirectional_text#Isolates
(was documented more in some page, cannot find it anymore...)
(what I noticed when looking `notmuch-search-insert-authors` that it uses
`length` to check the length of a string -- but that also counts these bidi
mode changing "characters" (as one char). `string-width` would be better
there -- and probably in many other places.)
(I tried quite a few things, something that could "reset" the stack with
e.g. one invisible tab, but no go (or that was filtered as I added it
to `notmuch-sanitize` ;), As a final step I did
(defun notmuch-sanitize (str)
...
- (replace-regexp-in-string "[[:cntrl:]\x7f\u2028\u2029]+" " " str))
+ (replace-regexp-in-string
+ "[\u202A-\u202E\u2066-\u2069]" ""
+ (replace-regexp-in-string "[[:cntrl:]\x7f\u2028\u2029]+" " " str)))
just to test-drop those chars. probably not good enough ;/)
Tomi
>
> Every mode must be terminated with with character U+202C POP
> DIRECTIONAL FORMATTING which pops the mode from the stack. The stack
> is per paragraph. A new text paragraph resets the rendering mode
> changed by these control characters.
>
> This change adds a new function "notmuch-balance-bidi-ctrl-chars"
> which reads its STRING argument and ensures that all push
> characters (U+202A, U+202B, U+202D, U+202E) have a pop character
> pair (U+202C). The function may add more U+202C characters at the end
> of the returned string, or it may remove some U+202C characters. The
> returned string is safe in the sense that it won't change the
> surrounding bidirectional rendering mode. This function should be used
> when sanitizing arbitrary input.
> ---
> emacs/notmuch-lib.el | 54 ++++++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 54 insertions(+)
>
> diff --git a/emacs/notmuch-lib.el b/emacs/notmuch-lib.el
> index 118faf1e..e6252c6c 100644
> --- a/emacs/notmuch-lib.el
> +++ b/emacs/notmuch-lib.el
> @@ -469,6 +469,60 @@ be displayed."
> "[No Subject]"
> subject)))
>
> +
> +(defun notmuch-balance-bidi-ctrl-chars (string)
> + "Balance bidirectional control chars in STRING.
> +
> +The following Unicode's bidirectional control chars are modal so
> +that they push a new bidirectional rendering mode to a stack:
> +U+202A LEFT-TO-RIGHT EMBEDDING, U+202B RIGHT-TO-LEFT EMBEDDING,
> +U+202D LEFT-TO-RIGHT OVERRIDE and U+202E RIGHT-TO-LEFT OVERRIDE.
> +Every mode must be terminated with with character U+202C POP
> +DIRECTIONAL FORMATTING which pops the mode from the stack. The
> +stack is per paragraph. A new text paragraph resets the rendering
> +mode changed by these control characters.
> +
> +This function reads the STRING argument and ensures that all push
> +characters (U+202A, U+202B, U+202D, U+202E) have a pop character
> +pair (U+202C). The function may add more U+202C characters at the
> +end of the returned string, or it may remove some U+202C
> +characters. The returned string is safe in the sense that it
> +won't change the surrounding bidirectional rendering mode. This
> +function should be used when sanitizing arbitrary input."
> +
> + (let ((new-string nil)
> + (stack-count 0))
> +
> + (cl-flet ((push-char-p (c)
> + ;; U+202A LEFT-TO-RIGHT EMBEDDING
> + ;; U+202B RIGHT-TO-LEFT EMBEDDING
> + ;; U+202D LEFT-TO-RIGHT OVERRIDE
> + ;; U+202E RIGHT-TO-LEFT OVERRIDE
> + (cl-find c '(?\u202a ?\u202b ?\u202d ?\u202e)))
> + (pop-char-p (c)
> + ;; U+202C POP DIRECTIONAL FORMATTING
> + (eql c ?\u202c)))
> +
> + (cl-loop for char across string
> + do (cond ((push-char-p char)
> + (cl-incf stack-count)
> + (push char new-string))
> + ((and (pop-char-p char)
> + (cl-plusp stack-count))
> + (cl-decf stack-count)
> + (push char new-string))
> + ((and (pop-char-p char)
> + (not (cl-plusp stack-count)))
> + ;; The stack is empty. Ignore this pop character.
> + )
> + (t (push char new-string)))))
> +
> + ;; Add possible missing pop characters.
> + (cl-loop repeat stack-count
> + do (push ?\x202c new-string))
> +
> + (seq-into (nreverse new-string) 'string)))
> +
> (defun notmuch-sanitize (str)
> "Sanitize control character in STR.
>
> --
> 2.20.1
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH 1/2] Emacs: Add a new function for balancing bidi control chars
2020-08-16 16:28 ` Tomi Ollila
@ 2020-08-16 17:41 ` Teemu Likonen
0 siblings, 0 replies; 6+ messages in thread
From: Teemu Likonen @ 2020-08-16 17:41 UTC (permalink / raw)
To: Tomi Ollila, notmuch; +Cc: David Edmondson
[-- Attachment #1.1: Type: text/plain, Size: 1957 bytes --]
* 2020-08-16 19:28:51+03, Tomi Ollila wrote:
> Good stuff -- implementation looks like port of the php code in
>
> https://www.iamcal.com/understanding-bidirectional-text
>
> to emacs lisp... anyway nice implementation took be a bit of
> time for me to understand it...
I don't read PHP and didn't try to read that code at all but the idea is
simple enough.
> thoughts
>
> - is it slow to execute it always, pure lisp implementation;
> (string-match "[\u202a-\u202e]") could be done before that.
> (if it were executed often could loop with `looking-at`
> (and then moving point based on match-end) be faster...
I don't see any speed issues but if we want to optimize I would create a
new sanitize function which walks just once across the characters
without using regular expressions. But currently I think it's
unnecessary micro optimization.
> - *but* adding U+202C's in `notmuch-sanitize` is doing it too early, as
> some functions truncate the strings afterwards if those are too long
> (e.g. `notmuch-search-insert-authors`) so those get lost..
Good point. This would mean that we shouldn't do "bidi ctrl char
balancing" in notmuch-sanitize. We should call the new
notmuch-balance-bidi-ctrl-chars function in various places before
inserting arbitrary strings to buffer and before combining such strings
with other strings.
> (what I noticed when looking `notmuch-search-insert-authors` that it uses
> `length` to check the length of a string -- but that also counts these bidi
> mode changing "characters" (as one char). `string-width` would be better
> there -- and probably in many other places.)
Yes, definitely string-width when truncating is based on width and when
using tabular format in buffers. With that function zero-width
characters really have no width.
--
/// Teemu Likonen - .-.. http://www.iki.fi/tlikonen/
// OpenPGP: 4E1055DC84E9DFF613D78557719D69D324539450
[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 251 bytes --]
[-- Attachment #2: Type: text/plain, Size: 0 bytes --]
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2020-08-16 17:42 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2020-08-15 9:30 [PATCH 0/2] Balance bidi control chars Teemu Likonen
2020-08-15 9:30 ` [PATCH 1/2] Emacs: Add a new function for balancing " Teemu Likonen
2020-08-16 16:28 ` Tomi Ollila
2020-08-16 17:41 ` Teemu Likonen
2020-08-15 9:30 ` [PATCH 2/2] Emacs: Call notmuch-balance-bidi-ctrl-chars in notmuch-sanitize Teemu Likonen
2020-08-15 9:44 ` [PATCH 0/2] Balance bidi control chars Teemu Likonen
Code repositories for project(s) associated with this public inbox
https://yhetil.org/notmuch.git/
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).