unofficial mirror of notmuch@notmuchmail.org
 help / color / mirror / code / Atom feed
From: Tomi Ollila <tomi.ollila@iki.fi>
To: Teemu Likonen <tlikonen@iki.fi>, notmuch@notmuchmail.org
Cc: David Edmondson <dme@dme.org>
Subject: Re: [PATCH 1/2] Emacs: Add a new function for balancing bidi control chars
Date: Sun, 16 Aug 2020 19:28:51 +0300	[thread overview]
Message-ID: <m24kp2k118.fsf@guru.guru-group.fi> (raw)
In-Reply-To: <20200815093036.5930-2-tlikonen@iki.fi>

On Sat, Aug 15 2020, Teemu Likonen wrote:

> The following Unicode's bidirectional control chars are modal so that
> they push a new bidirectional rendering mode to a stack:
>
>     U+202A LEFT-TO-RIGHT EMBEDDING
>     U+202B RIGHT-TO-LEFT EMBEDDING
>     U+202D LEFT-TO-RIGHT OVERRIDE
>     U+202E RIGHT-TO-LEFT OVERRIDE

Good stuff -- implementation looks like port of the php code in 

   https://www.iamcal.com/understanding-bidirectional-text

to emacs lisp... anyway nice implementation took be a bit of
time for me to understand it...

thoughts

- is it slow to execute it always, pure lisp implementation;
  (string-match "[\u202a-\u202e]") could be done before that.
  (if it were executed often could loop with `looking-at`
   (and then moving point based on match-end) be faster...

- *but* adding U+202C's in `notmuch-sanitize` is doing it too early, as
  some functions truncate the strings afterwards if those are too long
  (e.g. `notmuch-search-insert-authors`) so those get lost.. 

- what about https://en.wikipedia.org/wiki/Bidirectional_text#Isolates
  (was documented more in some page, cannot find it anymore...)

(what I noticed when looking `notmuch-search-insert-authors` that it uses
 `length` to check the length of a string -- but that also counts these bidi
 mode changing "characters" (as one char). `string-width` would be better
 there -- and probably in many other places.)

(I tried quite a few things, something that could "reset" the stack with
 e.g. one invisible tab, but no go (or that was filtered as I added it
 to `notmuch-sanitize` ;), As a final step I did

  (defun notmuch-sanitize (str)
  ...
  -  (replace-regexp-in-string "[[:cntrl:]\x7f\u2028\u2029]+" " " str))
  +  (replace-regexp-in-string
  +   "[\u202A-\u202E\u2066-\u2069]" ""
  +   (replace-regexp-in-string "[[:cntrl:]\x7f\u2028\u2029]+" " " str)))

just to test-drop those chars. probably not good enough ;/)


Tomi

>
> Every mode must be terminated with with character U+202C POP
> DIRECTIONAL FORMATTING which pops the mode from the stack. The stack
> is per paragraph. A new text paragraph resets the rendering mode
> changed by these control characters.
>
> This change adds a new function "notmuch-balance-bidi-ctrl-chars"
> which reads its STRING argument and ensures that all push
> characters (U+202A, U+202B, U+202D, U+202E) have a pop character
> pair (U+202C). The function may add more U+202C characters at the end
> of the returned string, or it may remove some U+202C characters. The
> returned string is safe in the sense that it won't change the
> surrounding bidirectional rendering mode. This function should be used
> when sanitizing arbitrary input.
> ---
>  emacs/notmuch-lib.el | 54 ++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 54 insertions(+)
>
> diff --git a/emacs/notmuch-lib.el b/emacs/notmuch-lib.el
> index 118faf1e..e6252c6c 100644
> --- a/emacs/notmuch-lib.el
> +++ b/emacs/notmuch-lib.el
> @@ -469,6 +469,60 @@ be displayed."
>  	"[No Subject]"
>        subject)))
>  
> +
> +(defun notmuch-balance-bidi-ctrl-chars (string)
> +  "Balance bidirectional control chars in STRING.
> +
> +The following Unicode's bidirectional control chars are modal so
> +that they push a new bidirectional rendering mode to a stack:
> +U+202A LEFT-TO-RIGHT EMBEDDING, U+202B RIGHT-TO-LEFT EMBEDDING,
> +U+202D LEFT-TO-RIGHT OVERRIDE and U+202E RIGHT-TO-LEFT OVERRIDE.
> +Every mode must be terminated with with character U+202C POP
> +DIRECTIONAL FORMATTING which pops the mode from the stack. The
> +stack is per paragraph. A new text paragraph resets the rendering
> +mode changed by these control characters.
> +
> +This function reads the STRING argument and ensures that all push
> +characters (U+202A, U+202B, U+202D, U+202E) have a pop character
> +pair (U+202C). The function may add more U+202C characters at the
> +end of the returned string, or it may remove some U+202C
> +characters. The returned string is safe in the sense that it
> +won't change the surrounding bidirectional rendering mode. This
> +function should be used when sanitizing arbitrary input."
> +
> +  (let ((new-string nil)
> +	(stack-count 0))
> +
> +    (cl-flet ((push-char-p (c)
> +		;; U+202A LEFT-TO-RIGHT EMBEDDING
> +		;; U+202B RIGHT-TO-LEFT EMBEDDING
> +		;; U+202D LEFT-TO-RIGHT OVERRIDE
> +		;; U+202E RIGHT-TO-LEFT OVERRIDE
> +		(cl-find c '(?\u202a ?\u202b ?\u202d ?\u202e)))
> +	      (pop-char-p (c)
> +		;; U+202C POP DIRECTIONAL FORMATTING
> +		(eql c ?\u202c)))
> +
> +      (cl-loop for char across string
> +	       do (cond ((push-char-p char)
> +			 (cl-incf stack-count)
> +			 (push char new-string))
> +			((and (pop-char-p char)
> +			      (cl-plusp stack-count))
> +			 (cl-decf stack-count)
> +			 (push char new-string))
> +			((and (pop-char-p char)
> +			      (not (cl-plusp stack-count)))
> +			 ;; The stack is empty. Ignore this pop character.
> +			 )
> +			(t (push char new-string)))))
> +
> +    ;; Add possible missing pop characters.
> +    (cl-loop repeat stack-count
> +	     do (push ?\x202c new-string))
> +
> +    (seq-into (nreverse new-string) 'string)))
> +
>  (defun notmuch-sanitize (str)
>    "Sanitize control character in STR.
>  
> -- 
> 2.20.1

  reply	other threads:[~2020-08-16 16:29 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-08-15  9:30 [PATCH 0/2] Balance bidi control chars Teemu Likonen
2020-08-15  9:30 ` [PATCH 1/2] Emacs: Add a new function for balancing " Teemu Likonen
2020-08-16 16:28   ` Tomi Ollila [this message]
2020-08-16 17:41     ` Teemu Likonen
2020-08-15  9:30 ` [PATCH 2/2] Emacs: Call notmuch-balance-bidi-ctrl-chars in notmuch-sanitize Teemu Likonen
2020-08-15  9:44 ` [PATCH 0/2] Balance bidi control chars Teemu Likonen

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://notmuchmail.org/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=m24kp2k118.fsf@guru.guru-group.fi \
    --to=tomi.ollila@iki.fi \
    --cc=dme@dme.org \
    --cc=notmuch@notmuchmail.org \
    --cc=tlikonen@iki.fi \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://yhetil.org/notmuch.git/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).