From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mp2 ([2001:41d0:2:4a6f::]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)) by ms11 with LMTPS id eEnzCd5eOV/WSwAA0tVLHw (envelope-from ) for ; Sun, 16 Aug 2020 16:29:18 +0000 Received: from aspmx1.migadu.com ([2001:41d0:2:4a6f::]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)) by mp2 with LMTPS id ULfeBd5eOV/aQAAAB5/wlQ (envelope-from ) for ; Sun, 16 Aug 2020 16:29:18 +0000 Received: from mail.notmuchmail.org (nmbug.tethera.net [IPv6:2607:5300:201:3100::1657]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) server-signature RSA-PSS (2048 bits)) (No client certificate requested) by aspmx1.migadu.com (Postfix) with ESMTPS id 70DDF94005D for ; Sun, 16 Aug 2020 16:29:16 +0000 (UTC) Received: from [144.217.243.247] (localhost [127.0.0.1]) by mail.notmuchmail.org (Postfix) with ESMTP id 0939228871; Sun, 16 Aug 2020 12:29:08 -0400 (EDT) Received: from lahtoruutu.iki.fi (unknown [IPv6:2a0b:5c81:1c1::37]) by mail.notmuchmail.org (Postfix) with ESMTPS id 85EE81FCB0 for ; Sun, 16 Aug 2020 12:29:05 -0400 (EDT) Received: from guru.guru-group.fi (unknown [IPv6:2a02:2380:1:9:5054:ff:feb7:a4bc]) (using TLSv1.2 with cipher AES256-SHA (256/256 bits)) (No client certificate requested) (Authenticated sender: too) by lahtoruutu.iki.fi (Postfix) with ESMTPSA id 2F2ED1B0029D; Sun, 16 Aug 2020 19:28:53 +0300 (EEST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=iki.fi; s=lahtoruutu; t=1597595333; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=8U5RERWjQ8S9w0+9SiJFBzixsoBLNoH30u+7UmhMJYo=; b=l4Pwu0Uc9Ab5FoJ85mAmKwkZgXKXvIG5vYzbchn/LACxdh4mintCFjNdIhhou1Aqnk1HP4 lSUvE4lVoSeEHCZS+kZ26dNbcjMselDq8DjIEMd90Jx3fvZrrQRTiwWWyULTCdj40t5jnk CQeAOb3zkBfGHrVpNN+Yz/ur82YHJUf1HrhH7TGbtzxmlSm7YMRfEgGU9BKFzcdluoG4+P FUn7Hfc0x6Hb84wjI2I/OQBYhp/KvBD3rCHiDehDCmn3k9JJ5eDxmTMO4ivdoADWBQFqD9 3xPzmuhfOT81abgLhBalu+yWxly9hQ20GcME/p3WBPjO5EHH5hG/o+bAX+losg== From: Tomi Ollila To: Teemu Likonen , notmuch@notmuchmail.org Cc: David Edmondson Subject: Re: [PATCH 1/2] Emacs: Add a new function for balancing bidi control chars In-Reply-To: <20200815093036.5930-2-tlikonen@iki.fi> References: <20200815093036.5930-1-tlikonen@iki.fi> <20200815093036.5930-2-tlikonen@iki.fi> User-Agent: Notmuch/0.30+160~gd1c5cc6 (https://notmuchmail.org) Emacs/27.1 (x86_64-pc-linux-gnu) X-Face: HhBM'cA~ MIME-Version: 1.0 ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=iki.fi; s=lahtoruutu; t=1597595333; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=8U5RERWjQ8S9w0+9SiJFBzixsoBLNoH30u+7UmhMJYo=; b=oV5JMr6yhTjdLakSzVWfhmn5g08VCfReReFkdxeBn/e0breWAUyz9I+OkYbBJmHyj3SmwJ PSiPxCCbHs/zvkPo0KfYDKSW4Bubqn/B5twXkjKLyp/nOTtZw+T+eZr2TmRArt84tnjnsY lVejCOmXM4gOmUxXmHk+jWVxKqOOB8kMQAIlJGzFSxyam6QI7bN7ZtZczay4mst4ez9qfE J1CISLb45JCz8ez343DEcBtnmPPBsbVjl9czyd5vcarK4lXQ/M+h/IDnKolp3ML5q9zuA0 UwEOXaM0UzwiepmJt92C5Ptgkj3B1RPLRap8tZ5ktILesrpl0ik0x9rPxM6dig== ARC-Seal: i=1; s=lahtoruutu; d=iki.fi; t=1597595333; a=rsa-sha256; cv=none; b=wWkSta9rk77EbC1jjI4PhMgYJH1/patFPJ8+JoiVNDlMH9k2TxeMVqOOnVUWlXTy1cHOIP 2Gzp+XLcOLHhuSgVgEPHosaz9aci1tz5Q5BCzptcn4RO9hga92t97SjhM1+Rbwa+f2y6Zx 000SgG06F5j4Wj220MQglhwGKJxCs4WkuKzIRCZkLdunFQVrZZ+DX8npnwtxNiADDY+gYQ r+4lf7wGaVgTLdNdC376YDC0ssTqLtLepHfKV/zlEORgJVKgll9yS0CN/6bIz+czRJhiCm /ZdnsTRYwZcaIP+i9XfeAvxpTw/rzSK0BZJiZ97EtUYTyyI5Rl3F3fjR5H3fNw== ARC-Authentication-Results: i=1; ORIGINATING; auth=pass smtp.auth=too smtp.mailfrom=tomi.ollila@iki.fi Message-ID-Hash: GJTXFLVT5SFLMARS5VHEJWJIIOE5GLIP X-Message-ID-Hash: GJTXFLVT5SFLMARS5VHEJWJIIOE5GLIP X-MailFrom: tomi.ollila@iki.fi X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; header-match-notmuch.notmuchmail.org-0; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; suspicious-header X-Mailman-Version: 3.2.1 Precedence: list List-Id: "Use and development of the notmuch mail system." List-Help: List-Post: List-Subscribe: List-Unsubscribe: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit X-Scanner: scn0 Authentication-Results: aspmx1.migadu.com; dkim=fail (body hash did not verify) header.d=iki.fi header.s=lahtoruutu header.b=l4Pwu0Uc; dmarc=none; spf=pass (aspmx1.migadu.com: domain of notmuch-bounces@notmuchmail.org designates 2607:5300:201:3100::1657 as permitted sender) smtp.mailfrom=notmuch-bounces@notmuchmail.org X-Spam-Score: 3.53 X-TUID: DeJcZhIlV4iP On Sat, Aug 15 2020, Teemu Likonen wrote: > The following Unicode's bidirectional control chars are modal so that > they push a new bidirectional rendering mode to a stack: > > U+202A LEFT-TO-RIGHT EMBEDDING > U+202B RIGHT-TO-LEFT EMBEDDING > U+202D LEFT-TO-RIGHT OVERRIDE > U+202E RIGHT-TO-LEFT OVERRIDE Good stuff -- implementation looks like port of the php code in https://www.iamcal.com/understanding-bidirectional-text to emacs lisp... anyway nice implementation took be a bit of time for me to understand it... thoughts - is it slow to execute it always, pure lisp implementation; (string-match "[\u202a-\u202e]") could be done before that. (if it were executed often could loop with `looking-at` (and then moving point based on match-end) be faster... - *but* adding U+202C's in `notmuch-sanitize` is doing it too early, as some functions truncate the strings afterwards if those are too long (e.g. `notmuch-search-insert-authors`) so those get lost.. - what about https://en.wikipedia.org/wiki/Bidirectional_text#Isolates (was documented more in some page, cannot find it anymore...) (what I noticed when looking `notmuch-search-insert-authors` that it uses `length` to check the length of a string -- but that also counts these bidi mode changing "characters" (as one char). `string-width` would be better there -- and probably in many other places.) (I tried quite a few things, something that could "reset" the stack with e.g. one invisible tab, but no go (or that was filtered as I added it to `notmuch-sanitize` ;), As a final step I did (defun notmuch-sanitize (str) ... - (replace-regexp-in-string "[[:cntrl:]\x7f\u2028\u2029]+" " " str)) + (replace-regexp-in-string + "[\u202A-\u202E\u2066-\u2069]" "" + (replace-regexp-in-string "[[:cntrl:]\x7f\u2028\u2029]+" " " str))) just to test-drop those chars. probably not good enough ;/) Tomi > > Every mode must be terminated with with character U+202C POP > DIRECTIONAL FORMATTING which pops the mode from the stack. The stack > is per paragraph. A new text paragraph resets the rendering mode > changed by these control characters. > > This change adds a new function "notmuch-balance-bidi-ctrl-chars" > which reads its STRING argument and ensures that all push > characters (U+202A, U+202B, U+202D, U+202E) have a pop character > pair (U+202C). The function may add more U+202C characters at the end > of the returned string, or it may remove some U+202C characters. The > returned string is safe in the sense that it won't change the > surrounding bidirectional rendering mode. This function should be used > when sanitizing arbitrary input. > --- > emacs/notmuch-lib.el | 54 ++++++++++++++++++++++++++++++++++++++++++++ > 1 file changed, 54 insertions(+) > > diff --git a/emacs/notmuch-lib.el b/emacs/notmuch-lib.el > index 118faf1e..e6252c6c 100644 > --- a/emacs/notmuch-lib.el > +++ b/emacs/notmuch-lib.el > @@ -469,6 +469,60 @@ be displayed." > "[No Subject]" > subject))) > > + > +(defun notmuch-balance-bidi-ctrl-chars (string) > + "Balance bidirectional control chars in STRING. > + > +The following Unicode's bidirectional control chars are modal so > +that they push a new bidirectional rendering mode to a stack: > +U+202A LEFT-TO-RIGHT EMBEDDING, U+202B RIGHT-TO-LEFT EMBEDDING, > +U+202D LEFT-TO-RIGHT OVERRIDE and U+202E RIGHT-TO-LEFT OVERRIDE. > +Every mode must be terminated with with character U+202C POP > +DIRECTIONAL FORMATTING which pops the mode from the stack. The > +stack is per paragraph. A new text paragraph resets the rendering > +mode changed by these control characters. > + > +This function reads the STRING argument and ensures that all push > +characters (U+202A, U+202B, U+202D, U+202E) have a pop character > +pair (U+202C). The function may add more U+202C characters at the > +end of the returned string, or it may remove some U+202C > +characters. The returned string is safe in the sense that it > +won't change the surrounding bidirectional rendering mode. This > +function should be used when sanitizing arbitrary input." > + > + (let ((new-string nil) > + (stack-count 0)) > + > + (cl-flet ((push-char-p (c) > + ;; U+202A LEFT-TO-RIGHT EMBEDDING > + ;; U+202B RIGHT-TO-LEFT EMBEDDING > + ;; U+202D LEFT-TO-RIGHT OVERRIDE > + ;; U+202E RIGHT-TO-LEFT OVERRIDE > + (cl-find c '(?\u202a ?\u202b ?\u202d ?\u202e))) > + (pop-char-p (c) > + ;; U+202C POP DIRECTIONAL FORMATTING > + (eql c ?\u202c))) > + > + (cl-loop for char across string > + do (cond ((push-char-p char) > + (cl-incf stack-count) > + (push char new-string)) > + ((and (pop-char-p char) > + (cl-plusp stack-count)) > + (cl-decf stack-count) > + (push char new-string)) > + ((and (pop-char-p char) > + (not (cl-plusp stack-count))) > + ;; The stack is empty. Ignore this pop character. > + ) > + (t (push char new-string))))) > + > + ;; Add possible missing pop characters. > + (cl-loop repeat stack-count > + do (push ?\x202c new-string)) > + > + (seq-into (nreverse new-string) 'string))) > + > (defun notmuch-sanitize (str) > "Sanitize control character in STR. > > -- > 2.20.1