From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.io!.POSTED.blaine.gmane.org!not-for-mail From: Daniel Brooks Newsgroups: gmane.emacs.devel Subject: Unicode confusables and reordering characters considered harmful, a simple solution Date: Tue, 02 Nov 2021 14:28:16 -0700 Message-ID: <87o872s0wf.fsf_-_@db48x.net> References: <875ytag0hb.fsf@yahoo.com> <87zgqmd5np.fsf@mat.ucm.es> <83wnlqk3rn.fsf@gnu.org> <72dd5c2a-42c7-b12e-05ed-e93adbd89727@gmail.com> <83ilxajyhw.fsf@gnu.org> <83fssejxf8.fsf@gnu.org> <835ytajsv2.fsf@gnu.org> <831r3yjqo9.fsf@gnu.org> <83v91aibe7.fsf@gnu.org> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Injection-Info: ciao.gmane.io; posting-host="blaine.gmane.org:116.202.254.214"; logging-data="31992"; mail-complaints-to="usenet@ciao.gmane.io" User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/27.2 (gnu/linux) Cc: cpitclaudel@gmail.com, stefan@marxist.se, Stefan Monnier , emacs-devel@gnu.org To: Eli Zaretskii Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org Tue Nov 02 22:29:45 2021 Return-path: Envelope-to: ged-emacs-devel@m.gmane-mx.org Original-Received: from lists.gnu.org ([209.51.188.17]) by ciao.gmane.io with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.92) (envelope-from ) id 1mi1Lk-00089D-HY for ged-emacs-devel@m.gmane-mx.org; Tue, 02 Nov 2021 22:29:44 +0100 Original-Received: from localhost ([::1]:45070 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1mi1Li-0000hW-LI for ged-emacs-devel@m.gmane-mx.org; Tue, 02 Nov 2021 17:29:42 -0400 Original-Received: from eggs.gnu.org ([2001:470:142:3::10]:56922) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1mi1KW-0008Fg-LE for emacs-devel@gnu.org; Tue, 02 Nov 2021 17:28:28 -0400 Original-Received: from smtp-out-4.mxes.net ([198.205.123.69]:53302) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1mi1KU-0003lt-Kc for emacs-devel@gnu.org; Tue, 02 Nov 2021 17:28:28 -0400 Original-Received: from Customer-MUA (mua.mxes.net [10.0.0.1]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.mxes.net (Postfix) with ESMTPSA id 4HkNL95MH3z3c9c; Tue, 2 Nov 2021 17:28:17 -0400 (EDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=mxes.net; s=mta; t=1635888499; bh=vF3pt1PZaTGm8H3+l3wsWMVpU9+FzmDQkKc7+PeWnsk=; h=From:To:Subject:References:Date:In-Reply-To:Message-ID: MIME-Version:Content-Type; b=cCMlQZEo7GA12SozRMEFkyHlVcnKlajOZzCHl0KwQrtWzYmXFMgoVCkxdf8g2bEl+ iBQFf9F60fK/ADT5FUcP4RWxOr+y3GIlYF8FtXUgZEje43zyWghdKle0Pu7QzgfZpq 36dDcj96jAHk+zkVcR7g0dc7OO1bZuYvOtLB3Yxc= Face: iVBORw0KGgoAAAANSUhEUgAAADAAAAAwBAMAAAClLOS0AAAABGdBTUEAALGOfPtRkwAAABJQ TFRFpKfbdou67PD6JjJgAwUWXGSeIcyLHgAAAkZJREFUOI1VU8Fy6yAMxLi+Q13fCZ3cnQL3dqTc 7RD+/1feStDXVnXHDuvVSivZTMba2GPdw3gyCGcMAFxTyrTd9dwGoxHiZX9PmRFUHYAQlGGtXY+F Uk0SJOxgJiUEnH1qkitT9D+pQub7qGAmUbR6bu3CvI96Yv6QqkBBMrsyfZccr1/RDXGDTLf4P7ZY glVxe2V+/ACXWO1gvDO9/gDRpFFVmPluvLcmBjd5H6d8DEte+Pbk4rcY/Fa5tLKLOtCZsuQKYhpa LOkYDT7hESya7/WIET3lfQBqX0pwFtbI832Is0ayMUR9B+12xjgPCQ089cfwkCkX6L5TPmRelJTh zMS0Sz1PyjLAMCUWjcmgQLWQMds+e3aaauZDf9dU9A2/8kPVF2odCUoMKHkfjJR+mbgC+DRiycw5 3XSqGe6HmhN/AWjHypkAXOAFW5EiuA1ge2GiZuMb0s1fSEXcATeLUfbyEY2L8yPOmdSsdghQXx3K pz2eoeXuYvMCINVFDrCdNfVUp4eJ6cSEbjbgFjBEvonGGTrgv9cHjAc8aVgSAPoxaONbzfwhDIhR at7IIS7fAGiDSwIA9alhhTBzfA7YM2FY6eMwayrIGK8FDFmshmUA43WqhFtpvoqG9HHaJ7fqtgTz 8EWVkgZgtsylFliHDgk0MB7KAEC45C/rgnGvanNLXyzOeTzcT2nw/N44gfrtYXRQLoz9Q3TgmJRx 2Mx/Q51qzpm+l3m8z2SWBqC5+PZXAtNYlGFf/gKfHfjFkDT4x7od7R+w3Ls+ZdQBuQAAAABJRU5E rkJggg== In-Reply-To: <83v91aibe7.fsf@gnu.org> (Eli Zaretskii's message of "Tue, 02 Nov 2021 21:51:44 +0200") X-Sent-To: Received-SPF: none client-ip=198.205.123.69; envelope-from=db48x@db48x.net; helo=smtp-out-4.mxes.net X-Spam_score_int: -25 X-Spam_score: -2.6 X-Spam_bar: -- X-Spam_report: (-2.6 / 5.0 requ) BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, RCVD_IN_DNSWL_LOW=-0.7, SPF_HELO_PASS=-0.001, SPF_NONE=0.001 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org Original-Sender: "Emacs-devel" Xref: news.gmane.io gmane.emacs.devel:278543 Archived-At: Eli Zaretskii writes: >> From: Stefan Monnier >> Cc: stefan@marxist.se, cpitclaudel@gmail.com, emacs-devel@gnu.org >> Date: Tue, 02 Nov 2021 15:47:27 -0400 >>=20 >> > In most cases, there's no need to make these controls stand out, >> > because situations where this presents security risks are extremely >> > rare, to put it mildly, and OTOH having them stand out more by default >> > will make it harder to read text with completely legitimate uses of >> > these controls (example: TUTORIAL.he). >>=20 >> Fully agreed. That's the problem: how to define the problematic cases >> in a precise enough way that it doesn't rule out all lots of >> legitimate cases. > > That's what bidi-find-overridden-directionality already does, albeit > not yet for the specific examples in that paper. But Someone=E2=84=A2 sh= ould > write a minor mode or an optional display feature which uses that > function to highlight the problematic stretches of text on display, > using the function's output for finding such stretches of text. We already have it; it is called whitespace-mode. It=E2=80=99s not perfect,= but this morning I customized mine to make these characters more obvious: (custom-set-variables '(whitespace-display-mappings '((space-mark 32 [183] [46]) (space-mark 160 [164] [95]) (newline-mark 10 [36 10]) (tab-mark 9 [187 9] [92 9]) (space-mark #x202A [#x21D2]) ; =E2=87=92 LEFT-TO-RIGHT EMBEDDING (space-mark #x202B [#x21D0]) ; =E2=87=90 RIGHT-TO-LEFT EMBEDDING (space-mark #x202D [#x2192]) ; =E2=86=92 LEFT-TO-RIGHT OVERRIDE (space-mark #x202E [#x2190]) ; =E2=86=90 RIGHT-TO-LEFT OVERRIDE (space-mark #x2066 [#x21E5]) ; =E2=87=A5 LEFT-TO-RIGHT ISOLATE (space-mark #x2067 [#x21E4]) ; =E2=87=A4 RIGHT-TO-LEFT ISOLATE (space-mark #x2068 [#x21A7]) ; =E2=86=A7 FIRST STRONG ISOLATE (space-mark #x202C [#x21D1]) ; =E2=87=91 POP DIRECTIONAL FORMATTING (space-mark #x2069 [#x2912]) ; =E2=A4=92 POP DIRECTIONAL ISOLATE ))) I didn=E2=80=99t spend much time thinking about which arrows to pick; these seemed right to me. They are all using 'space-mark as the kind, but I would like to extend whitespace-mode with a new kind specifically for these characters, so that I can give them a custom face as well. Here is some sample lisp code that I tried it on: (defun main () (let ((is_admin nil)) =E2=80=AE=E2=81=A6 ; begin admins only=E2=81=A9=E2=81=A6(when is_admin (print "You are an admin."))=E2=80=AE=E2=81=A6 ; end admins only=E2= =81=A9( ) Syntax highlighting is certainly a big clue that something is odd about this code, as the conditional is displayed in the comment face. It was however a nice little puzzle to figure out how to get the permutation of characters that I wanted. I will however note that Elisp, as currently implemented, is probably immune to this attack. The directional characters are incorrectly=C2=B9 treated as identifiers when they are outside of a comment; if you actually run this you will get a void-variable warning which is very confusing at first because the variable name is invisible. Great fun. I suggest that we include something along these lines in Emacs, and turn on whitespace-mode by default in all programming modes. If I recall correctly, the default configuration of whitespace-mode is fairly inoffensive. I would recommend keeping it so except that we make the face for BIDI control characters pretty obvious; perhaps a red background or something. By only enabling it by default in programming modes, we avoid bothering users of prose=E2=80=93oriented modes where using these characters is benign. Maybe we could have an override for programming languages such as Elisp that we think are immune to this attack, but I don=E2=80=99t really think we need to go that far. db48x =C2=B9 I say that this is incorrect because they are classified by Unicode = as control characters rather than as letters or numbers. The Elisp specification, such as it exists, probably doesn=E2=80=99t say anything abo= ut them.