From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.io!.POSTED.blaine.gmane.org!not-for-mail From: Eli Zaretskii Newsgroups: gmane.emacs.devel Subject: Re: Unicode confusables and reordering characters considered harmful, a simple solution Date: Fri, 05 Nov 2021 16:19:56 +0200 Message-ID: <831r3uelbn.fsf@gnu.org> References: <831r3yjqo9.fsf@gnu.org> <83v91aibe7.fsf@gnu.org> <87o872s0wf.fsf_-_@db48x.net> <83lf25gm1j.fsf@gnu.org> <83ee7xgio2.fsf@gnu.org> <87fssdrp54.fsf@db48x.net> <831r3xgfz3.fsf@gnu.org> <87v918qx37.fsf@db48x.net> <83o870fjqg.fsf@gnu.org> <87k0hnqr1v.fsf@db48x.net> <83ee7vdped.fsf@gnu.org> <83a6ijdnzv.fsf@gnu.org> <834k8qer8j.fsf@gnu.org> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit Injection-Info: ciao.gmane.io; posting-host="blaine.gmane.org:116.202.254.214"; logging-data="16218"; mail-complaints-to="usenet@ciao.gmane.io" Cc: db48x@db48x.net, cpitclaudel@gmail.com, yuri.v.khan@gmail.com, monnier@iro.umontreal.ca, emacs-devel@gnu.org To: Stefan Kangas Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org Fri Nov 05 15:21:48 2021 Return-path: Envelope-to: ged-emacs-devel@m.gmane-mx.org Original-Received: from lists.gnu.org ([209.51.188.17]) by ciao.gmane.io with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.92) (envelope-from ) id 1mj06G-0003tC-0G for ged-emacs-devel@m.gmane-mx.org; Fri, 05 Nov 2021 15:21:48 +0100 Original-Received: from localhost ([::1]:55916 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1mj06E-0003lI-N1 for ged-emacs-devel@m.gmane-mx.org; Fri, 05 Nov 2021 10:21:46 -0400 Original-Received: from eggs.gnu.org ([2001:470:142:3::10]:41096) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1mj04o-0002y9-2q for emacs-devel@gnu.org; Fri, 05 Nov 2021 10:20:18 -0400 Original-Received: from fencepost.gnu.org ([2001:470:142:3::e]:53004) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1mj04m-0005CN-2C; Fri, 05 Nov 2021 10:20:16 -0400 DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=gnu.org; s=fencepost-gnu-org; h=MIME-version:References:Subject:In-Reply-To:To:From: Date; bh=G0ukG/wKj7pVi8xh/i0oCfy7cAxTCxeOSPdBHyKB4J4=; b=rbicuf/11d32Gn8ns7Np SHlqKWIW+6lRrJouu8tkL3IDRkm0Z4au4w4hadaJgqc8TyGi0QqCPdtksxYzkZrq8d6qQv/ysV+0l B4NjFGXIkSlYgJCn/aYEdOGzzp7/EgfFLBElBSYk9HabTNPgayW6S/cNRgIpvn2jwgRTMmF3Xx5Ua G36lFOtvagdfCHmhrrbgQsTFW3zWt9/7aHPOVUsgFWO4MmBBC6/4AJZDldHIVpwcLtttRLobcolri 0dehi3mPTP8Z2WowwJcjhkSPvMER74Doaru/RB/o+5GMWKf3ighsoPRbbqAFbDqfNw5vc4Lu+n0h7 JAXbXq8zxnsgbg==; Original-Received: from [87.69.77.57] (port=4976 helo=home-c4e4a596f7) by fencepost.gnu.org with esmtpsa (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1mj04k-0000k6-RF; Fri, 05 Nov 2021 10:20:15 -0400 In-Reply-To: (message from Stefan Kangas on Fri, 5 Nov 2021 06:08:42 -0700) X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org Original-Sender: "Emacs-devel" Xref: news.gmane.io gmane.emacs.devel:278767 Archived-At: > From: Stefan Kangas > Date: Fri, 5 Nov 2021 06:08:42 -0700 > Cc: db48x@db48x.net, cpitclaudel@gmail.com, emacs-devel@gnu.org, > monnier@iro.umontreal.ca, yuri.v.khan@gmail.com > > I didn't study `bidi-find-overridden-directionality' yet, but the > "Trojan Source" paper writes: > > "By banning all directionality-control characters, users with > legitimate Bidi-override use cases in comments are penalized. > Therefore, a better defense might be to ban the use of > _unterminated_ Bidi override characters within string literals and > comments. By ensuring that each override is terminated – that is, > for example, that every LRI has a matching PDI– it becomes > impossible to distort legitimate source code outside of string > literals and comments." (p. 8, their emphasis) > > So, IIUC, the problematic cases are "unterminated Bidi override > characters", and those are the ones worth warning about. Does that > sound correct to you? No. What they say is simply wrong: such unterminated overrides and embeddings are perfectly valid. The Unicode Bidirectional Algorithm (UBA) mandates (https://unicode.org/reports/tr9/#X8): X8. All explicit directional embeddings, overrides and isolates are completely terminated at the end of each paragraph. Explicit paragraph separators (bidirectional character type B) indicate the end of a paragraph. As such, they are not included in any embedding, override or isolate. They are simply assigned the paragraph embedding level. And in https://unicode.org/reports/tr9/#Bidirectional_Character_Types you can see that newline is one of the characters whose bidi type is B; compare: (get-char-code-property ?\n 'bidi-class) => B So when the UBA says "at the end of each paragraph", it means in practice at EOL, since all the other paragraph separators are rarely if ever used in human-readable text. (And Emacs, of course, implements that rule.) The authors of the paper simply don't understand the bidi stuff well enough to make useful proposals about this. They should have bring this up on the Unicode mailing list, where at least the experts (and I don't mean myself, I mean the people who wrote the UBA) could set them straight. I encourage you to read the comments in the implementation I wrote, to see which cases I consider "suspicious". The comments need to be read with the UBA spec in mind, at least its Xn rules. I will be happy to explain or clarify if something is unclear there. This is a complex issue, and discussing it rationally could really enhance our understanding and handling of these cases. > > Adding one line is a nuisance. If it can be avoided, we should avoid > > it. Since we are capable of detecting the really suspicious uses of > > those controls, it is much better to use that, because in that case > > users will not have to add anything. > > I agree that it does sound better to prefer such an approach if > possible. Then let's try to implement that. If there's a need for more bidi-specific infrastructure, let me know and I will see what I can do. Thanks.