From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.io!.POSTED.blaine.gmane.org!not-for-mail From: Eli Zaretskii Newsgroups: gmane.emacs.devel Subject: Re: Unicode confusables and reordering characters considered harmful Date: Thu, 04 Nov 2021 10:21:12 +0200 Message-ID: <83k0hofi13.fsf@gnu.org> References: <8b09eed8-36dd-61f5-2a8f-8525122df98c@gmail.com> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit Injection-Info: ciao.gmane.io; posting-host="blaine.gmane.org:116.202.254.214"; logging-data="15601"; mail-complaints-to="usenet@ciao.gmane.io" Cc: monnier@iro.umontreal.ca, emacs-devel@gnu.org To: Reini Urban Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org Thu Nov 04 09:22:51 2021 Return-path: Envelope-to: ged-emacs-devel@m.gmane-mx.org Original-Received: from lists.gnu.org ([209.51.188.17]) by ciao.gmane.io with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.92) (envelope-from ) id 1miY1K-0003l8-VQ for ged-emacs-devel@m.gmane-mx.org; Thu, 04 Nov 2021 09:22:51 +0100 Original-Received: from localhost ([::1]:54138 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1miY1I-0002Mh-KK for ged-emacs-devel@m.gmane-mx.org; Thu, 04 Nov 2021 04:22:48 -0400 Original-Received: from eggs.gnu.org ([2001:470:142:3::10]:42624) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1miXzm-0000d5-V3 for emacs-devel@gnu.org; Thu, 04 Nov 2021 04:21:14 -0400 Original-Received: from fencepost.gnu.org ([2001:470:142:3::e]:59468) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1miXzm-0001UN-5E; Thu, 04 Nov 2021 04:21:14 -0400 DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=gnu.org; s=fencepost-gnu-org; h=MIME-version:References:Subject:In-Reply-To:To:From: Date; bh=pM7EP0jJrXsAMy6XDGZSltouy6MlLBBsKwhPeedkpZw=; b=VyPO0Qbh4w/RyOG7iVIz nMzk5tSKEIwTJbMR6l4ijoldHOBQ1gHnxEUphePtGECIU2Q+oXCFxYL8xRg8frDTfhZY68IKEkHxU BjeOk0jvLKdxvPDA8KKPT8N4X9X/2FwTk7BMFziXJX4GHqyLKl6D3gpsWhxrv8m+TK+e7Ith5DgSE InIQPL7iLpmyPNHc1uiBnNR3UWMWkGXY2GozrK9e2m8fyxiZYeMNnJdrMMIJa9U678VREnMlzfB86 dd46dIV4EVYsHnxGD0gNXxTT+fDX41JojlDIfRw7ZvvR/14N3pLwWOaIJoOCDU0nBqiboCljoxpG9 x3nDE8KFf54/DA==; Original-Received: from [87.69.77.57] (port=2474 helo=home-c4e4a596f7) by fencepost.gnu.org with esmtpsa (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1miXzl-0008U1-MP; Thu, 04 Nov 2021 04:21:14 -0400 In-Reply-To: (message from Reini Urban on Thu, 4 Nov 2021 08:50:14 +0100) X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org Original-Sender: "Emacs-devel" Xref: news.gmane.io gmane.emacs.devel:278655 Archived-At: > From: Reini Urban > Date: Thu, 4 Nov 2021 08:50:14 +0100 > Cc: emacs-devel@gnu.org > > int hi = 5; > int שָׁלוֹם = hi; > int hello = 10; > int السّلامعليك = hello; > myfun(שָׁלוֹם ,السّلامعليكم) > > IMO this code is fundamentally valid: we should allow > programmers to write identifiers in their native tongue. > > Sure, nobody wants to forbid unicode identifiers. The rules only ensure that identifiers keep identifiable. > I converted itto perl (because I dislike java or rust), and ran it through cperl. > The problem is that from an innocent look or code review you won't see any problem, hence the security > risk. > You need to adjust your tools. > > But the very first RTL identifier שָׁלוֹם contains already non-identifier characters. Which of its characters are non-identifier, and why? That identifier uses characters of a single script, AFAICT. > So I cannot tell you if this code doesn't violate any of the 4 unicode mixed script profiles > (http://www.unicode.org/reports/tr39/#Mixed_Script_Detection 2-5) > Or if any of the unreadable characters are of the recommended scripts: Which characters in that fragment are "unreadable" for this purpose?