From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.io!.POSTED.blaine.gmane.org!not-for-mail From: Reini Urban Newsgroups: gmane.emacs.devel Subject: Re: Unicode confusables and reordering characters considered harmful Date: Thu, 4 Nov 2021 08:50:14 +0100 Message-ID: References: <8b09eed8-36dd-61f5-2a8f-8525122df98c@gmail.com> Mime-Version: 1.0 Content-Type: multipart/alternative; boundary="00000000000094dcf105cff1c718" Injection-Info: ciao.gmane.io; posting-host="blaine.gmane.org:116.202.254.214"; logging-data="22620"; mail-complaints-to="usenet@ciao.gmane.io" Cc: emacs-devel@gnu.org To: Stefan Monnier Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org Thu Nov 04 08:51:51 2021 Return-path: Envelope-to: ged-emacs-devel@m.gmane-mx.org Original-Received: from lists.gnu.org ([209.51.188.17]) by ciao.gmane.io with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.92) (envelope-from ) id 1miXXL-0005j3-G9 for ged-emacs-devel@m.gmane-mx.org; Thu, 04 Nov 2021 08:51:51 +0100 Original-Received: from localhost ([::1]:43072 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1miXXJ-0001eR-RA for ged-emacs-devel@m.gmane-mx.org; Thu, 04 Nov 2021 03:51:49 -0400 Original-Received: from eggs.gnu.org ([2001:470:142:3::10]:35828) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1miXW0-00008N-TX for emacs-devel@gnu.org; Thu, 04 Nov 2021 03:50:28 -0400 Original-Received: from mail-ua1-x929.google.com ([2607:f8b0:4864:20::929]:40901) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.90_1) (envelope-from ) id 1miXVz-0003oz-3R for emacs-devel@gnu.org; Thu, 04 Nov 2021 03:50:28 -0400 Original-Received: by mail-ua1-x929.google.com with SMTP id e2so9217732uax.7 for ; Thu, 04 Nov 2021 00:50:26 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=CVuhd+9LVQ7aG1n6AGTsWDGbWsDW5JFLChSEwgpxtdI=; b=MS+P9OfzEm45BYpdvWUL9iuzJh34tu5lFtYfDwXBauXTJ8YeNh9LaAKwkayTgF3rBZ iL9b18s2JzGMZRWLqNzJfj7jOvxUpM5S2/Y9FM2ArfgQmrItlUY2yKH+Jkbps1hroMmK JjAXhrsGPCpCk56IcmD8WBu1ahIVq/d4MpcsX0Ru16WBt0NnTKd8f2Sfffl6rD4jyBgS pKr0ruVnTPsNbKJTx+wz6jWlsj8l7lFOr8pyVg86/CPNuuNh3TGm2Cp9qojAbFLxE6Z9 uKvkPH4QOgc8IR2eK+YG9s5+2ujD01pEQgwmCFudLO+bOtt0Lpw7Yciz8LHF7yT3THMB VTWA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=CVuhd+9LVQ7aG1n6AGTsWDGbWsDW5JFLChSEwgpxtdI=; b=jy7efjOykoyN1Gje2psLJWL9YGzZU1JdlKqcH74qaCy+xtR80rCTX703N++HJtY2ya FeXXKX/zxu30CoPFx2zIGkVsXrpjpWF+EKMKrxSJficXZE8wYPVsQ3CO1b1vXcrLc0zh 8kqh9hSdhRl/1Y14CXqHlIhKmEOkSu1vuM4cH9eTR8hF8UDWHVXgTQWyRxS3nc6DudC9 5bTY5nVpLbxf8bNdtni9yw4l+P5PUZV6i0SzxWKK7Abu8/9qr9A2OY/XbA/KVxewzFqy 2sjkk9MYqL7yQqCmJlTtZ88Te6K/vaG8HOjX4UreqabuNt8E32k6LXUK/glBk4gvxkVA hAQg== X-Gm-Message-State: AOAM53083NLlqkh/Li/Q+faxIZmXxRx6da57n2fFtDE3bioGACk3JLCF hNdtT57FUtH49FeHxJN8iYGpgPr/bwxAvY7y3fTOmhMRWvE= X-Google-Smtp-Source: ABdhPJyIom7uzcaqXyyBdEq1BL/cKfi8MfX6POwWMEp4hJ4yDbV7M1RVvNtgJjcQqy5Pf7sXE4OqGgyd+fiMm+lnrAs= X-Received: by 2002:ab0:648c:: with SMTP id p12mr45152082uam.93.1636012226042; Thu, 04 Nov 2021 00:50:26 -0700 (PDT) In-Reply-To: Received-SPF: pass client-ip=2607:f8b0:4864:20::929; envelope-from=reini.urban@gmail.com; helo=mail-ua1-x929.google.com X-Spam_score_int: -20 X-Spam_score: -2.1 X-Spam_bar: -- X-Spam_report: (-2.1 / 5.0 requ) BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, FREEMAIL_FROM=0.001, HTML_MESSAGE=0.001, LOTS_OF_MONEY=0.001, RCVD_IN_DNSWL_NONE=-0.0001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org Original-Sender: "Emacs-devel" Xref: news.gmane.io gmane.emacs.devel:278653 Archived-At: --00000000000094dcf105cff1c718 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable On Wed, Nov 3, 2021 at 4:43 PM Stefan Monnier wrote: > > No, this summary is awful. > > The issue is that libc, the C standard committee, linux and most others > are > > ignoring the unicode identifier security guidelines. > > Identifiers must be identifiable, but strings should not be touched. > > What do those rules say about code like: > > int hi =3D 5; > int =D7=A9=D6=B8=D7=81=D7=9C=D7=95=D6=B9=D7=9D =3D hi; > int hello =3D 10; > int =D8=A7=D9=84=D8=B3=D9=91=D9=84=D8=A7=D9=85=D8=B9=D9=84=D9=8A=D9= =83 =3D hello; > myfun(=D7=A9=D6=B8=D7=81=D7=9C=D7=95=D6=B9=D7=9D ,=D8=A7=D9=84=D8=B3= =D9=91=D9=84=D8=A7=D9=85=D8=B9=D9=84=D9=8A=D9=83=D9=85) > > IMO this code is fundamentally valid: we should allow > programmers to write identifiers in their native tongue. > Sure, nobody wants to forbid unicode identifiers. The rules only ensure that identifiers keep identifiable. I converted itto perl (because I dislike java or rust), and ran it through cperl. The problem is that from an innocent look or code review you won't see any problem, hence the security risk. You need to adjust your tools. But the very first RTL identifier =D7=A9=D6=B8=D7=81=D7=9C=D7=95=D6=B9=D7= =9D contains already non-identifier characters. So I cannot tell you if this code doesn't violate any of the 4 unicode mixed script profiles ( http://www.unicode.org/reports/tr39/#Mixed_Script_Detection 2-5) Or if any of the unreadable characters are of the recommended scripts: https://www.unicode.org/reports/tr31/#Table_Recommended_Scripts, (so no exotic or antique scripts) http://perl11.github.io/cperl/perldata.html#Identifier-parsing $hi =3D 5; $=D7=A9=D6=B8=D7=81=D7=9C=D7=95=D6=B9=D7=9D =3D $hi; $hello =3D 10; $=D8=A7=D9=84=D8=B3=D9=91=D9=84=D8=A7=D9=85=D8=B9=D9=84=D9=8A=D9=83 =3D $he= llo; myfun($=D7=A9=D6=B8=D7=81=D7=9C=D7=95=D6=B9=D7=9D, $=D8=A7=D9=84=D8=B3=D9= =91=D9=84=D8=A7=D9=85=D8=B9=D9=84=D9=8A=D9=83); =3D> od -c 0000000 $ h i =3D 5 ; \n $ 327 251 326 270 327 201 0000020 327 234 327 225 326 271 327 235 =3D $ h i ; \n 0000040 $ h e l l o =3D 1 0 ; \n $ 330 247 0000060 331 204 330 263 331 221 331 204 330 247 331 205 330 271 331 204 0000100 331 212 331 203 =3D $ h e l l o ; \n m 0000120 y f u n ( $ 327 251 326 270 327 201 327 234 327 225 0000140 326 271 327 235 , $ 330 247 331 204 330 263 331 221 331 0000160 204 330 247 331 205 330 271 331 204 331 212 331 203 ) ; \n > Does the security guidelines require override chars to force the > `, ` to be in LTR, so as to fix the ordering problem (and would the > result be more or less clear to someone familiar with those RTL > scripts ;-0 )? > > > Stefan > > --=20 Reini Urban --00000000000094dcf105cff1c718 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable


=
On Wed, Nov 3, 2021 at 4:43 PM Stefan= Monnier <monnier@iro.umontr= eal.ca> wrote:
> No, this summary is awful.
> The issue is that libc, the C standard committee, linux and most other= s are
> ignoring the unicode identifier security guidelines.
> Identifiers must be identifiable, but strings should not be touched.
What do those rules say about code like:

=C2=A0 =C2=A0 int hi =3D 5;
=C2=A0 =C2=A0 int =D7=A9=D6=B8=D7=81=D7=9C=D7=95=D6=B9=D7=9D =3D hi;
=C2=A0 =C2=A0 int hello =3D 10;
=C2=A0 =C2=A0 int =D8=A7=D9=84=D8=B3=D9=91=D9=84=D8=A7=D9=85=D8=B9=D9=84=D9= =8A=D9=83 =3D hello;
=C2=A0 =C2=A0 myfun(=D7=A9=D6=B8=D7=81=D7=9C=D7=95=D6=B9=D7=9D ,=D8=A7=D9= =84=D8=B3=D9=91=D9=84=D8=A7=D9=85=D8=B9=D9=84=D9=8A=D9=83=D9=85)

IMO this code is fundamentally valid: we should allow
programmers to write identifiers in their native tongue.

Sure, nobody wants to forbid unicode identifiers. The rul= es only ensure that identifiers keep identifiable.
I convert= ed itto perl (because I dislike java or rust), and ran it through cperl.
The problem is that from an innocent look or code review you won= 9;t see any problem, hence the security risk.
You need to adjust = your tools.

But the very first RTL identifier =D7= =A9=D6=B8=D7=81=D7=9C=D7=95=D6=B9=D7=9D contains already non-identifier cha= racters.
So I cannot tell you if this code doesn't violat= e any of the 4 unicode mixed script profiles (http://www.unicode.org/reports/t= r39/#Mixed_Script_Detection 2-5)
Or if any of the unreadable = characters are of the recommended scripts:



$hi =3D 5;<= br>$=D7=A9=D6=B8=D7=81=D7=9C=D7=95=D6=B9=D7=9D =3D $hi;
$hello =3D 10;$=D8=A7=D9=84=D8=B3=D9=91=D9=84=D8=A7=D9=85=D8=B9=D9=84=D9=8A=D9=83 =3D $= hello;
myfun($=D7=A9=D6=B8=D7=81=D7=9C=D7=95=D6=B9=D7=9D, $=D8=A7=D9=84= =D8=B3=D9=91=D9=84=D8=A7=D9=85=D8=B9=D9=84=D9=8A=D9=83);

=3D> od -c
0000000 =C2=A0 $ =C2=A0 h =C2=A0 i =C2=A0 =C2= =A0 =C2=A0 =3D =C2=A0 =C2=A0 =C2=A0 5 =C2=A0 ; =C2=A0\n =C2=A0 $ 327 251 32= 6 270 327 201
0000020 327 234 327 225 326 271 327 235 =C2=A0 =C2=A0 =C2= =A0 =3D =C2=A0 =C2=A0 =C2=A0 $ =C2=A0 h =C2=A0 i =C2=A0 ; =C2=A0\n
00000= 40 =C2=A0 $ =C2=A0 h =C2=A0 e =C2=A0 l =C2=A0 l =C2=A0 o =C2=A0 =C2=A0 =C2= =A0 =3D =C2=A0 =C2=A0 =C2=A0 1 =C2=A0 0 =C2=A0 ; =C2=A0\n =C2=A0 $ 330 247<= br>0000060 331 204 330 263 331 221 331 204 330 247 331 205 330 271 331 204<= br>0000100 331 212 331 203 =C2=A0 =C2=A0 =C2=A0 =3D =C2=A0 =C2=A0 =C2=A0 $ = =C2=A0 h =C2=A0 e =C2=A0 l =C2=A0 l =C2=A0 o =C2=A0 ; =C2=A0\n =C2=A0 m
= 0000120 =C2=A0 y =C2=A0 f =C2=A0 u =C2=A0 n =C2=A0 ( =C2=A0 $ 327 251 326 2= 70 327 201 327 234 327 225
0000140 326 271 327 235 =C2=A0 , =C2=A0 =C2= =A0 =C2=A0 $ 330 247 331 204 330 263 331 221 331
0000160 204 330 247 331= 205 330 271 331 204 331 212 331 203 =C2=A0 ) =C2=A0 ; =C2=A0\n


Does the security guidelines require override chars to force the
`, ` to be in LTR, so as to fix the ordering problem (and would the
result be more or less clear to someone familiar with those RTL
scripts ;-0 )?


=C2=A0 =C2=A0 =C2=A0 =C2=A0 Stefan



--
Reini Urban
--00000000000094dcf105cff1c718--