From mboxrd@z Thu Jan  1 00:00:00 1970
Path: news.gmane.io!.POSTED.blaine.gmane.org!not-for-mail
From: Reini Urban <reini.urban@gmail.com>
Newsgroups: gmane.emacs.devel
Subject: Re: Unicode confusables and reordering characters considered harmful
Date: Wed, 3 Nov 2021 16:07:51 +0100
Message-ID: <CAHiT=DHQN34ba5pYvdLy7kWb_02G4SuWmDxkL4P66BhXNX3B5A@mail.gmail.com>
References: <YYE1sEv6yS1bBUcu@odonien.localdomain>
 <8b09eed8-36dd-61f5-2a8f-8525122df98c@gmail.com>
Mime-Version: 1.0
Content-Type: multipart/alternative; boundary="000000000000c375f805cfe3c635"
Injection-Info: ciao.gmane.io; posting-host="blaine.gmane.org:116.202.254.214";
	logging-data="29836"; mail-complaints-to="usenet@ciao.gmane.io"
To: emacs-devel@gnu.org
Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org Wed Nov 03 16:23:22 2021
Return-path: <emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org>
Envelope-to: ged-emacs-devel@m.gmane-mx.org
Original-Received: from lists.gnu.org ([209.51.188.17])
	by ciao.gmane.io with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256)
	(Exim 4.92)
	(envelope-from <emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org>)
	id 1miI6k-0007bV-5e
	for ged-emacs-devel@m.gmane-mx.org; Wed, 03 Nov 2021 16:23:22 +0100
Original-Received: from localhost ([::1]:41968 helo=lists1p.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.90_1)
	(envelope-from <emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org>)
	id 1miI6j-0000qC-6F
	for ged-emacs-devel@m.gmane-mx.org; Wed, 03 Nov 2021 11:23:21 -0400
Original-Received: from eggs.gnu.org ([2001:470:142:3::10]:54012)
 by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256)
 (Exim 4.90_1) (envelope-from <reini.urban@gmail.com>)
 id 1miHs4-00023R-6p
 for emacs-devel@gnu.org; Wed, 03 Nov 2021 11:08:15 -0400
Original-Received: from mail-vk1-xa35.google.com ([2607:f8b0:4864:20::a35]:43936)
 by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128)
 (Exim 4.90_1) (envelope-from <reini.urban@gmail.com>)
 id 1miHrx-0005lz-2I
 for emacs-devel@gnu.org; Wed, 03 Nov 2021 11:08:09 -0400
Original-Received: by mail-vk1-xa35.google.com with SMTP id h133so1438590vke.10
 for <emacs-devel@gnu.org>; Wed, 03 Nov 2021 08:08:03 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112;
 h=mime-version:references:in-reply-to:from:date:message-id:subject:to;
 bh=+Eo8xXubpRt7/0VuvqQLHgiFi3JiNdZYAc6etSPqYr0=;
 b=pAgrydFtTWP4tWiUTBtEuZxBsnj/iH8ponjqUE1ya/wL7Pnj910PUkx/bHBrG6hD/z
 ZE267MfiqhU+GkFMEqVNunYIoagxMCi+bXYoc+y/gzipxQRmUczc8B0Y/GJwuescL6LK
 VytZ5dp4/olRmjtKadDj5sH92VJWNbNUnV/LbN9c5599B2k2XfnCoUSNjqhVR53bYAQr
 mGUKR6U5Tn3upRvrff0wFUNlKK5i6uHOoJ9u0M93nUgxh3Fih7sM667zr1yaNc4zK5FC
 K7XbLDUmUtnVhDRDqANJcOspXl7p2i284UOIBmewoQsCchZYPhszFTrcGCKw8GUBSKL1
 Y3gw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20210112;
 h=x-gm-message-state:mime-version:references:in-reply-to:from:date
 :message-id:subject:to;
 bh=+Eo8xXubpRt7/0VuvqQLHgiFi3JiNdZYAc6etSPqYr0=;
 b=ACJSRJOufpWXItgmzpZ0bBfu1tg5aHchq9tiIYwrh7ZOQNaoI2AWn6Em6Wff2omQko
 TE/AKxksXpK1i24b/xoQ1tetYVj1CWcr5wSxKB2YOk+u0CqW/ku2KVgSDdB/3dvJ0iZa
 ebffAkvYMdGCJKCzSq2+B2cDYWU12YT9iu2HfxPum6f7fNo1ET6YbLe8lOMUkPfaOOkz
 ZE1ftMwqzgPkLsDek27Fu+iQV1agrVrZa7XDzvWGYbG1+3Uxi0TtBCtBD6yepZSm7p5L
 o8LpsQwIYD4g2rBL9x37z0gQXMKmWmYJGfyU5z6PQhF+mywxrUH0FzcjaBOXnJ10779G
 aF8w==
X-Gm-Message-State: AOAM53372mjvdvpF55y1h8jOx2TpH6YyNBlH9nSBxCVEYqL8s2sbuNQb
 QxVsgDnlBcAZaKkscdGowKZxcHjFx2OBfzmvn/ZcYCO5cJ8=
X-Google-Smtp-Source: ABdhPJwVTZ7J26VU+bubOPf015kzo9BhrfzvrNK7bu6YQvhKqgs2BvnOILj6KBQ057lNUdC/kphq+CQyXJH4/Ix+IFg=
X-Received: by 2002:a05:6122:130a:: with SMTP id
 e10mr5905551vkp.15.1635952082776; 
 Wed, 03 Nov 2021 08:08:02 -0700 (PDT)
In-Reply-To: <8b09eed8-36dd-61f5-2a8f-8525122df98c@gmail.com>
Received-SPF: pass client-ip=2607:f8b0:4864:20::a35;
 envelope-from=reini.urban@gmail.com; helo=mail-vk1-xa35.google.com
X-Spam_score_int: -20
X-Spam_score: -2.1
X-Spam_bar: --
X-Spam_report: (-2.1 / 5.0 requ) BAYES_00=-1.9, DKIM_SIGNED=0.1,
 DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, FREEMAIL_FROM=0.001,
 HTML_MESSAGE=0.001, RCVD_IN_DNSWL_NONE=-0.0001, SPF_HELO_NONE=0.001,
 SPF_PASS=-0.001 autolearn=ham autolearn_force=no
X-Spam_action: no action
X-BeenThere: emacs-devel@gnu.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: "Emacs development discussions." <emacs-devel.gnu.org>
List-Unsubscribe: <https://lists.gnu.org/mailman/options/emacs-devel>,
 <mailto:emacs-devel-request@gnu.org?subject=unsubscribe>
List-Archive: <https://lists.gnu.org/archive/html/emacs-devel>
List-Post: <mailto:emacs-devel@gnu.org>
List-Help: <mailto:emacs-devel-request@gnu.org?subject=help>
List-Subscribe: <https://lists.gnu.org/mailman/listinfo/emacs-devel>,
 <mailto:emacs-devel-request@gnu.org?subject=subscribe>
Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org
Original-Sender: "Emacs-devel"
 <emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org>
Xref: news.gmane.io gmane.emacs.devel:278585
Archived-At: <http://permalink.gmane.org/gmane.emacs.devel/278585>

--000000000000c375f805cfe3c635
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

On Tue, Nov 2, 2021 at 4:08 PM Cl=C3=A9ment Pit-Claudel <cpitclaudel@gmail.=
com>
wrote:

> There is a good summary of the issue and relevant mitigations at
> https://research.swtch.com/trojan (it argues against compiler fixes and
> in favor of IDE enhancements.)
>

No, this summary is awful.
The issue is that libc, the C standard committee, linux and most others are
ignoring the unicode identifier security guidelines.
Identifiers must be identifiable, but strings should not be touched.

Identifiers are all names, pathnames, variable names, user names, ... but
not arbitrary strings.
IDE's are just one place to fix it (that's why glib does it), but the core
is more important.

The ones who do care about, like java (the compiler), my cperl (the
compiler and runtime, because it is dynamic), rust (the compiler), glib
(the library), do follow these guidelines.
All C compilers and most others are insecure. Linux Filesystems are
insecure. The old APPLE Filesystem was secure, the new is again insecure.
Also the libc's cannot deal with de-normalized characters at all. grep,
sed, coreutils all have outstanding unorm patches, because libunicode is
too slow. Because it iterates over the string via callbacks.

In short you need to normalize each identifier, check for proper
XID_Start/XID_Continue,
check your document for mixed scripts (several combinations are allowed,
several disallowed,
HAN unification did a good job, but greek vs cyrillic is the worst), and
forbid bidi changes.

The C standard recently complained that making identifiers secure would
require the full Unicode database, which is wrong.
You need the normalization code (one or two tiny tables), the script lists
(tiny), and the XID_Start/Continue lists (small).
Further you need an api to start a document (to init scripts) with an
optional script param (the language).
Scripts just need a byte, the Start/Cont two bits. Sorted lists are the
best representation. (musl does it unsorted, glibc an insecure table-lookup=
)
gnulib is really the best place to add these features, even if libunicode
is too slow.

I started adding u8id support two years ago to my safeclib and my ctl, but
was too busy lately. It works fine and fast enough in rust, java and cperl.
I have good support in the wchar_t part of safelibc (wcsnorm, wcsfc, but no
scripts), but not the u8 part yet. glibc and musl don't care about u8
replacing wchar_t yet.

https://unicode.org/reports/tr36/
https://unicode.org/reports/tr39/
http://perl11.github.io/blog/foldcase.html
--=20
Reini Urban

--000000000000c375f805cfe3c635
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div dir=3D"ltr"><br></div><br><div class=3D"gmail_quote">=
<div dir=3D"ltr" class=3D"gmail_attr">On Tue, Nov 2, 2021 at 4:08 PM Cl=C3=
=A9ment Pit-Claudel &lt;<a href=3D"mailto:cpitclaudel@gmail.com">cpitclaude=
l@gmail.com</a>&gt; wrote:<br></div><blockquote class=3D"gmail_quote" style=
=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding=
-left:1ex">
There is a good summary of the issue and relevant mitigations at <a href=3D=
"https://research.swtch.com/trojan" rel=3D"noreferrer" target=3D"_blank">ht=
tps://research.swtch.com/trojan</a> (it argues against compiler fixes and i=
n favor of IDE enhancements.)<br></blockquote><div><br></div><div>No, this =
summary is awful.</div><div>The issue is that libc, the C standard committe=
e, linux and most others are ignoring the unicode identifier security guide=
lines.<br></div><div>Identifiers must be identifiable, but strings should n=
ot be touched.<br></div><div><br></div><div>Identifiers are all names, path=
names, variable names, user names, ... but not arbitrary strings.<br></div>=
<div>IDE&#39;s are just one place to fix it (that&#39;s why glib does it), =
but the core is more important.<br></div><div><br></div><div>The ones who d=
o care about, like java (the compiler), my cperl (the compiler and runtime,=
 because it is dynamic), rust (the compiler), glib (the library), do follow=
 these guidelines.<br></div><div>All C compilers and most others are insecu=
re. Linux Filesystems are insecure. The old APPLE Filesystem was secure, th=
e new is again insecure.</div><div>Also the libc&#39;s cannot deal with de-=
normalized characters at all. grep, sed, coreutils all have outstanding uno=
rm patches, because libunicode is too slow. Because it iterates over the st=
ring via callbacks.<br></div><div><br></div><div>In short you need to norma=
lize each identifier, check for proper XID_Start/XID_Continue, <br></div><d=
iv>check your document for mixed scripts (several combinations are allowed,=
 several disallowed, <br></div><div>HAN unification did a good job, but gre=
ek vs cyrillic is the worst), and forbid bidi changes.</div><div><br></div>=
<div>The C standard recently complained that making identifiers secure woul=
d require the full Unicode database, which is wrong. <br></div><div>You nee=
d the normalization code (one or two tiny tables), the script lists (tiny),=
 and the XID_Start/Continue lists (small). <br></div><div>Further you need =
an api to start a document (to init scripts) with an optional script param =
(the language).</div><div>Scripts just need a byte, the Start/Cont two bits=
. Sorted lists are the best representation. (musl does it unsorted, glibc a=
n insecure table-lookup)</div><div>gnulib is really the best place to add t=
hese features, even if libunicode is too slow.<br><br></div><div>I started =
adding u8id support two years ago to my safeclib and my ctl, but was too bu=
sy lately. It works fine and fast enough in rust, java and cperl.</div><div=
>I have good support in the wchar_t part of safelibc (wcsnorm, wcsfc, but n=
o scripts), but not the u8 part yet. glibc and musl don&#39;t care about u8=
 <br></div><div>replacing wchar_t yet.</div><div><br></div><div><a href=3D"=
https://unicode.org/reports/tr36/">https://unicode.org/reports/tr36/</a></d=
iv><div><a href=3D"https://unicode.org/reports/tr39/">https://unicode.org/r=
eports/tr39/</a></div><div><a href=3D"http://perl11.github.io/blog/foldcase=
.html">http://perl11.github.io/blog/foldcase.html</a></div></div><div></div=
>-- <br><div dir=3D"ltr" class=3D"gmail_signature"><div dir=3D"ltr"><div>Re=
ini Urban<br></div></div></div></div>

--000000000000c375f805cfe3c635--