From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.io!.POSTED.blaine.gmane.org!not-for-mail From: Reini Urban Newsgroups: gmane.emacs.devel Subject: Re: Unicode confusables and reordering characters considered harmful Date: Wed, 3 Nov 2021 16:07:51 +0100 Message-ID: References: <8b09eed8-36dd-61f5-2a8f-8525122df98c@gmail.com> Mime-Version: 1.0 Content-Type: multipart/alternative; boundary="000000000000c375f805cfe3c635" Injection-Info: ciao.gmane.io; posting-host="blaine.gmane.org:116.202.254.214"; logging-data="29836"; mail-complaints-to="usenet@ciao.gmane.io" To: emacs-devel@gnu.org Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org Wed Nov 03 16:23:22 2021 Return-path: Envelope-to: ged-emacs-devel@m.gmane-mx.org Original-Received: from lists.gnu.org ([209.51.188.17]) by ciao.gmane.io with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.92) (envelope-from ) id 1miI6k-0007bV-5e for ged-emacs-devel@m.gmane-mx.org; Wed, 03 Nov 2021 16:23:22 +0100 Original-Received: from localhost ([::1]:41968 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1miI6j-0000qC-6F for ged-emacs-devel@m.gmane-mx.org; Wed, 03 Nov 2021 11:23:21 -0400 Original-Received: from eggs.gnu.org ([2001:470:142:3::10]:54012) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1miHs4-00023R-6p for emacs-devel@gnu.org; Wed, 03 Nov 2021 11:08:15 -0400 Original-Received: from mail-vk1-xa35.google.com ([2607:f8b0:4864:20::a35]:43936) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.90_1) (envelope-from ) id 1miHrx-0005lz-2I for emacs-devel@gnu.org; Wed, 03 Nov 2021 11:08:09 -0400 Original-Received: by mail-vk1-xa35.google.com with SMTP id h133so1438590vke.10 for ; Wed, 03 Nov 2021 08:08:03 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=mime-version:references:in-reply-to:from:date:message-id:subject:to; bh=+Eo8xXubpRt7/0VuvqQLHgiFi3JiNdZYAc6etSPqYr0=; b=pAgrydFtTWP4tWiUTBtEuZxBsnj/iH8ponjqUE1ya/wL7Pnj910PUkx/bHBrG6hD/z ZE267MfiqhU+GkFMEqVNunYIoagxMCi+bXYoc+y/gzipxQRmUczc8B0Y/GJwuescL6LK VytZ5dp4/olRmjtKadDj5sH92VJWNbNUnV/LbN9c5599B2k2XfnCoUSNjqhVR53bYAQr mGUKR6U5Tn3upRvrff0wFUNlKK5i6uHOoJ9u0M93nUgxh3Fih7sM667zr1yaNc4zK5FC K7XbLDUmUtnVhDRDqANJcOspXl7p2i284UOIBmewoQsCchZYPhszFTrcGCKw8GUBSKL1 Y3gw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to; bh=+Eo8xXubpRt7/0VuvqQLHgiFi3JiNdZYAc6etSPqYr0=; b=ACJSRJOufpWXItgmzpZ0bBfu1tg5aHchq9tiIYwrh7ZOQNaoI2AWn6Em6Wff2omQko TE/AKxksXpK1i24b/xoQ1tetYVj1CWcr5wSxKB2YOk+u0CqW/ku2KVgSDdB/3dvJ0iZa ebffAkvYMdGCJKCzSq2+B2cDYWU12YT9iu2HfxPum6f7fNo1ET6YbLe8lOMUkPfaOOkz ZE1ftMwqzgPkLsDek27Fu+iQV1agrVrZa7XDzvWGYbG1+3Uxi0TtBCtBD6yepZSm7p5L o8LpsQwIYD4g2rBL9x37z0gQXMKmWmYJGfyU5z6PQhF+mywxrUH0FzcjaBOXnJ10779G aF8w== X-Gm-Message-State: AOAM53372mjvdvpF55y1h8jOx2TpH6YyNBlH9nSBxCVEYqL8s2sbuNQb QxVsgDnlBcAZaKkscdGowKZxcHjFx2OBfzmvn/ZcYCO5cJ8= X-Google-Smtp-Source: ABdhPJwVTZ7J26VU+bubOPf015kzo9BhrfzvrNK7bu6YQvhKqgs2BvnOILj6KBQ057lNUdC/kphq+CQyXJH4/Ix+IFg= X-Received: by 2002:a05:6122:130a:: with SMTP id e10mr5905551vkp.15.1635952082776; Wed, 03 Nov 2021 08:08:02 -0700 (PDT) In-Reply-To: <8b09eed8-36dd-61f5-2a8f-8525122df98c@gmail.com> Received-SPF: pass client-ip=2607:f8b0:4864:20::a35; envelope-from=reini.urban@gmail.com; helo=mail-vk1-xa35.google.com X-Spam_score_int: -20 X-Spam_score: -2.1 X-Spam_bar: -- X-Spam_report: (-2.1 / 5.0 requ) BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, FREEMAIL_FROM=0.001, HTML_MESSAGE=0.001, RCVD_IN_DNSWL_NONE=-0.0001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org Original-Sender: "Emacs-devel" Xref: news.gmane.io gmane.emacs.devel:278585 Archived-At: --000000000000c375f805cfe3c635 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable On Tue, Nov 2, 2021 at 4:08 PM Cl=C3=A9ment Pit-Claudel wrote: > There is a good summary of the issue and relevant mitigations at > https://research.swtch.com/trojan (it argues against compiler fixes and > in favor of IDE enhancements.) > No, this summary is awful. The issue is that libc, the C standard committee, linux and most others are ignoring the unicode identifier security guidelines. Identifiers must be identifiable, but strings should not be touched. Identifiers are all names, pathnames, variable names, user names, ... but not arbitrary strings. IDE's are just one place to fix it (that's why glib does it), but the core is more important. The ones who do care about, like java (the compiler), my cperl (the compiler and runtime, because it is dynamic), rust (the compiler), glib (the library), do follow these guidelines. All C compilers and most others are insecure. Linux Filesystems are insecure. The old APPLE Filesystem was secure, the new is again insecure. Also the libc's cannot deal with de-normalized characters at all. grep, sed, coreutils all have outstanding unorm patches, because libunicode is too slow. Because it iterates over the string via callbacks. In short you need to normalize each identifier, check for proper XID_Start/XID_Continue, check your document for mixed scripts (several combinations are allowed, several disallowed, HAN unification did a good job, but greek vs cyrillic is the worst), and forbid bidi changes. The C standard recently complained that making identifiers secure would require the full Unicode database, which is wrong. You need the normalization code (one or two tiny tables), the script lists (tiny), and the XID_Start/Continue lists (small). Further you need an api to start a document (to init scripts) with an optional script param (the language). Scripts just need a byte, the Start/Cont two bits. Sorted lists are the best representation. (musl does it unsorted, glibc an insecure table-lookup= ) gnulib is really the best place to add these features, even if libunicode is too slow. I started adding u8id support two years ago to my safeclib and my ctl, but was too busy lately. It works fine and fast enough in rust, java and cperl. I have good support in the wchar_t part of safelibc (wcsnorm, wcsfc, but no scripts), but not the u8 part yet. glibc and musl don't care about u8 replacing wchar_t yet. https://unicode.org/reports/tr36/ https://unicode.org/reports/tr39/ http://perl11.github.io/blog/foldcase.html --=20 Reini Urban --000000000000c375f805cfe3c635 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable


=
On Tue, Nov 2, 2021 at 4:08 PM Cl=C3= =A9ment Pit-Claudel <cpitclaude= l@gmail.com> wrote:
There is a good summary of the issue and relevant mitigations at ht= tps://research.swtch.com/trojan (it argues against compiler fixes and i= n favor of IDE enhancements.)

No, this = summary is awful.
The issue is that libc, the C standard committe= e, linux and most others are ignoring the unicode identifier security guide= lines.
Identifiers must be identifiable, but strings should n= ot be touched.

Identifiers are all names, path= names, variable names, user names, ... but not arbitrary strings.
=
IDE's are just one place to fix it (that's why glib does it), = but the core is more important.

The ones who d= o care about, like java (the compiler), my cperl (the compiler and runtime,= because it is dynamic), rust (the compiler), glib (the library), do follow= these guidelines.
All C compilers and most others are insecu= re. Linux Filesystems are insecure. The old APPLE Filesystem was secure, th= e new is again insecure.
Also the libc's cannot deal with de-= normalized characters at all. grep, sed, coreutils all have outstanding uno= rm patches, because libunicode is too slow. Because it iterates over the st= ring via callbacks.

In short you need to norma= lize each identifier, check for proper XID_Start/XID_Continue,
check your document for mixed scripts (several combinations are allowed,= several disallowed,
HAN unification did a good job, but gre= ek vs cyrillic is the worst), and forbid bidi changes.

=
The C standard recently complained that making identifiers secure woul= d require the full Unicode database, which is wrong.
You nee= d the normalization code (one or two tiny tables), the script lists (tiny),= and the XID_Start/Continue lists (small).
Further you need = an api to start a document (to init scripts) with an optional script param = (the language).
Scripts just need a byte, the Start/Cont two bits= . Sorted lists are the best representation. (musl does it unsorted, glibc a= n insecure table-lookup)
gnulib is really the best place to add t= hese features, even if libunicode is too slow.

I started = adding u8id support two years ago to my safeclib and my ctl, but was too bu= sy lately. It works fine and fast enough in rust, java and cperl.
I have good support in the wchar_t part of safelibc (wcsnorm, wcsfc, but n= o scripts), but not the u8 part yet. glibc and musl don't care about u8=
replacing wchar_t yet.

--
Re= ini Urban
--000000000000c375f805cfe3c635--