From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.io!.POSTED.blaine.gmane.org!not-for-mail From: Eli Zaretskii Newsgroups: gmane.emacs.devel Subject: Re: Unicode confusables and reordering characters considered harmful Date: Wed, 03 Nov 2021 19:24:29 +0200 Message-ID: <83pmrhgnjm.fsf@gnu.org> References: <8b09eed8-36dd-61f5-2a8f-8525122df98c@gmail.com> Injection-Info: ciao.gmane.io; posting-host="blaine.gmane.org:116.202.254.214"; logging-data="778"; mail-complaints-to="usenet@ciao.gmane.io" Cc: emacs-devel@gnu.org To: Reini Urban Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org Wed Nov 03 18:28:50 2021 Return-path: Envelope-to: ged-emacs-devel@m.gmane-mx.org Original-Received: from lists.gnu.org ([209.51.188.17]) by ciao.gmane.io with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.92) (envelope-from ) id 1miK4A-000AUw-MI for ged-emacs-devel@m.gmane-mx.org; Wed, 03 Nov 2021 18:28:50 +0100 Original-Received: from localhost ([::1]:33546 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1miK49-0008FZ-Qk for ged-emacs-devel@m.gmane-mx.org; Wed, 03 Nov 2021 13:28:49 -0400 Original-Received: from eggs.gnu.org ([2001:470:142:3::10]:40024) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1miK03-0000UB-LE for emacs-devel@gnu.org; Wed, 03 Nov 2021 13:24:36 -0400 Original-Received: from fencepost.gnu.org ([2001:470:142:3::e]:33008) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1miK02-0000Eh-Dd; Wed, 03 Nov 2021 13:24:35 -0400 DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=gnu.org; s=fencepost-gnu-org; h=References:Subject:In-Reply-To:To:From:Date: mime-version; bh=Dww9m90/qaE5u3PMUD/D8e+dS3Vsrdn4ZLdbVjURnyM=; b=gNK/vURB8fqY rpUSYrHT/SxVtkvfYeJvvCo19HIue4pa0ooSv4SHyp7WfGYc99rFyzwxNLvnF5MCstamZczF08tTA hbUgXl2JyWpNdYqOjLRcf9DxQQqiFPJaqu/ovGEZ0M2U2Y5HHSnhdMlv2m7+2hQsBcjRagkj558+l 5xkEmmtjSxU7hCKevdfMLynu2BI2BYJv3wlE93hMRL9eNQvwTIvcVj8BHGshxDo++zjG6LA6k3Cqp /+80QjfAobcRzeUGsvtLvi8gM4mAvpMLlneDgCqtvvbWBqkdhteifyMoJDcGPxVF/Qs87sLqwzrb9 l+cLL5WVtfcNeltgRfk7yQ==; Original-Received: from [87.69.77.57] (port=2961 helo=home-c4e4a596f7) by fencepost.gnu.org with esmtpsa (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1miJzy-0001MG-IA; Wed, 03 Nov 2021 13:24:32 -0400 In-Reply-To: (message from Reini Urban on Wed, 3 Nov 2021 16:07:51 +0100) X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org Original-Sender: "Emacs-devel" Xref: news.gmane.io gmane.emacs.devel:278591 Archived-At: > From: Reini Urban > Date: Wed, 3 Nov 2021 16:07:51 +0100 > > The issue is that libc, the C standard committee, linux and most others are ignoring the unicode identifier > security guidelines. > Identifiers must be identifiable, but strings should not be touched. > > Identifiers are all names, pathnames, variable names, user names, ... but not arbitrary strings. > IDE's are just one place to fix it (that's why glib does it), but the core is more important. > > The ones who do care about, like java (the compiler), my cperl (the compiler and runtime, because it is > dynamic), rust (the compiler), glib (the library), do follow these guidelines. > All C compilers and most others are insecure. Linux Filesystems are insecure. The old APPLE Filesystem > was secure, the new is again insecure. > Also the libc's cannot deal with de-normalized characters at all. grep, sed, coreutils all have outstanding > unorm patches, because libunicode is too slow. Because it iterates over the string via callbacks. > > In short you need to normalize each identifier, check for proper XID_Start/XID_Continue, > check your document for mixed scripts (several combinations are allowed, several disallowed, > HAN unification did a good job, but greek vs cyrillic is the worst), and forbid bidi changes. I'm not sure I follow: the examples in the original paper which sparked all this brouhaha didn't touch any identifiers. All the identifiers in those examples were perfectly compliant with the Unicode guidelines, AFAIR. What the examples did was insert directional format controls so as to reorder _punctuation_ characters, in a way that changes the visual appearance and the interpreted semantics of the code. All of the format controls were inserted within whitespace, not inside any identifiers. So I'm not sure how what you tell is relevant to the issue at hand; could you perhaps explain?