From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.io!.POSTED.blaine.gmane.org!not-for-mail From: Eli Zaretskii Newsgroups: gmane.emacs.devel Subject: Re: Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it? Date: Wed, 19 Jan 2022 10:20:07 +0200 Message-ID: <83h7a0t9vs.fsf@gnu.org> References: Injection-Info: ciao.gmane.io; posting-host="blaine.gmane.org:116.202.254.214"; logging-data="26925"; mail-complaints-to="usenet@ciao.gmane.io" Cc: emacs-devel@gnu.org To: rms@gnu.org Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org Wed Jan 19 10:18:01 2022 Return-path: Envelope-to: ged-emacs-devel@m.gmane-mx.org Original-Received: from lists.gnu.org ([209.51.188.17]) by ciao.gmane.io with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.92) (envelope-from ) id 1nA76P-0006m9-1V for ged-emacs-devel@m.gmane-mx.org; Wed, 19 Jan 2022 10:18:01 +0100 Original-Received: from localhost ([::1]:44952 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1nA76O-0007X9-4L for ged-emacs-devel@m.gmane-mx.org; Wed, 19 Jan 2022 04:18:00 -0500 Original-Received: from eggs.gnu.org ([209.51.188.92]:60538) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1nA6CY-00044j-Hn for emacs-devel@gnu.org; Wed, 19 Jan 2022 03:20:19 -0500 Original-Received: from [2001:470:142:3::e] (port=59436 helo=fencepost.gnu.org) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1nA6CW-0002sV-PI for emacs-devel@gnu.org; Wed, 19 Jan 2022 03:20:17 -0500 DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=gnu.org; s=fencepost-gnu-org; h=References:Subject:In-Reply-To:To:From:Date: mime-version; bh=YFwQmY9+yU59aZcb3zptm2UckqCK2i8eJ344zKCRYZQ=; b=j38CArXRPcZH W8qXlWOqfQsxrpEKE/J5loD34ihktQXYSVo9oWBdZH+biBBaWo1LodUkaMyf4EB+NWvgUM0bRALta AFqNP2j9RSReXDIXkR+Bc3o76ZcGKSonjQCCNU05+kbkhWAfbJISCg+EeKyQXD8SXKhnIIGe588Y+ j3w1DLpd4XwP7+C8LXwefqGDqUqBTAdPgbpM0tv7Dh7OpzoYb137SiO8Fl1HwMs2wvUQ0GgwQ1hR8 dhUHlmkGQ6ZitxFyyr9/1m/APlzcOJ/h3doQuturkJ9jui8033O6Pmz932A1FbUB0GEmgpWpgpX4G 4zsUNOC0SFZr4UHU3vv21A==; Original-Received: from [87.69.77.57] (port=3267 helo=home-c4e4a596f7) by fencepost.gnu.org with esmtpsa (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1nA6CS-0004Ln-22; Wed, 19 Jan 2022 03:20:15 -0500 In-Reply-To: (message from Richard Stallman on Tue, 18 Jan 2022 23:15:59 -0500) X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org Original-Sender: "Emacs-devel" Xref: news.gmane.io gmane.emacs.devel:284968 Archived-At: > From: Richard Stallman > Date: Tue, 18 Jan 2022 23:15:59 -0500 > > Unicode allows user tracking by means of invisible text marking. Any > string can be converted into its binary form and then recoded into a > string of zero-width characters, which can then be invisibly inserted > into the text. If the text is posted elsewhere, the zero-width > character string can be extracted and the process reversed to figure > out the identity of the person who copied it. > > which seems ot be about a special case of confusables, and it makes me > wonder whether Emacs does, or could, show users when Unicode confusion > occurs, or prevent or fix it somehow. AFAIU, there's no confusion here, "just" injection of hidden information into plain text. "Confusion" is when the user is presented with some text that looks like something else. Here the problematic part is not presented at all. > First, is that issue of invisible characters real? Yes. The idea is to use 2 "normal" characters to serve as binary zero and binary one, which would then allow you to inject hidden text by combinations of these two. Of course, the technique is very inefficient and will need many such characters to inject any meaningful text. > Second, does Emacs do anything now such that these tricks > won't succeed? Emacs by default displays ZWJ and ZWNJ characters (and any other zero-width characters) as thin 1-pixel spaces on GUI frames, and as simple spaces on TTY frames. So Emacs users are likely to see these "hidden" sequences of characters on display. > If the problem exists in Emacs now, could we prevent it? I see a few > ways to try. I don't know whether they would work well. > > * Indicate the different encodings on the screen somehow. > > * Canonicalize such seqences (perhaps when reading text into Emacs), > so that different encodings of the same text become identical. > > * Use a stand-alone canonicalizer program. I don't think I understand your proposals. They seem to be based on some idea that these characters are "encodings" of something, and that this encoding can be "canonicalized"? If so, I think this interpretation is a mistake: there's no encoding going on here. These zero-width characters' role is to help the text-shaping engine to shape the characters around them correctly, according to the rules of the script of those surrounding characters. When those zero-width characters are used for the purpose of hiding text, they appear as sequences of zero-width characters without any reason, and in particular the characters that surround them are likely to be whitespace characters, which don't need any joiners to shape them. The job of a feature that detects this is to discern between these two use cases, and flag the suspicious one. In any case, I don't think these solutions could work by examining single characters. ZWJ and ZWNJ are important characters in some scripts, so we cannot mangle them based on considering isolated characters. We must consider sequences of such characters when we design a feature that makes them stand out, because only on that level we can distinguish between legitimate uses of those characters and suspicious uses. I think we should introduce a minor mode that detects those sequences and makes them stand out on display, with or without some warning message in the echo-area. People who want to be aware of any such potentially hidden text will turn that on. We could also turn it on automatically in email and eww. Patches are welcome; I believe we already have the infrastructure in the new textsec.el package.