From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.io!.POSTED.blaine.gmane.org!not-for-mail From: Tim Cross Newsgroups: gmane.emacs.devel Subject: Re: Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it? Date: Thu, 20 Jan 2022 17:35:23 +1100 Message-ID: <87sfti51ii.fsf@gmail.com> References: <87sftk49ih.fsf@yahoo.com> Mime-Version: 1.0 Content-Type: text/plain Injection-Info: ciao.gmane.io; posting-host="blaine.gmane.org:116.202.254.214"; logging-data="7841"; mail-complaints-to="usenet@ciao.gmane.io" User-Agent: mu4e 1.7.6; emacs 28.0.91 To: emacs-devel@gnu.org Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org Thu Jan 20 08:12:25 2022 Return-path: Envelope-to: ged-emacs-devel@m.gmane-mx.org Original-Received: from lists.gnu.org ([209.51.188.17]) by ciao.gmane.io with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.92) (envelope-from ) id 1nARcO-0001rc-5i for ged-emacs-devel@m.gmane-mx.org; Thu, 20 Jan 2022 08:12:24 +0100 Original-Received: from localhost ([::1]:49514 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1nARcM-0007SI-QO for ged-emacs-devel@m.gmane-mx.org; Thu, 20 Jan 2022 02:12:22 -0500 Original-Received: from eggs.gnu.org ([209.51.188.92]:42516) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1nARXI-0005TY-UY for emacs-devel@gnu.org; Thu, 20 Jan 2022 02:07:09 -0500 Original-Received: from [2607:f8b0:4864:20::1032] (port=38681 helo=mail-pj1-x1032.google.com) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.90_1) (envelope-from ) id 1nARXH-0007ss-49 for emacs-devel@gnu.org; Thu, 20 Jan 2022 02:07:08 -0500 Original-Received: by mail-pj1-x1032.google.com with SMTP id d12-20020a17090a628c00b001b4f47e2f51so6013656pjj.3 for ; Wed, 19 Jan 2022 23:07:06 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=references:user-agent:from:to:subject:date:in-reply-to:message-id :mime-version; bh=ZVkyxHYgdF0nqZoWafOPcO4Jr64IE58S6iVzCFjX7jQ=; b=QqGfCcyuKu91LJXBo1n/W9TpNpDUmuWN7ziUY3/M0ZOf6LIqW0fQl2Z2MY9ZuXOu3I xMHPYJK2LzQzHDlmnbeWfdk3tE/M4ApePmsTTaNM7OU2TEMwJWZTLyh2e3QmCCvKHVbu P3CX/VwyzkLXJ390W9A5h3/8kqNvzAZqh0OcyR8TS52a4nF6gC7R3Lm+rGiuyCjfDXd5 FHygQONxMVNc0R+uDJ677lcC/7dWU1p3pozZrD79y+clUsAmJuh2UkM2Sfaya8/UXXtz XXbNspMvpQBFJM0mYsSZDdy1U2YV/J7ik3X1kEG1DBI4adKVS1MTObBNr1E+Li7ex07J e7Zg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:references:user-agent:from:to:subject:date :in-reply-to:message-id:mime-version; bh=ZVkyxHYgdF0nqZoWafOPcO4Jr64IE58S6iVzCFjX7jQ=; b=c5qp1/jNsZWxj38kZzF+M+SupFISgfUkbTssJFNp+7b1976T8qBm0s8Ui3wqfkxD58 v1Cm525v2zjf7zDcnmbbT9tPPuDTUDbLf6MZ0Eycl3V+lBpWRrL8dfHonTcZo+9gp+Ke /EJ0Y684dnF/lT/v+xIvyk2LyCPM05jEvcEFjelJkERhmTgQsdXTfSAQ3HD21OEf3qmu LjAAUi539pEtb3B43HrNPhg0lH6axzalRbzmLHf9fHVW2yZHqnGT9dWFCavufiHw2tHX 2Ojy495jwl5sVfjvEZkWxx7jtgK5B6uQue/2kCv3Pd55xC4JPlai+yuVl9zxZzj8SBkU Qc1w== X-Gm-Message-State: AOAM531CHWFA77fPGk0Vi/M+Sin8fy4lMKm1owYQvBN7Bevmv7s2Pmck Xt1inltxHihuCUFuxl3FJmV8ZbA/Crk= X-Google-Smtp-Source: ABdhPJyiBM+Zi+lICod8jVsahl0dIKpoQLND1cE0SeKM4FRYKMa0aeVHlkcL2o6vgW75CY2Niqh68g== X-Received: by 2002:a17:90a:ab02:: with SMTP id m2mr9134001pjq.198.1642662425251; Wed, 19 Jan 2022 23:07:05 -0800 (PST) Original-Received: from dingbat (2001-44b8-31f2-bb00-8482-d3cd-23f1-cf6d.static.ipv6.internode.on.net. [2001:44b8:31f2:bb00:8482:d3cd:23f1:cf6d]) by smtp.gmail.com with ESMTPSA id y6sm1989296pfa.66.2022.01.19.23.07.03 for (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 19 Jan 2022 23:07:04 -0800 (PST) In-reply-to: X-Host-Lookup-Failed: Reverse DNS lookup failed for 2607:f8b0:4864:20::1032 (failed) Received-SPF: pass client-ip=2607:f8b0:4864:20::1032; envelope-from=theophilusx@gmail.com; helo=mail-pj1-x1032.google.com X-Spam_score_int: -12 X-Spam_score: -1.3 X-Spam_bar: - X-Spam_report: (-1.3 / 5.0 requ) BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, FREEMAIL_FROM=0.001, PDS_HP_HELO_NORDNS=0.001, RCVD_IN_DNSWL_NONE=-0.0001, RDNS_NONE=0.793, SPF_HELO_NONE=0.001, SPF_PASS=-0.001 autolearn=no autolearn_force=no X-Spam_action: no action X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org Original-Sender: "Emacs-devel" Xref: news.gmane.io gmane.emacs.devel:285012 Archived-At: Richard Stallman writes: > [[[ To any NSA and FBI agents reading my email: please consider ]]] > [[[ whether defending the US Constitution against all enemies, ]]] > [[[ foreign or domestic, requires you to follow Snowden's example. ]]] > > Explanation to Eli: I understand that these 0-width characters have > legitimate, useful purposes. It is good that we support them. > > The issue I've raised, which was explained in the text I cited, is > that _allegedly_ it is possible to use them maliciously, by inserting > a sequence of them to function as a sort of watermark that users > normally won't even see. > > > You can highlight them like so: > > > (set-face-background 'glyphless-char "red") > > > I've had that configured ever since > > https://debbugs.gnu.org/cgi/bugreport.cgi?bug=31194#40 > > > If you're not expecting zero-width characters in text in general, > > I think it's a good setting. > > I think I will try that, just in case someone sends me some of those. > Thanks. > > Should we make this the default? I think it is likely that most Emacs users > will see only malicious zero-width characters, and not useful ones. > > Is there a way we could detect automatically when these zero-width > characters are being used in a legit way for their intended purpose, > and in that case, display them as zero-width for real? > > That way, they would work right when used properly, and ring an alarm > (metaphorically) when used in a fishy way. > > > Emacs by default displays ZWJ and ZWNJ characters (and any other > > zero-width characters) as thin 1-pixel spaces on GUI frames, and as > > simple spaces on TTY frames. So Emacs users are likely to see these > > "hidden" sequences of characters on display. > > I wonder if we could do something clever to show when there is a > sequence of multiple different 1-pixel characters? For instance, > maybe give different colors to different characters, so that a > sequence of several shows as a funny spectrum? > > This could alert the user that "someone's messing with you here". > > There are many possible variants of the details -- I don't know what > would be best, or what would be easy, but people could try various > methods. Just to add some context here which some might find useful. At one point, I worked for an organisation which had real concerns about sensitive information being released (mainly to the press) and wanted to be able to track down the source when it occurred. Essentially, this technique was used. All electronic documents, when distributed to teh approved list of recipients, had a unique id stamp using zero-width characters. When I left, the organisation was also experimenting with adding similar 'marks' to emails sent via the orgnaisation's email server. So this practice is definitely occurring. It is probably more prevalent in PDF and word documents, but I guess could be in plain text email messages as well. This technique (and related ones) don't need high technical expertise either. We had a similar problem at a University I wored at where students used this technique to defeat the anti-plagiarism software the uni used. The software used basic text matching and students started to defeat it by using both zero width characters to break patterns and by using utf characters with glyphs that looked like standard characters, allowing the document to print an look correct, but also breaking pattern matching. Of course, once you are aware this is going on, you can improve the pattern matching and add checks to detect this type of activity. Personally, I was always amazed at the length people went to defeat the anti-plagiarism software. Always seem it would be easier not to plagiarise and cite when appropriate. It is a big challenge to find out a way to alert users to this possible unwanted 'tagging', but at the same time, allow legitimate use. For exmaple, in org-mode, it can sometimes be difficult to combine different markup and other syntax - often it is because of a corner case which is difficult to address with font-locking regexp. Adding a zero-width space is sometimes sufficient to work around the ambiguity in tghe regexp. Point is, anything which makes such use visual noticeable will also make the technique less useful for addressing this issue.