* Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it? @ 2022-01-19 4:15 Richard Stallman 2022-01-19 4:47 ` Po Lu ` (2 more replies) 0 siblings, 3 replies; 104+ messages in thread From: Richard Stallman @ 2022-01-19 4:15 UTC (permalink / raw) To: emacs-devel [[[ To any NSA and FBI agents reading my email: please consider ]]] [[[ whether defending the US Constitution against all enemies, ]]] [[[ foreign or domestic, requires you to follow Snowden's example. ]]] There is a thread now about confusables. I read this, Unicode allows user tracking by means of invisible text marking. Any string can be converted into its binary form and then recoded into a string of zero-width characters, which can then be invisibly inserted into the text. If the text is posted elsewhere, the zero-width character string can be extracted and the process reversed to figure out the identity of the person who copied it. which seems ot be about a special case of confusables, and it makes me wonder whether Emacs does, or could, show users when Unicode confusion occurs, or prevent or fix it somehow. First, is that issue of invisible characters real? Second, does Emacs do anything now such that these tricks won't succeed? If the problem exists in Emacs now, could we prevent it? I see a few ways to try. I don't know whether they would work well. * Indicate the different encodings on the screen somehow. * Canonicalize such seqences (perhaps when reading text into Emacs), so that different encodings of the same text become identical. * Use a stand-alone canonicalizer program. -- Dr Richard Stallman (https://stallman.org) Chief GNUisance of the GNU Project (https://gnu.org) Founder, Free Software Foundation (https://fsf.org) Internet Hall-of-Famer (https://internethalloffame.org) ^ permalink raw reply [flat|nested] 104+ messages in thread
* Re: Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it? 2022-01-19 4:15 Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it? Richard Stallman @ 2022-01-19 4:47 ` Po Lu 2022-01-19 10:05 ` Phil Sainty 2022-01-19 8:20 ` Eli Zaretskii 2022-01-19 17:36 ` T.V Raman 2 siblings, 1 reply; 104+ messages in thread From: Po Lu @ 2022-01-19 4:47 UTC (permalink / raw) To: Richard Stallman; +Cc: emacs-devel Richard Stallman <rms@gnu.org> writes: > If the problem exists in Emacs now, could we prevent it? I see a few > ways to try. I don't know whether they would work well. I think the "zero width characters" alluded to are displayed by Emacs as 1 pixel wide spaces, so when enough of them to be meaningful for tracking are inserted into a piece of text, they make for a noticable blank area when displayed by Emacs. Thanks. ^ permalink raw reply [flat|nested] 104+ messages in thread
* Re: Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it? 2022-01-19 4:47 ` Po Lu @ 2022-01-19 10:05 ` Phil Sainty 2022-01-19 11:43 ` Eli Zaretskii 2022-01-20 3:17 ` Richard Stallman 0 siblings, 2 replies; 104+ messages in thread From: Phil Sainty @ 2022-01-19 10:05 UTC (permalink / raw) To: Po Lu; +Cc: Richard Stallman, emacs-devel On 2022-01-19 17:47, Po Lu wrote: > I think the "zero width characters" alluded to are displayed > by Emacs as 1 pixel wide spaces You can highlight them like so: (set-face-background 'glyphless-char "red") I've had that configured ever since https://debbugs.gnu.org/cgi/bugreport.cgi?bug=31194#40 If you're not expecting zero-width characters in text in general, I think it's a good setting. -Phil ^ permalink raw reply [flat|nested] 104+ messages in thread
* Re: Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it? 2022-01-19 10:05 ` Phil Sainty @ 2022-01-19 11:43 ` Eli Zaretskii 2022-01-21 4:13 ` Richard Stallman 2022-01-20 3:17 ` Richard Stallman 1 sibling, 1 reply; 104+ messages in thread From: Eli Zaretskii @ 2022-01-19 11:43 UTC (permalink / raw) To: Phil Sainty; +Cc: luangruo, rms, emacs-devel > Date: Wed, 19 Jan 2022 23:05:51 +1300 > From: Phil Sainty <psainty@orcon.net.nz> > Cc: Richard Stallman <rms@gnu.org>, emacs-devel@gnu.org > > On 2022-01-19 17:47, Po Lu wrote: > > I think the "zero width characters" alluded to are displayed > > by Emacs as 1 pixel wide spaces > > You can highlight them like so: > > (set-face-background 'glyphless-char "red") > > I've had that configured ever since > https://debbugs.gnu.org/cgi/bugreport.cgi?bug=31194#40 > > If you're not expecting zero-width characters in text in general, > I think it's a good setting. Users and readers of certain scripts cannot use such a simplistic solution, which is basically only suitable for plain ASCII text. (And even there it is slowly becoming inappropriate, what with the growing popularity of ligatures, let alone Emoji.) Emacs should be able to do better. ^ permalink raw reply [flat|nested] 104+ messages in thread
* Re: Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it? 2022-01-19 11:43 ` Eli Zaretskii @ 2022-01-21 4:13 ` Richard Stallman 2022-01-21 7:49 ` Eli Zaretskii 0 siblings, 1 reply; 104+ messages in thread From: Richard Stallman @ 2022-01-21 4:13 UTC (permalink / raw) To: Eli Zaretskii; +Cc: psainty, luangruo, emacs-devel [[[ To any NSA and FBI agents reading my email: please consider ]]] [[[ whether defending the US Constitution against all enemies, ]]] [[[ foreign or domestic, requires you to follow Snowden's example. ]]] > Users and readers of certain scripts cannot use such a simplistic > solution, which is basically only suitable for plain ASCII text. I am no expert on this issue, but I do edit languages such as French and Spanish which use non-ASCII characters. It seems to work fine. I never insert zero-width characters, at least not knowingly. Would they be inserted without my knowing? If not, I think that some non-ASCII text works fine. > (And > even there it is slowly becoming inappropriate, what with the growing > popularity of ligatures, let alone Emoji.) Emoji show up on my terminal as diamonds, since it can't display them. So do the ligatures. Ideally we could display the ligatures as two letters. Emacs should be able to do > better. It would be very nice to do better, What would we do? Perhaps we should convert ligatures on file input-in into digraphs, and convert digraphs on file output into ligatures when using some coding system. -- Dr Richard Stallman (https://stallman.org) Chief GNUisance of the GNU Project (https://gnu.org) Founder, Free Software Foundation (https://fsf.org) Internet Hall-of-Famer (https://internethalloffame.org) ^ permalink raw reply [flat|nested] 104+ messages in thread
* Re: Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it? 2022-01-21 4:13 ` Richard Stallman @ 2022-01-21 7:49 ` Eli Zaretskii 2022-01-22 4:37 ` Richard Stallman 0 siblings, 1 reply; 104+ messages in thread From: Eli Zaretskii @ 2022-01-21 7:49 UTC (permalink / raw) To: rms; +Cc: psainty, luangruo, emacs-devel > From: Richard Stallman <rms@gnu.org> > Cc: psainty@orcon.net.nz, luangruo@yahoo.com, emacs-devel@gnu.org > Date: Thu, 20 Jan 2022 23:13:30 -0500 > > > Users and readers of certain scripts cannot use such a simplistic > > solution, which is basically only suitable for plain ASCII text. > > I am no expert on this issue, but I do edit languages such as French > and Spanish which use non-ASCII characters. It seems to work fine. I > never insert zero-width characters, at least not knowingly. Would > they be inserted without my knowing? > > If not, I think that some non-ASCII text works fine. You are using a very restricted subset of non-ASCII characters, and only on text-mode terminals, so you may never meet these characters or see their GUI effects. But we are talking about the Emacs defaults, not about what is good enough for your personal usage limited to your use patterns and display capabilities. > Emoji show up on my terminal as diamonds, since it can't display them. > So do the ligatures. Ideally we could display the ligatures as two > letters. The way to tell the display engine (any display engine, not just that of Emacs) not to ligate is to have the ZWNJ character between the characters that we don't want ligated. That's one of the legitimate uses of that zero-width character. > Emacs should be able to do > > better. > > It would be very nice to do better, What would we do? See textsec.el. > Perhaps we should convert ligatures on file input-in into digraphs, > and convert digraphs on file output into ligatures when using some > coding system. People nowadays _do_ want to see ligatures, so disabling them by default would be a step back. ^ permalink raw reply [flat|nested] 104+ messages in thread
* Re: Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it? 2022-01-21 7:49 ` Eli Zaretskii @ 2022-01-22 4:37 ` Richard Stallman 2022-01-22 6:58 ` Eli Zaretskii 0 siblings, 1 reply; 104+ messages in thread From: Richard Stallman @ 2022-01-22 4:37 UTC (permalink / raw) To: Eli Zaretskii; +Cc: psainty, luangruo, emacs-devel [[[ To any NSA and FBI agents reading my email: please consider ]]] [[[ whether defending the US Constitution against all enemies, ]]] [[[ foreign or domestic, requires you to follow Snowden's example. ]]] > > Emoji show up on my terminal as diamonds, since it can't display them. > > So do the ligatures. Ideally we could display the ligatures as two > > letters. > The way to tell the display engine (any display engine, not just that > of Emacs) not to ligate is to have the ZWNJ character between the > characters that we don't want ligated. That's one of the legitimate > uses of that zero-width character. I don't think we are talking about the same thing. You're talking about a way of modifying a particular document saying, "Don't display a ligature right here." I'm asking about a feature whereby a user can direct Emacs not to use ligatures in display on a certain terminal. The idea is, when using a terminal that can't display ligatures, Emacs should always display multiple letters instead of a ligature. > > Perhaps we should convert ligatures on file input-in into digraphs, > > and convert digraphs on file output into ligatures when using some > > coding system. > People nowadays _do_ want to see ligatures, so disabling them by > default would be a step back. People may be glad to see ligatures, on terminals that can display ligatures. I am talking about terminals which can't display ligatures. I doubt any user wants to see a diamond instead of `fi'. -- Dr Richard Stallman (https://stallman.org) Chief GNUisance of the GNU Project (https://gnu.org) Founder, Free Software Foundation (https://fsf.org) Internet Hall-of-Famer (https://internethalloffame.org) ^ permalink raw reply [flat|nested] 104+ messages in thread
* Re: Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it? 2022-01-22 4:37 ` Richard Stallman @ 2022-01-22 6:58 ` Eli Zaretskii 2022-01-24 4:33 ` Richard Stallman 0 siblings, 1 reply; 104+ messages in thread From: Eli Zaretskii @ 2022-01-22 6:58 UTC (permalink / raw) To: rms; +Cc: psainty, luangruo, emacs-devel > From: Richard Stallman <rms@gnu.org> > Cc: psainty@orcon.net.nz, luangruo@yahoo.com, emacs-devel@gnu.org > Date: Fri, 21 Jan 2022 23:37:59 -0500 > > I'm asking about a feature whereby a user can direct Emacs not to use > ligatures in display on a certain terminal. The idea is, when using a > terminal that can't display ligatures, Emacs should always display > multiple letters instead of a ligature. We don't have a way of determining whether a terminal can display ligatures. Increasingly, terminal emulators acquire these capabilities, and we have already a couple that display ligatures and Emoji sequences. But there are no methods known to us to query the terminal whether such support exists and/or which ligatures are supported. The user can disable auto-composition-mode or customize composition-function-table to disable some or all of the text-shaping features. > I doubt any user wants to see a diamond instead of `fi'. Is this what really happens for you, on your terminal? ^ permalink raw reply [flat|nested] 104+ messages in thread
* Re: Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it? 2022-01-22 6:58 ` Eli Zaretskii @ 2022-01-24 4:33 ` Richard Stallman 2022-01-24 5:06 ` Po Lu 2022-01-24 12:14 ` Eli Zaretskii 0 siblings, 2 replies; 104+ messages in thread From: Richard Stallman @ 2022-01-24 4:33 UTC (permalink / raw) To: Eli Zaretskii; +Cc: psainty, luangruo, emacs-devel [[[ To any NSA and FBI agents reading my email: please consider ]]] [[[ whether defending the US Constitution against all enemies, ]]] [[[ foreign or domestic, requires you to follow Snowden's example. ]]] > We don't have a way of determining whether a terminal can display > ligatures. Could we do it via terminfo? We can define any capabilities we like. > > I doubt any user wants to see a diamond instead of `fi'. > Is this what really happens for you, on your terminal? Yes. A few days ago I put point on a diamond, typed C-u C-x =, and was told it was a ligature for `fi'. I didn't save details of what text I was looking at, but I suspect it was a web page that a script fetched and emailed to me. -- Dr Richard Stallman (https://stallman.org) Chief GNUisance of the GNU Project (https://gnu.org) Founder, Free Software Foundation (https://fsf.org) Internet Hall-of-Famer (https://internethalloffame.org) ^ permalink raw reply [flat|nested] 104+ messages in thread
* Re: Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it? 2022-01-24 4:33 ` Richard Stallman @ 2022-01-24 5:06 ` Po Lu 2022-01-25 4:17 ` Richard Stallman 2022-01-24 12:14 ` Eli Zaretskii 1 sibling, 1 reply; 104+ messages in thread From: Po Lu @ 2022-01-24 5:06 UTC (permalink / raw) To: Richard Stallman; +Cc: Eli Zaretskii, psainty, emacs-devel Richard Stallman <rms@gnu.org> writes: > > We don't have a way of determining whether a terminal can display > > ligatures. > Could we do it via terminfo? We can define any capabilities we like. At the very least, it would require the cooperation of the terminal emulators, because these days they typically don't declare what they are (much less whether or not they support ligatures), instead masquerading as `xterm'. ^ permalink raw reply [flat|nested] 104+ messages in thread
* Re: Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it? 2022-01-24 5:06 ` Po Lu @ 2022-01-25 4:17 ` Richard Stallman 2022-01-25 4:58 ` Po Lu 0 siblings, 1 reply; 104+ messages in thread From: Richard Stallman @ 2022-01-25 4:17 UTC (permalink / raw) To: Po Lu; +Cc: psainty, eliz, emacs-devel [[[ To any NSA and FBI agents reading my email: please consider ]]] [[[ whether defending the US Constitution against all enemies, ]]] [[[ foreign or domestic, requires you to follow Snowden's example. ]]] > At the very least, it would require the cooperation of the terminal > emulators, because these days they typically don't declare what they are > (much less whether or not they support ligatures), instead masquerading > as `xterm'. I presume any terminal that calls itself that is operating on a graphics console and can display ligatures when using a suitable font. My text terminal says TERM=linux. -- Dr Richard Stallman (https://stallman.org) Chief GNUisance of the GNU Project (https://gnu.org) Founder, Free Software Foundation (https://fsf.org) Internet Hall-of-Famer (https://internethalloffame.org) ^ permalink raw reply [flat|nested] 104+ messages in thread
* Re: Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it? 2022-01-25 4:17 ` Richard Stallman @ 2022-01-25 4:58 ` Po Lu 0 siblings, 0 replies; 104+ messages in thread From: Po Lu @ 2022-01-25 4:58 UTC (permalink / raw) To: Richard Stallman; +Cc: psainty, eliz, emacs-devel Richard Stallman <rms@gnu.org> writes: > > At the very least, it would require the cooperation of the terminal > > emulators, because these days they typically don't declare what they are > > (much less whether or not they support ligatures), instead masquerading > > as `xterm'. > I presume any terminal that calls itself that is operating on a > graphics console and can display ligatures when using a suitable font. Not really. For example, xterm itself doesn't support ligatures at all, but VTE (the GNOME terminal emulator, which confusingly also reports itself as xterm) does. Thanks. ^ permalink raw reply [flat|nested] 104+ messages in thread
* Re: Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it? 2022-01-24 4:33 ` Richard Stallman 2022-01-24 5:06 ` Po Lu @ 2022-01-24 12:14 ` Eli Zaretskii 2022-01-25 4:16 ` Richard Stallman ` (2 more replies) 1 sibling, 3 replies; 104+ messages in thread From: Eli Zaretskii @ 2022-01-24 12:14 UTC (permalink / raw) To: rms; +Cc: psainty, luangruo, emacs-devel > From: Richard Stallman <rms@gnu.org> > Cc: psainty@orcon.net.nz, luangruo@yahoo.com, emacs-devel@gnu.org > Date: Sun, 23 Jan 2022 23:33:59 -0500 > > > We don't have a way of determining whether a terminal can display > > ligatures. > > Could we do it via terminfo? We can define any capabilities we like. I don't think this is feasible. But before we discuss this, I think we need to clear some fundamental misunderstanding about this, see below. > > > I doubt any user wants to see a diamond instead of `fi'. > > > Is this what really happens for you, on your terminal? > > Yes. A few days ago I put point on a diamond, typed C-u C-x =, > and was told it was a ligature for `fi'. How did that ligature get written to the screen? Was it present literally in some text that Emacs displayed? If not, how did it come into existence, in the form of a diamond? Emacs doesn't produce such ligatures on TTY frames. > I didn't save details of what text I was looking at, but I suspect > it was a web page that a script fetched and emailed to me. If that web page included a literal fi ligature, there's little we can do in Emacs, because we don't produce that character. Of course, one can set up a display table where ligatures like fi are displayed as two characters, but that is a separate issue, very far from what we were discussing in this thread. So let's please leave the literal fi display alone, because it will take us far away from the original issue. The original issue is with sequences of characters that are supposed to be composed on display, because that's where the zero-width characters play their role. When several characters are supposed to be composed on a text-mode display, Emacs simply writes them to the terminal one after another, and expects the terminal to display them as a ligature. The only difference between what Emacs does in this case and what it does when no character composition is expected is that in the former case Emacs expects the terminal to produce just one glyph that takes just one column on display. Emacs never actually writes the ligature's code to the TTY, unless that code is literally present in the text. So I don't see how querying the terminal about ligature support will help us in the case we are discussing, nor do I see how is that relevant. In any case, ligature support is not just the ability of a terminal, it also requires certain features from the font used to display text, and on TTY frames Emacs doesn't know which font is being used where. Moreover, there's any number of possible ligatures, and which ones are supported depends on the font, so a question like "are ligatures supported" has no meaningful answer unless you also specify the font and the particular ligature. ^ permalink raw reply [flat|nested] 104+ messages in thread
* Re: Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it? 2022-01-24 12:14 ` Eli Zaretskii @ 2022-01-25 4:16 ` Richard Stallman 2022-01-25 6:35 ` Eli Zaretskii 2022-01-25 4:16 ` New feature: displaying ligature characters in the buffer Richard Stallman 2022-01-25 11:08 ` Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it? Kévin Le Gouguec 2 siblings, 1 reply; 104+ messages in thread From: Richard Stallman @ 2022-01-25 4:16 UTC (permalink / raw) To: Eli Zaretskii; +Cc: psainty, luangruo, emacs-devel [[[ To any NSA and FBI agents reading my email: please consider ]]] [[[ whether defending the US Constitution against all enemies, ]]] [[[ foreign or domestic, requires you to follow Snowden's example. ]]] > How did that ligature get written to the screen? Was it present > literally in some text that Emacs displayed? It was in the buffer. That is why I was able to examine it with C-u C-x =. I suppose it was present in text that I visited in Emacs. > So let's please leave the literal fi > display alone, because it will take us far away from the original > issue. Thank you for clearly describing these two cases. I did not know that the ligature case people were discussing was limited to composition of characters -- that it was different from the case of displaying a ligature character actually in the buffer. What people said was sketchy and I had to try to fill in what was not said. > When several characters are supposed to > be composed on a text-mode display, Emacs simply writes them to the > terminal one after another, and expects the terminal to display them > as a ligature. The only difference between what Emacs does in this > case and what it does when no character composition is expected is > that in the former case Emacs expects the terminal to produce just one > glyph that takes just one column on display. In that case, I think I it would be good to be able to tell Emacs, when using a text-only terminal, not to try to compose ligatures. Not to expect that sequence to display as one column. Is that possible? -- Dr Richard Stallman (https://stallman.org) Chief GNUisance of the GNU Project (https://gnu.org) Founder, Free Software Foundation (https://fsf.org) Internet Hall-of-Famer (https://internethalloffame.org) ^ permalink raw reply [flat|nested] 104+ messages in thread
* Re: Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it? 2022-01-25 4:16 ` Richard Stallman @ 2022-01-25 6:35 ` Eli Zaretskii 2022-01-25 12:12 ` Eli Zaretskii 0 siblings, 1 reply; 104+ messages in thread From: Eli Zaretskii @ 2022-01-25 6:35 UTC (permalink / raw) To: rms, Richard Stallman; +Cc: psainty, luangruo, emacs-devel On January 25, 2022 6:16:29 AM GMT+02:00, Richard Stallman <rms@gnu.org> wrote: > > > When several characters are supposed to > > be composed on a text-mode display, Emacs simply writes them to the > > terminal one after another, and expects the terminal to display them > > as a ligature. The only difference between what Emacs does in this > > case and what it does when no character composition is expected is > > that in the former case Emacs expects the terminal to produce just one > > glyph that takes just one column on display. > > In that case, I think I it would be good to be able to tell Emacs, > when using a text-only terminal, not to try to compose ligatures. > Not to expect that sequence to display as one column. Is that possible? > This should already work: auto-composition-mode's value can be a symbol, to allow disabling that mode on uncapable terminals. Maybe you need to customize the value to disable compositions on your console. ^ permalink raw reply [flat|nested] 104+ messages in thread
* Re: Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it? 2022-01-25 6:35 ` Eli Zaretskii @ 2022-01-25 12:12 ` Eli Zaretskii 0 siblings, 0 replies; 104+ messages in thread From: Eli Zaretskii @ 2022-01-25 12:12 UTC (permalink / raw) To: rms; +Cc: psainty, luangruo, emacs-devel > Date: Tue, 25 Jan 2022 08:35:42 +0200 > From: Eli Zaretskii <eliz@gnu.org> > Cc: psainty@orcon.net.nz, luangruo@yahoo.com, emacs-devel@gnu.org > > On January 25, 2022 6:16:29 AM GMT+02:00, Richard Stallman <rms@gnu.org> wrote: > > > > > When several characters are supposed to > > > be composed on a text-mode display, Emacs simply writes them to the > > > terminal one after another, and expects the terminal to display them > > > as a ligature. The only difference between what Emacs does in this > > > case and what it does when no character composition is expected is > > > that in the former case Emacs expects the terminal to produce just one > > > glyph that takes just one column on display. > > > > In that case, I think I it would be good to be able to tell Emacs, > > when using a text-only terminal, not to try to compose ligatures. > > Not to expect that sequence to display as one column. Is that possible? > > > > This should already work: auto-composition-mode's value can be a symbol, to allow disabling that mode on uncapable terminals. Sorry, not a symbol, but a string -- the name of the terminal type as returned by tty-type. ^ permalink raw reply [flat|nested] 104+ messages in thread
* New feature: displaying ligature characters in the buffer 2022-01-24 12:14 ` Eli Zaretskii 2022-01-25 4:16 ` Richard Stallman @ 2022-01-25 4:16 ` Richard Stallman 2022-01-25 6:31 ` Eli Zaretskii 2022-01-25 11:08 ` Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it? Kévin Le Gouguec 2 siblings, 1 reply; 104+ messages in thread From: Richard Stallman @ 2022-01-25 4:16 UTC (permalink / raw) To: Eli Zaretskii; +Cc: psainty, luangruo, emacs-devel [[[ To any NSA and FBI agents reading my email: please consider ]]] [[[ whether defending the US Constitution against all enemies, ]]] [[[ foreign or domestic, requires you to follow Snowden's example. ]]] When the buffer contains a ligature character, it would be a good thing for Emacs to determine that the terminal doesn't support ligatures, and in that case to arrange to display those ligature characters using two letters. This should happen by default. Other pre-composed characters could likewise be displayed in two columns using non-composed characters. Emacs needs to know which compositions the terminal can display. I expect it will handle a fairly limited set. So a new TERMINFO field could specify which characters work, and Emacs could convert that into a binary array for quick lookup. TERM=linux could have a TERMINFO field to say that ligatures don't work. -- Dr Richard Stallman (https://stallman.org) Chief GNUisance of the GNU Project (https://gnu.org) Founder, Free Software Foundation (https://fsf.org) Internet Hall-of-Famer (https://internethalloffame.org) ^ permalink raw reply [flat|nested] 104+ messages in thread
* Re: New feature: displaying ligature characters in the buffer 2022-01-25 4:16 ` New feature: displaying ligature characters in the buffer Richard Stallman @ 2022-01-25 6:31 ` Eli Zaretskii 2022-01-27 4:12 ` Richard Stallman 0 siblings, 1 reply; 104+ messages in thread From: Eli Zaretskii @ 2022-01-25 6:31 UTC (permalink / raw) To: rms, Richard Stallman; +Cc: psainty, luangruo, emacs-devel On January 25, 2022 6:16:33 AM GMT+02:00, Richard Stallman <rms@gnu.org> wrote: > > When the buffer contains a ligature character, it would be a good thing for > Emacs to determine that the terminal doesn't support ligatures, and in > that case to arrange to display those ligature characters using two letters. > This should happen by default. > > Other pre-composed characters could likewise be displayed in two columns > using non-composed characters. > > Emacs needs to know which compositions the terminal can display. I > expect it will handle a fairly limited set. So a new TERMINFO field > could specify which characters work, and Emacs could convert that into > a binary array for quick lookup. TERM=linux could have a TERMINFO > field to say that ligatures don't work. > This is supposed to be working already, up to a point, see terminal_glyph_code in terminal.c. I'm guessing that the diamond glyphs you see for some ligatures is the way your terminal "supports" these characters. Or maybe it lies to Emacs about which characters it supports, or maybe the code which queries the terminal about supported characters doesn't work in your case for some other reason. I don't think I agree that this must work by default. That's certainly the desire, but the capabilities of the linux terminal and the way they are reported are a mess, and the use case is quite marginal nowadays. Ligatures are no different for this purpose from any other non-ASCII character that the console cannot display. We have the latin1-display feature that you can turn on if your console doesn't cope well enough with non-ASCII characters. And if you want to set up display of ASCII equivalents for just a small set of characters, you can use the latin1-display-char function todo that in your .emacs, in a way that suits the capabilities of your particular type and version of the linux console. ^ permalink raw reply [flat|nested] 104+ messages in thread
* Re: New feature: displaying ligature characters in the buffer 2022-01-25 6:31 ` Eli Zaretskii @ 2022-01-27 4:12 ` Richard Stallman 2022-01-27 7:58 ` Eli Zaretskii 0 siblings, 1 reply; 104+ messages in thread From: Richard Stallman @ 2022-01-27 4:12 UTC (permalink / raw) To: Eli Zaretskii; +Cc: psainty, luangruo, emacs-devel [[[ To any NSA and FBI agents reading my email: please consider ]]] [[[ whether defending the US Constitution against all enemies, ]]] [[[ foreign or domestic, requires you to follow Snowden's example. ]]] > I'm guessing that the diamond glyphs you see for some ligatures > is the way your terminal "supports" these characters. Or maybe > it lies to Emacs about which characters it supports, or maybe > the code which queries the terminal about supported characters > doesn't work in your case for some other reason. Those sound possible. How can I diagnose with GDB what is in fact going on? -- Dr Richard Stallman (https://stallman.org) Chief GNUisance of the GNU Project (https://gnu.org) Founder, Free Software Foundation (https://fsf.org) Internet Hall-of-Famer (https://internethalloffame.org) ^ permalink raw reply [flat|nested] 104+ messages in thread
* Re: New feature: displaying ligature characters in the buffer 2022-01-27 4:12 ` Richard Stallman @ 2022-01-27 7:58 ` Eli Zaretskii 0 siblings, 0 replies; 104+ messages in thread From: Eli Zaretskii @ 2022-01-27 7:58 UTC (permalink / raw) To: rms; +Cc: psainty, luangruo, emacs-devel > From: Richard Stallman <rms@gnu.org> > Cc: psainty@orcon.net.nz, luangruo@yahoo.com, emacs-devel@gnu.org > Date: Wed, 26 Jan 2022 23:12:57 -0500 > > > I'm guessing that the diamond glyphs you see for some ligatures > > is the way your terminal "supports" these characters. Or maybe > > it lies to Emacs about which characters it supports, or maybe > > the code which queries the terminal about supported characters > > doesn't work in your case for some other reason. > > Those sound possible. > How can I diagnose with GDB what is in fact going on? The code responsible for that is in terminal.c, functions terminal_glyph_code and calculate_glyph_code_table. The latter is called when we first want to find out whether a certain character can be displayed by the terminal, which probably happens during startup. I'd begin by establishing whether the ioctl used by calculate_glyph_code_table succeeds, and if so, whether the terminal tells us that the ligature codepoints do have glyphs in the terminal's font. The relevant Unicode codepoints are U+FB00..U+FB06. Another issue could be with the terminal encoding: terminal_glyph_code only queries the terminal for supported glyphs if terminal-coding-system is UTF-8 -- is that what you have? Or maybe the HAVE_STRUCT_UNIPAIR_UNICODE preprocessor condition doesn't work on your system, in which case these functions return trivial results. The Lisp interface to this is internal-char-font (which on TTY frames calls terminal_glyph_code). Does it return the same non-negative number for all of the ligature codepoints in the above range? If it does, then it could be an indication that the terminal displays the same diamond glyph for all of them, i.e. doesn't really support them. If internal-char-font returns a negative number, it means the terminal cannot support those ligatures, and our processing of that is somehow incorrect or assumes something that doesn't happen; see char-displayable-p. ^ permalink raw reply [flat|nested] 104+ messages in thread
* Re: Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it? 2022-01-24 12:14 ` Eli Zaretskii 2022-01-25 4:16 ` Richard Stallman 2022-01-25 4:16 ` New feature: displaying ligature characters in the buffer Richard Stallman @ 2022-01-25 11:08 ` Kévin Le Gouguec 2022-01-25 12:38 ` Eli Zaretskii 2 siblings, 1 reply; 104+ messages in thread From: Kévin Le Gouguec @ 2022-01-25 11:08 UTC (permalink / raw) To: Eli Zaretskii; +Cc: psainty, luangruo, rms, emacs-devel Eli Zaretskii <eliz@gnu.org> writes: >> > > I doubt any user wants to see a diamond instead of `fi'. >> >> > Is this what really happens for you, on your terminal? >> >> Yes. A few days ago I put point on a diamond, typed C-u C-x =, >> and was told it was a ligature for `fi'. > > How did that ligature get written to the screen? Was it present > literally in some text that Emacs displayed? If not, how did it come > into existence, in the form of a diamond? Emacs doesn't produce such > ligatures on TTY frames. (Apologies for the noise, but I thought it might be worth checking since even after re-reading Richard's messages I'm not 100% sure every one is talking about the same thing. Richard, would you happen to be talking about literal U+FB01 or U+FB03 characters (fi or ffi, respectively, named "LATIN SMALL LIGATURE FI/FFI"), rather than the kind of ligature Emacs produces when configured to do so with "fi" and "ffi"? I don't know much about how ligatures are setup in Emacs, but I too am surprised that it would attempt to produce them in a terminal frame. OTOH the aforementioned Unicode characters are indeed displayed as diamonds on my TTY. Again, sorry for the noise if we are indeed talking about bona-fide ligatures) ^ permalink raw reply [flat|nested] 104+ messages in thread
* Re: Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it? 2022-01-25 11:08 ` Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it? Kévin Le Gouguec @ 2022-01-25 12:38 ` Eli Zaretskii 2022-01-26 3:39 ` Richard Stallman 0 siblings, 1 reply; 104+ messages in thread From: Eli Zaretskii @ 2022-01-25 12:38 UTC (permalink / raw) To: Kévin Le Gouguec; +Cc: psainty, luangruo, rms, emacs-devel > From: Kévin Le Gouguec <kevin.legouguec@gmail.com> > Cc: rms@gnu.org, psainty@orcon.net.nz, luangruo@yahoo.com, > emacs-devel@gnu.org > Date: Tue, 25 Jan 2022 12:08:46 +0100 > > Richard, would you happen to be talking about literal U+FB01 or U+FB03 > characters (fi or ffi, respectively, named "LATIN SMALL LIGATURE FI/FFI"), > rather than the kind of ligature Emacs produces when configured to do so > with "fi" and "ffi"? As I explained in that message, Emacs doesn't produce any ligatures when it displays on TTY frames. It expects the terminal to display the ligatures when it receives the characters that should ligate. ^ permalink raw reply [flat|nested] 104+ messages in thread
* Re: Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it? 2022-01-25 12:38 ` Eli Zaretskii @ 2022-01-26 3:39 ` Richard Stallman 2022-01-26 5:38 ` Eli Zaretskii 2022-01-26 8:20 ` Andreas Schwab 0 siblings, 2 replies; 104+ messages in thread From: Richard Stallman @ 2022-01-26 3:39 UTC (permalink / raw) To: Eli Zaretskii; +Cc: psainty, luangruo, emacs-devel, kevin.legouguec [[[ To any NSA and FBI agents reading my email: please consider ]]] [[[ whether defending the US Constitution against all enemies, ]]] [[[ foreign or domestic, requires you to follow Snowden's example. ]]] > > Richard, would you happen to be talking about literal U+FB01 or U+FB03 > > characters (fi or ffi, respectively, named "LATIN SMALL LIGATURE FI/FFI"), > > rather than the kind of ligature Emacs produces when configured to do so > > with "fi" and "ffi"? I don't have that message any more, so I can't check. But the text I quoted above displays with a diamond, and the output of C-u C-x = on it matches my rather vague memories of what C-u C-x = displayed then. I didn't know that there were two different kinds of ligatures in Unicode. -- Dr Richard Stallman (https://stallman.org) Chief GNUisance of the GNU Project (https://gnu.org) Founder, Free Software Foundation (https://fsf.org) Internet Hall-of-Famer (https://internethalloffame.org) ^ permalink raw reply [flat|nested] 104+ messages in thread
* Re: Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it? 2022-01-26 3:39 ` Richard Stallman @ 2022-01-26 5:38 ` Eli Zaretskii 2022-01-28 13:04 ` Richard Stallman 2022-01-26 8:20 ` Andreas Schwab 1 sibling, 1 reply; 104+ messages in thread From: Eli Zaretskii @ 2022-01-26 5:38 UTC (permalink / raw) To: rms, Richard Stallman; +Cc: psainty, luangruo, emacs-devel, kevin.legouguec On January 26, 2022 5:39:17 AM GMT+02:00, Richard Stallman <rms@gnu.org> wrote: > > I didn't know that there were two different kinds of ligatures in Unicode. It is not specific to ligatures. Some character sequences that are supposed to be composed on display have precomposed variants with their own codepoints. This is generally for legacy reasons. The most widely known example is Latin characters with diacritics, such as ç and à. ^ permalink raw reply [flat|nested] 104+ messages in thread
* Re: Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it? 2022-01-26 5:38 ` Eli Zaretskii @ 2022-01-28 13:04 ` Richard Stallman 2022-01-28 13:31 ` Eli Zaretskii 0 siblings, 1 reply; 104+ messages in thread From: Richard Stallman @ 2022-01-28 13:04 UTC (permalink / raw) To: Eli Zaretskii; +Cc: psainty, luangruo, kevin.legouguec, emacs-devel [[[ To any NSA and FBI agents reading my email: please consider ]]] [[[ whether defending the US Constitution against all enemies, ]]] [[[ foreign or domestic, requires you to follow Snowden's example. ]]] > The most widely known example is Latin characters with diacritics, such as ç and à. Since my terminal handles many of those characters, they work ok for me. But there are some it does not support. Many Vietnamese characters, for instance. If this feature is implemented to handle ligatures, it could handle the letters with diacritics too. That would be as easy as populating the table of the sequences they stand for. The terminfo item for `linux' could indicate which characters are ok to display unchanged and which ones need to be displayed as the equivalent digraphs (or trigraphs). All the work is in the basic feature that converts some Unicode codes into sequences for display. -- Dr Richard Stallman (https://stallman.org) Chief GNUisance of the GNU Project (https://gnu.org) Founder, Free Software Foundation (https://fsf.org) Internet Hall-of-Famer (https://internethalloffame.org) ^ permalink raw reply [flat|nested] 104+ messages in thread
* Re: Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it? 2022-01-28 13:04 ` Richard Stallman @ 2022-01-28 13:31 ` Eli Zaretskii 2022-01-30 4:17 ` Richard Stallman 0 siblings, 1 reply; 104+ messages in thread From: Eli Zaretskii @ 2022-01-28 13:31 UTC (permalink / raw) To: rms; +Cc: psainty, luangruo, kevin.legouguec, emacs-devel > From: Richard Stallman <rms@gnu.org> > Cc: psainty@orcon.net.nz, luangruo@yahoo.com, emacs-devel@gnu.org, > kevin.legouguec@gmail.com > Date: Fri, 28 Jan 2022 08:04:49 -0500 > > > The most widely known example is Latin characters with diacritics, such as ç and à. > > Since my terminal handles many of those characters, they work ok for > me. But there are some it does not support. Many Vietnamese > characters, for instance. > > If this feature is implemented to handle ligatures, it could handle > the letters with diacritics too. That would be as easy as populating > the table of the sequences they stand for. IIUC what you mean by "this feature", we already have that in latin1-disp.el. It just isn't automatic, because most terminals and terminal emulators don't have a way of reporting which sequences they are capable of composing. So we let it to the users to determine whether they need this kind of "ASCII-fied" display. I asked whether adding a command that specifically targets ligatures like "fi" would be useful -- can you answer that? > The terminfo item for `linux' could indicate which characters are ok > to display unchanged and which ones need to be displayed as the > equivalent digraphs (or trigraphs). Given the enormously large number of such sequences, I doubt that terminfo is the right means for determining which sequences are supported. We have a solution for the Linux console, and for the rest we allow the user to customize the value of auto-composition-mode to disable it if the terminal misbehaves with these sequences. ^ permalink raw reply [flat|nested] 104+ messages in thread
* Re: Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it? 2022-01-28 13:31 ` Eli Zaretskii @ 2022-01-30 4:17 ` Richard Stallman 2022-01-30 7:36 ` Eli Zaretskii 0 siblings, 1 reply; 104+ messages in thread From: Richard Stallman @ 2022-01-30 4:17 UTC (permalink / raw) To: Eli Zaretskii; +Cc: psainty, luangruo, kevin.legouguec, emacs-devel [[[ To any NSA and FBI agents reading my email: please consider ]]] [[[ whether defending the US Constitution against all enemies, ]]] [[[ foreign or domestic, requires you to follow Snowden's example. ]]] > > Since my terminal handles many of those characters, they work ok for > > me. But there are some it does not support. Many Vietnamese > > characters, for instance. > > > > If this feature is implemented to handle ligatures, it could handle > > the letters with diacritics too. That would be as easy as populating > > the table of the sequences they stand for. > IIUC what you mean by "this feature", we already have that in > latin1-disp.el. It is the same general idea, but (according to the comments at the start) it handles only the characters in the ISO 8859 character sets. It should handle all the Unicode characters that could sensibly be represented as characters to be composed, including ligatures and all Latin and Greek characters with diacritics. Maybe some others can be handled too. I customized the variable to enable that mode but I don't know how to make it actually do anything. Maybe it needs something else to truly enable it. I inserted ẵ (latin small letter a with breve and tilde); it does not do anything special to that. > Given the enormously large number of such sequences, I doubt that > terminfo is the right means for determining which sequences are > supported. We have a solution for the Linux console, We do? What is it? > and for the rest > we allow the user to customize the value of auto-composition-mode to > disable it if the terminal misbehaves with these sequences. We are talking about different issues. I am talking about how to display complex characters _in the buffer_. NOT generated by auto-composition. I don't think auto-composition does anything in my Emacs. If I insert an f and an i in the buffer, they display as two characters, f followed by i. Not as a ligature. -- Dr Richard Stallman (https://stallman.org) Chief GNUisance of the GNU Project (https://gnu.org) Founder, Free Software Foundation (https://fsf.org) Internet Hall-of-Famer (https://internethalloffame.org) ^ permalink raw reply [flat|nested] 104+ messages in thread
* Re: Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it? 2022-01-30 4:17 ` Richard Stallman @ 2022-01-30 7:36 ` Eli Zaretskii 2022-01-31 4:02 ` Richard Stallman 0 siblings, 1 reply; 104+ messages in thread From: Eli Zaretskii @ 2022-01-30 7:36 UTC (permalink / raw) To: rms; +Cc: psainty, luangruo, kevin.legouguec, emacs-devel > From: Richard Stallman <rms@gnu.org> > Cc: psainty@orcon.net.nz, luangruo@yahoo.com, emacs-devel@gnu.org, > kevin.legouguec@gmail.com > Date: Sat, 29 Jan 2022 23:17:21 -0500 > > > IIUC what you mean by "this feature", we already have that in > > latin1-disp.el. > > It is the same general idea, but (according to the comments at the > start) it handles only the characters in the ISO 8859 character sets. That comment is obsolete; I've updated it now. There are facilities in that package that display much more than ISO 8859 characters, see latin1-display-ucs-per-lynx. > It should handle all the Unicode characters that could sensibly be > represented as characters to be composed, including ligatures and all > Latin and Greek characters with diacritics. Maybe some others can be > handled too. Ligatures are currently not there, and I think it would make sense to have that as a separate command, as I suggested in another email (which you still didn't respond to). I'm waiting for your response before I decide whether to install such a feature. The question I asked was: Would it be good enough to have a command that will arrange for these ligatures to be displayed as their ASCII equivalents, using the facilities in latin1-disp.el? Such a command could be invoked either manually or from your init file. latin1-disp.el also provides a special face to display such equivalents, so you could have them stand out on display if you want. > I customized the variable to enable that mode but I don't know how to make it actually do > anything. Maybe it needs something else to truly enable it. If you customized latin1-display, then it only affects characters that your terminal doesn't support. The code dynamically discovers which characters are those when you activate the feature. See this fragment from the setup function: (defun latin1-display-setup (set &optional _force) "Set up Latin-1 display for characters in the given SET. SET must be a member of `latin1-display-sets'. Normally, check whether a font for SET is available and don't set the display if it is." (cond ((eq set 'latin-2) (latin1-display-identities set) (mapc (lambda (l) (or (char-displayable-p (car l)) <<<<<<<<<<<<<<<<<<<<<<<<<< (apply 'latin1-display-char l))) > I inserted ẵ (latin small letter a with breve and tilde); it does not > do anything special to that. ẵ is not supported by latin1-display, as it is not an ISO 8859 character. You need to turn on a more thorough feature. Try this: M-x latin1-display-ucs-per-lynx RET > > Given the enormously large number of such sequences, I doubt that > > terminfo is the right means for determining which sequences are > > supported. We have a solution for the Linux console, > > We do? What is it? The same code I pointed to in response to your other message (about displaying ligatures as diamonds): terminal_glyph_code and its subroutine calculate_glyph_code_table (in terminal.c). > I don't think auto-composition does anything in my Emacs. If I insert > an f and an i in the buffer, they display as two characters, f followed by i. > Not as a ligature. We haven't yet installed composition rules for ASCII ligatures, because we need first to resolve some basic problems with them (see etc/TODO for the details). I could show you how to install such a composition rule, but I don't think it will do anything on your console, since it doesn't support ligatures. ^ permalink raw reply [flat|nested] 104+ messages in thread
* Re: Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it? 2022-01-30 7:36 ` Eli Zaretskii @ 2022-01-31 4:02 ` Richard Stallman 2022-01-31 13:05 ` Eli Zaretskii 0 siblings, 1 reply; 104+ messages in thread From: Richard Stallman @ 2022-01-31 4:02 UTC (permalink / raw) To: Eli Zaretskii; +Cc: psainty, luangruo, kevin.legouguec, emacs-devel [[[ To any NSA and FBI agents reading my email: please consider ]]] [[[ whether defending the US Constitution against all enemies, ]]] [[[ foreign or domestic, requires you to follow Snowden's example. ]]] > Ligatures are currently not there, and I think it would make sense to > have that as a separate command, as I suggested in another email > (which you still didn't respond to). I'm waiting for your response > before I decide whether to install such a feature. The question I > asked was: > Would it be good enough to have a command that will arrange for these > ligatures to be displayed as their ASCII equivalents, using the > facilities in latin1-disp.el? I'm not sure, because I don't know what that would be like in practice. If I could see it actually handle some characters, I would probably see how to answer. > ẵ is not supported by latin1-display, as it is not an ISO 8859 > character. You need to turn on a more thorough feature. Try this: > M-x latin1-display-ucs-per-lynx RET I just gave that command, but it doesn't do anything for the ẵ character. > I could show you how to install such a > composition rule, but I don't think it will do anything on your > console, since it doesn't support ligatures. I don't WANT autocomposition on my Linux terminal. I'm talking about how to display a ligature character that actually appears in the buffer. If latin1-display is the way, it ought to handle ligature characters too. -- Dr Richard Stallman (https://stallman.org) Chief GNUisance of the GNU Project (https://gnu.org) Founder, Free Software Foundation (https://fsf.org) Internet Hall-of-Famer (https://internethalloffame.org) ^ permalink raw reply [flat|nested] 104+ messages in thread
* Re: Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it? 2022-01-31 4:02 ` Richard Stallman @ 2022-01-31 13:05 ` Eli Zaretskii 2022-02-01 5:06 ` Richard Stallman 0 siblings, 1 reply; 104+ messages in thread From: Eli Zaretskii @ 2022-01-31 13:05 UTC (permalink / raw) To: rms; +Cc: psainty, luangruo, kevin.legouguec, emacs-devel > From: Richard Stallman <rms@gnu.org> > Cc: psainty@orcon.net.nz, luangruo@yahoo.com, emacs-devel@gnu.org, > kevin.legouguec@gmail.com > Date: Sun, 30 Jan 2022 23:02:14 -0500 > > > Would it be good enough to have a command that will arrange for these > > ligatures to be displayed as their ASCII equivalents, using the > > facilities in latin1-disp.el? > > I'm not sure, because I don't know what that would be like in > practice. If I could see it actually handle some characters, I would > probably see how to answer. Please try the patch below. > > ẵ is not supported by latin1-display, as it is not an ISO 8859 > > character. You need to turn on a more thorough feature. Try this: > > > M-x latin1-display-ucs-per-lynx RET > > I just gave that command, but it doesn't do anything for the ẵ > character. The patch below should fix this as well, I hope. > > I could show you how to install such a > > composition rule, but I don't think it will do anything on your > > console, since it doesn't support ligatures. > > I don't WANT autocomposition on my Linux terminal. Well, you keep mentioning auto-composition, so I'm answering your implied questions about that (the above was in response to you saying that typing f and i didn't produce any ligatures on your terminal). Here's the patch I suggest to try: diff --git a/lisp/international/latin1-disp.el b/lisp/international/latin1-disp.el index 96a54cc..1f639ed 100644 --- a/lisp/international/latin1-disp.el +++ b/lisp/international/latin1-disp.el @@ -764,12 +764,11 @@ latin1-display-ucs-per-lynx isn't changed if the display can render Unicode characters." (interactive "p") (if (> arg 0) - (unless (char-displayable-p #x101) ; a with macron - ;; It doesn't look as though we have a Unicode font. - (let ((latin1-display-format "%s")) - (mapc - (lambda (l) - (apply 'latin1-display-char l)) + (let ((latin1-display-format "%s")) + (mapc + (lambda (l) + (or (char-displayable-p (car l)) + (apply 'latin1-display-char l))) ;; Table derived by running Lynx on a suitable list of ;; characters in a utf-8 file, except for some added by ;; hand at the end. @@ -3183,7 +3182,7 @@ latin1-display-ucs-per-lynx (?\、 ",") ;; Not from Lynx (? "") - (?� "?"))))) + (?� "?")))) (aset standard-display-table (make-char 'mule-unicode-0100-24ff) nil) (aset standard-display-table ^ permalink raw reply related [flat|nested] 104+ messages in thread
* Re: Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it? 2022-01-31 13:05 ` Eli Zaretskii @ 2022-02-01 5:06 ` Richard Stallman 2022-02-01 14:57 ` Eli Zaretskii 0 siblings, 1 reply; 104+ messages in thread From: Richard Stallman @ 2022-02-01 5:06 UTC (permalink / raw) To: Eli Zaretskii; +Cc: psainty, luangruo, kevin.legouguec, emacs-devel [[[ To any NSA and FBI agents reading my email: please consider ]]] [[[ whether defending the US Constitution against all enemies, ]]] [[[ foreign or domestic, requires you to follow Snowden's example. ]]] > > > > I'm not sure, because I don't know what that would be like in > > practice. If I could see it actually handle some characters, I would > > probably see how to answer. > Please try the patch below. With that patch, and using latin1-display-ucs-per-lynx, it does display that character as text. It uses the string `a)?' -- that doesn't seem to make sense to indicate a macron and a tilde, but at leaset the underlying mechanism seems to work. Then I inserted the fi ligature and it displays as two letters, f and i. So it seems to do the job, for ligatures. Thanks. -- Dr Richard Stallman (https://stallman.org) Chief GNUisance of the GNU Project (https://gnu.org) Founder, Free Software Foundation (https://fsf.org) Internet Hall-of-Famer (https://internethalloffame.org) ^ permalink raw reply [flat|nested] 104+ messages in thread
* Re: Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it? 2022-02-01 5:06 ` Richard Stallman @ 2022-02-01 14:57 ` Eli Zaretskii 2022-02-02 3:58 ` Richard Stallman 0 siblings, 1 reply; 104+ messages in thread From: Eli Zaretskii @ 2022-02-01 14:57 UTC (permalink / raw) To: rms; +Cc: psainty, luangruo, kevin.legouguec, emacs-devel > From: Richard Stallman <rms@gnu.org> > Cc: psainty@orcon.net.nz, luangruo@yahoo.com, emacs-devel@gnu.org, > kevin.legouguec@gmail.com > Date: Tue, 01 Feb 2022 00:06:21 -0500 > > > Please try the patch below. > > With that patch, and using latin1-display-ucs-per-lynx, it does display > that character as text. It uses the string `a)?' Actually, it shows a(?. > -- that doesn't seem to make sense to indicate a macron and a tilde, > but at leaset the underlying mechanism seems to work. If you rotate "(?" 90 degrees counter-clockwise, you'll get something resembling the breve and the tilde above it. If you can suggest a better "ASCII art" for this, we could use it. What we have now comes from Lynx, but we aren't wedded to it. ^ permalink raw reply [flat|nested] 104+ messages in thread
* Re: Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it? 2022-02-01 14:57 ` Eli Zaretskii @ 2022-02-02 3:58 ` Richard Stallman 2022-02-02 12:28 ` Eli Zaretskii 0 siblings, 1 reply; 104+ messages in thread From: Richard Stallman @ 2022-02-02 3:58 UTC (permalink / raw) To: Eli Zaretskii; +Cc: psainty, luangruo, kevin.legouguec, emacs-devel [[[ To any NSA and FBI agents reading my email: please consider ]]] [[[ whether defending the US Constitution against all enemies, ]]] [[[ foreign or domestic, requires you to follow Snowden's example. ]]] > If you rotate "(?" 90 degrees counter-clockwise, you'll get something > resembling the breve and the tilde above it. I'd never have thought of trying that. How about using ã¯? That takes up only two character spaces and it makes sense without turning your head 90 degrees. It could test whether the terminal can display ã and macro, and if not, fall back on some other alternative. -- Dr Richard Stallman (https://stallman.org) Chief GNUisance of the GNU Project (https://gnu.org) Founder, Free Software Foundation (https://fsf.org) Internet Hall-of-Famer (https://internethalloffame.org) ^ permalink raw reply [flat|nested] 104+ messages in thread
* Re: Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it? 2022-02-02 3:58 ` Richard Stallman @ 2022-02-02 12:28 ` Eli Zaretskii 2022-02-03 4:23 ` Richard Stallman 0 siblings, 1 reply; 104+ messages in thread From: Eli Zaretskii @ 2022-02-02 12:28 UTC (permalink / raw) To: rms; +Cc: psainty, luangruo, kevin.legouguec, emacs-devel > From: Richard Stallman <rms@gnu.org> > Cc: psainty@orcon.net.nz, luangruo@yahoo.com, emacs-devel@gnu.org, > kevin.legouguec@gmail.com > Date: Tue, 01 Feb 2022 22:58:49 -0500 > > > If you rotate "(?" 90 degrees counter-clockwise, you'll get something > > resembling the breve and the tilde above it. > > I'd never have thought of trying that. > > How about using ã¯? That doesn't seem to remind anything like the original. Moreover, the feature as implemented only uses ASCII characters in the translations. > That takes up only two character spaces and it makes sense without > turning your head 90 degrees. It could test whether the terminal > can display ã and macro, and if not, fall back on some other > alternative. We could have alternatives like that, but it would have to be a better-looking alternative, since replacing one imperfect emulation with another that's not better doesn't sound like an improvement to me. ^ permalink raw reply [flat|nested] 104+ messages in thread
* Re: Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it? 2022-02-02 12:28 ` Eli Zaretskii @ 2022-02-03 4:23 ` Richard Stallman 2022-02-03 7:53 ` Eli Zaretskii ` (2 more replies) 0 siblings, 3 replies; 104+ messages in thread From: Richard Stallman @ 2022-02-03 4:23 UTC (permalink / raw) To: Eli Zaretskii; +Cc: psainty, luangruo, emacs-devel, kevin.legouguec [[[ To any NSA and FBI agents reading my email: please consider ]]] [[[ whether defending the US Constitution against all enemies, ]]] [[[ foreign or domestic, requires you to follow Snowden's example. ]]] > > How about using ã¯? > That doesn't seem to remind anything like the original. Sorry, somehow I misremembered and thought the character was a with macron and tilde. Was it actually a with breve and tilde? Then the natural visual representations would be ă~ (a with breve, then tilde) and ã˘ (a with tilde, then breve), > Moreover, the > feature as implemented only uses ASCII characters in the translations. Since the linux console does handle many modified letters, a display method that taks advantage of them will be good on linux consoles. At least, on my machine it does that. I don't know how much variation there is or how much this can be configured. I can try to find someone who knows. > We could have alternatives like that, but it would have to be a > better-looking alternative, since replacing one imperfect emulation > with another that's not better doesn't sound like an improvement to > me. It is much better. I would never hae guessed what a(? meant, just from seeing it. ă~ I could guess -- it's an a, with a breve and a tilde. Once I know it's a single character, because C-f moves over it, it would have to be a-with-breve-and-tilde. -- Dr Richard Stallman (https://stallman.org) Chief GNUisance of the GNU Project (https://gnu.org) Founder, Free Software Foundation (https://fsf.org) Internet Hall-of-Famer (https://internethalloffame.org) ^ permalink raw reply [flat|nested] 104+ messages in thread
* Re: Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it? 2022-02-03 4:23 ` Richard Stallman @ 2022-02-03 7:53 ` Eli Zaretskii 2022-02-03 8:16 ` Yuri Khan 2022-02-04 3:52 ` Richard Stallman 2022-02-03 20:28 ` Tomas Hlavaty 2022-02-04 3:52 ` Richard Stallman 2 siblings, 2 replies; 104+ messages in thread From: Eli Zaretskii @ 2022-02-03 7:53 UTC (permalink / raw) To: rms; +Cc: psainty, luangruo, emacs-devel, kevin.legouguec > From: Richard Stallman <rms@gnu.org> > Cc: psainty@orcon.net.nz, luangruo@yahoo.com, > kevin.legouguec@gmail.com, emacs-devel@gnu.org > Date: Wed, 02 Feb 2022 23:23:56 -0500 > > > > How about using ã¯? > > > That doesn't seem to remind anything like the original. > > Sorry, somehow I misremembered and thought the character was a with > macron and tilde. Was it actually a with breve and tilde? Yes. > Then the natural visual representations would be ă~ (a with breve, > then tilde) and ã˘ (a with tilde, then breve), Either one would be fine, but the former is better, I think, since ~ is an ASCII character, and so is universally supported. > > We could have alternatives like that, but it would have to be a > > better-looking alternative, since replacing one imperfect emulation > > with another that's not better doesn't sound like an improvement to > > me. > > It is much better. > > I would never hae guessed what a(? meant, just from seeing it. > ă~ I could guess -- it's an a, with a breve and a tilde. > Once I know it's a single character, because C-f moves over it, > it would have to be a-with-breve-and-tilde. Feel free to propose alternatives for such characters, which could be better represented by an accented character followed by the rest of accents expressed as ASCII equivalents (there's not a lot of them, btw). I don't have access to a Linux console to see which ones could be supported. ^ permalink raw reply [flat|nested] 104+ messages in thread
* Re: Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it? 2022-02-03 7:53 ` Eli Zaretskii @ 2022-02-03 8:16 ` Yuri Khan 2022-02-03 9:26 ` Eli Zaretskii 2022-02-04 3:52 ` Richard Stallman 1 sibling, 1 reply; 104+ messages in thread From: Yuri Khan @ 2022-02-03 8:16 UTC (permalink / raw) To: Eli Zaretskii Cc: Phil Sainty, Po Lu, Kévin Le Gouguec, Richard Stallman, Emacs developers On Thu, 3 Feb 2022 at 15:09, Eli Zaretskii <eliz@gnu.org> wrote: > > Sorry, somehow I misremembered and thought the character was a with > > macron and tilde. Was it actually a with breve and tilde? > > Then the natural visual representations would be ă~ (a with breve, > > then tilde) and ã˘ (a with tilde, then breve), > > Either one would be fine, but the former is better, I think, since ~ > is an ASCII character, and so is universally supported. The ordering of diacritics on the same side of the base character is considered significant in Unicode, so ă~ and ã˘ would be representations of different grapheme clusters — “a with breve and tilde” and “a with tilde and breve”, respectively. The issue of any characters used not being universally available is still valid, of course. ^ permalink raw reply [flat|nested] 104+ messages in thread
* Re: Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it? 2022-02-03 8:16 ` Yuri Khan @ 2022-02-03 9:26 ` Eli Zaretskii 0 siblings, 0 replies; 104+ messages in thread From: Eli Zaretskii @ 2022-02-03 9:26 UTC (permalink / raw) To: Yuri Khan; +Cc: psainty, luangruo, kevin.legouguec, rms, emacs-devel > From: Yuri Khan <yuri.v.khan@gmail.com> > Date: Thu, 3 Feb 2022 15:16:27 +0700 > Cc: Richard Stallman <rms@gnu.org>, Phil Sainty <psainty@orcon.net.nz>, Po Lu <luangruo@yahoo.com>, > Emacs developers <emacs-devel@gnu.org>, Kévin Le Gouguec <kevin.legouguec@gmail.com> > > On Thu, 3 Feb 2022 at 15:09, Eli Zaretskii <eliz@gnu.org> wrote: > > > > Sorry, somehow I misremembered and thought the character was a with > > > macron and tilde. Was it actually a with breve and tilde? > > > Then the natural visual representations would be ă~ (a with breve, > > > then tilde) and ã˘ (a with tilde, then breve), > > > > Either one would be fine, but the former is better, I think, since ~ > > is an ASCII character, and so is universally supported. > > The ordering of diacritics on the same side of the base character is > considered significant in Unicode, so ă~ and ã˘ would be > representations of different grapheme clusters — “a with breve and > tilde” and “a with tilde and breve”, respectively. I know, but I think for an emulation it could be okay to ignore this subtlety. The real character is always available in "C-u C-x =". ^ permalink raw reply [flat|nested] 104+ messages in thread
* Re: Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it? 2022-02-03 7:53 ` Eli Zaretskii 2022-02-03 8:16 ` Yuri Khan @ 2022-02-04 3:52 ` Richard Stallman 2022-02-04 4:56 ` Yuri Khan 2022-02-04 8:10 ` Eli Zaretskii 1 sibling, 2 replies; 104+ messages in thread From: Richard Stallman @ 2022-02-04 3:52 UTC (permalink / raw) To: Eli Zaretskii; +Cc: psainty, luangruo, kevin.legouguec, emacs-devel [[[ To any NSA and FBI agents reading my email: please consider ]]] [[[ whether defending the US Constitution against all enemies, ]]] [[[ foreign or domestic, requires you to follow Snowden's example. ]]] > > Then the natural visual representations would be ă~ (a with breve, > > then tilde) and ã˘ (a with tilde, then breve), > Either one would be fine, but the former is better, I think, since ~ > is an ASCII character, and so is universally supported. That's a good point, but Yuri Khan <yuri.v.khan@gmail.com> said: > The ordering of diacritics on the same side of the base character is > considered significant in Unicode, so ă~ and ã˘ would be > representations of different grapheme clusters — “a with breve and > tilde” and “a with tilde and breve”, respectively. Are there really two different Unicode characters like that? C-x 8 RET recognizes LATIN SMALL LETTER A WITH BREVE AND TILDE but it does not recognize LATIN SMALL LETTER A WITH TILDE AND BREVE Is the failure to handle the latter a bug? -- Dr Richard Stallman (https://stallman.org) Chief GNUisance of the GNU Project (https://gnu.org) Founder, Free Software Foundation (https://fsf.org) Internet Hall-of-Famer (https://internethalloffame.org) ^ permalink raw reply [flat|nested] 104+ messages in thread
* Re: Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it? 2022-02-04 3:52 ` Richard Stallman @ 2022-02-04 4:56 ` Yuri Khan 2022-02-06 4:13 ` Richard Stallman 2022-02-04 8:10 ` Eli Zaretskii 1 sibling, 1 reply; 104+ messages in thread From: Yuri Khan @ 2022-02-04 4:56 UTC (permalink / raw) To: Richard Stallman Cc: Phil Sainty, Po Lu, Eli Zaretskii, Emacs developers, Kévin Le Gouguec On Fri, 4 Feb 2022 at 10:53, Richard Stallman <rms@gnu.org> wrote: > C-x 8 RET recognizes > LATIN SMALL LETTER A WITH BREVE AND TILDE > but it does not recognize > LATIN SMALL LETTER A WITH TILDE AND BREVE > > Is the failure to handle the latter a bug? Not necessarily. Unicode does not assign single-character codes to all possible letter and diacritic combinations. Instead, it has various combining diacritics that apply to the nearest preceding non-combining character. So a with tilde and breve could be encoded as a sequence <Latin small letter A> <combining tilde> <combining breve> or <Latin small letter A with tilde> <combining breve>. ^ permalink raw reply [flat|nested] 104+ messages in thread
* Re: Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it? 2022-02-04 4:56 ` Yuri Khan @ 2022-02-06 4:13 ` Richard Stallman 0 siblings, 0 replies; 104+ messages in thread From: Richard Stallman @ 2022-02-06 4:13 UTC (permalink / raw) To: Yuri Khan; +Cc: psainty, luangruo, eliz, kevin.legouguec, emacs-devel [[[ To any NSA and FBI agents reading my email: please consider ]]] [[[ whether defending the US Constitution against all enemies, ]]] [[[ foreign or domestic, requires you to follow Snowden's example. ]]] > > C-x 8 RET recognizes > > LATIN SMALL LETTER A WITH BREVE AND TILDE > > but it does not recognize > > LATIN SMALL LETTER A WITH TILDE AND BREVE > > > > Is the failure to handle the latter a bug? > Not necessarily. Unicode does not assign single-character codes to all > possible letter and diacritic combinations. Instead, it has various > combining diacritics that apply to the nearest preceding non-combining > character. I suspect we are miscommunicating. I am not talking about compositions using combining diacritics. I'm talking about individual Unicode code points that can appear, as such, in a file. Such as the code point #x1eb5, which is the character `ẵ'. I think Emacs already handles composition, and perhaps does that well enough. (I can't tell, because my Linux console does not display compositions.) -- Dr Richard Stallman (https://stallman.org) Chief GNUisance of the GNU Project (https://gnu.org) Founder, Free Software Foundation (https://fsf.org) Internet Hall-of-Famer (https://internethalloffame.org) ^ permalink raw reply [flat|nested] 104+ messages in thread
* Re: Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it? 2022-02-04 3:52 ` Richard Stallman 2022-02-04 4:56 ` Yuri Khan @ 2022-02-04 8:10 ` Eli Zaretskii 2022-02-06 4:13 ` Richard Stallman 1 sibling, 1 reply; 104+ messages in thread From: Eli Zaretskii @ 2022-02-04 8:10 UTC (permalink / raw) To: rms; +Cc: psainty, luangruo, kevin.legouguec, emacs-devel > From: Richard Stallman <rms@gnu.org> > Cc: psainty@orcon.net.nz, luangruo@yahoo.com, emacs-devel@gnu.org, > kevin.legouguec@gmail.com > Date: Thu, 03 Feb 2022 22:52:16 -0500 > > > > Then the natural visual representations would be ă~ (a with breve, > > > then tilde) and ã˘ (a with tilde, then breve), > > > Either one would be fine, but the former is better, I think, since ~ > > is an ASCII character, and so is universally supported. > > That's a good point, but Yuri Khan <yuri.v.khan@gmail.com> said: > > > The ordering of diacritics on the same side of the base character is > > considered significant in Unicode, so ă~ and ã˘ would be > > representations of different grapheme clusters — “a with breve and > > tilde” and “a with tilde and breve”, respectively. That's a tangent, not directly relevant to the issue at hand. If/when we bump into different characters that differ only by the order of the diacriticals, we will resolve each such case one by one. ^ permalink raw reply [flat|nested] 104+ messages in thread
* Re: Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it? 2022-02-04 8:10 ` Eli Zaretskii @ 2022-02-06 4:13 ` Richard Stallman 0 siblings, 0 replies; 104+ messages in thread From: Richard Stallman @ 2022-02-06 4:13 UTC (permalink / raw) To: Eli Zaretskii; +Cc: psainty, luangruo, emacs-devel, kevin.legouguec [[[ To any NSA and FBI agents reading my email: please consider ]]] [[[ whether defending the US Constitution against all enemies, ]]] [[[ foreign or domestic, requires you to follow Snowden's example. ]]] > > > The ordering of diacritics on the same side of the base character is > > > considered significant in Unicode, so ă~ and ã˘ would be > > > representations of different grapheme clusters — “a with breve and > > > tilde” and “a with tilde and breve”, respectively. > That's a tangent, not directly relevant to the issue at hand. If/when > we bump into different characters that differ only by the order of the > diacriticals, we will resolve each such case one by one. Ok with me. -- Dr Richard Stallman (https://stallman.org) Chief GNUisance of the GNU Project (https://gnu.org) Founder, Free Software Foundation (https://fsf.org) Internet Hall-of-Famer (https://internethalloffame.org) ^ permalink raw reply [flat|nested] 104+ messages in thread
* Re: Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it? 2022-02-03 4:23 ` Richard Stallman 2022-02-03 7:53 ` Eli Zaretskii @ 2022-02-03 20:28 ` Tomas Hlavaty 2022-02-04 7:07 ` Eli Zaretskii 2022-02-05 4:20 ` Richard Stallman 2022-02-04 3:52 ` Richard Stallman 2 siblings, 2 replies; 104+ messages in thread From: Tomas Hlavaty @ 2022-02-03 20:28 UTC (permalink / raw) To: rms, Eli Zaretskii; +Cc: psainty, luangruo, kevin.legouguec, emacs-devel On Wed 02 Feb 2022 at 23:23, Richard Stallman <rms@gnu.org> wrote: > > > How about using ã¯? I see two boxes. > Then the natural visual representations would be ă~ (a with breve, > then tilde) I see one box and tilde. > and ã˘ (a with tilde, then breve), I see two boxes. This is on linux console. I guess it depends on the console font. My console font is ter-i24n. ^ permalink raw reply [flat|nested] 104+ messages in thread
* Re: Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it? 2022-02-03 20:28 ` Tomas Hlavaty @ 2022-02-04 7:07 ` Eli Zaretskii 2022-02-05 4:20 ` Richard Stallman 1 sibling, 0 replies; 104+ messages in thread From: Eli Zaretskii @ 2022-02-04 7:07 UTC (permalink / raw) To: Tomas Hlavaty; +Cc: psainty, luangruo, kevin.legouguec, rms, emacs-devel > From: Tomas Hlavaty <tom@logand.com> > Cc: psainty@orcon.net.nz, luangruo@yahoo.com, emacs-devel@gnu.org, > kevin.legouguec@gmail.com > Date: Thu, 03 Feb 2022 21:28:17 +0100 > > On Wed 02 Feb 2022 at 23:23, Richard Stallman <rms@gnu.org> wrote: > > > > How about using ã¯? > > I see two boxes. > > > Then the natural visual representations would be ă~ (a with breve, > > then tilde) > > I see one box and tilde. The idea, as explained up-thread, is to use only those non-ASCII characters that are supported by the terminal, with a run-time test for each one of them. So if your terminal doesn't support those, they will not be used; latin1-display-ucs-per-lynx will instead use the purely-ASCII emulation. ^ permalink raw reply [flat|nested] 104+ messages in thread
* Re: Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it? 2022-02-03 20:28 ` Tomas Hlavaty 2022-02-04 7:07 ` Eli Zaretskii @ 2022-02-05 4:20 ` Richard Stallman 2022-02-05 13:55 ` Tomas Hlavaty 1 sibling, 1 reply; 104+ messages in thread From: Richard Stallman @ 2022-02-05 4:20 UTC (permalink / raw) To: Tomas Hlavaty; +Cc: psainty, luangruo, eliz, kevin.legouguec, emacs-devel [[[ To any NSA and FBI agents reading my email: please consider ]]] [[[ whether defending the US Constitution against all enemies, ]]] [[[ foreign or domestic, requires you to follow Snowden's example. ]]] > I guess it depends on the console font. I guess it does. > My console font is ter-i24n. How can I tell which font my console uses? The Lisp function `char-displayable-p' seems to report correctly which characters my console can display. Is it correct for yours too? -- Dr Richard Stallman (https://stallman.org) Chief GNUisance of the GNU Project (https://gnu.org) Founder, Free Software Foundation (https://fsf.org) Internet Hall-of-Famer (https://internethalloffame.org) ^ permalink raw reply [flat|nested] 104+ messages in thread
* Re: Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it? 2022-02-05 4:20 ` Richard Stallman @ 2022-02-05 13:55 ` Tomas Hlavaty 2022-02-05 14:06 ` Eli Zaretskii 2022-02-06 4:16 ` Richard Stallman 0 siblings, 2 replies; 104+ messages in thread From: Tomas Hlavaty @ 2022-02-05 13:55 UTC (permalink / raw) To: rms; +Cc: psainty, luangruo, eliz, kevin.legouguec, emacs-devel On Fri 04 Feb 2022 at 23:20, Richard Stallman <rms@gnu.org> wrote: > How can I tell which font my console uses? on debian based systems: dpkg-reconfigure console-setup ls /usr/share/consolefonts/ setfont /usr/share/consolefonts/Lat2-Terminus24x12.psf.gz sed -i 's/FONTFACE="Fixed"/FONTFACE="Terminus"/' /etc/default/console-setup sed -i 's/FONTSIZE="8x16"/FONTSIZE="28x14"/' /etc/default/console-setup setupcon on nixos: i18n.consoleFont = "Lat2-Terminus16"; setfont Lat2-Terminus24x12 > The Lisp function `char-displayable-p' seems to report correctly which > characters my console can display. Is it correct for yours too? Here the cases which I see as boxes: (char-displayable-p ?ã¯) => (invalid-read-syntax "?") (char-displayable-p ?ã) => unicode (char-displayable-p ?¯) => unicode (char-displayable-p ?ă) => unicode (char-displayable-p ?ã˘) => (invalid-read-syntax "?") (char-displayable-p ?ã) => unicode (char-displayable-p ?˘) => unicode char-displayable-p returns non-nil even though I see boxes. ^ permalink raw reply [flat|nested] 104+ messages in thread
* Re: Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it? 2022-02-05 13:55 ` Tomas Hlavaty @ 2022-02-05 14:06 ` Eli Zaretskii 2022-02-05 14:12 ` Eli Zaretskii ` (2 more replies) 2022-02-06 4:16 ` Richard Stallman 1 sibling, 3 replies; 104+ messages in thread From: Eli Zaretskii @ 2022-02-05 14:06 UTC (permalink / raw) To: Tomas Hlavaty; +Cc: psainty, luangruo, kevin.legouguec, rms, emacs-devel > From: Tomas Hlavaty <tom@logand.com> > Cc: eliz@gnu.org, psainty@orcon.net.nz, luangruo@yahoo.com, > emacs-devel@gnu.org, kevin.legouguec@gmail.com > Date: Sat, 05 Feb 2022 14:55:19 +0100 > > char-displayable-p returns non-nil even though I see boxes. Is this the Linux console or is this a terminal emulator? If it's a console, does it support the ioctl issued by calculate_glyph_code_table? Also, these calls are incorrect: (char-displayable-p ?ã¯) (char-displayable-p ?ã˘) char-displayable-p accepts a single character, not 2. ^ permalink raw reply [flat|nested] 104+ messages in thread
* Re: Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it? 2022-02-05 14:06 ` Eli Zaretskii @ 2022-02-05 14:12 ` Eli Zaretskii 2022-02-06 1:29 ` Tomas Hlavaty 2022-02-06 1:10 ` Tomas Hlavaty 2022-02-06 4:16 ` Richard Stallman 2 siblings, 1 reply; 104+ messages in thread From: Eli Zaretskii @ 2022-02-05 14:12 UTC (permalink / raw) To: tom; +Cc: psainty, luangruo, emacs-devel, rms, kevin.legouguec > Date: Sat, 05 Feb 2022 16:06:42 +0200 > From: Eli Zaretskii <eliz@gnu.org> > Cc: psainty@orcon.net.nz, luangruo@yahoo.com, kevin.legouguec@gmail.com, > rms@gnu.org, emacs-devel@gnu.org > > > char-displayable-p returns non-nil even though I see boxes. > > Is this the Linux console or is this a terminal emulator? > > If it's a console, does it support the ioctl issued by > calculate_glyph_code_table? I guess the answer is NO, because if it did, you'd see t as the return value, not 'unicode'. So then it isn't surprising that you get false positives when your terminal-coding-system is UTF-8: that coding-system can encode any character, and Emacs has no way of knowing which of them actually have glyphs in the console font if the console doesn't support the GIO_UNIMAP ioctl we use to find out which glyphs are actually available. ^ permalink raw reply [flat|nested] 104+ messages in thread
* Re: Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it? 2022-02-05 14:12 ` Eli Zaretskii @ 2022-02-06 1:29 ` Tomas Hlavaty 2022-02-06 8:30 ` Eli Zaretskii 0 siblings, 1 reply; 104+ messages in thread From: Tomas Hlavaty @ 2022-02-06 1:29 UTC (permalink / raw) To: Eli Zaretskii; +Cc: psainty, luangruo, emacs-devel, rms, kevin.legouguec On Sat 05 Feb 2022 at 16:12, Eli Zaretskii <eliz@gnu.org> wrote: >> If it's a console, does it support the ioctl issued by >> calculate_glyph_code_table? > > I guess the answer is NO, because if it did, you'd see t as the return > value, not 'unicode'. ok, but is it rare or surprising, that my linux console does not support the discussed ioctl? I did not do anything with the linux console, I am simply using it the way it is, except choosing the font. I found it surprising that Richard did not see the boxes I saw and that he actually could read it. I also found surprising you saying something about runtime test to detect supported characters and choosing replacements accordingly. That would be great if feasible. What is the recommended setting for the linux console, in order to see less boxes and more readable characters? > So then it isn't surprising that you get false positives when your > terminal-coding-system is UTF-8: There is no terminal-coding-system variable in my Emacs 27.2. How can it be UTF-8? M-x occur on ~/.emacs shows me these utf-8 settings: current-language-environment "UTF-8" prefer-coding-system 'utf-8 > that coding-system can encode any > character, and Emacs has no way of knowing which of them actually have > glyphs in the console font if the console doesn't support the > GIO_UNIMAP ioctl we use to find out which glyphs are actually > available. Maybe the ioctl is not the sufficient source of information; emacs would need to query the current console font and then read the avalable glyphs from the font? ^ permalink raw reply [flat|nested] 104+ messages in thread
* Re: Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it? 2022-02-06 1:29 ` Tomas Hlavaty @ 2022-02-06 8:30 ` Eli Zaretskii 2022-02-06 10:38 ` Tomas Hlavaty 0 siblings, 1 reply; 104+ messages in thread From: Eli Zaretskii @ 2022-02-06 8:30 UTC (permalink / raw) To: Tomas Hlavaty; +Cc: psainty, luangruo, emacs-devel, rms, kevin.legouguec > From: Tomas Hlavaty <tom@logand.com> > Cc: psainty@orcon.net.nz, luangruo@yahoo.com, > kevin.legouguec@gmail.com, rms@gnu.org, > emacs-devel@gnu.org > Date: Sun, 06 Feb 2022 02:29:18 +0100 > > On Sat 05 Feb 2022 at 16:12, Eli Zaretskii <eliz@gnu.org> wrote: > >> If it's a console, does it support the ioctl issued by > >> calculate_glyph_code_table? > > > > I guess the answer is NO, because if it did, you'd see t as the return > > value, not 'unicode'. > > ok, but is it rare or surprising, that my linux console does not support > the discussed ioctl? I don't know, I'm not an expert on that. I thought the Linux console always supports that. Perhaps someone else could chime in. Failing that, how about asking about this on the forum dedicated your GNU/Linux distribution, or the specific console you are using? > I also found surprising you saying something about runtime test to > detect supported characters and choosing replacements accordingly. > That would be great if feasible. That test is part of char-displayable-p. > > So then it isn't surprising that you get false positives when your > > terminal-coding-system is UTF-8: > > There is no terminal-coding-system variable in my Emacs 27.2. It's a function, not a variable. > How can it be UTF-8? Most probably because your locale specifies UTF-8 as the codeset. > M-x occur on ~/.emacs shows me these utf-8 settings: > current-language-environment "UTF-8" > prefer-coding-system 'utf-8 These are evidence that your locale indeed specifies UTF-8. > > that coding-system can encode any > > character, and Emacs has no way of knowing which of them actually have > > glyphs in the console font if the console doesn't support the > > GIO_UNIMAP ioctl we use to find out which glyphs are actually > > available. > > Maybe the ioctl is not the sufficient source of information; emacs would > need to query the current console font and then read the avalable glyphs > from the font? We don't know of any way of querying the console except via that ioctl. If someone can suggest a more reliable or more widely supported method, let them please speak up. ^ permalink raw reply [flat|nested] 104+ messages in thread
* Re: Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it? 2022-02-06 8:30 ` Eli Zaretskii @ 2022-02-06 10:38 ` Tomas Hlavaty 2022-02-06 10:44 ` Eli Zaretskii 2022-02-06 10:54 ` Andreas Schwab 0 siblings, 2 replies; 104+ messages in thread From: Tomas Hlavaty @ 2022-02-06 10:38 UTC (permalink / raw) To: Eli Zaretskii; +Cc: luangruo, emacs-devel, rms, kevin.legouguec On Sun 06 Feb 2022 at 10:30, Eli Zaretskii <eliz@gnu.org> wrote: >> There is no terminal-coding-system variable in my Emacs 27.2. > > It's a function, not a variable. Thanks, I missed that: (terminal-coding-system) => utf-8-unix >> > So then it isn't surprising that you get false positives when your >> > terminal-coding-system is UTF-8: How can I change terminal-coding-system say to latin2 (to match my console font)? ^ permalink raw reply [flat|nested] 104+ messages in thread
* Re: Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it? 2022-02-06 10:38 ` Tomas Hlavaty @ 2022-02-06 10:44 ` Eli Zaretskii 2022-02-06 10:54 ` Andreas Schwab 1 sibling, 0 replies; 104+ messages in thread From: Eli Zaretskii @ 2022-02-06 10:44 UTC (permalink / raw) To: Tomas Hlavaty; +Cc: luangruo, emacs-devel, rms, kevin.legouguec > From: Tomas Hlavaty <tom@logand.com> > Cc: luangruo@yahoo.com, kevin.legouguec@gmail.com, rms@gnu.org, > emacs-devel@gnu.org > Date: Sun, 06 Feb 2022 11:38:36 +0100 > > >> > So then it isn't surprising that you get false positives when your > >> > terminal-coding-system is UTF-8: > > How can I change terminal-coding-system say to latin2 (to match my > console font)? Use set-terminal-coding-system. ^ permalink raw reply [flat|nested] 104+ messages in thread
* Re: Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it? 2022-02-06 10:38 ` Tomas Hlavaty 2022-02-06 10:44 ` Eli Zaretskii @ 2022-02-06 10:54 ` Andreas Schwab 1 sibling, 0 replies; 104+ messages in thread From: Andreas Schwab @ 2022-02-06 10:54 UTC (permalink / raw) To: Tomas Hlavaty; +Cc: luangruo, Eli Zaretskii, kevin.legouguec, rms, emacs-devel On Feb 06 2022, Tomas Hlavaty wrote: > How can I change terminal-coding-system say to latin2 (to match my > console font)? The terminal coding system has nothing to do with how the text is displayed, it's how the terminal interprets the bytes it receives. On modern systems, this is always UTF-8. -- Andreas Schwab, schwab@linux-m68k.org GPG Key fingerprint = 7578 EB47 D4E5 4D69 2510 2552 DF73 E780 A9DA AEC1 "And now for something completely different." ^ permalink raw reply [flat|nested] 104+ messages in thread
* Re: Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it? 2022-02-05 14:06 ` Eli Zaretskii 2022-02-05 14:12 ` Eli Zaretskii @ 2022-02-06 1:10 ` Tomas Hlavaty 2022-02-06 4:16 ` Richard Stallman 2 siblings, 0 replies; 104+ messages in thread From: Tomas Hlavaty @ 2022-02-06 1:10 UTC (permalink / raw) To: Eli Zaretskii; +Cc: psainty, luangruo, kevin.legouguec, rms, emacs-devel On Sat 05 Feb 2022 at 16:06, Eli Zaretskii <eliz@gnu.org> wrote: >> char-displayable-p returns non-nil even though I see boxes. > > Is this the Linux console or is this a terminal emulator? linux console > If it's a console, does it support the ioctl issued by > calculate_glyph_code_table? how can i say? > Also, these calls are incorrect: > > (char-displayable-p ?ã¯) > (char-displayable-p ?ã˘) > > char-displayable-p accepts a single character, not 2. I see boxes, not characters. I wasn't sure if 1 box ~ 1 char, always. That's why I sent also results with the two boxes separated, one test for each part. ^ permalink raw reply [flat|nested] 104+ messages in thread
* Re: Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it? 2022-02-05 14:06 ` Eli Zaretskii 2022-02-05 14:12 ` Eli Zaretskii 2022-02-06 1:10 ` Tomas Hlavaty @ 2022-02-06 4:16 ` Richard Stallman 2 siblings, 0 replies; 104+ messages in thread From: Richard Stallman @ 2022-02-06 4:16 UTC (permalink / raw) To: Eli Zaretskii; +Cc: psainty, luangruo, kevin.legouguec, tom, emacs-devel [[[ To any NSA and FBI agents reading my email: please consider ]]] [[[ whether defending the US Constitution against all enemies, ]]] [[[ foreign or domestic, requires you to follow Snowden's example. ]]] > Also, these calls are incorrect: > (char-displayable-p ?ã¯) > (char-displayable-p ?ã˘) I put a single Unicode character in the buffer to send the email. I guess it got munged in transmission. -- Dr Richard Stallman (https://stallman.org) Chief GNUisance of the GNU Project (https://gnu.org) Founder, Free Software Foundation (https://fsf.org) Internet Hall-of-Famer (https://internethalloffame.org) ^ permalink raw reply [flat|nested] 104+ messages in thread
* Re: Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it? 2022-02-05 13:55 ` Tomas Hlavaty 2022-02-05 14:06 ` Eli Zaretskii @ 2022-02-06 4:16 ` Richard Stallman 2022-02-06 11:29 ` Tomas Hlavaty 1 sibling, 1 reply; 104+ messages in thread From: Richard Stallman @ 2022-02-06 4:16 UTC (permalink / raw) To: Tomas Hlavaty; +Cc: psainty, luangruo, eliz, kevin.legouguec, emacs-devel [[[ To any NSA and FBI agents reading my email: please consider ]]] [[[ whether defending the US Constitution against all enemies, ]]] [[[ foreign or domestic, requires you to follow Snowden's example. ]]] > setfont /usr/share/consolefonts/Lat2-Terminus24x12.psf.gz Would that alter the settings? I don't want to do that! I want to find out how it IS set, not change it. Here's what I have in /etc/default/console-setup # FONT='lat9w-08.psf.gz brl-8x8.psf' # FONT_MAP=/usr/share/consoletrans/lat9u.uni Can you explain what they mean? -- Dr Richard Stallman (https://stallman.org) Chief GNUisance of the GNU Project (https://gnu.org) Founder, Free Software Foundation (https://fsf.org) Internet Hall-of-Famer (https://internethalloffame.org) ^ permalink raw reply [flat|nested] 104+ messages in thread
* Re: Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it? 2022-02-06 4:16 ` Richard Stallman @ 2022-02-06 11:29 ` Tomas Hlavaty 0 siblings, 0 replies; 104+ messages in thread From: Tomas Hlavaty @ 2022-02-06 11:29 UTC (permalink / raw) To: rms; +Cc: luangruo, eliz, kevin.legouguec, emacs-devel On Sat 05 Feb 2022 at 23:16, Richard Stallman <rms@gnu.org> wrote: > > setfont /usr/share/consolefonts/Lat2-Terminus24x12.psf.gz > > Would that alter the settings? yes but only per console and only temporarily (does not persist on reboot) In case you want to persist it, the sed command I provided might do that on your distro. > I don't want to do that! > I want to find out how it IS set, not change it. I am not aware of a "getfont" thing. It seems to be dependent on things like on distro. That is why I provided related hints on where to find the configuration and also a distro independent way of setting it (so that one knows what the font is). If you do not want to change it, you have to look into your distro specific configuration. ls /usr/share/consolefonts/ might give you a list of fonts to choose from (it does not on my distro). grep FONT /etc/default/console-setup might give you info about the default font configuration (it does not on my distro) > Here's what I have in /etc/default/console-setup > > # FONT='lat9w-08.psf.gz brl-8x8.psf' > # FONT_MAP=/usr/share/consoletrans/lat9u.uni > > Can you explain what they mean? I am not an expert on this, just a user, so I'll try to guess: - There are two fonts used. I did not know it was possible and I do not know how this works. - There is a file describing the supported characters. Maybe this map tells the console which characters to take from which font? iirc the console supports up to 256 characters. - latin9 seems to be the character set in the first font. - The fonts are very small. You still have very good eyes:-) I have one font only and no map file. My font seems to be for latin2. Interestingly, my native language should be covered by latin2 but still, some accented characters are displayed properly but some as boxes. ^ permalink raw reply [flat|nested] 104+ messages in thread
* Re: Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it? 2022-02-03 4:23 ` Richard Stallman 2022-02-03 7:53 ` Eli Zaretskii 2022-02-03 20:28 ` Tomas Hlavaty @ 2022-02-04 3:52 ` Richard Stallman 2022-02-04 8:03 ` Eli Zaretskii 2 siblings, 1 reply; 104+ messages in thread From: Richard Stallman @ 2022-02-04 3:52 UTC (permalink / raw) To: rms; +Cc: psainty, luangruo, eliz, kevin.legouguec, emacs-devel [[[ To any NSA and FBI agents reading my email: please consider ]]] [[[ whether defending the US Constitution against all enemies, ]]] [[[ foreign or domestic, requires you to follow Snowden's example. ]]] It would be useful to be able to analyze and construct complex characters -- for instance, to operate on a-with-breve-and-tilde and find out that represents an a with two diacritics. So I propose a function, `diacriticize'. Its arguments are characters, and if they can be graphically combined to make a single character, that's what diacriticize returns. Otherwise, it returns nil. (diacriticize ?a ?~ ?˘) => ?㯠(diacriticize ?a ?Z) => nil It could have an inverse function, criticanalyze, which given the character code for a character that is (in spirit) a composition, would return the characters it consists of: (criticanalyze ?ã˘) => (?a ?~ ?˘) With these functions, latin1-display could figure out automatically which conversions to make. -- Dr Richard Stallman (https://stallman.org) Chief GNUisance of the GNU Project (https://gnu.org) Founder, Free Software Foundation (https://fsf.org) Internet Hall-of-Famer (https://internethalloffame.org) ^ permalink raw reply [flat|nested] 104+ messages in thread
* Re: Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it? 2022-02-04 3:52 ` Richard Stallman @ 2022-02-04 8:03 ` Eli Zaretskii 2022-02-06 4:13 ` Richard Stallman 0 siblings, 1 reply; 104+ messages in thread From: Eli Zaretskii @ 2022-02-04 8:03 UTC (permalink / raw) To: rms; +Cc: psainty, luangruo, kevin.legouguec, emacs-devel > From: Richard Stallman <rms@gnu.org> > Cc: eliz@gnu.org, psainty@orcon.net.nz, luangruo@yahoo.com, > emacs-devel@gnu.org, kevin.legouguec@gmail.com > Date: Thu, 03 Feb 2022 22:52:07 -0500 > > It would be useful to be able to analyze and construct complex > characters -- for instance, to operate on a-with-breve-and-tilde > and find out that represents an a with two diacritics. This already exists, see below. But you seem to have something different in mind: > So I propose a function, `diacriticize'. Its arguments are > characters, and if they can be graphically combined to make a single > character, that's what diacriticize returns. Otherwise, it returns > nil. > > (diacriticize ?a ?~ ?˘) => ?㯠> (diacriticize ?a ?Z) => nil > > It could have an inverse function, criticanalyze, which given the > character code for a character that is (in spirit) a composition, > would return the characters it consists of: > > (criticanalyze ?ã˘) => (?a ?~ ?˘) > > With these functions, latin1-display could figure out automatically > which conversions to make. I don't understand the specification of these functions. How would diacriticize decide/know that ?~ is equivalent to the ?̃ (U+0303 COMBINING TILDE) that is part of ?ã ? We do have infrastructure in place to decompose characters like ã into the base character ?a and the combining diacritic(s): the call (ucs-normalize-NFD-string "ã") returns a string of 2 characters, ?a and ?̃. But how do you propose to make the leap from ?̃ to ?~ ? ^ permalink raw reply [flat|nested] 104+ messages in thread
* Re: Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it? 2022-02-04 8:03 ` Eli Zaretskii @ 2022-02-06 4:13 ` Richard Stallman 2022-02-06 8:56 ` Eli Zaretskii 0 siblings, 1 reply; 104+ messages in thread From: Richard Stallman @ 2022-02-06 4:13 UTC (permalink / raw) To: Eli Zaretskii; +Cc: psainty, luangruo, emacs-devel, kevin.legouguec [[[ To any NSA and FBI agents reading my email: please consider ]]] [[[ whether defending the US Constitution against all enemies, ]]] [[[ foreign or domestic, requires you to follow Snowden's example. ]]] > I don't understand the specification of these functions. How would > diacriticize decide/know that ?~ is equivalent to the ?̃ (U+0303 > COMBINING TILDE) that is part of ?ã ? You know more about Unicode than I do, so I'm sure it is true _in some sense_ that "U+0303 (COMBINING TILDE) is part of ?ã". But I have doubts that that particular sense is the one that is pertinent to the job `diacriticize' is meant to do. I think you mean that one can represent the glyph image `ã' in Unicode as a composition using a sequence of `a' and COMBINING TILDE. Please tell me if I am mistaken. The ã in this sentence is not a composition. It is a single Unicode character, which is also in Latin-1. I don't think that COMBINING TILDE is "part of it". COMBINING TILDE can be used to create its glyph image by composition, but as to what is graphically part of that glyph image, I think that is ordinary `~'. the call (ucs-normalize-NFD-string "ã") returns a string of 2 characters, ?a and ?̃.. Interesting. I think it would be easy to implement `diacriticize' with that. But how do you propose to make the leap from ?̃ to ?~ ? (defconst unicode-combining-chars-alist '(... (?~ . ?̃ ) ...)) ... (car (rassq combining-char unicode-combining-chars-alist)) ... Indeed, I think this does the job for `criticanalyze'. (defun criticanalyze (char) (let* ((composition (ucs-normalize-NFD-string (char-to-string char))) charlist) (mapcar (lambda (c) (or (car (rassq c unicode-combining-chars-alist)) c)) composition))) There is probably an equally simple way to handle `diacriticize'. I proposed those two functions because I thought we had no way for Lisp programs to get info about this. Since we already have one, maybe we don't need those two functions. Popping back to the question of `latin1-display.el', it could use the `ucs-...' functions directly to figure out what substitutions to make. However, `ucs-normalize-NFD-string' does not know anything about ligatures. Given the fi ligature, it returns the fi ligature. So it can't be the sole method for `latin1-display' to find useful substitutions. We would have to tell it the list of ligatures. It already uses `char-displayable-p' to determine at run time which characters could use display substitutions. -- Dr Richard Stallman (https://stallman.org) Chief GNUisance of the GNU Project (https://gnu.org) Founder, Free Software Foundation (https://fsf.org) Internet Hall-of-Famer (https://internethalloffame.org) ^ permalink raw reply [flat|nested] 104+ messages in thread
* Re: Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it? 2022-02-06 4:13 ` Richard Stallman @ 2022-02-06 8:56 ` Eli Zaretskii 2022-02-07 5:11 ` Richard Stallman 0 siblings, 1 reply; 104+ messages in thread From: Eli Zaretskii @ 2022-02-06 8:56 UTC (permalink / raw) To: rms; +Cc: psainty, luangruo, emacs-devel, kevin.legouguec > From: Richard Stallman <rms@gnu.org> > Cc: psainty@orcon.net.nz, luangruo@yahoo.com, > kevin.legouguec@gmail.com, emacs-devel@gnu.org > Date: Sat, 05 Feb 2022 23:13:37 -0500 > > > I don't understand the specification of these functions. How would > > diacriticize decide/know that ?~ is equivalent to the ?̃ (U+0303 > > COMBINING TILDE) that is part of ?ã ? > > You know more about Unicode than I do, so I'm sure it is true _in some > sense_ that "U+0303 (COMBINING TILDE) is part of ?ã". > > But I have doubts that that particular sense is the one that is > pertinent to the job `diacriticize' is meant to do. > > I think you mean that one can represent the glyph image `ã' in Unicode > as a composition using a sequence of `a' and COMBINING TILDE. Please > tell me if I am mistaken. You are not mistaken. The character 'ã' can be "decomposed" into 2 characters, 'a' and COMBINING TILDE. This is called "canonical decomposition" in Unicode. > The ã in this sentence is not a composition. It is a single > Unicode character, which is also in Latin-1. I don't think that > COMBINING TILDE is "part of it". It is, in the sense that the original character can be decomposed. > But how do you propose > to make the leap from ?̃ to ?~ ? > > > > (defconst unicode-combining-chars-alist '(... (?~ . ?̃ ) ...)) So you mean we should create a database of ASCII characters that approximate the combining diacriticals? But if so, how is it better than having a database of complete characters and their ASCII equivalents, like we have now in latin1-disp.el? Your proposal may make the database smaller (and even that mostly only for Latin characters), but a database of complete characters makes it easier to make sure the results are optimal, because you see the original complete character and the complete equivalent, instead of "composing" them in your head for all the combinations. I think reasonable appearance is more important than memory consumption in this case, and other than that, your proposal just means replacing one database by another, right? > However, `ucs-normalize-NFD-string' does not know anything about > ligatures. Given the fi ligature, it returns the fi ligature. You need a different kind of decomposition for that, called "compatibility decomposition": (ucs-normalize-NFKD-string "fi") => "fi" You can use ucs-normalize-NFKD-string for the job of ucs-normalize-NFD-string as well: (append (ucs-normalize-NFKD-string "ã") nil) => (97 771) (I used 'append' here to make it evident that the result of the decomposition is 2 characters, not one, since the Emacs display will by default combine them into the same glyph as the original non-ASCII character, and an innocent reader could think the decomposition didn't work.) ^ permalink raw reply [flat|nested] 104+ messages in thread
* Re: Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it? 2022-02-06 8:56 ` Eli Zaretskii @ 2022-02-07 5:11 ` Richard Stallman 2022-02-07 13:16 ` Eli Zaretskii 0 siblings, 1 reply; 104+ messages in thread From: Richard Stallman @ 2022-02-07 5:11 UTC (permalink / raw) To: Eli Zaretskii; +Cc: psainty, luangruo, kevin.legouguec, emacs-devel [[[ To any NSA and FBI agents reading my email: please consider ]]] [[[ whether defending the US Constitution against all enemies, ]]] [[[ foreign or domestic, requires you to follow Snowden's example. ]]] > So you mean we should create a database of ASCII characters that > approximate the combining diacriticals? But if so, how is it better > than having a database of complete characters and their ASCII > equivalents, like we have now in latin1-disp.el? I think there are only around 20 diacritics. There must be hundreds of letters-with-diacritics. The method I've proposed can handle everything automatically, given a table about the 20-odd diacritics. That's a great simplification from a table of hundreds of elements, set up by hand. > but a database of complete characters makes it easier to > make sure the results are optimal, because you see the original > complete character and the complete equivalent, I don't follow you here. In particular, what does "complete equivalent" mean? Concretely how would a result be "less than optimal"? Can you illustrate with an example? > I think reasonable appearance is more important than memory > consumption in this case, What makes an appearance more or less reasonable when we're talking about replacing one character with two or three that express _symbolically_ which character it is? I don't get it. > You can use ucs-normalize-NFKD-string for the job of > ucs-normalize-NFD-string as well: > (append (ucs-normalize-NFKD-string "ã") nil) => (97 771) Great! That does most of the job, I think. > (I used 'append' here to make it evident that the result of the > decomposition is 2 characters, not one, since the Emacs display will > by default combine them into the same glyph as the original non-ASCII > character, Not on a Linux console, I think. When I have f and i in the buffer, Emacs does not convert them into a ligature. The only time it has to try to deal with a ligature is when there is a Unicode ligature code point in the buffer. -- Dr Richard Stallman (https://stallman.org) Chief GNUisance of the GNU Project (https://gnu.org) Founder, Free Software Foundation (https://fsf.org) Internet Hall-of-Famer (https://internethalloffame.org) ^ permalink raw reply [flat|nested] 104+ messages in thread
* Re: Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it? 2022-02-07 5:11 ` Richard Stallman @ 2022-02-07 13:16 ` Eli Zaretskii 2022-02-08 3:55 ` Richard Stallman 0 siblings, 1 reply; 104+ messages in thread From: Eli Zaretskii @ 2022-02-07 13:16 UTC (permalink / raw) To: rms; +Cc: psainty, luangruo, kevin.legouguec, emacs-devel > From: Richard Stallman <rms@gnu.org> > Cc: psainty@orcon.net.nz, luangruo@yahoo.com, emacs-devel@gnu.org, > kevin.legouguec@gmail.com > Date: Mon, 07 Feb 2022 00:11:28 -0500 > > > So you mean we should create a database of ASCII characters that > > approximate the combining diacriticals? But if so, how is it better > > than having a database of complete characters and their ASCII > > equivalents, like we have now in latin1-disp.el? > > I think there are only around 20 diacritics. You are thinking of some subset, I think. The real number is more like 80, and that's even if we only take the diacritics relevant to Latin characters, and disregard the Cyrillic, Greek, and others. > There must be hundreds of letters-with-diacritics. The method I've > proposed can handle everything automatically, given a table about > the 20-odd diacritics. That's a great simplification from a table > of hundreds of elements, set up by hand. Setting by hand was already done, and we have it in latin1-disp.el so it isn't like we need to weigh 2 jobs one against the other. > > but a database of complete characters makes it easier to > > make sure the results are optimal, because you see the original > > complete character and the complete equivalent, > > I don't follow you here. In particular, what does "complete > equivalent" mean? For example, "o?'" instead of "o" + "?" + "'" (to emulate ?\ṍ). With the former, you see the entire string that will be shown; with the latter, you need to imagine it (and all the other combinations that use one or both of these diacritics). Also, characters that have two diacritics are just part of the problem. What would you do with the likes of ?\ǿ (which we currently represent as "o/'")? Its base character, ø, doesn't have a decomposition in Unicode. IOW, your proposal solves only some (small) part of the problem at best, whereas having complete strings in the database is needed anyway for the rest. > > I think reasonable appearance is more important than memory > > consumption in this case, > > What makes an appearance more or less reasonable when we're talking > about replacing one character with two or three that express > _symbolically_ which character it is? I don't get it. The appearance should (a) make sense, and (b) be consistent: for example, U+030C COMBINING CARON should always be represented by the same ASCII equivalent. I don't see how you could fulfill these two conditions without reviewing all the relevant combinations and iteratively fixing whatever needs fixing. > > (I used 'append' here to make it evident that the result of the > > decomposition is 2 characters, not one, since the Emacs display will > > by default combine them into the same glyph as the original non-ASCII > > character, > > Not on a Linux console, I think. When I have f and i in the buffer, > Emacs does not convert them into a ligature. The only time it has to > try to deal with a ligature is when there is a Unicode ligature > code point in the buffer. Once again, on a TTY frame Emacs does NOT produce the ligatures nor combine base characters with the diacritics, it expects the terminal to do that. I've written the above remark because you are not the only one who reads this discussion, and most other people do use GUI displays, where the characters would (potentially confusingly) combine on display. ^ permalink raw reply [flat|nested] 104+ messages in thread
* Re: Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it? 2022-02-07 13:16 ` Eli Zaretskii @ 2022-02-08 3:55 ` Richard Stallman 2022-02-08 12:20 ` Eli Zaretskii 0 siblings, 1 reply; 104+ messages in thread From: Richard Stallman @ 2022-02-08 3:55 UTC (permalink / raw) To: Eli Zaretskii; +Cc: psainty, luangruo, emacs-devel, kevin.legouguec [[[ To any NSA and FBI agents reading my email: please consider ]]] [[[ whether defending the US Constitution against all enemies, ]]] [[[ foreign or domestic, requires you to follow Snowden's example. ]]] > > I think there are only around 20 diacritics. > You are thinking of some subset, I think. The real number is more > like 80, I am amazed. Where can I see a list that shows more of them? > That's a great simplification from a table > > of hundreds of elements, set up by hand. > Setting by hand was already done, and we have it in latin1-disp.el so Do you mean, the table that presents a-with-breve-and-tilde as `a)?'? I don't think that works well. > > I don't follow you here. In particular, what does "complete > > equivalent" mean? > For example, "o?'" instead of "o" + "?" + "'" (to emulate ?\ṍ). I don't understand the difference between "o?'" and "o" + "?" + "'". They look like two ways of describing the same sequence of three characters. Though ? would never make me think of tilde unless you told me. With > the former, you see the entire string that will be shown; with the > latter, you need to imagine it I can't follow that, since you're talking about two things that look identical to me. > What would you do with the likes of ?\ǿ (which we currently > represent as "o/'")? Its base character, ø, doesn't have a > decomposition in Unicode. For my terminal, I'd like it to send ø literally since my terminal can display that. `ø'' would be a good way to display it. But on a terminal that can't display ø, `o/'' would be a good choice. > > Not on a Linux console, I think. When I have f and i in the buffer, > > Emacs does not convert them into a ligature. The only time it has to > > try to deal with a ligature is when there is a Unicode ligature > > code point in the buffer. > Once again, on a TTY frame Emacs does NOT produce the ligatures nor > combine base characters with the diacritics. You have told me this several times, and I believe you. But how does it relate to the case I am talking about? I don't see a relationship. I was looking at a buffer containing a ligature character. It must have come from a message or file that I looked at in that buffer. I suppose Emacs did not _produce_ it, but it was in my buffer and I had to use C-u C-x = to see what it was. -- Dr Richard Stallman (https://stallman.org) Chief GNUisance of the GNU Project (https://gnu.org) Founder, Free Software Foundation (https://fsf.org) Internet Hall-of-Famer (https://internethalloffame.org) ^ permalink raw reply [flat|nested] 104+ messages in thread
* Re: Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it? 2022-02-08 3:55 ` Richard Stallman @ 2022-02-08 12:20 ` Eli Zaretskii 2022-02-09 4:06 ` Richard Stallman 0 siblings, 1 reply; 104+ messages in thread From: Eli Zaretskii @ 2022-02-08 12:20 UTC (permalink / raw) To: rms; +Cc: psainty, luangruo, emacs-devel, kevin.legouguec > From: Richard Stallman <rms@gnu.org> > Cc: psainty@orcon.net.nz, luangruo@yahoo.com, > kevin.legouguec@gmail.com, emacs-devel@gnu.org > Date: Mon, 07 Feb 2022 22:55:54 -0500 > > > > I think there are only around 20 diacritics. > > > You are thinking of some subset, I think. The real number is more > > like 80, > > I am amazed. Where can I see a list that shows more of them? Type "C-x 8 RET COMBINING", press TAB, then filter out of the candidates those which pertain to Cyrillic, Greek, and other specific scripts, leaving just Latin and those which don't belong to specific scripts. > > That's a great simplification from a table > > > of hundreds of elements, set up by hand. > > > Setting by hand was already done, and we have it in latin1-disp.el so > > Do you mean, the table that presents a-with-breve-and-tilde as `a)?'? > I don't think that works well. I think it works as well as it could, but in any case, seeing all the combinations explicitly is needed to provide reasonable results. > > > I don't follow you here. In particular, what does "complete > > > equivalent" mean? > > > For example, "o?'" instead of "o" + "?" + "'" (to emulate ?\ṍ). > > I don't understand the difference between "o?'" and "o" + "?" + "'". Your proposal is to have separate rules to produce the equivalent of each diacritic, so you will never see "o?'", only its components separately; I denoted the latter by "o?'" and "o" + "?" + "'". > > What would you do with the likes of ?\ǿ (which we currently > > represent as "o/'")? Its base character, ø, doesn't have a > > decomposition in Unicode. > > For my terminal, I'd like it to send ø literally since my terminal > can display that. `ø'' would be a good way to display it. > But on a terminal that can't display ø, `o/'' would be a good choice. My point is that there isn't a mechanical way of producing "o/" from ø, because Unicode decompositions don't support that. > > > Not on a Linux console, I think. When I have f and i in the buffer, > > > Emacs does not convert them into a ligature. The only time it has to > > > try to deal with a ligature is when there is a Unicode ligature > > > code point in the buffer. > > > Once again, on a TTY frame Emacs does NOT produce the ligatures nor > > combine base characters with the diacritics. > > You have told me this several times, and I believe you. But how does > it relate to the case I am talking about? I don't see a relationship. As I said, that remark was for other people, those who will read my email on GUI displays. ^ permalink raw reply [flat|nested] 104+ messages in thread
* Re: Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it? 2022-02-08 12:20 ` Eli Zaretskii @ 2022-02-09 4:06 ` Richard Stallman 2022-02-09 13:50 ` Eli Zaretskii 0 siblings, 1 reply; 104+ messages in thread From: Richard Stallman @ 2022-02-09 4:06 UTC (permalink / raw) To: Eli Zaretskii; +Cc: psainty, luangruo, emacs-devel, kevin.legouguec [[[ To any NSA and FBI agents reading my email: please consider ]]] [[[ whether defending the US Constitution against all enemies, ]]] [[[ foreign or domestic, requires you to follow Snowden's example. ]]] > Type "C-x 8 RET COMBINING", press TAB, then filter out of the > candidates those which pertain to Cyrillic, Greek, and other specific > scripts, leaving just Latin and those which don't belong to specific > scripts. Amazing! I see that it is possible to compute automatically the alist I proposed, from these names. A program can look at each of these COMBINING names, delete `COMBINING ', and look up the rest as a character name. No need to make the correspondence alist by hand. However, some of those COMBINING forms have no non-COMBINING counterpart. For instance, there is COMBINING ZIGZAG ABOVE, but no ZIGZAG ABOVE. How do you represent an uncombined zigzag-above in Unicode? Put it after SPACE as a combination? > Your proposal is to have separate rules to produce the equivalent of > each diacritic, so you will never see "o?'", only its components > separately. Yes. On a terminal that can't display that letter, I'd like Emacs to display it as a trigraph of one letter and two diacritics. They should be non-combining diacritics, so that display won't try to combine them. > My point is that there isn't a mechanical way of producing "o/" from > ø, because Unicode decompositions don't support that. It wouldn't be very hard to add a list of extra decompositions that are not known to Unicode itself. > > You have told me this several times, and I believe you. But how does > > it relate to the case I am talking about? I don't see a relationship. > As I said, that remark was for other people, those who will read my > email on GUI displays. It is good to know that we don't have a misunderstanding about that point. -- Dr Richard Stallman (https://stallman.org) Chief GNUisance of the GNU Project (https://gnu.org) Founder, Free Software Foundation (https://fsf.org) Internet Hall-of-Famer (https://internethalloffame.org) ^ permalink raw reply [flat|nested] 104+ messages in thread
* Re: Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it? 2022-02-09 4:06 ` Richard Stallman @ 2022-02-09 13:50 ` Eli Zaretskii 2022-02-10 3:57 ` Richard Stallman 0 siblings, 1 reply; 104+ messages in thread From: Eli Zaretskii @ 2022-02-09 13:50 UTC (permalink / raw) To: rms; +Cc: psainty, luangruo, emacs-devel, kevin.legouguec > From: Richard Stallman <rms@gnu.org> > Cc: psainty@orcon.net.nz, luangruo@yahoo.com, > kevin.legouguec@gmail.com, emacs-devel@gnu.org > Date: Tue, 08 Feb 2022 23:06:26 -0500 > > However, some of those COMBINING forms have no non-COMBINING counterpart. > For instance, there is COMBINING ZIGZAG ABOVE, but no ZIGZAG ABOVE. > > How do you represent an uncombined zigzag-above in Unicode? > Put it after SPACE as a combination? If I understand correctly what you want, you should use U+25CC DOTTED CIRCLE before the combining character, not SPACE. > > My point is that there isn't a mechanical way of producing "o/" from > > ø, because Unicode decompositions don't support that. > > It wouldn't be very hard to add a list of extra decompositions that > are not known to Unicode itself. Sure, but that means we'd need some manually-maintained database anyway. ^ permalink raw reply [flat|nested] 104+ messages in thread
* Re: Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it? 2022-02-09 13:50 ` Eli Zaretskii @ 2022-02-10 3:57 ` Richard Stallman 2022-02-10 6:26 ` Eli Zaretskii 0 siblings, 1 reply; 104+ messages in thread From: Richard Stallman @ 2022-02-10 3:57 UTC (permalink / raw) To: Eli Zaretskii; +Cc: psainty, luangruo, kevin.legouguec, emacs-devel [[[ To any NSA and FBI agents reading my email: please consider ]]] [[[ whether defending the US Constitution against all enemies, ]]] [[[ foreign or domestic, requires you to follow Snowden's example. ]]] > > How do you represent an uncombined zigzag-above in Unicode? > > Put it after SPACE as a combination? > If I understand correctly what you want, you should use U+25CC DOTTED > CIRCLE before the combining character, not SPACE. Maybe that is right, but I don't understand it. Why is that right? Anyway, if DOTTED CIRCLE + COMBINING ZIGZAG ABOVE is the right way to represent a noncombiing ZIGZAG ABOVE, Emacs can use that. > > It wouldn't be very hard to add a list of extra decompositions that > > are not known to Unicode itself. > Sure, but that means we'd need some manually-maintained database > anyway. Maybe so. But these automatic methds will make a good simplification even if it doesn't simplify everything perfectly. -- Dr Richard Stallman (https://stallman.org) Chief GNUisance of the GNU Project (https://gnu.org) Founder, Free Software Foundation (https://fsf.org) Internet Hall-of-Famer (https://internethalloffame.org) ^ permalink raw reply [flat|nested] 104+ messages in thread
* Re: Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it? 2022-02-10 3:57 ` Richard Stallman @ 2022-02-10 6:26 ` Eli Zaretskii 2022-02-12 3:57 ` Richard Stallman 0 siblings, 1 reply; 104+ messages in thread From: Eli Zaretskii @ 2022-02-10 6:26 UTC (permalink / raw) To: rms; +Cc: psainty, luangruo, kevin.legouguec, emacs-devel > From: Richard Stallman <rms@gnu.org> > Cc: psainty@orcon.net.nz, luangruo@yahoo.com, emacs-devel@gnu.org, > kevin.legouguec@gmail.com > Date: Wed, 09 Feb 2022 22:57:51 -0500 > > > If I understand correctly what you want, you should use U+25CC DOTTED > > CIRCLE before the combining character, not SPACE. > > Maybe that is right, but I don't understand it. > Why is that right? The dotted circle is the accepted method of showing stand-alone combining characters. It is used everywhere, and U+25CC exists for that very purpose. > > > It wouldn't be very hard to add a list of extra decompositions that > > > are not known to Unicode itself. > > > Sure, but that means we'd need some manually-maintained database > > anyway. > > Maybe so. But these automatic methds will make a good simplification > even if it doesn't simplify everything perfectly. We disagree about whether this is a significant simplification. Looking at the giant Lynx-derived database in latin1-disp.el, I fail to see how making a small part of it auto-generated would be a win. ^ permalink raw reply [flat|nested] 104+ messages in thread
* Re: Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it? 2022-02-10 6:26 ` Eli Zaretskii @ 2022-02-12 3:57 ` Richard Stallman 2022-02-12 7:36 ` Eli Zaretskii 2022-02-12 20:10 ` Tomas Hlavaty 0 siblings, 2 replies; 104+ messages in thread From: Richard Stallman @ 2022-02-12 3:57 UTC (permalink / raw) To: Eli Zaretskii; +Cc: psainty, luangruo, emacs-devel, kevin.legouguec [[[ To any NSA and FBI agents reading my email: please consider ]]] [[[ whether defending the US Constitution against all enemies, ]]] [[[ foreign or domestic, requires you to follow Snowden's example. ]]] > The dotted circle is the accepted method of showing stand-alone > combining characters. Sorry, I didn't know about thus. I don't argue against it. Presenting characters that can't be displayed can use DOTTED CIRCLE if the terminal can display that, and the combination if that with the diacritics. Otherwise it should use some other character, such as SPACE. > We disagree about whether this is a significant simplification. > Looking at the giant Lynx-derived database in latin1-disp.el, I fail > to see how making a small part of it auto-generated would be a win. I had not seen that list before. Earlier you said that the a(? translation was made by hand, so I am somewhat confused now. Most of those entries are for characters without diacritics, it seems, and I'm not talking about those. My objection is to some translations of letters with diacritics. Their meanings are not guessable. I want to replace them with sequences people will be able to understand at first sight. If the easiest way to do that is by editing that list, ok. But maybe those characters don't need to be in the list at all. -- Dr Richard Stallman (https://stallman.org) Chief GNUisance of the GNU Project (https://gnu.org) Founder, Free Software Foundation (https://fsf.org) Internet Hall-of-Famer (https://internethalloffame.org) ^ permalink raw reply [flat|nested] 104+ messages in thread
* Re: Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it? 2022-02-12 3:57 ` Richard Stallman @ 2022-02-12 7:36 ` Eli Zaretskii 2022-02-14 4:13 ` Richard Stallman 2022-02-12 20:10 ` Tomas Hlavaty 1 sibling, 1 reply; 104+ messages in thread From: Eli Zaretskii @ 2022-02-12 7:36 UTC (permalink / raw) To: rms; +Cc: psainty, luangruo, emacs-devel, kevin.legouguec > From: Richard Stallman <rms@gnu.org> > Cc: psainty@orcon.net.nz, luangruo@yahoo.com, > kevin.legouguec@gmail.com, emacs-devel@gnu.org > Date: Fri, 11 Feb 2022 22:57:07 -0500 > > > We disagree about whether this is a significant simplification. > > Looking at the giant Lynx-derived database in latin1-disp.el, I fail > > to see how making a small part of it auto-generated would be a win. > > I had not seen that list before. > > Earlier you said that the a(? translation was made by hand, so I am > somewhat confused now. > > Most of those entries are for characters without diacritics, it seems, > and I'm not talking about those. My objection is to some translations > of letters with diacritics. Their meanings are not guessable. I want > to replace them with sequences people will be able to understand at > first sight. Then please suggest replacements you consider to be better, and let's make those replacements. We are not bound by what Lynx does, we just used that as a source. > If the easiest way to do that is by editing that list, ok. Yes, that's what I was trying to say all the time: let's edit that list directly. That way, we get to see all the entries, and can easily judge whether the method of expressing the diacritics is consistent and looks reasonably well. > But maybe those characters don't need to be in the list at all. This was your proposal about generating some of the entries. I think it will be harder to maintain, because the effects of a change in expressing some diacritic are not immediately evident -- you don't see all of the affected characters. ^ permalink raw reply [flat|nested] 104+ messages in thread
* Re: Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it? 2022-02-12 7:36 ` Eli Zaretskii @ 2022-02-14 4:13 ` Richard Stallman 2022-02-14 12:07 ` Eli Zaretskii 0 siblings, 1 reply; 104+ messages in thread From: Richard Stallman @ 2022-02-14 4:13 UTC (permalink / raw) To: Eli Zaretskii; +Cc: psainty, luangruo, emacs-devel, kevin.legouguec [[[ To any NSA and FBI agents reading my email: please consider ]]] [[[ whether defending the US Constitution against all enemies, ]]] [[[ foreign or domestic, requires you to follow Snowden's example. ]]] > Then please suggest replacements you consider to be better, and let's > make those replacements. We are not bound by what Lynx does, we just > used that as a source. For each character whose name has the form LATIN (SMALL|CAPITAL) LETTER \1 WITH \2 if the terminal can't display that, it should display \1 (in the appropriate case), followed by the graphical form of \2. Thus, for LATIN CAPITAL LETTER A WITH MACRON it should display as `A' followed by a macron. For each character whose name has the form latin (small|capital) letter \1 with \2 and \3 if the terminal can't display that, but it can display \1 with \2, it should display LATIN (SMALL|CAPITAL) LETTER \1 WITH \2 followed by the graphical form of \3. Otherwise, it should display \1 followed by the graphical form of \2 followed by the graphical form of \3. -- Dr Richard Stallman (https://stallman.org) Chief GNUisance of the GNU Project (https://gnu.org) Founder, Free Software Foundation (https://fsf.org) Internet Hall-of-Famer (https://internethalloffame.org) ^ permalink raw reply [flat|nested] 104+ messages in thread
* Re: Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it? 2022-02-14 4:13 ` Richard Stallman @ 2022-02-14 12:07 ` Eli Zaretskii 2022-02-15 4:33 ` Richard Stallman 0 siblings, 1 reply; 104+ messages in thread From: Eli Zaretskii @ 2022-02-14 12:07 UTC (permalink / raw) To: rms; +Cc: psainty, luangruo, emacs-devel, kevin.legouguec > From: Richard Stallman <rms@gnu.org> > Cc: psainty@orcon.net.nz, luangruo@yahoo.com, > kevin.legouguec@gmail.com, emacs-devel@gnu.org > Date: Sun, 13 Feb 2022 23:13:16 -0500 > > > Then please suggest replacements you consider to be better, and let's > > make those replacements. We are not bound by what Lynx does, we just > > used that as a source. > > For each character whose name has the form > > LATIN (SMALL|CAPITAL) LETTER \1 WITH \2 > > if the terminal can't display that, it should display \1 > (in the appropriate case), followed by the graphical form of \2. > > Thus, for > > LATIN CAPITAL LETTER A WITH MACRON > > it should display as `A' followed by a macron. > > For each character whose name has the form > > latin (small|capital) letter \1 with \2 and \3 > > if the terminal can't display that, but it can display \1 with \2, > it should display > > LATIN (SMALL|CAPITAL) LETTER \1 WITH \2 > > followed by the graphical form of \3. > > Otherwise, it should display > > \1 followed by the graphical form of \2 followed by the graphical form of \3. Thanks, but I thought you'd propose replacements for the equivalents of the diacritics (i.e. those "graphical forms"), not an algorithm to combine them. ^ permalink raw reply [flat|nested] 104+ messages in thread
* Re: Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it? 2022-02-14 12:07 ` Eli Zaretskii @ 2022-02-15 4:33 ` Richard Stallman 2022-02-15 13:32 ` Eli Zaretskii 0 siblings, 1 reply; 104+ messages in thread From: Richard Stallman @ 2022-02-15 4:33 UTC (permalink / raw) To: Eli Zaretskii; +Cc: psainty, luangruo, kevin.legouguec, emacs-devel [[[ To any NSA and FBI agents reading my email: please consider ]]] [[[ whether defending the US Constitution against all enemies, ]]] [[[ foreign or domestic, requires you to follow Snowden's example. ]]] > Thanks, but I thought you'd propose replacements for the equivalents > of the diacritics (i.e. those "graphical forms"), not an algorithm to > combine them. Sorry, I don't understand either half of that sentence. I thought you were asking me to present replacements for the display equivalents in the long alist, so I tried to do that in a systematic way. Are you asking for a list of correspondences from COMBINING TILDE to TILDE? -- Dr Richard Stallman (https://stallman.org) Chief GNUisance of the GNU Project (https://gnu.org) Founder, Free Software Foundation (https://fsf.org) Internet Hall-of-Famer (https://internethalloffame.org) ^ permalink raw reply [flat|nested] 104+ messages in thread
* Re: Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it? 2022-02-15 4:33 ` Richard Stallman @ 2022-02-15 13:32 ` Eli Zaretskii 2022-02-16 4:14 ` Richard Stallman 0 siblings, 1 reply; 104+ messages in thread From: Eli Zaretskii @ 2022-02-15 13:32 UTC (permalink / raw) To: rms; +Cc: psainty, luangruo, kevin.legouguec, emacs-devel > From: Richard Stallman <rms@gnu.org> > Cc: psainty@orcon.net.nz, luangruo@yahoo.com, emacs-devel@gnu.org, > kevin.legouguec@gmail.com > Date: Mon, 14 Feb 2022 23:33:50 -0500 > > > Thanks, but I thought you'd propose replacements for the equivalents > > of the diacritics (i.e. those "graphical forms"), not an algorithm to > > combine them. > > Sorry, I don't understand either half of that sentence. > > I thought you were asking me to present replacements for the > display equivalents in the long alist, so I tried to do that > in a systematic way. > > Are you asking for a list of correspondences from > COMBINING TILDE to TILDE? No, for replacements for the cdr part of the likes of (?\ẵ "a(?") You said that you didn't like (? as the equivalent of the two diacritics ̆̃, so I suggested that you propose alternative equivalents which you'd like better. ^ permalink raw reply [flat|nested] 104+ messages in thread
* Re: Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it? 2022-02-15 13:32 ` Eli Zaretskii @ 2022-02-16 4:14 ` Richard Stallman 2022-02-16 12:10 ` Eli Zaretskii 0 siblings, 1 reply; 104+ messages in thread From: Richard Stallman @ 2022-02-16 4:14 UTC (permalink / raw) To: Eli Zaretskii; +Cc: psainty, luangruo, emacs-devel, kevin.legouguec [[[ To any NSA and FBI agents reading my email: please consider ]]] [[[ whether defending the US Constitution against all enemies, ]]] [[[ foreign or domestic, requires you to follow Snowden's example. ]]] > No, for replacements for the cdr part of the likes of > (?\ẵ "a(?") > You said that you didn't like (? as the equivalent of the two > diacritics ̆̃, so I suggested that you propose alternative equivalents > which you'd like better. That is the question I answered -- in full generality. Instead of sending you a very long list of replacements, I sent simple general rules to handle ALL characters that have the form LETTER + DIACRITIC, and ALL characters that have the form LETTER + DIACRITIC1 + DIACRITIC2. The character you just cited is LATIN SMALL LETTER A WITH BREVE AND TILDE. The rule is For each character whose name has the form latin (small|capital) letter \1 with \2 and \3 if the terminal can't display that, but it can display \1 with \2, it should display LATIN (SMALL|CAPITAL) LETTER \1 WITH \2 followed by the graphical form of \3. Otherwise, it should display \1 followed by the graphical form of \2 followed by the graphical form of \3. Following this rule, Emacs would display ă~, if the terminal can handle that, otherwise a˘~ . Have I said it clearly this time? -- Dr Richard Stallman (https://stallman.org) Chief GNUisance of the GNU Project (https://gnu.org) Founder, Free Software Foundation (https://fsf.org) Internet Hall-of-Famer (https://internethalloffame.org) ^ permalink raw reply [flat|nested] 104+ messages in thread
* Re: Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it? 2022-02-16 4:14 ` Richard Stallman @ 2022-02-16 12:10 ` Eli Zaretskii 2022-02-19 4:54 ` Richard Stallman 0 siblings, 1 reply; 104+ messages in thread From: Eli Zaretskii @ 2022-02-16 12:10 UTC (permalink / raw) To: rms; +Cc: psainty, luangruo, emacs-devel, kevin.legouguec > From: Richard Stallman <rms@gnu.org> > Cc: psainty@orcon.net.nz, luangruo@yahoo.com, > kevin.legouguec@gmail.com, emacs-devel@gnu.org > Date: Tue, 15 Feb 2022 23:14:25 -0500 > > > No, for replacements for the cdr part of the likes of > > > (?\ẵ "a(?") > > > You said that you didn't like (? as the equivalent of the two > > diacritics ̆̃, so I suggested that you propose alternative equivalents > > which you'd like better. > > That is the question I answered -- in full generality. > Instead of sending you a very long list of replacements, > I sent simple general rules to handle ALL characters > that have the form LETTER + DIACRITIC, and ALL characters > that have the form LETTER + DIACRITIC1 + DIACRITIC2. > > The character you just cited is LATIN SMALL LETTER A WITH BREVE AND TILDE. > The rule is > > For each character whose name has the form > > latin (small|capital) letter \1 with \2 and \3 > > if the terminal can't display that, but it can display \1 with \2, > it should display > > LATIN (SMALL|CAPITAL) LETTER \1 WITH \2 > > followed by the graphical form of \3. > > Otherwise, it should display > > \1 followed by the graphical form of \2 followed by the graphical form of \3. I understand the general rule, but I hoped you had specific suggestions for those \2 and \3 placeholders. It now sounds like you actually suggest to display the diacritic in its non-combining variety, as I understand from this example: > Following this rule, Emacs would display ă~, if the terminal can handle that, > otherwise a˘~ . But in that case, it goes against the spirit of this feature, which expresses non-ASCII characters with equivalent strings composed of ASCII characters. Since ˘ is non-ASCII, chances are that the terminal which cannot display ă will be unable to display ˘ as well. ^ permalink raw reply [flat|nested] 104+ messages in thread
* Re: Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it? 2022-02-16 12:10 ` Eli Zaretskii @ 2022-02-19 4:54 ` Richard Stallman 0 siblings, 0 replies; 104+ messages in thread From: Richard Stallman @ 2022-02-19 4:54 UTC (permalink / raw) To: Eli Zaretskii; +Cc: psainty, luangruo, emacs-devel, kevin.legouguec [[[ To any NSA and FBI agents reading my email: please consider ]]] [[[ whether defending the US Constitution against all enemies, ]]] [[[ foreign or domestic, requires you to follow Snowden's example. ]]] > > Following this rule, Emacs would display ă~, if the terminal can handle that, > > otherwise a˘~ . > But in that case, it goes against the spirit of this feature, which > expresses non-ASCII characters with equivalent strings composed of > ASCII characters. Since ˘ is non-ASCII, chances are that the terminal > which cannot display ă will be unable to display ˘ as well. That's a valid point: on a terminal which can't display the BREVE character, this method would not work. There could be an additional fallback method of using `(' instead of BREVE. I can't think of any other ASCII character that would be better than that one. -- Dr Richard Stallman (https://stallman.org) Chief GNUisance of the GNU Project (https://gnu.org) Founder, Free Software Foundation (https://fsf.org) Internet Hall-of-Famer (https://internethalloffame.org) ^ permalink raw reply [flat|nested] 104+ messages in thread
* Re: Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it? 2022-02-12 3:57 ` Richard Stallman 2022-02-12 7:36 ` Eli Zaretskii @ 2022-02-12 20:10 ` Tomas Hlavaty 2022-02-14 4:14 ` Richard Stallman 1 sibling, 1 reply; 104+ messages in thread From: Tomas Hlavaty @ 2022-02-12 20:10 UTC (permalink / raw) To: rms, Eli Zaretskii; +Cc: psainty, luangruo, kevin.legouguec, emacs-devel On Fri 11 Feb 2022 at 22:57, Richard Stallman <rms@gnu.org> wrote: > Earlier you said that the a(? translation was made by hand, so I am > somewhat confused now. > > Most of those entries are for characters without diacritics, it seems, > and I'm not talking about those. My objection is to some translations > of letters with diacritics. Their meanings are not guessable. I want > to replace them with sequences people will be able to understand at > first sight. In my native language, droping diacritics for ascii rather than imitating it with additional characters is usually more readable. ^ permalink raw reply [flat|nested] 104+ messages in thread
* Re: Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it? 2022-02-12 20:10 ` Tomas Hlavaty @ 2022-02-14 4:14 ` Richard Stallman 0 siblings, 0 replies; 104+ messages in thread From: Richard Stallman @ 2022-02-14 4:14 UTC (permalink / raw) To: Tomas Hlavaty; +Cc: psainty, luangruo, eliz, emacs-devel, kevin.legouguec [[[ To any NSA and FBI agents reading my email: please consider ]]] [[[ whether defending the US Constitution against all enemies, ]]] [[[ foreign or domestic, requires you to follow Snowden's example. ]]] > In my native language, droping diacritics for ascii rather than > imitating it with additional characters is usually more readable. That could be an option. If we implement what I asked for, it would be easy to implement a variant that deletes all diacritics from the expansions. -- Dr Richard Stallman (https://stallman.org) Chief GNUisance of the GNU Project (https://gnu.org) Founder, Free Software Foundation (https://fsf.org) Internet Hall-of-Famer (https://internethalloffame.org) ^ permalink raw reply [flat|nested] 104+ messages in thread
* Re: Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it? 2022-01-26 3:39 ` Richard Stallman 2022-01-26 5:38 ` Eli Zaretskii @ 2022-01-26 8:20 ` Andreas Schwab 2022-01-27 4:13 ` Richard Stallman 1 sibling, 1 reply; 104+ messages in thread From: Andreas Schwab @ 2022-01-26 8:20 UTC (permalink / raw) To: Richard Stallman Cc: psainty, luangruo, Eli Zaretskii, kevin.legouguec, emacs-devel On Jan 25 2022, Richard Stallman wrote: > I didn't know that there were two different kinds of ligatures in Unicode. It is misleading at best to call them ligatures. They are just random Unicode code points that happen to be absent from the font that the terminal uses. -- Andreas Schwab, schwab@linux-m68k.org GPG Key fingerprint = 7578 EB47 D4E5 4D69 2510 2552 DF73 E780 A9DA AEC1 "And now for something completely different." ^ permalink raw reply [flat|nested] 104+ messages in thread
* Re: Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it? 2022-01-26 8:20 ` Andreas Schwab @ 2022-01-27 4:13 ` Richard Stallman 2022-01-27 6:39 ` Eli Zaretskii 0 siblings, 1 reply; 104+ messages in thread From: Richard Stallman @ 2022-01-27 4:13 UTC (permalink / raw) To: Andreas Schwab; +Cc: psainty, luangruo, eliz, kevin.legouguec, emacs-devel [[[ To any NSA and FBI agents reading my email: please consider ]]] [[[ whether defending the US Constitution against all enemies, ]]] [[[ foreign or domestic, requires you to follow Snowden's example. ]]] > It is misleading at best to call them ligatures. They are just random > Unicode code points that happen to be absent from the font that the > terminal uses. I disagree. These Unicode values are not ordinary, not just like any other. They are special because each of them represents a ligature of two ASCII characters, which could just as well be presented as a series of two characters. When it is impossible to display the character's ligature, it would be more useful to display the two ASCII characters than to display an unhelpful diamond. We should try to do what is most helpful, not be quick to give up. -- Dr Richard Stallman (https://stallman.org) Chief GNUisance of the GNU Project (https://gnu.org) Founder, Free Software Foundation (https://fsf.org) Internet Hall-of-Famer (https://internethalloffame.org) ^ permalink raw reply [flat|nested] 104+ messages in thread
* Re: Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it? 2022-01-27 4:13 ` Richard Stallman @ 2022-01-27 6:39 ` Eli Zaretskii 2022-01-27 8:13 ` Kévin Le Gouguec 0 siblings, 1 reply; 104+ messages in thread From: Eli Zaretskii @ 2022-01-27 6:39 UTC (permalink / raw) To: rms; +Cc: psainty, luangruo, kevin.legouguec, schwab, emacs-devel > From: Richard Stallman <rms@gnu.org> > Cc: eliz@gnu.org, psainty@orcon.net.nz, luangruo@yahoo.com, > emacs-devel@gnu.org, kevin.legouguec@gmail.com > Date: Wed, 26 Jan 2022 23:13:47 -0500 > > I disagree. These Unicode values are not ordinary, not just like any > other. They are special because each of them represents a ligature of > two ASCII characters, which could just as well be presented as a > series of two characters. Note that there are precomposed codepoints for ligatures of 3 ASCII characters as well, for example U+FB03 LATIN SMALL LIGATURE FFI. > When it is impossible to display the character's ligature, it would be > more useful to display the two ASCII characters than to display an > unhelpful diamond. > > We should try to do what is most helpful, not be quick to give up. Would it be good enough to have a command that will arrange for these ligatures to be displayed as their ASCII equivalents, using the facilities in latin1-disp.el? Such a command could be invoked either manually or from your init file. latin1-disp.el also provides a special face to display such equivalents, so you could have them stand out on display if you want. ^ permalink raw reply [flat|nested] 104+ messages in thread
* Re: Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it? 2022-01-27 6:39 ` Eli Zaretskii @ 2022-01-27 8:13 ` Kévin Le Gouguec 2022-01-27 9:55 ` Eli Zaretskii 0 siblings, 1 reply; 104+ messages in thread From: Kévin Le Gouguec @ 2022-01-27 8:13 UTC (permalink / raw) To: Eli Zaretskii; +Cc: psainty, luangruo, schwab, rms, emacs-devel Eli Zaretskii <eliz@gnu.org> writes: >> From: Richard Stallman <rms@gnu.org> >> Cc: eliz@gnu.org, psainty@orcon.net.nz, luangruo@yahoo.com, >> emacs-devel@gnu.org, kevin.legouguec@gmail.com >> Date: Wed, 26 Jan 2022 23:13:47 -0500 >> >> When it is impossible to display the character's ligature, it would be >> more useful to display the two ASCII characters than to display an >> unhelpful diamond. >> >> We should try to do what is most helpful, not be quick to give up. > > Would it be good enough to have a command that will arrange for these > ligatures to be displayed as their ASCII equivalents, using the > facilities in latin1-disp.el? Such a command could be invoked either > manually or from your init file. latin1-disp.el also provides a > special face to display such equivalents, so you could have them stand > out on display if you want. Reading the documentation of the various glyphless-* knobs, I wonder if it would make sense to provide another group for glyphless-char-display-control? 'no-font is not helpful on my TTY, IIUC because terminal-coding-system says "utf-8-unix"?). Maybe 'no-display, meaning (null (char-displayable-p CHAR))? That would at least allow users to tell Emacs to use the 'hex-code method, which would be more immediately informative than the diamond. Though not by a lot. Maybe adding a new method? Something like 'char-name? Obviously it'd be ugly to see… > Please refer to the o\N{LATIN SMALL LIGATURE FFI}cial documentation … but (1) it would be more informative (though maybe not less confusing) than "Please refer to the o◆cial documentation", (2) it would also serve as a decent fallback for symbols and emojis, which we see more and more on this list. I'm thinking of situations like <E1mxiYQ-0002ul-5B@fencepost.gnu.org>; not saying we should encourage using symbols over words, but TTY users would probably appreciate this kind of fallback? (I hope at least some of this message makes sense; apologies if not) ^ permalink raw reply [flat|nested] 104+ messages in thread
* Re: Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it? 2022-01-27 8:13 ` Kévin Le Gouguec @ 2022-01-27 9:55 ` Eli Zaretskii 2022-01-27 10:29 ` Eli Zaretskii 0 siblings, 1 reply; 104+ messages in thread From: Eli Zaretskii @ 2022-01-27 9:55 UTC (permalink / raw) To: Kévin Le Gouguec; +Cc: psainty, luangruo, schwab, rms, emacs-devel > From: Kévin Le Gouguec <kevin.legouguec@gmail.com> > Cc: rms@gnu.org, schwab@linux-m68k.org, psainty@orcon.net.nz, > luangruo@yahoo.com, emacs-devel@gnu.org > Date: Thu, 27 Jan 2022 09:13:37 +0100 > > Reading the documentation of the various glyphless-* knobs, I wonder if > it would make sense to provide another group for > glyphless-char-display-control? 'no-font is not helpful on my TTY, IIUC > because terminal-coding-system says "utf-8-unix"?). > > Maybe 'no-display, meaning (null (char-displayable-p CHAR))? Isn't that what glyphless-char-display-control already does on a TTY for no-font? We just need to set up the table for such characters. But I don't think that displaying the hex code is the best alternative for this particular use case, as displaying the ASCII equivalents is much better. ^ permalink raw reply [flat|nested] 104+ messages in thread
* Re: Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it? 2022-01-27 9:55 ` Eli Zaretskii @ 2022-01-27 10:29 ` Eli Zaretskii 2022-01-27 17:36 ` Kévin Le Gouguec 0 siblings, 1 reply; 104+ messages in thread From: Eli Zaretskii @ 2022-01-27 10:29 UTC (permalink / raw) To: kevin.legouguec; +Cc: psainty, luangruo, schwab, rms, emacs-devel > Date: Thu, 27 Jan 2022 11:55:25 +0200 > From: Eli Zaretskii <eliz@gnu.org> > Cc: psainty@orcon.net.nz, luangruo@yahoo.com, schwab@linux-m68k.org, > rms@gnu.org, emacs-devel@gnu.org > > > Reading the documentation of the various glyphless-* knobs, I wonder if > > it would make sense to provide another group for > > glyphless-char-display-control? 'no-font is not helpful on my TTY, IIUC > > because terminal-coding-system says "utf-8-unix"?). > > > > Maybe 'no-display, meaning (null (char-displayable-p CHAR))? > > Isn't that what glyphless-char-display-control already does on a TTY > for no-font? We just need to set up the table for such characters. Or maybe we should install the below? diff --git a/src/term.c b/src/term.c index 4c7a90a..ddf0e8e 100644 --- a/src/term.c +++ b/src/term.c @@ -1632,9 +1632,13 @@ produce_glyphs (struct it *it) } else { - Lisp_Object charset_list = FRAME_TERMINAL (it->f)->charset_list; + struct terminal *t = FRAME_TERMINAL (it->f); + Lisp_Object charset_list = t->charset_list, char_glyph; - if (char_charset (it->char_to_display, charset_list, NULL)) + if (char_charset (it->char_to_display, charset_list, NULL) + && (char_glyph = terminal_glyph_code (t, it->char_to_display), + NILP (char_glyph) + || (FIXNUMP (char_glyph) && XFIXNUM (char_glyph) >= 0))) { it->pixel_width = CHARACTER_WIDTH (it->char_to_display); it->nglyphs = it->pixel_width; ^ permalink raw reply related [flat|nested] 104+ messages in thread
* Re: Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it? 2022-01-27 10:29 ` Eli Zaretskii @ 2022-01-27 17:36 ` Kévin Le Gouguec 2022-01-27 18:38 ` Eli Zaretskii 0 siblings, 1 reply; 104+ messages in thread From: Kévin Le Gouguec @ 2022-01-27 17:36 UTC (permalink / raw) To: Eli Zaretskii; +Cc: psainty, luangruo, schwab, rms, emacs-devel Eli Zaretskii <eliz@gnu.org> writes: >> Date: Thu, 27 Jan 2022 11:55:25 +0200 >> From: Eli Zaretskii <eliz@gnu.org> >> Cc: psainty@orcon.net.nz, luangruo@yahoo.com, schwab@linux-m68k.org, >> rms@gnu.org, emacs-devel@gnu.org >> >> > Reading the documentation of the various glyphless-* knobs, I wonder if >> > it would make sense to provide another group for >> > glyphless-char-display-control? 'no-font is not helpful on my TTY, IIUC >> > because terminal-coding-system says "utf-8-unix"?). >> > >> > Maybe 'no-display, meaning (null (char-displayable-p CHAR))? >> >> Isn't that what glyphless-char-display-control already does on a TTY >> for no-font? We just need to set up the table for such characters. > > Or maybe we should install the below? Ah, yes, that does improve the situation quite a bit here (if that matters: Linux 5.16.1 on openSUSE Tumbleweed; 5.10.92 on Debian 11): all the characters that we discussed here and showed up as diamonds (ffi, ⚀) now show up as \uHHHH escape sequences. (I just noticed that there seem to be a TTY glyph for fi on Tumbleweed, but not on Debian 11 🤔) >> But I don't think that displaying the hex code is the best alternative >> for this particular use case, as displaying the ASCII equivalents is >> much better. Right, wholehearted agreement. I only mentioned the hex codes because they seemed (1) more informative than the infamous diamonds, (2) less effort to implement than displaying the ASCII equivalent, (3) also applicable to other characters (e.g. symbols & emojis). With your patch, none of these points remain relevant. Thanks! ^ permalink raw reply [flat|nested] 104+ messages in thread
* Re: Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it? 2022-01-27 17:36 ` Kévin Le Gouguec @ 2022-01-27 18:38 ` Eli Zaretskii 0 siblings, 0 replies; 104+ messages in thread From: Eli Zaretskii @ 2022-01-27 18:38 UTC (permalink / raw) To: Kévin Le Gouguec; +Cc: psainty, luangruo, schwab, rms, emacs-devel > From: Kévin Le Gouguec <kevin.legouguec@gmail.com> > Cc: psainty@orcon.net.nz, luangruo@yahoo.com, schwab@linux-m68k.org, > rms@gnu.org, emacs-devel@gnu.org > Date: Thu, 27 Jan 2022 18:36:37 +0100 > > > Or maybe we should install the below? > > Ah, yes, that does improve the situation quite a bit here (if that > matters: Linux 5.16.1 on openSUSE Tumbleweed; 5.10.92 on Debian 11): all > the characters that we discussed here and showed up as diamonds (ffi, ⚀) > now show up as \uHHHH escape sequences. Thanks, installed. ^ permalink raw reply [flat|nested] 104+ messages in thread
* Re: Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it? 2022-01-19 10:05 ` Phil Sainty 2022-01-19 11:43 ` Eli Zaretskii @ 2022-01-20 3:17 ` Richard Stallman 2022-01-20 4:54 ` Phil Sainty ` (2 more replies) 1 sibling, 3 replies; 104+ messages in thread From: Richard Stallman @ 2022-01-20 3:17 UTC (permalink / raw) To: Phil Sainty; +Cc: luangruo, emacs-devel [[[ To any NSA and FBI agents reading my email: please consider ]]] [[[ whether defending the US Constitution against all enemies, ]]] [[[ foreign or domestic, requires you to follow Snowden's example. ]]] Explanation to Eli: I understand that these 0-width characters have legitimate, useful purposes. It is good that we support them. The issue I've raised, which was explained in the text I cited, is that _allegedly_ it is possible to use them maliciously, by inserting a sequence of them to function as a sort of watermark that users normally won't even see. > You can highlight them like so: > (set-face-background 'glyphless-char "red") > I've had that configured ever since > https://debbugs.gnu.org/cgi/bugreport.cgi?bug=31194#40 > If you're not expecting zero-width characters in text in general, > I think it's a good setting. I think I will try that, just in case someone sends me some of those. Thanks. Should we make this the default? I think it is likely that most Emacs users will see only malicious zero-width characters, and not useful ones. Is there a way we could detect automatically when these zero-width characters are being used in a legit way for their intended purpose, and in that case, display them as zero-width for real? That way, they would work right when used properly, and ring an alarm (metaphorically) when used in a fishy way. > Emacs by default displays ZWJ and ZWNJ characters (and any other > zero-width characters) as thin 1-pixel spaces on GUI frames, and as > simple spaces on TTY frames. So Emacs users are likely to see these > "hidden" sequences of characters on display. I wonder if we could do something clever to show when there is a sequence of multiple different 1-pixel characters? For instance, maybe give different colors to different characters, so that a sequence of several shows as a funny spectrum? This could alert the user that "someone's messing with you here". There are many possible variants of the details -- I don't know what would be best, or what would be easy, but people could try various methods. -- Dr Richard Stallman (https://stallman.org) Chief GNUisance of the GNU Project (https://gnu.org) Founder, Free Software Foundation (https://fsf.org) Internet Hall-of-Famer (https://internethalloffame.org) ^ permalink raw reply [flat|nested] 104+ messages in thread
* Re: Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it? 2022-01-20 3:17 ` Richard Stallman @ 2022-01-20 4:54 ` Phil Sainty 2022-01-20 6:39 ` tomas 2022-01-20 7:57 ` Eli Zaretskii 2022-01-20 6:35 ` Tim Cross 2022-01-20 7:48 ` Eli Zaretskii 2 siblings, 2 replies; 104+ messages in thread From: Phil Sainty @ 2022-01-20 4:54 UTC (permalink / raw) To: emacs-devel I've remembered a question on emacs.stackexchange.com about customizing the glyphless char display, and doing so on a per-mode basis; so just in case this is useful to anyone: https://emacs.stackexchange.com/a/65109 (I don't think that's part of a general solution to this question, but it might be interesting to someone.) ^ permalink raw reply [flat|nested] 104+ messages in thread
* Re: Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it? 2022-01-20 4:54 ` Phil Sainty @ 2022-01-20 6:39 ` tomas 2022-01-20 17:58 ` [External] : " Drew Adams 2022-01-22 4:37 ` Richard Stallman 2022-01-20 7:57 ` Eli Zaretskii 1 sibling, 2 replies; 104+ messages in thread From: tomas @ 2022-01-20 6:39 UTC (permalink / raw) To: emacs-devel [-- Attachment #1: Type: text/plain, Size: 708 bytes --] On Thu, Jan 20, 2022 at 05:54:24PM +1300, Phil Sainty wrote: > I've remembered a question on emacs.stackexchange.com about > customizing the glyphless char display, and doing so on a > per-mode basis; so just in case this is useful to anyone: > > https://emacs.stackexchange.com/a/65109 > > (I don't think that's part of a general solution to this > question, but it might be interesting to someone.) Last time a similar discussion came around (that time about direction-change Unicode characters in source code used for malicious purposes) white-space-mode was mentioned as a place where to put visualization of "such things". This time it seems even more appropriate. Cheers -- t [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 195 bytes --] ^ permalink raw reply [flat|nested] 104+ messages in thread
* RE: [External] : Re: Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it? 2022-01-20 6:39 ` tomas @ 2022-01-20 17:58 ` Drew Adams 2022-01-22 4:37 ` Richard Stallman 1 sibling, 0 replies; 104+ messages in thread From: Drew Adams @ 2022-01-20 17:58 UTC (permalink / raw) To: tomas@tuxteam.de, emacs-devel@gnu.org > > I've remembered a question on emacs.stackexchange.com about > > customizing the glyphless char display, and doing so on a > > per-mode basis; so just in case this is useful to anyone: > > https://emacs.stackexchange.com/a/65109 > > Last time a similar discussion came around (that time about > direction-change Unicode characters in source code used for malicious > purposes) white-space-mode was mentioned as a place where to put > visualization of "such things". > > This time it seems even more appropriate. Yes, `whitespace-mode` can help, but it's somewhat limited wrt highlighting different characters (or sets or ranges of characters) differently. My library `highlight-chars.el` can help more with this kind of thing. code: https://www.emacswiki.org/emacs/download/highlight-chars.el Description: https://www.emacswiki.org/emacs/ShowWhiteSpace#HighlightChars ^ permalink raw reply [flat|nested] 104+ messages in thread
* Re: Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it? 2022-01-20 6:39 ` tomas 2022-01-20 17:58 ` [External] : " Drew Adams @ 2022-01-22 4:37 ` Richard Stallman 2022-01-22 5:16 ` Po Lu 1 sibling, 1 reply; 104+ messages in thread From: Richard Stallman @ 2022-01-22 4:37 UTC (permalink / raw) To: tomas; +Cc: emacs-devel [[[ To any NSA and FBI agents reading my email: please consider ]]] [[[ whether defending the US Constitution against all enemies, ]]] [[[ foreign or domestic, requires you to follow Snowden's example. ]]] > Last time a similar discussion came around (that time about > direction-change Unicode characters in source code used for malicious > purposes) white-space-mode was mentioned as a place where to put > visualization of "such things". Isn't there a campaign that objects to that term "white space", accusing the term of racism? ;-}. -- Dr Richard Stallman (https://stallman.org) Chief GNUisance of the GNU Project (https://gnu.org) Founder, Free Software Foundation (https://fsf.org) Internet Hall-of-Famer (https://internethalloffame.org) ^ permalink raw reply [flat|nested] 104+ messages in thread
* Re: Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it? 2022-01-22 4:37 ` Richard Stallman @ 2022-01-22 5:16 ` Po Lu 0 siblings, 0 replies; 104+ messages in thread From: Po Lu @ 2022-01-22 5:16 UTC (permalink / raw) To: Richard Stallman; +Cc: tomas, emacs-devel Richard Stallman <rms@gnu.org> writes: > > Last time a similar discussion came around (that time about > > direction-change Unicode characters in source code used for malicious > > purposes) white-space-mode was mentioned as a place where to put > > visualization of "such things". > Isn't there a campaign that objects to that term "white space", > accusing the term of racism? ;-}. If there is, I sincerely hope it will not (as with any other political campaign unrelated to free software) affect our development of Emacs, where people have used the term "whitespace" for decades. The smiley probably means that you're joking. But I'm not sure about that. Thanks. ^ permalink raw reply [flat|nested] 104+ messages in thread
* Re: Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it? 2022-01-20 4:54 ` Phil Sainty 2022-01-20 6:39 ` tomas @ 2022-01-20 7:57 ` Eli Zaretskii 1 sibling, 0 replies; 104+ messages in thread From: Eli Zaretskii @ 2022-01-20 7:57 UTC (permalink / raw) To: Phil Sainty; +Cc: emacs-devel > Date: Thu, 20 Jan 2022 17:54:24 +1300 > From: Phil Sainty <psainty@orcon.net.nz> > > I've remembered a question on emacs.stackexchange.com about > customizing the glyphless char display, and doing so on a > per-mode basis; so just in case this is useful to anyone: > > https://emacs.stackexchange.com/a/65109 Yes, making it possible for glyphless display to be buffer-local is a useful feature we should have. Patches to that effect are welcome. ^ permalink raw reply [flat|nested] 104+ messages in thread
* Re: Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it? 2022-01-20 3:17 ` Richard Stallman 2022-01-20 4:54 ` Phil Sainty @ 2022-01-20 6:35 ` Tim Cross 2022-01-20 7:39 ` tomas 2022-01-20 8:20 ` Eli Zaretskii 2022-01-20 7:48 ` Eli Zaretskii 2 siblings, 2 replies; 104+ messages in thread From: Tim Cross @ 2022-01-20 6:35 UTC (permalink / raw) To: emacs-devel Richard Stallman <rms@gnu.org> writes: > [[[ To any NSA and FBI agents reading my email: please consider ]]] > [[[ whether defending the US Constitution against all enemies, ]]] > [[[ foreign or domestic, requires you to follow Snowden's example. ]]] > > Explanation to Eli: I understand that these 0-width characters have > legitimate, useful purposes. It is good that we support them. > > The issue I've raised, which was explained in the text I cited, is > that _allegedly_ it is possible to use them maliciously, by inserting > a sequence of them to function as a sort of watermark that users > normally won't even see. > > > You can highlight them like so: > > > (set-face-background 'glyphless-char "red") > > > I've had that configured ever since > > https://debbugs.gnu.org/cgi/bugreport.cgi?bug=31194#40 > > > If you're not expecting zero-width characters in text in general, > > I think it's a good setting. > > I think I will try that, just in case someone sends me some of those. > Thanks. > > Should we make this the default? I think it is likely that most Emacs users > will see only malicious zero-width characters, and not useful ones. > > Is there a way we could detect automatically when these zero-width > characters are being used in a legit way for their intended purpose, > and in that case, display them as zero-width for real? > > That way, they would work right when used properly, and ring an alarm > (metaphorically) when used in a fishy way. > > > Emacs by default displays ZWJ and ZWNJ characters (and any other > > zero-width characters) as thin 1-pixel spaces on GUI frames, and as > > simple spaces on TTY frames. So Emacs users are likely to see these > > "hidden" sequences of characters on display. > > I wonder if we could do something clever to show when there is a > sequence of multiple different 1-pixel characters? For instance, > maybe give different colors to different characters, so that a > sequence of several shows as a funny spectrum? > > This could alert the user that "someone's messing with you here". > > There are many possible variants of the details -- I don't know what > would be best, or what would be easy, but people could try various > methods. Just to add some context here which some might find useful. At one point, I worked for an organisation which had real concerns about sensitive information being released (mainly to the press) and wanted to be able to track down the source when it occurred. Essentially, this technique was used. All electronic documents, when distributed to teh approved list of recipients, had a unique id stamp using zero-width characters. When I left, the organisation was also experimenting with adding similar 'marks' to emails sent via the orgnaisation's email server. So this practice is definitely occurring. It is probably more prevalent in PDF and word documents, but I guess could be in plain text email messages as well. This technique (and related ones) don't need high technical expertise either. We had a similar problem at a University I wored at where students used this technique to defeat the anti-plagiarism software the uni used. The software used basic text matching and students started to defeat it by using both zero width characters to break patterns and by using utf characters with glyphs that looked like standard characters, allowing the document to print an look correct, but also breaking pattern matching. Of course, once you are aware this is going on, you can improve the pattern matching and add checks to detect this type of activity. Personally, I was always amazed at the length people went to defeat the anti-plagiarism software. Always seem it would be easier not to plagiarise and cite when appropriate. It is a big challenge to find out a way to alert users to this possible unwanted 'tagging', but at the same time, allow legitimate use. For exmaple, in org-mode, it can sometimes be difficult to combine different markup and other syntax - often it is because of a corner case which is difficult to address with font-locking regexp. Adding a zero-width space is sometimes sufficient to work around the ambiguity in tghe regexp. Point is, anything which makes such use visual noticeable will also make the technique less useful for addressing this issue. ^ permalink raw reply [flat|nested] 104+ messages in thread
* Re: Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it? 2022-01-20 6:35 ` Tim Cross @ 2022-01-20 7:39 ` tomas 2022-01-20 8:20 ` Eli Zaretskii 1 sibling, 0 replies; 104+ messages in thread From: tomas @ 2022-01-20 7:39 UTC (permalink / raw) To: emacs-devel [-- Attachment #1: Type: text/plain, Size: 782 bytes --] On Thu, Jan 20, 2022 at 05:35:23PM +1100, Tim Cross wrote: [...] > Just to add some context here which some might find useful. > > At one point, I worked for an organisation which had real concerns about > sensitive information being released (mainly to the press) and wanted to > be able to track down the source when it occurred [...] Interesting. Related, but not the same: the famous yellow dots from colour laser printers [1]. This shows that this kind of techniques are already deeply ingrained in "industry practice". Now a good challenge would be to come up with a set of criteria which can discriminate between "good" and "bad" use of normally invisible characters. Cheers [1] https://en.wikipedia.org/wiki/Machine_Identification_Code -- t [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 195 bytes --] ^ permalink raw reply [flat|nested] 104+ messages in thread
* Re: Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it? 2022-01-20 6:35 ` Tim Cross 2022-01-20 7:39 ` tomas @ 2022-01-20 8:20 ` Eli Zaretskii 1 sibling, 0 replies; 104+ messages in thread From: Eli Zaretskii @ 2022-01-20 8:20 UTC (permalink / raw) To: Tim Cross; +Cc: emacs-devel > From: Tim Cross <theophilusx@gmail.com> > Date: Thu, 20 Jan 2022 17:35:23 +1100 > > It is a big challenge to find out a way to alert users to this possible > unwanted 'tagging', but at the same time, allow legitimate use. For > exmaple, in org-mode, it can sometimes be difficult to combine different > markup and other syntax - often it is because of a corner case which is > difficult to address with font-locking regexp. Adding a zero-width space > is sometimes sufficient to work around the ambiguity in tghe regexp. A single zero-width character should never be flagged, at least not by default, because its uses are mostly legitimate and important. ^ permalink raw reply [flat|nested] 104+ messages in thread
* Re: Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it? 2022-01-20 3:17 ` Richard Stallman 2022-01-20 4:54 ` Phil Sainty 2022-01-20 6:35 ` Tim Cross @ 2022-01-20 7:48 ` Eli Zaretskii 2022-01-20 8:17 ` Lars Ingebrigtsen 2022-01-21 4:14 ` Richard Stallman 2 siblings, 2 replies; 104+ messages in thread From: Eli Zaretskii @ 2022-01-20 7:48 UTC (permalink / raw) To: rms; +Cc: psainty, luangruo, emacs-devel > From: Richard Stallman <rms@gnu.org> > Date: Wed, 19 Jan 2022 22:17:31 -0500 > Cc: luangruo@yahoo.com, emacs-devel@gnu.org > > > If you're not expecting zero-width characters in text in general, > > I think it's a good setting. > > I think I will try that, just in case someone sends me some of those. > Thanks. > > Should we make this the default? I think it is likely that most Emacs users > will see only malicious zero-width characters, and not useful ones. "Most users" is not a good argument when for some users these characters are a must. As I explained, these characters, when used for their intended purpose, are necessary for correct shaping of text, which increasingly includes even plain-ASCII text. So I will object to any simplistic default like that. We should flag suspicious uses of those characters (which means sequences of several of them in a row), not lone characters. The new textsec.el library is developing the capabilities for detecting such suspicious uses, and we should use that as the basis for any defaults. Users who want to flag _any_ use of zero-width characters are free to do so in their own customizations, of course. > Is there a way we could detect automatically when these zero-width > characters are being used in a legit way for their intended purpose, > and in that case, display them as zero-width for real? That is the subject of the new textsec.el package that Lars is working on now. > > Emacs by default displays ZWJ and ZWNJ characters (and any other > > zero-width characters) as thin 1-pixel spaces on GUI frames, and as > > simple spaces on TTY frames. So Emacs users are likely to see these > > "hidden" sequences of characters on display. > > I wonder if we could do something clever to show when there is a > sequence of multiple different 1-pixel characters? For instance, > maybe give different colors to different characters, so that a > sequence of several shows as a funny spectrum? textsec.el should provide facilities for that. ^ permalink raw reply [flat|nested] 104+ messages in thread
* Re: Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it? 2022-01-20 7:48 ` Eli Zaretskii @ 2022-01-20 8:17 ` Lars Ingebrigtsen 2022-01-21 4:14 ` Richard Stallman 1 sibling, 0 replies; 104+ messages in thread From: Lars Ingebrigtsen @ 2022-01-20 8:17 UTC (permalink / raw) To: Eli Zaretskii; +Cc: psainty, luangruo, rms, emacs-devel Eli Zaretskii <eliz@gnu.org> writes: > As I explained, these characters, when used > for their intended purpose, are necessary for correct shaping of text, > which increasingly includes even plain-ASCII text. So I will object > to any simplistic default like that. Yup. But for people that are paranoid about this stuff, there's `glyphless-display-mode' that they can enable in buffers they worry about. -- (domestic pets only, the antidote for overdose, milk.) bloggy blog: http://lars.ingebrigtsen.no ^ permalink raw reply [flat|nested] 104+ messages in thread
* Re: Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it? 2022-01-20 7:48 ` Eli Zaretskii 2022-01-20 8:17 ` Lars Ingebrigtsen @ 2022-01-21 4:14 ` Richard Stallman 1 sibling, 0 replies; 104+ messages in thread From: Richard Stallman @ 2022-01-21 4:14 UTC (permalink / raw) To: Eli Zaretskii; +Cc: psainty, luangruo, emacs-devel [[[ To any NSA and FBI agents reading my email: please consider ]]] [[[ whether defending the US Constitution against all enemies, ]]] [[[ foreign or domestic, requires you to follow Snowden's example. ]]] > > Should we make this the default? I think it is likely that most Emacs users > > will see only malicious zero-width characters, and not useful ones. > "Most users" is not a good argument when for some users these > characters are a must. I don't follow the argument. Since some users actually use zero-width characters, that seems to give us two choices (at least): * Leave zero-width characters unflagged by default. * Flag zero-width characters by default, and those users can turn that off. I don't know which is better -- I think it depends partly on what fraction of all users find the zero-width characters useful. > > Is there a way we could detect automatically when these zero-width > > characters are being used in a legit way for their intended purpose, > > and in that case, display them as zero-width for real? > That is the subject of the new textsec.el package that Lars is working > on now. That sounds good. -- Dr Richard Stallman (https://stallman.org) Chief GNUisance of the GNU Project (https://gnu.org) Founder, Free Software Foundation (https://fsf.org) Internet Hall-of-Famer (https://internethalloffame.org) ^ permalink raw reply [flat|nested] 104+ messages in thread
* Re: Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it? 2022-01-19 4:15 Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it? Richard Stallman 2022-01-19 4:47 ` Po Lu @ 2022-01-19 8:20 ` Eli Zaretskii 2022-01-19 17:36 ` T.V Raman 2 siblings, 0 replies; 104+ messages in thread From: Eli Zaretskii @ 2022-01-19 8:20 UTC (permalink / raw) To: rms; +Cc: emacs-devel > From: Richard Stallman <rms@gnu.org> > Date: Tue, 18 Jan 2022 23:15:59 -0500 > > Unicode allows user tracking by means of invisible text marking. Any > string can be converted into its binary form and then recoded into a > string of zero-width characters, which can then be invisibly inserted > into the text. If the text is posted elsewhere, the zero-width > character string can be extracted and the process reversed to figure > out the identity of the person who copied it. > > which seems ot be about a special case of confusables, and it makes me > wonder whether Emacs does, or could, show users when Unicode confusion > occurs, or prevent or fix it somehow. AFAIU, there's no confusion here, "just" injection of hidden information into plain text. "Confusion" is when the user is presented with some text that looks like something else. Here the problematic part is not presented at all. > First, is that issue of invisible characters real? Yes. The idea is to use 2 "normal" characters to serve as binary zero and binary one, which would then allow you to inject hidden text by combinations of these two. Of course, the technique is very inefficient and will need many such characters to inject any meaningful text. > Second, does Emacs do anything now such that these tricks > won't succeed? Emacs by default displays ZWJ and ZWNJ characters (and any other zero-width characters) as thin 1-pixel spaces on GUI frames, and as simple spaces on TTY frames. So Emacs users are likely to see these "hidden" sequences of characters on display. > If the problem exists in Emacs now, could we prevent it? I see a few > ways to try. I don't know whether they would work well. > > * Indicate the different encodings on the screen somehow. > > * Canonicalize such seqences (perhaps when reading text into Emacs), > so that different encodings of the same text become identical. > > * Use a stand-alone canonicalizer program. I don't think I understand your proposals. They seem to be based on some idea that these characters are "encodings" of something, and that this encoding can be "canonicalized"? If so, I think this interpretation is a mistake: there's no encoding going on here. These zero-width characters' role is to help the text-shaping engine to shape the characters around them correctly, according to the rules of the script of those surrounding characters. When those zero-width characters are used for the purpose of hiding text, they appear as sequences of zero-width characters without any reason, and in particular the characters that surround them are likely to be whitespace characters, which don't need any joiners to shape them. The job of a feature that detects this is to discern between these two use cases, and flag the suspicious one. In any case, I don't think these solutions could work by examining single characters. ZWJ and ZWNJ are important characters in some scripts, so we cannot mangle them based on considering isolated characters. We must consider sequences of such characters when we design a feature that makes them stand out, because only on that level we can distinguish between legitimate uses of those characters and suspicious uses. I think we should introduce a minor mode that detects those sequences and makes them stand out on display, with or without some warning message in the echo-area. People who want to be aware of any such potentially hidden text will turn that on. We could also turn it on automatically in email and eww. Patches are welcome; I believe we already have the infrastructure in the new textsec.el package. ^ permalink raw reply [flat|nested] 104+ messages in thread
* Re: Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it? 2022-01-19 4:15 Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it? Richard Stallman 2022-01-19 4:47 ` Po Lu 2022-01-19 8:20 ` Eli Zaretskii @ 2022-01-19 17:36 ` T.V Raman 2 siblings, 0 replies; 104+ messages in thread From: T.V Raman @ 2022-01-19 17:36 UTC (permalink / raw) To: Richard Stallman; +Cc: emacs-devel [-- Warning: decoded text below may be mangled, UTF-8 assumed --] [-- Attachment #1: Type: text/plain; charset=gb18030, Size: 1964 bytes --] Richard Stallman <rms@gnu.org> writes: This is indeed worrysome and has been around for a while. There is an even more insidious form of this hack where unicode chars that "appear like english letters" can be used --and a quick visual scan will miss it -- the trick is often used by spammers in domain-names within URLs as an example. As an example, there are Cyrillic letters that "look like" Roman letters. > [[[ To any NSA and FBI agents reading my email: please consider ]]] > [[[ whether defending the US Constitution against all enemies, ]]] > [[[ foreign or domestic, requires you to follow Snowden's example. ]]] > > There is a thread now about confusables. > > I read this, > > Unicode allows user tracking by means of invisible text marking. Any > string can be converted into its binary form and then recoded into a > string of zero-width characters, which can then be invisibly inserted > into the text. If the text is posted elsewhere, the zero-width > character string can be extracted and the process reversed to figure > out the identity of the person who copied it. > > which seems ot be about a special case of confusables, and it makes me > wonder whether Emacs does, or could, show users when Unicode confusion > occurs, or prevent or fix it somehow. > > First, is that issue of invisible characters real? > > Second, does Emacs do anything now such that these tricks > won't succeed? > > If the problem exists in Emacs now, could we prevent it? I see a few > ways to try. I don't know whether they would work well. > > * Indicate the different encodings on the screen somehow. > > * Canonicalize such seqences (perhaps when reading text into Emacs), > so that different encodings of the same text become identical. > > * Use a stand-alone canonicalizer program. -- Thanks, --Raman(I Search, I Find, I Misplace, I Research) 7©4 Id: kg:/m/0285kf1 0Ü8 ^ permalink raw reply [flat|nested] 104+ messages in thread
end of thread, other threads:[~2022-02-19 4:54 UTC | newest] Thread overview: 104+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2022-01-19 4:15 Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it? Richard Stallman 2022-01-19 4:47 ` Po Lu 2022-01-19 10:05 ` Phil Sainty 2022-01-19 11:43 ` Eli Zaretskii 2022-01-21 4:13 ` Richard Stallman 2022-01-21 7:49 ` Eli Zaretskii 2022-01-22 4:37 ` Richard Stallman 2022-01-22 6:58 ` Eli Zaretskii 2022-01-24 4:33 ` Richard Stallman 2022-01-24 5:06 ` Po Lu 2022-01-25 4:17 ` Richard Stallman 2022-01-25 4:58 ` Po Lu 2022-01-24 12:14 ` Eli Zaretskii 2022-01-25 4:16 ` Richard Stallman 2022-01-25 6:35 ` Eli Zaretskii 2022-01-25 12:12 ` Eli Zaretskii 2022-01-25 4:16 ` New feature: displaying ligature characters in the buffer Richard Stallman 2022-01-25 6:31 ` Eli Zaretskii 2022-01-27 4:12 ` Richard Stallman 2022-01-27 7:58 ` Eli Zaretskii 2022-01-25 11:08 ` Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it? Kévin Le Gouguec 2022-01-25 12:38 ` Eli Zaretskii 2022-01-26 3:39 ` Richard Stallman 2022-01-26 5:38 ` Eli Zaretskii 2022-01-28 13:04 ` Richard Stallman 2022-01-28 13:31 ` Eli Zaretskii 2022-01-30 4:17 ` Richard Stallman 2022-01-30 7:36 ` Eli Zaretskii 2022-01-31 4:02 ` Richard Stallman 2022-01-31 13:05 ` Eli Zaretskii 2022-02-01 5:06 ` Richard Stallman 2022-02-01 14:57 ` Eli Zaretskii 2022-02-02 3:58 ` Richard Stallman 2022-02-02 12:28 ` Eli Zaretskii 2022-02-03 4:23 ` Richard Stallman 2022-02-03 7:53 ` Eli Zaretskii 2022-02-03 8:16 ` Yuri Khan 2022-02-03 9:26 ` Eli Zaretskii 2022-02-04 3:52 ` Richard Stallman 2022-02-04 4:56 ` Yuri Khan 2022-02-06 4:13 ` Richard Stallman 2022-02-04 8:10 ` Eli Zaretskii 2022-02-06 4:13 ` Richard Stallman 2022-02-03 20:28 ` Tomas Hlavaty 2022-02-04 7:07 ` Eli Zaretskii 2022-02-05 4:20 ` Richard Stallman 2022-02-05 13:55 ` Tomas Hlavaty 2022-02-05 14:06 ` Eli Zaretskii 2022-02-05 14:12 ` Eli Zaretskii 2022-02-06 1:29 ` Tomas Hlavaty 2022-02-06 8:30 ` Eli Zaretskii 2022-02-06 10:38 ` Tomas Hlavaty 2022-02-06 10:44 ` Eli Zaretskii 2022-02-06 10:54 ` Andreas Schwab 2022-02-06 1:10 ` Tomas Hlavaty 2022-02-06 4:16 ` Richard Stallman 2022-02-06 4:16 ` Richard Stallman 2022-02-06 11:29 ` Tomas Hlavaty 2022-02-04 3:52 ` Richard Stallman 2022-02-04 8:03 ` Eli Zaretskii 2022-02-06 4:13 ` Richard Stallman 2022-02-06 8:56 ` Eli Zaretskii 2022-02-07 5:11 ` Richard Stallman 2022-02-07 13:16 ` Eli Zaretskii 2022-02-08 3:55 ` Richard Stallman 2022-02-08 12:20 ` Eli Zaretskii 2022-02-09 4:06 ` Richard Stallman 2022-02-09 13:50 ` Eli Zaretskii 2022-02-10 3:57 ` Richard Stallman 2022-02-10 6:26 ` Eli Zaretskii 2022-02-12 3:57 ` Richard Stallman 2022-02-12 7:36 ` Eli Zaretskii 2022-02-14 4:13 ` Richard Stallman 2022-02-14 12:07 ` Eli Zaretskii 2022-02-15 4:33 ` Richard Stallman 2022-02-15 13:32 ` Eli Zaretskii 2022-02-16 4:14 ` Richard Stallman 2022-02-16 12:10 ` Eli Zaretskii 2022-02-19 4:54 ` Richard Stallman 2022-02-12 20:10 ` Tomas Hlavaty 2022-02-14 4:14 ` Richard Stallman 2022-01-26 8:20 ` Andreas Schwab 2022-01-27 4:13 ` Richard Stallman 2022-01-27 6:39 ` Eli Zaretskii 2022-01-27 8:13 ` Kévin Le Gouguec 2022-01-27 9:55 ` Eli Zaretskii 2022-01-27 10:29 ` Eli Zaretskii 2022-01-27 17:36 ` Kévin Le Gouguec 2022-01-27 18:38 ` Eli Zaretskii 2022-01-20 3:17 ` Richard Stallman 2022-01-20 4:54 ` Phil Sainty 2022-01-20 6:39 ` tomas 2022-01-20 17:58 ` [External] : " Drew Adams 2022-01-22 4:37 ` Richard Stallman 2022-01-22 5:16 ` Po Lu 2022-01-20 7:57 ` Eli Zaretskii 2022-01-20 6:35 ` Tim Cross 2022-01-20 7:39 ` tomas 2022-01-20 8:20 ` Eli Zaretskii 2022-01-20 7:48 ` Eli Zaretskii 2022-01-20 8:17 ` Lars Ingebrigtsen 2022-01-21 4:14 ` Richard Stallman 2022-01-19 8:20 ` Eli Zaretskii 2022-01-19 17:36 ` T.V Raman
Code repositories for project(s) associated with this public inbox https://git.savannah.gnu.org/cgit/emacs.git This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).