Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it?

unofficial mirror of emacs-devel@gnu.org 
 help / color / mirror / code / Atom feed

* Can watermarking Unicode text using invisible differences  sneak through Emacs, or can Emacs detect it?
@ 2022-01-19  4:15 Richard Stallman
  2022-01-19  4:47 ` Po Lu
                   ` (2 more replies)
  0 siblings, 3 replies; 104+ messages in thread
From: Richard Stallman @ 2022-01-19  4:15 UTC (permalink / raw)
  To: emacs-devel

[[[ To any NSA and FBI agents reading my email: please consider    ]]]
[[[ whether defending the US Constitution against all enemies,     ]]]
[[[ foreign or domestic, requires you to follow Snowden's example. ]]]

There is a thread now about confusables.

I read this,

   Unicode allows user tracking by means of invisible text marking. Any
   string can be converted into its binary form and then recoded into a
   string of zero-width characters, which can then be invisibly inserted
   into the text. If the text is posted elsewhere, the zero-width
   character string can be extracted and the process reversed to figure
   out the identity of the person who copied it.

which seems ot be about a special case of confusables, and it makes me
wonder whether Emacs does, or could, show users when Unicode confusion
occurs, or prevent or fix it somehow.

First, is that issue of invisible characters real?

Second, does Emacs do anything now such that these tricks
won't succeed?

If the problem exists in Emacs now, could we prevent it?  I see a few
ways to try.  I don't know whether they would work well.

* Indicate the different encodings on the screen somehow.

* Canonicalize such seqences (perhaps when reading text into Emacs),
so that different encodings of the same text become identical.

* Use a stand-alone canonicalizer program.

-- 
Dr Richard Stallman (https://stallman.org)
Chief GNUisance of the GNU Project (https://gnu.org)
Founder, Free Software Foundation (https://fsf.org)
Internet Hall-of-Famer (https://internethalloffame.org)

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it?
  2022-01-19  4:15 Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it? Richard Stallman
@ 2022-01-19  4:47 ` Po Lu
  2022-01-19 10:05   ` Phil Sainty
  2022-01-19  8:20 ` Eli Zaretskii
  2022-01-19 17:36 ` T.V Raman
  2 siblings, 1 reply; 104+ messages in thread
From: Po Lu @ 2022-01-19  4:47 UTC (permalink / raw)
  To: Richard Stallman; +Cc: emacs-devel

Richard Stallman <rms@gnu.org> writes:

> If the problem exists in Emacs now, could we prevent it?  I see a few
> ways to try.  I don't know whether they would work well.

I think the "zero width characters" alluded to are displayed by Emacs as
1 pixel wide spaces, so when enough of them to be meaningful for
tracking are inserted into a piece of text, they make for a noticable
blank area when displayed by Emacs.

Thanks.

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it?
  2022-01-19  4:47 ` Po Lu
@ 2022-01-19 10:05   ` Phil Sainty
  2022-01-19 11:43     ` Eli Zaretskii
  2022-01-20  3:17     ` Richard Stallman
  0 siblings, 2 replies; 104+ messages in thread
From: Phil Sainty @ 2022-01-19 10:05 UTC (permalink / raw)
  To: Po Lu; +Cc: Richard Stallman, emacs-devel

On 2022-01-19 17:47, Po Lu wrote:
> I think the "zero width characters" alluded to are displayed
> by Emacs as 1 pixel wide spaces

You can highlight them like so:

(set-face-background 'glyphless-char "red")

I've had that configured ever since
https://debbugs.gnu.org/cgi/bugreport.cgi?bug=31194#40

If you're not expecting zero-width characters in text in general,
I think it's a good setting.


-Phil




^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it?
  2022-01-19 10:05   ` Phil Sainty
@ 2022-01-19 11:43     ` Eli Zaretskii
  2022-01-21  4:13       ` Richard Stallman
  2022-01-20  3:17     ` Richard Stallman
  1 sibling, 1 reply; 104+ messages in thread
From: Eli Zaretskii @ 2022-01-19 11:43 UTC (permalink / raw)
  To: Phil Sainty; +Cc: luangruo, rms, emacs-devel

> Date: Wed, 19 Jan 2022 23:05:51 +1300
> From: Phil Sainty <psainty@orcon.net.nz>
> Cc: Richard Stallman <rms@gnu.org>, emacs-devel@gnu.org
> 
> On 2022-01-19 17:47, Po Lu wrote:
> > I think the "zero width characters" alluded to are displayed
> > by Emacs as 1 pixel wide spaces
> 
> You can highlight them like so:
> 
> (set-face-background 'glyphless-char "red")
> 
> I've had that configured ever since
> https://debbugs.gnu.org/cgi/bugreport.cgi?bug=31194#40
> 
> If you're not expecting zero-width characters in text in general,
> I think it's a good setting.

Users and readers of certain scripts cannot use such a simplistic
solution, which is basically only suitable for plain ASCII text.  (And
even there it is slowly becoming inappropriate, what with the growing
popularity of ligatures, let alone Emoji.) Emacs should be able to do
better.



^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it?
  2022-01-19 11:43     ` Eli Zaretskii
@ 2022-01-21  4:13       ` Richard Stallman
  2022-01-21  7:49         ` Eli Zaretskii
  0 siblings, 1 reply; 104+ messages in thread
From: Richard Stallman @ 2022-01-21  4:13 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: psainty, luangruo, emacs-devel

[[[ To any NSA and FBI agents reading my email: please consider    ]]]
[[[ whether defending the US Constitution against all enemies,     ]]]
[[[ foreign or domestic, requires you to follow Snowden's example. ]]]

  > Users and readers of certain scripts cannot use such a simplistic
  > solution, which is basically only suitable for plain ASCII text.

I am no expert on this issue, but I do edit languages such as French
and Spanish which use non-ASCII characters.  It seems to work fine.  I
never insert zero-width characters, at least not knowingly.  Would
they be inserted without my knowing?

If not, I think that some non-ASCII text works fine.

  >   (And
  > even there it is slowly becoming inappropriate, what with the growing
  > popularity of ligatures, let alone Emoji.)

Emoji show up on my terminal as diamonds, since it can't display them.
So do the ligatures.  Ideally we could display the ligatures as two
letters.

                                               Emacs should be able to do
  > better.

It would be very nice to do better,  What would we do?

Perhaps we should convert ligatures on file input-in into digraphs,
and convert digraphs on file output into ligatures when using some
coding system.

-- 
Dr Richard Stallman (https://stallman.org)
Chief GNUisance of the GNU Project (https://gnu.org)
Founder, Free Software Foundation (https://fsf.org)
Internet Hall-of-Famer (https://internethalloffame.org)

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it?
  2022-01-21  4:13       ` Richard Stallman
@ 2022-01-21  7:49         ` Eli Zaretskii
  2022-01-22  4:37           ` Richard Stallman
  0 siblings, 1 reply; 104+ messages in thread
From: Eli Zaretskii @ 2022-01-21  7:49 UTC (permalink / raw)
  To: rms; +Cc: psainty, luangruo, emacs-devel

> From: Richard Stallman <rms@gnu.org>
> Cc: psainty@orcon.net.nz, luangruo@yahoo.com, emacs-devel@gnu.org
> Date: Thu, 20 Jan 2022 23:13:30 -0500
> 
>   > Users and readers of certain scripts cannot use such a simplistic
>   > solution, which is basically only suitable for plain ASCII text.
> 
> I am no expert on this issue, but I do edit languages such as French
> and Spanish which use non-ASCII characters.  It seems to work fine.  I
> never insert zero-width characters, at least not knowingly.  Would
> they be inserted without my knowing?
> 
> If not, I think that some non-ASCII text works fine.

You are using a very restricted subset of non-ASCII characters, and
only on text-mode terminals, so you may never meet these characters or
see their GUI effects.  But we are talking about the Emacs defaults,
not about what is good enough for your personal usage limited to your
use patterns and display capabilities.

> Emoji show up on my terminal as diamonds, since it can't display them.
> So do the ligatures.  Ideally we could display the ligatures as two
> letters.

The way to tell the display engine (any display engine, not just that
of Emacs) not to ligate is to have the ZWNJ character between the
characters that we don't want ligated.  That's one of the legitimate
uses of that zero-width character.

>                                                Emacs should be able to do
>   > better.
> 
> It would be very nice to do better,  What would we do?

See textsec.el.

> Perhaps we should convert ligatures on file input-in into digraphs,
> and convert digraphs on file output into ligatures when using some
> coding system.

People nowadays _do_ want to see ligatures, so disabling them by
default would be a step back.

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it?
  2022-01-21  7:49         ` Eli Zaretskii
@ 2022-01-22  4:37           ` Richard Stallman
  2022-01-22  6:58             ` Eli Zaretskii
  0 siblings, 1 reply; 104+ messages in thread
From: Richard Stallman @ 2022-01-22  4:37 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: psainty, luangruo, emacs-devel

[[[ To any NSA and FBI agents reading my email: please consider    ]]]
[[[ whether defending the US Constitution against all enemies,     ]]]
[[[ foreign or domestic, requires you to follow Snowden's example. ]]]

  > > Emoji show up on my terminal as diamonds, since it can't display them.
  > > So do the ligatures.  Ideally we could display the ligatures as two
  > > letters.

  > The way to tell the display engine (any display engine, not just that
  > of Emacs) not to ligate is to have the ZWNJ character between the
  > characters that we don't want ligated.  That's one of the legitimate
  > uses of that zero-width character.

I don't think we are talking about the same thing.  You're talking
about a way of modifying a particular document saying, "Don't display
a ligature right here."

I'm asking about a feature whereby a user can direct Emacs not to use
ligatures in display on a certain terminal.  The idea is, when using a
terminal that can't display ligatures, Emacs should always display
multiple letters instead of a ligature.

  > > Perhaps we should convert ligatures on file input-in into digraphs,
  > > and convert digraphs on file output into ligatures when using some
  > > coding system.

  > People nowadays _do_ want to see ligatures, so disabling them by
  > default would be a step back.

People may be glad to see ligatures, on terminals that can display
ligatures.  I am talking about terminals which can't display
ligatures.

I doubt any user wants to see a diamond instead of `fi'.

-- 
Dr Richard Stallman (https://stallman.org)
Chief GNUisance of the GNU Project (https://gnu.org)
Founder, Free Software Foundation (https://fsf.org)
Internet Hall-of-Famer (https://internethalloffame.org)

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it?
  2022-01-22  4:37           ` Richard Stallman
@ 2022-01-22  6:58             ` Eli Zaretskii
  2022-01-24  4:33               ` Richard Stallman
  0 siblings, 1 reply; 104+ messages in thread
From: Eli Zaretskii @ 2022-01-22  6:58 UTC (permalink / raw)
  To: rms; +Cc: psainty, luangruo, emacs-devel

> From: Richard Stallman <rms@gnu.org>
> Cc: psainty@orcon.net.nz, luangruo@yahoo.com, emacs-devel@gnu.org
> Date: Fri, 21 Jan 2022 23:37:59 -0500
> 
> I'm asking about a feature whereby a user can direct Emacs not to use
> ligatures in display on a certain terminal.  The idea is, when using a
> terminal that can't display ligatures, Emacs should always display
> multiple letters instead of a ligature.

We don't have a way of determining whether a terminal can display
ligatures.  Increasingly, terminal emulators acquire these
capabilities, and we have already a couple that display ligatures and
Emoji sequences.  But there are no methods known to us to query the
terminal whether such support exists and/or which ligatures are
supported.

The user can disable auto-composition-mode or customize
composition-function-table to disable some or all of the text-shaping
features.

> I doubt any user wants to see a diamond instead of `fi'.

Is this what really happens for you, on your terminal?

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it?
  2022-01-22  6:58             ` Eli Zaretskii
@ 2022-01-24  4:33               ` Richard Stallman
  2022-01-24  5:06                 ` Po Lu
  2022-01-24 12:14                 ` Eli Zaretskii
  0 siblings, 2 replies; 104+ messages in thread
From: Richard Stallman @ 2022-01-24  4:33 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: psainty, luangruo, emacs-devel

[[[ To any NSA and FBI agents reading my email: please consider    ]]]
[[[ whether defending the US Constitution against all enemies,     ]]]
[[[ foreign or domestic, requires you to follow Snowden's example. ]]]

  > We don't have a way of determining whether a terminal can display
  > ligatures.

Could we do it via terminfo?  We can define any capabilities we like.

  > > I doubt any user wants to see a diamond instead of `fi'.

  > Is this what really happens for you, on your terminal?

Yes.  A few days ago I put point on a diamond, typed C-u C-x =,
and was told it was a ligature for `fi'.

I didn't save details of what text I was looking at,
but I suspect it was a web page that a script fetched and
emailed to me.

-- 
Dr Richard Stallman (https://stallman.org)
Chief GNUisance of the GNU Project (https://gnu.org)
Founder, Free Software Foundation (https://fsf.org)
Internet Hall-of-Famer (https://internethalloffame.org)

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it?
  2022-01-24  4:33               ` Richard Stallman
@ 2022-01-24  5:06                 ` Po Lu
  2022-01-25  4:17                   ` Richard Stallman
  2022-01-24 12:14                 ` Eli Zaretskii
  1 sibling, 1 reply; 104+ messages in thread
From: Po Lu @ 2022-01-24  5:06 UTC (permalink / raw)
  To: Richard Stallman; +Cc: Eli Zaretskii, psainty, emacs-devel

Richard Stallman <rms@gnu.org> writes:

>   > We don't have a way of determining whether a terminal can display
>   > ligatures.

> Could we do it via terminfo?  We can define any capabilities we like.

At the very least, it would require the cooperation of the terminal
emulators, because these days they typically don't declare what they are
(much less whether or not they support ligatures), instead masquerading
as `xterm'.



^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it?
  2022-01-24  5:06                 ` Po Lu
@ 2022-01-25  4:17                   ` Richard Stallman
  2022-01-25  4:58                     ` Po Lu
  0 siblings, 1 reply; 104+ messages in thread
From: Richard Stallman @ 2022-01-25  4:17 UTC (permalink / raw)
  To: Po Lu; +Cc: psainty, eliz, emacs-devel

[[[ To any NSA and FBI agents reading my email: please consider    ]]]
[[[ whether defending the US Constitution against all enemies,     ]]]
[[[ foreign or domestic, requires you to follow Snowden's example. ]]]

  > At the very least, it would require the cooperation of the terminal
  > emulators, because these days they typically don't declare what they are
  > (much less whether or not they support ligatures), instead masquerading
  > as `xterm'.

I presume any terminal that calls itself that
is operating on a graphics console and can display ligatures
when using a suitable font.

My text terminal says TERM=linux.

-- 
Dr Richard Stallman (https://stallman.org)
Chief GNUisance of the GNU Project (https://gnu.org)
Founder, Free Software Foundation (https://fsf.org)
Internet Hall-of-Famer (https://internethalloffame.org)





^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it?
  2022-01-25  4:17                   ` Richard Stallman
@ 2022-01-25  4:58                     ` Po Lu
  0 siblings, 0 replies; 104+ messages in thread
From: Po Lu @ 2022-01-25  4:58 UTC (permalink / raw)
  To: Richard Stallman; +Cc: psainty, eliz, emacs-devel

Richard Stallman <rms@gnu.org> writes:

>   > At the very least, it would require the cooperation of the terminal
>   > emulators, because these days they typically don't declare what they are
>   > (much less whether or not they support ligatures), instead masquerading
>   > as `xterm'.

> I presume any terminal that calls itself that is operating on a
> graphics console and can display ligatures when using a suitable font.

Not really.  For example, xterm itself doesn't support ligatures at all,
but VTE (the GNOME terminal emulator, which confusingly also reports
itself as xterm) does.

Thanks.



^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it?
  2022-01-24  4:33               ` Richard Stallman
  2022-01-24  5:06                 ` Po Lu
@ 2022-01-24 12:14                 ` Eli Zaretskii
  2022-01-25  4:16                   ` Richard Stallman
                                     ` (2 more replies)
  1 sibling, 3 replies; 104+ messages in thread
From: Eli Zaretskii @ 2022-01-24 12:14 UTC (permalink / raw)
  To: rms; +Cc: psainty, luangruo, emacs-devel

> From: Richard Stallman <rms@gnu.org>
> Cc: psainty@orcon.net.nz, luangruo@yahoo.com, emacs-devel@gnu.org
> Date: Sun, 23 Jan 2022 23:33:59 -0500
> 
>   > We don't have a way of determining whether a terminal can display
>   > ligatures.
> 
> Could we do it via terminfo?  We can define any capabilities we like.

I don't think this is feasible.  But before we discuss this, I think
we need to clear some fundamental misunderstanding about this, see
below.

>   > > I doubt any user wants to see a diamond instead of `fi'.
> 
>   > Is this what really happens for you, on your terminal?
> 
> Yes.  A few days ago I put point on a diamond, typed C-u C-x =,
> and was told it was a ligature for `fi'.

How did that ligature get written to the screen?  Was it present
literally in some text that Emacs displayed?  If not, how did it come
into existence, in the form of a diamond?  Emacs doesn't produce such
ligatures on TTY frames.

> I didn't save details of what text I was looking at, but I suspect
> it was a web page that a script fetched and emailed to me.

If that web page included a literal fi ligature, there's little we can
do in Emacs, because we don't produce that character.  Of course, one
can set up a display table where ligatures like fi are displayed as
two characters, but that is a separate issue, very far from what we
were discussing in this thread.  So let's please leave the literal fi
display alone, because it will take us far away from the original
issue.

The original issue is with sequences of characters that are supposed
to be composed on display, because that's where the zero-width
characters play their role.  When several characters are supposed to
be composed on a text-mode display, Emacs simply writes them to the
terminal one after another, and expects the terminal to display them
as a ligature.  The only difference between what Emacs does in this
case and what it does when no character composition is expected is
that in the former case Emacs expects the terminal to produce just one
glyph that takes just one column on display.  Emacs never actually
writes the ligature's code to the TTY, unless that code is literally
present in the text.

So I don't see how querying the terminal about ligature support will
help us in the case we are discussing, nor do I see how is that
relevant.  In any case, ligature support is not just the ability of a
terminal, it also requires certain features from the font used to
display text, and on TTY frames Emacs doesn't know which font is being
used where.  Moreover, there's any number of possible ligatures, and
which ones are supported depends on the font, so a question like "are
ligatures supported" has no meaningful answer unless you also specify
the font and the particular ligature.

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it?
  2022-01-24 12:14                 ` Eli Zaretskii
@ 2022-01-25  4:16                   ` Richard Stallman
  2022-01-25  6:35                     ` Eli Zaretskii
  2022-01-25  4:16                   ` New feature: displaying ligature characters in the buffer Richard Stallman
  2022-01-25 11:08                   ` Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it? Kévin Le Gouguec
  2 siblings, 1 reply; 104+ messages in thread
From: Richard Stallman @ 2022-01-25  4:16 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: psainty, luangruo, emacs-devel

[[[ To any NSA and FBI agents reading my email: please consider    ]]]
[[[ whether defending the US Constitution against all enemies,     ]]]
[[[ foreign or domestic, requires you to follow Snowden's example. ]]]

  > How did that ligature get written to the screen?  Was it present
  > literally in some text that Emacs displayed?

It was in the buffer.  That is why I was able to examine it with C-u C-x =.
I suppose it was present in text that I visited in Emacs.

  >   So let's please leave the literal fi
  > display alone, because it will take us far away from the original
  > issue.

Thank you for clearly describing these two cases.
I did not know that the ligature case people were discussing
was limited to composition of characters -- that it was different
from the case of displaying a ligature character actually in the buffer.
What people said was sketchy and I had to try to fill in what
was not said.

  >   When several characters are supposed to
  > be composed on a text-mode display, Emacs simply writes them to the
  > terminal one after another, and expects the terminal to display them
  > as a ligature.  The only difference between what Emacs does in this
  > case and what it does when no character composition is expected is
  > that in the former case Emacs expects the terminal to produce just one
  > glyph that takes just one column on display.

In that case, I think I it would be good to be able to tell Emacs,
when using a text-only terminal, not to try to compose ligatures.
Not to expect that sequence to display as one column.  Is that possible?

-- 
Dr Richard Stallman (https://stallman.org)
Chief GNUisance of the GNU Project (https://gnu.org)
Founder, Free Software Foundation (https://fsf.org)
Internet Hall-of-Famer (https://internethalloffame.org)

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it?
  2022-01-25  4:16                   ` Richard Stallman
@ 2022-01-25  6:35                     ` Eli Zaretskii
  2022-01-25 12:12                       ` Eli Zaretskii
  0 siblings, 1 reply; 104+ messages in thread
From: Eli Zaretskii @ 2022-01-25  6:35 UTC (permalink / raw)
  To: rms, Richard Stallman; +Cc: psainty, luangruo, emacs-devel

On January 25, 2022 6:16:29 AM GMT+02:00, Richard Stallman <rms@gnu.org> wrote:
>
>   >   When several characters are supposed to
>   > be composed on a text-mode display, Emacs simply writes them to the
>   > terminal one after another, and expects the terminal to display them
>   > as a ligature.  The only difference between what Emacs does in this
>   > case and what it does when no character composition is expected is
>   > that in the former case Emacs expects the terminal to produce just one
>   > glyph that takes just one column on display.
> 
> In that case, I think I it would be good to be able to tell Emacs,
> when using a text-only terminal, not to try to compose ligatures.
> Not to expect that sequence to display as one column.  Is that possible?
> 

This should already work: auto-composition-mode's value can be a symbol, to allow disabling that mode on uncapable terminals.  Maybe you need to customize the value to disable compositions on your console.



^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it?
  2022-01-25  6:35                     ` Eli Zaretskii
@ 2022-01-25 12:12                       ` Eli Zaretskii
  0 siblings, 0 replies; 104+ messages in thread
From: Eli Zaretskii @ 2022-01-25 12:12 UTC (permalink / raw)
  To: rms; +Cc: psainty, luangruo, emacs-devel

> Date: Tue, 25 Jan 2022 08:35:42 +0200
> From: Eli Zaretskii <eliz@gnu.org>
> Cc: psainty@orcon.net.nz, luangruo@yahoo.com, emacs-devel@gnu.org
> 
> On January 25, 2022 6:16:29 AM GMT+02:00, Richard Stallman <rms@gnu.org> wrote:
> >
> >   >   When several characters are supposed to
> >   > be composed on a text-mode display, Emacs simply writes them to the
> >   > terminal one after another, and expects the terminal to display them
> >   > as a ligature.  The only difference between what Emacs does in this
> >   > case and what it does when no character composition is expected is
> >   > that in the former case Emacs expects the terminal to produce just one
> >   > glyph that takes just one column on display.
> > 
> > In that case, I think I it would be good to be able to tell Emacs,
> > when using a text-only terminal, not to try to compose ligatures.
> > Not to expect that sequence to display as one column.  Is that possible?
> > 
> 
> This should already work: auto-composition-mode's value can be a symbol, to allow disabling that mode on uncapable terminals.

Sorry, not a symbol, but a string -- the name of the terminal type as
returned by tty-type.



^ permalink raw reply	[flat|nested] 104+ messages in thread

* New feature: displaying ligature characters in the buffer
  2022-01-24 12:14                 ` Eli Zaretskii
  2022-01-25  4:16                   ` Richard Stallman
@ 2022-01-25  4:16                   ` Richard Stallman
  2022-01-25  6:31                     ` Eli Zaretskii
  2022-01-25 11:08                   ` Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it? Kévin Le Gouguec
  2 siblings, 1 reply; 104+ messages in thread
From: Richard Stallman @ 2022-01-25  4:16 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: psainty, luangruo, emacs-devel

[[[ To any NSA and FBI agents reading my email: please consider    ]]]
[[[ whether defending the US Constitution against all enemies,     ]]]
[[[ foreign or domestic, requires you to follow Snowden's example. ]]]

When the buffer contains a ligature character, it would be a good thing for
Emacs to determine that the terminal doesn't support ligatures, and in
that case to arrange to display those ligature characters using two letters.
This should happen by default.

Other pre-composed characters could likewise be displayed in two columns
using non-composed characters.

Emacs needs to know which compositions the terminal can display.  I
expect it will handle a fairly limited set.  So a new TERMINFO field
could specify which characters work, and Emacs could convert that into
a binary array for quick lookup.  TERM=linux could have a TERMINFO
field to say that ligatures don't work.

-- 
Dr Richard Stallman (https://stallman.org)
Chief GNUisance of the GNU Project (https://gnu.org)
Founder, Free Software Foundation (https://fsf.org)
Internet Hall-of-Famer (https://internethalloffame.org)

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: New feature: displaying ligature characters in the buffer
  2022-01-25  4:16                   ` New feature: displaying ligature characters in the buffer Richard Stallman
@ 2022-01-25  6:31                     ` Eli Zaretskii
  2022-01-27  4:12                       ` Richard Stallman
  0 siblings, 1 reply; 104+ messages in thread
From: Eli Zaretskii @ 2022-01-25  6:31 UTC (permalink / raw)
  To: rms, Richard Stallman; +Cc: psainty, luangruo, emacs-devel

On January 25, 2022 6:16:33 AM GMT+02:00, Richard Stallman <rms@gnu.org> wrote:
> 
> When the buffer contains a ligature character, it would be a good thing for
> Emacs to determine that the terminal doesn't support ligatures, and in
> that case to arrange to display those ligature characters using two letters.
> This should happen by default.
> 
> Other pre-composed characters could likewise be displayed in two columns
> using non-composed characters.
> 
> Emacs needs to know which compositions the terminal can display.  I
> expect it will handle a fairly limited set.  So a new TERMINFO field
> could specify which characters work, and Emacs could convert that into
> a binary array for quick lookup.  TERM=linux could have a TERMINFO
> field to say that ligatures don't work.
> 

This is supposed to be working already, up to a point, see terminal_glyph_code in terminal.c.  I'm guessing that the diamond glyphs you see for some ligatures is the way your terminal "supports" these characters.  Or maybe it lies to Emacs about which characters it supports, or maybe the code which queries the terminal about supported characters doesn't work in your case for some other reason.

I don't think I agree that this must work by default.  That's certainly the desire, but the capabilities of the linux terminal and the way they are reported are a mess, and the use case is quite marginal nowadays.  Ligatures are no different for this purpose from any other non-ASCII character that the console cannot display. We have the latin1-display feature that you can turn on if your console doesn't cope well enough with non-ASCII characters.  And if you want to set up display of ASCII equivalents for just a small set of characters, you can use the latin1-display-char function todo that in your .emacs, in a way that suits the capabilities of your particular type and version of the linux console.

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: New feature: displaying ligature characters in the buffer
  2022-01-25  6:31                     ` Eli Zaretskii
@ 2022-01-27  4:12                       ` Richard Stallman
  2022-01-27  7:58                         ` Eli Zaretskii
  0 siblings, 1 reply; 104+ messages in thread
From: Richard Stallman @ 2022-01-27  4:12 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: psainty, luangruo, emacs-devel

[[[ To any NSA and FBI agents reading my email: please consider    ]]]
[[[ whether defending the US Constitution against all enemies,     ]]]
[[[ foreign or domestic, requires you to follow Snowden's example. ]]]

    > I'm guessing that the diamond glyphs you see for some ligatures
    > is the way your terminal "supports" these characters.  Or maybe
    > it lies to Emacs about which characters it supports, or maybe
    > the code which queries the terminal about supported characters
    > doesn't work in your case for some other reason.

Those sound possible.
How can I diagnose with GDB what is in fact going on?

-- 
Dr Richard Stallman (https://stallman.org)
Chief GNUisance of the GNU Project (https://gnu.org)
Founder, Free Software Foundation (https://fsf.org)
Internet Hall-of-Famer (https://internethalloffame.org)





^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: New feature: displaying ligature characters in the buffer
  2022-01-27  4:12                       ` Richard Stallman
@ 2022-01-27  7:58                         ` Eli Zaretskii
  0 siblings, 0 replies; 104+ messages in thread
From: Eli Zaretskii @ 2022-01-27  7:58 UTC (permalink / raw)
  To: rms; +Cc: psainty, luangruo, emacs-devel

> From: Richard Stallman <rms@gnu.org>
> Cc: psainty@orcon.net.nz, luangruo@yahoo.com, emacs-devel@gnu.org
> Date: Wed, 26 Jan 2022 23:12:57 -0500
> 
>     > I'm guessing that the diamond glyphs you see for some ligatures
>     > is the way your terminal "supports" these characters.  Or maybe
>     > it lies to Emacs about which characters it supports, or maybe
>     > the code which queries the terminal about supported characters
>     > doesn't work in your case for some other reason.
> 
> Those sound possible.
> How can I diagnose with GDB what is in fact going on?

The code responsible for that is in terminal.c, functions
terminal_glyph_code and calculate_glyph_code_table.  The latter is
called when we first want to find out whether a certain character can
be displayed by the terminal, which probably happens during startup.
I'd begin by establishing whether the ioctl used by
calculate_glyph_code_table succeeds, and if so, whether the terminal
tells us that the ligature codepoints do have glyphs in the terminal's
font.  The relevant Unicode codepoints are U+FB00..U+FB06.

Another issue could be with the terminal encoding: terminal_glyph_code
only queries the terminal for supported glyphs if
terminal-coding-system is UTF-8 -- is that what you have?

Or maybe the HAVE_STRUCT_UNIPAIR_UNICODE preprocessor condition
doesn't work on your system, in which case these functions return
trivial results.

The Lisp interface to this is internal-char-font (which on TTY frames
calls terminal_glyph_code).  Does it return the same non-negative
number for all of the ligature codepoints in the above range?  If it
does, then it could be an indication that the terminal displays the
same diamond glyph for all of them, i.e. doesn't really support them.
If internal-char-font returns a negative number, it means the terminal
cannot support those ligatures, and our processing of that is somehow
incorrect or assumes something that doesn't happen; see
char-displayable-p.

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it?
  2022-01-24 12:14                 ` Eli Zaretskii
  2022-01-25  4:16                   ` Richard Stallman
  2022-01-25  4:16                   ` New feature: displaying ligature characters in the buffer Richard Stallman
@ 2022-01-25 11:08                   ` Kévin Le Gouguec
  2022-01-25 12:38                     ` Eli Zaretskii
  2 siblings, 1 reply; 104+ messages in thread
From: Kévin Le Gouguec @ 2022-01-25 11:08 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: psainty, luangruo, rms, emacs-devel

Eli Zaretskii <eliz@gnu.org> writes:

>>   > > I doubt any user wants to see a diamond instead of `fi'.
>> 
>>   > Is this what really happens for you, on your terminal?
>> 
>> Yes.  A few days ago I put point on a diamond, typed C-u C-x =,
>> and was told it was a ligature for `fi'.
>
> How did that ligature get written to the screen?  Was it present
> literally in some text that Emacs displayed?  If not, how did it come
> into existence, in the form of a diamond?  Emacs doesn't produce such
> ligatures on TTY frames.

(Apologies for the noise, but I thought it might be worth checking since
even after re-reading Richard's messages I'm not 100% sure every one is
talking about the same thing.

Richard, would you happen to be talking about literal U+FB01 or U+FB03
characters (ﬁ or ﬃ, respectively, named "LATIN SMALL LIGATURE FI/FFI"),
rather than the kind of ligature Emacs produces when configured to do so
with "fi" and "ffi"?

I don't know much about how ligatures are setup in Emacs, but I too am
surprised that it would attempt to produce them in a terminal frame.
OTOH the aforementioned Unicode characters are indeed displayed as
diamonds on my TTY.

Again, sorry for the noise if we are indeed talking about bona-fide
ligatures)

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it?
  2022-01-25 11:08                   ` Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it? Kévin Le Gouguec
@ 2022-01-25 12:38                     ` Eli Zaretskii
  2022-01-26  3:39                       ` Richard Stallman
  0 siblings, 1 reply; 104+ messages in thread
From: Eli Zaretskii @ 2022-01-25 12:38 UTC (permalink / raw)
  To: Kévin Le Gouguec; +Cc: psainty, luangruo, rms, emacs-devel

> From: Kévin Le Gouguec <kevin.legouguec@gmail.com>
> Cc: rms@gnu.org,  psainty@orcon.net.nz,  luangruo@yahoo.com,
>   emacs-devel@gnu.org
> Date: Tue, 25 Jan 2022 12:08:46 +0100
> 
> Richard, would you happen to be talking about literal U+FB01 or U+FB03
> characters (ﬁ or ﬃ, respectively, named "LATIN SMALL LIGATURE FI/FFI"),
> rather than the kind of ligature Emacs produces when configured to do so
> with "fi" and "ffi"?

As I explained in that message, Emacs doesn't produce any ligatures
when it displays on TTY frames.  It expects the terminal to display
the ligatures when it receives the characters that should ligate.

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it?
  2022-01-25 12:38                     ` Eli Zaretskii
@ 2022-01-26  3:39                       ` Richard Stallman
  2022-01-26  5:38                         ` Eli Zaretskii
  2022-01-26  8:20                         ` Andreas Schwab
  0 siblings, 2 replies; 104+ messages in thread
From: Richard Stallman @ 2022-01-26  3:39 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: psainty, luangruo, emacs-devel, kevin.legouguec

[[[ To any NSA and FBI agents reading my email: please consider    ]]]
[[[ whether defending the US Constitution against all enemies,     ]]]
[[[ foreign or domestic, requires you to follow Snowden's example. ]]]

  > > Richard, would you happen to be talking about literal U+FB01 or U+FB03
  > > characters (ﬁ or ﬃ, respectively, named "LATIN SMALL LIGATURE FI/FFI"),
  > > rather than the kind of ligature Emacs produces when configured to do so
  > > with "fi" and "ffi"?

I don't have that message any more, so I can't check.  But the text I
quoted above displays with a diamond, and the output of C-u C-x = on
it matches my rather vague memories of what C-u C-x = displayed then.

I didn't know that there were two different kinds of ligatures in Unicode.

-- 
Dr Richard Stallman (https://stallman.org)
Chief GNUisance of the GNU Project (https://gnu.org)
Founder, Free Software Foundation (https://fsf.org)
Internet Hall-of-Famer (https://internethalloffame.org)

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it?
  2022-01-26  3:39                       ` Richard Stallman
@ 2022-01-26  5:38                         ` Eli Zaretskii
  2022-01-28 13:04                           ` Richard Stallman
  2022-01-26  8:20                         ` Andreas Schwab
  1 sibling, 1 reply; 104+ messages in thread
From: Eli Zaretskii @ 2022-01-26  5:38 UTC (permalink / raw)
  To: rms, Richard Stallman; +Cc: psainty, luangruo, emacs-devel, kevin.legouguec

On January 26, 2022 5:39:17 AM GMT+02:00, Richard Stallman <rms@gnu.org> wrote:
> 
> I didn't know that there were two different kinds of ligatures in Unicode.

It is not specific to ligatures.  Some character sequences that are supposed to be composed on display have precomposed variants with their own codepoints.  This is generally for legacy reasons.  The most widely known example is Latin characters with diacritics, such as ç and à.



^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it?
  2022-01-26  5:38                         ` Eli Zaretskii
@ 2022-01-28 13:04                           ` Richard Stallman
  2022-01-28 13:31                             ` Eli Zaretskii
  0 siblings, 1 reply; 104+ messages in thread
From: Richard Stallman @ 2022-01-28 13:04 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: psainty, luangruo, kevin.legouguec, emacs-devel

[[[ To any NSA and FBI agents reading my email: please consider    ]]]
[[[ whether defending the US Constitution against all enemies,     ]]]
[[[ foreign or domestic, requires you to follow Snowden's example. ]]]

    > The most widely known example is Latin characters with diacritics, such as ç and à.

Since my terminal handles many of those characters, they work ok for
me.  But there are some it does not support.  Many Vietnamese
characters, for instance.

If this feature is implemented to handle ligatures, it could handle
the letters with diacritics too.  That would be as easy as populating
the table of the sequences they stand for.  The terminfo item for
`linux' could indicate which characters are ok to display unchanged
and which ones need to be displayed as the equivalent digraphs (or
trigraphs).

All the work is in the basic feature that converts some Unicode codes
into sequences for display.

-- 
Dr Richard Stallman (https://stallman.org)
Chief GNUisance of the GNU Project (https://gnu.org)
Founder, Free Software Foundation (https://fsf.org)
Internet Hall-of-Famer (https://internethalloffame.org)

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it?
  2022-01-28 13:04                           ` Richard Stallman
@ 2022-01-28 13:31                             ` Eli Zaretskii
  2022-01-30  4:17                               ` Richard Stallman
  0 siblings, 1 reply; 104+ messages in thread
From: Eli Zaretskii @ 2022-01-28 13:31 UTC (permalink / raw)
  To: rms; +Cc: psainty, luangruo, kevin.legouguec, emacs-devel

> From: Richard Stallman <rms@gnu.org>
> Cc: psainty@orcon.net.nz, luangruo@yahoo.com, emacs-devel@gnu.org,
> 	kevin.legouguec@gmail.com
> Date: Fri, 28 Jan 2022 08:04:49 -0500
> 
>     > The most widely known example is Latin characters with diacritics, such as ç and à.
> 
> Since my terminal handles many of those characters, they work ok for
> me.  But there are some it does not support.  Many Vietnamese
> characters, for instance.
> 
> If this feature is implemented to handle ligatures, it could handle
> the letters with diacritics too.  That would be as easy as populating
> the table of the sequences they stand for.

IIUC what you mean by "this feature", we already have that in
latin1-disp.el.  It just isn't automatic, because most terminals and
terminal emulators don't have a way of reporting which sequences they
are capable of composing.  So we let it to the users to determine
whether they need this kind of "ASCII-fied" display.

I asked whether adding a command that specifically targets ligatures
like "fi" would be useful -- can you answer that?

> The terminfo item for `linux' could indicate which characters are ok
> to display unchanged and which ones need to be displayed as the
> equivalent digraphs (or trigraphs).

Given the enormously large number of such sequences, I doubt that
terminfo is the right means for determining which sequences are
supported.  We have a solution for the Linux console, and for the rest
we allow the user to customize the value of auto-composition-mode to
disable it if the terminal misbehaves with these sequences.

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it?
  2022-01-28 13:31                             ` Eli Zaretskii
@ 2022-01-30  4:17                               ` Richard Stallman
  2022-01-30  7:36                                 ` Eli Zaretskii
  0 siblings, 1 reply; 104+ messages in thread
From: Richard Stallman @ 2022-01-30  4:17 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: psainty, luangruo, kevin.legouguec, emacs-devel

[[[ To any NSA and FBI agents reading my email: please consider    ]]]
[[[ whether defending the US Constitution against all enemies,     ]]]
[[[ foreign or domestic, requires you to follow Snowden's example. ]]]

  > > Since my terminal handles many of those characters, they work ok for
  > > me.  But there are some it does not support.  Many Vietnamese
  > > characters, for instance.
  > > 
  > > If this feature is implemented to handle ligatures, it could handle
  > > the letters with diacritics too.  That would be as easy as populating
  > > the table of the sequences they stand for.

  > IIUC what you mean by "this feature", we already have that in
  > latin1-disp.el.

It is the same general idea, but (according to the comments at the
start) it handles only the characters in the ISO 8859 character sets.
It should handle all the Unicode characters that could sensibly be
represented as characters to be composed, including ligatures and all
Latin and Greek characters with diacritics.  Maybe some others can be
handled too.

I customized the variable to enable that mode but I don't know how to make it actually do
anything.  Maybe it needs something else to truly enable it.

I inserted ẵ (latin small letter a with breve and tilde); it does not
do anything special to that.

  > Given the enormously large number of such sequences, I doubt that
  > terminfo is the right means for determining which sequences are
  > supported.  We have a solution for the Linux console,

We do?  What is it?

  >  and for the rest
  > we allow the user to customize the value of auto-composition-mode to
  > disable it if the terminal misbehaves with these sequences.

We are talking about different issues.  I am talking about how to display
complex characters _in the buffer_.  NOT generated by auto-composition.

I don't think auto-composition does anything in my Emacs.  If I insert
an f and an i in the buffer, they display as two characters, f followed by i.
Not as a ligature.

-- 
Dr Richard Stallman (https://stallman.org)
Chief GNUisance of the GNU Project (https://gnu.org)
Founder, Free Software Foundation (https://fsf.org)
Internet Hall-of-Famer (https://internethalloffame.org)

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it?
  2022-01-30  4:17                               ` Richard Stallman
@ 2022-01-30  7:36                                 ` Eli Zaretskii
  2022-01-31  4:02                                   ` Richard Stallman
  0 siblings, 1 reply; 104+ messages in thread
From: Eli Zaretskii @ 2022-01-30  7:36 UTC (permalink / raw)
  To: rms; +Cc: psainty, luangruo, kevin.legouguec, emacs-devel

> From: Richard Stallman <rms@gnu.org>
> Cc: psainty@orcon.net.nz, luangruo@yahoo.com, emacs-devel@gnu.org,
> 	kevin.legouguec@gmail.com
> Date: Sat, 29 Jan 2022 23:17:21 -0500
> 
>   > IIUC what you mean by "this feature", we already have that in
>   > latin1-disp.el.
> 
> It is the same general idea, but (according to the comments at the
> start) it handles only the characters in the ISO 8859 character sets.

That comment is obsolete; I've updated it now.  There are facilities
in that package that display much more than ISO 8859 characters, see
latin1-display-ucs-per-lynx.

> It should handle all the Unicode characters that could sensibly be
> represented as characters to be composed, including ligatures and all
> Latin and Greek characters with diacritics.  Maybe some others can be
> handled too.

Ligatures are currently not there, and I think it would make sense to
have that as a separate command, as I suggested in another email
(which you still didn't respond to).  I'm waiting for your response
before I decide whether to install such a feature.  The question I
asked was:

  Would it be good enough to have a command that will arrange for these
  ligatures to be displayed as their ASCII equivalents, using the
  facilities in latin1-disp.el?  Such a command could be invoked either
  manually or from your init file.  latin1-disp.el also provides a
  special face to display such equivalents, so you could have them stand
  out on display if you want.

> I customized the variable to enable that mode but I don't know how to make it actually do
> anything.  Maybe it needs something else to truly enable it.

If you customized latin1-display, then it only affects characters that
your terminal doesn't support.  The code dynamically discovers which
characters are those when you activate the feature.  See this fragment
from the setup function:

  (defun latin1-display-setup (set &optional _force)
    "Set up Latin-1 display for characters in the given SET.
  SET must be a member of `latin1-display-sets'.  Normally, check
  whether a font for SET is available and don't set the display if it is."
    (cond
     ((eq set 'latin-2)
      (latin1-display-identities set)
      (mapc
       (lambda (l)
	 (or (char-displayable-p (car l))  <<<<<<<<<<<<<<<<<<<<<<<<<<
	     (apply 'latin1-display-char l)))

> I inserted ẵ (latin small letter a with breve and tilde); it does not
> do anything special to that.

ẵ is not supported by latin1-display, as it is not an ISO 8859
character.  You need to turn on a more thorough feature.  Try this:

  M-x latin1-display-ucs-per-lynx RET

>   > Given the enormously large number of such sequences, I doubt that
>   > terminfo is the right means for determining which sequences are
>   > supported.  We have a solution for the Linux console,
> 
> We do?  What is it?

The same code I pointed to in response to your other message (about
displaying ligatures as diamonds): terminal_glyph_code and its
subroutine calculate_glyph_code_table (in terminal.c).

> I don't think auto-composition does anything in my Emacs.  If I insert
> an f and an i in the buffer, they display as two characters, f followed by i.
> Not as a ligature.

We haven't yet installed composition rules for ASCII ligatures,
because we need first to resolve some basic problems with them (see
etc/TODO for the details).  I could show you how to install such a
composition rule, but I don't think it will do anything on your
console, since it doesn't support ligatures.

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it?
  2022-01-30  7:36                                 ` Eli Zaretskii
@ 2022-01-31  4:02                                   ` Richard Stallman
  2022-01-31 13:05                                     ` Eli Zaretskii
  0 siblings, 1 reply; 104+ messages in thread
From: Richard Stallman @ 2022-01-31  4:02 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: psainty, luangruo, kevin.legouguec, emacs-devel

[[[ To any NSA and FBI agents reading my email: please consider    ]]]
[[[ whether defending the US Constitution against all enemies,     ]]]
[[[ foreign or domestic, requires you to follow Snowden's example. ]]]

  > Ligatures are currently not there, and I think it would make sense to
  > have that as a separate command, as I suggested in another email
  > (which you still didn't respond to).  I'm waiting for your response
  > before I decide whether to install such a feature.  The question I
  > asked was:

  >   Would it be good enough to have a command that will arrange for these
  >   ligatures to be displayed as their ASCII equivalents, using the
  >   facilities in latin1-disp.el?

I'm not sure, because I don't know what that would be like in
practice.  If I could see it actually handle some characters, I would
probably see how to answer.

  > ẵ is not supported by latin1-display, as it is not an ISO 8859
  > character.  You need to turn on a more thorough feature.  Try this:

  >   M-x latin1-display-ucs-per-lynx RET

I just gave that command, but it doesn't do anything for the ẵ
character.

  >   I could show you how to install such a
  > composition rule, but I don't think it will do anything on your
  > console, since it doesn't support ligatures.

I don't WANT autocomposition on my Linux terminal.
I'm talking about how to display a ligature character
that actually appears in the buffer.  If latin1-display is the way,
it ought to handle ligature characters too.

-- 
Dr Richard Stallman (https://stallman.org)
Chief GNUisance of the GNU Project (https://gnu.org)
Founder, Free Software Foundation (https://fsf.org)
Internet Hall-of-Famer (https://internethalloffame.org)





^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it?
  2022-01-31  4:02                                   ` Richard Stallman
@ 2022-01-31 13:05                                     ` Eli Zaretskii
  2022-02-01  5:06                                       ` Richard Stallman
  0 siblings, 1 reply; 104+ messages in thread
From: Eli Zaretskii @ 2022-01-31 13:05 UTC (permalink / raw)
  To: rms; +Cc: psainty, luangruo, kevin.legouguec, emacs-devel

> From: Richard Stallman <rms@gnu.org>
> Cc: psainty@orcon.net.nz, luangruo@yahoo.com, emacs-devel@gnu.org,
> 	kevin.legouguec@gmail.com
> Date: Sun, 30 Jan 2022 23:02:14 -0500
> 
>   >   Would it be good enough to have a command that will arrange for these
>   >   ligatures to be displayed as their ASCII equivalents, using the
>   >   facilities in latin1-disp.el?
> 
> I'm not sure, because I don't know what that would be like in
> practice.  If I could see it actually handle some characters, I would
> probably see how to answer.

Please try the patch below.

>   > ẵ is not supported by latin1-display, as it is not an ISO 8859
>   > character.  You need to turn on a more thorough feature.  Try this:
> 
>   >   M-x latin1-display-ucs-per-lynx RET
> 
> I just gave that command, but it doesn't do anything for the ẵ
> character.

The patch below should fix this as well, I hope.

>   >   I could show you how to install such a
>   > composition rule, but I don't think it will do anything on your
>   > console, since it doesn't support ligatures.
> 
> I don't WANT autocomposition on my Linux terminal.

Well, you keep mentioning auto-composition, so I'm answering your
implied questions about that (the above was in response to you saying
that typing f and i didn't produce any ligatures on your terminal).

Here's the patch I suggest to try:

diff --git a/lisp/international/latin1-disp.el b/lisp/international/latin1-disp.el
index 96a54cc..1f639ed 100644
--- a/lisp/international/latin1-disp.el
+++ b/lisp/international/latin1-disp.el
@@ -764,12 +764,11 @@ latin1-display-ucs-per-lynx
 isn't changed if the display can render Unicode characters."
   (interactive "p")
   (if (> arg 0)
-      (unless (char-displayable-p #x101) ; a with macron
-	;; It doesn't look as though we have a Unicode font.
-	(let ((latin1-display-format "%s"))
-	  (mapc
-	   (lambda (l)
-	     (apply 'latin1-display-char l))
+      (let ((latin1-display-format "%s"))
+	(mapc
+	 (lambda (l)
+           (or (char-displayable-p (car l))
+	       (apply 'latin1-display-char l)))
 	   ;; Table derived by running Lynx on a suitable list of
 	   ;; characters in a utf-8 file, except for some added by
 	   ;; hand at the end.
@@ -3183,7 +3182,7 @@ latin1-display-ucs-per-lynx
 	     (?\ï½¤ ",")
 	     ;; Not from Lynx
 	     (?ï»¿ "")
-	     (?ï¿½ "?")))))
+	     (?ï¿½ "?"))))
     (aset standard-display-table
 	  (make-char 'mule-unicode-0100-24ff) nil)
     (aset standard-display-table



^ permalink raw reply related	[flat|nested] 104+ messages in thread

* Re: Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it?
  2022-01-31 13:05                                     ` Eli Zaretskii
@ 2022-02-01  5:06                                       ` Richard Stallman
  2022-02-01 14:57                                         ` Eli Zaretskii
  0 siblings, 1 reply; 104+ messages in thread
From: Richard Stallman @ 2022-02-01  5:06 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: psainty, luangruo, kevin.legouguec, emacs-devel

[[[ To any NSA and FBI agents reading my email: please consider    ]]]
[[[ whether defending the US Constitution against all enemies,     ]]]
[[[ foreign or domestic, requires you to follow Snowden's example. ]]]

  > > 
  > > I'm not sure, because I don't know what that would be like in
  > > practice.  If I could see it actually handle some characters, I would
  > > probably see how to answer.

  > Please try the patch below.

With that patch, and using latin1-display-ucs-per-lynx, it does display
that character as text.  It uses the string `a)?' -- that doesn't seem to
make sense to indicate a macron and a tilde, but at leaset the underlying
mechanism seems to work.

Then I inserted the fi ligature and it displays as two letters, f and i.
So it seems to do the job, for ligatures.

Thanks.

-- 
Dr Richard Stallman (https://stallman.org)
Chief GNUisance of the GNU Project (https://gnu.org)
Founder, Free Software Foundation (https://fsf.org)
Internet Hall-of-Famer (https://internethalloffame.org)

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it?
  2022-02-01  5:06                                       ` Richard Stallman
@ 2022-02-01 14:57                                         ` Eli Zaretskii
  2022-02-02  3:58                                           ` Richard Stallman
  0 siblings, 1 reply; 104+ messages in thread
From: Eli Zaretskii @ 2022-02-01 14:57 UTC (permalink / raw)
  To: rms; +Cc: psainty, luangruo, kevin.legouguec, emacs-devel

> From: Richard Stallman <rms@gnu.org>
> Cc: psainty@orcon.net.nz, luangruo@yahoo.com, emacs-devel@gnu.org,
> 	kevin.legouguec@gmail.com
> Date: Tue, 01 Feb 2022 00:06:21 -0500
> 
>   > Please try the patch below.
> 
> With that patch, and using latin1-display-ucs-per-lynx, it does display
> that character as text.  It uses the string `a)?'

Actually, it shows a(?.

> -- that doesn't seem to make sense to indicate a macron and a tilde,
> but at leaset the underlying mechanism seems to work.

If you rotate "(?" 90 degrees counter-clockwise, you'll get something
resembling the breve and the tilde above it.

If you can suggest a better "ASCII art" for this, we could use it.
What we have now comes from Lynx, but we aren't wedded to it.



^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it?
  2022-02-01 14:57                                         ` Eli Zaretskii
@ 2022-02-02  3:58                                           ` Richard Stallman
  2022-02-02 12:28                                             ` Eli Zaretskii
  0 siblings, 1 reply; 104+ messages in thread
From: Richard Stallman @ 2022-02-02  3:58 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: psainty, luangruo, kevin.legouguec, emacs-devel

[[[ To any NSA and FBI agents reading my email: please consider    ]]]
[[[ whether defending the US Constitution against all enemies,     ]]]
[[[ foreign or domestic, requires you to follow Snowden's example. ]]]

  > If you rotate "(?" 90 degrees counter-clockwise, you'll get something
  > resembling the breve and the tilde above it.

I'd never have thought of trying that.

How about using ã¯?  That takes up only two character spaces
and it makes sense without turning your head 90 degrees.
It could test whether the terminal can display ã and macro, and if not,
fall back on some other alternative.

-- 
Dr Richard Stallman (https://stallman.org)
Chief GNUisance of the GNU Project (https://gnu.org)
Founder, Free Software Foundation (https://fsf.org)
Internet Hall-of-Famer (https://internethalloffame.org)

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it?
  2022-02-02  3:58                                           ` Richard Stallman
@ 2022-02-02 12:28                                             ` Eli Zaretskii
  2022-02-03  4:23                                               ` Richard Stallman
  0 siblings, 1 reply; 104+ messages in thread
From: Eli Zaretskii @ 2022-02-02 12:28 UTC (permalink / raw)
  To: rms; +Cc: psainty, luangruo, kevin.legouguec, emacs-devel

> From: Richard Stallman <rms@gnu.org>
> Cc: psainty@orcon.net.nz, luangruo@yahoo.com, emacs-devel@gnu.org,
> 	kevin.legouguec@gmail.com
> Date: Tue, 01 Feb 2022 22:58:49 -0500
> 
>   > If you rotate "(?" 90 degrees counter-clockwise, you'll get something
>   > resembling the breve and the tilde above it.
> 
> I'd never have thought of trying that.
> 
> How about using ã¯?

That doesn't seem to remind anything like the original.  Moreover, the
feature as implemented only uses ASCII characters in the translations.

> That takes up only two character spaces and it makes sense without
> turning your head 90 degrees.  It could test whether the terminal
> can display ã and macro, and if not, fall back on some other
> alternative.

We could have alternatives like that, but it would have to be a
better-looking alternative, since replacing one imperfect emulation
with another that's not better doesn't sound like an improvement to
me.



^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it?
  2022-02-02 12:28                                             ` Eli Zaretskii
@ 2022-02-03  4:23                                               ` Richard Stallman
  2022-02-03  7:53                                                 ` Eli Zaretskii
                                                                   ` (2 more replies)
  0 siblings, 3 replies; 104+ messages in thread
From: Richard Stallman @ 2022-02-03  4:23 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: psainty, luangruo, emacs-devel, kevin.legouguec

[[[ To any NSA and FBI agents reading my email: please consider    ]]]
[[[ whether defending the US Constitution against all enemies,     ]]]
[[[ foreign or domestic, requires you to follow Snowden's example. ]]]

  > > How about using ã¯?

  > That doesn't seem to remind anything like the original.

Sorry, somehow I misremembered and thought the character was a with
macron and tilde.  Was it actually a with breve and tilde?

Then the natural visual representations would be ă~ (a with breve,
then tilde) and ã˘ (a with tilde, then breve),

  >   Moreover, the
  > feature as implemented only uses ASCII characters in the translations.

Since the linux console does handle many modified letters, a display method
that taks advantage of them will be good on linux consoles.

At least, on my machine it does that.  I don't know how much variation
there is or how much this can be configured.  I can try to find someone
who knows.

  > We could have alternatives like that, but it would have to be a
  > better-looking alternative, since replacing one imperfect emulation
  > with another that's not better doesn't sound like an improvement to
  > me.

It is much better.

I would never hae guessed what a(? meant, just from seeing it.
ă~ I could guess -- it's an a, with a breve and a tilde.
Once I know it's a single character, because C-f moves over it,
it would have to be a-with-breve-and-tilde.

-- 
Dr Richard Stallman (https://stallman.org)
Chief GNUisance of the GNU Project (https://gnu.org)
Founder, Free Software Foundation (https://fsf.org)
Internet Hall-of-Famer (https://internethalloffame.org)

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it?
  2022-02-03  4:23                                               ` Richard Stallman
@ 2022-02-03  7:53                                                 ` Eli Zaretskii
  2022-02-03  8:16                                                   ` Yuri Khan
  2022-02-04  3:52                                                   ` Richard Stallman
  2022-02-03 20:28                                                 ` Tomas Hlavaty
  2022-02-04  3:52                                                 ` Richard Stallman
  2 siblings, 2 replies; 104+ messages in thread
From: Eli Zaretskii @ 2022-02-03  7:53 UTC (permalink / raw)
  To: rms; +Cc: psainty, luangruo, emacs-devel, kevin.legouguec

> From: Richard Stallman <rms@gnu.org>
> Cc: psainty@orcon.net.nz, luangruo@yahoo.com,
> 	kevin.legouguec@gmail.com, emacs-devel@gnu.org
> Date: Wed, 02 Feb 2022 23:23:56 -0500
> 
>   > > How about using ã¯?
> 
>   > That doesn't seem to remind anything like the original.
> 
> Sorry, somehow I misremembered and thought the character was a with
> macron and tilde.  Was it actually a with breve and tilde?

Yes.

> Then the natural visual representations would be ă~ (a with breve,
> then tilde) and ã˘ (a with tilde, then breve),

Either one would be fine, but the former is better, I think, since ~
is an ASCII character, and so is universally supported.

>   > We could have alternatives like that, but it would have to be a
>   > better-looking alternative, since replacing one imperfect emulation
>   > with another that's not better doesn't sound like an improvement to
>   > me.
> 
> It is much better.
> 
> I would never hae guessed what a(? meant, just from seeing it.
> ă~ I could guess -- it's an a, with a breve and a tilde.
> Once I know it's a single character, because C-f moves over it,
> it would have to be a-with-breve-and-tilde.

Feel free to propose alternatives for such characters, which could be
better represented by an accented character followed by the rest of
accents expressed as ASCII equivalents (there's not a lot of them,
btw).  I don't have access to a Linux console to see which ones could
be supported.



^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it?
  2022-02-03  7:53                                                 ` Eli Zaretskii
@ 2022-02-03  8:16                                                   ` Yuri Khan
  2022-02-03  9:26                                                     ` Eli Zaretskii
  2022-02-04  3:52                                                   ` Richard Stallman
  1 sibling, 1 reply; 104+ messages in thread
From: Yuri Khan @ 2022-02-03  8:16 UTC (permalink / raw)
  To: Eli Zaretskii
  Cc: Phil Sainty, Po Lu, Kévin Le Gouguec, Richard Stallman,
	Emacs developers

On Thu, 3 Feb 2022 at 15:09, Eli Zaretskii <eliz@gnu.org> wrote:

> > Sorry, somehow I misremembered and thought the character was a with
> > macron and tilde.  Was it actually a with breve and tilde?
> > Then the natural visual representations would be ă~ (a with breve,
> > then tilde) and ã˘ (a with tilde, then breve),
>
> Either one would be fine, but the former is better, I think, since ~
> is an ASCII character, and so is universally supported.

The ordering of diacritics on the same side of the base character is
considered significant in Unicode, so ă~ and ã˘ would be
representations of different grapheme clusters — “a with breve and
tilde” and “a with tilde and breve”, respectively.

The issue of any characters used not being universally available is
still valid, of course.



^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it?
  2022-02-03  8:16                                                   ` Yuri Khan
@ 2022-02-03  9:26                                                     ` Eli Zaretskii
  0 siblings, 0 replies; 104+ messages in thread
From: Eli Zaretskii @ 2022-02-03  9:26 UTC (permalink / raw)
  To: Yuri Khan; +Cc: psainty, luangruo, kevin.legouguec, rms, emacs-devel

> From: Yuri Khan <yuri.v.khan@gmail.com>
> Date: Thu, 3 Feb 2022 15:16:27 +0700
> Cc: Richard Stallman <rms@gnu.org>, Phil Sainty <psainty@orcon.net.nz>, Po Lu <luangruo@yahoo.com>, 
> 	Emacs developers <emacs-devel@gnu.org>, Kévin Le Gouguec <kevin.legouguec@gmail.com>
> 
> On Thu, 3 Feb 2022 at 15:09, Eli Zaretskii <eliz@gnu.org> wrote:
> 
> > > Sorry, somehow I misremembered and thought the character was a with
> > > macron and tilde.  Was it actually a with breve and tilde?
> > > Then the natural visual representations would be ă~ (a with breve,
> > > then tilde) and ã˘ (a with tilde, then breve),
> >
> > Either one would be fine, but the former is better, I think, since ~
> > is an ASCII character, and so is universally supported.
> 
> The ordering of diacritics on the same side of the base character is
> considered significant in Unicode, so ă~ and ã˘ would be
> representations of different grapheme clusters — “a with breve and
> tilde” and “a with tilde and breve”, respectively.

I know, but I think for an emulation it could be okay to ignore this
subtlety.  The real character is always available in "C-u C-x =".

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it?
  2022-02-03  7:53                                                 ` Eli Zaretskii
  2022-02-03  8:16                                                   ` Yuri Khan
@ 2022-02-04  3:52                                                   ` Richard Stallman
  2022-02-04  4:56                                                     ` Yuri Khan
  2022-02-04  8:10                                                     ` Eli Zaretskii
  1 sibling, 2 replies; 104+ messages in thread
From: Richard Stallman @ 2022-02-04  3:52 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: psainty, luangruo, kevin.legouguec, emacs-devel

[[[ To any NSA and FBI agents reading my email: please consider    ]]]
[[[ whether defending the US Constitution against all enemies,     ]]]
[[[ foreign or domestic, requires you to follow Snowden's example. ]]]

  > > Then the natural visual representations would be ă~ (a with breve,
  > > then tilde) and ã˘ (a with tilde, then breve),

  > Either one would be fine, but the former is better, I think, since ~
  > is an ASCII character, and so is universally supported.

That's a good point, but Yuri Khan <yuri.v.khan@gmail.com> said:

  > The ordering of diacritics on the same side of the base character is
  > considered significant in Unicode, so ă~ and ã˘ would be
  > representations of different grapheme clusters — “a with breve and
  > tilde” and “a with tilde and breve”, respectively.

Are there really two different Unicode characters like that?

C-x 8 RET recognizes
LATIN SMALL LETTER A WITH BREVE AND TILDE
but it does not recognize
LATIN SMALL LETTER A WITH TILDE AND BREVE

Is the failure to handle the latter a bug?

-- 
Dr Richard Stallman (https://stallman.org)
Chief GNUisance of the GNU Project (https://gnu.org)
Founder, Free Software Foundation (https://fsf.org)
Internet Hall-of-Famer (https://internethalloffame.org)





^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it?
  2022-02-04  3:52                                                   ` Richard Stallman
@ 2022-02-04  4:56                                                     ` Yuri Khan
  2022-02-06  4:13                                                       ` Richard Stallman
  2022-02-04  8:10                                                     ` Eli Zaretskii
  1 sibling, 1 reply; 104+ messages in thread
From: Yuri Khan @ 2022-02-04  4:56 UTC (permalink / raw)
  To: Richard Stallman
  Cc: Phil Sainty, Po Lu, Eli Zaretskii, Emacs developers,
	Kévin Le Gouguec

On Fri, 4 Feb 2022 at 10:53, Richard Stallman <rms@gnu.org> wrote:

> C-x 8 RET recognizes
> LATIN SMALL LETTER A WITH BREVE AND TILDE
> but it does not recognize
> LATIN SMALL LETTER A WITH TILDE AND BREVE
>
> Is the failure to handle the latter a bug?

Not necessarily. Unicode does not assign single-character codes to all
possible letter and diacritic combinations. Instead, it has various
combining diacritics that apply to the nearest preceding non-combining
character. So a with tilde and breve could be encoded as a sequence
<Latin small letter A> <combining tilde> <combining breve> or <Latin
small letter A with tilde> <combining breve>.

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it?
  2022-02-04  4:56                                                     ` Yuri Khan
@ 2022-02-06  4:13                                                       ` Richard Stallman
  0 siblings, 0 replies; 104+ messages in thread
From: Richard Stallman @ 2022-02-06  4:13 UTC (permalink / raw)
  To: Yuri Khan; +Cc: psainty, luangruo, eliz, kevin.legouguec, emacs-devel

[[[ To any NSA and FBI agents reading my email: please consider    ]]]
[[[ whether defending the US Constitution against all enemies,     ]]]
[[[ foreign or domestic, requires you to follow Snowden's example. ]]]

  > > C-x 8 RET recognizes
  > > LATIN SMALL LETTER A WITH BREVE AND TILDE
  > > but it does not recognize
  > > LATIN SMALL LETTER A WITH TILDE AND BREVE
  > >
  > > Is the failure to handle the latter a bug?

  > Not necessarily. Unicode does not assign single-character codes to all
  > possible letter and diacritic combinations. Instead, it has various
  > combining diacritics that apply to the nearest preceding non-combining
  > character.

I suspect we are miscommunicating.

I am not talking about compositions using combining diacritics.  I'm
talking about individual Unicode code points that can appear, as such,
in a file.  Such as the code point #x1eb5, which is the character
`ẵ'.

I think Emacs already handles composition, and perhaps does that well
enough.  (I can't tell, because my Linux console does not display
compositions.)

-- 
Dr Richard Stallman (https://stallman.org)
Chief GNUisance of the GNU Project (https://gnu.org)
Founder, Free Software Foundation (https://fsf.org)
Internet Hall-of-Famer (https://internethalloffame.org)

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it?
  2022-02-04  3:52                                                   ` Richard Stallman
  2022-02-04  4:56                                                     ` Yuri Khan
@ 2022-02-04  8:10                                                     ` Eli Zaretskii
  2022-02-06  4:13                                                       ` Richard Stallman
  1 sibling, 1 reply; 104+ messages in thread
From: Eli Zaretskii @ 2022-02-04  8:10 UTC (permalink / raw)
  To: rms; +Cc: psainty, luangruo, kevin.legouguec, emacs-devel

> From: Richard Stallman <rms@gnu.org>
> Cc: psainty@orcon.net.nz, luangruo@yahoo.com, emacs-devel@gnu.org,
> 	kevin.legouguec@gmail.com
> Date: Thu, 03 Feb 2022 22:52:16 -0500
> 
>   > > Then the natural visual representations would be ă~ (a with breve,
>   > > then tilde) and ã˘ (a with tilde, then breve),
> 
>   > Either one would be fine, but the former is better, I think, since ~
>   > is an ASCII character, and so is universally supported.
> 
> That's a good point, but Yuri Khan <yuri.v.khan@gmail.com> said:
> 
>   > The ordering of diacritics on the same side of the base character is
>   > considered significant in Unicode, so ă~ and ã˘ would be
>   > representations of different grapheme clusters — “a with breve and
>   > tilde” and “a with tilde and breve”, respectively.

That's a tangent, not directly relevant to the issue at hand.  If/when
we bump into different characters that differ only by the order of the
diacriticals, we will resolve each such case one by one.



^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it?
  2022-02-04  8:10                                                     ` Eli Zaretskii
@ 2022-02-06  4:13                                                       ` Richard Stallman
  0 siblings, 0 replies; 104+ messages in thread
From: Richard Stallman @ 2022-02-06  4:13 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: psainty, luangruo, emacs-devel, kevin.legouguec

[[[ To any NSA and FBI agents reading my email: please consider    ]]]
[[[ whether defending the US Constitution against all enemies,     ]]]
[[[ foreign or domestic, requires you to follow Snowden's example. ]]]

  > >   > The ordering of diacritics on the same side of the base character is
  > >   > considered significant in Unicode, so ă~ and ã˘ would be
  > >   > representations of different grapheme clusters — “a with breve and
  > >   > tilde” and “a with tilde and breve”, respectively.

  > That's a tangent, not directly relevant to the issue at hand.  If/when
  > we bump into different characters that differ only by the order of the
  > diacriticals, we will resolve each such case one by one.

Ok with me.

-- 
Dr Richard Stallman (https://stallman.org)
Chief GNUisance of the GNU Project (https://gnu.org)
Founder, Free Software Foundation (https://fsf.org)
Internet Hall-of-Famer (https://internethalloffame.org)

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it?
  2022-02-03  4:23                                               ` Richard Stallman
  2022-02-03  7:53                                                 ` Eli Zaretskii
@ 2022-02-03 20:28                                                 ` Tomas Hlavaty
  2022-02-04  7:07                                                   ` Eli Zaretskii
  2022-02-05  4:20                                                   ` Richard Stallman
  2022-02-04  3:52                                                 ` Richard Stallman
  2 siblings, 2 replies; 104+ messages in thread
From: Tomas Hlavaty @ 2022-02-03 20:28 UTC (permalink / raw)
  To: rms, Eli Zaretskii; +Cc: psainty, luangruo, kevin.legouguec, emacs-devel

On Wed 02 Feb 2022 at 23:23, Richard Stallman <rms@gnu.org> wrote:
>   > > How about using ã¯?

I see two boxes.

> Then the natural visual representations would be ă~ (a with breve,
> then tilde)

I see one box and tilde.

> and ã˘ (a with tilde, then breve),

I see two boxes.

This is on linux console.
I guess it depends on the console font.
My console font is ter-i24n.



^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it?
  2022-02-03 20:28                                                 ` Tomas Hlavaty
@ 2022-02-04  7:07                                                   ` Eli Zaretskii
  2022-02-05  4:20                                                   ` Richard Stallman
  1 sibling, 0 replies; 104+ messages in thread
From: Eli Zaretskii @ 2022-02-04  7:07 UTC (permalink / raw)
  To: Tomas Hlavaty; +Cc: psainty, luangruo, kevin.legouguec, rms, emacs-devel

> From: Tomas Hlavaty <tom@logand.com>
> Cc: psainty@orcon.net.nz, luangruo@yahoo.com, emacs-devel@gnu.org,
> 	kevin.legouguec@gmail.com
> Date: Thu, 03 Feb 2022 21:28:17 +0100
> 
> On Wed 02 Feb 2022 at 23:23, Richard Stallman <rms@gnu.org> wrote:
> >   > > How about using ã¯?
> 
> I see two boxes.
> 
> > Then the natural visual representations would be ă~ (a with breve,
> > then tilde)
> 
> I see one box and tilde.

The idea, as explained up-thread, is to use only those non-ASCII
characters that are supported by the terminal, with a run-time test
for each one of them.  So if your terminal doesn't support those, they
will not be used; latin1-display-ucs-per-lynx will instead use the
purely-ASCII emulation.



^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it?
  2022-02-03 20:28                                                 ` Tomas Hlavaty
  2022-02-04  7:07                                                   ` Eli Zaretskii
@ 2022-02-05  4:20                                                   ` Richard Stallman
  2022-02-05 13:55                                                     ` Tomas Hlavaty
  1 sibling, 1 reply; 104+ messages in thread
From: Richard Stallman @ 2022-02-05  4:20 UTC (permalink / raw)
  To: Tomas Hlavaty; +Cc: psainty, luangruo, eliz, kevin.legouguec, emacs-devel

[[[ To any NSA and FBI agents reading my email: please consider    ]]]
[[[ whether defending the US Constitution against all enemies,     ]]]
[[[ foreign or domestic, requires you to follow Snowden's example. ]]]

  > I guess it depends on the console font.

I guess it does.

  > My console font is ter-i24n.

How can I tell which font my console uses?

The Lisp function `char-displayable-p' seems to report correctly which
characters my console can display.  Is it correct for yours too?


-- 
Dr Richard Stallman (https://stallman.org)
Chief GNUisance of the GNU Project (https://gnu.org)
Founder, Free Software Foundation (https://fsf.org)
Internet Hall-of-Famer (https://internethalloffame.org)





^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it?
  2022-02-05  4:20                                                   ` Richard Stallman
@ 2022-02-05 13:55                                                     ` Tomas Hlavaty
  2022-02-05 14:06                                                       ` Eli Zaretskii
  2022-02-06  4:16                                                       ` Richard Stallman
  0 siblings, 2 replies; 104+ messages in thread
From: Tomas Hlavaty @ 2022-02-05 13:55 UTC (permalink / raw)
  To: rms; +Cc: psainty, luangruo, eliz, kevin.legouguec, emacs-devel

On Fri 04 Feb 2022 at 23:20, Richard Stallman <rms@gnu.org> wrote:
> How can I tell which font my console uses?

on debian based systems:

dpkg-reconfigure console-setup
ls /usr/share/consolefonts/
setfont /usr/share/consolefonts/Lat2-Terminus24x12.psf.gz

sed -i 's/FONTFACE="Fixed"/FONTFACE="Terminus"/' /etc/default/console-setup
sed -i 's/FONTSIZE="8x16"/FONTSIZE="28x14"/' /etc/default/console-setup
setupcon

on nixos:

i18n.consoleFont = "Lat2-Terminus16";

setfont Lat2-Terminus24x12

> The Lisp function `char-displayable-p' seems to report correctly which
> characters my console can display.  Is it correct for yours too?

Here the cases which I see as boxes:

(char-displayable-p ?ã¯)
=> (invalid-read-syntax "?")
(char-displayable-p ?ã)
=> unicode
(char-displayable-p ?¯)
=> unicode

(char-displayable-p ?ă)
=> unicode

(char-displayable-p ?ã˘)
=> (invalid-read-syntax "?")
(char-displayable-p ?ã)
=> unicode
(char-displayable-p ?˘)
=> unicode

char-displayable-p returns non-nil even though I see boxes.



^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it?
  2022-02-05 13:55                                                     ` Tomas Hlavaty
@ 2022-02-05 14:06                                                       ` Eli Zaretskii
  2022-02-05 14:12                                                         ` Eli Zaretskii
                                                                           ` (2 more replies)
  2022-02-06  4:16                                                       ` Richard Stallman
  1 sibling, 3 replies; 104+ messages in thread
From: Eli Zaretskii @ 2022-02-05 14:06 UTC (permalink / raw)
  To: Tomas Hlavaty; +Cc: psainty, luangruo, kevin.legouguec, rms, emacs-devel

> From: Tomas Hlavaty <tom@logand.com>
> Cc: eliz@gnu.org, psainty@orcon.net.nz, luangruo@yahoo.com,
> 	emacs-devel@gnu.org, kevin.legouguec@gmail.com
> Date: Sat, 05 Feb 2022 14:55:19 +0100
> 
> char-displayable-p returns non-nil even though I see boxes.

Is this the Linux console or is this a terminal emulator?

If it's a console, does it support the ioctl issued by
calculate_glyph_code_table?

Also, these calls are incorrect:

 (char-displayable-p ?ã¯)
 (char-displayable-p ?ã˘)

char-displayable-p accepts a single character, not 2.



^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it?
  2022-02-05 14:06                                                       ` Eli Zaretskii
@ 2022-02-05 14:12                                                         ` Eli Zaretskii
  2022-02-06  1:29                                                           ` Tomas Hlavaty
  2022-02-06  1:10                                                         ` Tomas Hlavaty
  2022-02-06  4:16                                                         ` Richard Stallman
  2 siblings, 1 reply; 104+ messages in thread
From: Eli Zaretskii @ 2022-02-05 14:12 UTC (permalink / raw)
  To: tom; +Cc: psainty, luangruo, emacs-devel, rms, kevin.legouguec

> Date: Sat, 05 Feb 2022 16:06:42 +0200
> From: Eli Zaretskii <eliz@gnu.org>
> Cc: psainty@orcon.net.nz, luangruo@yahoo.com, kevin.legouguec@gmail.com,
>  rms@gnu.org, emacs-devel@gnu.org
> 
> > char-displayable-p returns non-nil even though I see boxes.
> 
> Is this the Linux console or is this a terminal emulator?
> 
> If it's a console, does it support the ioctl issued by
> calculate_glyph_code_table?

I guess the answer is NO, because if it did, you'd see t as the return
value, not 'unicode'.

So then it isn't surprising that you get false positives when your
terminal-coding-system is UTF-8: that coding-system can encode any
character, and Emacs has no way of knowing which of them actually have
glyphs in the console font if the console doesn't support the
GIO_UNIMAP ioctl we use to find out which glyphs are actually
available.



^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it?
  2022-02-05 14:12                                                         ` Eli Zaretskii
@ 2022-02-06  1:29                                                           ` Tomas Hlavaty
  2022-02-06  8:30                                                             ` Eli Zaretskii
  0 siblings, 1 reply; 104+ messages in thread
From: Tomas Hlavaty @ 2022-02-06  1:29 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: psainty, luangruo, emacs-devel, rms, kevin.legouguec

On Sat 05 Feb 2022 at 16:12, Eli Zaretskii <eliz@gnu.org> wrote:
>> If it's a console, does it support the ioctl issued by
>> calculate_glyph_code_table?
>
> I guess the answer is NO, because if it did, you'd see t as the return
> value, not 'unicode'.

ok, but is it rare or surprising, that my linux console does not support
the discussed ioctl?  I did not do anything with the linux console, I am
simply using it the way it is, except choosing the font.

I found it surprising that Richard did not see the boxes I saw and that
he actually could read it.

I also found surprising you saying something about runtime test to
detect supported characters and choosing replacements accordingly.
That would be great if feasible.

What is the recommended setting for the linux console,
in order to see less boxes and more readable characters?

> So then it isn't surprising that you get false positives when your
> terminal-coding-system is UTF-8:

There is no terminal-coding-system variable in my Emacs 27.2.
How can it be UTF-8?

M-x occur on ~/.emacs shows me these utf-8 settings:
current-language-environment "UTF-8"
prefer-coding-system 'utf-8

> that coding-system can encode any
> character, and Emacs has no way of knowing which of them actually have
> glyphs in the console font if the console doesn't support the
> GIO_UNIMAP ioctl we use to find out which glyphs are actually
> available.

Maybe the ioctl is not the sufficient source of information; emacs would
need to query the current console font and then read the avalable glyphs
from the font?

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it?
  2022-02-06  1:29                                                           ` Tomas Hlavaty
@ 2022-02-06  8:30                                                             ` Eli Zaretskii
  2022-02-06 10:38                                                               ` Tomas Hlavaty
  0 siblings, 1 reply; 104+ messages in thread
From: Eli Zaretskii @ 2022-02-06  8:30 UTC (permalink / raw)
  To: Tomas Hlavaty; +Cc: psainty, luangruo, emacs-devel, rms, kevin.legouguec

> From: Tomas Hlavaty <tom@logand.com>
> Cc: psainty@orcon.net.nz, luangruo@yahoo.com,
> 	kevin.legouguec@gmail.com, rms@gnu.org,
> 	emacs-devel@gnu.org
> Date: Sun, 06 Feb 2022 02:29:18 +0100
> 
> On Sat 05 Feb 2022 at 16:12, Eli Zaretskii <eliz@gnu.org> wrote:
> >> If it's a console, does it support the ioctl issued by
> >> calculate_glyph_code_table?
> >
> > I guess the answer is NO, because if it did, you'd see t as the return
> > value, not 'unicode'.
> 
> ok, but is it rare or surprising, that my linux console does not support
> the discussed ioctl?

I don't know, I'm not an expert on that.  I thought the Linux console
always supports that.  Perhaps someone else could chime in.  Failing
that, how about asking about this on the forum dedicated your
GNU/Linux distribution, or the specific console you are using?

> I also found surprising you saying something about runtime test to
> detect supported characters and choosing replacements accordingly.
> That would be great if feasible.

That test is part of char-displayable-p.

> > So then it isn't surprising that you get false positives when your
> > terminal-coding-system is UTF-8:
> 
> There is no terminal-coding-system variable in my Emacs 27.2.

It's a function, not a variable.

> How can it be UTF-8?

Most probably because your locale specifies UTF-8 as the codeset.

> M-x occur on ~/.emacs shows me these utf-8 settings:
> current-language-environment "UTF-8"
> prefer-coding-system 'utf-8

These are evidence that your locale indeed specifies UTF-8.

> > that coding-system can encode any
> > character, and Emacs has no way of knowing which of them actually have
> > glyphs in the console font if the console doesn't support the
> > GIO_UNIMAP ioctl we use to find out which glyphs are actually
> > available.
> 
> Maybe the ioctl is not the sufficient source of information; emacs would
> need to query the current console font and then read the avalable glyphs
> from the font?

We don't know of any way of querying the console except via that
ioctl.  If someone can suggest a more reliable or more widely
supported method, let them please speak up.



^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it?
  2022-02-06  8:30                                                             ` Eli Zaretskii
@ 2022-02-06 10:38                                                               ` Tomas Hlavaty
  2022-02-06 10:44                                                                 ` Eli Zaretskii
  2022-02-06 10:54                                                                 ` Andreas Schwab
  0 siblings, 2 replies; 104+ messages in thread
From: Tomas Hlavaty @ 2022-02-06 10:38 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: luangruo, emacs-devel, rms, kevin.legouguec

On Sun 06 Feb 2022 at 10:30, Eli Zaretskii <eliz@gnu.org> wrote:
>> There is no terminal-coding-system variable in my Emacs 27.2.
>
> It's a function, not a variable.

Thanks, I missed that:

(terminal-coding-system)
=> utf-8-unix

>> > So then it isn't surprising that you get false positives when your
>> > terminal-coding-system is UTF-8:

How can I change terminal-coding-system say to latin2 (to match my
console font)?



^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it?
  2022-02-06 10:38                                                               ` Tomas Hlavaty
@ 2022-02-06 10:44                                                                 ` Eli Zaretskii
  2022-02-06 10:54                                                                 ` Andreas Schwab
  1 sibling, 0 replies; 104+ messages in thread
From: Eli Zaretskii @ 2022-02-06 10:44 UTC (permalink / raw)
  To: Tomas Hlavaty; +Cc: luangruo, emacs-devel, rms, kevin.legouguec

> From: Tomas Hlavaty <tom@logand.com>
> Cc: luangruo@yahoo.com, kevin.legouguec@gmail.com, rms@gnu.org,
> 	emacs-devel@gnu.org
> Date: Sun, 06 Feb 2022 11:38:36 +0100
> 
> >> > So then it isn't surprising that you get false positives when your
> >> > terminal-coding-system is UTF-8:
> 
> How can I change terminal-coding-system say to latin2 (to match my
> console font)?

Use set-terminal-coding-system.



^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it?
  2022-02-06 10:38                                                               ` Tomas Hlavaty
  2022-02-06 10:44                                                                 ` Eli Zaretskii
@ 2022-02-06 10:54                                                                 ` Andreas Schwab
  1 sibling, 0 replies; 104+ messages in thread
From: Andreas Schwab @ 2022-02-06 10:54 UTC (permalink / raw)
  To: Tomas Hlavaty; +Cc: luangruo, Eli Zaretskii, kevin.legouguec, rms, emacs-devel

On Feb 06 2022, Tomas Hlavaty wrote:

> How can I change terminal-coding-system say to latin2 (to match my
> console font)?

The terminal coding system has nothing to do with how the text is
displayed, it's how the terminal interprets the bytes it receives.  On
modern systems, this is always UTF-8.

-- 
Andreas Schwab, schwab@linux-m68k.org
GPG Key fingerprint = 7578 EB47 D4E5 4D69 2510  2552 DF73 E780 A9DA AEC1
"And now for something completely different."



^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it?
  2022-02-05 14:06                                                       ` Eli Zaretskii
  2022-02-05 14:12                                                         ` Eli Zaretskii
@ 2022-02-06  1:10                                                         ` Tomas Hlavaty
  2022-02-06  4:16                                                         ` Richard Stallman
  2 siblings, 0 replies; 104+ messages in thread
From: Tomas Hlavaty @ 2022-02-06  1:10 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: psainty, luangruo, kevin.legouguec, rms, emacs-devel

On Sat 05 Feb 2022 at 16:06, Eli Zaretskii <eliz@gnu.org> wrote:
>> char-displayable-p returns non-nil even though I see boxes.
>
> Is this the Linux console or is this a terminal emulator?

linux console

> If it's a console, does it support the ioctl issued by
> calculate_glyph_code_table?

how can i say?

> Also, these calls are incorrect:
>
>  (char-displayable-p ?ã¯)
>  (char-displayable-p ?ã˘)
>
> char-displayable-p accepts a single character, not 2.

I see boxes, not characters.  I wasn't sure if 1 box ~ 1 char, always.
That's why I sent also results with the two boxes separated, one test
for each part.



^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it?
  2022-02-05 14:06                                                       ` Eli Zaretskii
  2022-02-05 14:12                                                         ` Eli Zaretskii
  2022-02-06  1:10                                                         ` Tomas Hlavaty
@ 2022-02-06  4:16                                                         ` Richard Stallman
  2 siblings, 0 replies; 104+ messages in thread
From: Richard Stallman @ 2022-02-06  4:16 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: psainty, luangruo, kevin.legouguec, tom, emacs-devel

[[[ To any NSA and FBI agents reading my email: please consider    ]]]
[[[ whether defending the US Constitution against all enemies,     ]]]
[[[ foreign or domestic, requires you to follow Snowden's example. ]]]

  > Also, these calls are incorrect:

  >  (char-displayable-p ?ã¯)
  >  (char-displayable-p ?ã˘)

I put a single Unicode character in the buffer to send the email.
I guess it got munged in transmission.

-- 
Dr Richard Stallman (https://stallman.org)
Chief GNUisance of the GNU Project (https://gnu.org)
Founder, Free Software Foundation (https://fsf.org)
Internet Hall-of-Famer (https://internethalloffame.org)





^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it?
  2022-02-05 13:55                                                     ` Tomas Hlavaty
  2022-02-05 14:06                                                       ` Eli Zaretskii
@ 2022-02-06  4:16                                                       ` Richard Stallman
  2022-02-06 11:29                                                         ` Tomas Hlavaty
  1 sibling, 1 reply; 104+ messages in thread
From: Richard Stallman @ 2022-02-06  4:16 UTC (permalink / raw)
  To: Tomas Hlavaty; +Cc: psainty, luangruo, eliz, kevin.legouguec, emacs-devel

[[[ To any NSA and FBI agents reading my email: please consider    ]]]
[[[ whether defending the US Constitution against all enemies,     ]]]
[[[ foreign or domestic, requires you to follow Snowden's example. ]]]

  > setfont /usr/share/consolefonts/Lat2-Terminus24x12.psf.gz

Would that alter the settings?  I don't want to do that!
I want to find out how it IS set, not change it. 

Here's what I have in /etc/default/console-setup

    # FONT='lat9w-08.psf.gz brl-8x8.psf'
    # FONT_MAP=/usr/share/consoletrans/lat9u.uni

Can you explain what they mean?


-- 
Dr Richard Stallman (https://stallman.org)
Chief GNUisance of the GNU Project (https://gnu.org)
Founder, Free Software Foundation (https://fsf.org)
Internet Hall-of-Famer (https://internethalloffame.org)





^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it?
  2022-02-06  4:16                                                       ` Richard Stallman
@ 2022-02-06 11:29                                                         ` Tomas Hlavaty
  0 siblings, 0 replies; 104+ messages in thread
From: Tomas Hlavaty @ 2022-02-06 11:29 UTC (permalink / raw)
  To: rms; +Cc: luangruo, eliz, kevin.legouguec, emacs-devel

On Sat 05 Feb 2022 at 23:16, Richard Stallman <rms@gnu.org> wrote:
>   > setfont /usr/share/consolefonts/Lat2-Terminus24x12.psf.gz
>
> Would that alter the settings?

yes

but only per console and only temporarily (does not persist on reboot)

In case you want to persist it, the sed command I provided might do that
on your distro.

> I don't want to do that!
> I want to find out how it IS set, not change it.

I am not aware of a "getfont" thing.

It seems to be dependent on things like on distro.  That is why I
provided related hints on where to find the configuration and also a
distro independent way of setting it (so that one knows what the font
is).  If you do not want to change it, you have to look into your distro
specific configuration.

ls /usr/share/consolefonts/

might give you a list of fonts to choose from (it does not on my
distro).

grep FONT /etc/default/console-setup

might give you info about the default font configuration (it does not on
my distro)

> Here's what I have in /etc/default/console-setup
>
>     # FONT='lat9w-08.psf.gz brl-8x8.psf'
>     # FONT_MAP=/usr/share/consoletrans/lat9u.uni
>
> Can you explain what they mean?

I am not an expert on this, just a user, so I'll try to guess:

- There are two fonts used.
  I did not know it was possible and I do not know how this works.
- There is a file describing the supported characters.
  Maybe this map tells the console which characters to take from which
  font?
  iirc the console supports up to 256 characters.
- latin9 seems to be the character set in the first font.
- The fonts are very small.  You still have very good eyes:-)

I have one font only and no map file.  My font seems to be for latin2.
Interestingly, my native language should be covered by latin2 but still,
some accented characters are displayed properly but some as boxes.

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it?
  2022-02-03  4:23                                               ` Richard Stallman
  2022-02-03  7:53                                                 ` Eli Zaretskii
  2022-02-03 20:28                                                 ` Tomas Hlavaty
@ 2022-02-04  3:52                                                 ` Richard Stallman
  2022-02-04  8:03                                                   ` Eli Zaretskii
  2 siblings, 1 reply; 104+ messages in thread
From: Richard Stallman @ 2022-02-04  3:52 UTC (permalink / raw)
  To: rms; +Cc: psainty, luangruo, eliz, kevin.legouguec, emacs-devel

[[[ To any NSA and FBI agents reading my email: please consider    ]]]
[[[ whether defending the US Constitution against all enemies,     ]]]
[[[ foreign or domestic, requires you to follow Snowden's example. ]]]

It would be useful to be able to analyze and construct complex
characters -- for instance, to operate on a-with-breve-and-tilde
and find out that represents an a with two diacritics.

So I propose a function, `diacriticize'.  Its arguments are
characters, and if they can be graphically combined to make a single
character, that's what diacriticize returns.  Otherwise, it returns
nil.

  (diacriticize ?a ?~ ?˘) => ?ã¯
  (diacriticize ?a ?Z) => nil

It could have an inverse function, criticanalyze, which given the
character code for a character that is (in spirit) a composition,
would return the characters it consists of:

(criticanalyze ?ã˘) => (?a ?~ ?˘)

With these functions, latin1-display could figure out automatically
which conversions to make.

-- 
Dr Richard Stallman (https://stallman.org)
Chief GNUisance of the GNU Project (https://gnu.org)
Founder, Free Software Foundation (https://fsf.org)
Internet Hall-of-Famer (https://internethalloffame.org)

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it?
  2022-02-04  3:52                                                 ` Richard Stallman
@ 2022-02-04  8:03                                                   ` Eli Zaretskii
  2022-02-06  4:13                                                     ` Richard Stallman
  0 siblings, 1 reply; 104+ messages in thread
From: Eli Zaretskii @ 2022-02-04  8:03 UTC (permalink / raw)
  To: rms; +Cc: psainty, luangruo, kevin.legouguec, emacs-devel

> From: Richard Stallman <rms@gnu.org>
> Cc: eliz@gnu.org, psainty@orcon.net.nz, luangruo@yahoo.com,
> 	emacs-devel@gnu.org, kevin.legouguec@gmail.com
> Date: Thu, 03 Feb 2022 22:52:07 -0500
> 
> It would be useful to be able to analyze and construct complex
> characters -- for instance, to operate on a-with-breve-and-tilde
> and find out that represents an a with two diacritics.

This already exists, see below.  But you seem to have something
different in mind:

> So I propose a function, `diacriticize'.  Its arguments are
> characters, and if they can be graphically combined to make a single
> character, that's what diacriticize returns.  Otherwise, it returns
> nil.
> 
>   (diacriticize ?a ?~ ?˘) => ?ã¯
>   (diacriticize ?a ?Z) => nil
> 
> It could have an inverse function, criticanalyze, which given the
> character code for a character that is (in spirit) a composition,
> would return the characters it consists of:
> 
> (criticanalyze ?ã˘) => (?a ?~ ?˘)
> 
> With these functions, latin1-display could figure out automatically
> which conversions to make.

I don't understand the specification of these functions.  How would
diacriticize decide/know that ?~ is equivalent to the ?̃ (U+0303
COMBINING TILDE) that is part of ?ã ?  We do have infrastructure in
place to decompose characters like ã into the base character ?a and
the combining diacritic(s): the call (ucs-normalize-NFD-string "ã")
returns a string of 2 characters, ?a and ?̃.  But how do you propose
to make the leap from ?̃ to ?~ ?



^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it?
  2022-02-04  8:03                                                   ` Eli Zaretskii
@ 2022-02-06  4:13                                                     ` Richard Stallman
  2022-02-06  8:56                                                       ` Eli Zaretskii
  0 siblings, 1 reply; 104+ messages in thread
From: Richard Stallman @ 2022-02-06  4:13 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: psainty, luangruo, emacs-devel, kevin.legouguec

[[[ To any NSA and FBI agents reading my email: please consider    ]]]
[[[ whether defending the US Constitution against all enemies,     ]]]
[[[ foreign or domestic, requires you to follow Snowden's example. ]]]

  > I don't understand the specification of these functions.  How would
  > diacriticize decide/know that ?~ is equivalent to the ?̃ (U+0303
  > COMBINING TILDE) that is part of ?ã ?

You know more about Unicode than I do, so I'm sure it is true _in some
sense_ that "U+0303 (COMBINING TILDE) is part of ?ã".

But I have doubts that that particular sense is the one that is
pertinent to the job `diacriticize' is meant to do.

I think you mean that one can represent the glyph image `ã' in Unicode
as a composition using a sequence of `a' and COMBINING TILDE.  Please
tell me if I am mistaken.

The ã in this sentence is not a composition.  It is a single
Unicode character, which is also in Latin-1.  I don't think that
COMBINING TILDE is "part of it".

COMBINING TILDE can be used to create its glyph image by composition,
but as to what is graphically part of that glyph image, I think
that is ordinary `~'.

    the call (ucs-normalize-NFD-string "ã")
    returns a string of 2 characters, ?a and ?̃..

Interesting.  I think it would be easy to implement `diacriticize' with that.

                                                 But how do you propose
    to make the leap from ?̃ to ?~ ?

(defconst unicode-combining-chars-alist '(... (?~ . ?̃ ) ...))

... (car (rassq combining-char unicode-combining-chars-alist)) ...

Indeed, I think this does the job for `criticanalyze'.

(defun criticanalyze (char)
  (let* ((composition (ucs-normalize-NFD-string (char-to-string char)))
         charlist)
     (mapcar (lambda (c) (or (car (rassq c unicode-combining-chars-alist)) c))
                composition)))

There is probably an equally simple way to handle `diacriticize'.

I proposed those two functions because I thought we had no way
for Lisp programs to get info about this.  Since we already have one,
maybe we don't need those two functions.  Popping back to the question
of `latin1-display.el', it could use the `ucs-...' functions directly
to figure out what substitutions to make.

However, `ucs-normalize-NFD-string' does not know anything about
ligatures.  Given the fi ligature, it returns the fi ligature.  So it
can't be the sole method for `latin1-display' to find useful
substitutions.  We would have to tell it the list of ligatures.

It already uses `char-displayable-p' to determine at run time which
characters could use display substitutions.

-- 
Dr Richard Stallman (https://stallman.org)
Chief GNUisance of the GNU Project (https://gnu.org)
Founder, Free Software Foundation (https://fsf.org)
Internet Hall-of-Famer (https://internethalloffame.org)

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it?
  2022-02-06  4:13                                                     ` Richard Stallman
@ 2022-02-06  8:56                                                       ` Eli Zaretskii
  2022-02-07  5:11                                                         ` Richard Stallman
  0 siblings, 1 reply; 104+ messages in thread
From: Eli Zaretskii @ 2022-02-06  8:56 UTC (permalink / raw)
  To: rms; +Cc: psainty, luangruo, emacs-devel, kevin.legouguec

> From: Richard Stallman <rms@gnu.org>
> Cc: psainty@orcon.net.nz, luangruo@yahoo.com,
> 	kevin.legouguec@gmail.com, emacs-devel@gnu.org
> Date: Sat, 05 Feb 2022 23:13:37 -0500
> 
>   > I don't understand the specification of these functions.  How would
>   > diacriticize decide/know that ?~ is equivalent to the ?̃ (U+0303
>   > COMBINING TILDE) that is part of ?ã ?
> 
> You know more about Unicode than I do, so I'm sure it is true _in some
> sense_ that "U+0303 (COMBINING TILDE) is part of ?ã".
> 
> But I have doubts that that particular sense is the one that is
> pertinent to the job `diacriticize' is meant to do.
> 
> I think you mean that one can represent the glyph image `ã' in Unicode
> as a composition using a sequence of `a' and COMBINING TILDE.  Please
> tell me if I am mistaken.

You are not mistaken.  The character 'ã' can be "decomposed" into 2
characters, 'a' and COMBINING TILDE.  This is called "canonical
decomposition" in Unicode.

> The ã in this sentence is not a composition.  It is a single
> Unicode character, which is also in Latin-1.  I don't think that
> COMBINING TILDE is "part of it".

It is, in the sense that the original character can be decomposed.

>                                                  But how do you propose
>     to make the leap from ?̃ to ?~ ?
> 
> 
> 
> (defconst unicode-combining-chars-alist '(... (?~ . ?̃ ) ...))

So you mean we should create a database of ASCII characters that
approximate the combining diacriticals?  But if so, how is it better
than having a database of complete characters and their ASCII
equivalents, like we have now in latin1-disp.el?  Your proposal may
make the database smaller (and even that mostly only for Latin
characters), but a database of complete characters makes it easier to
make sure the results are optimal, because you see the original
complete character and the complete equivalent, instead of "composing"
them in your head for all the combinations.

I think reasonable appearance is more important than memory
consumption in this case, and other than that, your proposal just
means replacing one database by another, right?

> However, `ucs-normalize-NFD-string' does not know anything about
> ligatures.  Given the fi ligature, it returns the fi ligature.

You need a different kind of decomposition for that, called
"compatibility decomposition":

  (ucs-normalize-NFKD-string "ﬁ") => "fi"

You can use ucs-normalize-NFKD-string for the job of
ucs-normalize-NFD-string as well:

  (append (ucs-normalize-NFKD-string "ã") nil) => (97 771)

(I used 'append' here to make it evident that the result of the
decomposition is 2 characters, not one, since the Emacs display will
by default combine them into the same glyph as the original non-ASCII
character, and an innocent reader could think the decomposition didn't
work.)

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it?
  2022-02-06  8:56                                                       ` Eli Zaretskii
@ 2022-02-07  5:11                                                         ` Richard Stallman
  2022-02-07 13:16                                                           ` Eli Zaretskii
  0 siblings, 1 reply; 104+ messages in thread
From: Richard Stallman @ 2022-02-07  5:11 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: psainty, luangruo, kevin.legouguec, emacs-devel

[[[ To any NSA and FBI agents reading my email: please consider    ]]]
[[[ whether defending the US Constitution against all enemies,     ]]]
[[[ foreign or domestic, requires you to follow Snowden's example. ]]]

  > So you mean we should create a database of ASCII characters that
  > approximate the combining diacriticals?  But if so, how is it better
  > than having a database of complete characters and their ASCII
  > equivalents, like we have now in latin1-disp.el?

I think there are only around 20 diacritics.  There must be hundreds
of letters-with-diacritics.  The method I've proposed can handle
everything automatically, given a table about the 20-odd diacritics.
That's a great simplification from a table of hundreds of elements,
set up by hand.

  >  but a database of complete characters makes it easier to
  > make sure the results are optimal, because you see the original
  > complete character and the complete equivalent,

I don't follow you here.  In particular, what does "complete
equivalent" mean?  Concretely how would a result be "less than
optimal"?  Can you illustrate with an example?

  > I think reasonable appearance is more important than memory
  > consumption in this case,

What makes an appearance more or less reasonable when we're talking
about replacing one character with two or three that express
_symbolically_ which character it is?  I don't get it.

  > You can use ucs-normalize-NFKD-string for the job of
  > ucs-normalize-NFD-string as well:

  >   (append (ucs-normalize-NFKD-string "ã") nil) => (97 771)

Great!  That does most of the job, I think.

  > (I used 'append' here to make it evident that the result of the
  > decomposition is 2 characters, not one, since the Emacs display will
  > by default combine them into the same glyph as the original non-ASCII
  > character,

Not on a Linux console, I think.  When I have f and i in the buffer,
Emacs does not convert them into a ligature.  The only time it has to
try to deal with a ligature is when there is a Unicode ligature
code point in the buffer.

-- 
Dr Richard Stallman (https://stallman.org)
Chief GNUisance of the GNU Project (https://gnu.org)
Founder, Free Software Foundation (https://fsf.org)
Internet Hall-of-Famer (https://internethalloffame.org)

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it?
  2022-02-07  5:11                                                         ` Richard Stallman
@ 2022-02-07 13:16                                                           ` Eli Zaretskii
  2022-02-08  3:55                                                             ` Richard Stallman
  0 siblings, 1 reply; 104+ messages in thread
From: Eli Zaretskii @ 2022-02-07 13:16 UTC (permalink / raw)
  To: rms; +Cc: psainty, luangruo, kevin.legouguec, emacs-devel

> From: Richard Stallman <rms@gnu.org>
> Cc: psainty@orcon.net.nz, luangruo@yahoo.com, emacs-devel@gnu.org,
> 	kevin.legouguec@gmail.com
> Date: Mon, 07 Feb 2022 00:11:28 -0500
> 
>   > So you mean we should create a database of ASCII characters that
>   > approximate the combining diacriticals?  But if so, how is it better
>   > than having a database of complete characters and their ASCII
>   > equivalents, like we have now in latin1-disp.el?
> 
> I think there are only around 20 diacritics.

You are thinking of some subset, I think.  The real number is more
like 80, and that's even if we only take the diacritics relevant to
Latin characters, and disregard the Cyrillic, Greek, and others.

> There must be hundreds of letters-with-diacritics.  The method I've
> proposed can handle everything automatically, given a table about
> the 20-odd diacritics.  That's a great simplification from a table
> of hundreds of elements, set up by hand.

Setting by hand was already done, and we have it in latin1-disp.el so
it isn't like we need to weigh 2 jobs one against the other.

>   >  but a database of complete characters makes it easier to
>   > make sure the results are optimal, because you see the original
>   > complete character and the complete equivalent,
> 
> I don't follow you here.  In particular, what does "complete
> equivalent" mean?

For example, "o?'" instead of "o" + "?" + "'" (to emulate ?\ṍ).  With
the former, you see the entire string that will be shown; with the
latter, you need to imagine it (and all the other combinations that
use one or both of these diacritics).

Also, characters that have two diacritics are just part of the
problem.  What would you do with the likes of ?\ǿ (which we currently
represent as "o/'")?  Its base character, ø, doesn't have a
decomposition in Unicode.

IOW, your proposal solves only some (small) part of the problem at
best, whereas having complete strings in the database is needed anyway
for the rest.

>   > I think reasonable appearance is more important than memory
>   > consumption in this case,
> 
> What makes an appearance more or less reasonable when we're talking
> about replacing one character with two or three that express
> _symbolically_ which character it is?  I don't get it.

The appearance should (a) make sense, and (b) be consistent: for
example, U+030C COMBINING CARON should always be represented by the
same ASCII equivalent.  I don't see how you could fulfill these two
conditions without reviewing all the relevant combinations and
iteratively fixing whatever needs fixing.

>   > (I used 'append' here to make it evident that the result of the
>   > decomposition is 2 characters, not one, since the Emacs display will
>   > by default combine them into the same glyph as the original non-ASCII
>   > character,
> 
> Not on a Linux console, I think.  When I have f and i in the buffer,
> Emacs does not convert them into a ligature.  The only time it has to
> try to deal with a ligature is when there is a Unicode ligature
> code point in the buffer.

Once again, on a TTY frame Emacs does NOT produce the ligatures nor
combine base characters with the diacritics, it expects the terminal
to do that.  I've written the above remark because you are not the
only one who reads this discussion, and most other people do use GUI
displays, where the characters would (potentially confusingly) combine
on display.

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it?
  2022-02-07 13:16                                                           ` Eli Zaretskii
@ 2022-02-08  3:55                                                             ` Richard Stallman
  2022-02-08 12:20                                                               ` Eli Zaretskii
  0 siblings, 1 reply; 104+ messages in thread
From: Richard Stallman @ 2022-02-08  3:55 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: psainty, luangruo, emacs-devel, kevin.legouguec

[[[ To any NSA and FBI agents reading my email: please consider    ]]]
[[[ whether defending the US Constitution against all enemies,     ]]]
[[[ foreign or domestic, requires you to follow Snowden's example. ]]]

  > > I think there are only around 20 diacritics.

  > You are thinking of some subset, I think.  The real number is more
  > like 80,

I am amazed.  Where can I see a list that shows more of them?

  >   That's a great simplification from a table
  > > of hundreds of elements, set up by hand.

  > Setting by hand was already done, and we have it in latin1-disp.el so

Do you mean, the table that presents a-with-breve-and-tilde as `a)?'?
I don't think that works well.

  > > I don't follow you here.  In particular, what does "complete
  > > equivalent" mean?

  > For example, "o?'" instead of "o" + "?" + "'" (to emulate ?\ṍ).

I don't understand the difference between "o?'" and "o" + "?" + "'".
They look like two ways of describing the same sequence of three characters.
Though ? would never make me think of tilde unless you told me.

                                                                     With
  > the former, you see the entire string that will be shown; with the
  > latter, you need to imagine it

I can't follow that, since you're talking about two things that look
identical to me.

  >   What would you do with the likes of ?\ǿ (which we currently
  > represent as "o/'")?  Its base character, ø, doesn't have a
  > decomposition in Unicode.

For my terminal, I'd like it to send ø literally since my terminal
can display that.  `ø'' would be a good way to display it.
But on a terminal that can't display ø, `o/'' would be a good choice.

  > > Not on a Linux console, I think.  When I have f and i in the buffer,
  > > Emacs does not convert them into a ligature.  The only time it has to
  > > try to deal with a ligature is when there is a Unicode ligature
  > > code point in the buffer.

  > Once again, on a TTY frame Emacs does NOT produce the ligatures nor
  > combine base characters with the diacritics.

You have told me this several times, and I believe you.  But how does
it relate to the case I am talking about?  I don't see a relationship.

I was looking at a buffer containing a ligature character.  It must
have come from a message or file that I looked at in that buffer.  I
suppose Emacs did not _produce_ it, but it was in my buffer and I had
to use C-u C-x = to see what it was.

-- 
Dr Richard Stallman (https://stallman.org)
Chief GNUisance of the GNU Project (https://gnu.org)
Founder, Free Software Foundation (https://fsf.org)
Internet Hall-of-Famer (https://internethalloffame.org)

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it?
  2022-02-08  3:55                                                             ` Richard Stallman
@ 2022-02-08 12:20                                                               ` Eli Zaretskii
  2022-02-09  4:06                                                                 ` Richard Stallman
  0 siblings, 1 reply; 104+ messages in thread
From: Eli Zaretskii @ 2022-02-08 12:20 UTC (permalink / raw)
  To: rms; +Cc: psainty, luangruo, emacs-devel, kevin.legouguec

> From: Richard Stallman <rms@gnu.org>
> Cc: psainty@orcon.net.nz, luangruo@yahoo.com,
> 	kevin.legouguec@gmail.com, emacs-devel@gnu.org
> Date: Mon, 07 Feb 2022 22:55:54 -0500
> 
>   > > I think there are only around 20 diacritics.
> 
>   > You are thinking of some subset, I think.  The real number is more
>   > like 80,
> 
> I am amazed.  Where can I see a list that shows more of them?

Type "C-x 8 RET COMBINING", press TAB, then filter out of the
candidates those which pertain to Cyrillic, Greek, and other specific
scripts, leaving just Latin and those which don't belong to specific
scripts.

>   >   That's a great simplification from a table
>   > > of hundreds of elements, set up by hand.
> 
>   > Setting by hand was already done, and we have it in latin1-disp.el so
> 
> Do you mean, the table that presents a-with-breve-and-tilde as `a)?'?
> I don't think that works well.

I think it works as well as it could, but in any case, seeing all the
combinations explicitly is needed to provide reasonable results.

>   > > I don't follow you here.  In particular, what does "complete
>   > > equivalent" mean?
> 
>   > For example, "o?'" instead of "o" + "?" + "'" (to emulate ?\ṍ).
> 
> I don't understand the difference between "o?'" and "o" + "?" + "'".

Your proposal is to have separate rules to produce the equivalent of
each diacritic, so you will never see "o?'", only its components
separately; I denoted the latter by "o?'" and "o" + "?" + "'".

>   >   What would you do with the likes of ?\ǿ (which we currently
>   > represent as "o/'")?  Its base character, ø, doesn't have a
>   > decomposition in Unicode.
> 
> For my terminal, I'd like it to send ø literally since my terminal
> can display that.  `ø'' would be a good way to display it.
> But on a terminal that can't display ø, `o/'' would be a good choice.

My point is that there isn't a mechanical way of producing "o/" from
ø, because Unicode decompositions don't support that.

>   > > Not on a Linux console, I think.  When I have f and i in the buffer,
>   > > Emacs does not convert them into a ligature.  The only time it has to
>   > > try to deal with a ligature is when there is a Unicode ligature
>   > > code point in the buffer.
> 
>   > Once again, on a TTY frame Emacs does NOT produce the ligatures nor
>   > combine base characters with the diacritics.
> 
> You have told me this several times, and I believe you.  But how does
> it relate to the case I am talking about?  I don't see a relationship.

As I said, that remark was for other people, those who will read my
email on GUI displays.



^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it?
  2022-02-08 12:20                                                               ` Eli Zaretskii
@ 2022-02-09  4:06                                                                 ` Richard Stallman
  2022-02-09 13:50                                                                   ` Eli Zaretskii
  0 siblings, 1 reply; 104+ messages in thread
From: Richard Stallman @ 2022-02-09  4:06 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: psainty, luangruo, emacs-devel, kevin.legouguec

[[[ To any NSA and FBI agents reading my email: please consider    ]]]
[[[ whether defending the US Constitution against all enemies,     ]]]
[[[ foreign or domestic, requires you to follow Snowden's example. ]]]

  > Type "C-x 8 RET COMBINING", press TAB, then filter out of the
  > candidates those which pertain to Cyrillic, Greek, and other specific
  > scripts, leaving just Latin and those which don't belong to specific
  > scripts.

Amazing!

I see that it is possible to compute automatically the alist I proposed,
from these names.  A program can look at each of these
COMBINING names, delete `COMBINING ', and look up the rest as a
character name.  No need to make the correspondence alist by hand.

However, some of those COMBINING forms have no non-COMBINING counterpart.
For instance, there is COMBINING ZIGZAG ABOVE, but no ZIGZAG ABOVE.

How do you represent an uncombined zigzag-above in Unicode?
Put it after SPACE as a combination?

  > Your proposal is to have separate rules to produce the equivalent of
  > each diacritic, so you will never see "o?'", only its components
  > separately.

Yes.  On a terminal that can't display that letter, I'd like Emacs to
display it as a trigraph of one letter and two diacritics.

They should be non-combining diacritics, so that display won't try
to combine them.

  > My point is that there isn't a mechanical way of producing "o/" from
  > ø, because Unicode decompositions don't support that.

It wouldn't be very hard to add a list of extra decompositions that
are not known to Unicode itself.

  > > You have told me this several times, and I believe you.  But how does
  > > it relate to the case I am talking about?  I don't see a relationship.

  > As I said, that remark was for other people, those who will read my
  > email on GUI displays.

It is good to know that we don't have a misunderstanding about that point.

-- 
Dr Richard Stallman (https://stallman.org)
Chief GNUisance of the GNU Project (https://gnu.org)
Founder, Free Software Foundation (https://fsf.org)
Internet Hall-of-Famer (https://internethalloffame.org)

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it?
  2022-02-09  4:06                                                                 ` Richard Stallman
@ 2022-02-09 13:50                                                                   ` Eli Zaretskii
  2022-02-10  3:57                                                                     ` Richard Stallman
  0 siblings, 1 reply; 104+ messages in thread
From: Eli Zaretskii @ 2022-02-09 13:50 UTC (permalink / raw)
  To: rms; +Cc: psainty, luangruo, emacs-devel, kevin.legouguec

> From: Richard Stallman <rms@gnu.org>
> Cc: psainty@orcon.net.nz, luangruo@yahoo.com,
> 	kevin.legouguec@gmail.com, emacs-devel@gnu.org
> Date: Tue, 08 Feb 2022 23:06:26 -0500
> 
> However, some of those COMBINING forms have no non-COMBINING counterpart.
> For instance, there is COMBINING ZIGZAG ABOVE, but no ZIGZAG ABOVE.
> 
> How do you represent an uncombined zigzag-above in Unicode?
> Put it after SPACE as a combination?

If I understand correctly what you want, you should use U+25CC DOTTED
CIRCLE before the combining character, not SPACE.

>   > My point is that there isn't a mechanical way of producing "o/" from
>   > ø, because Unicode decompositions don't support that.
> 
> It wouldn't be very hard to add a list of extra decompositions that
> are not known to Unicode itself.

Sure, but that means we'd need some manually-maintained database
anyway.



^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it?
  2022-02-09 13:50                                                                   ` Eli Zaretskii
@ 2022-02-10  3:57                                                                     ` Richard Stallman
  2022-02-10  6:26                                                                       ` Eli Zaretskii
  0 siblings, 1 reply; 104+ messages in thread
From: Richard Stallman @ 2022-02-10  3:57 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: psainty, luangruo, kevin.legouguec, emacs-devel

[[[ To any NSA and FBI agents reading my email: please consider    ]]]
[[[ whether defending the US Constitution against all enemies,     ]]]
[[[ foreign or domestic, requires you to follow Snowden's example. ]]]

  > > How do you represent an uncombined zigzag-above in Unicode?
  > > Put it after SPACE as a combination?

  > If I understand correctly what you want, you should use U+25CC DOTTED
  > CIRCLE before the combining character, not SPACE.

Maybe that is right, but I don't understand it.
Why is that right?

Anyway, if DOTTED CIRCLE + COMBINING ZIGZAG ABOVE is the right way
to represent a noncombiing ZIGZAG ABOVE, Emacs can use that.

  > > It wouldn't be very hard to add a list of extra decompositions that
  > > are not known to Unicode itself.

  > Sure, but that means we'd need some manually-maintained database
  > anyway.

Maybe so.  But these automatic methds will make a good simplification
even if it doesn't simplify everything perfectly.


-- 
Dr Richard Stallman (https://stallman.org)
Chief GNUisance of the GNU Project (https://gnu.org)
Founder, Free Software Foundation (https://fsf.org)
Internet Hall-of-Famer (https://internethalloffame.org)





^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it?
  2022-02-10  3:57                                                                     ` Richard Stallman
@ 2022-02-10  6:26                                                                       ` Eli Zaretskii
  2022-02-12  3:57                                                                         ` Richard Stallman
  0 siblings, 1 reply; 104+ messages in thread
From: Eli Zaretskii @ 2022-02-10  6:26 UTC (permalink / raw)
  To: rms; +Cc: psainty, luangruo, kevin.legouguec, emacs-devel

> From: Richard Stallman <rms@gnu.org>
> Cc: psainty@orcon.net.nz, luangruo@yahoo.com, emacs-devel@gnu.org,
> 	kevin.legouguec@gmail.com
> Date: Wed, 09 Feb 2022 22:57:51 -0500
> 
>   > If I understand correctly what you want, you should use U+25CC DOTTED
>   > CIRCLE before the combining character, not SPACE.
> 
> Maybe that is right, but I don't understand it.
> Why is that right?

The dotted circle is the accepted method of showing stand-alone
combining characters.  It is used everywhere, and U+25CC exists for
that very purpose.

>   > > It wouldn't be very hard to add a list of extra decompositions that
>   > > are not known to Unicode itself.
> 
>   > Sure, but that means we'd need some manually-maintained database
>   > anyway.
> 
> Maybe so.  But these automatic methds will make a good simplification
> even if it doesn't simplify everything perfectly.

We disagree about whether this is a significant simplification.
Looking at the giant Lynx-derived database in latin1-disp.el, I fail
to see how making a small part of it auto-generated would be a win.



^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it?
  2022-02-10  6:26                                                                       ` Eli Zaretskii
@ 2022-02-12  3:57                                                                         ` Richard Stallman
  2022-02-12  7:36                                                                           ` Eli Zaretskii
  2022-02-12 20:10                                                                           ` Tomas Hlavaty
  0 siblings, 2 replies; 104+ messages in thread
From: Richard Stallman @ 2022-02-12  3:57 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: psainty, luangruo, emacs-devel, kevin.legouguec

[[[ To any NSA and FBI agents reading my email: please consider    ]]]
[[[ whether defending the US Constitution against all enemies,     ]]]
[[[ foreign or domestic, requires you to follow Snowden's example. ]]]

  > The dotted circle is the accepted method of showing stand-alone
  > combining characters.

Sorry, I didn't know about thus.  I don't argue against it.

Presenting characters that can't be displayed can use
DOTTED CIRCLE if the terminal can display that, and the combination
if that with the diacritics.
Otherwise it should use some other character, such as SPACE.

  > We disagree about whether this is a significant simplification.
  > Looking at the giant Lynx-derived database in latin1-disp.el, I fail
  > to see how making a small part of it auto-generated would be a win.

I had not seen that list before.

Earlier you said that the a(? translation was made by hand, so I am
somewhat confused now.

Most of those entries are for characters without diacritics, it seems,
and I'm not talking about those.  My objection is to some translations
of letters with diacritics.  Their meanings are not guessable.  I want
to replace them with sequences people will be able to understand at
first sight.

If the easiest way to do that is by editing that list, ok.
But maybe those characters don't need to be in the list at all.

-- 
Dr Richard Stallman (https://stallman.org)
Chief GNUisance of the GNU Project (https://gnu.org)
Founder, Free Software Foundation (https://fsf.org)
Internet Hall-of-Famer (https://internethalloffame.org)

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it?
  2022-02-12  3:57                                                                         ` Richard Stallman
@ 2022-02-12  7:36                                                                           ` Eli Zaretskii
  2022-02-14  4:13                                                                             ` Richard Stallman
  2022-02-12 20:10                                                                           ` Tomas Hlavaty
  1 sibling, 1 reply; 104+ messages in thread
From: Eli Zaretskii @ 2022-02-12  7:36 UTC (permalink / raw)
  To: rms; +Cc: psainty, luangruo, emacs-devel, kevin.legouguec

> From: Richard Stallman <rms@gnu.org>
> Cc: psainty@orcon.net.nz, luangruo@yahoo.com,
> 	kevin.legouguec@gmail.com, emacs-devel@gnu.org
> Date: Fri, 11 Feb 2022 22:57:07 -0500
> 
>   > We disagree about whether this is a significant simplification.
>   > Looking at the giant Lynx-derived database in latin1-disp.el, I fail
>   > to see how making a small part of it auto-generated would be a win.
> 
> I had not seen that list before.
> 
> Earlier you said that the a(? translation was made by hand, so I am
> somewhat confused now.
> 
> Most of those entries are for characters without diacritics, it seems,
> and I'm not talking about those.  My objection is to some translations
> of letters with diacritics.  Their meanings are not guessable.  I want
> to replace them with sequences people will be able to understand at
> first sight.

Then please suggest replacements you consider to be better, and let's
make those replacements.  We are not bound by what Lynx does, we just
used that as a source.

> If the easiest way to do that is by editing that list, ok.

Yes, that's what I was trying to say all the time: let's edit that
list directly.  That way, we get to see all the entries, and can
easily judge whether the method of expressing the diacritics is
consistent and looks reasonably well.

> But maybe those characters don't need to be in the list at all.

This was your proposal about generating some of the entries.  I think
it will be harder to maintain, because the effects of a change in
expressing some diacritic are not immediately evident -- you don't see
all of the affected characters.



^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it?
  2022-02-12  7:36                                                                           ` Eli Zaretskii
@ 2022-02-14  4:13                                                                             ` Richard Stallman
  2022-02-14 12:07                                                                               ` Eli Zaretskii
  0 siblings, 1 reply; 104+ messages in thread
From: Richard Stallman @ 2022-02-14  4:13 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: psainty, luangruo, emacs-devel, kevin.legouguec

[[[ To any NSA and FBI agents reading my email: please consider    ]]]
[[[ whether defending the US Constitution against all enemies,     ]]]
[[[ foreign or domestic, requires you to follow Snowden's example. ]]]

  > Then please suggest replacements you consider to be better, and let's
  > make those replacements.  We are not bound by what Lynx does, we just
  > used that as a source.

For each character whose name has the form

  LATIN (SMALL|CAPITAL) LETTER \1 WITH \2

if the terminal can't display that, it should display \1
(in the appropriate case), followed by the graphical form of \2.

Thus, for

  LATIN CAPITAL LETTER A WITH MACRON

it should display as `A' followed by a macron.

For each character whose name has the form

  latin (small|capital) letter \1 with \2 and \3

if the terminal can't display that, but it can display \1 with \2,
it should display

  LATIN (SMALL|CAPITAL) LETTER \1 WITH \2

followed by the graphical form of \3.
  
Otherwise, it should display

  \1 followed by the graphical form of \2 followed by the graphical form of \3.


-- 
Dr Richard Stallman (https://stallman.org)
Chief GNUisance of the GNU Project (https://gnu.org)
Founder, Free Software Foundation (https://fsf.org)
Internet Hall-of-Famer (https://internethalloffame.org)





^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it?
  2022-02-14  4:13                                                                             ` Richard Stallman
@ 2022-02-14 12:07                                                                               ` Eli Zaretskii
  2022-02-15  4:33                                                                                 ` Richard Stallman
  0 siblings, 1 reply; 104+ messages in thread
From: Eli Zaretskii @ 2022-02-14 12:07 UTC (permalink / raw)
  To: rms; +Cc: psainty, luangruo, emacs-devel, kevin.legouguec

> From: Richard Stallman <rms@gnu.org>
> Cc: psainty@orcon.net.nz, luangruo@yahoo.com,
> 	kevin.legouguec@gmail.com, emacs-devel@gnu.org
> Date: Sun, 13 Feb 2022 23:13:16 -0500
> 
>   > Then please suggest replacements you consider to be better, and let's
>   > make those replacements.  We are not bound by what Lynx does, we just
>   > used that as a source.
> 
> For each character whose name has the form
> 
>   LATIN (SMALL|CAPITAL) LETTER \1 WITH \2
> 
> if the terminal can't display that, it should display \1
> (in the appropriate case), followed by the graphical form of \2.
> 
> Thus, for
> 
>   LATIN CAPITAL LETTER A WITH MACRON
> 
> it should display as `A' followed by a macron.
> 
> For each character whose name has the form
> 
>   latin (small|capital) letter \1 with \2 and \3
> 
> if the terminal can't display that, but it can display \1 with \2,
> it should display
> 
>   LATIN (SMALL|CAPITAL) LETTER \1 WITH \2
> 
> followed by the graphical form of \3.
>   
> Otherwise, it should display
> 
>   \1 followed by the graphical form of \2 followed by the graphical form of \3.

Thanks, but I thought you'd propose replacements for the equivalents
of the diacritics (i.e. those "graphical forms"), not an algorithm to
combine them.



^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it?
  2022-02-14 12:07                                                                               ` Eli Zaretskii
@ 2022-02-15  4:33                                                                                 ` Richard Stallman
  2022-02-15 13:32                                                                                   ` Eli Zaretskii
  0 siblings, 1 reply; 104+ messages in thread
From: Richard Stallman @ 2022-02-15  4:33 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: psainty, luangruo, kevin.legouguec, emacs-devel

[[[ To any NSA and FBI agents reading my email: please consider    ]]]
[[[ whether defending the US Constitution against all enemies,     ]]]
[[[ foreign or domestic, requires you to follow Snowden's example. ]]]

  > Thanks, but I thought you'd propose replacements for the equivalents
  > of the diacritics (i.e. those "graphical forms"), not an algorithm to
  > combine them.

Sorry, I don't understand either half of that sentence.

I thought you were asking me to present replacements for the
display equivalents in the long alist, so I tried to do that
in a systematic way.

Are you asking for a list of correspondences from
COMBINING TILDE to TILDE?

-- 
Dr Richard Stallman (https://stallman.org)
Chief GNUisance of the GNU Project (https://gnu.org)
Founder, Free Software Foundation (https://fsf.org)
Internet Hall-of-Famer (https://internethalloffame.org)

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it?
  2022-02-15  4:33                                                                                 ` Richard Stallman
@ 2022-02-15 13:32                                                                                   ` Eli Zaretskii
  2022-02-16  4:14                                                                                     ` Richard Stallman
  0 siblings, 1 reply; 104+ messages in thread
From: Eli Zaretskii @ 2022-02-15 13:32 UTC (permalink / raw)
  To: rms; +Cc: psainty, luangruo, kevin.legouguec, emacs-devel

> From: Richard Stallman <rms@gnu.org>
> Cc: psainty@orcon.net.nz, luangruo@yahoo.com, emacs-devel@gnu.org,
> 	kevin.legouguec@gmail.com
> Date: Mon, 14 Feb 2022 23:33:50 -0500
> 
>   > Thanks, but I thought you'd propose replacements for the equivalents
>   > of the diacritics (i.e. those "graphical forms"), not an algorithm to
>   > combine them.
> 
> Sorry, I don't understand either half of that sentence.
> 
> I thought you were asking me to present replacements for the
> display equivalents in the long alist, so I tried to do that
> in a systematic way.
> 
> Are you asking for a list of correspondences from
> COMBINING TILDE to TILDE?

No, for replacements for the cdr part of the likes of

     	   (?\ẵ "a(?")

You said that you didn't like (? as the equivalent of the two
diacritics ̆̃, so I suggested that you propose alternative equivalents
which you'd like better.



^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it?
  2022-02-15 13:32                                                                                   ` Eli Zaretskii
@ 2022-02-16  4:14                                                                                     ` Richard Stallman
  2022-02-16 12:10                                                                                       ` Eli Zaretskii
  0 siblings, 1 reply; 104+ messages in thread
From: Richard Stallman @ 2022-02-16  4:14 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: psainty, luangruo, emacs-devel, kevin.legouguec

[[[ To any NSA and FBI agents reading my email: please consider    ]]]
[[[ whether defending the US Constitution against all enemies,     ]]]
[[[ foreign or domestic, requires you to follow Snowden's example. ]]]

  > No, for replacements for the cdr part of the likes of

  >      	   (?\ẵ "a(?")

  > You said that you didn't like (? as the equivalent of the two
  > diacritics ̆̃, so I suggested that you propose alternative equivalents
  > which you'd like better.

That is the question I answered -- in full generality.
Instead of sending you a very long list of replacements,
I sent simple general rules to handle ALL characters
that have the form LETTER + DIACRITIC, and ALL characters
that have the form LETTER + DIACRITIC1 + DIACRITIC2.

The character you just cited is LATIN SMALL LETTER A WITH BREVE AND TILDE.
The rule is

    For each character whose name has the form

      latin (small|capital) letter \1 with \2 and \3

    if the terminal can't display that, but it can display \1 with \2,
    it should display

      LATIN (SMALL|CAPITAL) LETTER \1 WITH \2

    followed by the graphical form of \3.

    Otherwise, it should display

      \1 followed by the graphical form of \2 followed by the graphical form of \3.

Following this rule, Emacs would display ă~, if the terminal can handle that,
otherwise a˘~ .

Have I said it clearly this time?

-- 
Dr Richard Stallman (https://stallman.org)
Chief GNUisance of the GNU Project (https://gnu.org)
Founder, Free Software Foundation (https://fsf.org)
Internet Hall-of-Famer (https://internethalloffame.org)

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it?
  2022-02-16  4:14                                                                                     ` Richard Stallman
@ 2022-02-16 12:10                                                                                       ` Eli Zaretskii
  2022-02-19  4:54                                                                                         ` Richard Stallman
  0 siblings, 1 reply; 104+ messages in thread
From: Eli Zaretskii @ 2022-02-16 12:10 UTC (permalink / raw)
  To: rms; +Cc: psainty, luangruo, emacs-devel, kevin.legouguec

> From: Richard Stallman <rms@gnu.org>
> Cc: psainty@orcon.net.nz, luangruo@yahoo.com,
> 	kevin.legouguec@gmail.com, emacs-devel@gnu.org
> Date: Tue, 15 Feb 2022 23:14:25 -0500
> 
>   > No, for replacements for the cdr part of the likes of
> 
>   >      	   (?\ẵ "a(?")
> 
>   > You said that you didn't like (? as the equivalent of the two
>   > diacritics ̆̃, so I suggested that you propose alternative equivalents
>   > which you'd like better.
> 
> That is the question I answered -- in full generality.
> Instead of sending you a very long list of replacements,
> I sent simple general rules to handle ALL characters
> that have the form LETTER + DIACRITIC, and ALL characters
> that have the form LETTER + DIACRITIC1 + DIACRITIC2.
> 
> The character you just cited is LATIN SMALL LETTER A WITH BREVE AND TILDE.
> The rule is
> 
>     For each character whose name has the form
> 
>       latin (small|capital) letter \1 with \2 and \3
> 
>     if the terminal can't display that, but it can display \1 with \2,
>     it should display
> 
>       LATIN (SMALL|CAPITAL) LETTER \1 WITH \2
> 
>     followed by the graphical form of \3.
> 
>     Otherwise, it should display
> 
>       \1 followed by the graphical form of \2 followed by the graphical form of \3.

I understand the general rule, but I hoped you had specific
suggestions for those \2 and \3 placeholders.

It now sounds like you actually suggest to display the diacritic in
its non-combining variety, as I understand from this example:

> Following this rule, Emacs would display ă~, if the terminal can handle that,
> otherwise a˘~ .

But in that case, it goes against the spirit of this feature, which
expresses non-ASCII characters with equivalent strings composed of
ASCII characters.  Since ˘ is non-ASCII, chances are that the terminal
which cannot display ă will be unable to display ˘ as well.



^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it?
  2022-02-16 12:10                                                                                       ` Eli Zaretskii
@ 2022-02-19  4:54                                                                                         ` Richard Stallman
  0 siblings, 0 replies; 104+ messages in thread
From: Richard Stallman @ 2022-02-19  4:54 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: psainty, luangruo, emacs-devel, kevin.legouguec

[[[ To any NSA and FBI agents reading my email: please consider    ]]]
[[[ whether defending the US Constitution against all enemies,     ]]]
[[[ foreign or domestic, requires you to follow Snowden's example. ]]]

  > > Following this rule, Emacs would display ă~, if the terminal can handle that,
  > > otherwise a˘~ .

  > But in that case, it goes against the spirit of this feature, which
  > expresses non-ASCII characters with equivalent strings composed of
  > ASCII characters.  Since ˘ is non-ASCII, chances are that the terminal
  > which cannot display ă will be unable to display ˘ as well.

That's a valid point: on a terminal which can't display the BREVE
character, this method would not work.

There could be an additional fallback method of using `(' instead of
BREVE.  I can't think of any other ASCII character that would be
better than that one.

-- 
Dr Richard Stallman (https://stallman.org)
Chief GNUisance of the GNU Project (https://gnu.org)
Founder, Free Software Foundation (https://fsf.org)
Internet Hall-of-Famer (https://internethalloffame.org)

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it?
  2022-02-12  3:57                                                                         ` Richard Stallman
  2022-02-12  7:36                                                                           ` Eli Zaretskii
@ 2022-02-12 20:10                                                                           ` Tomas Hlavaty
  2022-02-14  4:14                                                                             ` Richard Stallman
  1 sibling, 1 reply; 104+ messages in thread
From: Tomas Hlavaty @ 2022-02-12 20:10 UTC (permalink / raw)
  To: rms, Eli Zaretskii; +Cc: psainty, luangruo, kevin.legouguec, emacs-devel

On Fri 11 Feb 2022 at 22:57, Richard Stallman <rms@gnu.org> wrote:
> Earlier you said that the a(? translation was made by hand, so I am
> somewhat confused now.
>
> Most of those entries are for characters without diacritics, it seems,
> and I'm not talking about those.  My objection is to some translations
> of letters with diacritics.  Their meanings are not guessable.  I want
> to replace them with sequences people will be able to understand at
> first sight.

In my native language, droping diacritics for ascii rather than
imitating it with additional characters is usually more readable.



^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it?
  2022-02-12 20:10                                                                           ` Tomas Hlavaty
@ 2022-02-14  4:14                                                                             ` Richard Stallman
  0 siblings, 0 replies; 104+ messages in thread
From: Richard Stallman @ 2022-02-14  4:14 UTC (permalink / raw)
  To: Tomas Hlavaty; +Cc: psainty, luangruo, eliz, emacs-devel, kevin.legouguec

[[[ To any NSA and FBI agents reading my email: please consider    ]]]
[[[ whether defending the US Constitution against all enemies,     ]]]
[[[ foreign or domestic, requires you to follow Snowden's example. ]]]

  > In my native language, droping diacritics for ascii rather than
  > imitating it with additional characters is usually more readable.

That could be an option.  If we implement what I asked for, it would
be easy to implement a variant that deletes all diacritics from the
expansions.
-- 
Dr Richard Stallman (https://stallman.org)
Chief GNUisance of the GNU Project (https://gnu.org)
Founder, Free Software Foundation (https://fsf.org)
Internet Hall-of-Famer (https://internethalloffame.org)

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it?
  2022-01-26  3:39                       ` Richard Stallman
  2022-01-26  5:38                         ` Eli Zaretskii
@ 2022-01-26  8:20                         ` Andreas Schwab
  2022-01-27  4:13                           ` Richard Stallman
  1 sibling, 1 reply; 104+ messages in thread
From: Andreas Schwab @ 2022-01-26  8:20 UTC (permalink / raw)
  To: Richard Stallman
  Cc: psainty, luangruo, Eli Zaretskii, kevin.legouguec, emacs-devel

On Jan 25 2022, Richard Stallman wrote:

> I didn't know that there were two different kinds of ligatures in Unicode.

It is misleading at best to call them ligatures.  They are just random
Unicode code points that happen to be absent from the font that the
terminal uses.
-- 
Andreas Schwab, schwab@linux-m68k.org
GPG Key fingerprint = 7578 EB47 D4E5 4D69 2510  2552 DF73 E780 A9DA AEC1
"And now for something completely different."



^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it?
  2022-01-26  8:20                         ` Andreas Schwab
@ 2022-01-27  4:13                           ` Richard Stallman
  2022-01-27  6:39                             ` Eli Zaretskii
  0 siblings, 1 reply; 104+ messages in thread
From: Richard Stallman @ 2022-01-27  4:13 UTC (permalink / raw)
  To: Andreas Schwab; +Cc: psainty, luangruo, eliz, kevin.legouguec, emacs-devel

[[[ To any NSA and FBI agents reading my email: please consider    ]]]
[[[ whether defending the US Constitution against all enemies,     ]]]
[[[ foreign or domestic, requires you to follow Snowden's example. ]]]

  > It is misleading at best to call them ligatures.  They are just random
  > Unicode code points that happen to be absent from the font that the
  > terminal uses.

I disagree.  These Unicode values are not ordinary, not just like any
other.  They are special because each of them represents a ligature of
two ASCII characters, which could just as well be presented as a
series of two characters.

When it is impossible to display the character's ligature, it would be
more useful to display the two ASCII characters than to display an
unhelpful diamond.

We should try to do what is most helpful, not be quick to give up.

-- 
Dr Richard Stallman (https://stallman.org)
Chief GNUisance of the GNU Project (https://gnu.org)
Founder, Free Software Foundation (https://fsf.org)
Internet Hall-of-Famer (https://internethalloffame.org)

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it?
  2022-01-27  4:13                           ` Richard Stallman
@ 2022-01-27  6:39                             ` Eli Zaretskii
  2022-01-27  8:13                               ` Kévin Le Gouguec
  0 siblings, 1 reply; 104+ messages in thread
From: Eli Zaretskii @ 2022-01-27  6:39 UTC (permalink / raw)
  To: rms; +Cc: psainty, luangruo, kevin.legouguec, schwab, emacs-devel

> From: Richard Stallman <rms@gnu.org>
> Cc: eliz@gnu.org, psainty@orcon.net.nz, luangruo@yahoo.com,
> 	emacs-devel@gnu.org, kevin.legouguec@gmail.com
> Date: Wed, 26 Jan 2022 23:13:47 -0500
> 
> I disagree.  These Unicode values are not ordinary, not just like any
> other.  They are special because each of them represents a ligature of
> two ASCII characters, which could just as well be presented as a
> series of two characters.

Note that there are precomposed codepoints for ligatures of 3 ASCII
characters as well, for example U+FB03 LATIN SMALL LIGATURE FFI.

> When it is impossible to display the character's ligature, it would be
> more useful to display the two ASCII characters than to display an
> unhelpful diamond.
> 
> We should try to do what is most helpful, not be quick to give up.

Would it be good enough to have a command that will arrange for these
ligatures to be displayed as their ASCII equivalents, using the
facilities in latin1-disp.el?  Such a command could be invoked either
manually or from your init file.  latin1-disp.el also provides a
special face to display such equivalents, so you could have them stand
out on display if you want.



^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it?
  2022-01-27  6:39                             ` Eli Zaretskii
@ 2022-01-27  8:13                               ` Kévin Le Gouguec
  2022-01-27  9:55                                 ` Eli Zaretskii
  0 siblings, 1 reply; 104+ messages in thread
From: Kévin Le Gouguec @ 2022-01-27  8:13 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: psainty, luangruo, schwab, rms, emacs-devel

Eli Zaretskii <eliz@gnu.org> writes:

>> From: Richard Stallman <rms@gnu.org>
>> Cc: eliz@gnu.org, psainty@orcon.net.nz, luangruo@yahoo.com,
>> 	emacs-devel@gnu.org, kevin.legouguec@gmail.com
>> Date: Wed, 26 Jan 2022 23:13:47 -0500
>> 
>> When it is impossible to display the character's ligature, it would be
>> more useful to display the two ASCII characters than to display an
>> unhelpful diamond.
>> 
>> We should try to do what is most helpful, not be quick to give up.
>
> Would it be good enough to have a command that will arrange for these
> ligatures to be displayed as their ASCII equivalents, using the
> facilities in latin1-disp.el?  Such a command could be invoked either
> manually or from your init file.  latin1-disp.el also provides a
> special face to display such equivalents, so you could have them stand
> out on display if you want.

Reading the documentation of the various glyphless-* knobs, I wonder if
it would make sense to provide another group for
glyphless-char-display-control?  'no-font is not helpful on my TTY, IIUC
because terminal-coding-system says "utf-8-unix"?).

Maybe 'no-display, meaning (null (char-displayable-p CHAR))?  That would
at least allow users to tell Emacs to use the 'hex-code method, which
would be more immediately informative than the diamond.

Though not by a lot.  Maybe adding a new method?  Something like
'char-name?  Obviously it'd be ugly to see…

> Please refer to the o\N{LATIN SMALL LIGATURE FFI}cial documentation

… but (1) it would be more informative (though maybe not less confusing)
than "Please refer to the o◆cial documentation", (2) it would also serve
as a decent fallback for symbols and emojis, which we see more and more
on this list.

I'm thinking of situations like <E1mxiYQ-0002ul-5B@fencepost.gnu.org>;
not saying we should encourage using symbols over words, but TTY users
would probably appreciate this kind of fallback?

(I hope at least some of this message makes sense; apologies if not)

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it?
  2022-01-27  8:13                               ` Kévin Le Gouguec
@ 2022-01-27  9:55                                 ` Eli Zaretskii
  2022-01-27 10:29                                   ` Eli Zaretskii
  0 siblings, 1 reply; 104+ messages in thread
From: Eli Zaretskii @ 2022-01-27  9:55 UTC (permalink / raw)
  To: Kévin Le Gouguec; +Cc: psainty, luangruo, schwab, rms, emacs-devel

> From: Kévin Le Gouguec <kevin.legouguec@gmail.com>
> Cc: rms@gnu.org,  schwab@linux-m68k.org,  psainty@orcon.net.nz,
>   luangruo@yahoo.com,  emacs-devel@gnu.org
> Date: Thu, 27 Jan 2022 09:13:37 +0100
> 
> Reading the documentation of the various glyphless-* knobs, I wonder if
> it would make sense to provide another group for
> glyphless-char-display-control?  'no-font is not helpful on my TTY, IIUC
> because terminal-coding-system says "utf-8-unix"?).
> 
> Maybe 'no-display, meaning (null (char-displayable-p CHAR))?

Isn't that what glyphless-char-display-control already does on a TTY
for no-font?  We just need to set up the table for such characters.

But I don't think that displaying the hex code is the best alternative
for this particular use case, as displaying the ASCII equivalents is
much better.



^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it?
  2022-01-27  9:55                                 ` Eli Zaretskii
@ 2022-01-27 10:29                                   ` Eli Zaretskii
  2022-01-27 17:36                                     ` Kévin Le Gouguec
  0 siblings, 1 reply; 104+ messages in thread
From: Eli Zaretskii @ 2022-01-27 10:29 UTC (permalink / raw)
  To: kevin.legouguec; +Cc: psainty, luangruo, schwab, rms, emacs-devel

> Date: Thu, 27 Jan 2022 11:55:25 +0200
> From: Eli Zaretskii <eliz@gnu.org>
> Cc: psainty@orcon.net.nz, luangruo@yahoo.com, schwab@linux-m68k.org,
>  rms@gnu.org, emacs-devel@gnu.org
> 
> > Reading the documentation of the various glyphless-* knobs, I wonder if
> > it would make sense to provide another group for
> > glyphless-char-display-control?  'no-font is not helpful on my TTY, IIUC
> > because terminal-coding-system says "utf-8-unix"?).
> > 
> > Maybe 'no-display, meaning (null (char-displayable-p CHAR))?
> 
> Isn't that what glyphless-char-display-control already does on a TTY
> for no-font?  We just need to set up the table for such characters.

Or maybe we should install the below?

diff --git a/src/term.c b/src/term.c
index 4c7a90a..ddf0e8e 100644
--- a/src/term.c
+++ b/src/term.c
@@ -1632,9 +1632,13 @@ produce_glyphs (struct it *it)
     }
   else
     {
-      Lisp_Object charset_list = FRAME_TERMINAL (it->f)->charset_list;
+      struct terminal *t = FRAME_TERMINAL (it->f);
+      Lisp_Object charset_list = t->charset_list, char_glyph;
 
-      if (char_charset (it->char_to_display, charset_list, NULL))
+      if (char_charset (it->char_to_display, charset_list, NULL)
+	  && (char_glyph = terminal_glyph_code (t, it->char_to_display),
+	      NILP (char_glyph)
+	      || (FIXNUMP (char_glyph) && XFIXNUM (char_glyph) >= 0)))
 	{
 	  it->pixel_width = CHARACTER_WIDTH (it->char_to_display);
 	  it->nglyphs = it->pixel_width;



^ permalink raw reply related	[flat|nested] 104+ messages in thread

* Re: Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it?
  2022-01-27 10:29                                   ` Eli Zaretskii
@ 2022-01-27 17:36                                     ` Kévin Le Gouguec
  2022-01-27 18:38                                       ` Eli Zaretskii
  0 siblings, 1 reply; 104+ messages in thread
From: Kévin Le Gouguec @ 2022-01-27 17:36 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: psainty, luangruo, schwab, rms, emacs-devel

Eli Zaretskii <eliz@gnu.org> writes:

>> Date: Thu, 27 Jan 2022 11:55:25 +0200
>> From: Eli Zaretskii <eliz@gnu.org>
>> Cc: psainty@orcon.net.nz, luangruo@yahoo.com, schwab@linux-m68k.org,
>>  rms@gnu.org, emacs-devel@gnu.org
>> 
>> > Reading the documentation of the various glyphless-* knobs, I wonder if
>> > it would make sense to provide another group for
>> > glyphless-char-display-control?  'no-font is not helpful on my TTY, IIUC
>> > because terminal-coding-system says "utf-8-unix"?).
>> > 
>> > Maybe 'no-display, meaning (null (char-displayable-p CHAR))?
>> 
>> Isn't that what glyphless-char-display-control already does on a TTY
>> for no-font?  We just need to set up the table for such characters.
>
> Or maybe we should install the below?

Ah, yes, that does improve the situation quite a bit here (if that
matters: Linux 5.16.1 on openSUSE Tumbleweed; 5.10.92 on Debian 11): all
the characters that we discussed here and showed up as diamonds (ﬃ, ⚀)
now show up as \uHHHH escape sequences.

(I just noticed that there seem to be a TTY glyph for ﬁ on Tumbleweed,
but not on Debian 11 🤔)

>> But I don't think that displaying the hex code is the best alternative
>> for this particular use case, as displaying the ASCII equivalents is
>> much better.

Right, wholehearted agreement.  I only mentioned the hex codes because
they seemed (1) more informative than the infamous diamonds, (2) less
effort to implement than displaying the ASCII equivalent, (3) also
applicable to other characters (e.g. symbols & emojis).

With your patch, none of these points remain relevant.  Thanks!



^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it?
  2022-01-27 17:36                                     ` Kévin Le Gouguec
@ 2022-01-27 18:38                                       ` Eli Zaretskii
  0 siblings, 0 replies; 104+ messages in thread
From: Eli Zaretskii @ 2022-01-27 18:38 UTC (permalink / raw)
  To: Kévin Le Gouguec; +Cc: psainty, luangruo, schwab, rms, emacs-devel

> From: Kévin Le Gouguec <kevin.legouguec@gmail.com>
> Cc: psainty@orcon.net.nz,  luangruo@yahoo.com,  schwab@linux-m68k.org,
>   rms@gnu.org,  emacs-devel@gnu.org
> Date: Thu, 27 Jan 2022 18:36:37 +0100
> 
> > Or maybe we should install the below?
> 
> Ah, yes, that does improve the situation quite a bit here (if that
> matters: Linux 5.16.1 on openSUSE Tumbleweed; 5.10.92 on Debian 11): all
> the characters that we discussed here and showed up as diamonds (ﬃ, ⚀)
> now show up as \uHHHH escape sequences.

Thanks, installed.

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it?
  2022-01-19 10:05   ` Phil Sainty
  2022-01-19 11:43     ` Eli Zaretskii
@ 2022-01-20  3:17     ` Richard Stallman
  2022-01-20  4:54       ` Phil Sainty
                         ` (2 more replies)
  1 sibling, 3 replies; 104+ messages in thread
From: Richard Stallman @ 2022-01-20  3:17 UTC (permalink / raw)
  To: Phil Sainty; +Cc: luangruo, emacs-devel

[[[ To any NSA and FBI agents reading my email: please consider    ]]]
[[[ whether defending the US Constitution against all enemies,     ]]]
[[[ foreign or domestic, requires you to follow Snowden's example. ]]]

Explanation to Eli: I understand that these 0-width characters have
legitimate, useful purposes.  It is good that we support them.

The issue I've raised, which was explained in the text I cited, is
that _allegedly_  it is possible to use them maliciously, by inserting
a sequence of them to function as a sort of watermark that users
normally won't even see.

  > You can highlight them like so:

  > (set-face-background 'glyphless-char "red")

  > I've had that configured ever since
  > https://debbugs.gnu.org/cgi/bugreport.cgi?bug=31194#40

  > If you're not expecting zero-width characters in text in general,
  > I think it's a good setting.

I think I will try that, just in case someone sends me some of those.
Thanks.

Should we make this the default?  I think it is likely that most Emacs users
will see only malicious zero-width characters, and not useful ones.

Is there a way we could detect automatically when these zero-width
characters are being used in a legit way for their intended purpose,
and in that case, display them as zero-width for real?

That way, they would work right when used properly, and ring an alarm
(metaphorically) when used in a fishy way.

  > Emacs by default displays ZWJ and ZWNJ characters (and any other
  > zero-width characters) as thin 1-pixel spaces on GUI frames, and as
  > simple spaces on TTY frames.  So Emacs users are likely to see these
  > "hidden" sequences of characters on display.

I wonder if we could do something clever to show when there is a
sequence of multiple different 1-pixel characters?  For instance,
maybe give different colors to different characters, so that a
sequence of several shows as a funny spectrum?

This could alert the user that "someone's messing with you here".

There are many possible variants of the details -- I don't know what
would be best, or what would be easy, but people could try various
methods.

-- 
Dr Richard Stallman (https://stallman.org)
Chief GNUisance of the GNU Project (https://gnu.org)
Founder, Free Software Foundation (https://fsf.org)
Internet Hall-of-Famer (https://internethalloffame.org)

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it?
  2022-01-20  3:17     ` Richard Stallman
@ 2022-01-20  4:54       ` Phil Sainty
  2022-01-20  6:39         ` tomas
  2022-01-20  7:57         ` Eli Zaretskii
  2022-01-20  6:35       ` Tim Cross
  2022-01-20  7:48       ` Eli Zaretskii
  2 siblings, 2 replies; 104+ messages in thread
From: Phil Sainty @ 2022-01-20  4:54 UTC (permalink / raw)
  To: emacs-devel

I've remembered a question on emacs.stackexchange.com about
customizing the glyphless char display, and doing so on a
per-mode basis; so just in case this is useful to anyone:

https://emacs.stackexchange.com/a/65109

(I don't think that's part of a general solution to this
question, but it might be interesting to someone.)




^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it?
  2022-01-20  4:54       ` Phil Sainty
@ 2022-01-20  6:39         ` tomas
  2022-01-20 17:58           ` [External] : " Drew Adams
  2022-01-22  4:37           ` Richard Stallman
  2022-01-20  7:57         ` Eli Zaretskii
  1 sibling, 2 replies; 104+ messages in thread
From: tomas @ 2022-01-20  6:39 UTC (permalink / raw)
  To: emacs-devel

[-- Attachment #1: Type: text/plain, Size: 708 bytes --]

On Thu, Jan 20, 2022 at 05:54:24PM +1300, Phil Sainty wrote:
> I've remembered a question on emacs.stackexchange.com about
> customizing the glyphless char display, and doing so on a
> per-mode basis; so just in case this is useful to anyone:
> 
> https://emacs.stackexchange.com/a/65109
> 
> (I don't think that's part of a general solution to this
> question, but it might be interesting to someone.)

Last time a similar discussion came around (that time about
direction-change Unicode characters in source code used for malicious
purposes) white-space-mode was mentioned as a place where to put
visualization of "such things".

This time it seems even more appropriate.

Cheers
-- 
t

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

^ permalink raw reply	[flat|nested] 104+ messages in thread

* RE: [External] : Re: Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it?
  2022-01-20  6:39         ` tomas
@ 2022-01-20 17:58           ` Drew Adams
  2022-01-22  4:37           ` Richard Stallman
  1 sibling, 0 replies; 104+ messages in thread
From: Drew Adams @ 2022-01-20 17:58 UTC (permalink / raw)
  To: tomas@tuxteam.de, emacs-devel@gnu.org

> > I've remembered a question on emacs.stackexchange.com about
> > customizing the glyphless char display, and doing so on a
> > per-mode basis; so just in case this is useful to anyone:
> > https://emacs.stackexchange.com/a/65109
> 
> Last time a similar discussion came around (that time about
> direction-change Unicode characters in source code used for malicious
> purposes) white-space-mode was mentioned as a place where to put
> visualization of "such things".
> 
> This time it seems even more appropriate.

Yes, `whitespace-mode` can help, but it's
somewhat limited wrt highlighting different
characters (or sets or ranges of characters)
differently.

My library `highlight-chars.el` can help
more with this kind of thing.

code:
https://www.emacswiki.org/emacs/download/highlight-chars.el

Description:
https://www.emacswiki.org/emacs/ShowWhiteSpace#HighlightChars


^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it?
  2022-01-20  6:39         ` tomas
  2022-01-20 17:58           ` [External] : " Drew Adams
@ 2022-01-22  4:37           ` Richard Stallman
  2022-01-22  5:16             ` Po Lu
  1 sibling, 1 reply; 104+ messages in thread
From: Richard Stallman @ 2022-01-22  4:37 UTC (permalink / raw)
  To: tomas; +Cc: emacs-devel

[[[ To any NSA and FBI agents reading my email: please consider    ]]]
[[[ whether defending the US Constitution against all enemies,     ]]]
[[[ foreign or domestic, requires you to follow Snowden's example. ]]]

  > Last time a similar discussion came around (that time about
  > direction-change Unicode characters in source code used for malicious
  > purposes) white-space-mode was mentioned as a place where to put
  > visualization of "such things".

Isn't there a campaign that objects to that term "white space",
accusing the term of racism?  ;-}.


-- 
Dr Richard Stallman (https://stallman.org)
Chief GNUisance of the GNU Project (https://gnu.org)
Founder, Free Software Foundation (https://fsf.org)
Internet Hall-of-Famer (https://internethalloffame.org)





^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it?
  2022-01-22  4:37           ` Richard Stallman
@ 2022-01-22  5:16             ` Po Lu
  0 siblings, 0 replies; 104+ messages in thread
From: Po Lu @ 2022-01-22  5:16 UTC (permalink / raw)
  To: Richard Stallman; +Cc: tomas, emacs-devel

Richard Stallman <rms@gnu.org> writes:

>   > Last time a similar discussion came around (that time about
>   > direction-change Unicode characters in source code used for malicious
>   > purposes) white-space-mode was mentioned as a place where to put
>   > visualization of "such things".

> Isn't there a campaign that objects to that term "white space",
> accusing the term of racism?  ;-}.

If there is, I sincerely hope it will not (as with any other political
campaign unrelated to free software) affect our development of Emacs,
where people have used the term "whitespace" for decades.

The smiley probably means that you're joking.  But I'm not sure about
that.

Thanks.



^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it?
  2022-01-20  4:54       ` Phil Sainty
  2022-01-20  6:39         ` tomas
@ 2022-01-20  7:57         ` Eli Zaretskii
  1 sibling, 0 replies; 104+ messages in thread
From: Eli Zaretskii @ 2022-01-20  7:57 UTC (permalink / raw)
  To: Phil Sainty; +Cc: emacs-devel

> Date: Thu, 20 Jan 2022 17:54:24 +1300
> From: Phil Sainty <psainty@orcon.net.nz>
> 
> I've remembered a question on emacs.stackexchange.com about
> customizing the glyphless char display, and doing so on a
> per-mode basis; so just in case this is useful to anyone:
> 
> https://emacs.stackexchange.com/a/65109

Yes, making it possible for glyphless display to be buffer-local is a
useful feature we should have.  Patches to that effect are welcome.



^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it?
  2022-01-20  3:17     ` Richard Stallman
  2022-01-20  4:54       ` Phil Sainty
@ 2022-01-20  6:35       ` Tim Cross
  2022-01-20  7:39         ` tomas
  2022-01-20  8:20         ` Eli Zaretskii
  2022-01-20  7:48       ` Eli Zaretskii
  2 siblings, 2 replies; 104+ messages in thread
From: Tim Cross @ 2022-01-20  6:35 UTC (permalink / raw)
  To: emacs-devel

Richard Stallman <rms@gnu.org> writes:

> [[[ To any NSA and FBI agents reading my email: please consider    ]]]
> [[[ whether defending the US Constitution against all enemies,     ]]]
> [[[ foreign or domestic, requires you to follow Snowden's example. ]]]
>
> Explanation to Eli: I understand that these 0-width characters have
> legitimate, useful purposes.  It is good that we support them.
>
> The issue I've raised, which was explained in the text I cited, is
> that _allegedly_  it is possible to use them maliciously, by inserting
> a sequence of them to function as a sort of watermark that users
> normally won't even see.
>
>   > You can highlight them like so:
>
>   > (set-face-background 'glyphless-char "red")
>
>   > I've had that configured ever since
>   > https://debbugs.gnu.org/cgi/bugreport.cgi?bug=31194#40
>
>   > If you're not expecting zero-width characters in text in general,
>   > I think it's a good setting.
>
> I think I will try that, just in case someone sends me some of those.
> Thanks.
>
> Should we make this the default?  I think it is likely that most Emacs users
> will see only malicious zero-width characters, and not useful ones.
>
> Is there a way we could detect automatically when these zero-width
> characters are being used in a legit way for their intended purpose,
> and in that case, display them as zero-width for real?
>
> That way, they would work right when used properly, and ring an alarm
> (metaphorically) when used in a fishy way.
>
>   > Emacs by default displays ZWJ and ZWNJ characters (and any other
>   > zero-width characters) as thin 1-pixel spaces on GUI frames, and as
>   > simple spaces on TTY frames.  So Emacs users are likely to see these
>   > "hidden" sequences of characters on display.
>
> I wonder if we could do something clever to show when there is a
> sequence of multiple different 1-pixel characters?  For instance,
> maybe give different colors to different characters, so that a
> sequence of several shows as a funny spectrum?
>
> This could alert the user that "someone's messing with you here".
>
> There are many possible variants of the details -- I don't know what
> would be best, or what would be easy, but people could try various
> methods.

Just to add some context here which some might find useful.

At one point, I worked for an organisation which had real concerns about
sensitive information being released (mainly to the press) and wanted to
be able to track down the source when it occurred. Essentially, this
technique was used. All electronic documents, when distributed to teh
approved list of recipients, had a unique id stamp using zero-width
characters. When I left, the organisation was also experimenting with
adding similar 'marks' to emails sent via the orgnaisation's email
server. So this practice is definitely occurring. It is probably more
prevalent in PDF and word documents, but I guess could be in plain text
email messages as well.  

This technique (and related ones) don't need high technical expertise
either. We had a similar problem at a University I wored at where
students used this technique to defeat the anti-plagiarism software the
uni used. The software used basic text matching and students started to
defeat it by using both zero width characters to break patterns and by
using utf characters with glyphs that looked like standard characters,
allowing the document to print an look correct, but also breaking
pattern matching. Of course, once you are aware this is going on, you
can improve the pattern matching and add checks to detect this type of
activity. Personally, I was always amazed at the length people went to
defeat the anti-plagiarism software. Always seem it would be easier not
to plagiarise and cite when appropriate.    

It is a big challenge to find out a way to alert users to this possible
unwanted 'tagging', but at the same time, allow legitimate use. For
exmaple, in org-mode, it can sometimes be difficult to combine different
markup and other syntax - often it is because of a corner case which is
difficult to address with font-locking regexp. Adding a zero-width space
is sometimes sufficient to work around the ambiguity in tghe regexp.
Point is, anything which makes such use visual noticeable will also make
the technique less useful for addressing this issue.   

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it?
  2022-01-20  6:35       ` Tim Cross
@ 2022-01-20  7:39         ` tomas
  2022-01-20  8:20         ` Eli Zaretskii
  1 sibling, 0 replies; 104+ messages in thread
From: tomas @ 2022-01-20  7:39 UTC (permalink / raw)
  To: emacs-devel

[-- Attachment #1: Type: text/plain, Size: 782 bytes --]

On Thu, Jan 20, 2022 at 05:35:23PM +1100, Tim Cross wrote:

[...]

> Just to add some context here which some might find useful.
> 
> At one point, I worked for an organisation which had real concerns about
> sensitive information being released (mainly to the press) and wanted to
> be able to track down the source when it occurred [...]

Interesting. Related, but not the same: the famous yellow dots from
colour laser printers [1]. This shows that this kind of techniques are
already deeply ingrained in "industry practice".

Now a good challenge would be to come up with a set of criteria which
can discriminate between "good" and "bad" use of normally invisible
characters.

Cheers

[1] https://en.wikipedia.org/wiki/Machine_Identification_Code

-- 
t

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it?
  2022-01-20  6:35       ` Tim Cross
  2022-01-20  7:39         ` tomas
@ 2022-01-20  8:20         ` Eli Zaretskii
  1 sibling, 0 replies; 104+ messages in thread
From: Eli Zaretskii @ 2022-01-20  8:20 UTC (permalink / raw)
  To: Tim Cross; +Cc: emacs-devel

> From: Tim Cross <theophilusx@gmail.com>
> Date: Thu, 20 Jan 2022 17:35:23 +1100
> 
> It is a big challenge to find out a way to alert users to this possible
> unwanted 'tagging', but at the same time, allow legitimate use. For
> exmaple, in org-mode, it can sometimes be difficult to combine different
> markup and other syntax - often it is because of a corner case which is
> difficult to address with font-locking regexp. Adding a zero-width space
> is sometimes sufficient to work around the ambiguity in tghe regexp.

A single zero-width character should never be flagged, at least not by
default, because its uses are mostly legitimate and important.



^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it?
  2022-01-20  3:17     ` Richard Stallman
  2022-01-20  4:54       ` Phil Sainty
  2022-01-20  6:35       ` Tim Cross
@ 2022-01-20  7:48       ` Eli Zaretskii
  2022-01-20  8:17         ` Lars Ingebrigtsen
  2022-01-21  4:14         ` Richard Stallman
  2 siblings, 2 replies; 104+ messages in thread
From: Eli Zaretskii @ 2022-01-20  7:48 UTC (permalink / raw)
  To: rms; +Cc: psainty, luangruo, emacs-devel

> From: Richard Stallman <rms@gnu.org>
> Date: Wed, 19 Jan 2022 22:17:31 -0500
> Cc: luangruo@yahoo.com, emacs-devel@gnu.org
> 
>   > If you're not expecting zero-width characters in text in general,
>   > I think it's a good setting.
> 
> I think I will try that, just in case someone sends me some of those.
> Thanks.
> 
> Should we make this the default?  I think it is likely that most Emacs users
> will see only malicious zero-width characters, and not useful ones.

"Most users" is not a good argument when for some users these
characters are a must.  As I explained, these characters, when used
for their intended purpose, are necessary for correct shaping of text,
which increasingly includes even plain-ASCII text.  So I will object
to any simplistic default like that.  We should flag suspicious uses
of those characters (which means sequences of several of them in a
row), not lone characters.  The new textsec.el library is developing
the capabilities for detecting such suspicious uses, and we should use
that as the basis for any defaults.

Users who want to flag _any_ use of zero-width characters are free to
do so in their own customizations, of course.

> Is there a way we could detect automatically when these zero-width
> characters are being used in a legit way for their intended purpose,
> and in that case, display them as zero-width for real?

That is the subject of the new textsec.el package that Lars is working
on now.

>   > Emacs by default displays ZWJ and ZWNJ characters (and any other
>   > zero-width characters) as thin 1-pixel spaces on GUI frames, and as
>   > simple spaces on TTY frames.  So Emacs users are likely to see these
>   > "hidden" sequences of characters on display.
> 
> I wonder if we could do something clever to show when there is a
> sequence of multiple different 1-pixel characters?  For instance,
> maybe give different colors to different characters, so that a
> sequence of several shows as a funny spectrum?

textsec.el should provide facilities for that.

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it?
  2022-01-20  7:48       ` Eli Zaretskii
@ 2022-01-20  8:17         ` Lars Ingebrigtsen
  2022-01-21  4:14         ` Richard Stallman
  1 sibling, 0 replies; 104+ messages in thread
From: Lars Ingebrigtsen @ 2022-01-20  8:17 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: psainty, luangruo, rms, emacs-devel

Eli Zaretskii <eliz@gnu.org> writes:

> As I explained, these characters, when used
> for their intended purpose, are necessary for correct shaping of text,
> which increasingly includes even plain-ASCII text.  So I will object
> to any simplistic default like that.

Yup.

But for people that are paranoid about this stuff, there's
`glyphless-display-mode' that they can enable in buffers they worry
about.

-- 
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no



^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it?
  2022-01-20  7:48       ` Eli Zaretskii
  2022-01-20  8:17         ` Lars Ingebrigtsen
@ 2022-01-21  4:14         ` Richard Stallman
  1 sibling, 0 replies; 104+ messages in thread
From: Richard Stallman @ 2022-01-21  4:14 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: psainty, luangruo, emacs-devel

[[[ To any NSA and FBI agents reading my email: please consider    ]]]
[[[ whether defending the US Constitution against all enemies,     ]]]
[[[ foreign or domestic, requires you to follow Snowden's example. ]]]

  > > Should we make this the default?  I think it is likely that most Emacs users
  > > will see only malicious zero-width characters, and not useful ones.

  > "Most users" is not a good argument when for some users these
  > characters are a must.

I don't follow the argument.  Since some users actually use zero-width
characters, that seems to give us two choices (at least):

* Leave zero-width characters unflagged by default.

* Flag zero-width characters by default, and those users can
turn that off.

I don't know which is better -- I think it depends partly on what
fraction of all users find the zero-width characters useful.

  > > Is there a way we could detect automatically when these zero-width
  > > characters are being used in a legit way for their intended purpose,
  > > and in that case, display them as zero-width for real?

  > That is the subject of the new textsec.el package that Lars is working
  > on now.

That sounds good.

-- 
Dr Richard Stallman (https://stallman.org)
Chief GNUisance of the GNU Project (https://gnu.org)
Founder, Free Software Foundation (https://fsf.org)
Internet Hall-of-Famer (https://internethalloffame.org)

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Can watermarking Unicode text using invisible differences  sneak through Emacs, or can Emacs detect it?
  2022-01-19  4:15 Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it? Richard Stallman
  2022-01-19  4:47 ` Po Lu
@ 2022-01-19  8:20 ` Eli Zaretskii
  2022-01-19 17:36 ` T.V Raman
  2 siblings, 0 replies; 104+ messages in thread
From: Eli Zaretskii @ 2022-01-19  8:20 UTC (permalink / raw)
  To: rms; +Cc: emacs-devel

> From: Richard Stallman <rms@gnu.org>
> Date: Tue, 18 Jan 2022 23:15:59 -0500
> 
>    Unicode allows user tracking by means of invisible text marking. Any
>    string can be converted into its binary form and then recoded into a
>    string of zero-width characters, which can then be invisibly inserted
>    into the text. If the text is posted elsewhere, the zero-width
>    character string can be extracted and the process reversed to figure
>    out the identity of the person who copied it.
> 
> which seems ot be about a special case of confusables, and it makes me
> wonder whether Emacs does, or could, show users when Unicode confusion
> occurs, or prevent or fix it somehow.

AFAIU, there's no confusion here, "just" injection of hidden
information into plain text.  "Confusion" is when the user is
presented with some text that looks like something else.  Here the
problematic part is not presented at all.

> First, is that issue of invisible characters real?

Yes.  The idea is to use 2 "normal" characters to serve as binary zero
and binary one, which would then allow you to inject hidden text by
combinations of these two.  Of course, the technique is very
inefficient and will need many such characters to inject any
meaningful text.

> Second, does Emacs do anything now such that these tricks
> won't succeed?

Emacs by default displays ZWJ and ZWNJ characters (and any other
zero-width characters) as thin 1-pixel spaces on GUI frames, and as
simple spaces on TTY frames.  So Emacs users are likely to see these
"hidden" sequences of characters on display.

> If the problem exists in Emacs now, could we prevent it?  I see a few
> ways to try.  I don't know whether they would work well.
> 
> * Indicate the different encodings on the screen somehow.
> 
> * Canonicalize such seqences (perhaps when reading text into Emacs),
> so that different encodings of the same text become identical.
> 
> * Use a stand-alone canonicalizer program.

I don't think I understand your proposals.  They seem to be based on
some idea that these characters are "encodings" of something, and that
this encoding can be "canonicalized"?  If so, I think this
interpretation is a mistake: there's no encoding going on here.  These
zero-width characters' role is to help the text-shaping engine to
shape the characters around them correctly, according to the rules of
the script of those surrounding characters.  When those zero-width
characters are used for the purpose of hiding text, they appear as
sequences of zero-width characters without any reason, and in
particular the characters that surround them are likely to be
whitespace characters, which don't need any joiners to shape them.
The job of a feature that detects this is to discern between these two
use cases, and flag the suspicious one.

In any case, I don't think these solutions could work by examining
single characters.  ZWJ and ZWNJ are important characters in some
scripts, so we cannot mangle them based on considering isolated
characters.  We must consider sequences of such characters when we
design a feature that makes them stand out, because only on that level
we can distinguish between legitimate uses of those characters and
suspicious uses.

I think we should introduce a minor mode that detects those sequences
and makes them stand out on display, with or without some warning
message in the echo-area.  People who want to be aware of any such
potentially hidden text will turn that on.  We could also turn it on
automatically in email and eww.  Patches are welcome; I believe we
already have the infrastructure in the new textsec.el package.

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it?
  2022-01-19  4:15 Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it? Richard Stallman
  2022-01-19  4:47 ` Po Lu
  2022-01-19  8:20 ` Eli Zaretskii
@ 2022-01-19 17:36 ` T.V Raman
  2 siblings, 0 replies; 104+ messages in thread
From: T.V Raman @ 2022-01-19 17:36 UTC (permalink / raw)
  To: Richard Stallman; +Cc: emacs-devel

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset=gb18030, Size: 1964 bytes --]

Richard Stallman <rms@gnu.org> writes:


This is indeed  worrysome and has been around for a while. There is an
even more insidious form of this hack where unicode chars that "appear
like english letters" can be used  --and a quick visual scan will miss
it -- the trick is often used by spammers in domain-names within URLs as
 an example. As an example, there are Cyrillic letters that "look like"
 Roman letters.
 > [[[ To any NSA and FBI agents reading my email: please consider    ]]]
> [[[ whether defending the US Constitution against all enemies,     ]]]
> [[[ foreign or domestic, requires you to follow Snowden's example. ]]]
>
> There is a thread now about confusables.
>
> I read this,
>
>    Unicode allows user tracking by means of invisible text marking. Any
>    string can be converted into its binary form and then recoded into a
>    string of zero-width characters, which can then be invisibly inserted
>    into the text. If the text is posted elsewhere, the zero-width
>    character string can be extracted and the process reversed to figure
>    out the identity of the person who copied it.
>
> which seems ot be about a special case of confusables, and it makes me
> wonder whether Emacs does, or could, show users when Unicode confusion
> occurs, or prevent or fix it somehow.
>
> First, is that issue of invisible characters real?
>
> Second, does Emacs do anything now such that these tricks
> won't succeed?
>
> If the problem exists in Emacs now, could we prevent it?  I see a few
> ways to try.  I don't know whether they would work well.
>
> * Indicate the different encodings on the screen somehow.
>
> * Canonicalize such seqences (perhaps when reading text into Emacs),
> so that different encodings of the same text become identical.
>
> * Use a stand-alone canonicalizer program.

-- 

Thanks,

--Raman(I Search, I Find, I Misplace, I Research)
7©4 Id: kg:/m/0285kf1  •0Ü8



^ permalink raw reply	[flat|nested] 104+ messages in thread

end of thread, other threads:[~2022-02-19  4:54 UTC | newest]

Thread overview: 104+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2022-01-19  4:15 Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it? Richard Stallman
2022-01-19  4:47 ` Po Lu
2022-01-19 10:05   ` Phil Sainty
2022-01-19 11:43     ` Eli Zaretskii
2022-01-21  4:13       ` Richard Stallman
2022-01-21  7:49         ` Eli Zaretskii
2022-01-22  4:37           ` Richard Stallman
2022-01-22  6:58             ` Eli Zaretskii
2022-01-24  4:33               ` Richard Stallman
2022-01-24  5:06                 ` Po Lu
2022-01-25  4:17                   ` Richard Stallman
2022-01-25  4:58                     ` Po Lu
2022-01-24 12:14                 ` Eli Zaretskii
2022-01-25  4:16                   ` Richard Stallman
2022-01-25  6:35                     ` Eli Zaretskii
2022-01-25 12:12                       ` Eli Zaretskii
2022-01-25  4:16                   ` New feature: displaying ligature characters in the buffer Richard Stallman
2022-01-25  6:31                     ` Eli Zaretskii
2022-01-27  4:12                       ` Richard Stallman
2022-01-27  7:58                         ` Eli Zaretskii
2022-01-25 11:08                   ` Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it? Kévin Le Gouguec
2022-01-25 12:38                     ` Eli Zaretskii
2022-01-26  3:39                       ` Richard Stallman
2022-01-26  5:38                         ` Eli Zaretskii
2022-01-28 13:04                           ` Richard Stallman
2022-01-28 13:31                             ` Eli Zaretskii
2022-01-30  4:17                               ` Richard Stallman
2022-01-30  7:36                                 ` Eli Zaretskii
2022-01-31  4:02                                   ` Richard Stallman
2022-01-31 13:05                                     ` Eli Zaretskii
2022-02-01  5:06                                       ` Richard Stallman
2022-02-01 14:57                                         ` Eli Zaretskii
2022-02-02  3:58                                           ` Richard Stallman
2022-02-02 12:28                                             ` Eli Zaretskii
2022-02-03  4:23                                               ` Richard Stallman
2022-02-03  7:53                                                 ` Eli Zaretskii
2022-02-03  8:16                                                   ` Yuri Khan
2022-02-03  9:26                                                     ` Eli Zaretskii
2022-02-04  3:52                                                   ` Richard Stallman
2022-02-04  4:56                                                     ` Yuri Khan
2022-02-06  4:13                                                       ` Richard Stallman
2022-02-04  8:10                                                     ` Eli Zaretskii
2022-02-06  4:13                                                       ` Richard Stallman
2022-02-03 20:28                                                 ` Tomas Hlavaty
2022-02-04  7:07                                                   ` Eli Zaretskii
2022-02-05  4:20                                                   ` Richard Stallman
2022-02-05 13:55                                                     ` Tomas Hlavaty
2022-02-05 14:06                                                       ` Eli Zaretskii
2022-02-05 14:12                                                         ` Eli Zaretskii
2022-02-06  1:29                                                           ` Tomas Hlavaty
2022-02-06  8:30                                                             ` Eli Zaretskii
2022-02-06 10:38                                                               ` Tomas Hlavaty
2022-02-06 10:44                                                                 ` Eli Zaretskii
2022-02-06 10:54                                                                 ` Andreas Schwab
2022-02-06  1:10                                                         ` Tomas Hlavaty
2022-02-06  4:16                                                         ` Richard Stallman
2022-02-06  4:16                                                       ` Richard Stallman
2022-02-06 11:29                                                         ` Tomas Hlavaty
2022-02-04  3:52                                                 ` Richard Stallman
2022-02-04  8:03                                                   ` Eli Zaretskii
2022-02-06  4:13                                                     ` Richard Stallman
2022-02-06  8:56                                                       ` Eli Zaretskii
2022-02-07  5:11                                                         ` Richard Stallman
2022-02-07 13:16                                                           ` Eli Zaretskii
2022-02-08  3:55                                                             ` Richard Stallman
2022-02-08 12:20                                                               ` Eli Zaretskii
2022-02-09  4:06                                                                 ` Richard Stallman
2022-02-09 13:50                                                                   ` Eli Zaretskii
2022-02-10  3:57                                                                     ` Richard Stallman
2022-02-10  6:26                                                                       ` Eli Zaretskii
2022-02-12  3:57                                                                         ` Richard Stallman
2022-02-12  7:36                                                                           ` Eli Zaretskii
2022-02-14  4:13                                                                             ` Richard Stallman
2022-02-14 12:07                                                                               ` Eli Zaretskii
2022-02-15  4:33                                                                                 ` Richard Stallman
2022-02-15 13:32                                                                                   ` Eli Zaretskii
2022-02-16  4:14                                                                                     ` Richard Stallman
2022-02-16 12:10                                                                                       ` Eli Zaretskii
2022-02-19  4:54                                                                                         ` Richard Stallman
2022-02-12 20:10                                                                           ` Tomas Hlavaty
2022-02-14  4:14                                                                             ` Richard Stallman
2022-01-26  8:20                         ` Andreas Schwab
2022-01-27  4:13                           ` Richard Stallman
2022-01-27  6:39                             ` Eli Zaretskii
2022-01-27  8:13                               ` Kévin Le Gouguec
2022-01-27  9:55                                 ` Eli Zaretskii
2022-01-27 10:29                                   ` Eli Zaretskii
2022-01-27 17:36                                     ` Kévin Le Gouguec
2022-01-27 18:38                                       ` Eli Zaretskii
2022-01-20  3:17     ` Richard Stallman
2022-01-20  4:54       ` Phil Sainty
2022-01-20  6:39         ` tomas
2022-01-20 17:58           ` [External] : " Drew Adams
2022-01-22  4:37           ` Richard Stallman
2022-01-22  5:16             ` Po Lu
2022-01-20  7:57         ` Eli Zaretskii
2022-01-20  6:35       ` Tim Cross
2022-01-20  7:39         ` tomas
2022-01-20  8:20         ` Eli Zaretskii
2022-01-20  7:48       ` Eli Zaretskii
2022-01-20  8:17         ` Lars Ingebrigtsen
2022-01-21  4:14         ` Richard Stallman
2022-01-19  8:20 ` Eli Zaretskii
2022-01-19 17:36 ` T.V Raman

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).