unofficial mirror of emacs-devel@gnu.org 
 help / color / mirror / code / Atom feed
* Unicode confusables and reordering characters considered harmful
@ 2021-11-02 12:57 Vasilij Schneidermann
  2021-11-02 13:18 ` Po Lu
                   ` (6 more replies)
  0 siblings, 7 replies; 172+ messages in thread
From: Vasilij Schneidermann @ 2021-11-02 12:57 UTC (permalink / raw)
  To: emacs-devel

[-- Attachment #1: Type: text/plain, Size: 1582 bytes --]

There's a paper going around that demonstrates how two Unicode features
can be used to trick source code auditors into misinterpreting program
logic. The authors have suggested that language specifications should be
amended, implementations should warn or raise errors and editor tooling
should display visual warnings. Both issues are tracked as
CVE-2021-42574 and CVE-2021-42694.

The first issue is about bidirectional reordering characters. If bidi
text rendering is not needed, it's easy enough to work around with
`(setq-default bidi-display-reordering nil)`. Some people already make
use of this to speed up redisplay. Maybe there's a better solution, such
as automatically detecting whether the user is working with a RTL script
and only then enable bidi text rendering.

The second issue is about mixed-script confusable characters. Emacs does
not appear to have a workaround for that. I've come across the
uni-confusables package in GNU ELPA, but it merely sets up character
tables. The only mention of confusables I can find in the Emacs sources
is for `help-uni-confusables` which contains a much smaller list for
quotation marks, used in help buffers and elisp buffers. A
possible solution would be to implement the Unicode confusables
algorithm and expose it in the uni-confusables package.

Vasilij

https://trojansource.codes/
https://www.trojansource.codes/trojan-source.pdf
https://github.com/nickboucher/trojan-source
https://krebsonsecurity.com/2021/11/trojan-source-bug-threatens-the-security-of-all-code/
https://unicode.org/reports/tr39/#Confusable_Detection

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: Unicode confusables and reordering characters considered harmful
  2021-11-02 12:57 Unicode confusables and reordering characters considered harmful Vasilij Schneidermann
@ 2021-11-02 13:18 ` Po Lu
  2021-11-02 13:54   ` Uwe Brauer
  2021-11-02 14:31   ` Unicode confusables and reordering characters considered harmful Eli Zaretskii
  2021-11-02 13:42 ` tomas
                   ` (5 subsequent siblings)
  6 siblings, 2 replies; 172+ messages in thread
From: Po Lu @ 2021-11-02 13:18 UTC (permalink / raw)
  To: Vasilij Schneidermann; +Cc: emacs-devel

Vasilij Schneidermann <mail@vasilij.de> writes:

> The first issue is about bidirectional reordering characters. If bidi
> text rendering is not needed, it's easy enough to work around with
> `(setq-default bidi-display-reordering nil)`. Some people already make
> use of this to speed up redisplay. Maybe there's a better solution, such
> as automatically detecting whether the user is working with a RTL script
> and only then enable bidi text rendering.

Isn't bidi-display-ordering obsolete and present only as a debugging
option?  Or am I misunderstanding something?



^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: Unicode confusables and reordering characters considered harmful
  2021-11-02 12:57 Unicode confusables and reordering characters considered harmful Vasilij Schneidermann
  2021-11-02 13:18 ` Po Lu
@ 2021-11-02 13:42 ` tomas
  2021-11-02 14:57   ` Stefan Kangas
  2021-11-02 14:30 ` Eli Zaretskii
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 172+ messages in thread
From: tomas @ 2021-11-02 13:42 UTC (permalink / raw)
  To: Vasilij Schneidermann; +Cc: emacs-devel

[-- Attachment #1: Type: text/plain, Size: 396 bytes --]

On Tue, Nov 02, 2021 at 01:57:20PM +0100, Vasilij Schneidermann wrote:
> There's a paper going around that demonstrates how two Unicode features
> can be used to trick source code auditors into misinterpreting program
> logic.

"Trojan source", yes. A discussion was started already at help-gnu-emacs,
Message-ID: <CANc-5Uy_au4VV2AGWO1pYHZHVTfHFqCmig06GdN5CfHfrBu1tA@mail.gmail.com>

Cheers
 - t

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: Unicode confusables and reordering characters considered harmful
  2021-11-02 13:18 ` Po Lu
@ 2021-11-02 13:54   ` Uwe Brauer
  2021-11-02 14:53     ` Eli Zaretskii
  2021-11-02 14:31   ` Unicode confusables and reordering characters considered harmful Eli Zaretskii
  1 sibling, 1 reply; 172+ messages in thread
From: Uwe Brauer @ 2021-11-02 13:54 UTC (permalink / raw)
  To: emacs-devel

[-- Attachment #1: Type: text/plain, Size: 806 bytes --]

>>> "PL" == Po Lu <luangruo@yahoo.com> writes:

> Vasilij Schneidermann <mail@vasilij.de> writes:
>> The first issue is about bidirectional reordering characters. If bidi
>> text rendering is not needed, it's easy enough to work around with
>> `(setq-default bidi-display-reordering nil)`. Some people already make
>> use of this to speed up redisplay. Maybe there's a better solution, such
>> as automatically detecting whether the user is working with a RTL script
>> and only then enable bidi text rendering.

> Isn't bidi-display-ordering obsolete and present only as a debugging
> option?  Or am I misunderstanding something?

I was about to ask about this article especially since emacs was also
investigated. The authors don't talk about this variable when concluding
that emacs could be fooled. hm

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5673 bytes --]

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: Unicode confusables and reordering characters considered harmful
  2021-11-02 12:57 Unicode confusables and reordering characters considered harmful Vasilij Schneidermann
  2021-11-02 13:18 ` Po Lu
  2021-11-02 13:42 ` tomas
@ 2021-11-02 14:30 ` Eli Zaretskii
  2021-11-02 14:43 ` Clément Pit-Claudel
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 172+ messages in thread
From: Eli Zaretskii @ 2021-11-02 14:30 UTC (permalink / raw)
  To: Vasilij Schneidermann; +Cc: emacs-devel

> Date: Tue, 2 Nov 2021 13:57:20 +0100
> From: Vasilij Schneidermann <mail@vasilij.de>
> 
> The first issue is about bidirectional reordering characters. If bidi
> text rendering is not needed, it's easy enough to work around with
> `(setq-default bidi-display-reordering nil)`. Some people already make
> use of this to speed up redisplay. Maybe there's a better solution, such
> as automatically detecting whether the user is working with a RTL script
> and only then enable bidi text rendering.

  https://lists.gnu.org/archive/html/help-gnu-emacs/2021-11/msg00023.html

"We have the technology."



^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: Unicode confusables and reordering characters considered harmful
  2021-11-02 13:18 ` Po Lu
  2021-11-02 13:54   ` Uwe Brauer
@ 2021-11-02 14:31   ` Eli Zaretskii
  2021-11-02 15:13     ` Uwe Brauer
  1 sibling, 1 reply; 172+ messages in thread
From: Eli Zaretskii @ 2021-11-02 14:31 UTC (permalink / raw)
  To: Po Lu; +Cc: emacs-devel, mail

> From: Po Lu <luangruo@yahoo.com>
> Cc: emacs-devel@gnu.org
> Date: Tue, 02 Nov 2021 21:18:08 +0800
> 
> Vasilij Schneidermann <mail@vasilij.de> writes:
> 
> > The first issue is about bidirectional reordering characters. If bidi
> > text rendering is not needed, it's easy enough to work around with
> > `(setq-default bidi-display-reordering nil)`. Some people already make
> > use of this to speed up redisplay. Maybe there's a better solution, such
> > as automatically detecting whether the user is working with a RTL script
> > and only then enable bidi text rendering.
> 
> Isn't bidi-display-ordering obsolete and present only as a debugging
> option?

It is.  But this is a free world: people are free to do whatever they
want (to their cost), and we cannot stop them from doing that.



^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: Unicode confusables and reordering characters considered harmful
  2021-11-02 12:57 Unicode confusables and reordering characters considered harmful Vasilij Schneidermann
                   ` (2 preceding siblings ...)
  2021-11-02 14:30 ` Eli Zaretskii
@ 2021-11-02 14:43 ` Clément Pit-Claudel
  2021-11-03 15:07   ` Reini Urban
  2021-11-02 14:57 ` Stefan Kangas
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 172+ messages in thread
From: Clément Pit-Claudel @ 2021-11-02 14:43 UTC (permalink / raw)
  To: emacs-devel

On 11/2/21 8:57 AM, Vasilij Schneidermann wrote:
> There's a paper going around that demonstrates how two Unicode features
> can be used to trick source code auditors into misinterpreting program
> logic. The authors have suggested that language specifications should be
> amended, implementations should warn or raise errors and editor tooling
> should display visual warnings. Both issues are tracked as
> CVE-2021-42574 and CVE-2021-42694.

There is a good summary of the issue and relevant mitigations at https://research.swtch.com/trojan (it argues against compiler fixes and in favor of IDE enhancements.)



^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: Unicode confusables and reordering characters considered harmful
  2021-11-02 13:54   ` Uwe Brauer
@ 2021-11-02 14:53     ` Eli Zaretskii
  2021-11-02 15:16       ` Eli Zaretskii
                         ` (2 more replies)
  0 siblings, 3 replies; 172+ messages in thread
From: Eli Zaretskii @ 2021-11-02 14:53 UTC (permalink / raw)
  To: Uwe Brauer; +Cc: emacs-devel

> From: Uwe Brauer <oub@mat.ucm.es>
> Date: Tue, 02 Nov 2021 14:54:34 +0100
> 
> I was about to ask about this article especially since emacs was also
> investigated.

"Investigated" is a very string word for what I found there about
Emacs.  They didn't even bother to mention which editors display the
formatting controls and which make them invisible (and thus much
harder to find, even if one is vigilant).

Another issue with this paper is that they dismiss the importance of
leaving the formatting controls on display and the seemingly-erratic
cursor motion around the problematic text, saying that "this requires
more attention than is given by most developers", but then go ahead
and propose to flag the problematic text on display, which somehow
should attract enough attention and overcome those problems.  So much
for unbiased logic...



^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: Unicode confusables and reordering characters considered harmful
  2021-11-02 12:57 Unicode confusables and reordering characters considered harmful Vasilij Schneidermann
                   ` (3 preceding siblings ...)
  2021-11-02 14:43 ` Clément Pit-Claudel
@ 2021-11-02 14:57 ` Stefan Kangas
  2021-11-05 18:53 ` Unicode confusables " Vasilij Schneidermann
  2021-11-10 15:47 ` Unicode confusables and reordering characters " Dmitry Gutov
  6 siblings, 0 replies; 172+ messages in thread
From: Stefan Kangas @ 2021-11-02 14:57 UTC (permalink / raw)
  To: Vasilij Schneidermann, emacs-devel

Vasilij Schneidermann <mail@vasilij.de> writes:

> There's a paper going around that demonstrates how two Unicode features
> can be used to trick source code auditors into misinterpreting program
> logic. The authors have suggested that language specifications should be
> amended, implementations should warn or raise errors and editor tooling
> should display visual warnings. Both issues are tracked as
> CVE-2021-42574 and CVE-2021-42694.

This is the list of solutions proposed on https://trojansource.codes/

(1) Compilers, interpreters, and build pipelines supporting Unicode should
    throw errors or warnings for unterminated bidirectional control
    characters in comments or string literals, and for identifiers with
    mixed-script confusable characters.

(2) Language specifications should formally disallow unterminated
    bidirectional control characters in comments and string literals.

(3) Code editors and repository frontends should make bidirectional control
    characters and mixed-script confusable characters perceptible with
    visual symbols or warnings.



^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: Unicode confusables and reordering characters considered harmful
  2021-11-02 13:42 ` tomas
@ 2021-11-02 14:57   ` Stefan Kangas
  0 siblings, 0 replies; 172+ messages in thread
From: Stefan Kangas @ 2021-11-02 14:57 UTC (permalink / raw)
  To: tomas, Vasilij Schneidermann; +Cc: emacs-devel

<tomas@tuxteam.de> writes:

> "Trojan source", yes. A discussion was started already at help-gnu-emacs,
> Message-ID: <CANc-5Uy_au4VV2AGWO1pYHZHVTfHFqCmig06GdN5CfHfrBu1tA@mail.gmail.com>

Direct link:
https://lists.gnu.org/archive/html/help-gnu-emacs/2021-11/msg00019.html



^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: Unicode confusables and reordering characters considered harmful
  2021-11-02 14:31   ` Unicode confusables and reordering characters considered harmful Eli Zaretskii
@ 2021-11-02 15:13     ` Uwe Brauer
  0 siblings, 0 replies; 172+ messages in thread
From: Uwe Brauer @ 2021-11-02 15:13 UTC (permalink / raw)
  To: emacs-devel

[-- Attachment #1: Type: text/plain, Size: 309 bytes --]



> It is.  But this is a free world: people are free to do whatever they
> want (to their cost), and we cannot stop them from doing that.

Just for the protocol. I already asked one of the authors whether they
bothered to check bidi-display-reordering.

Truth being told, I formulated it a bit more polite.


[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5673 bytes --]

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: Unicode confusables and reordering characters considered harmful
  2021-11-02 14:53     ` Eli Zaretskii
@ 2021-11-02 15:16       ` Eli Zaretskii
  2021-11-02 15:21         ` Uwe Brauer
  2021-11-02 16:24       ` Clément Pit-Claudel
  2021-11-02 17:24       ` [authors: default bidi-display-reordering is set to t] (was: Unicode confusables and reordering characters considered harmful) Uwe Brauer
  2 siblings, 1 reply; 172+ messages in thread
From: Eli Zaretskii @ 2021-11-02 15:16 UTC (permalink / raw)
  To: oub; +Cc: emacs-devel

> Date: Tue, 02 Nov 2021 16:53:32 +0200
> From: Eli Zaretskii <eliz@gnu.org>
> Cc: emacs-devel@gnu.org
> 
> "Investigated" is a very string word for what I found there about
                           ^^^^^^
I meant "strong", of course.  Sorry.



^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: Unicode confusables and reordering characters considered harmful
  2021-11-02 15:16       ` Eli Zaretskii
@ 2021-11-02 15:21         ` Uwe Brauer
  0 siblings, 0 replies; 172+ messages in thread
From: Uwe Brauer @ 2021-11-02 15:21 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: oub, emacs-devel

[-- Attachment #1: Type: text/plain, Size: 337 bytes --]

>>> "EZ" == Eli Zaretskii <eliz@gnu.org> writes:

>> Date: Tue, 02 Nov 2021 16:53:32 +0200
>> From: Eli Zaretskii <eliz@gnu.org>
>> Cc: emacs-devel@gnu.org
>> 
>> "Investigated" is a very string word for what I found there about
>                            ^^^^^^
> I meant "strong", of course.  Sorry.

A BUG I suppose 😇

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5673 bytes --]

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: Unicode confusables and reordering characters considered harmful
  2021-11-02 14:53     ` Eli Zaretskii
  2021-11-02 15:16       ` Eli Zaretskii
@ 2021-11-02 16:24       ` Clément Pit-Claudel
  2021-11-02 16:47         ` Eli Zaretskii
  2021-11-02 19:17         ` Yuri Khan
  2021-11-02 17:24       ` [authors: default bidi-display-reordering is set to t] (was: Unicode confusables and reordering characters considered harmful) Uwe Brauer
  2 siblings, 2 replies; 172+ messages in thread
From: Clément Pit-Claudel @ 2021-11-02 16:24 UTC (permalink / raw)
  To: emacs-devel

[-- Attachment #1: Type: text/plain, Size: 599 bytes --]

On 11/2/21 10:53 AM, Eli Zaretskii wrote:
>> From: Uwe Brauer <oub@mat.ucm.es>
>> Date: Tue, 02 Nov 2021 14:54:34 +0100
>>
>> I was about to ask about this article especially since emacs was also
>> investigated.
> 
> "Investigated" is a very string word for what I found there about
> Emacs.  They didn't even bother to mention which editors display the
> formatting controls and which make them invisible (and thus much
> harder to find, even if one is vigilant).

`emacs -Q` doesn't display these formatting controls for me (screenshot and corresponding file attached).  Should I report a bug?



[-- Attachment #2: emacs-bidi.png --]
[-- Type: image/png, Size: 18202 bytes --]

[-- Attachment #3: emacs-bidi.py --]
[-- Type: text/x-python, Size: 59 bytes --]

if access_level != "user‮ ⁦// Check if admin⁩ ⁦" {

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: Unicode confusables and reordering characters considered harmful
  2021-11-02 16:24       ` Clément Pit-Claudel
@ 2021-11-02 16:47         ` Eli Zaretskii
  2021-11-02 17:01           ` Stefan Kangas
  2021-11-02 18:16           ` Clément Pit-Claudel
  2021-11-02 19:17         ` Yuri Khan
  1 sibling, 2 replies; 172+ messages in thread
From: Eli Zaretskii @ 2021-11-02 16:47 UTC (permalink / raw)
  To: Clément Pit-Claudel; +Cc: emacs-devel

> From: Clément Pit-Claudel <cpitclaudel@gmail.com>
> Date: Tue, 2 Nov 2021 12:24:21 -0400
> 
> > "Investigated" is a very string word for what I found there about
> > Emacs.  They didn't even bother to mention which editors display the
> > formatting controls and which make them invisible (and thus much
> > harder to find, even if one is vigilant).
> 
> `emacs -Q` doesn't display these formatting controls for me (screenshot and corresponding file attached).  Should I report a bug?

It does display them, look closer.



^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: Unicode confusables and reordering characters considered harmful
  2021-11-02 16:47         ` Eli Zaretskii
@ 2021-11-02 17:01           ` Stefan Kangas
  2021-11-02 17:10             ` Eli Zaretskii
  2021-11-02 18:16           ` Clément Pit-Claudel
  1 sibling, 1 reply; 172+ messages in thread
From: Stefan Kangas @ 2021-11-02 17:01 UTC (permalink / raw)
  To: Eli Zaretskii, Clément Pit-Claudel; +Cc: emacs-devel

Eli Zaretskii <eliz@gnu.org> writes:

>> `emacs -Q` doesn't display these formatting controls for me
>> (screenshot and corresponding file attached).  Should I report a bug?
>
> It does display them, look closer.

I also can't see it in the screenshot, or in my local Emacs.  Could you
describe what we're missing?



^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: Unicode confusables and reordering characters considered harmful
  2021-11-02 17:01           ` Stefan Kangas
@ 2021-11-02 17:10             ` Eli Zaretskii
  2021-11-02 18:43               ` Stefan Kangas
  0 siblings, 1 reply; 172+ messages in thread
From: Eli Zaretskii @ 2021-11-02 17:10 UTC (permalink / raw)
  To: Stefan Kangas; +Cc: cpitclaudel, emacs-devel

> From: Stefan Kangas <stefankangas@gmail.com>
> Date: Tue, 2 Nov 2021 10:01:51 -0700
> Cc: emacs-devel@gnu.org
> 
> Eli Zaretskii <eliz@gnu.org> writes:
> 
> >> `emacs -Q` doesn't display these formatting controls for me
> >> (screenshot and corresponding file attached).  Should I report a bug?
> >
> > It does display them, look closer.
> 
> I also can't see it in the screenshot, or in my local Emacs.  Could you
> describe what we're missing?

I don't know what you are missing.  How are you looking for them, and
what do you expect to see?



^ permalink raw reply	[flat|nested] 172+ messages in thread

* [authors: default bidi-display-reordering is set to t] (was: Unicode confusables and reordering characters considered harmful)
  2021-11-02 14:53     ` Eli Zaretskii
  2021-11-02 15:16       ` Eli Zaretskii
  2021-11-02 16:24       ` Clément Pit-Claudel
@ 2021-11-02 17:24       ` Uwe Brauer
  2021-11-02 17:37         ` Eli Zaretskii
  2 siblings, 1 reply; 172+ messages in thread
From: Uwe Brauer @ 2021-11-02 17:24 UTC (permalink / raw)
  To: emacs-devel

[-- Attachment #1: Type: text/plain, Size: 915 bytes --]

>>> "EZ" == Eli Zaretskii <eliz@gnu.org> writes:

>> From: Uwe Brauer <oub@mat.ucm.es>
>> Date: Tue, 02 Nov 2021 14:54:34 +0100
>> 
>> I was about to ask about this article especially since emacs was also
>> investigated.

> "Investigated" is a very string word for what I found there about
> Emacs.  They didn't even bother to mention which editors display the
> formatting controls and which make them invisible (and thus much
> harder to find, even if one is vigilant).

One of the authors just answered my question about setting the variable 
bidi-display-reordering and I quote 

,----
| I kept all of the default configurations in each of our tests, so whichever
| the default config is would be what we used.
`----

I just tested: emacs -Q 

Then that variable is set to t.

Maybe that is not a good idea?

I am wondering how the test would be with the variable set to nil.




[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5673 bytes --]

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [authors: default bidi-display-reordering is set to t] (was: Unicode confusables and reordering characters considered harmful)
  2021-11-02 17:24       ` [authors: default bidi-display-reordering is set to t] (was: Unicode confusables and reordering characters considered harmful) Uwe Brauer
@ 2021-11-02 17:37         ` Eli Zaretskii
  0 siblings, 0 replies; 172+ messages in thread
From: Eli Zaretskii @ 2021-11-02 17:37 UTC (permalink / raw)
  To: Uwe Brauer; +Cc: emacs-devel

> From: Uwe Brauer <oub@mat.ucm.es>
> Date: Tue, 02 Nov 2021 18:24:24 +0100
> 
> I just tested: emacs -Q 
> 
> Then that variable is set to t.
> 
> Maybe that is not a good idea?

The display engine will not work reliably if you reset it to nil.
We keep that variable for debugging purposes; users are well advised
not to change its value, certainly not in production work.



^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: Unicode confusables and reordering characters considered harmful
  2021-11-02 16:47         ` Eli Zaretskii
  2021-11-02 17:01           ` Stefan Kangas
@ 2021-11-02 18:16           ` Clément Pit-Claudel
  2021-11-02 18:37             ` Eli Zaretskii
  1 sibling, 1 reply; 172+ messages in thread
From: Clément Pit-Claudel @ 2021-11-02 18:16 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: emacs-devel

On 11/2/21 12:47 PM, Eli Zaretskii wrote:
>> From: Clément Pit-Claudel <cpitclaudel@gmail.com>
>> Date: Tue, 2 Nov 2021 12:24:21 -0400
>>
>>> "Investigated" is a very string word for what I found there about
>>> Emacs.  They didn't even bother to mention which editors display the
>>> formatting controls and which make them invisible (and thus much
>>> harder to find, even if one is vigilant).
>>
>> `emacs -Q` doesn't display these formatting controls for me (screenshot and corresponding file attached).  Should I report a bug?
> 
> It does display them, look closer.

Sorry, I don't see it.



^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: Unicode confusables and reordering characters considered harmful
  2021-11-02 18:16           ` Clément Pit-Claudel
@ 2021-11-02 18:37             ` Eli Zaretskii
  0 siblings, 0 replies; 172+ messages in thread
From: Eli Zaretskii @ 2021-11-02 18:37 UTC (permalink / raw)
  To: Clément Pit-Claudel; +Cc: emacs-devel

> From: Clément Pit-Claudel <cpitclaudel@gmail.com>
> Date: Tue, 2 Nov 2021 14:16:55 -0400
> Cc: emacs-devel@gnu.org
> 
> >> `emacs -Q` doesn't display these formatting controls for me (screenshot and corresponding file attached).  Should I report a bug?
> > 
> > It does display them, look closer.
> 
> Sorry, I don't see it.

Please describe how and where do you look for them, and what did you
expect to see there.  We have some basic misunderstanding here.



^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: Unicode confusables and reordering characters considered harmful
  2021-11-02 17:10             ` Eli Zaretskii
@ 2021-11-02 18:43               ` Stefan Kangas
  2021-11-02 18:49                 ` Eli Zaretskii
  0 siblings, 1 reply; 172+ messages in thread
From: Stefan Kangas @ 2021-11-02 18:43 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: cpitclaudel, emacs-devel

Eli Zaretskii <eliz@gnu.org> writes:

>> > It does display them, look closer.
>>
>> I also can't see it in the screenshot, or in my local Emacs.  Could you
>> describe what we're missing?
>
> I don't know what you are missing.  How are you looking for them, and
> what do you expect to see?

I expect to see some visual indication that the "formatting controls"
are there.  As for how I do it, I just look at the screenshot, and I
don't understand in what way they are displayed.  It looks
indistinguishable from text without such controls.



^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: Unicode confusables and reordering characters considered harmful
  2021-11-02 18:43               ` Stefan Kangas
@ 2021-11-02 18:49                 ` Eli Zaretskii
  2021-11-02 19:12                   ` Stefan Monnier
  2021-11-02 19:26                   ` Stefan Kangas
  0 siblings, 2 replies; 172+ messages in thread
From: Eli Zaretskii @ 2021-11-02 18:49 UTC (permalink / raw)
  To: Stefan Kangas; +Cc: cpitclaudel, emacs-devel

> From: Stefan Kangas <stefankangas@gmail.com>
> Date: Tue, 2 Nov 2021 11:43:18 -0700
> Cc: cpitclaudel@gmail.com, emacs-devel@gnu.org
> 
> Eli Zaretskii <eliz@gnu.org> writes:
> 
> >> > It does display them, look closer.
> >>
> >> I also can't see it in the screenshot, or in my local Emacs.  Could you
> >> describe what we're missing?
> >
> > I don't know what you are missing.  How are you looking for them, and
> > what do you expect to see?
> 
> I expect to see some visual indication that the "formatting controls"
> are there.  As for how I do it, I just look at the screenshot, and I
> don't understand in what way they are displayed.  It looks
> indistinguishable from text without such controls.

You cannot see those characters on a screenshot, for the same reason
you cannot see any whitespace characters on a screenshot: they are
only discernible when you move cursor through them.  Which is why I
asked how are you looking for them.  If you are looking for them in a
screenshot, you will never see them.



^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: Unicode confusables and reordering characters considered harmful
  2021-11-02 18:49                 ` Eli Zaretskii
@ 2021-11-02 19:12                   ` Stefan Monnier
  2021-11-02 19:36                     ` Eli Zaretskii
  2021-11-03  0:28                     ` Gregory Heytings
  2021-11-02 19:26                   ` Stefan Kangas
  1 sibling, 2 replies; 172+ messages in thread
From: Stefan Monnier @ 2021-11-02 19:12 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: Stefan Kangas, cpitclaudel, emacs-devel

> You cannot see those characters on a screenshot, for the same reason
> you cannot see any whitespace characters on a screenshot: they are
> only discernible when you move cursor through them.  Which is why I
> asked how are you looking for them.  If you are looking for them in a
> screenshot, you will never see them.

But that's the core of the vulnerability: if you just look at the screen
(and just scroll through it) you will have an incorrect understanding of
what the code does.

It's good that such bidi override chars are displayed as a thin space,
but it's mostly useful to make it possible to edit them (or to `C-x =`
on them), but I don't think it makes a significant different in terms of
the security issues introduced by the presence of those chars in the code.


        Stefan




^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: Unicode confusables and reordering characters considered harmful
  2021-11-02 16:24       ` Clément Pit-Claudel
  2021-11-02 16:47         ` Eli Zaretskii
@ 2021-11-02 19:17         ` Yuri Khan
  2021-11-02 19:37           ` Eli Zaretskii
  1 sibling, 1 reply; 172+ messages in thread
From: Yuri Khan @ 2021-11-02 19:17 UTC (permalink / raw)
  To: Clément Pit-Claudel; +Cc: Emacs developers

On Tue, 2 Nov 2021 at 23:31, Clément Pit-Claudel <cpitclaudel@gmail.com> wrote:

> `emacs -Q` doesn't display these formatting controls for me (screenshot and corresponding file attached).  Should I report a bug?

On a different note, the code in the screenshot is immediately sus
because the alleged comment is highlighted with string face.



^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: Unicode confusables and reordering characters considered harmful
  2021-11-02 18:49                 ` Eli Zaretskii
  2021-11-02 19:12                   ` Stefan Monnier
@ 2021-11-02 19:26                   ` Stefan Kangas
  2021-11-02 19:44                     ` Eli Zaretskii
  2021-11-02 19:49                     ` Stefan Monnier
  1 sibling, 2 replies; 172+ messages in thread
From: Stefan Kangas @ 2021-11-02 19:26 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: cpitclaudel, emacs-devel

Eli Zaretskii <eliz@gnu.org> writes:

> You cannot see those characters on a screenshot, for the same reason
> you cannot see any whitespace characters on a screenshot: they are
> only discernible when you move cursor through them.  Which is why I
> asked how are you looking for them.  If you are looking for them in a
> screenshot, you will never see them.

Now I see what you mean.  Yes, they are clearly "visible" when you move
the cursor through them, in the sense that the cursor will jump.  (Does
this not happen in other text editors?)

Could we add some additional visual indication or warning for such
characters, in the light of this discussion?  I think there is a clear
risk that users will not step the cursor through some code they copied
from a website, or a patch with tens or hundreds of lines.  This means
that they might miss the indication that we have.



^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: Unicode confusables and reordering characters considered harmful
  2021-11-02 19:12                   ` Stefan Monnier
@ 2021-11-02 19:36                     ` Eli Zaretskii
  2021-11-02 19:47                       ` Stefan Monnier
  2021-11-02 20:18                       ` Unicode confusables and reordering characters considered harmful Tim Cross
  2021-11-03  0:28                     ` Gregory Heytings
  1 sibling, 2 replies; 172+ messages in thread
From: Eli Zaretskii @ 2021-11-02 19:36 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: cpitclaudel, stefan, emacs-devel

> From: Stefan Monnier <monnier@iro.umontreal.ca>
> Cc: Stefan Kangas <stefan@marxist.se>,  cpitclaudel@gmail.com,
>   emacs-devel@gnu.org
> Date: Tue, 02 Nov 2021 15:12:56 -0400
> 
> > You cannot see those characters on a screenshot, for the same reason
> > you cannot see any whitespace characters on a screenshot: they are
> > only discernible when you move cursor through them.  Which is why I
> > asked how are you looking for them.  If you are looking for them in a
> > screenshot, you will never see them.
> 
> But that's the core of the vulnerability: if you just look at the screen
> (and just scroll through it) you will have an incorrect understanding of
> what the code does.

If you want a more prominent display, customize
glyphless-char-display-control to show format-control characters as
acronyms, say, or as hex-code.

And anyway, my point was that Emacs deviates from Unicode here, which
says not to show these controls at all, and by deviating it gives the
user some defense against these problems.  I did say originally the
defense was "weak", so if you want to point out that this is a weak
defense, you are preaching to the choir.

> It's good that such bidi override chars are displayed as a thin space,
> but it's mostly useful to make it possible to edit them (or to `C-x =`
> on them), but I don't think it makes a significant different in terms of
> the security issues introduced by the presence of those chars in the code.

In most cases, there's no need to make these controls stand out,
because situations where this presents security risks are extremely
rare, to put it mildly, and OTOH having them stand out more by default
will make it harder to read text with completely legitimate uses of
these controls (example: TUTORIAL.he).  Therefore, IMNSHO it's okay to
have this off by default (and have a way of turning that on in case of
increased paranoia).  Moreover, I think adding features that detect
the suspicious uses of this functionality will better serve our users
than just showing the controls more prominently, because it will have
a much lower probability of false positives, and will avoid getting in
the way of reading legitimate text which uses these controls for valid
reasons.



^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: Unicode confusables and reordering characters considered harmful
  2021-11-02 19:17         ` Yuri Khan
@ 2021-11-02 19:37           ` Eli Zaretskii
  0 siblings, 0 replies; 172+ messages in thread
From: Eli Zaretskii @ 2021-11-02 19:37 UTC (permalink / raw)
  To: Yuri Khan; +Cc: cpitclaudel, emacs-devel

> From: Yuri Khan <yuri.v.khan@gmail.com>
> Date: Wed, 3 Nov 2021 02:17:29 +0700
> Cc: Emacs developers <emacs-devel@gnu.org>
> 
> On Tue, 2 Nov 2021 at 23:31, Clément Pit-Claudel <cpitclaudel@gmail.com> wrote:
> 
> > `emacs -Q` doesn't display these formatting controls for me (screenshot and corresponding file attached).  Should I report a bug?
> 
> On a different note, the code in the screenshot is immediately sus
> because the alleged comment is highlighted with string face.

Yes, the authors mention that, and then go one to dismiss that as an
indication that will be noticed.



^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: Unicode confusables and reordering characters considered harmful
  2021-11-02 19:26                   ` Stefan Kangas
@ 2021-11-02 19:44                     ` Eli Zaretskii
  2021-11-02 19:49                     ` Stefan Monnier
  1 sibling, 0 replies; 172+ messages in thread
From: Eli Zaretskii @ 2021-11-02 19:44 UTC (permalink / raw)
  To: Stefan Kangas; +Cc: cpitclaudel, emacs-devel

> From: Stefan Kangas <stefankangas@gmail.com>
> Date: Tue, 2 Nov 2021 12:26:29 -0700
> Cc: cpitclaudel@gmail.com, emacs-devel@gnu.org
> 
> Eli Zaretskii <eliz@gnu.org> writes:
> 
> > You cannot see those characters on a screenshot, for the same reason
> > you cannot see any whitespace characters on a screenshot: they are
> > only discernible when you move cursor through them.  Which is why I
> > asked how are you looking for them.  If you are looking for them in a
> > screenshot, you will never see them.
> 
> Now I see what you mean.  Yes, they are clearly "visible" when you move
> the cursor through them, in the sense that the cursor will jump.

Not only will it jump, it will display as a thin 1-pixel shape when
point is at the position of these control characters.  Just move it
slowly so as not to miss that.

> (Does this not happen in other text editors?)

It should, because almost all (if not all) editors provide "logical"
(as opposed to "visual") cursor motion as the default.  The paper
mentions (and dismisses) that.

> Could we add some additional visual indication or warning for such
> characters, in the light of this discussion?

That was why I wrote bidi-find-overridden-directionality several years
ago.  My opinion about this should be clear to you from that fact
alone.  That function has zero uses.  What does this tell you about
the _real_ importance of this kind of problems in people's eyes, even
those very people who argued back then we _must_ have a feature like
that?



^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: Unicode confusables and reordering characters considered harmful
  2021-11-02 19:36                     ` Eli Zaretskii
@ 2021-11-02 19:47                       ` Stefan Monnier
  2021-11-02 19:51                         ` Eli Zaretskii
  2021-11-02 20:18                       ` Unicode confusables and reordering characters considered harmful Tim Cross
  1 sibling, 1 reply; 172+ messages in thread
From: Stefan Monnier @ 2021-11-02 19:47 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: stefan, cpitclaudel, emacs-devel

> In most cases, there's no need to make these controls stand out,
> because situations where this presents security risks are extremely
> rare, to put it mildly, and OTOH having them stand out more by default
> will make it harder to read text with completely legitimate uses of
> these controls (example: TUTORIAL.he).

Fully agreed.  That's the problem: how to define the problematic cases
in a precise enough way that it doesn't rule out all lots of
legitimate cases.


        Stefan




^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: Unicode confusables and reordering characters considered harmful
  2021-11-02 19:26                   ` Stefan Kangas
  2021-11-02 19:44                     ` Eli Zaretskii
@ 2021-11-02 19:49                     ` Stefan Monnier
  1 sibling, 0 replies; 172+ messages in thread
From: Stefan Monnier @ 2021-11-02 19:49 UTC (permalink / raw)
  To: Stefan Kangas; +Cc: Eli Zaretskii, cpitclaudel, emacs-devel

> Could we add some additional visual indication or warning for such
> characters, in the light of this discussion?  I think there is a clear
> risk that users will not step the cursor through some code they copied
> from a website, or a patch with tens or hundreds of lines.  This means
> that they might miss the indication that we have.

That's only one very specific attack vector.  There are many more.
I don't think it makes much sense to try and fix this one without trying
to fix the others.


        Stefan




^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: Unicode confusables and reordering characters considered harmful
  2021-11-02 19:47                       ` Stefan Monnier
@ 2021-11-02 19:51                         ` Eli Zaretskii
  2021-11-02 21:28                           ` Unicode confusables and reordering characters considered harmful, a simple solution Daniel Brooks
  0 siblings, 1 reply; 172+ messages in thread
From: Eli Zaretskii @ 2021-11-02 19:51 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: cpitclaudel, stefan, emacs-devel

> From: Stefan Monnier <monnier@iro.umontreal.ca>
> Cc: stefan@marxist.se,  cpitclaudel@gmail.com,  emacs-devel@gnu.org
> Date: Tue, 02 Nov 2021 15:47:27 -0400
> 
> > In most cases, there's no need to make these controls stand out,
> > because situations where this presents security risks are extremely
> > rare, to put it mildly, and OTOH having them stand out more by default
> > will make it harder to read text with completely legitimate uses of
> > these controls (example: TUTORIAL.he).
> 
> Fully agreed.  That's the problem: how to define the problematic cases
> in a precise enough way that it doesn't rule out all lots of
> legitimate cases.

That's what bidi-find-overridden-directionality already does, albeit
not yet for the specific examples in that paper.  But Someone™ should
write a minor mode or an optional display feature which uses that
function to highlight the problematic stretches of text on display,
using the function's output for finding such stretches of text.



^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: Unicode confusables and reordering characters considered harmful
  2021-11-02 19:36                     ` Eli Zaretskii
  2021-11-02 19:47                       ` Stefan Monnier
@ 2021-11-02 20:18                       ` Tim Cross
  1 sibling, 0 replies; 172+ messages in thread
From: Tim Cross @ 2021-11-02 20:18 UTC (permalink / raw)
  To: emacs-devel


Eli Zaretskii <eliz@gnu.org> writes:

>> From: Stefan Monnier <monnier@iro.umontreal.ca>
>> Cc: Stefan Kangas <stefan@marxist.se>,  cpitclaudel@gmail.com,
>>   emacs-devel@gnu.org
>> Date: Tue, 02 Nov 2021 15:12:56 -0400
>> 
>> > You cannot see those characters on a screenshot, for the same reason
>> > you cannot see any whitespace characters on a screenshot: they are
>> > only discernible when you move cursor through them.  Which is why I
>> > asked how are you looking for them.  If you are looking for them in a
>> > screenshot, you will never see them.
>> 
>> But that's the core of the vulnerability: if you just look at the screen
>> (and just scroll through it) you will have an incorrect understanding of
>> what the code does.
>
> If you want a more prominent display, customize
> glyphless-char-display-control to show format-control characters as
> acronyms, say, or as hex-code.
>
> And anyway, my point was that Emacs deviates from Unicode here, which
> says not to show these controls at all, and by deviating it gives the
> user some defense against these problems.  I did say originally the
> defense was "weak", so if you want to point out that this is a weak
> defense, you are preaching to the choir.
>
>> It's good that such bidi override chars are displayed as a thin space,
>> but it's mostly useful to make it possible to edit them (or to `C-x =`
>> on them), but I don't think it makes a significant different in terms of
>> the security issues introduced by the presence of those chars in the code.
>
> In most cases, there's no need to make these controls stand out,
> because situations where this presents security risks are extremely
> rare, to put it mildly, and OTOH having them stand out more by default
> will make it harder to read text with completely legitimate uses of
> these controls (example: TUTORIAL.he).  Therefore, IMNSHO it's okay to
> have this off by default (and have a way of turning that on in case of
> increased paranoia).  Moreover, I think adding features that detect
> the suspicious uses of this functionality will better serve our users
> than just showing the controls more prominently, because it will have
> a much lower probability of false positives, and will avoid getting in
> the way of reading legitimate text which uses these controls for valid
> reasons.


Totally agree. There is no point having additional visual clues for a
security issue which is not well understood by most users. It will just
cause user frustration and as pointed out, in some cases make legitimate
uses look worse. On the other hand, having features which warn the user
of suspicious instances is likely to be more useful and informative,
even to those who are not terribly aware of the issue (I'm assuming such
feature would provide some sort of warning and a reference to where to
find more detailed explanation). 



^ permalink raw reply	[flat|nested] 172+ messages in thread

* Unicode confusables and reordering characters considered harmful, a simple solution
  2021-11-02 19:51                         ` Eli Zaretskii
@ 2021-11-02 21:28                           ` Daniel Brooks
  2021-11-03 13:30                             ` Eli Zaretskii
  2021-11-03 17:41                             ` Yuri Khan
  0 siblings, 2 replies; 172+ messages in thread
From: Daniel Brooks @ 2021-11-02 21:28 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: cpitclaudel, stefan, Stefan Monnier, emacs-devel

Eli Zaretskii <eliz@gnu.org> writes:

>> From: Stefan Monnier <monnier@iro.umontreal.ca>
>> Cc: stefan@marxist.se,  cpitclaudel@gmail.com,  emacs-devel@gnu.org
>> Date: Tue, 02 Nov 2021 15:47:27 -0400
>> 
>> > In most cases, there's no need to make these controls stand out,
>> > because situations where this presents security risks are extremely
>> > rare, to put it mildly, and OTOH having them stand out more by default
>> > will make it harder to read text with completely legitimate uses of
>> > these controls (example: TUTORIAL.he).
>> 
>> Fully agreed.  That's the problem: how to define the problematic cases
>> in a precise enough way that it doesn't rule out all lots of
>> legitimate cases.
>
> That's what bidi-find-overridden-directionality already does, albeit
> not yet for the specific examples in that paper.  But Someone™ should
> write a minor mode or an optional display feature which uses that
> function to highlight the problematic stretches of text on display,
> using the function's output for finding such stretches of text.

We already have it; it is called whitespace-mode. It’s not perfect, but
this morning I customized mine to make these characters more obvious:

(custom-set-variables
 '(whitespace-display-mappings
   '((space-mark 32 [183] [46])
     (space-mark 160 [164] [95])
     (newline-mark 10 [36 10])
     (tab-mark 9 [187 9] [92 9])
     (space-mark #x202A [#x21D2]) ; ⇒ LEFT-TO-RIGHT EMBEDDING
     (space-mark #x202B [#x21D0]) ; ⇐ RIGHT-TO-LEFT EMBEDDING
     (space-mark #x202D [#x2192]) ; → LEFT-TO-RIGHT OVERRIDE
     (space-mark #x202E [#x2190]) ; ← RIGHT-TO-LEFT OVERRIDE
     (space-mark #x2066 [#x21E5]) ; ⇥ LEFT-TO-RIGHT ISOLATE
     (space-mark #x2067 [#x21E4]) ; ⇤ RIGHT-TO-LEFT ISOLATE
     (space-mark #x2068 [#x21A7]) ; ↧ FIRST STRONG ISOLATE
     (space-mark #x202C [#x21D1]) ; ⇑ POP DIRECTIONAL FORMATTING
     (space-mark #x2069 [#x2912]) ; ⤒ POP DIRECTIONAL ISOLATE
     )))

I didn’t spend much time thinking about which arrows to pick; these
seemed right to me. They are all using 'space-mark as the kind, but I
would like to extend whitespace-mode with a new kind specifically for
these characters, so that I can give them a custom face as well.

Here is some sample lisp code that I tried it on:

(defun main ()
  (let ((is_admin nil))
    ‮⁦ ; begin admins only⁩⁦(when is_admin
      (print "You are an admin."))‮⁦ ; end admins only⁩(
)

Syntax highlighting is certainly a big clue that something is odd about
this code, as the conditional is displayed in the comment face. It was
however a nice little puzzle to figure out how to get the permutation of
characters that I wanted.

I will however note that Elisp, as currently implemented, is probably
immune to this attack. The directional characters are incorrectly¹
treated as identifiers when they are outside of a comment; if you
actually run this you will get a void-variable warning which is very
confusing at first because the variable name is invisible. Great fun.

I suggest that we include something along these lines in Emacs, and turn
on whitespace-mode by default in all programming modes. If I recall
correctly, the default configuration of whitespace-mode is fairly
inoffensive. I would recommend keeping it so except that we make the
face for BIDI control characters pretty obvious; perhaps a red
background or something.

By only enabling it by default in programming modes, we avoid bothering
users of prose–oriented modes where using these characters is
benign. Maybe we could have an override for programming languages such
as Elisp that we think are immune to this attack, but I don’t really
think we need to go that far.

db48x

¹ I say that this is incorrect because they are classified by Unicode as
control characters rather than as letters or numbers. The Elisp
specification, such as it exists, probably doesn’t say anything about
them.



^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: Unicode confusables and reordering characters considered harmful
  2021-11-02 19:12                   ` Stefan Monnier
  2021-11-02 19:36                     ` Eli Zaretskii
@ 2021-11-03  0:28                     ` Gregory Heytings
  2021-11-03  1:07                       ` Stefan Monnier
  2021-11-03 13:31                       ` Eli Zaretskii
  1 sibling, 2 replies; 172+ messages in thread
From: Gregory Heytings @ 2021-11-03  0:28 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: Eli Zaretskii, Stefan Kangas, cpitclaudel, emacs-devel

[-- Attachment #1: Type: text/plain, Size: 1301 bytes --]


>
> But that's the core of the vulnerability: if you just look at the screen 
> (and just scroll through it) you will have an incorrect understanding of 
> what the code does.
>
> It's good that such bidi override chars are displayed as a thin space, 
> but it's mostly useful to make it possible to edit them (or to `C-x =` 
> on them), but I don't think it makes a significant different in terms of 
> the security issues introduced by the presence of those chars in the 
> code.
>

Given that the vulnerability is limited to source code, in which AFAIU 
there's no legitimate use of such characters, would the following not be 
enough?

(defun make-bidi-reordering-characters-apparent ()
   (setq buffer-display-table (make-display-table))
   (aset buffer-display-table ?‪ [?⭤])
   (aset buffer-display-table ?‫ [?⭤])
   (aset buffer-display-table ?‭ [?⭤])
   (aset buffer-display-table ?‮ [?⭤])
   (aset buffer-display-table ?⁦ [?⭤])
   (aset buffer-display-table ?⁧ [?⭤])
   (aset buffer-display-table ?⁨ [?⭤])
   (aset buffer-display-table ?‬ [?⭤])
   (aset buffer-display-table ?⁩ [?⭤])
   (font-lock-add-keywords nil '(("⭤" . 'font-lock-warning-face))))

(add-hook 'prog-mode-hook #'make-bidi-reordering-characters-apparent)

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: Unicode confusables and reordering characters considered harmful
  2021-11-03  0:28                     ` Gregory Heytings
@ 2021-11-03  1:07                       ` Stefan Monnier
  2021-11-03  1:59                         ` Daniel Brooks
                                           ` (2 more replies)
  2021-11-03 13:31                       ` Eli Zaretskii
  1 sibling, 3 replies; 172+ messages in thread
From: Stefan Monnier @ 2021-11-03  1:07 UTC (permalink / raw)
  To: Gregory Heytings; +Cc: Eli Zaretskii, Stefan Kangas, cpitclaudel, emacs-devel

> Given that the vulnerability is limited to source code, in which AFAIU
> there's no legitimate use of such characters, would the following not
> be enough?

I'm pretty sure there are legitimate uses of such characters in source code.
Maybe there are significant parts of the world where this is extremely rare,
but we shouldn't generalize too quickly.


        Stefan




^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: Unicode confusables and reordering characters considered harmful
  2021-11-03  1:07                       ` Stefan Monnier
@ 2021-11-03  1:59                         ` Daniel Brooks
  2021-11-03 13:35                           ` Eli Zaretskii
  2021-11-03  9:59                         ` Gregory Heytings
  2021-11-03 13:33                         ` Eli Zaretskii
  2 siblings, 1 reply; 172+ messages in thread
From: Daniel Brooks @ 2021-11-03  1:59 UTC (permalink / raw)
  To: Stefan Monnier
  Cc: Gregory Heytings, Stefan Kangas, Eli Zaretskii, cpitclaudel,
	emacs-devel

Stefan Monnier <monnier@iro.umontreal.ca> writes:

>> Given that the vulnerability is limited to source code, in which AFAIU
>> there's no legitimate use of such characters, would the following not
>> be enough?
>
> I'm pretty sure there are legitimate uses of such characters in source code.
> Maybe there are significant parts of the world where this is extremely rare,
> but we shouldn't generalize too quickly.

Yea, strings and comments both need to be able to contain pretty much
arbitrary prose; they’ll need to allow these characters for the same
reasons you need them in prose.

One recommendation the paper made was that languages should allow them,
but give a syntax error if they reorder the comment or string delimiters
relative to other text.

But I definitely agree that they should be marked very visibly when used
in source code; see my own suggestion for using whitespace-mode.

db48x



^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: Unicode confusables and reordering characters considered harmful
  2021-11-03  1:07                       ` Stefan Monnier
  2021-11-03  1:59                         ` Daniel Brooks
@ 2021-11-03  9:59                         ` Gregory Heytings
  2021-11-03 11:19                           ` Stefan Kangas
                                             ` (2 more replies)
  2021-11-03 13:33                         ` Eli Zaretskii
  2 siblings, 3 replies; 172+ messages in thread
From: Gregory Heytings @ 2021-11-03  9:59 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: Eli Zaretskii, Stefan Kangas, cpitclaudel, emacs-devel

[-- Attachment #1: Type: text/plain, Size: 973 bytes --]


>> Given that the vulnerability is limited to source code, in which AFAIU 
>> there's no legitimate use of such characters, would the following not 
>> be enough?
>
> I'm pretty sure there are legitimate uses of such characters in source 
> code. Maybe there are significant parts of the world where this is 
> extremely rare, but we shouldn't generalize too quickly.
>

There's some data that shows that this is extremely rare in general: the 
Rust Security Response WG analyzed the 70322 crates and found only 5 in 
which these codepoints were present (see [1]).  That's ~0.01 %.

Moreover such highlighting does not make the source code or text 
unreadable, even in those few legitimate cases.

Therefore I suggest to experiment with the attached patch during a month 
or so, and see if there are objections.  I used the 
{left,right,up,down}wards arrows, which are visible in both GUI and TUI 
interfaces.

[1] https://blog.rust-lang.org/2021/11/01/cve-2021-42574.html

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: Type: text/x-diff; name=Make-bidi-reordering-characters-visible.patch, Size: 2221 bytes --]

From 74b7318fc223e5a64d20660522c01c03cf9a7022 Mon Sep 17 00:00:00 2001
From: Gregory Heytings <gregory@heytings.org>
Date: Wed, 3 Nov 2021 09:53:47 +0000
Subject: [PATCH] Make bidi reordering characters visible

* lisp/progmodes/prog-mode.el (fontify-bidi-reordering-characters,
make-bidi-reordering-characters-visible): New functions.
(prog-mode): Use the new functions.
---
 lisp/progmodes/prog-mode.el | 21 ++++++++++++++++++++-
 1 file changed, 20 insertions(+), 1 deletion(-)

diff --git a/lisp/progmodes/prog-mode.el b/lisp/progmodes/prog-mode.el
index db350a5f70..0005d3d4d7 100644
--- a/lisp/progmodes/prog-mode.el
+++ b/lisp/progmodes/prog-mode.el
@@ -289,6 +289,24 @@ turn-on-prettify-symbols-mode
              (local-variable-p 'prettify-symbols-alist))
     (prettify-symbols-mode 1)))
 
+(defun fontify-bidi-reordering-characters ()
+  (font-lock-add-keywords nil '(("⁩\\|‬\\|⁨\\|⁧\\|⁦\\|‫\\|‪\\|‮\\|‭" . 'font-lock-warning-face))))
+
+(defun make-bidi-reordering-characters-visible ()
+  (setq buffer-display-table (or buffer-display-table
+                                 standard-display-table
+                                 (make-display-table)))
+  (aset buffer-display-table ?‪ [?→])
+  (aset buffer-display-table ?‫ [?←])
+  (aset buffer-display-table ?‭ [?→])
+  (aset buffer-display-table ?‮ [?←])
+  (aset buffer-display-table ?⁦ [?→])
+  (aset buffer-display-table ?⁧ [?←])
+  (aset buffer-display-table ?⁨ [?↓])
+  (aset buffer-display-table ?‬ [?↑])
+  (aset buffer-display-table ?⁩ [?↑])
+  (add-hook 'font-lock-mode-hook #'fontify-bidi-reordering-characters))
+
 ;;;###autoload
 (define-globalized-minor-mode global-prettify-symbols-mode
   prettify-symbols-mode turn-on-prettify-symbols-mode)
@@ -300,7 +318,8 @@ prog-mode
   (setq-local parse-sexp-ignore-comments t)
   (add-hook 'context-menu-functions 'prog-context-menu 10 t)
   ;; Any programming language is always written left to right.
-  (setq bidi-paragraph-direction 'left-to-right))
+  (setq bidi-paragraph-direction 'left-to-right)
+  (make-bidi-reordering-characters-visible))
 
 (provide 'prog-mode)
 
-- 
2.33.0


^ permalink raw reply related	[flat|nested] 172+ messages in thread

* Re: Unicode confusables and reordering characters considered harmful
  2021-11-03  9:59                         ` Gregory Heytings
@ 2021-11-03 11:19                           ` Stefan Kangas
  2021-11-03 11:31                             ` Gregory Heytings
  2021-11-03 13:44                             ` Eli Zaretskii
  2021-11-03 11:29                           ` Andreas Schwab
  2021-11-03 13:37                           ` Eli Zaretskii
  2 siblings, 2 replies; 172+ messages in thread
From: Stefan Kangas @ 2021-11-03 11:19 UTC (permalink / raw)
  To: Gregory Heytings
  Cc: Eli Zaretskii, Clément Pit-Claudel, Stefan Monnier,
	Emacs developers

Gregory Heytings <gregory@heytings.org> writes:

> There's some data that shows that this is extremely rare in general: the
> Rust Security Response WG analyzed the 70322 crates and found only 5 in
> which these codepoints were present (see [1]).  That's ~0.01 %.
>
> Moreover such highlighting does not make the source code or text
> unreadable, even in those few legitimate cases.

Depending on how you define it, there is at least one major world
language (Arabic) that has a RTL script, and other major languages
such as Urdu, Farsi and Hebrew also use it (and a couple of others
too).  So I think we should consider to what extent your proposal
might hurt users of such languages.

Are these characters important to write comments and strings in any of
those languages?  Will your proposal make it harder to type in such
languages?  If yes, are there less invasive solutions?

The Rust data point is relevant, but in my opinion not sufficient to
outweigh the above considerations.  But even if that wasn't the case,
we would still need to consider languages like C, Fortran, PHP,
JavaScript, etc.  We are, after all, talking about hundreds of
millions of native speakers of the mentioned languages, a certain
proportion of which will be Emacs users interested in writing strings
and comments in their own language.



^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: Unicode confusables and reordering characters considered harmful
  2021-11-03  9:59                         ` Gregory Heytings
  2021-11-03 11:19                           ` Stefan Kangas
@ 2021-11-03 11:29                           ` Andreas Schwab
  2021-11-03 18:47                             ` Stefan Monnier
  2021-11-03 13:37                           ` Eli Zaretskii
  2 siblings, 1 reply; 172+ messages in thread
From: Andreas Schwab @ 2021-11-03 11:29 UTC (permalink / raw)
  To: Gregory Heytings
  Cc: Eli Zaretskii, Stefan Kangas, Stefan Monnier, cpitclaudel,
	emacs-devel

On Nov 03 2021, Gregory Heytings wrote:

> diff --git a/lisp/progmodes/prog-mode.el b/lisp/progmodes/prog-mode.el
> index db350a5f70..0005d3d4d7 100644
> --- a/lisp/progmodes/prog-mode.el
> +++ b/lisp/progmodes/prog-mode.el
> @@ -289,6 +289,24 @@ turn-on-prettify-symbols-mode
>               (local-variable-p 'prettify-symbols-alist))
>      (prettify-symbols-mode 1)))
>  
> +(defun fontify-bidi-reordering-characters ()
> +  (font-lock-add-keywords nil '(("⁩\\|‬\\|⁨\\|⁧\\|⁦\\|‫\\|‪\\|‮\\|‭" . 'font-lock-warning-face))))
> +
> +(defun make-bidi-reordering-characters-visible ()
> +  (setq buffer-display-table (or buffer-display-table
> +                                 standard-display-table
> +                                 (make-display-table)))
> +  (aset buffer-display-table ?‪ [?→])
> +  (aset buffer-display-table ?‫ [?←])
> +  (aset buffer-display-table ?‭ [?→])
> +  (aset buffer-display-table ?‮ [?←])
> +  (aset buffer-display-table ?⁦ [?→])
> +  (aset buffer-display-table ?⁧ [?←])
> +  (aset buffer-display-table ?⁨ [?↓])
> +  (aset buffer-display-table ?‬ [?↑])
> +  (aset buffer-display-table ?⁩ [?↑])

A perfect example of how legitimate use of these characters can mess up
your source. :-)

Andreas.

-- 
Andreas Schwab, schwab@linux-m68k.org
GPG Key fingerprint = 7578 EB47 D4E5 4D69 2510  2552 DF73 E780 A9DA AEC1
"And now for something completely different."



^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: Unicode confusables and reordering characters considered harmful
  2021-11-03 11:19                           ` Stefan Kangas
@ 2021-11-03 11:31                             ` Gregory Heytings
  2021-11-03 12:20                               ` Stefan Monnier
  2021-11-03 13:45                               ` Eli Zaretskii
  2021-11-03 13:44                             ` Eli Zaretskii
  1 sibling, 2 replies; 172+ messages in thread
From: Gregory Heytings @ 2021-11-03 11:31 UTC (permalink / raw)
  To: Stefan Kangas
  Cc: Eli Zaretskii, emacs-devel, Clément Pit-Claudel,
	Stefan Monnier


>> There's some data that shows that this is extremely rare in general: 
>> the Rust Security Response WG analyzed the 70322 crates and found only 
>> 5 in which these codepoints were present (see [1]).  That's ~0.01 %.
>>
>> Moreover such highlighting does not make the source code or text 
>> unreadable, even in those few legitimate cases.
>
> Depending on how you define it, there is at least one major world 
> language (Arabic) that has a RTL script, and other major languages such 
> as Urdu, Farsi and Hebrew also use it (and a couple of others too).  So 
> I think we should consider to what extent your proposal might hurt users 
> of such languages.
>
> Are these characters important to write comments and strings in any of 
> those languages?  Will your proposal make it harder to type in such 
> languages?  If yes, are there less invasive solutions?
>

Thanks for your comments!

AFAIK, these specific characters are not necessary to write comments and 
strings in these languages.  Here are two random file which use RTL 
strings and comments, and in which these characters are not used:

https://raw.githubusercontent.com/01walid/goarabic/master/stringutils_test.go
https://raw.githubusercontent.com/AbdullahDiaa/garabic/main/garabic.go



^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: Unicode confusables and reordering characters considered harmful
  2021-11-03 11:31                             ` Gregory Heytings
@ 2021-11-03 12:20                               ` Stefan Monnier
  2021-11-03 12:41                                 ` tomas
  2021-11-03 13:46                                 ` Eli Zaretskii
  2021-11-03 13:45                               ` Eli Zaretskii
  1 sibling, 2 replies; 172+ messages in thread
From: Stefan Monnier @ 2021-11-03 12:20 UTC (permalink / raw)
  To: Gregory Heytings
  Cc: Stefan Kangas, Eli Zaretskii, Clément Pit-Claudel,
	emacs-devel

> AFAIK, these specific characters are not necessary to write comments and
>  strings in these languages.  Here are two random file which use RTL strings
> and comments, and in which these characters are not used:

I was more worried about the fact that, while highlighting those chars
might be helpful to warn about accidental uses of them, if attackers
want to trick the reader, I'm pretty sure they can get similar results
without having to use those special LTR/RTL override chars:

    int hi = 5;
    int שָׁלוֹם = hi;
    int hello = 10;
    int السّلامعليك = hello;
    myfun(שָׁלוֹם ,السّلامعليكم)

There's no override here, but did I call `myfun` with args 5 and 10 or
did I call it with args 10 and 5?

[ OK, admittedly, for a bidi-idiot like me, it looks like neither since
  the Arabic shaping of the two occurrences of the identifier actually look
  different (and I truly have no clue why that is here), so I'm lead to
  believe that the second is a reference to a non-existing
  variable ;-)  ]


        Stefan




^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: Unicode confusables and reordering characters considered harmful
  2021-11-03 12:20                               ` Stefan Monnier
@ 2021-11-03 12:41                                 ` tomas
  2021-11-03 13:15                                   ` Eli Zaretskii
  2021-11-03 13:46                                 ` Eli Zaretskii
  1 sibling, 1 reply; 172+ messages in thread
From: tomas @ 2021-11-03 12:41 UTC (permalink / raw)
  To: emacs-devel

[-- Attachment #1: Type: text/plain, Size: 1806 bytes --]

On Wed, Nov 03, 2021 at 08:20:01AM -0400, Stefan Monnier wrote:
> > AFAIK, these specific characters are not necessary to write comments and
> >  strings in these languages.  Here are two random file which use RTL strings
> > and comments, and in which these characters are not used:
> 
> I was more worried about the fact that, while highlighting those chars
> might be helpful to warn about accidental uses of them, if attackers
> want to trick the reader, I'm pretty sure they can get similar results
> without having to use those special LTR/RTL override chars:
> 
>     int hi = 5;
>     int שָׁלוֹם = hi;
>     int hello = 10;
>     int السّلامعليك = hello;
>     myfun(שָׁלוֹם ,السّلامعليكم)
> 
> There's no override here, but did I call `myfun` with args 5 and 10 or
> did I call it with args 10 and 5?
> 
> [ OK, admittedly, for a bidi-idiot like me, it looks like neither since
>   the Arabic shaping of the two occurrences of the identifier actually look
>   different (and I truly have no clue why that is here), so I'm lead to
>   believe that the second is a reference to a non-existing
>   variable ;-)  ]

Most probably, yes. The second instance had one letter more, the "mim" (م)
at the end (which, for some funny reason, seems to have evaporated when my
mailer quoted your message: in the above quote, they now /look/ equal,
although when I copy/paste them, the mim re-appears. Go figure).

As a full disclosure, I have to admit that I'm using mutt with vim as an
editor (gah! :), so I chalk that up to differences between the viewer and
the editor: it seems vim just hides that one).

But you raise an interesting point: in an R to L stretch, is the order
of the arguments also R to L, or L to R?

Cheers
 - t

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: Unicode confusables and reordering characters considered harmful
  2021-11-03 12:41                                 ` tomas
@ 2021-11-03 13:15                                   ` Eli Zaretskii
  2021-11-03 14:46                                     ` tomas
  0 siblings, 1 reply; 172+ messages in thread
From: Eli Zaretskii @ 2021-11-03 13:15 UTC (permalink / raw)
  To: tomas; +Cc: emacs-devel

> Date: Wed, 3 Nov 2021 13:41:19 +0100
> From: <tomas@tuxteam.de>
> 
> But you raise an interesting point: in an R to L stretch, is the order
> of the arguments also R to L, or L to R?

The order of the arguments is always the "logical" order, i.e. the
order of increasing buffer positions, because that's how the
compiler/interpreter reads the program text.



^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: Unicode confusables and reordering characters considered harmful, a simple solution
  2021-11-02 21:28                           ` Unicode confusables and reordering characters considered harmful, a simple solution Daniel Brooks
@ 2021-11-03 13:30                             ` Eli Zaretskii
  2021-11-03 17:41                             ` Yuri Khan
  1 sibling, 0 replies; 172+ messages in thread
From: Eli Zaretskii @ 2021-11-03 13:30 UTC (permalink / raw)
  To: Daniel Brooks; +Cc: cpitclaudel, stefan, monnier, emacs-devel

> From: Daniel Brooks <db48x@db48x.net>
> Cc: Stefan Monnier <monnier@iro.umontreal.ca>,  cpitclaudel@gmail.com,
>   stefan@marxist.se,  emacs-devel@gnu.org
> Date: Tue, 02 Nov 2021 14:28:16 -0700
> 
> > That's what bidi-find-overridden-directionality already does, albeit
> > not yet for the specific examples in that paper.  But Someone™ should
> > write a minor mode or an optional display feature which uses that
> > function to highlight the problematic stretches of text on display,
> > using the function's output for finding such stretches of text.
> 
> We already have it; it is called whitespace-mode. It’s not perfect, but
> this morning I customized mine to make these characters more obvious:

The idea of detecting these problems is not just highlight specific
characters, because those characters are generally harmless when used
for valid purposes.  Blindly highlighting them will just distract
people and sometimes annoy them, which in some cases will cause them
to turn off the annoying feature.

Here, try your customizations on the following completely innocent
text:

   abcd ‮⁧⁩‬xyz

or on this:

   Char: ‮‬‎ (8238, #o20056, #x202e, file ...) point=2080 of 4903 (42%) column=8

(The latter is what "C-x =" produces in Emacs, and for a good reason.)

Why would we want such legitimate uses of these characters light up on
display like a Christmas tree?

The idea is to identify the suspicious uses of these formatting
controls, and highlight only those suspicious uses.  That would give
our users a much more reliable tool that could perhaps even be turned
on by default in most, if not all, buffers.

Let's not settle for a simplistic implementation just because it's
easier.

> I suggest that we include something along these lines in Emacs, and turn
> on whitespace-mode by default in all programming modes.

Sorry, no.  We have a much better facility implemented already, so
let's use it instead.



^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: Unicode confusables and reordering characters considered harmful
  2021-11-03  0:28                     ` Gregory Heytings
  2021-11-03  1:07                       ` Stefan Monnier
@ 2021-11-03 13:31                       ` Eli Zaretskii
  1 sibling, 0 replies; 172+ messages in thread
From: Eli Zaretskii @ 2021-11-03 13:31 UTC (permalink / raw)
  To: Gregory Heytings; +Cc: cpitclaudel, stefan, monnier, emacs-devel

> Date: Wed, 03 Nov 2021 00:28:54 +0000
> From: Gregory Heytings <gregory@heytings.org>
> cc: Eli Zaretskii <eliz@gnu.org>, Stefan Kangas <stefan@marxist.se>, 
>     cpitclaudel@gmail.com, emacs-devel@gnu.org
> 
> Given that the vulnerability is limited to source code, in which AFAIU 
> there's no legitimate use of such characters, would the following not be 
> enough?

No.  I tried to explain in a previous message why, with examples.



^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: Unicode confusables and reordering characters considered harmful
  2021-11-03  1:07                       ` Stefan Monnier
  2021-11-03  1:59                         ` Daniel Brooks
  2021-11-03  9:59                         ` Gregory Heytings
@ 2021-11-03 13:33                         ` Eli Zaretskii
  2 siblings, 0 replies; 172+ messages in thread
From: Eli Zaretskii @ 2021-11-03 13:33 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: gregory, stefan, cpitclaudel, emacs-devel

> From: Stefan Monnier <monnier@iro.umontreal.ca>
> Cc: Eli Zaretskii <eliz@gnu.org>,  Stefan Kangas <stefan@marxist.se>,
>   cpitclaudel@gmail.com,  emacs-devel@gnu.org
> Date: Tue, 02 Nov 2021 21:07:05 -0400
> 
> > Given that the vulnerability is limited to source code, in which AFAIU
> > there's no legitimate use of such characters, would the following not
> > be enough?
> 
> I'm pretty sure there are legitimate uses of such characters in source code.
> Maybe there are significant parts of the world where this is extremely rare,
> but we shouldn't generalize too quickly.

Agreed.  Especially since some thought was already invested in this,
and we have some of those ideas implemented since Emacs 25.1.



^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: Unicode confusables and reordering characters considered harmful
  2021-11-03  1:59                         ` Daniel Brooks
@ 2021-11-03 13:35                           ` Eli Zaretskii
  0 siblings, 0 replies; 172+ messages in thread
From: Eli Zaretskii @ 2021-11-03 13:35 UTC (permalink / raw)
  To: Daniel Brooks; +Cc: gregory, stefan, monnier, cpitclaudel, emacs-devel

> From: Daniel Brooks <db48x@db48x.net>
> Cc: Gregory Heytings <gregory@heytings.org>,  Eli Zaretskii <eliz@gnu.org>,
>   Stefan Kangas <stefan@marxist.se>,  cpitclaudel@gmail.com,
>   emacs-devel@gnu.org
> Date: Tue, 02 Nov 2021 18:59:09 -0700
> 
> One recommendation the paper made was that languages should allow them,
> but give a syntax error if they reorder the comment or string delimiters
> relative to other text.

This would mean the compilers/interpreters will need to implement the
Unicode Bidirectional Algorithm, which is IMO and unreasonable
requirement, since those tools don't care about display.



^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: Unicode confusables and reordering characters considered harmful
  2021-11-03  9:59                         ` Gregory Heytings
  2021-11-03 11:19                           ` Stefan Kangas
  2021-11-03 11:29                           ` Andreas Schwab
@ 2021-11-03 13:37                           ` Eli Zaretskii
  2021-11-03 18:53                             ` Manuel Giraud
  2 siblings, 1 reply; 172+ messages in thread
From: Eli Zaretskii @ 2021-11-03 13:37 UTC (permalink / raw)
  To: Gregory Heytings; +Cc: cpitclaudel, stefan, monnier, emacs-devel

> Date: Wed, 03 Nov 2021 09:59:46 +0000
> From: Gregory Heytings <gregory@heytings.org>
> cc: Eli Zaretskii <eliz@gnu.org>, Stefan Kangas <stefan@marxist.se>, 
>     cpitclaudel@gmail.com, emacs-devel@gnu.org
> 
> There's some data that shows that this is extremely rare in general: the 
> Rust Security Response WG analyzed the 70322 crates and found only 5 in 
> which these codepoints were present (see [1]).  That's ~0.01 %.
> 
> Moreover such highlighting does not make the source code or text 
> unreadable, even in those few legitimate cases.

The statistics are irrelevant when you actually need to deal with such
a rare case.

> Therefore I suggest to experiment with the attached patch during a month 
> or so, and see if there are objections.

I object already.  We shouldn't settle for easy half-solutions when we
are fully capable of implementing the much more complete and accurate
ones.



^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: Unicode confusables and reordering characters considered harmful
  2021-11-03 11:19                           ` Stefan Kangas
  2021-11-03 11:31                             ` Gregory Heytings
@ 2021-11-03 13:44                             ` Eli Zaretskii
  2021-11-03 14:29                               ` Gregory Heytings
  1 sibling, 1 reply; 172+ messages in thread
From: Eli Zaretskii @ 2021-11-03 13:44 UTC (permalink / raw)
  To: Stefan Kangas; +Cc: gregory, emacs-devel, cpitclaudel, monnier

> From: Stefan Kangas <stefan@marxist.se>
> Date: Wed, 3 Nov 2021 12:19:58 +0100
> Cc: Eli Zaretskii <eliz@gnu.org>,
>  Clément Pit-Claudel <cpitclaudel@gmail.com>,
>  Stefan Monnier <monnier@iro.umontreal.ca>,
>  Emacs developers <emacs-devel@gnu.org>
> 
> Depending on how you define it, there is at least one major world
> language (Arabic) that has a RTL script, and other major languages
> such as Urdu, Farsi and Hebrew also use it (and a couple of others
> too).  So I think we should consider to what extent your proposal
> might hurt users of such languages.
> 
> Are these characters important to write comments and strings in any of
> those languages?

Yes, definitely.  Especially when the comments mix RTL characters with
ASCII punctuation and separators (which have "weak" directionality,
and change their actual directionality depending on the surrounding
strong directional text).  This happens quite frequently, because
comments can include arithmetic operators and other similar symbol and
punctuation characters.  Without the formatting controls, this could
make comments and strings almost unreadable in some cases.

> Will your proposal make it harder to type in such languages?

Yes, in some cases.

> If yes, are there less invasive solutions?

Yes: detect the situations where the use of these controls is
suspicious.  For example, the current implementation of
bidi-find-overridden-directionality detects when characters that
normally have left-to-right directionality (example: 'a') are forced
to behave as strong right-to-left characters instead -- this is
something "normal" human-readable text should rarely if ever need to
do, and OTOH its potential to confuse is very high.



^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: Unicode confusables and reordering characters considered harmful
  2021-11-03 11:31                             ` Gregory Heytings
  2021-11-03 12:20                               ` Stefan Monnier
@ 2021-11-03 13:45                               ` Eli Zaretskii
  1 sibling, 0 replies; 172+ messages in thread
From: Eli Zaretskii @ 2021-11-03 13:45 UTC (permalink / raw)
  To: Gregory Heytings; +Cc: cpitclaudel, stefan, monnier, emacs-devel

> Date: Wed, 03 Nov 2021 11:31:37 +0000
> From: Gregory Heytings <gregory@heytings.org>
> cc: Eli Zaretskii <eliz@gnu.org>, 
>     Clément Pit-Claudel <cpitclaudel@gmail.com>, 
>     Stefan Monnier <monnier@iro.umontreal.ca>, emacs-devel@gnu.org
> 
> AFAIK, these specific characters are not necessary to write comments and 
> strings in these languages.  Here are two random file which use RTL 
> strings and comments, and in which these characters are not used:
> 
> https://raw.githubusercontent.com/01walid/goarabic/master/stringutils_test.go
> https://raw.githubusercontent.com/AbdullahDiaa/garabic/main/garabic.go

Two examples that do NOT use these controls are not a proof they are
not needed at all.



^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: Unicode confusables and reordering characters considered harmful
  2021-11-03 12:20                               ` Stefan Monnier
  2021-11-03 12:41                                 ` tomas
@ 2021-11-03 13:46                                 ` Eli Zaretskii
  1 sibling, 0 replies; 172+ messages in thread
From: Eli Zaretskii @ 2021-11-03 13:46 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: gregory, stefan, cpitclaudel, emacs-devel

> From: Stefan Monnier <monnier@iro.umontreal.ca>
> Cc: Stefan Kangas <stefan@marxist.se>,  Eli Zaretskii <eliz@gnu.org>,
>   Clément Pit-Claudel <cpitclaudel@gmail.com>,
>   emacs-devel@gnu.org
> Date: Wed, 03 Nov 2021 08:20:01 -0400
> 
> I was more worried about the fact that, while highlighting those chars
> might be helpful to warn about accidental uses of them, if attackers
> want to trick the reader, I'm pretty sure they can get similar results
> without having to use those special LTR/RTL override chars:
> 
>     int hi = 5;
>     int שָׁלוֹם = hi;
>     int hello = 10;
>     int السّلامعليك = hello;
>     myfun(שָׁלוֹם ,السّلامعليكم)
> 
> There's no override here, but did I call `myfun` with args 5 and 10 or
> did I call it with args 10 and 5?

If we want, we can detect such cases as well.  It's quite easy,
actually, because the display engine has that information handy.



^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: Unicode confusables and reordering characters considered harmful
  2021-11-03 13:44                             ` Eli Zaretskii
@ 2021-11-03 14:29                               ` Gregory Heytings
  2021-11-03 14:37                                 ` Eli Zaretskii
  0 siblings, 1 reply; 172+ messages in thread
From: Gregory Heytings @ 2021-11-03 14:29 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: cpitclaudel, Stefan Kangas, monnier, emacs-devel

[-- Attachment #1: Type: text/plain, Size: 401 bytes --]


>> Will your proposal make it harder to type in such languages?
>
> Yes, in some cases.
>

I don't see why they would become harder to type in in any way.  The 
directionality characters would be highlighted, that's all (and it would 
be easy to add a toggle command to (un)highlight them on demand, see 
attached).  This functionality is orthogonal to the one you propose AFAIU, 
both could coexist.

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: Type: text/x-diff; name=Make-bidi-reordering-characters-visible.patch, Size: 3032 bytes --]

From cc1ccc0cddf6d37df473121c25d7d1692c205cae Mon Sep 17 00:00:00 2001
From: Gregory Heytings <gregory@heytings.org>
Date: Wed, 3 Nov 2021 14:25:46 +0000
Subject: [PATCH] Make bidi reordering characters visible

* lisp/progmodes/prog-mode.el (bidi-reordering-characters-visible,
bidi-reordering-characters-fontify,
bidi-reordering-characters-toggle-visibility): New functions.
---
 lisp/progmodes/prog-mode.el | 41 ++++++++++++++++++++++++++++++++++++-
 1 file changed, 40 insertions(+), 1 deletion(-)

diff --git a/lisp/progmodes/prog-mode.el b/lisp/progmodes/prog-mode.el
index db350a5f70..0ac5b53f72 100644
--- a/lisp/progmodes/prog-mode.el
+++ b/lisp/progmodes/prog-mode.el
@@ -289,6 +289,44 @@ turn-on-prettify-symbols-mode
              (local-variable-p 'prettify-symbols-alist))
     (prettify-symbols-mode 1)))
 
+(defvar-local bidi-reordering-characters-visible nil)
+
+(defun bidi-reordering-characters-fontify ()
+  (font-lock-add-keywords nil '(("⁩\\|‬\\|⁨\\|⁧\\|⁦\\|‫\\|‪\\|‮\\|‭" . 'font-lock-warning-face))))
+
+(defun bidi-reordering-characters-visible ()
+  (setq buffer-display-table (or buffer-display-table
+                                 standard-display-table
+                                 (make-display-table)))
+  (bidi-reordering-character-toggle-visibility)
+  (add-hook 'font-lock-mode-hook #'bidi-reordering-characters-fontify))
+
+;;;###autoload
+(defun bidi-reordering-character-toggle-visibility ()
+  (interactive)
+  (setq bidi-reordering-characters-visible
+        (not bidi-reordering-characters-visible))
+  (when bidi-reordering-characters-visible
+    (aset buffer-display-table ?‪ [?→])
+    (aset buffer-display-table ?‫ [?←])
+    (aset buffer-display-table ?‭ [?→])
+    (aset buffer-display-table ?‮ [?←])
+    (aset buffer-display-table ?⁦ [?→])
+    (aset buffer-display-table ?⁧ [?←])
+    (aset buffer-display-table ?⁨ [?↓])
+    (aset buffer-display-table ?‬ [?↑])
+    (aset buffer-display-table ?⁩ [?↑]))
+  (unless bidi-reordering-characters-visible
+    (aset buffer-display-table ?‪ nil)
+    (aset buffer-display-table ?‫ nil)
+    (aset buffer-display-table ?‭ nil)
+    (aset buffer-display-table ?‮ nil)
+    (aset buffer-display-table ?⁦ nil)
+    (aset buffer-display-table ?⁧ nil)
+    (aset buffer-display-table ?⁨ nil)
+    (aset buffer-display-table ?‬ nil)
+    (aset buffer-display-table ?⁩ nil)))
+
 ;;;###autoload
 (define-globalized-minor-mode global-prettify-symbols-mode
   prettify-symbols-mode turn-on-prettify-symbols-mode)
@@ -300,7 +338,8 @@ prog-mode
   (setq-local parse-sexp-ignore-comments t)
   (add-hook 'context-menu-functions 'prog-context-menu 10 t)
   ;; Any programming language is always written left to right.
-  (setq bidi-paragraph-direction 'left-to-right))
+  (setq bidi-paragraph-direction 'left-to-right)
+  (bidi-reordering-characters-visible))
 
 (provide 'prog-mode)
 
-- 
2.33.0


^ permalink raw reply related	[flat|nested] 172+ messages in thread

* Re: Unicode confusables and reordering characters considered harmful
  2021-11-03 14:29                               ` Gregory Heytings
@ 2021-11-03 14:37                                 ` Eli Zaretskii
  2021-11-03 16:01                                   ` Gregory Heytings
  0 siblings, 1 reply; 172+ messages in thread
From: Eli Zaretskii @ 2021-11-03 14:37 UTC (permalink / raw)
  To: Gregory Heytings; +Cc: cpitclaudel, stefan, monnier, emacs-devel

> Date: Wed, 03 Nov 2021 14:29:21 +0000
> From: Gregory Heytings <gregory@heytings.org>
> cc: Stefan Kangas <stefan@marxist.se>, emacs-devel@gnu.org, 
>     cpitclaudel@gmail.com, monnier@iro.umontreal.ca
> 
> >> Will your proposal make it harder to type in such languages?
> >
> > Yes, in some cases.
> 
> I don't see why they would become harder to type in in any way.

Because it makes the text harder to read.

> The 
> directionality characters would be highlighted, that's all (and it would 
> be easy to add a toggle command to (un)highlight them on demand, see 
> attached).  This functionality is orthogonal to the one you propose AFAIU, 
> both could coexist.

It could, but I don't think we should install such features.



^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: Unicode confusables and reordering characters considered harmful
  2021-11-03 13:15                                   ` Eli Zaretskii
@ 2021-11-03 14:46                                     ` tomas
  2021-11-03 17:13                                       ` Eli Zaretskii
  0 siblings, 1 reply; 172+ messages in thread
From: tomas @ 2021-11-03 14:46 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: emacs-devel

[-- Attachment #1: Type: text/plain, Size: 884 bytes --]

On Wed, Nov 03, 2021 at 03:15:34PM +0200, Eli Zaretskii wrote:
> > Date: Wed, 3 Nov 2021 13:41:19 +0100
> > From: <tomas@tuxteam.de>
> > 
> > But you raise an interesting point: in an R to L stretch, is the order
> > of the arguments also R to L, or L to R?
> 
> The order of the arguments is always the "logical" order, i.e. the
> order of increasing buffer positions, because that's how the
> compiler/interpreter reads the program text.

That makes sense. So in Stefan's case...

> [Stefan Monnier]
>
>     int hi = 5;
>     int שָׁלוֹם = hi;
>     int hello = 10;
>     int السّلامعليك = hello;
>     myfun(שָׁלוֹם ,السّلامعليكم)
>
> There's no override here, but did I call `myfun` with args 5 and 10 or
> did I call it with args 10 and 5?

... the args are 10 and 5, because the whole arg list is RTL.

Cheers
 - t

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: Unicode confusables and reordering characters considered harmful
  2021-11-02 14:43 ` Clément Pit-Claudel
@ 2021-11-03 15:07   ` Reini Urban
  2021-11-03 15:43     ` Stefan Monnier
  2021-11-03 17:24     ` Eli Zaretskii
  0 siblings, 2 replies; 172+ messages in thread
From: Reini Urban @ 2021-11-03 15:07 UTC (permalink / raw)
  To: emacs-devel

[-- Attachment #1: Type: text/plain, Size: 2585 bytes --]

On Tue, Nov 2, 2021 at 4:08 PM Clément Pit-Claudel <cpitclaudel@gmail.com>
wrote:

> There is a good summary of the issue and relevant mitigations at
> https://research.swtch.com/trojan (it argues against compiler fixes and
> in favor of IDE enhancements.)
>

No, this summary is awful.
The issue is that libc, the C standard committee, linux and most others are
ignoring the unicode identifier security guidelines.
Identifiers must be identifiable, but strings should not be touched.

Identifiers are all names, pathnames, variable names, user names, ... but
not arbitrary strings.
IDE's are just one place to fix it (that's why glib does it), but the core
is more important.

The ones who do care about, like java (the compiler), my cperl (the
compiler and runtime, because it is dynamic), rust (the compiler), glib
(the library), do follow these guidelines.
All C compilers and most others are insecure. Linux Filesystems are
insecure. The old APPLE Filesystem was secure, the new is again insecure.
Also the libc's cannot deal with de-normalized characters at all. grep,
sed, coreutils all have outstanding unorm patches, because libunicode is
too slow. Because it iterates over the string via callbacks.

In short you need to normalize each identifier, check for proper
XID_Start/XID_Continue,
check your document for mixed scripts (several combinations are allowed,
several disallowed,
HAN unification did a good job, but greek vs cyrillic is the worst), and
forbid bidi changes.

The C standard recently complained that making identifiers secure would
require the full Unicode database, which is wrong.
You need the normalization code (one or two tiny tables), the script lists
(tiny), and the XID_Start/Continue lists (small).
Further you need an api to start a document (to init scripts) with an
optional script param (the language).
Scripts just need a byte, the Start/Cont two bits. Sorted lists are the
best representation. (musl does it unsorted, glibc an insecure table-lookup)
gnulib is really the best place to add these features, even if libunicode
is too slow.

I started adding u8id support two years ago to my safeclib and my ctl, but
was too busy lately. It works fine and fast enough in rust, java and cperl.
I have good support in the wchar_t part of safelibc (wcsnorm, wcsfc, but no
scripts), but not the u8 part yet. glibc and musl don't care about u8
replacing wchar_t yet.

https://unicode.org/reports/tr36/
https://unicode.org/reports/tr39/
http://perl11.github.io/blog/foldcase.html
-- 
Reini Urban

[-- Attachment #2: Type: text/html, Size: 3538 bytes --]

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: Unicode confusables and reordering characters considered harmful
  2021-11-03 15:07   ` Reini Urban
@ 2021-11-03 15:43     ` Stefan Monnier
  2021-11-04  7:50       ` Reini Urban
  2021-11-03 17:24     ` Eli Zaretskii
  1 sibling, 1 reply; 172+ messages in thread
From: Stefan Monnier @ 2021-11-03 15:43 UTC (permalink / raw)
  To: Reini Urban; +Cc: emacs-devel

> No, this summary is awful.
> The issue is that libc, the C standard committee, linux and most others are
> ignoring the unicode identifier security guidelines.
> Identifiers must be identifiable, but strings should not be touched.

What do those rules say about code like:

    int hi = 5;
    int שָׁלוֹם = hi;
    int hello = 10;
    int السّلامعليك = hello;
    myfun(שָׁלוֹם ,السّلامعليكم)

IMO this code is fundamentally valid: we should allow
programmers to write identifiers in their native tongue.

Does the security guidelines require override chars to force the
`, ` to be in LTR, so as to fix the ordering problem (and would the
result be more or less clear to someone familiar with those RTL
scripts ;-0 )?


        Stefan




^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: Unicode confusables and reordering characters considered harmful
  2021-11-03 14:37                                 ` Eli Zaretskii
@ 2021-11-03 16:01                                   ` Gregory Heytings
  2021-11-03 17:44                                     ` Eli Zaretskii
  0 siblings, 1 reply; 172+ messages in thread
From: Gregory Heytings @ 2021-11-03 16:01 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: cpitclaudel, stefan, monnier, emacs-devel

[-- Attachment #1: Type: text/plain, Size: 729 bytes --]


>> The directionality characters would be highlighted, that's all (and it 
>> would be easy to add a toggle command to (un)highlight them on demand, 
>> see attached).  This functionality is orthogonal to the one you propose 
>> AFAIU, both could coexist.
>
> It could, but I don't think we should install such features.
>

Why not?  Is this not simply an improved (and built-in) version of your 
initial suggestion to use glyphless-char-display?  Except that 
glyphless-char-display is global (whereas buffer-display-table is 
buffer-local), and that a tofu is not as visible as a character 
highlighted with the font-lock-warning-face.

Here's an improved patch which also highlights these characters in strings 
and comments.

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: Type: text/x-diff; name=Make-bidi-reordering-characters-visible.patch, Size: 3301 bytes --]

From c15d21b833f32896138afffaa9d7c9d917d52ea7 Mon Sep 17 00:00:00 2001
From: Gregory Heytings <gregory@heytings.org>
Date: Wed, 3 Nov 2021 15:58:02 +0000
Subject: [PATCH] Make bidi reordering characters visible

* lisp/progmodes/prog-mode.el (bidi-reordering-characters-visible,
bidi-reordering-characters-fontify,
bidi-reordering-characters-toggle-visibility): New functions.
---
 lisp/progmodes/prog-mode.el | 47 ++++++++++++++++++++++++++++++++++++-
 1 file changed, 46 insertions(+), 1 deletion(-)

diff --git a/lisp/progmodes/prog-mode.el b/lisp/progmodes/prog-mode.el
index db350a5f70..e4f63b0645 100644
--- a/lisp/progmodes/prog-mode.el
+++ b/lisp/progmodes/prog-mode.el
@@ -289,6 +289,50 @@ turn-on-prettify-symbols-mode
              (local-variable-p 'prettify-symbols-alist))
     (prettify-symbols-mode 1)))
 
+(defvar-local bidi-reordering-characters-visible nil
+  "Internal variable used by `bidi-reordering-characters-visible'.")
+
+(defun bidi-reordering-characters-fontify ()
+  "Fontify bidi reordering characters with `font-lock-warning-face'."
+  (font-lock-add-keywords
+   nil
+   '(("⁩\\|‬\\|⁨\\|⁧\\|⁦\\|‫\\|‪\\|‮\\|‭" . (0 'font-lock-warning-face t)))))
+
+(defun bidi-reordering-characters-visible ()
+  "Display bidi reordering characters as arrows."
+  (setq buffer-display-table (or buffer-display-table
+                                 standard-display-table
+                                 (make-display-table)))
+  (bidi-reordering-character-toggle-visibility)
+  (add-hook 'font-lock-mode-hook #'bidi-reordering-characters-fontify))
+
+;;;###autoload
+(defun bidi-reordering-character-toggle-visibility ()
+  "Toggle the visibility of bidi reordering characters."
+  (interactive)
+  (setq bidi-reordering-characters-visible
+        (not bidi-reordering-characters-visible))
+  (when bidi-reordering-characters-visible
+    (aset buffer-display-table ?‪ [?→])
+    (aset buffer-display-table ?‫ [?←])
+    (aset buffer-display-table ?‭ [?→])
+    (aset buffer-display-table ?‮ [?←])
+    (aset buffer-display-table ?⁦ [?→])
+    (aset buffer-display-table ?⁧ [?←])
+    (aset buffer-display-table ?⁨ [?↓])
+    (aset buffer-display-table ?‬ [?↑])
+    (aset buffer-display-table ?⁩ [?↑]))
+  (unless bidi-reordering-characters-visible
+    (aset buffer-display-table ?‪ nil)
+    (aset buffer-display-table ?‫ nil)
+    (aset buffer-display-table ?‭ nil)
+    (aset buffer-display-table ?‮ nil)
+    (aset buffer-display-table ?⁦ nil)
+    (aset buffer-display-table ?⁧ nil)
+    (aset buffer-display-table ?⁨ nil)
+    (aset buffer-display-table ?‬ nil)
+    (aset buffer-display-table ?⁩ nil)))
+
 ;;;###autoload
 (define-globalized-minor-mode global-prettify-symbols-mode
   prettify-symbols-mode turn-on-prettify-symbols-mode)
@@ -300,7 +344,8 @@ prog-mode
   (setq-local parse-sexp-ignore-comments t)
   (add-hook 'context-menu-functions 'prog-context-menu 10 t)
   ;; Any programming language is always written left to right.
-  (setq bidi-paragraph-direction 'left-to-right))
+  (setq bidi-paragraph-direction 'left-to-right)
+  (bidi-reordering-characters-visible))
 
 (provide 'prog-mode)
 
-- 
2.33.0


^ permalink raw reply related	[flat|nested] 172+ messages in thread

* Re: Unicode confusables and reordering characters considered harmful
  2021-11-03 14:46                                     ` tomas
@ 2021-11-03 17:13                                       ` Eli Zaretskii
  2021-11-03 17:34                                         ` tomas
  0 siblings, 1 reply; 172+ messages in thread
From: Eli Zaretskii @ 2021-11-03 17:13 UTC (permalink / raw)
  To: tomas; +Cc: emacs-devel

> Date: Wed, 3 Nov 2021 15:46:06 +0100
> From: tomas@tuxteam.de
> Cc: emacs-devel@gnu.org
> 
> >     int hi = 5;
> >     int שָׁלוֹם = hi;
> >     int hello = 10;
> >     int السّلامعليك = hello;
> >     myfun(שָׁלוֹם ,السّلامعليكم)
> >
> > There's no override here, but did I call `myfun` with args 5 and 10 or
> > did I call it with args 10 and 5?
> 
> ... the args are 10 and 5, because the whole arg list is RTL.

No, they are 5 and 10 (assuming you read this left to right ;-)
Move the cursor with C-f from myfun, and you will see which one is the
first and which one the second.



^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: Unicode confusables and reordering characters considered harmful
  2021-11-03 15:07   ` Reini Urban
  2021-11-03 15:43     ` Stefan Monnier
@ 2021-11-03 17:24     ` Eli Zaretskii
  1 sibling, 0 replies; 172+ messages in thread
From: Eli Zaretskii @ 2021-11-03 17:24 UTC (permalink / raw)
  To: Reini Urban; +Cc: emacs-devel

> From: Reini Urban <reini.urban@gmail.com>
> Date: Wed, 3 Nov 2021 16:07:51 +0100
> 
> The issue is that libc, the C standard committee, linux and most others are ignoring the unicode identifier
> security guidelines.
> Identifiers must be identifiable, but strings should not be touched.
> 
> Identifiers are all names, pathnames, variable names, user names, ... but not arbitrary strings.
> IDE's are just one place to fix it (that's why glib does it), but the core is more important.
> 
> The ones who do care about, like java (the compiler), my cperl (the compiler and runtime, because it is
> dynamic), rust (the compiler), glib (the library), do follow these guidelines.
> All C compilers and most others are insecure. Linux Filesystems are insecure. The old APPLE Filesystem
> was secure, the new is again insecure.
> Also the libc's cannot deal with de-normalized characters at all. grep, sed, coreutils all have outstanding
> unorm patches, because libunicode is too slow. Because it iterates over the string via callbacks.
> 
> In short you need to normalize each identifier, check for proper XID_Start/XID_Continue, 
> check your document for mixed scripts (several combinations are allowed, several disallowed, 
> HAN unification did a good job, but greek vs cyrillic is the worst), and forbid bidi changes.

I'm not sure I follow: the examples in the original paper which
sparked all this brouhaha didn't touch any identifiers.  All the
identifiers in those examples were perfectly compliant with the
Unicode guidelines, AFAIR.  What the examples did was insert
directional format controls so as to reorder _punctuation_ characters,
in a way that changes the visual appearance and the interpreted
semantics of the code.  All of the format controls were inserted
within whitespace, not inside any identifiers.

So I'm not sure how what you tell is relevant to the issue at hand;
could you perhaps explain?



^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: Unicode confusables and reordering characters considered harmful
  2021-11-03 17:13                                       ` Eli Zaretskii
@ 2021-11-03 17:34                                         ` tomas
  0 siblings, 0 replies; 172+ messages in thread
From: tomas @ 2021-11-03 17:34 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: emacs-devel

[-- Attachment #1: Type: text/plain, Size: 1261 bytes --]

On Wed, Nov 03, 2021 at 07:13:23PM +0200, Eli Zaretskii wrote:
> > Date: Wed, 3 Nov 2021 15:46:06 +0100
> > From: tomas@tuxteam.de
> > Cc: emacs-devel@gnu.org
> > 
> > >     int hi = 5;
> > >     int שָׁלוֹם = hi;
> > >     int hello = 10;
> > >     int السّلامعليك = hello;
> > >     myfun(שָׁלוֹם ,السّلامعليكم)
> > >
> > > There's no override here, but did I call `myfun` with args 5 and 10 or
> > > did I call it with args 10 and 5?
> > 
> > ... the args are 10 and 5, because the whole arg list is RTL.
> 
> No, they are 5 and 10

Er - you are right, of course.

>                        (assuming you read this left to right ;-)

:-D

> Move the cursor with C-f from myfun, and you will see which one is the
> first and which one the second.

Yes. I got confused at another point. No, wait! Now I see what
happened to me: the vim, where I'm editing this shows things in
the wrong order! It's not doing RTL at all. So while I thought
I was looking at what happens in Emacs, and not paying attention
at what my mailer shows -- somehow I was picking it up, subliminally.

So sorry for the noise -- and thanks for teaching me something.
And for your very special humour :)

Cheers
 - t

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: Unicode confusables and reordering characters considered harmful,  a simple solution
  2021-11-02 21:28                           ` Unicode confusables and reordering characters considered harmful, a simple solution Daniel Brooks
  2021-11-03 13:30                             ` Eli Zaretskii
@ 2021-11-03 17:41                             ` Yuri Khan
  2021-11-03 17:56                               ` Eli Zaretskii
  1 sibling, 1 reply; 172+ messages in thread
From: Yuri Khan @ 2021-11-03 17:41 UTC (permalink / raw)
  To: Daniel Brooks
  Cc: Eli Zaretskii, Emacs developers, Stefan Kangas,
	Clément Pit-Claudel, Stefan Monnier

On Wed, 3 Nov 2021 at 04:29, Daniel Brooks <db48x@db48x.net> wrote:

> We already have it; it is called whitespace-mode. It’s not perfect, but
> this morning I customized mine to make these characters more obvious:
>
> (custom-set-variables
>  '(whitespace-display-mappings
>    '((space-mark 32 [183] [46])
>      (space-mark 160 [164] [95])
>      (newline-mark 10 [36 10])
>      (tab-mark 9 [187 9] [92 9])
>      (space-mark #x202A [#x21D2]) ; ⇒ LEFT-TO-RIGHT EMBEDDING
>      (space-mark #x202B [#x21D0]) ; ⇐ RIGHT-TO-LEFT EMBEDDING
>      (space-mark #x202D [#x2192]) ; → LEFT-TO-RIGHT OVERRIDE
>      (space-mark #x202E [#x2190]) ; ← RIGHT-TO-LEFT OVERRIDE
>      (space-mark #x2066 [#x21E5]) ; ⇥ LEFT-TO-RIGHT ISOLATE
>      (space-mark #x2067 [#x21E4]) ; ⇤ RIGHT-TO-LEFT ISOLATE
>      (space-mark #x2068 [#x21A7]) ; ↧ FIRST STRONG ISOLATE
>      (space-mark #x202C [#x21D1]) ; ⇑ POP DIRECTIONAL FORMATTING
>      (space-mark #x2069 [#x2912]) ; ⤒ POP DIRECTIONAL ISOLATE
>      )))

I like! I already enable whitespace-mode in prog-mode and text-mode
(and that leaves out read-only modes such as Info and Dired where I
don’t want visible whitespace, and various unclassified modes such as
conf-mode where I do but haven’t gotten around to setting them all
up).



^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: Unicode confusables and reordering characters considered harmful
  2021-11-03 16:01                                   ` Gregory Heytings
@ 2021-11-03 17:44                                     ` Eli Zaretskii
  2021-11-03 17:53                                       ` Gregory Heytings
  0 siblings, 1 reply; 172+ messages in thread
From: Eli Zaretskii @ 2021-11-03 17:44 UTC (permalink / raw)
  To: Gregory Heytings; +Cc: cpitclaudel, stefan, monnier, emacs-devel

> Date: Wed, 03 Nov 2021 16:01:20 +0000
> From: Gregory Heytings <gregory@heytings.org>
> cc: stefan@marxist.se, emacs-devel@gnu.org, cpitclaudel@gmail.com, 
>     monnier@iro.umontreal.ca
> 
> > It could, but I don't think we should install such features.
> 
> Why not?

Because it doesn't pass my quality-control tests.

> Is this not simply an improved (and built-in) version of your 
> initial suggestion to use glyphless-char-display?

glyphless-char-display already exists and is built-in.  People who
want to have these characters stand out for some reason can use it,
that's why it is open to customization.  What you propose is "yet
another mechanism similar to glyphless-char-display", and I don't see
why we should have this for a small group of characters.  We already
have a small mess because display-tables and glyphless-char-display
produce a race; I certainly don't think we should introduce yet
another, third mechanism for a similar purpose.

I could consider adding an additional METHOD to those we already
support there, if someone thinks the existing ones are insufficient.
Not sure if this is needed, but it at least makes some sense, assuming
the proposal would be reasonable and not limited to a small group of
codepoints.

> Except that glyphless-char-display is global (whereas
> buffer-display-table is buffer-local), and that a tofu is not as
> visible as a character highlighted with the font-lock-warning-face.

Patches to allow glyphless display to be controlled on a buffer-local
basis would be welcome, I think this would be a good enhancement.



^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: Unicode confusables and reordering characters considered harmful
  2021-11-03 17:44                                     ` Eli Zaretskii
@ 2021-11-03 17:53                                       ` Gregory Heytings
  0 siblings, 0 replies; 172+ messages in thread
From: Gregory Heytings @ 2021-11-03 17:53 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: cpitclaudel, stefan, monnier, emacs-devel


>>> It could, but I don't think we should install such features.
>>
>> Why not?
>
> Because it doesn't pass my quality-control tests.
>

That's a reason ;-)

>> Is this not simply an improved (and built-in) version of your initial 
>> suggestion to use glyphless-char-display?
>
> glyphless-char-display already exists and is built-in.  People who want 
> to have these characters stand out for some reason can use it, that's 
> why it is open to customization.  What you propose is "yet another 
> mechanism similar to glyphless-char-display", and I don't see why we 
> should have this for a small group of characters.  We already have a 
> small mess because display-tables and glyphless-char-display produce a 
> race; I certainly don't think we should introduce yet another, third 
> mechanism for a similar purpose.
>

I don't really see why this is a third mechanism, buffer-display-table 
already exists and is built-in, too.  Both are char tables.  I always 
thought that glyphless-char-display was more or less the global equivalent 
of buffer-display-table, but apparently I'm missing something.

> I could consider adding an additional METHOD to those we already support 
> there, if someone thinks the existing ones are insufficient. Not sure if 
> this is needed, but it at least makes some sense, assuming the proposal 
> would be reasonable and not limited to a small group of codepoints.
>
>> Except that glyphless-char-display is global (whereas
>> buffer-display-table is buffer-local), and that a tofu is not as
>> visible as a character highlighted with the font-lock-warning-face.
>
> Patches to allow glyphless display to be controlled on a buffer-local
> basis would be welcome, I think this would be a good enhancement.
>



^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: Unicode confusables and reordering characters considered harmful,  a simple solution
  2021-11-03 17:41                             ` Yuri Khan
@ 2021-11-03 17:56                               ` Eli Zaretskii
  2021-11-03 18:20                                 ` Juri Linkov
                                                   ` (2 more replies)
  0 siblings, 3 replies; 172+ messages in thread
From: Eli Zaretskii @ 2021-11-03 17:56 UTC (permalink / raw)
  To: Yuri Khan; +Cc: db48x, cpitclaudel, stefan, monnier, emacs-devel

> From: Yuri Khan <yuri.v.khan@gmail.com>
> Date: Thu, 4 Nov 2021 00:41:07 +0700
> Cc: Eli Zaretskii <eliz@gnu.org>, Clément Pit-Claudel <cpitclaudel@gmail.com>, 
> 	Stefan Kangas <stefan@marxist.se>, Stefan Monnier <monnier@iro.umontreal.ca>, 
> 	Emacs developers <emacs-devel@gnu.org>
> 
> >      (space-mark #x2068 [#x21A7]) ; ↧ FIRST STRONG ISOLATE

This one doesn't make sense, at least if one knows what FSI means and
does.

> >      (space-mark #x202C [#x21D1]) ; ⇑ POP DIRECTIONAL FORMATTING
> >      (space-mark #x2069 [#x2912]) ; ⤒ POP DIRECTIONAL ISOLATE
> >      )))
> 
> I like! I already enable whitespace-mode in prog-mode and text-mode
> (and that leaves out read-only modes such as Info and Dired where I
> don’t want visible whitespace, and various unclassified modes such as
> conf-mode where I do but haven’t gotten around to setting them all
> up).

The problem with these remappings is that you then get to somehow
discern between the remapped characters and the real characters which
look identically on display.

Also, this will disrupt alignment and make text using these controls
much harder to read.  E.g., the few places in TUTORIAL.he which use
those controls are barely readable after turning the above on.

Anyway, if one wants to be able to highlight certain characters on
display, one could also use highlight-regexp, I think.



^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: Unicode confusables and reordering characters considered harmful,  a simple solution
  2021-11-03 17:56                               ` Eli Zaretskii
@ 2021-11-03 18:20                                 ` Juri Linkov
  2021-11-03 19:02                                   ` Gregory Heytings
  2021-11-03 18:45                                 ` Yuri Khan
  2021-11-03 21:13                                 ` Daniel Brooks
  2 siblings, 1 reply; 172+ messages in thread
From: Juri Linkov @ 2021-11-03 18:20 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: cpitclaudel, stefan, emacs-devel, db48x, monnier, Yuri Khan

> Anyway, if one wants to be able to highlight certain characters on
> display, one could also use highlight-regexp, I think.

Or markchars.el with markchars-what customized to markchars-confusables.



^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: Unicode confusables and reordering characters considered harmful,  a simple solution
  2021-11-03 17:56                               ` Eli Zaretskii
  2021-11-03 18:20                                 ` Juri Linkov
@ 2021-11-03 18:45                                 ` Yuri Khan
  2021-11-03 19:09                                   ` Eli Zaretskii
  2021-11-03 21:13                                 ` Daniel Brooks
  2 siblings, 1 reply; 172+ messages in thread
From: Yuri Khan @ 2021-11-03 18:45 UTC (permalink / raw)
  To: Eli Zaretskii
  Cc: Daniel Brooks, Clément Pit-Claudel, Stefan Kangas,
	Stefan Monnier, Emacs developers

On Thu, 4 Nov 2021 at 00:56, Eli Zaretskii <eliz@gnu.org> wrote:

> The problem with these remappings is that you then get to somehow
> discern between the remapped characters and the real characters which
> look identically on display.

Real characters are fontified as whichever syntax unit they belong to.
Remapped characters are fontified as whitespace-space-face or
whitespace-hspace-face depending on whether you add them to
whitespace-space-regexp or whitespace-hspace-regexp. (I’m interpreting
Daniel as wanting to add two more customization points, perhaps
whitespace-bidi-regexp and whitespace-bidi-face, and a new enum value
for use in whitespace-style.)

> Also, this will disrupt alignment

We already have this issue with TABs — when a tab would expand to a
single space, a remapped tab expands to its replacement glyph and a
whole tab-width’s worth of spaces. Yes, it’s slightly annoying.

> and make text using these controls
> much harder to read.  E.g., the few places in TUTORIAL.he which use
> those controls are barely readable after turning the above on.

I tried that and I find it okay. When reading the text, I tune out the
yellow (my whitespace-hspace-face), and when focusing on formatting, I
tune out the gray (default). Same as my mother used to focus on black
Latin or red Cyrillic letters when she was learning to type on a
computer keyboard.

Perhaps a subtler face or glyphs will make the issue less annoying
while keeping its usefulness.

> Anyway, if one wants to be able to highlight certain characters on
> display, one could also use highlight-regexp, I think.

One does not only want to highlight, but also to actually see and
distinguish certain characters, including the case where several
consecutive such characters are present. Unfortunately for alignment,
this requires width.



^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: Unicode confusables and reordering characters considered harmful
  2021-11-03 11:29                           ` Andreas Schwab
@ 2021-11-03 18:47                             ` Stefan Monnier
  2021-11-03 18:52                               ` Yuri Khan
                                                 ` (2 more replies)
  0 siblings, 3 replies; 172+ messages in thread
From: Stefan Monnier @ 2021-11-03 18:47 UTC (permalink / raw)
  To: Andreas Schwab
  Cc: Gregory Heytings, Eli Zaretskii, Stefan Kangas, cpitclaudel,
	emacs-devel

>> +  (aset buffer-display-table ?‪ [?→])
>> +  (aset buffer-display-table ?‫ [?←])
>> +  (aset buffer-display-table ?‭ [?→])
>> +  (aset buffer-display-table ?‮ [?←])
>> +  (aset buffer-display-table ?⁦ [?→])
>> +  (aset buffer-display-table ?⁧ [?←])
>> +  (aset buffer-display-table ?⁨ [?↓])
>> +  (aset buffer-display-table ?‬ [?↑])
>> +  (aset buffer-display-table ?⁩ [?↑])
>
> A perfect example of how legitimate use of these characters can mess up
> your source. :-)

FWIW, I think the source would be clearer if it used `?\u{CHARNAME}`
instead of the literal chars.


        Stefan "who usually prefers `?<char>` over `?\u<...>`, but not
                when the char is non-printing"




^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: Unicode confusables and reordering characters considered harmful
  2021-11-03 18:47                             ` Stefan Monnier
@ 2021-11-03 18:52                               ` Yuri Khan
  2021-11-03 19:19                                 ` Stefan Monnier
  2021-11-03 19:28                               ` Gregory Heytings
  2021-11-03 19:30                               ` Eli Zaretskii
  2 siblings, 1 reply; 172+ messages in thread
From: Yuri Khan @ 2021-11-03 18:52 UTC (permalink / raw)
  To: Stefan Monnier
  Cc: Clément Pit-Claudel, Stefan Kangas, Emacs developers,
	Gregory Heytings, Andreas Schwab, Eli Zaretskii

On Thu, 4 Nov 2021 at 01:48, Stefan Monnier <monnier@iro.umontreal.ca> wrote:

> FWIW, I think the source would be clearer if it used `?\u{CHARNAME}`
> instead of the literal chars.

Perhaps even ‘?\N{CHARNAME}’ rather than ‘?\uCHARCODE’.

>         Stefan "who usually prefers `?<char>` over `?\u<...>`, but not
>                 when the char is non-printing"

+1.



^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: Unicode confusables and reordering characters considered harmful
  2021-11-03 13:37                           ` Eli Zaretskii
@ 2021-11-03 18:53                             ` Manuel Giraud
  2021-11-03 19:36                               ` Eli Zaretskii
  0 siblings, 1 reply; 172+ messages in thread
From: Manuel Giraud @ 2021-11-03 18:53 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: Gregory Heytings, emacs-devel, stefan, cpitclaudel, monnier

Eli Zaretskii <eliz@gnu.org> writes:

> I object already.  We shouldn't settle for easy half-solutions when we
> are fully capable of implementing the much more complete and accurate
> ones.

Sorry if I missed them but do you have an example of
bidi-find-overridden-directionality usage? For instance Clément's Python
sample.
-- 
Manuel Giraud



^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: Unicode confusables and reordering characters considered harmful,  a simple solution
  2021-11-03 18:20                                 ` Juri Linkov
@ 2021-11-03 19:02                                   ` Gregory Heytings
  2021-11-03 19:46                                     ` Eli Zaretskii
  2021-11-04  8:44                                     ` Juri Linkov
  0 siblings, 2 replies; 172+ messages in thread
From: Gregory Heytings @ 2021-11-03 19:02 UTC (permalink / raw)
  To: Juri Linkov
  Cc: cpitclaudel, stefan, emacs-devel, db48x, monnier, Eli Zaretskii,
	Yuri Khan


>> Anyway, if one wants to be able to highlight certain characters on 
>> display, one could also use highlight-regexp, I think.
>
> Or markchars.el with markchars-what customized to markchars-confusables.
>

Neither would work AFAICS, because these characters are glyphless. 
Highlighting a glyphless character will not make it more visible.



^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: Unicode confusables and reordering characters considered harmful,  a simple solution
  2021-11-03 18:45                                 ` Yuri Khan
@ 2021-11-03 19:09                                   ` Eli Zaretskii
  2021-11-03 19:35                                     ` Yuri Khan
  2021-11-03 19:54                                     ` Daniel Brooks
  0 siblings, 2 replies; 172+ messages in thread
From: Eli Zaretskii @ 2021-11-03 19:09 UTC (permalink / raw)
  To: Yuri Khan; +Cc: db48x, cpitclaudel, stefan, monnier, emacs-devel

> From: Yuri Khan <yuri.v.khan@gmail.com>
> Date: Thu, 4 Nov 2021 01:45:17 +0700
> Cc: Daniel Brooks <db48x@db48x.net>, Clément Pit-Claudel <cpitclaudel@gmail.com>, 
> 	Stefan Kangas <stefan@marxist.se>, Stefan Monnier <monnier@iro.umontreal.ca>, 
> 	Emacs developers <emacs-devel@gnu.org>
> 
> On Thu, 4 Nov 2021 at 00:56, Eli Zaretskii <eliz@gnu.org> wrote:
> 
> > The problem with these remappings is that you then get to somehow
> > discern between the remapped characters and the real characters which
> > look identically on display.
> 
> Real characters are fontified as whichever syntax unit they belong to.
> Remapped characters are fontified as whitespace-space-face or
> whitespace-hspace-face depending on whether you add them to
> whitespace-space-regexp or whitespace-hspace-regexp.

I just used what Daniel posted, and that doesn't display the remapped
characters in any distinct face.  Gotta tinker?

> > Also, this will disrupt alignment
> 
> We already have this issue with TABs — when a tab would expand to a
> single space, a remapped tab expands to its replacement glyph and a
> whole tab-width’s worth of spaces. Yes, it’s slightly annoying.

Yes, it's a general problem with remapping.

> > and make text using these controls
> > much harder to read.  E.g., the few places in TUTORIAL.he which use
> > those controls are barely readable after turning the above on.
> 
> I tried that and I find it okay.

Do you read Hebrew?  Those characters look like line noise there,
whereas the text with the default display is perfectly readable, and
most people won't even know these controls are there (as intended).

> > Anyway, if one wants to be able to highlight certain characters on
> > display, one could also use highlight-regexp, I think.
> 
> One does not only want to highlight, but also to actually see and
> distinguish certain characters

What for?  The absolute majority of people won't have any idea what is
the effect of each of these controls, and how it differs from others.
Even I many times need to talk myself through their effect on display.
The UBA spec weighs in at more than 30 pages of highly technical text,
and I don't expect people to memorize it by heart.



^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: Unicode confusables and reordering characters considered harmful
  2021-11-03 18:52                               ` Yuri Khan
@ 2021-11-03 19:19                                 ` Stefan Monnier
  0 siblings, 0 replies; 172+ messages in thread
From: Stefan Monnier @ 2021-11-03 19:19 UTC (permalink / raw)
  To: Yuri Khan
  Cc: Andreas Schwab, Gregory Heytings, Eli Zaretskii, Stefan Kangas,
	Clément Pit-Claudel, Emacs developers

Yuri Khan [2021-11-04 01:52:55] wrote:
> On Thu, 4 Nov 2021 at 01:48, Stefan Monnier <monnier@iro.umontreal.ca> wrote:
>> FWIW, I think the source would be clearer if it used `?\u{CHARNAME}`
>> instead of the literal chars.
> Perhaps even ‘?\N{CHARNAME}’ rather than ‘?\uCHARCODE’.

That's what I meant, yes, thank you,


        Stefan




^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: Unicode confusables and reordering characters considered harmful
  2021-11-03 18:47                             ` Stefan Monnier
  2021-11-03 18:52                               ` Yuri Khan
@ 2021-11-03 19:28                               ` Gregory Heytings
  2021-11-03 19:32                                 ` Stefan Monnier
  2021-11-03 19:51                                 ` Eli Zaretskii
  2021-11-03 19:30                               ` Eli Zaretskii
  2 siblings, 2 replies; 172+ messages in thread
From: Gregory Heytings @ 2021-11-03 19:28 UTC (permalink / raw)
  To: Stefan Monnier
  Cc: Stefan Kangas, Eli Zaretskii, Andreas Schwab, cpitclaudel,
	emacs-devel

[-- Attachment #1: Type: text/plain, Size: 428 bytes --]


>
> FWIW, I think the source would be clearer if it used `?\u{CHARNAME}` 
> instead of the literal chars.
>

I did not know that feature exists, thank you!

Here's the updated patch, which is indeed much more readable.

And a screenshot of TUTORIAL.he, which doesn't seem unreadable at all (but 
of course that feature would need to be turned off for beginners who read 
TUTORIAL.he, it's meant for source code, not for prose).

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: Type: text/x-diff; name=Make-bidi-reordering-characters-visible.patch, Size: 3433 bytes --]

From f9354f99d1d27212659cbee0420b35e883373a5d Mon Sep 17 00:00:00 2001
From: Gregory Heytings <gregory@heytings.org>
Date: Wed, 3 Nov 2021 19:22:07 +0000
Subject: [PATCH] Make bidi reordering characters visible

* lisp/progmodes/prog-mode.el (bidi-reordering-characters-visible,
bidi-reordering-characters-fontify,
bidi-reordering-characters-toggle-visibility): New functions.
---
 lisp/progmodes/prog-mode.el | 41 ++++++++++++++++++++++++++++++++++++-
 1 file changed, 40 insertions(+), 1 deletion(-)

diff --git a/lisp/progmodes/prog-mode.el b/lisp/progmodes/prog-mode.el
index db350a5f70..7fcdc7b5e3 100644
--- a/lisp/progmodes/prog-mode.el
+++ b/lisp/progmodes/prog-mode.el
@@ -293,6 +293,44 @@ turn-on-prettify-symbols-mode
 (define-globalized-minor-mode global-prettify-symbols-mode
   prettify-symbols-mode turn-on-prettify-symbols-mode)
 
+(defvar-local bidi-reordering-characters-visible nil
+  "Internal variable used by `bidi-reordering-characters-visible'.")
+
+(defun bidi-reordering-characters-fontify ()
+  "Fontify bidi reordering characters with `font-lock-warning-face'."
+  (font-lock-add-keywords
+   nil
+   '(("\N{LEFT-TO-RIGHT EMBEDDING}\\|\N{RIGHT-TO-LEFT EMBEDDING}\\|\
+\N{LEFT-TO-RIGHT OVERRIDE}\\|\N{RIGHT-TO-LEFT OVERRIDE}\\|\
+\N{LEFT-TO-RIGHT ISOLATE}\\|\N{RIGHT-TO-LEFT ISOLATE}\\|\
+\N{FIRST STRONG ISOLATE}\\|\N{POP DIRECTIONAL FORMATTING}\\|\
+\N{POP DIRECTIONAL ISOLATE}" . (0 'font-lock-warning-face t)))))
+
+(defun bidi-reordering-characters-visible ()
+  "Display bidi reordering characters as arrows."
+  (setq buffer-display-table (or buffer-display-table
+                                 standard-display-table
+                                 (make-display-table)))
+  (bidi-reordering-character-toggle-visibility)
+  (add-hook 'font-lock-mode-hook #'bidi-reordering-characters-fontify))
+
+;;;###autoload
+(defun bidi-reordering-character-toggle-visibility ()
+  "Toggle the visibility of bidi reordering characters."
+  (interactive)
+  (setq bidi-reordering-characters-visible
+        (not bidi-reordering-characters-visible))
+  (let ((v bidi-reordering-characters-visible))
+    (aset buffer-display-table ?\N{LEFT-TO-RIGHT EMBEDDING} (if v [?→] nil))
+    (aset buffer-display-table ?\N{RIGHT-TO-LEFT EMBEDDING} (if v [?←] nil))
+    (aset buffer-display-table ?\N{LEFT-TO-RIGHT OVERRIDE} (if v [?→] nil))
+    (aset buffer-display-table ?\N{RIGHT-TO-LEFT OVERRIDE} (if v [?←] nil))
+    (aset buffer-display-table ?\N{LEFT-TO-RIGHT ISOLATE} (if v [?→] nil))
+    (aset buffer-display-table ?\N{RIGHT-TO-LEFT ISOLATE} (if v [?←] nil))
+    (aset buffer-display-table ?\N{FIRST STRONG ISOLATE} (if v [?↓] nil))
+    (aset buffer-display-table ?\N{POP DIRECTIONAL FORMATTING} (if v [?↑] nil))
+    (aset buffer-display-table ?\N{POP DIRECTIONAL ISOLATE} (if v [?↑] nil))))
+
 ;;;###autoload
 (define-derived-mode prog-mode fundamental-mode "Prog"
   "Major mode for editing programming language source code."
@@ -300,7 +338,8 @@ prog-mode
   (setq-local parse-sexp-ignore-comments t)
   (add-hook 'context-menu-functions 'prog-context-menu 10 t)
   ;; Any programming language is always written left to right.
-  (setq bidi-paragraph-direction 'left-to-right))
+  (setq bidi-paragraph-direction 'left-to-right)
+  (bidi-reordering-characters-visible))
 
 (provide 'prog-mode)
 
-- 
2.33.0


[-- Attachment #3: Type: image/png, Size: 119662 bytes --]

^ permalink raw reply related	[flat|nested] 172+ messages in thread

* Re: Unicode confusables and reordering characters considered harmful
  2021-11-03 18:47                             ` Stefan Monnier
  2021-11-03 18:52                               ` Yuri Khan
  2021-11-03 19:28                               ` Gregory Heytings
@ 2021-11-03 19:30                               ` Eli Zaretskii
  2021-11-03 19:34                                 ` Andreas Schwab
  2 siblings, 1 reply; 172+ messages in thread
From: Eli Zaretskii @ 2021-11-03 19:30 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: stefan, gregory, schwab, cpitclaudel, emacs-devel

> From: Stefan Monnier <monnier@iro.umontreal.ca>
> Cc: Gregory Heytings <gregory@heytings.org>,  Eli Zaretskii <eliz@gnu.org>,
>  Stefan Kangas <stefan@marxist.se>,  cpitclaudel@gmail.com,
>  emacs-devel@gnu.org
> Date: Wed, 03 Nov 2021 14:47:26 -0400
> 
> >> +  (aset buffer-display-table ?‪ [?→])
> >> +  (aset buffer-display-table ?‫ [?←])
> >> +  (aset buffer-display-table ?‭ [?→])
> >> +  (aset buffer-display-table ?‮ [?←])
> >> +  (aset buffer-display-table ?⁦ [?→])
> >> +  (aset buffer-display-table ?⁧ [?←])
> >> +  (aset buffer-display-table ?⁨ [?↓])
> >> +  (aset buffer-display-table ?‬ [?↑])
> >> +  (aset buffer-display-table ?⁩ [?↑])
> >
> > A perfect example of how legitimate use of these characters can mess up
> > your source. :-)
> 
> FWIW, I think the source would be clearer if it used `?\u{CHARNAME}`
> instead of the literal chars.

Huh?  What's unclear in a line like this one:

  (aset buffer-display-table ?‫ [?←])



^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: Unicode confusables and reordering characters considered harmful
  2021-11-03 19:28                               ` Gregory Heytings
@ 2021-11-03 19:32                                 ` Stefan Monnier
  2021-11-03 19:41                                   ` Yuri Khan
  2021-11-03 20:12                                   ` Gregory Heytings
  2021-11-03 19:51                                 ` Eli Zaretskii
  1 sibling, 2 replies; 172+ messages in thread
From: Stefan Monnier @ 2021-11-03 19:32 UTC (permalink / raw)
  To: Gregory Heytings
  Cc: Andreas Schwab, Eli Zaretskii, Stefan Kangas, emacs-devel,
	cpitclaudel

> +   '(("\N{LEFT-TO-RIGHT EMBEDDING}\\|\N{RIGHT-TO-LEFT EMBEDDING}\\|\
> +\N{LEFT-TO-RIGHT OVERRIDE}\\|\N{RIGHT-TO-LEFT OVERRIDE}\\|\
> +\N{LEFT-TO-RIGHT ISOLATE}\\|\N{RIGHT-TO-LEFT ISOLATE}\\|\
> +\N{FIRST STRONG ISOLATE}\\|\N{POP DIRECTIONAL FORMATTING}\\|\
> +\N{POP DIRECTIONAL ISOLATE}" . (0 'font-lock-warning-face t)))))

A [...] would be a lot more efficient than this "...\\|...\\|...\\|...".

> +(defun bidi-reordering-character-toggle-visibility ()
> +  "Toggle the visibility of bidi reordering characters."
> +  (interactive)
> +  (setq bidi-reordering-characters-visible
> +        (not bidi-reordering-characters-visible))

Aka

    (define-minor-mode bidi-reordering-characters-visible
      "Make the bidi reordering characters visible."
      :global t
      ...)
    (define-obsolete-function-alias
      'bidi-reordering-character-toggle-visibility
      #'bidi-reordering-characters-visible "...")


-- Stefan




^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: Unicode confusables and reordering characters considered harmful
  2021-11-03 19:30                               ` Eli Zaretskii
@ 2021-11-03 19:34                                 ` Andreas Schwab
  2021-11-03 19:54                                   ` Eli Zaretskii
  0 siblings, 1 reply; 172+ messages in thread
From: Andreas Schwab @ 2021-11-03 19:34 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: gregory, stefan, Stefan Monnier, cpitclaudel, emacs-devel

On Nov 03 2021, Eli Zaretskii wrote:

>> From: Stefan Monnier <monnier@iro.umontreal.ca>
>> Cc: Gregory Heytings <gregory@heytings.org>,  Eli Zaretskii <eliz@gnu.org>,
>>  Stefan Kangas <stefan@marxist.se>,  cpitclaudel@gmail.com,
>>  emacs-devel@gnu.org
>> Date: Wed, 03 Nov 2021 14:47:26 -0400
>> 
>> >> +  (aset buffer-display-table ?‪ [?→])
>> >> +  (aset buffer-display-table ?‫ [?←])
>> >> +  (aset buffer-display-table ?‭ [?→])
>> >> +  (aset buffer-display-table ?‮ [?←])
>> >> +  (aset buffer-display-table ?⁦ [?→])
>> >> +  (aset buffer-display-table ?⁧ [?←])
>> >> +  (aset buffer-display-table ?⁨ [?↓])
>> >> +  (aset buffer-display-table ?‬ [?↑])
>> >> +  (aset buffer-display-table ?⁩ [?↑])
>> >
>> > A perfect example of how legitimate use of these characters can mess up
>> > your source. :-)
>> 
>> FWIW, I think the source would be clearer if it used `?\u{CHARNAME}`
>> instead of the literal chars.
>
> Huh?  What's unclear in a line like this one:
>
>   (aset buffer-display-table ?‫ [?←])

Can you explain what `?([←?]' means?

Andreas.

-- 
Andreas Schwab, schwab@linux-m68k.org
GPG Key fingerprint = 7578 EB47 D4E5 4D69 2510  2552 DF73 E780 A9DA AEC1
"And now for something completely different."



^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: Unicode confusables and reordering characters considered harmful,  a simple solution
  2021-11-03 19:09                                   ` Eli Zaretskii
@ 2021-11-03 19:35                                     ` Yuri Khan
  2021-11-03 20:01                                       ` Eli Zaretskii
  2021-11-03 19:54                                     ` Daniel Brooks
  1 sibling, 1 reply; 172+ messages in thread
From: Yuri Khan @ 2021-11-03 19:35 UTC (permalink / raw)
  To: Eli Zaretskii
  Cc: Daniel Brooks, Clément Pit-Claudel, Stefan Kangas,
	Stefan Monnier, Emacs developers

On Thu, 4 Nov 2021 at 02:09, Eli Zaretskii <eliz@gnu.org> wrote:

> Do you read Hebrew?

No. I just imagine how I’d perceive the text if I could.

> Those characters look like line noise there,
> whereas the text with the default display is perfectly readable, and
> most people won't even know these controls are there (as intended).

TUTORIAL.he is slightly special, in that both an editor and a
reader[^1] use the same mode (because once in a while the user is
instructed to edit some part of their copy). In most other cases, I
prefer remaps turned on when I’m an editor or reviewer, and off when
I’m a reader.

[^1]: Here, by “editor” and “reader” I mean the human roles, not software.

> > One does not only want to highlight, but also to actually see and
> > distinguish certain characters
>
> What for?  The absolute majority of people won't have any idea what is
> the effect of each of these controls, and how it differs from others.
> Even I many times need to talk myself through their effect on display.
> The UBA spec weighs in at more than 30 pages of highly technical text,
> and I don't expect people to memorize it by heart.

Most people, when in the reader role, probably won’t and shouldn’t have to.

If I’m editing a text in a bidi language, though, I am expected to use
format control characters, and so I must know where they are or are
not. In the same vein, when I edit a program expected to conform to a
coding style, I must know where spaces and tabs are, so I do not
introduce whitespace-only changes or trailing blanks and keep
indentation consistent. Or when I edit anything that will end up as a
web page I want to know which spaces and hyphens are non-breaking, so
the page will wrap correctly no matter how the user resizes their
window and/or zooms the page. (No, I do not trust tools to do these
things right; if they could, we would not need format control
characters at all. I like tools to let me check what they did and
correct if necessary.)



^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: Unicode confusables and reordering characters considered harmful
  2021-11-03 18:53                             ` Manuel Giraud
@ 2021-11-03 19:36                               ` Eli Zaretskii
  2021-11-03 21:15                                 ` Manuel Giraud
  0 siblings, 1 reply; 172+ messages in thread
From: Eli Zaretskii @ 2021-11-03 19:36 UTC (permalink / raw)
  To: Manuel Giraud; +Cc: gregory, emacs-devel, stefan, cpitclaudel, monnier

> From: Manuel Giraud <manuel@ledu-giraud.fr>
> Cc: Gregory Heytings <gregory@heytings.org>,  cpitclaudel@gmail.com,
>   stefan@marxist.se,  monnier@iro.umontreal.ca,  emacs-devel@gnu.org
> Date: Wed, 03 Nov 2021 19:53:49 +0100
> 
> Eli Zaretskii <eliz@gnu.org> writes:
> 
> > I object already.  We shouldn't settle for easy half-solutions when we
> > are fully capable of implementing the much more complete and accurate
> > ones.
> 
> Sorry if I missed them but do you have an example of
> bidi-find-overridden-directionality usage? For instance Clément's Python
> sample.

Try it on the text of this message.

  ‮madam deified kayak‬

And then tell me: Who worshiped whom?



^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: Unicode confusables and reordering characters considered harmful
  2021-11-03 19:32                                 ` Stefan Monnier
@ 2021-11-03 19:41                                   ` Yuri Khan
  2021-11-03 20:12                                   ` Gregory Heytings
  1 sibling, 0 replies; 172+ messages in thread
From: Yuri Khan @ 2021-11-03 19:41 UTC (permalink / raw)
  To: Stefan Monnier
  Cc: Clément Pit-Claudel, Stefan Kangas, Emacs developers,
	Gregory Heytings, Andreas Schwab, Eli Zaretskii

On Thu, 4 Nov 2021 at 02:33, Stefan Monnier <monnier@iro.umontreal.ca> wrote:
>
> > +   '(("\N{LEFT-TO-RIGHT EMBEDDING}\\|\N{RIGHT-TO-LEFT EMBEDDING}\\|\
> > +\N{LEFT-TO-RIGHT OVERRIDE}\\|\N{RIGHT-TO-LEFT OVERRIDE}\\|\
> > +\N{LEFT-TO-RIGHT ISOLATE}\\|\N{RIGHT-TO-LEFT ISOLATE}\\|\
> > +\N{FIRST STRONG ISOLATE}\\|\N{POP DIRECTIONAL FORMATTING}\\|\
> > +\N{POP DIRECTIONAL ISOLATE}" . (0 'font-lock-warning-face t)))))
>
> A [...] would be a lot more efficient than this "...\\|...\\|...\\|...".

An (rx (any ?… ?… ?… …)) would be more wrappable/indentable and almost
as performant as a […], though.



^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: Unicode confusables and reordering characters considered harmful,  a simple solution
  2021-11-03 19:02                                   ` Gregory Heytings
@ 2021-11-03 19:46                                     ` Eli Zaretskii
  2021-11-03 19:58                                       ` Yuri Khan
  2021-11-03 20:21                                       ` Gregory Heytings
  2021-11-04  8:44                                     ` Juri Linkov
  1 sibling, 2 replies; 172+ messages in thread
From: Eli Zaretskii @ 2021-11-03 19:46 UTC (permalink / raw)
  To: Gregory Heytings
  Cc: cpitclaudel, yuri.v.khan, stefan, emacs-devel, db48x, monnier,
	juri

> Date: Wed, 03 Nov 2021 19:02:19 +0000
> From: Gregory Heytings <gregory@heytings.org>
> cc: Eli Zaretskii <eliz@gnu.org>, cpitclaudel@gmail.com, stefan@marxist.se, 
>     emacs-devel@gnu.org, db48x@db48x.net, monnier@iro.umontreal.ca, 
>     Yuri Khan <yuri.v.khan@gmail.com>
> 
> 
> >> Anyway, if one wants to be able to highlight certain characters on 
> >> display, one could also use highlight-regexp, I think.
> >
> > Or markchars.el with markchars-what customized to markchars-confusables.
> >
> 
> Neither would work AFAICS, because these characters are glyphless. 
> Highlighting a glyphless character will not make it more visible.

??? Of course, it will make it more visible: if the face has a
distinct background.  The "thin space" display looks like whitespace,
and whitespace can have background color to make it stand out.



^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: Unicode confusables and reordering characters considered harmful
  2021-11-03 19:28                               ` Gregory Heytings
  2021-11-03 19:32                                 ` Stefan Monnier
@ 2021-11-03 19:51                                 ` Eli Zaretskii
  1 sibling, 0 replies; 172+ messages in thread
From: Eli Zaretskii @ 2021-11-03 19:51 UTC (permalink / raw)
  To: Gregory Heytings; +Cc: stefan, cpitclaudel, schwab, monnier, emacs-devel

> Date: Wed, 03 Nov 2021 19:28:04 +0000
> From: Gregory Heytings <gregory@heytings.org>
> cc: Andreas Schwab <schwab@linux-m68k.org>, Eli Zaretskii <eliz@gnu.org>, 
>     Stefan Kangas <stefan@marxist.se>, emacs-devel@gnu.org, 
>     cpitclaudel@gmail.com
> 
> And a screenshot of TUTORIAL.he, which doesn't seem unreadable at all (but 
> of course that feature would need to be turned off for beginners who read 
> TUTORIAL.he, it's meant for source code, not for prose).

Comments and strings in source code aren't different from prose and
what TUTORIAL.he needs to say.  It doesn't seem unreadable to you
because you cannot read it anyway.  Please believe me that those
arrows are jarring and confusing to anyone who does read the text.



^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: Unicode confusables and reordering characters considered harmful,  a simple solution
  2021-11-03 19:09                                   ` Eli Zaretskii
  2021-11-03 19:35                                     ` Yuri Khan
@ 2021-11-03 19:54                                     ` Daniel Brooks
  2021-11-03 20:08                                       ` Eli Zaretskii
  1 sibling, 1 reply; 172+ messages in thread
From: Daniel Brooks @ 2021-11-03 19:54 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: cpitclaudel, emacs-devel, stefan, monnier, Yuri Khan

Eli Zaretskii <eliz@gnu.org> writes:

>> From: Yuri Khan <yuri.v.khan@gmail.com>
>> Date: Thu, 4 Nov 2021 01:45:17 +0700
>> Cc: Daniel Brooks <db48x@db48x.net>, Clément Pit-Claudel <cpitclaudel@gmail.com>, 
>> 	Stefan Kangas <stefan@marxist.se>, Stefan Monnier <monnier@iro.umontreal.ca>, 
>> 	Emacs developers <emacs-devel@gnu.org>
>> 
>> On Thu, 4 Nov 2021 at 00:56, Eli Zaretskii <eliz@gnu.org> wrote:
>> 
>> > The problem with these remappings is that you then get to somehow
>> > discern between the remapped characters and the real characters which
>> > look identically on display.
>> 
>> Real characters are fontified as whichever syntax unit they belong to.
>> Remapped characters are fontified as whitespace-space-face or
>> whitespace-hspace-face depending on whether you add them to
>> whitespace-space-regexp or whitespace-hspace-regexp.
>
> I just used what Daniel posted, and that doesn't display the remapped
> characters in any distinct face.  Gotta tinker?

Yea, I intend to tinker in order to add a new category that has it’s own
face and can be toggled on and off separately and so on. I haven’t
actually started yet though.

> Do you read Hebrew?  Those characters look like line noise there,
> whereas the text with the default display is perfectly readable, and
> most people won't even know these controls are there (as intended).

My suggestion is to only enable it by default in _programming modes_. It
should remain disabled in ordinary prose like a TUTORIAL file.

> What for?  The absolute majority of people won't have any idea what is
> the effect of each of these controls, and how it differs from others.
> Even I many times need to talk myself through their effect on display.
> The UBA spec weighs in at more than 30 pages of highly technical text,
> and I don't expect people to memorize it by heart.

I totally agree, but I think that this is not very relevant. The whole
point is for a programmer who is unaware of BiDi in general to go “WTF‽”
when these characters show up in a source file one day, so that they can
have something to ask questions about.

`what-cursor-position' will show the face, once a face is available, and
it also shows the name of the character. Both are good ways for the user
to find more information, and in principle we could have it show other
information as well. We could pull a description from the Unicode
database perhaps, or just add extra help messages for individual
characters. Now that I think about it, maybe we should just show the
docstring for the face right there next to the name. That would save me
a step from time to time, if nothing else.

db48x



^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: Unicode confusables and reordering characters considered harmful
  2021-11-03 19:34                                 ` Andreas Schwab
@ 2021-11-03 19:54                                   ` Eli Zaretskii
  0 siblings, 0 replies; 172+ messages in thread
From: Eli Zaretskii @ 2021-11-03 19:54 UTC (permalink / raw)
  To: Andreas Schwab; +Cc: gregory, stefan, monnier, cpitclaudel, emacs-devel

> From: Andreas Schwab <schwab@linux-m68k.org>
> Cc: Stefan Monnier <monnier@iro.umontreal.ca>,  gregory@heytings.org,
>   stefan@marxist.se,  cpitclaudel@gmail.com,  emacs-devel@gnu.org
> Date: Wed, 03 Nov 2021 20:34:47 +0100
> 
> > Huh?  What's unclear in a line like this one:
> >
> >   (aset buffer-display-table ?‫ [?←])
> 
> Can you explain what `?([←?]' means?

What's to explain? it's crystal clear.



^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: Unicode confusables and reordering characters considered harmful,  a simple solution
  2021-11-03 19:46                                     ` Eli Zaretskii
@ 2021-11-03 19:58                                       ` Yuri Khan
  2021-11-03 20:21                                       ` Gregory Heytings
  1 sibling, 0 replies; 172+ messages in thread
From: Yuri Khan @ 2021-11-03 19:58 UTC (permalink / raw)
  To: Eli Zaretskii
  Cc: Clément Pit-Claudel, Stefan Kangas, Emacs developers,
	Daniel Brooks, Gregory Heytings, Stefan Monnier, Juri Linkov

On Thu, 4 Nov 2021 at 02:46, Eli Zaretskii <eliz@gnu.org> wrote:

> ??? Of course, it will make it more visible: if the face has a
> distinct background.  The "thin space" display looks like whitespace,
> and whitespace can have background color to make it stand out.

We could, in principle, say red is for LTR, blue is for RTL, and green
is for pop, and various shades and tints are for embedding and
override and isolate, and try to remember this mapping in addition to
remembering the embedding/override/isolate semantics, but some of us
will just prefer glyphs and deal with the occasional misalignment, and
also will someone please think of the colorblind.



^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: Unicode confusables and reordering characters considered harmful,  a simple solution
  2021-11-03 19:35                                     ` Yuri Khan
@ 2021-11-03 20:01                                       ` Eli Zaretskii
  2021-11-03 20:45                                         ` Gregory Heytings
  0 siblings, 1 reply; 172+ messages in thread
From: Eli Zaretskii @ 2021-11-03 20:01 UTC (permalink / raw)
  To: Yuri Khan; +Cc: db48x, cpitclaudel, stefan, monnier, emacs-devel

> From: Yuri Khan <yuri.v.khan@gmail.com>
> Date: Thu, 4 Nov 2021 02:35:04 +0700
> Cc: Daniel Brooks <db48x@db48x.net>, Clément Pit-Claudel <cpitclaudel@gmail.com>, 
> 	Stefan Kangas <stefan@marxist.se>, Stefan Monnier <monnier@iro.umontreal.ca>, 
> 	Emacs developers <emacs-devel@gnu.org>
> 
> On Thu, 4 Nov 2021 at 02:09, Eli Zaretskii <eliz@gnu.org> wrote:
> 
> > Do you read Hebrew?
> 
> No. I just imagine how I’d perceive the text if I could.

IME, imagination doesn't help here.

> > > One does not only want to highlight, but also to actually see and
> > > distinguish certain characters
> >
> > What for?  The absolute majority of people won't have any idea what is
> > the effect of each of these controls, and how it differs from others.
> > Even I many times need to talk myself through their effect on display.
> > The UBA spec weighs in at more than 30 pages of highly technical text,
> > and I don't expect people to memorize it by heart.
> 
> Most people, when in the reader role, probably won’t and shouldn’t have to.
> 
> If I’m editing a text in a bidi language, though, I am expected to use
> format control characters

Actually, it is quite rare to need those controls.  Most people who
write RTL scripts every day don't even know those controls exist.

> and so I must know where they are or are not.

Then what we have in glyphless-char-display-control is better, and
doesn't need any changes, just customization of format-control to
display as acronyms.  Consider:

  . you get these characters stand out
  . they stand out, but in a somewhat subtle way, using a face that
    dims them
  . you clearly and unequivocally see which character is which -- no
    need to guess or remember what exactly does this or that arrow
    mean

> In the same vein, when I edit a program expected to conform to a
> coding style, I must know where spaces and tabs are, so I do not
> introduce whitespace-only changes or trailing blanks and keep
> indentation consistent. Or when I edit anything that will end up as a
> web page I want to know which spaces and hyphens are non-breaking, so
> the page will wrap correctly no matter how the user resizes their
> window and/or zooms the page. (No, I do not trust tools to do these
> things right; if they could, we would not need format control
> characters at all. I like tools to let me check what they did and
> correct if necessary.)

You seem to have some very unusual needs.  I find it hard to believe
that they are representative.



^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: Unicode confusables and reordering characters considered harmful,  a simple solution
  2021-11-03 19:54                                     ` Daniel Brooks
@ 2021-11-03 20:08                                       ` Eli Zaretskii
  2021-11-04  6:00                                         ` Daniel Brooks
  0 siblings, 1 reply; 172+ messages in thread
From: Eli Zaretskii @ 2021-11-03 20:08 UTC (permalink / raw)
  To: Daniel Brooks; +Cc: cpitclaudel, emacs-devel, stefan, monnier, yuri.v.khan

> From: Daniel Brooks <db48x@db48x.net>
> Cc: Yuri Khan <yuri.v.khan@gmail.com>,  cpitclaudel@gmail.com,
>   stefan@marxist.se,  monnier@iro.umontreal.ca,  emacs-devel@gnu.org
> Date: Wed, 03 Nov 2021 12:54:31 -0700
> 
> > Do you read Hebrew?  Those characters look like line noise there,
> > whereas the text with the default display is perfectly readable, and
> > most people won't even know these controls are there (as intended).
> 
> My suggestion is to only enable it by default in _programming modes_. It
> should remain disabled in ordinary prose like a TUTORIAL file.

What about comments and strings?  Are we going to pretend that RTL
scripts aren't used in those?

> > What for?  The absolute majority of people won't have any idea what is
> > the effect of each of these controls, and how it differs from others.
> > Even I many times need to talk myself through their effect on display.
> > The UBA spec weighs in at more than 30 pages of highly technical text,
> > and I don't expect people to memorize it by heart.
> 
> I totally agree, but I think that this is not very relevant. The whole
> point is for a programmer who is unaware of BiDi in general to go “WTF‽”
> when these characters show up in a source file one day, so that they can
> have something to ask questions about.
> 
> `what-cursor-position' will show the face, once a face is available, and
> it also shows the name of the character. Both are good ways for the user
> to find more information, and in principle we could have it show other
> information as well. We could pull a description from the Unicode
> database perhaps, or just add extra help messages for individual
> characters. Now that I think about it, maybe we should just show the
> docstring for the face right there next to the name. That would save me
> a step from time to time, if nothing else.

You are welcome to make such customizations in your Emacs.  My point
is that for a useful feature that doesn't get in the way when those
controls are used for legitimate purposes, and only highlights _text_
(NOT the controls!) whose appearance may have been altered by them for
questionable or suspicious reasons -- for such a useful feature what
you propose is not enough for having it in Emacs for everyone.  It is
a blunt weapon that I would be ashamed to install.



^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: Unicode confusables and reordering characters considered harmful
  2021-11-03 19:32                                 ` Stefan Monnier
  2021-11-03 19:41                                   ` Yuri Khan
@ 2021-11-03 20:12                                   ` Gregory Heytings
  2021-11-03 22:03                                     ` Gregory Heytings
  1 sibling, 1 reply; 172+ messages in thread
From: Gregory Heytings @ 2021-11-03 20:12 UTC (permalink / raw)
  To: Stefan Monnier
  Cc: Stefan Kangas, Eli Zaretskii, Andreas Schwab, cpitclaudel,
	emacs-devel


>> +   '(("\N{LEFT-TO-RIGHT EMBEDDING}\\|\N{RIGHT-TO-LEFT EMBEDDING}\\|\
>> +\N{LEFT-TO-RIGHT OVERRIDE}\\|\N{RIGHT-TO-LEFT OVERRIDE}\\|\
>> +\N{LEFT-TO-RIGHT ISOLATE}\\|\N{RIGHT-TO-LEFT ISOLATE}\\|\
>> +\N{FIRST STRONG ISOLATE}\\|\N{POP DIRECTIONAL FORMATTING}\\|\
>> +\N{POP DIRECTIONAL ISOLATE}" . (0 'font-lock-warning-face t)))))
>
> A [...] would be a lot more efficient than this "...\\|...\\|...\\|...".
>
>> +(defun bidi-reordering-character-toggle-visibility ()
>> +  "Toggle the visibility of bidi reordering characters."
>> +  (interactive)
>> +  (setq bidi-reordering-characters-visible
>> +        (not bidi-reordering-characters-visible))
>
> Aka
>
>    (define-minor-mode bidi-reordering-characters-visible
>      "Make the bidi reordering characters visible."
>      :global t
>      ...)
>    (define-obsolete-function-alias
>      'bidi-reordering-character-toggle-visibility
>      #'bidi-reordering-characters-visible "...")
>

Thanks for your comments!  Indeed and indeed.  I did not spend enough time 
tweaking that code, given that Eli already said he doesn't want it.



^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: Unicode confusables and reordering characters considered harmful,  a simple solution
  2021-11-03 19:46                                     ` Eli Zaretskii
  2021-11-03 19:58                                       ` Yuri Khan
@ 2021-11-03 20:21                                       ` Gregory Heytings
  2021-11-03 20:31                                         ` Eli Zaretskii
  1 sibling, 1 reply; 172+ messages in thread
From: Gregory Heytings @ 2021-11-03 20:21 UTC (permalink / raw)
  To: Eli Zaretskii
  Cc: cpitclaudel, yuri.v.khan, stefan, emacs-devel, db48x, monnier,
	juri


>>>> Anyway, if one wants to be able to highlight certain characters on 
>>>> display, one could also use highlight-regexp, I think.
>>>
>>> Or markchars.el with markchars-what customized to 
>>> markchars-confusables.
>>
>> Neither would work AFAICS, because these characters are glyphless. 
>> Highlighting a glyphless character will not make it more visible.
>
> ??? Of course, it will make it more visible: if the face has a distinct 
> background.  The "thin space" display looks like whitespace, and 
> whitespace can have background color to make it stand out.
>

I tried various of the predefined colors with highlight-regexp, of course 
those with a distinct background, and none make any of those characters 
more visible.  Not even with a single pixel wide bar.



^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: Unicode confusables and reordering characters considered harmful,  a simple solution
  2021-11-03 20:21                                       ` Gregory Heytings
@ 2021-11-03 20:31                                         ` Eli Zaretskii
  2021-11-03 21:16                                           ` Gregory Heytings
  0 siblings, 1 reply; 172+ messages in thread
From: Eli Zaretskii @ 2021-11-03 20:31 UTC (permalink / raw)
  To: Gregory Heytings
  Cc: cpitclaudel, yuri.v.khan, stefan, emacs-devel, db48x, monnier,
	juri

> Date: Wed, 03 Nov 2021 20:21:50 +0000
> From: Gregory Heytings <gregory@heytings.org>
> cc: juri@linkov.net, cpitclaudel@gmail.com, stefan@marxist.se, 
>     emacs-devel@gnu.org, db48x@db48x.net, monnier@iro.umontreal.ca, 
>     yuri.v.khan@gmail.com
> 
> 
> >>>> Anyway, if one wants to be able to highlight certain characters on 
> >>>> display, one could also use highlight-regexp, I think.
> >>>
> >>> Or markchars.el with markchars-what customized to 
> >>> markchars-confusables.
> >>
> >> Neither would work AFAICS, because these characters are glyphless. 
> >> Highlighting a glyphless character will not make it more visible.
> >
> > ??? Of course, it will make it more visible: if the face has a distinct 
> > background.  The "thin space" display looks like whitespace, and 
> > whitespace can have background color to make it stand out.
> >
> 
> I tried various of the predefined colors with highlight-regexp, of course 
> those with a distinct background, and none make any of those characters 
> more visible.  Not even with a single pixel wide bar.

That's very strange, because I see them even if I make the region span
only a single such character.  The default background of the region
face is quite pale, and still I see them quite clearly, even with the
default light theme.



^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: Unicode confusables and reordering characters considered harmful,  a simple solution
  2021-11-03 20:01                                       ` Eli Zaretskii
@ 2021-11-03 20:45                                         ` Gregory Heytings
  2021-11-03 20:53                                           ` Eli Zaretskii
  0 siblings, 1 reply; 172+ messages in thread
From: Gregory Heytings @ 2021-11-03 20:45 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: cpitclaudel, stefan, emacs-devel, db48x, monnier, Yuri Khan


>
> Then what we have in glyphless-char-display-control is better, and 
> doesn't need any changes, just customization of format-control to 
> display as acronyms.
>

I must really be missing something, but using buffer-display-table (or 
standard-display-table) does not need any changes whatsoever either.

But now I get what you mean, I thought you were talking about the 
glyphless-char-display table, but you were talking about the 
glyphless-char-display-control variable, which would be set by:

(custom-set-variables '(glyphless-char-display-control '((format-control . hex-code) (no-font . hex-code))))

or:

(custom-set-variables '(glyphless-char-display-control '((format-control . acronym) (no-font . hex-code))))

.



^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: Unicode confusables and reordering characters considered harmful,  a simple solution
  2021-11-03 20:45                                         ` Gregory Heytings
@ 2021-11-03 20:53                                           ` Eli Zaretskii
  2021-11-03 21:23                                             ` Gregory Heytings
  0 siblings, 1 reply; 172+ messages in thread
From: Eli Zaretskii @ 2021-11-03 20:53 UTC (permalink / raw)
  To: Gregory Heytings
  Cc: cpitclaudel, stefan, emacs-devel, db48x, monnier, yuri.v.khan

> Date: Wed, 03 Nov 2021 20:45:14 +0000
> From: Gregory Heytings <gregory@heytings.org>
> cc: Yuri Khan <yuri.v.khan@gmail.com>, db48x@db48x.net, cpitclaudel@gmail.com, 
>     stefan@marxist.se, monnier@iro.umontreal.ca, emacs-devel@gnu.org
> 
> > Then what we have in glyphless-char-display-control is better, and 
> > doesn't need any changes, just customization of format-control to 
> > display as acronyms.
> 
> I must really be missing something, but using buffer-display-table (or 
> standard-display-table) does not need any changes whatsoever either.

It needs the additional code you presented (or minor mode mentioned by
Stefan).

> But now I get what you mean, I thought you were talking about the 
> glyphless-char-display table, but you were talking about the 
> glyphless-char-display-control variable, which would be set by:
> 
> (custom-set-variables '(glyphless-char-display-control '((format-control . hex-code) (no-font . hex-code))))
> 
> or:
> 
> (custom-set-variables '(glyphless-char-display-control '((format-control . acronym) (no-font . hex-code))))

Yes, that's what I meant.  Sorry for not being more clear.



^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: Unicode confusables and reordering characters considered harmful,  a simple solution
  2021-11-03 17:56                               ` Eli Zaretskii
  2021-11-03 18:20                                 ` Juri Linkov
  2021-11-03 18:45                                 ` Yuri Khan
@ 2021-11-03 21:13                                 ` Daniel Brooks
  2021-11-04  6:52                                   ` Eli Zaretskii
  2 siblings, 1 reply; 172+ messages in thread
From: Daniel Brooks @ 2021-11-03 21:13 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: cpitclaudel, emacs-devel, stefan, monnier, Yuri Khan

Eli Zaretskii <eliz@gnu.org> writes:

>> >      (space-mark #x2068 [#x21A7]) ; ↧ FIRST STRONG ISOLATE
>
> This one doesn't make sense, at least if one knows what FSI means and
> does.

Yea, I’m open to suggestions.

db48x



^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: Unicode confusables and reordering characters considered harmful
  2021-11-03 19:36                               ` Eli Zaretskii
@ 2021-11-03 21:15                                 ` Manuel Giraud
  2021-11-04  6:56                                   ` Eli Zaretskii
  0 siblings, 1 reply; 172+ messages in thread
From: Manuel Giraud @ 2021-11-03 21:15 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: gregory, emacs-devel, stefan, cpitclaudel, monnier

Eli Zaretskii <eliz@gnu.org> writes:

> Try it on the text of this message.
>
>   ‮madam deified kayak‬
>
> And then tell me: Who worshiped whom?

Ok. On this one, bidi-find-overridden-directionality returns a position
but I get a nil on the Clément's example:

if access_level != "user‮ ⁦// Check if admin⁩ ⁦" {

which is the kind of overridden directionality we should have a warning
on, no?
-- 
Manuel Giraud



^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: Unicode confusables and reordering characters considered harmful,  a simple solution
  2021-11-03 20:31                                         ` Eli Zaretskii
@ 2021-11-03 21:16                                           ` Gregory Heytings
  2021-11-04  7:16                                             ` Eli Zaretskii
  0 siblings, 1 reply; 172+ messages in thread
From: Gregory Heytings @ 2021-11-03 21:16 UTC (permalink / raw)
  To: Eli Zaretskii
  Cc: juri, cpitclaudel, stefan, emacs-devel, db48x, monnier,
	yuri.v.khan

[-- Attachment #1: Type: text/plain, Size: 767 bytes --]


>> I tried various of the predefined colors with highlight-regexp, of 
>> course those with a distinct background, and none make any of those 
>> characters more visible.  Not even with a single pixel wide bar.
>
> That's very strange, because I see them even if I make the region span 
> only a single such character.  The default background of the region face 
> is quite pale, and still I see them quite clearly, even with the default 
> light theme.
>

Here's a screenshot with emacs -Q (current trunk).  I did M-x 
highlight-regexp RET y RET hi-green RET M-x highlight-regexp <regexp with 
all reordering characters> RET hi-green RET.  There's one reordering 
character between each of the "abcdefghij" string.  I zoomed that picture, 
I see absolutely nothing.

[-- Attachment #2: Type: image/png, Size: 4623 bytes --]

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: Unicode confusables and reordering characters considered harmful,  a simple solution
  2021-11-03 20:53                                           ` Eli Zaretskii
@ 2021-11-03 21:23                                             ` Gregory Heytings
  2021-11-04  6:58                                               ` Eli Zaretskii
  0 siblings, 1 reply; 172+ messages in thread
From: Gregory Heytings @ 2021-11-03 21:23 UTC (permalink / raw)
  To: Eli Zaretskii
  Cc: cpitclaudel, stefan, emacs-devel, db48x, monnier, yuri.v.khan


>> I must really be missing something, but using buffer-display-table (or 
>> standard-display-table) does not need any changes whatsoever either.
>
> It needs the additional code you presented (or minor mode mentioned by 
> Stefan).
>

Well, the glyphless-char-display-control solution also needs additional 
code: at least the custom-set-variables line, a few lines to update the 
acronyms of several of those characters, and quite a few lines to make it 
possible to do that in a buffer-local way (because we don't want to see 
these tofus everywhere, e.g. in TUTORIAL.he).  And IMO the result (a tofu) 
makes it less clear that "here's a potential danger".



^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: Unicode confusables and reordering characters considered harmful
  2021-11-03 20:12                                   ` Gregory Heytings
@ 2021-11-03 22:03                                     ` Gregory Heytings
  2021-11-04  8:50                                       ` Gregory Heytings
  0 siblings, 1 reply; 172+ messages in thread
From: Gregory Heytings @ 2021-11-03 22:03 UTC (permalink / raw)
  To: Stefan Monnier
  Cc: Andreas Schwab, Eli Zaretskii, Stefan Kangas, cpitclaudel,
	emacs-devel

[-- Attachment #1: Type: text/plain, Size: 188 bytes --]


>
> I did not spend enough time tweaking that code, given that Eli already 
> said he doesn't want it.
>

But I could not resist, I implemented your suggestions anyway ;-)  Thanks 
again!

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: Type: text/x-diff; name=Make-bidi-reordering-characters-visible.patch, Size: 3445 bytes --]

From 8a4eda2e417c0d7d4ca4489b9c025258d53f20c1 Mon Sep 17 00:00:00 2001
From: Gregory Heytings <gregory@heytings.org>
Date: Wed, 3 Nov 2021 21:54:03 +0000
Subject: [PATCH] Make bidi reordering characters visible

* lisp/progmodes/prog-mode.el (bidi-reordering-characters-visible):
New minor mode.
(bidi-reordering-characters-visible--fontify,
bidi-reordering-characters-visible--toggle): New helper functions.
(prog-mode): Enable the new minor mode.
---
 lisp/progmodes/prog-mode.el | 43 ++++++++++++++++++++++++++++++++++++-
 1 file changed, 42 insertions(+), 1 deletion(-)

diff --git a/lisp/progmodes/prog-mode.el b/lisp/progmodes/prog-mode.el
index db350a5f70..8c863f56a7 100644
--- a/lisp/progmodes/prog-mode.el
+++ b/lisp/progmodes/prog-mode.el
@@ -293,6 +293,46 @@ turn-on-prettify-symbols-mode
 (define-globalized-minor-mode global-prettify-symbols-mode
   prettify-symbols-mode turn-on-prettify-symbols-mode)
 
+(defun bidi-reordering-characters-visible--fontify ()
+  "Fontify bidi reordering characters with `font-lock-warning-face'."
+  (font-lock-add-keywords
+   nil
+   '(("[\N{left-to-right embedding}\N{right-to-left embedding}\
+\N{left-to-right override}\N{right-to-left override}\
+\N{left-to-right isolate}\N{right-to-left isolate}\
+\N{first strong isolate}\N{pop directional formatting}\
+\N{pop directional isolate}]" . (0 'font-lock-warning-face t)))))
+
+(defun bidi-reordering-character-visible--toggle ()
+  "Toggle the visibility of bidi reordering characters."
+  (let ((v bidi-reordering-characters-visible)
+        (bdt buffer-display-table))
+    (aset bdt ?\N{left-to-right embedding} (if v [?→] nil))
+    (aset bdt ?\N{right-to-left embedding} (if v [?←] nil))
+    (aset bdt ?\N{left-to-right override} (if v [?→] nil))
+    (aset bdt ?\N{right-to-left override} (if v [?←] nil))
+    (aset bdt ?\N{left-to-right isolate} (if v [?→] nil))
+    (aset bdt ?\N{right-to-left isolate} (if v [?←] nil))
+    (aset bdt ?\N{first strong isolate} (if v [?↓] nil))
+    (aset bdt ?\N{pop directional formatting} (if v [?↑] nil))
+    (aset bdt ?\N{pop directional isolate} (if v [?↑] nil))))
+
+;;;###autoload
+(define-minor-mode bidi-reordering-characters-visible
+  "Make the bidi reordering characters visible."
+  :init-value nil
+  (if bidi-reordering-characters-visible
+      (progn
+          (setq buffer-display-table (or buffer-display-table
+                                         standard-display-table
+                                         (make-display-table)))
+          (bidi-reordering-character-visible--toggle)
+          (add-hook 'font-lock-mode-hook
+                    #'bidi-reordering-characters-visible--fontify))
+    (bidi-reordering-character-visible--toggle)
+    (remove-hook 'font-lock-mode-hook
+                 #'bidi-reordering-characters-visible--fontify)))
+
 ;;;###autoload
 (define-derived-mode prog-mode fundamental-mode "Prog"
   "Major mode for editing programming language source code."
@@ -300,7 +340,8 @@ prog-mode
   (setq-local parse-sexp-ignore-comments t)
   (add-hook 'context-menu-functions 'prog-context-menu 10 t)
   ;; Any programming language is always written left to right.
-  (setq bidi-paragraph-direction 'left-to-right))
+  (setq bidi-paragraph-direction 'left-to-right)
+  (bidi-reordering-characters-visible))
 
 (provide 'prog-mode)
 
-- 
2.33.0


^ permalink raw reply related	[flat|nested] 172+ messages in thread

* Re: Unicode confusables and reordering characters considered harmful,  a simple solution
  2021-11-03 20:08                                       ` Eli Zaretskii
@ 2021-11-04  6:00                                         ` Daniel Brooks
  2021-11-04  7:44                                           ` Eli Zaretskii
  2021-11-04 19:05                                           ` Stefan Monnier
  0 siblings, 2 replies; 172+ messages in thread
From: Daniel Brooks @ 2021-11-04  6:00 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: cpitclaudel, yuri.v.khan, stefan, monnier, emacs-devel

[-- Attachment #1: Type: text/plain, Size: 3500 bytes --]

Eli Zaretskii <eliz@gnu.org> writes:

>> From: Daniel Brooks <db48x@db48x.net>
>> Cc: Yuri Khan <yuri.v.khan@gmail.com>,  cpitclaudel@gmail.com,
>>   stefan@marxist.se,  monnier@iro.umontreal.ca,  emacs-devel@gnu.org
>> Date: Wed, 03 Nov 2021 12:54:31 -0700
>> 
>> > Do you read Hebrew?  Those characters look like line noise there,
>> > whereas the text with the default display is perfectly readable, and
>> > most people won't even know these controls are there (as intended).
>> 
>> My suggestion is to only enable it by default in _programming modes_. It
>> should remain disabled in ordinary prose like a TUTORIAL file.
>
> What about comments and strings?  Are we going to pretend that RTL
> scripts aren't used in those?

Of course it will show them in the comments and strings. That’s where
the problem is.

> You are welcome to make such customizations in your Emacs.  My point
> is that for a useful feature that doesn't get in the way when those
> controls are used for legitimate purposes, and only highlights _text_
> (NOT the controls!) whose appearance may have been altered by them for
> questionable or suspicious reasons -- for such a useful feature what
> you propose is not enough for having it in Emacs for everyone.  It is
> a blunt weapon that I would be ashamed to install.

Ok, it is helpful to know your thoughts on the matter.

However, your suggestion of highlighting the text affected by the bidi
override characters while not actually showing those characters visibly
is not something that I would care to use. It shows that there may be a
problem without showing what the cause is. The cause is the presense of
certain characters, and I must be able to see those characters in order
to fix the problem, or even to judge whether there is a problem at
all. Anything short of that is useless to me, and I suspect to many
others as well. Do you hide the tags when you write HTML? Do you hide
the parentheses when you write Lisp? Or the semicolons when you write C?
This is no different.

Furthermore, I have not suggested that showing the characters needs to
preclude any other form of highlighting. If you wish to develop some
additional way of warning the developer, please do so.

However, I suspect that the compilers for most languages currently in
active development will develop their own warnings and error messages as
well. We have plenty of ways for those messages to show up inside Emacs
as highlights.

Rust, for example, has already done so. Here’s an example:

    error: unicode codepoint changing visible direction of text present in comment
      --> src/pathmap/path.rs:10:5
       |
    10 |     /* } if is_admin  begin admins only */
       |     ^^-^^-^^^^^^^^^^--^^^^^^^^^^^^^^^^^^^^
       |     | |  |          ||
       |     | |  |          |'\u{2066}'
       |     | |  |          '\u{2069}'
       |     | |  '\u{2066}'
       |     | '\u{202e}'
       |     this comment contains invisible unicode text flow control codepoints
       |
       = note: `#[deny(text_direction_codepoint_in_comment)]` on by default
       = note: these kind of unicode codepoints change the way text flows on applications that support them, but can cause confusion because they change the order of characters on the screen
       = help: if their presence wasn't intentional, you can remove them

Naturally that already shows up inside of Emacs just fine; see the
attached image.

db48x

[-- Attachment #2: screenshot of a highlighted error inside Emacs --]
[-- Type: image/png, Size: 25995 bytes --]

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: Unicode confusables and reordering characters considered harmful,  a simple solution
  2021-11-03 21:13                                 ` Daniel Brooks
@ 2021-11-04  6:52                                   ` Eli Zaretskii
  0 siblings, 0 replies; 172+ messages in thread
From: Eli Zaretskii @ 2021-11-04  6:52 UTC (permalink / raw)
  To: Daniel Brooks; +Cc: cpitclaudel, emacs-devel, stefan, monnier, yuri.v.khan

> From: Daniel Brooks <db48x@db48x.net>
> Cc: Yuri Khan <yuri.v.khan@gmail.com>,  cpitclaudel@gmail.com,
>   stefan@marxist.se,  monnier@iro.umontreal.ca,  emacs-devel@gnu.org
> Date: Wed, 03 Nov 2021 14:13:01 -0700
> 
> Eli Zaretskii <eliz@gnu.org> writes:
> 
> >> >      (space-mark #x2068 [#x21A7]) ; ↧ FIRST STRONG ISOLATE
> >
> > This one doesn't make sense, at least if one knows what FSI means and
> > does.
> 
> Yea, I’m open to suggestions.

I already made one: use the characters' acronyms (they are all
3-letter, so not too long), preferably through
glyphless-char-display-control.

But if you must use arrows for some reason I cannot understand, use an
arrow that goes both ways, left and right.



^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: Unicode confusables and reordering characters considered harmful
  2021-11-03 21:15                                 ` Manuel Giraud
@ 2021-11-04  6:56                                   ` Eli Zaretskii
  2021-11-04 19:04                                     ` Eli Zaretskii
  0 siblings, 1 reply; 172+ messages in thread
From: Eli Zaretskii @ 2021-11-04  6:56 UTC (permalink / raw)
  To: Manuel Giraud; +Cc: gregory, emacs-devel, stefan, cpitclaudel, monnier

> From: Manuel Giraud <manuel@ledu-giraud.fr>
> Cc: gregory@heytings.org,  cpitclaudel@gmail.com,  stefan@marxist.se,
>   monnier@iro.umontreal.ca,  emacs-devel@gnu.org
> Date: Wed, 03 Nov 2021 22:15:50 +0100
> 
> Eli Zaretskii <eliz@gnu.org> writes:
> 
> > Try it on the text of this message.
> >
> >   ‮madam deified kayak‬
> >
> > And then tell me: Who worshiped whom?
> 
> Ok. On this one, bidi-find-overridden-directionality returns a position
> but I get a nil on the Clément's example:
> 
> if access_level != "user‮ ⁦// Check if admin⁩ ⁦" {
> 
> which is the kind of overridden directionality we should have a warning
> on, no?

Did you read the doc string of bidi-find-overridden-directionality,
which explains what kind of overrides it looks for?  I already said
that it will have to be extended to cover the examples from that
paper.  Those examples override the directionality of punctuation
characters.  By contrast, the original intent of this function was to
detect reordering-caused confusions in URLs, where punctuation
characters don't happen, and if they do, they are not the stuff which
malevolent parties want to reorder.



^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: Unicode confusables and reordering characters considered harmful,  a simple solution
  2021-11-03 21:23                                             ` Gregory Heytings
@ 2021-11-04  6:58                                               ` Eli Zaretskii
  2021-11-04  8:53                                                 ` Gregory Heytings
  0 siblings, 1 reply; 172+ messages in thread
From: Eli Zaretskii @ 2021-11-04  6:58 UTC (permalink / raw)
  To: Gregory Heytings
  Cc: cpitclaudel, stefan, emacs-devel, db48x, monnier, yuri.v.khan

> Date: Wed, 03 Nov 2021 21:23:34 +0000
> From: Gregory Heytings <gregory@heytings.org>
> cc: yuri.v.khan@gmail.com, db48x@db48x.net, cpitclaudel@gmail.com, 
>     stefan@marxist.se, monnier@iro.umontreal.ca, emacs-devel@gnu.org
> 
> >> I must really be missing something, but using buffer-display-table (or 
> >> standard-display-table) does not need any changes whatsoever either.
> >
> > It needs the additional code you presented (or minor mode mentioned by 
> > Stefan).
> 
> Well, the glyphless-char-display-control solution also needs additional 
> code: at least the custom-set-variables line, a few lines to update the 
> acronyms of several of those characters

No, it only needs the interested user to customize that option and
save the customizations.



^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: Unicode confusables and reordering characters considered harmful,  a simple solution
  2021-11-03 21:16                                           ` Gregory Heytings
@ 2021-11-04  7:16                                             ` Eli Zaretskii
  2021-11-04  9:06                                               ` Gregory Heytings
  0 siblings, 1 reply; 172+ messages in thread
From: Eli Zaretskii @ 2021-11-04  7:16 UTC (permalink / raw)
  To: Gregory Heytings
  Cc: juri, cpitclaudel, stefan, emacs-devel, db48x, monnier,
	yuri.v.khan

> Date: Wed, 03 Nov 2021 21:16:56 +0000
> From: Gregory Heytings <gregory@heytings.org>
> cc: cpitclaudel@gmail.com, yuri.v.khan@gmail.com, stefan@marxist.se, 
>     emacs-devel@gnu.org, db48x@db48x.net, monnier@iro.umontreal.ca, 
>     juri@linkov.net
> 
> >> I tried various of the predefined colors with highlight-regexp, of 
> >> course those with a distinct background, and none make any of those 
> >> characters more visible.  Not even with a single pixel wide bar.
> >
> > That's very strange, because I see them even if I make the region span 
> > only a single such character.  The default background of the region face 
> > is quite pale, and still I see them quite clearly, even with the default 
> > light theme.
> >
> 
> Here's a screenshot with emacs -Q (current trunk).  I did M-x 
> highlight-regexp RET y RET hi-green RET M-x highlight-regexp <regexp with 
> all reordering characters> RET hi-green RET.  There's one reordering 
> character between each of the "abcdefghij" string.  I zoomed that picture, 
> I see absolutely nothing.

If you configured highlight-regexp to highlight only the formatting
controls, how come 'y' is highlighted in green on the image you sent?
Is 'y' one of the characters that are supposed to be highlighted?

Or maybe your configuration of highlight-regexp was incorrect?  Or
could there be some subtle bug/misfeature in highlight-regexp (I
didn't try it myself)?

If you shift-highlight one of these formatting control characters, and
only one such character, don't you see a thin whitespace shown in the
background color of the region face?



^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: Unicode confusables and reordering characters considered harmful,  a simple solution
  2021-11-04  6:00                                         ` Daniel Brooks
@ 2021-11-04  7:44                                           ` Eli Zaretskii
  2021-11-04  9:14                                             ` Gregory Heytings
  2021-11-05  2:23                                             ` Daniel Brooks
  2021-11-04 19:05                                           ` Stefan Monnier
  1 sibling, 2 replies; 172+ messages in thread
From: Eli Zaretskii @ 2021-11-04  7:44 UTC (permalink / raw)
  To: Daniel Brooks; +Cc: cpitclaudel, yuri.v.khan, stefan, monnier, emacs-devel

> From: Daniel Brooks <db48x@db48x.net>
> Cc: cpitclaudel@gmail.com,  emacs-devel@gnu.org,  stefan@marxist.se,
>   monnier@iro.umontreal.ca,  yuri.v.khan@gmail.com
> Date: Wed, 03 Nov 2021 23:00:28 -0700
> 
> Eli Zaretskii <eliz@gnu.org> writes:
> 
> >> From: Daniel Brooks <db48x@db48x.net>
> >> Cc: Yuri Khan <yuri.v.khan@gmail.com>,  cpitclaudel@gmail.com,
> >>   stefan@marxist.se,  monnier@iro.umontreal.ca,  emacs-devel@gnu.org
> >> Date: Wed, 03 Nov 2021 12:54:31 -0700
> >> 
> >> > Do you read Hebrew?  Those characters look like line noise there,
> >> > whereas the text with the default display is perfectly readable, and
> >> > most people won't even know these controls are there (as intended).
> >> 
> >> My suggestion is to only enable it by default in _programming modes_. It
> >> should remain disabled in ordinary prose like a TUTORIAL file.
> >
> > What about comments and strings?  Are we going to pretend that RTL
> > scripts aren't used in those?
> 
> Of course it will show them in the comments and strings.

Then this visual noise will get in the way of people's reading those
comments and strings, and, for strings, will make it very hard to
understand what will be presented to the user when those strings are
output in some UI.

> That’s where the problem is.

No, the problem is elsewhere entirely: it's in the punctuation
characters unrelated to strings and comments whose directionality is
overridden, and which thus display in places that cause incorrect
visual interpretation of the program during a casual read.

> > You are welcome to make such customizations in your Emacs.  My point
> > is that for a useful feature that doesn't get in the way when those
> > controls are used for legitimate purposes, and only highlights _text_
> > (NOT the controls!) whose appearance may have been altered by them for
> > questionable or suspicious reasons -- for such a useful feature what
> > you propose is not enough for having it in Emacs for everyone.  It is
> > a blunt weapon that I would be ashamed to install.
> 
> Ok, it is helpful to know your thoughts on the matter.
> 
> However, your suggestion of highlighting the text affected by the bidi
> override characters while not actually showing those characters visibly
> is not something that I would care to use. It shows that there may be a
> problem without showing what the cause is. The cause is the presense of
> certain characters, and I must be able to see those characters in order
> to fix the problem, or even to judge whether there is a problem at
> all.

You misunderstand the cause.  The mere presence of these characters is
NOT the root cause.  These characters are legitimate and helpful when
used as intended.  See TUTORIAL.he for a pertinent example.

The real cause is that these characters are used with the explicit
intent of changing the visual presentation of some code fragment or an
identifier in source code or in a URL.  The challenge, therefore, is
not to make these characters stand out wherever they happen, because
that would flag also their legitimate uses for no good reason.  the
challenge is to flag only those suspicious or malicious uses of these
characters.  And that cannot be done by just changing the visual
appearance of those characters, because their legitimate uses are by
far more frequent than their malicious uses.  To flag only the
suspicious cases, the code which does that needs to examine the
details of the text whose directionality was overridden and detect
those cases where such overriding is suspicious.  For example, when a
character with a strong left-to-right directionality has its
directionality overridden to behave like right-to-left character, that
is highly suspicious, because it makes no sense to do that in 99.99%
of valid use cases.

> Anything short of that is useless to me, and I suspect to many
> others as well. Do you hide the tags when you write HTML? Do you hide
> the parentheses when you write Lisp? Or the semicolons when you write C?
> This is no different.

This is VERY different, for the reasons I explained above.  What you
suggest will have a very low signal-to-noise ratio, so having such a
feature in Emacs in general is a bad idea.  And people who for some
reason still want to have that noise in their face can simply
customize glyphless-char-display-control to show those characters as
their acronyms in a small box.

> Furthermore, I have not suggested that showing the characters needs to
> preclude any other form of highlighting. If you wish to develop some
> additional way of warning the developer, please do so.

We are talking about what should be in Emacs.  What you suggest
shouldn't.

> However, I suspect that the compilers for most languages currently in
> active development will develop their own warnings and error messages as
> well. We have plenty of ways for those messages to show up inside Emacs
> as highlights.

That's a tangent.  We are discussing what Emacs should do as a
programmer's editor to flag such suspicious code.  That shouldn't need
a compiler if we can do the job ourselves.  And we can.

> Rust, for example, has already done so. Here’s an example:
> 
>     error: unicode codepoint changing visible direction of text present in comment
>       --> src/pathmap/path.rs:10:5
>        |
>     10 |     /* } if is_admin  begin admins only */
>        |     ^^-^^-^^^^^^^^^^--^^^^^^^^^^^^^^^^^^^^
>        |     | |  |          ||
>        |     | |  |          |'\u{2066}'
>        |     | |  |          '\u{2069}'
>        |     | |  '\u{2066}'
>        |     | '\u{202e}'
>        |     this comment contains invisible unicode text flow control codepoints
>        |
>        = note: `#[deny(text_direction_codepoint_in_comment)]` on by default
>        = note: these kind of unicode codepoints change the way text flows on applications that support them, but can cause confusion because they change the order of characters on the screen
>        = help: if their presence wasn't intentional, you can remove them

Since the Rust compiler evidently does this when it finds these
characters inside comments (and probably also inside strings), IMNSHO
this is a terrible misfeature, because it means code that uses those
controls in legitimate ways cannot be compiled without tweaking
non-default options.  That's a cop-out, not the way to flag the
problematic cases.

> Naturally that already shows up inside of Emacs just fine; see the
> attached image.

I think this is terrible.  At best, it only tells you that something
non-trivial goes on here (but what exactly?).  At worst, it looks like
corruption of the source.  And while in the malicious case treating
that as corruption is not such a bad idea, all the valid uses of these
characters will also look like corruption.  Which means the cure is
probably worse than the disease, because the malicious cases are a
tiny fraction of the valid ones.

It's the same kind of "solution" like the airport security after 9/11:
because there was a bunch of terrorists, we are all now suspect as
potential terrorists, and for that reason we are constantly delayed
for hours and humiliated by endless frisking.



^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: Unicode confusables and reordering characters considered harmful
  2021-11-03 15:43     ` Stefan Monnier
@ 2021-11-04  7:50       ` Reini Urban
  2021-11-04  8:21         ` Eli Zaretskii
  0 siblings, 1 reply; 172+ messages in thread
From: Reini Urban @ 2021-11-04  7:50 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: emacs-devel

[-- Attachment #1: Type: text/plain, Size: 2526 bytes --]

On Wed, Nov 3, 2021 at 4:43 PM Stefan Monnier <monnier@iro.umontreal.ca>
wrote:

> > No, this summary is awful.
> > The issue is that libc, the C standard committee, linux and most others
> are
> > ignoring the unicode identifier security guidelines.
> > Identifiers must be identifiable, but strings should not be touched.
>
> What do those rules say about code like:
>
>     int hi = 5;
>     int שָׁלוֹם = hi;
>     int hello = 10;
>     int السّلامعليك = hello;
>     myfun(שָׁלוֹם ,السّلامعليكم)
>
> IMO this code is fundamentally valid: we should allow
> programmers to write identifiers in their native tongue.
>

Sure, nobody wants to forbid unicode identifiers. The rules only ensure
that identifiers keep identifiable.
I converted itto perl (because I dislike java or rust), and ran it through
cperl.
The problem is that from an innocent look or code review you won't see any
problem, hence the security risk.
You need to adjust your tools.

But the very first RTL identifier שָׁלוֹם contains already non-identifier
characters.
So I cannot tell you if this code doesn't violate any of the 4 unicode
mixed script profiles (
http://www.unicode.org/reports/tr39/#Mixed_Script_Detection 2-5)
Or if any of the unreadable characters are of the recommended scripts:
https://www.unicode.org/reports/tr31/#Table_Recommended_Scripts, (so no
exotic or antique scripts)

http://perl11.github.io/cperl/perldata.html#Identifier-parsing


$hi = 5;
$שָׁלוֹם = $hi;
$hello = 10;
$السّلامعليك = $hello;
myfun($שָׁלוֹם, $السّلامعليك);

=> od -c
0000000   $   h   i       =       5   ;  \n   $ 327 251 326 270 327 201
0000020 327 234 327 225 326 271 327 235       =       $   h   i   ;  \n
0000040   $   h   e   l   l   o       =       1   0   ;  \n   $ 330 247
0000060 331 204 330 263 331 221 331 204 330 247 331 205 330 271 331 204
0000100 331 212 331 203       =       $   h   e   l   l   o   ;  \n   m
0000120   y   f   u   n   (   $ 327 251 326 270 327 201 327 234 327 225
0000140 326 271 327 235   ,       $ 330 247 331 204 330 263 331 221 331
0000160 204 330 247 331 205 330 271 331 204 331 212 331 203   )   ;  \n


> Does the security guidelines require override chars to force the
> `, ` to be in LTR, so as to fix the ordering problem (and would the
> result be more or less clear to someone familiar with those RTL
> scripts ;-0 )?
>
>
>         Stefan
>
>

-- 
Reini Urban

[-- Attachment #2: Type: text/html, Size: 3660 bytes --]

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: Unicode confusables and reordering characters considered harmful
  2021-11-04  7:50       ` Reini Urban
@ 2021-11-04  8:21         ` Eli Zaretskii
  0 siblings, 0 replies; 172+ messages in thread
From: Eli Zaretskii @ 2021-11-04  8:21 UTC (permalink / raw)
  To: Reini Urban; +Cc: monnier, emacs-devel

> From: Reini Urban <reini.urban@gmail.com>
> Date: Thu, 4 Nov 2021 08:50:14 +0100
> Cc: emacs-devel@gnu.org
> 
>      int hi = 5;
>      int שָׁלוֹם = hi;
>      int hello = 10;
>      int السّلامعليك = hello;
>      myfun(שָׁלוֹם ,السّلامعليكم)
> 
>  IMO this code is fundamentally valid: we should allow
>  programmers to write identifiers in their native tongue.
> 
> Sure, nobody wants to forbid unicode identifiers. The rules only ensure that identifiers keep identifiable. 
> I converted itto perl (because I dislike java or rust), and ran it through cperl.
> The problem is that from an innocent look or code review you won't see any problem, hence the security
> risk.
> You need to adjust your tools.
> 
> But the very first RTL identifier שָׁלוֹם contains already non-identifier characters.

Which of its characters are non-identifier, and why?  That identifier
uses characters of a single script, AFAICT.

> So I cannot tell you if this code doesn't violate any of the 4 unicode mixed script profiles
> (http://www.unicode.org/reports/tr39/#Mixed_Script_Detection 2-5)
> Or if any of the unreadable characters are of the recommended scripts:

Which characters in that fragment are "unreadable" for this purpose?



^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: Unicode confusables and reordering characters considered harmful, a simple solution
  2021-11-03 19:02                                   ` Gregory Heytings
  2021-11-03 19:46                                     ` Eli Zaretskii
@ 2021-11-04  8:44                                     ` Juri Linkov
  1 sibling, 0 replies; 172+ messages in thread
From: Juri Linkov @ 2021-11-04  8:44 UTC (permalink / raw)
  To: Gregory Heytings
  Cc: cpitclaudel, stefan, emacs-devel, db48x, monnier, Eli Zaretskii,
	Yuri Khan

>>> Anyway, if one wants to be able to highlight certain characters on
>>> display, one could also use highlight-regexp, I think.
>>
>> Or markchars.el with markchars-what customized to markchars-confusables.
>
> Neither would work AFAICS, because these characters are
> glyphless. Highlighting a glyphless character will not make it more
> visible.

Eli pointed out that instead of highlighting glyphless characters,
only suspicious text between glyphless characters should be highlighted:

  For example, when a character with a strong left-to-right directionality
  has its directionality overridden to behave like right-to-left
  character, that is highly suspicious, because it makes no sense to do
  that in 99.99% of valid use cases.

markchars.el has a rule that highlights adjacent characters from
different scripts, so a new rule could be added that will highlight
text that has no right-to-left characters between directionality
switching characters.



^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: Unicode confusables and reordering characters considered harmful
  2021-11-03 22:03                                     ` Gregory Heytings
@ 2021-11-04  8:50                                       ` Gregory Heytings
  0 siblings, 0 replies; 172+ messages in thread
From: Gregory Heytings @ 2021-11-04  8:50 UTC (permalink / raw)
  To: Stefan Monnier
  Cc: Stefan Kangas, Eli Zaretskii, Andreas Schwab, cpitclaudel,
	emacs-devel

[-- Attachment #1: Type: text/plain, Size: 430 bytes --]


>> I did not spend enough time tweaking that code, given that Eli already 
>> said he doesn't want it.
>
> But I could not resist, I implemented your suggestions anyway ;-) 
> Thanks again!
>

It just occurred to me that the remove-hook wasn't necessary, and that it 
is also not necessary to have two separate control flows for the 
activation and deactivation of the minor mode.  Final patch attached, it's 
only 37 lines long.

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: Type: text/x-diff; name=Make-bidi-reordering-characters-visible.patch, Size: 3183 bytes --]

From dff39cf654213122d2a926a77f14d9357d95f2ab Mon Sep 17 00:00:00 2001
From: Gregory Heytings <gregory@heytings.org>
Date: Thu, 4 Nov 2021 08:47:26 +0000
Subject: [PATCH] Make bidi reordering characters visible

* lisp/progmodes/prog-mode.el (bidi-reordering-characters-visible):
New minor mode.
(bidi-reordering-characters-visible--fontify,
bidi-reordering-characters-visible--toggle): New helper functions.
(prog-mode): Enable the new minor mode.
---
 lisp/progmodes/prog-mode.el | 38 ++++++++++++++++++++++++++++++++++++-
 1 file changed, 37 insertions(+), 1 deletion(-)

diff --git a/lisp/progmodes/prog-mode.el b/lisp/progmodes/prog-mode.el
index db350a5f70..471f64ce28 100644
--- a/lisp/progmodes/prog-mode.el
+++ b/lisp/progmodes/prog-mode.el
@@ -293,6 +293,41 @@ turn-on-prettify-symbols-mode
 (define-globalized-minor-mode global-prettify-symbols-mode
   prettify-symbols-mode turn-on-prettify-symbols-mode)
 
+(defun bidi-reordering-characters-visible--fontify ()
+  "Fontify bidi reordering characters with `font-lock-warning-face'."
+  (font-lock-add-keywords
+   nil
+   '(("[\N{left-to-right embedding}\N{right-to-left embedding}\
+\N{left-to-right override}\N{right-to-left override}\
+\N{left-to-right isolate}\N{right-to-left isolate}\
+\N{first strong isolate}\N{pop directional formatting}\
+\N{pop directional isolate}]" . (0 'font-lock-warning-face t)))))
+
+(defun bidi-reordering-characters-visible--toggle ()
+  "Toggle the visibility of bidi reordering characters."
+  (let ((v bidi-reordering-characters-visible)
+        (bdt buffer-display-table))
+    (aset bdt ?\N{left-to-right embedding} (if v [?→] nil))
+    (aset bdt ?\N{right-to-left embedding} (if v [?←] nil))
+    (aset bdt ?\N{left-to-right override} (if v [?→] nil))
+    (aset bdt ?\N{right-to-left override} (if v [?←] nil))
+    (aset bdt ?\N{left-to-right isolate} (if v [?→] nil))
+    (aset bdt ?\N{right-to-left isolate} (if v [?←] nil))
+    (aset bdt ?\N{first strong isolate} (if v [?↓] nil))
+    (aset bdt ?\N{pop directional formatting} (if v [?↑] nil))
+    (aset bdt ?\N{pop directional isolate} (if v [?↑] nil))))
+
+;;;###autoload
+(define-minor-mode bidi-reordering-characters-visible
+  "Make the bidi reordering characters visible."
+  :init-value nil
+  (setq buffer-display-table (or buffer-display-table
+                                 standard-display-table
+                                 (make-display-table)))
+  (bidi-reordering-characters-visible--toggle)
+  (add-hook 'font-lock-mode-hook
+            #'bidi-reordering-characters-visible--fontify))
+
 ;;;###autoload
 (define-derived-mode prog-mode fundamental-mode "Prog"
   "Major mode for editing programming language source code."
@@ -300,7 +335,8 @@ prog-mode
   (setq-local parse-sexp-ignore-comments t)
   (add-hook 'context-menu-functions 'prog-context-menu 10 t)
   ;; Any programming language is always written left to right.
-  (setq bidi-paragraph-direction 'left-to-right))
+  (setq bidi-paragraph-direction 'left-to-right)
+  (bidi-reordering-characters-visible))
 
 (provide 'prog-mode)
 
-- 
2.33.0


^ permalink raw reply related	[flat|nested] 172+ messages in thread

* Re: Unicode confusables and reordering characters considered harmful,  a simple solution
  2021-11-04  6:58                                               ` Eli Zaretskii
@ 2021-11-04  8:53                                                 ` Gregory Heytings
  2021-11-04  9:15                                                   ` Eli Zaretskii
  0 siblings, 1 reply; 172+ messages in thread
From: Gregory Heytings @ 2021-11-04  8:53 UTC (permalink / raw)
  To: Eli Zaretskii
  Cc: cpitclaudel, stefan, emacs-devel, db48x, monnier, yuri.v.khan


>>>> I must really be missing something, but using buffer-display-table 
>>>> (or standard-display-table) does not need any changes whatsoever 
>>>> either.
>>>
>>> It needs the additional code you presented (or minor mode mentioned by 
>>> Stefan).
>>
>> Well, the glyphless-char-display-control solution also needs additional 
>> code: at least the custom-set-variables line, a few lines to update the 
>> acronyms of several of those characters
>
> No, it only needs the interested user to customize that option and save 
> the customizations.
>

And likewise with a minor mode.  Modes in Emacs are just built-in sets of 
customizations that users could set themselves and save if they had not 
been built-in.  This one is no different.



^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: Unicode confusables and reordering characters considered harmful,  a simple solution
  2021-11-04  7:16                                             ` Eli Zaretskii
@ 2021-11-04  9:06                                               ` Gregory Heytings
  2021-11-04  9:19                                                 ` Eli Zaretskii
  0 siblings, 1 reply; 172+ messages in thread
From: Gregory Heytings @ 2021-11-04  9:06 UTC (permalink / raw)
  To: Eli Zaretskii
  Cc: juri, cpitclaudel, stefan, emacs-devel, db48x, monnier,
	yuri.v.khan

[-- Attachment #1: Type: text/plain, Size: 1070 bytes --]


>> Here's a screenshot with emacs -Q (current trunk).  I did M-x 
>> highlight-regexp RET y RET hi-green RET M-x highlight-regexp <regexp 
>> with all reordering characters> RET hi-green RET.  There's one 
>> reordering character between each of the "abcdefghij" string.  I zoomed 
>> that picture, I see absolutely nothing.
>
> If you configured highlight-regexp to highlight only the formatting 
> controls, how come 'y' is highlighted in green on the image you sent? Is 
> 'y' one of the characters that are supposed to be highlighted?
>

As explained above, I first did a highlight-regexp on 'y' to show what the 
chosen highlighting option does on non-glyphless characters.

>
> If you shift-highlight one of these formatting control characters, and 
> only one such character, don't you see a thin whitespace shown in the 
> background color of the region face?
>

If you look very close, yes, you see see something in that case.  In the 
attached screenshot, there's a one-pixel light gray bar on the left of the 
cursor.  That's not what I'd consider "visible".

[-- Attachment #2: Type: image/png, Size: 1120 bytes --]

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: Unicode confusables and reordering characters considered harmful,  a simple solution
  2021-11-04  7:44                                           ` Eli Zaretskii
@ 2021-11-04  9:14                                             ` Gregory Heytings
  2021-11-04  9:45                                               ` Eli Zaretskii
  2021-11-05  2:23                                             ` Daniel Brooks
  1 sibling, 1 reply; 172+ messages in thread
From: Gregory Heytings @ 2021-11-04  9:14 UTC (permalink / raw)
  To: Eli Zaretskii
  Cc: cpitclaudel, stefan, emacs-devel, Daniel Brooks, monnier,
	yuri.v.khan


>
> The mere presence of these characters is NOT the root cause.  These 
> characters are legitimate and helpful when used as intended.  See 
> TUTORIAL.he for a pertinent example.
>

But TUTORIAL.he is not a pertinent example, because it's not a file with 
source code.  It's a pertinent example to show that these characters do 
have legitimate uses, which is obvious.  If you could find an actual 
source code file in an actual project in which these characters are used 
with their intended purpose, it would be a pertinent example.  Otherwise 
it is safe and reasonable to assume (as the Rust developers did) that the 
mere presence of these characters in source code files is a potential 
problem and must be flagged as such.



^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: Unicode confusables and reordering characters considered harmful,  a simple solution
  2021-11-04  8:53                                                 ` Gregory Heytings
@ 2021-11-04  9:15                                                   ` Eli Zaretskii
  0 siblings, 0 replies; 172+ messages in thread
From: Eli Zaretskii @ 2021-11-04  9:15 UTC (permalink / raw)
  To: Gregory Heytings
  Cc: cpitclaudel, stefan, emacs-devel, db48x, monnier, yuri.v.khan

> Date: Thu, 04 Nov 2021 08:53:08 +0000
> From: Gregory Heytings <gregory@heytings.org>
> cc: yuri.v.khan@gmail.com, db48x@db48x.net, cpitclaudel@gmail.com, 
>     stefan@marxist.se, monnier@iro.umontreal.ca, emacs-devel@gnu.org
> 
> >>>> I must really be missing something, but using buffer-display-table 
> >>>> (or standard-display-table) does not need any changes whatsoever 
> >>>> either.
> >>>
> >>> It needs the additional code you presented (or minor mode mentioned by 
> >>> Stefan).
> >>
> >> Well, the glyphless-char-display-control solution also needs additional 
> >> code: at least the custom-set-variables line, a few lines to update the 
> >> acronyms of several of those characters
> >
> > No, it only needs the interested user to customize that option and save 
> > the customizations.
> 
> And likewise with a minor mode.

Which needs to be added.

> Modes in Emacs are just built-in sets of customizations that users
> could set themselves and save if they had not been built-in.  This
> one is no different.

I see no reason to add a mode for this purpose, when we already have
equivalent features.



^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: Unicode confusables and reordering characters considered harmful,  a simple solution
  2021-11-04  9:06                                               ` Gregory Heytings
@ 2021-11-04  9:19                                                 ` Eli Zaretskii
  2021-11-04  9:48                                                   ` Eli Zaretskii
  0 siblings, 1 reply; 172+ messages in thread
From: Eli Zaretskii @ 2021-11-04  9:19 UTC (permalink / raw)
  To: Gregory Heytings
  Cc: juri, cpitclaudel, stefan, emacs-devel, db48x, monnier,
	yuri.v.khan

> Date: Thu, 04 Nov 2021 09:06:10 +0000
> From: Gregory Heytings <gregory@heytings.org>
> cc: cpitclaudel@gmail.com, yuri.v.khan@gmail.com, stefan@marxist.se, 
>     emacs-devel@gnu.org, db48x@db48x.net, monnier@iro.umontreal.ca, 
>     juri@linkov.net
> 
> > If you shift-highlight one of these formatting control characters, and 
> > only one such character, don't you see a thin whitespace shown in the 
> > background color of the region face?
> 
> If you look very close, yes, you see see something in that case.  In the 
> attached screenshot, there's a one-pixel light gray bar on the left of the 
> cursor.  That's not what I'd consider "visible".

I don't see even that bar on the image you posted, not sure why.

Anyway, for making these characters stand out, one would need to use a
color that is more prominent than light gray, probably some shade of
red.



^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: Unicode confusables and reordering characters considered harmful,  a simple solution
  2021-11-04  9:14                                             ` Gregory Heytings
@ 2021-11-04  9:45                                               ` Eli Zaretskii
  2021-11-04 10:41                                                 ` Gregory Heytings
  0 siblings, 1 reply; 172+ messages in thread
From: Eli Zaretskii @ 2021-11-04  9:45 UTC (permalink / raw)
  To: Gregory Heytings
  Cc: cpitclaudel, stefan, emacs-devel, db48x, monnier, yuri.v.khan

> Date: Thu, 04 Nov 2021 09:14:42 +0000
> From: Gregory Heytings <gregory@heytings.org>
> cc: Daniel Brooks <db48x@db48x.net>, cpitclaudel@gmail.com, 
>     yuri.v.khan@gmail.com, stefan@marxist.se, monnier@iro.umontreal.ca, 
>     emacs-devel@gnu.org
> 
> > The mere presence of these characters is NOT the root cause.  These 
> > characters are legitimate and helpful when used as intended.  See 
> > TUTORIAL.he for a pertinent example.
> 
> But TUTORIAL.he is not a pertinent example, because it's not a file with 
> source code.  It's a pertinent example to show that these characters do 
> have legitimate uses, which is obvious.

It's a pertinent example, because it shows that these characters have
their use in human-readable text of technical nature (which frequently
mixes RTL characters with LTR letters and punctuation).  That is
exactly what happens in comments and strings which use RTL scripts
within source code.

> If you could find an actual source code file in an actual project in
> which these characters are used with their intended purpose, it
> would be a pertinent example.

Why do you need me to find an actual source code which uses those
controls?  Isn't it clear that any human-readable text in comments and
strings in a program's source code can and will use these controls?
How does the tutorial text that explains technical stuff related to a
computer program differ from what a programmer could wish to write in
a comment or a string in his/her program?

Would it be enough if myself I wrote such a source code myself and
show it to you?  That would be an invented example, but so are the
examples in the paper that brought up this subject, so how is that
different??

> Otherwise it is safe and reasonable to assume (as the Rust
> developers did) that the mere presence of these characters in source
> code files is a potential problem and must be flagged as such.

It's easy, that's sure.  Reasonable it isn't.  neither it's safe,
because any user who does want these characters used legitimately will
quickly turn off that warning for good.

So it works for the Rust developers to tick a checkbox, but it isn't a
solution for the problem.



^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: Unicode confusables and reordering characters considered harmful,  a simple solution
  2021-11-04  9:19                                                 ` Eli Zaretskii
@ 2021-11-04  9:48                                                   ` Eli Zaretskii
  0 siblings, 0 replies; 172+ messages in thread
From: Eli Zaretskii @ 2021-11-04  9:48 UTC (permalink / raw)
  To: gregory; +Cc: cpitclaudel, yuri.v.khan, stefan, emacs-devel, db48x, monnier,
	juri

> Date: Thu, 04 Nov 2021 11:19:14 +0200
> From: Eli Zaretskii <eliz@gnu.org>
> Cc: juri@linkov.net, cpitclaudel@gmail.com, stefan@marxist.se,
>  emacs-devel@gnu.org, db48x@db48x.net, monnier@iro.umontreal.ca,
>  yuri.v.khan@gmail.com
> 
> I don't see even that bar on the image you posted, not sure why.
> 
> Anyway, for making these characters stand out, one would need to use a
> color that is more prominent than light gray, probably some shade of
> red.

I stand corrected: as long as these controls are displayed as a
1-pixel thin space, they indeed cannot have any color, because there's
no area we can fill with the background color on display, it's just a
single thin line.  Only with other methods of displaying glyphless
characters the different face or color will be visible.



^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: Unicode confusables and reordering characters considered harmful,  a simple solution
  2021-11-04  9:45                                               ` Eli Zaretskii
@ 2021-11-04 10:41                                                 ` Gregory Heytings
  2021-11-04 11:03                                                   ` Po Lu
  2021-11-04 11:20                                                   ` Eli Zaretskii
  0 siblings, 2 replies; 172+ messages in thread
From: Gregory Heytings @ 2021-11-04 10:41 UTC (permalink / raw)
  To: Eli Zaretskii
  Cc: cpitclaudel, stefan, yuri.v.khan, db48x, monnier, emacs-devel


>> If you could find an actual source code file in an actual project in 
>> which these characters are used with their intended purpose, it would 
>> be a pertinent example.
>
> Why do you need me to find an actual source code which uses those 
> controls?  Isn't it clear that any human-readable text in comments and 
> strings in a program's source code can and will use these controls? How 
> does the tutorial text that explains technical stuff related to a 
> computer program differ from what a programmer could wish to write in a 
> comment or a string in his/her program?
>

From a theoretical point of view, that's correct.  From a practical point 
of view, if these controls characters are only found in 0.01% of the files 
that are hosted on, say, GitLab, and given that these controls can have a 
dangerous effect, it is reasonable for an editor to make them stand out. 
Just like Emacs makes no-break spaces stand out for example (although 
AFAIK they are not dangerous in any way), with a thin brown line.

>> Otherwise it is safe and reasonable to assume (as the Rust developers 
>> did) that the mere presence of these characters in source code files is 
>> a potential problem and must be flagged as such.
>
> It's easy, that's sure.  Reasonable it isn't.  neither it's safe, 
> because any user who does want these characters used legitimately will 
> quickly turn off that warning for good.
>
> So it works for the Rust developers to tick a checkbox, but it isn't a 
> solution for the problem.
>

AFAIU the solutions you propose are:

1. Customize glyphless-char-display-control to display all control 
characters in a different way.  This is a much cruder solution, it would 
also have an effect for example on ZWNJ which might be undesirable, and it 
is also not buffer-local.  Users who want to use these characters 
legitimately are unlikely to use that solution.

2. Improve bidi-find-overridden-directionality to detect such 
non-legitimate cases.  This has to be done.

In comparison, the minor-mode exists, it's a small patch, and it's 
orthogonal to the two solutions you propose.

Anyway, I think it is time to abandon all hope.



^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: Unicode confusables and reordering characters considered harmful,  a simple solution
  2021-11-04 10:41                                                 ` Gregory Heytings
@ 2021-11-04 11:03                                                   ` Po Lu
  2021-11-04 11:27                                                     ` Gregory Heytings
  2021-11-04 11:20                                                   ` Eli Zaretskii
  1 sibling, 1 reply; 172+ messages in thread
From: Po Lu @ 2021-11-04 11:03 UTC (permalink / raw)
  To: Gregory Heytings
  Cc: Eli Zaretskii, cpitclaudel, stefan, yuri.v.khan, db48x, monnier,
	emacs-devel

Gregory Heytings <gregory@heytings.org> writes:

>> Why do you need me to find an actual source code which uses those
>> controls?  Isn't it clear that any human-readable text in comments
>> and strings in a program's source code can and will use these
>> controls? How does the tutorial text that explains technical stuff
>> related to a computer program differ from what a programmer could
>> wish to write in a comment or a string in his/her program?

> From a theoretical point of view, that's correct.  From a practical
> point of view, if these controls characters are only found in 0.01% of
> the files that are hosted on, say, GitLab, and given that these
> controls can have a dangerous effect, it is reasonable for an editor
> to make them stand out. Just like Emacs makes no-break spaces stand
> out for example (although AFAIK they are not dangerous in any way),
> with a thin brown line.

I think the point that being made was that TUTORIAL.he demonstrates the
importance of these control characters in documents mixing characters of
different directionality, of which the Emacs tutorial is one, and source
code another.

And as such, that these characters are important for users who speak RTL
languages and wish to comment their code using those languages.

If it is ok for people to comment their code in Chinese, why make it
difficult for speakers of another important language, such as Hebrew or
Arabic?



^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: Unicode confusables and reordering characters considered harmful,  a simple solution
  2021-11-04 10:41                                                 ` Gregory Heytings
  2021-11-04 11:03                                                   ` Po Lu
@ 2021-11-04 11:20                                                   ` Eli Zaretskii
  2021-11-04 11:34                                                     ` Gregory Heytings
  2021-11-04 19:08                                                     ` Eli Zaretskii
  1 sibling, 2 replies; 172+ messages in thread
From: Eli Zaretskii @ 2021-11-04 11:20 UTC (permalink / raw)
  To: Gregory Heytings
  Cc: cpitclaudel, stefan, yuri.v.khan, db48x, monnier, emacs-devel

> Date: Thu, 04 Nov 2021 10:41:41 +0000
> From: Gregory Heytings <gregory@heytings.org>
> cc: cpitclaudel@gmail.com, stefan@marxist.se, emacs-devel@gnu.org, 
>     db48x@db48x.net, monnier@iro.umontreal.ca, yuri.v.khan@gmail.com
> 
> >> If you could find an actual source code file in an actual project in 
> >> which these characters are used with their intended purpose, it would 
> >> be a pertinent example.
> >
> > Why do you need me to find an actual source code which uses those 
> > controls?  Isn't it clear that any human-readable text in comments and 
> > strings in a program's source code can and will use these controls? How 
> > does the tutorial text that explains technical stuff related to a 
> > computer program differ from what a programmer could wish to write in a 
> > comment or a string in his/her program?
> >
> 
> >From a theoretical point of view, that's correct.  From a practical point 
> of view, if these controls characters are only found in 0.01% of the files 
> that are hosted on, say, GitLab, and given that these controls can have a 
> dangerous effect, it is reasonable for an editor to make them stand out. 

Since when is it OK to flag characters that are used very rarely?
What would be the sense of doing that?  Should we perhaps flag all the
Egyptian hieroglyphs for the same reason?

> Just like Emacs makes no-break spaces stand out for example (although 
> AFAIK they are not dangerous in any way), with a thin brown line.

It isn't "just like", because those no-break spaces are very
frequent.  I see them almost every day in the email messages I receive
and read.

> AFAIU the solutions you propose are:
> 
> 1. Customize glyphless-char-display-control to display all control 
> characters in a different way.  This is a much cruder solution, it would 
> also have an effect for example on ZWNJ which might be undesirable, and it 
> is also not buffer-local.  Users who want to use these characters 
> legitimately are unlikely to use that solution.
> 
> 2. Improve bidi-find-overridden-directionality to detect such 
> non-legitimate cases.  This has to be done.
> 
> In comparison, the minor-mode exists, it's a small patch, and it's 
> orthogonal to the two solutions you propose.

Small doesn't necessarily mean good.

> Anyway, I think it is time to abandon all hope.

It would be a shame if we abandoned all hope to solve this issue in a
good way.  I, for one, don't abandon hope in this matter.  Making
glyphless-char-display-control support buffer-local customizations is
one way of working on solving the issue better than by displaying
arbitrary glyphs instead of them.  bidi-find-overridden-directionality
will be extended soon to find the problematic text in the examples
from that paper.



^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: Unicode confusables and reordering characters considered harmful,  a simple solution
  2021-11-04 11:03                                                   ` Po Lu
@ 2021-11-04 11:27                                                     ` Gregory Heytings
  0 siblings, 0 replies; 172+ messages in thread
From: Gregory Heytings @ 2021-11-04 11:27 UTC (permalink / raw)
  To: Po Lu
  Cc: cpitclaudel, stefan, yuri.v.khan, db48x, monnier, Eli Zaretskii,
	emacs-devel

[-- Attachment #1: Type: text/plain, Size: 1357 bytes --]


>
> I think the point that being made was that TUTORIAL.he demonstrates the 
> importance of these control characters in documents mixing characters of 
> different directionality, of which the Emacs tutorial is one, and source 
> code another.
>
> And as such, that these characters are important for users who speak RTL 
> languages and wish to comment their code using those languages.
>
> If it is ok for people to comment their code in Chinese, why make it 
> difficult for speakers of another important language, such as Hebrew or 
> Arabic?
>

I never ever said or thought that these characters are unimportant, or 
that it would be okay to make it difficult for speakers of Hebrew or 
Arabic to comment their code in their native language.  I only suggested 
to make these characters stand out by default, and only in source code 
files, given that they have potential security implications.  Just like we 
make no-break spaces, thin spaces, hair spaces, and so forth, stand out. 
That doesn't make the life of those who would like to comment their code 
in their native language more difficult.  Even more so if it transpires 
that these characters are in fact used rarely, even by those who use RTL 
languages in source code files, which is why I asked to see a real-life 
example of such a use.

See again the attached screenshot of TUTORIAL.he.

[-- Attachment #2: Type: image/png, Size: 119662 bytes --]

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: Unicode confusables and reordering characters considered harmful,  a simple solution
  2021-11-04 11:20                                                   ` Eli Zaretskii
@ 2021-11-04 11:34                                                     ` Gregory Heytings
  2021-11-04 13:25                                                       ` Eli Zaretskii
  2021-11-04 19:08                                                     ` Eli Zaretskii
  1 sibling, 1 reply; 172+ messages in thread
From: Gregory Heytings @ 2021-11-04 11:34 UTC (permalink / raw)
  To: Eli Zaretskii
  Cc: cpitclaudel, stefan, yuri.v.khan, db48x, monnier, emacs-devel


>> From a theoretical point of view, that's correct.  From a practical 
>> point of view, if these controls characters are only found in 0.01% of 
>> the files that are hosted on, say, GitLab, and given that these 
>> controls can have a dangerous effect, it is reasonable for an editor to 
>> make them stand out.
>
> Since when is it OK to flag characters that are used very rarely? What 
> would be the sense of doing that?  Should we perhaps flag all the 
> Egyptian hieroglyphs for the same reason?
>

The answer is above: "given that these controls can have a dangerous 
effect".  There's no reason to put a traffic sign in the middle of a 
forest.

>> Anyway, I think it is time to abandon all hope.
>
> It would be a shame if we abandoned all hope to solve this issue in a 
> good way.
>

That was a sentence for myself.

>
> I, for one, don't abandon hope in this matter.
>

I know ;-)

>
> bidi-find-overridden-directionality will be extended soon to find the 
> problematic text in the examples from that paper.
>

Let's then hope it will it be called from prog-mode.



^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: Unicode confusables and reordering characters considered harmful,  a simple solution
  2021-11-04 11:34                                                     ` Gregory Heytings
@ 2021-11-04 13:25                                                       ` Eli Zaretskii
  2021-11-04 14:10                                                         ` Gregory Heytings
  0 siblings, 1 reply; 172+ messages in thread
From: Eli Zaretskii @ 2021-11-04 13:25 UTC (permalink / raw)
  To: Gregory Heytings
  Cc: cpitclaudel, stefan, yuri.v.khan, db48x, monnier, emacs-devel

> Date: Thu, 04 Nov 2021 11:34:06 +0000
> From: Gregory Heytings <gregory@heytings.org>
> cc: cpitclaudel@gmail.com, stefan@marxist.se, emacs-devel@gnu.org, 
>     db48x@db48x.net, monnier@iro.umontreal.ca, yuri.v.khan@gmail.com
> 
> >> From a theoretical point of view, that's correct.  From a practical 
> >> point of view, if these controls characters are only found in 0.01% of 
> >> the files that are hosted on, say, GitLab, and given that these 
> >> controls can have a dangerous effect, it is reasonable for an editor to 
> >> make them stand out.
> >
> > Since when is it OK to flag characters that are used very rarely? What 
> > would be the sense of doing that?  Should we perhaps flag all the 
> > Egyptian hieroglyphs for the same reason?
> 
> The answer is above: "given that these controls can have a dangerous 
> effect".

But they don't.  Not more than just using RTL characters within LTR
text would.  Just revisit the example posted by Stefan (which I
slightly modified to be more realistic):

      myfun("שָׁלוֹם" ,"السّلامعليكم");

Which string does this function call pass as the first argument, and
which as the second one?

> There's no reason to put a traffic sign in the middle of a forest.

Exactly.  And flagging those characters when they are used
legitimately is doing precisely that.



^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: Unicode confusables and reordering characters considered harmful,  a simple solution
  2021-11-04 13:25                                                       ` Eli Zaretskii
@ 2021-11-04 14:10                                                         ` Gregory Heytings
  2021-11-04 16:50                                                           ` Eli Zaretskii
                                                                             ` (2 more replies)
  0 siblings, 3 replies; 172+ messages in thread
From: Gregory Heytings @ 2021-11-04 14:10 UTC (permalink / raw)
  To: Eli Zaretskii
  Cc: cpitclaudel, stefan, yuri.v.khan, db48x, monnier, emacs-devel

[-- Attachment #1: Type: text/plain, Size: 1173 bytes --]


>>> Since when is it OK to flag characters that are used very rarely? What 
>>> would be the sense of doing that?  Should we perhaps flag all the 
>>> Egyptian hieroglyphs for the same reason?
>>
>> The answer is above: "given that these controls can have a dangerous 
>> effect".
>
> But they don't.  Not more than just using RTL characters within LTR text 
> would.  Just revisit the example posted by Stefan (which I slightly 
> modified to be more realistic):
>
>      myfun("שָׁלוֹם" ,"السّلامعليكم");
>
> Which string does this function call pass as the first argument, and 
> which as the second one?
>

There is no danger in that example, and in particular nothing invisible. 
The programmer must just be aware that compilers read source code files in 
byte order, which might be different from the order in which the string is 
displayed on screen, but is identical to the order in which one 
forward-char's through the string.

There is a danger when, because the source code contains invisible control 
characters, the programmer sees something on their screen, and the 
compiler sees something completely different.

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: Unicode confusables and reordering characters considered harmful,  a simple solution
  2021-11-04 14:10                                                         ` Gregory Heytings
@ 2021-11-04 16:50                                                           ` Eli Zaretskii
  2021-11-04 17:04                                                             ` Gregory Heytings
  2021-11-04 19:16                                                           ` Stefan Monnier
  2021-11-04 19:22                                                           ` Stefan Monnier
  2 siblings, 1 reply; 172+ messages in thread
From: Eli Zaretskii @ 2021-11-04 16:50 UTC (permalink / raw)
  To: Gregory Heytings
  Cc: cpitclaudel, stefan, yuri.v.khan, db48x, monnier, emacs-devel

> Date: Thu, 04 Nov 2021 14:10:01 +0000
> From: Gregory Heytings <gregory@heytings.org>
> cc: cpitclaudel@gmail.com, stefan@marxist.se, emacs-devel@gnu.org, 
>     db48x@db48x.net, monnier@iro.umontreal.ca, yuri.v.khan@gmail.com
> 
> >> The answer is above: "given that these controls can have a dangerous 
> >> effect".
> >
> > But they don't.  Not more than just using RTL characters within LTR text 
> > would.  Just revisit the example posted by Stefan (which I slightly 
> > modified to be more realistic):
> >
> >      myfun("שָׁלוֹם" ,"السّلامعليكم");
> >
> > Which string does this function call pass as the first argument, and 
> > which as the second one?
> 
> There is no danger in that example, and in particular nothing invisible. 

Ha-ha, very funny.

> The programmer must just be aware that compilers read source code files in 
> byte order, which might be different from the order in which the string is 
> displayed on screen, but is identical to the order in which one 
> forward-char's through the string.

If we are going to assume users forward-char through every piece of
code they look at, then the examples we were discussing present no
problem, either.

> There is a danger when, because the source code contains invisible control 
> characters, the programmer sees something on their screen, and the 
> compiler sees something completely different.

That's exactly what happens in the above example.  Except that
reordering happens automatically without any invisible characters,
i.e. also "invisibly".



^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: Unicode confusables and reordering characters considered harmful,  a simple solution
  2021-11-04 16:50                                                           ` Eli Zaretskii
@ 2021-11-04 17:04                                                             ` Gregory Heytings
  0 siblings, 0 replies; 172+ messages in thread
From: Gregory Heytings @ 2021-11-04 17:04 UTC (permalink / raw)
  To: Eli Zaretskii
  Cc: cpitclaudel, stefan, emacs-devel, db48x, monnier, yuri.v.khan

[-- Attachment #1: Type: text/plain, Size: 1924 bytes --]


>>> But they don't.  Not more than just using RTL characters within LTR 
>>> text would.  Just revisit the example posted by Stefan (which I 
>>> slightly modified to be more realistic):
>>>
>>>      myfun("שָׁלוֹם" ,"السّلامعليكم");
>>>
>>> Which string does this function call pass as the first argument, and 
>>> which as the second one?
>>
>> There is no danger in that example, and in particular nothing 
>> invisible.
>
> Ha-ha, very funny.
>

It wasn't supposed to be funny.

>> The programmer must just be aware that compilers read source code files 
>> in byte order, which might be different from the order in which the 
>> string is displayed on screen, but is identical to the order in which 
>> one forward-char's through the string.
>
> If we are going to assume users forward-char through every piece of code 
> they look at, then the examples we were discussing present no problem, 
> either.
>

I'm not assuming any of this.  There are programmers who read Hebrew and 
Arabic, and those who don't.  Those who do know them know that they are 
entered and read RTL, and don't even need to check the argument order. 
Those who don't may not know this, and can easily check if they have some 
doubt about what string is passed in which argument.

>> There is a danger when, because the source code contains invisible 
>> control characters, the programmer sees something on their screen, and 
>> the compiler sees something completely different.
>
> That's exactly what happens in the above example.  Except that 
> reordering happens automatically without any invisible characters, i.e. 
> also "invisibly".
>

There are no invisible characters doing weird things with the text, no. 
And it's those invisible characters that the "Trojan Source" paper is 
about.  Not potential interpretation problems by those who would discover 
RTL languages.

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: Unicode confusables and reordering characters considered harmful
  2021-11-04  6:56                                   ` Eli Zaretskii
@ 2021-11-04 19:04                                     ` Eli Zaretskii
  0 siblings, 0 replies; 172+ messages in thread
From: Eli Zaretskii @ 2021-11-04 19:04 UTC (permalink / raw)
  To: manuel, gregory; +Cc: cpitclaudel, stefan, monnier, emacs-devel

> Date: Thu, 04 Nov 2021 08:56:18 +0200
> From: Eli Zaretskii <eliz@gnu.org>
> Cc: gregory@heytings.org, emacs-devel@gnu.org, stefan@marxist.se,
>  cpitclaudel@gmail.com, monnier@iro.umontreal.ca
> 
> > Ok. On this one, bidi-find-overridden-directionality returns a position
> > but I get a nil on the Clément's example:
> > 
> > if access_level != "user‮ ⁦// Check if admin⁩ ⁦" {
> > 
> > which is the kind of overridden directionality we should have a warning
> > on, no?
> 
> Did you read the doc string of bidi-find-overridden-directionality,
> which explains what kind of overrides it looks for?  I already said
> that it will have to be extended to cover the examples from that
> paper.

Now done on master.



^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: Unicode confusables and reordering characters considered harmful,  a simple solution
  2021-11-04  6:00                                         ` Daniel Brooks
  2021-11-04  7:44                                           ` Eli Zaretskii
@ 2021-11-04 19:05                                           ` Stefan Monnier
  1 sibling, 0 replies; 172+ messages in thread
From: Stefan Monnier @ 2021-11-04 19:05 UTC (permalink / raw)
  To: Daniel Brooks
  Cc: Eli Zaretskii, cpitclaudel, emacs-devel, stefan, yuri.v.khan

> However, your suggestion of highlighting the text affected by the bidi
> override characters while not actually showing those characters visibly
> is not something that I would care to use. It shows that there may be a
> problem without showing what the cause is. The cause is the presense of
> certain characters, and I must be able to see those characters in order
> to fix the problem, or even to judge whether there is a problem at
> all.

I don't think it's the case.

AFAIK there are 3 steps:
1- Become aware of the presence of something suspicious, i.e. a chunk of
   text that may not mean what you think.
2- Be able to confirm whether this is what it looks like or not.
3- Find the root cause.

Making the special control chars more visible can help at step 3 (tho
not in all cases since the problem can occur without using any of those
chars, as shown in my example code), but it's definitely not necessary
for step 1 (where highlighting the text as Eli suggest might be more
useful) nor for step 2 (where moving the cursor across the text is all
it takes to figure out what it really means).

Really, this is just another case of the "confusables": situations where
different sequences of bytes can result in the exact same display (or
maybe not 100% identical, but sufficiently similar that the untrained
eye won't notice the difference) yet be treated differently by
our tools.

The main problem I see is that the definition of "normal" and "abnormal"
depends on the programming language and even potentially to the human
reading the text as well.

For example, Imagine that the uppercase text below are written in
a script&language that's RTL:

My previous example had

    myfun (ARG1, ARG2)

where the rendering displayed ARG2 to the left or ARG1, making it
(presumably) confusing to the reader.  But if the code says:

    days = [MONDAY, TUESDAY, WEDNESDAY, THURSDAY, FRIDAY, SATURDAY, SUNDAY]

Which would be more confusing?  To have first element displayed on the
left or to have it displayed on the right?
I think the answer strongly depends on the past experience of the
reader, so there's a human factor at play.


        Stefan




^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: Unicode confusables and reordering characters considered harmful,  a simple solution
  2021-11-04 11:20                                                   ` Eli Zaretskii
  2021-11-04 11:34                                                     ` Gregory Heytings
@ 2021-11-04 19:08                                                     ` Eli Zaretskii
  2021-11-04 20:00                                                       ` Eli Zaretskii
  1 sibling, 1 reply; 172+ messages in thread
From: Eli Zaretskii @ 2021-11-04 19:08 UTC (permalink / raw)
  To: gregory, cpitclaudel; +Cc: db48x, emacs-devel, stefan, monnier, yuri.v.khan

> Date: Thu, 04 Nov 2021 13:20:43 +0200
> From: Eli Zaretskii <eliz@gnu.org>
> Cc: cpitclaudel@gmail.com, stefan@marxist.se, yuri.v.khan@gmail.com,
>  db48x@db48x.net, monnier@iro.umontreal.ca, emacs-devel@gnu.org
> 
> > Anyway, I think it is time to abandon all hope.
> 
> It would be a shame if we abandoned all hope to solve this issue in a
> good way.  I, for one, don't abandon hope in this matter.  Making
> glyphless-char-display-control support buffer-local customizations is
> one way of working on solving the issue better than by displaying
> arbitrary glyphs instead of them.  bidi-find-overridden-directionality
> will be extended soon to find the problematic text in the examples
> from that paper.

Now done on master.  I also added a simple command that finds and
highlights all the problematic stretches of text in a buffer.  Testing
is welcome.

I hope someone will be interested enough to write a minor mode that
detects such text automatically, perhaps by registering a function
with jit-lock and looking at the chunk of text to be displayed next.



^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: Unicode confusables and reordering characters considered harmful, a simple solution
  2021-11-04 14:10                                                         ` Gregory Heytings
  2021-11-04 16:50                                                           ` Eli Zaretskii
@ 2021-11-04 19:16                                                           ` Stefan Monnier
  2021-11-05 23:31                                                             ` Gregory Heytings
  2021-11-04 19:22                                                           ` Stefan Monnier
  2 siblings, 1 reply; 172+ messages in thread
From: Stefan Monnier @ 2021-11-04 19:16 UTC (permalink / raw)
  To: Gregory Heytings
  Cc: Eli Zaretskii, cpitclaudel, stefan, emacs-devel, db48x,
	yuri.v.khan

>>      myfun("שָׁלוֹם" ,"السّلامعليكم");
> There is no danger in that example, and in particular nothing invisible.

I'm pretty sure an attacker can use the above confusing arg order to
turn an apparently harmless program into a security hole.

The fact that the args are passed in the other order than "expected" by
the naive reader means that the naive reader (who may very well be an
expert at computer security, reviewing potentially dangerous code, just
one who knows more about proofs&bits&bytes than about Unicode's
intricacies).


        Stefan




^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: Unicode confusables and reordering characters considered harmful, a simple solution
  2021-11-04 14:10                                                         ` Gregory Heytings
  2021-11-04 16:50                                                           ` Eli Zaretskii
  2021-11-04 19:16                                                           ` Stefan Monnier
@ 2021-11-04 19:22                                                           ` Stefan Monnier
  2021-11-04 19:55                                                             ` Eli Zaretskii
  2021-11-05 23:32                                                             ` Gregory Heytings
  2 siblings, 2 replies; 172+ messages in thread
From: Stefan Monnier @ 2021-11-04 19:22 UTC (permalink / raw)
  To: Gregory Heytings
  Cc: Eli Zaretskii, cpitclaudel, stefan, emacs-devel, db48x,
	yuri.v.khan

>>      myfun("שָׁלוֹם" ,"السّلامعليكم");
[...]
> There is a danger when, because the source code contains invisible control
> characters, the programmer sees something on their screen, and the compiler
> sees something completely different.

You mean there is a special kind of danger coming from the invisible
control characters because they can make code render unexpectedly even
though all the rendered chars are "familiar" (e.g. all-ASCII)?

That's a good point.  Maybe a middle ground could be to call the
attention to such overrides when they're used inside a text line where
all the chars are of the exact same directionality, but not if the line
already contains both strong-LTR and strong-RTL characters.


        Stefan




^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: Unicode confusables and reordering characters considered harmful, a simple solution
  2021-11-04 19:22                                                           ` Stefan Monnier
@ 2021-11-04 19:55                                                             ` Eli Zaretskii
  2021-11-05 23:32                                                             ` Gregory Heytings
  1 sibling, 0 replies; 172+ messages in thread
From: Eli Zaretskii @ 2021-11-04 19:55 UTC (permalink / raw)
  To: Stefan Monnier
  Cc: cpitclaudel, stefan, yuri.v.khan, db48x, gregory, emacs-devel

> From: Stefan Monnier <monnier@iro.umontreal.ca>
> Cc: Eli Zaretskii <eliz@gnu.org>,  cpitclaudel@gmail.com,
>   stefan@marxist.se,  emacs-devel@gnu.org,  db48x@db48x.net,
>   yuri.v.khan@gmail.com
> Date: Thu, 04 Nov 2021 15:22:41 -0400
> 
> You mean there is a special kind of danger coming from the invisible
> control characters because they can make code render unexpectedly even
> though all the rendered chars are "familiar" (e.g. all-ASCII)?
> 
> That's a good point.  Maybe a middle ground could be to call the
> attention to such overrides when they're used inside a text line where
> all the chars are of the exact same directionality, but not if the line
> already contains both strong-LTR and strong-RTL characters.

I think the code I just installed does that, and more.

(And note that "all the chars are of the exact same directionality" is
a problematic definition, since the only characters whose
directionality cannot be changed except by these formatting controls
are so-called "strong directional" characters.  By contrast, the
examples in the paper which got us excited deliberately reorder
punctuation characters, which have "weak" directionality, and whose
reordering for malicious purposes is much harder to detect, because
many/most legitimate uses of directional formatting controls is
precisely to avoid the "weak" directional characters taking the
"wrong" direction.  And a typical line of source code will always
include both string and weak directional characters.)



^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: Unicode confusables and reordering characters considered harmful,  a simple solution
  2021-11-04 19:08                                                     ` Eli Zaretskii
@ 2021-11-04 20:00                                                       ` Eli Zaretskii
  0 siblings, 0 replies; 172+ messages in thread
From: Eli Zaretskii @ 2021-11-04 20:00 UTC (permalink / raw)
  To: gregory; +Cc: cpitclaudel, stefan, yuri.v.khan, db48x, monnier, emacs-devel

> Date: Thu, 04 Nov 2021 21:08:15 +0200
> From: Eli Zaretskii <eliz@gnu.org>
> Cc: db48x@db48x.net, emacs-devel@gnu.org, stefan@marxist.se,
>  monnier@iro.umontreal.ca, yuri.v.khan@gmail.com
> 
> Now done on master.  I also added a simple command that finds and
> highlights all the problematic stretches of text in a buffer.  Testing
> is welcome.

Btw, that command immediately proved its utility by flagging 2
instances in TUTORIAL.he that used an embedding of an incorrect
directionality (which happened not to affect the display).



^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: Unicode confusables and reordering characters considered harmful,  a simple solution
  2021-11-04  7:44                                           ` Eli Zaretskii
  2021-11-04  9:14                                             ` Gregory Heytings
@ 2021-11-05  2:23                                             ` Daniel Brooks
  2021-11-05  3:52                                               ` Stefan Kangas
                                                                 ` (2 more replies)
  1 sibling, 3 replies; 172+ messages in thread
From: Daniel Brooks @ 2021-11-05  2:23 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: cpitclaudel, emacs-devel, stefan, monnier, yuri.v.khan

Eli Zaretskii <eliz@gnu.org> writes:

>> From: Daniel Brooks <db48x@db48x.net>

>> Of course it will show them in the comments and strings.
>
> Then this visual noise will get in the way of people's reading those
> comments and strings, and, for strings, will make it very hard to
> understand what will be presented to the user when those strings are
> output in some UI.
>
>> That’s where the problem is.
>
> No, the problem is elsewhere entirely: it's in the punctuation
> characters unrelated to strings and comments whose directionality is
> overridden, and which thus display in places that cause incorrect
> visual interpretation of the program during a casual read.

Look at the examples again. In many of them, all of the bidi override
characters are inside a string or comment. When that is the case, these
characters are only a problem if they cause characters that are inside
the string or comment to appear to be outside of it, by reordering those
characters relative to the syntactic markers for the string or
comment. In other examples these characters are _outside_ the string or
comment.

Unless Emacs has specific knowledge of the language syntax, showing the
characters is the only sure way to know if there is a problem or not.

> You misunderstand the cause.  The mere presence of these characters is
> NOT the root cause.  These characters are legitimate and helpful when
> used as intended.  See TUTORIAL.he for a pertinent example.

Please don’t presume to tell me what I do or don’t understand. Yes,
there are use cases which are not harmful, but as I have said it must be
up to either the programmer or the compiler to answer that
question. Emacs doesn’t know the syntax of every programming language.

>> Furthermore, I have not suggested that showing the characters needs to
>> preclude any other form of highlighting. If you wish to develop some
>> additional way of warning the developer, please do so.
>
> We are talking about what should be in Emacs.  What you suggest
> shouldn't.

No other suggested feature will be useful to me. This one will. I
suggest to you that you do not know what all users want.

>> However, I suspect that the compilers for most languages currently in
>> active development will develop their own warnings and error messages as
>> well. We have plenty of ways for those messages to show up inside Emacs
>> as highlights.
>
> That's a tangent.  We are discussing what Emacs should do as a
> programmer's editor to flag such suspicious code.  That shouldn't need
> a compiler if we can do the job ourselves.  And we can.

This is not a tangent. Emacs relies heavily on compilers and language
runtimes for many of its features. This is just one more area where
Emacs should not try to be too clever.

>
>> Rust, for example, has already done so. Here’s an example:
>> 
>>     error: unicode codepoint changing visible direction of text present in comment
>>       --> src/pathmap/path.rs:10:5
>>        |
>>     10 |     /* } if is_admin  begin admins only */
>>        |     ^^-^^-^^^^^^^^^^--^^^^^^^^^^^^^^^^^^^^
>>        |     | |  |          ||
>>        |     | |  |          |'\u{2066}'
>>        |     | |  |          '\u{2069}'
>>        |     | |  '\u{2066}'
>>        |     | '\u{202e}'
>>        |     this comment contains invisible unicode text flow control codepoints
>>        |
>>        = note: `#[deny(text_direction_codepoint_in_comment)]` on by default
>>        = note: these kind of unicode codepoints change the way text
>> flows on applications that support them, but can cause confusion
>> because they change the order of characters on the screen
>>        = help: if their presence wasn't intentional, you can remove them
>
> Since the Rust compiler evidently does this when it finds these
> characters inside comments (and probably also inside strings), IMNSHO
> this is a terrible misfeature, because it means code that uses those
> controls in legitimate ways cannot be compiled without tweaking
> non-default options.  That's a cop-out, not the way to flag the
> problematic cases.

Your conclusion here is incorrect. Rust has choosen a fast strategy,
where they implement a broad error today (well, four days ago) knowing
that it does not prevent them from introducing a more refined error or
set of errors later.

Rust also has a very flexible annotation system that allows the
programmer to annotate specific statements and language items. If a use
of these characters is determined to be legitimate, the programmer can
annotate the comment, or the function the comment is in, so that this
error is disabled. In projects with strong review culture, seeing that
annotation while doing a code review will be a very strong signal that
something unusual is going on, and that it needs to be considered
carefully. Annotations are are a great feature of Rust that I do not
expect Emacs to take into account.

Instead I think that Emacs should adopt a similar fast
strategy. Anything we do today can be refined later.

>> Naturally that already shows up inside of Emacs just fine; see the
>> attached image.
>
> I think this is terrible.  At best, it only tells you that something
> non-trivial goes on here (but what exactly?).  At worst, it looks like
> corruption of the source.  And while in the malicious case treating
> that as corruption is not such a bad idea, all the valid uses of these
> characters will also look like corruption.  Which means the cure is
> probably worse than the disease, because the malicious cases are a
> tiny fraction of the valid ones.

I cannot believe that you really think this. It shows up with exactly the
same highlighting that your recently–introduced
highlight-confusing-reorderings function uses. It looks nothing like
“corruption of the source”, whatever you may mean by that. The error
message explains _exactly_ what the compiler is guarding against.

Also, thinking about fractions here is irrelevant. The Rust team
examined the source of every Rust crate every published on
https://crates.io, and found only 5 that even used these
characters. With a sample size that small, percentages don’t mean much.

https://blog.rust-lang.org/2021/11/01/cve-2021-42574.html

> It's the same kind of "solution" like the airport security after 9/11:
> because there was a bunch of terrorists, we are all now suspect as
> potential terrorists, and for that reason we are constantly delayed
> for hours and humiliated by endless frisking.

Now I think you are being deliberately insulting. I conclude that your
only purpose in this conversation was to troll people or to say no to
any solution you didn’t think of yourself.

Yours doesn’t even work with `next-error`. Useless.

db48x



^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: Unicode confusables and reordering characters considered harmful,  a simple solution
  2021-11-05  2:23                                             ` Daniel Brooks
@ 2021-11-05  3:52                                               ` Stefan Kangas
  2021-11-05  5:21                                                 ` code annotations Daniel Brooks
                                                                   ` (3 more replies)
  2021-11-05  8:09                                               ` tomas
  2021-11-05  8:31                                               ` Eli Zaretskii
  2 siblings, 4 replies; 172+ messages in thread
From: Stefan Kangas @ 2021-11-05  3:52 UTC (permalink / raw)
  To: Daniel Brooks, Eli Zaretskii
  Cc: cpitclaudel, emacs-devel, monnier, yuri.v.khan

Daniel Brooks <db48x@db48x.net> writes:

> Rust also has a very flexible annotation system that allows the
> programmer to annotate specific statements and language items. If a use
> of these characters is determined to be legitimate, the programmer can
> annotate the comment, or the function the comment is in, so that this
> error is disabled. In projects with strong review culture, seeing that
> annotation while doing a code review will be a very strong signal that
> something unusual is going on, and that it needs to be considered
> carefully. Annotations are are a great feature of Rust that I do not
> expect Emacs to take into account.

We already have `ignore-errors', `with-suppressed-warnings', etc.
That sounds as powerful as the annotation system you describe, or am I
missing something?

[Discussing strictly what to do about Emacs Lisp here:]

In any case, the above leads me back to the simple idea to raise
byte-compiler (or even `read'?) warnings for the problematic control
characters unless a specific variable is set to t, or unless the piece
of code using them is wrapped in some `with-suppressed-warnings' call.

Or we do it the other way around: users mark a source code file to say
that "this file will never contain RTL characters" (but RTL scripts in
ELisp code is pretty uncommon, I think).

It doesn't seem too bad, certainly not much worse than having to add
"coding: utf-8" or similar.

Was such a solution rejected already?

> Instead I think that Emacs should adopt a similar fast
> strategy. Anything we do today can be refined later.

FWIW, I tend to agree with this.



^ permalink raw reply	[flat|nested] 172+ messages in thread

* code annotations
  2021-11-05  3:52                                               ` Stefan Kangas
@ 2021-11-05  5:21                                                 ` Daniel Brooks
  2021-11-05  5:53                                                   ` Stefan Kangas
  2021-11-05  5:23                                                 ` Unicode confusables and reordering characters considered harmful, a simple solution Daniel Brooks
                                                                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 172+ messages in thread
From: Daniel Brooks @ 2021-11-05  5:21 UTC (permalink / raw)
  To: Stefan Kangas
  Cc: Eli Zaretskii, emacs-devel, cpitclaudel, monnier, yuri.v.khan

Stefan Kangas <stefan@marxist.se> writes:

> Daniel Brooks <db48x@db48x.net> writes:
>
>> Rust also has a very flexible annotation system that allows the
>> programmer to annotate specific statements and language items. If a use
>> of these characters is determined to be legitimate, the programmer can
>> annotate the comment, or the function the comment is in, so that this
>> error is disabled. In projects with strong review culture, seeing that
>> annotation while doing a code review will be a very strong signal that
>> something unusual is going on, and that it needs to be considered
>> carefully. Annotations are are a great feature of Rust that I do not
>> expect Emacs to take into account.
>
> We already have `ignore-errors', `with-suppressed-warnings', etc.
> That sounds as powerful as the annotation system you describe, or am I
> missing something?

`ignore-errors' is not similar, because it operates only a run time. I
had forgotten about `with-suppressed-warnings', which can suppress
warnings while byte compiling; that is indeed similar. Does it operate
at read time though?

On the other hand, Rust annotations are used for a few other things as
well, besides enabling or disabling warnings and errors. They are used
for conditional compilation, telling the compiler to do extra work for
you (derive, for example), specifying linking options (static vs
dynamic linking, for example), code generation (inlining, etc). And the
list can be extended by macros.

The reference is here:
https://doc.rust-lang.org/reference/attributes.html

db48x



^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: Unicode confusables and reordering characters considered harmful, a simple solution
  2021-11-05  3:52                                               ` Stefan Kangas
  2021-11-05  5:21                                                 ` code annotations Daniel Brooks
@ 2021-11-05  5:23                                                 ` Daniel Brooks
  2021-11-05  6:13                                                 ` Po Lu
  2021-11-05  7:37                                                 ` Eli Zaretskii
  3 siblings, 0 replies; 172+ messages in thread
From: Daniel Brooks @ 2021-11-05  5:23 UTC (permalink / raw)
  To: Stefan Kangas
  Cc: Eli Zaretskii, emacs-devel, cpitclaudel, monnier, yuri.v.khan

Stefan Kangas <stefan@marxist.se> writes:

> In any case, the above leads me back to the simple idea to raise
> byte-compiler (or even `read'?) warnings for the problematic control
> characters unless a specific variable is set to t, or unless the piece
> of code using them is wrapped in some `with-suppressed-warnings' call.
>
> Or we do it the other way around: users mark a source code file to say
> that "this file will never contain RTL characters" (but RTL scripts in
> ELisp code is pretty uncommon, I think).
>
> It doesn't seem too bad, certainly not much worse than having to add
> "coding: utf-8" or similar.

I prefer the former to the latter. Marking the whole file as not
containing RTL would close off too many avenues for future development
in that file. Marking a specific expression or form is much nicer.

>> Instead I think that Emacs should adopt a similar fast
>> strategy. Anything we do today can be refined later.
>
> FWIW, I tend to agree with this.

Thank you.

db48x



^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: code annotations
  2021-11-05  5:21                                                 ` code annotations Daniel Brooks
@ 2021-11-05  5:53                                                   ` Stefan Kangas
  0 siblings, 0 replies; 172+ messages in thread
From: Stefan Kangas @ 2021-11-05  5:53 UTC (permalink / raw)
  To: Daniel Brooks
  Cc: Eli Zaretskii, yuri.v.khan, cpitclaudel, monnier, emacs-devel

Daniel Brooks <db48x@db48x.net> writes:

> I had forgotten about `with-suppressed-warnings', which can suppress
> warnings while byte compiling; that is indeed similar. Does it operate
> at read time though?

AFAIK, it does not operate at read-time, as it only suppresses
byte-compiler warnings.



^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: Unicode confusables and reordering characters considered harmful,  a simple solution
  2021-11-05  3:52                                               ` Stefan Kangas
  2021-11-05  5:21                                                 ` code annotations Daniel Brooks
  2021-11-05  5:23                                                 ` Unicode confusables and reordering characters considered harmful, a simple solution Daniel Brooks
@ 2021-11-05  6:13                                                 ` Po Lu
  2021-11-05  7:37                                                 ` Eli Zaretskii
  3 siblings, 0 replies; 172+ messages in thread
From: Po Lu @ 2021-11-05  6:13 UTC (permalink / raw)
  To: Stefan Kangas
  Cc: Daniel Brooks, Eli Zaretskii, cpitclaudel, emacs-devel, monnier,
	yuri.v.khan

Stefan Kangas <stefan@marxist.se> writes:

> In any case, the above leads me back to the simple idea to raise
> byte-compiler (or even `read'?) warnings for the problematic control
> characters unless a specific variable is set to t, or unless the piece
> of code using them is wrapped in some `with-suppressed-warnings' call.

Warning inside read would be extremely counterproductive, as read is
used to read many types of data, not just code.

And presumably, the other (machine-generated?) data read by read could
contain such reordering characters.



^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: Unicode confusables and reordering characters considered harmful,  a simple solution
  2021-11-05  3:52                                               ` Stefan Kangas
                                                                   ` (2 preceding siblings ...)
  2021-11-05  6:13                                                 ` Po Lu
@ 2021-11-05  7:37                                                 ` Eli Zaretskii
  2021-11-05  8:00                                                   ` Stefan Kangas
  3 siblings, 1 reply; 172+ messages in thread
From: Eli Zaretskii @ 2021-11-05  7:37 UTC (permalink / raw)
  To: Stefan Kangas; +Cc: db48x, cpitclaudel, emacs-devel, monnier, yuri.v.khan

> From: Stefan Kangas <stefan@marxist.se>
> Date: Thu, 4 Nov 2021 20:52:21 -0700
> Cc: cpitclaudel@gmail.com, yuri.v.khan@gmail.com, monnier@iro.umontreal.ca, 
> 	emacs-devel@gnu.org
> 
> In any case, the above leads me back to the simple idea to raise
> byte-compiler (or even `read'?) warnings for the problematic control
> characters unless a specific variable is set to t, or unless the piece
> of code using them is wrapped in some `with-suppressed-warnings' call.

That would flag our own code, because we sometimes wrap strings in
these directional format control characters, to avoid confusing
display.  Those are exactly the valid uses of these characters, ones
against which it makes no sense to issue a warning.

> Or we do it the other way around: users mark a source code file to say
> that "this file will never contain RTL characters" (but RTL scripts in
> ELisp code is pretty uncommon, I think).

So now everyone is suspect unless certified otherwise?  How does this
make sense?

> > Instead I think that Emacs should adopt a similar fast
> > strategy. Anything we do today can be refined later.
> 
> FWIW, I tend to agree with this.

I don't.  FWIW, I should probably say, because it seems my opinions in
these matters aren't worth much here, the years I spent studying them
and coding for them in Emacs notwithstanding.



^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: Unicode confusables and reordering characters considered harmful,  a simple solution
  2021-11-05  7:37                                                 ` Eli Zaretskii
@ 2021-11-05  8:00                                                   ` Stefan Kangas
  2021-11-05  8:07                                                     ` Eli Zaretskii
  0 siblings, 1 reply; 172+ messages in thread
From: Stefan Kangas @ 2021-11-05  8:00 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: db48x, cpitclaudel, yuri.v.khan, monnier, emacs-devel

Eli Zaretskii <eliz@gnu.org> writes:

>> In any case, the above leads me back to the simple idea to raise
>> byte-compiler (or even `read'?) warnings for the problematic control
>> characters unless a specific variable is set to t, or unless the piece
>> of code using them is wrapped in some `with-suppressed-warnings' call.
>
> That would flag our own code, because we sometimes wrap strings in
> these directional format control characters, to avoid confusing
> display.  Those are exactly the valid uses of these characters, ones
> against which it makes no sense to issue a warning.

We would need to mark those uses as okay, of course.

>> Or we do it the other way around: users mark a source code file to say
>> that "this file will never contain RTL characters" (but RTL scripts in
>> ELisp code is pretty uncommon, I think).
>
> So now everyone is suspect unless certified otherwise?  How does this
> make sense?

The idea is to make the programmer explicitly say yes to using these
characters.  (Or at the very least give them a way to say no, but I'd
much prefer the former.)



^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: Unicode confusables and reordering characters considered harmful,  a simple solution
  2021-11-05  8:00                                                   ` Stefan Kangas
@ 2021-11-05  8:07                                                     ` Eli Zaretskii
  2021-11-05  9:58                                                       ` Stefan Kangas
  0 siblings, 1 reply; 172+ messages in thread
From: Eli Zaretskii @ 2021-11-05  8:07 UTC (permalink / raw)
  To: Stefan Kangas; +Cc: db48x, cpitclaudel, yuri.v.khan, monnier, emacs-devel

> From: Stefan Kangas <stefan@marxist.se>
> Date: Fri, 5 Nov 2021 01:00:59 -0700
> Cc: db48x@db48x.net, cpitclaudel@gmail.com, emacs-devel@gnu.org, 
> 	monnier@iro.umontreal.ca, yuri.v.khan@gmail.com
> 
> Eli Zaretskii <eliz@gnu.org> writes:
> 
> > That would flag our own code, because we sometimes wrap strings in
> > these directional format control characters, to avoid confusing
> > display.  Those are exactly the valid uses of these characters, ones
> > against which it makes no sense to issue a warning.
> 
> We would need to mark those uses as okay, of course.
> 
> >> Or we do it the other way around: users mark a source code file to say
> >> that "this file will never contain RTL characters" (but RTL scripts in
> >> ELisp code is pretty uncommon, I think).
> >
> > So now everyone is suspect unless certified otherwise?  How does this
> > make sense?
> 
> The idea is to make the programmer explicitly say yes to using these
> characters.  (Or at the very least give them a way to say no, but I'd
> much prefer the former.)

IMNSHO, that would be a nuisance.  IOW, this cure is much worse than
the disease.



^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: Unicode confusables and reordering characters considered harmful,  a simple solution
  2021-11-05  2:23                                             ` Daniel Brooks
  2021-11-05  3:52                                               ` Stefan Kangas
@ 2021-11-05  8:09                                               ` tomas
  2021-11-06  1:09                                                 ` Daniel Brooks
  2021-11-05  8:31                                               ` Eli Zaretskii
  2 siblings, 1 reply; 172+ messages in thread
From: tomas @ 2021-11-05  8:09 UTC (permalink / raw)
  To: emacs-devel

[-- Attachment #1: Type: text/plain, Size: 1162 bytes --]

On Thu, Nov 04, 2021 at 07:23:08PM -0700, Daniel Brooks wrote:
> Eli Zaretskii <eliz@gnu.org> writes:
> 
> >> From: Daniel Brooks <db48x@db48x.net>

FWIW, I'm somewhat shocked by the tone you are taking. You disagree
with Eli, that's fine, but:

[...]

> No other suggested feature will be useful to me. This one will. I
> suggest to you that you do not know what all users want.

So far, so good. Emacs is extensible for a reason. If it doesn't suit
you, you can always extend it.

What you can request from a maintainer is that (s)he makes your task
easier, by providing the necessary knobs and levers.

[...]

> Now I think you are being deliberately insulting. I conclude that your
> only purpose in this conversation was to troll people or to say no to
> any solution you didn’t think of yourself.

But this, for my taste, at least, goes definitely too far. I think
Eli knows much more about Emacs and bidi than most of the people
around here taken together. He may be wrong on whatever thing (we
all are sometimes), and convincing him is sometimes hard work, but
accusing him of trolling is Just Wrong. IMO.

Cheers
 - t

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: Unicode confusables and reordering characters considered harmful,  a simple solution
  2021-11-05  2:23                                             ` Daniel Brooks
  2021-11-05  3:52                                               ` Stefan Kangas
  2021-11-05  8:09                                               ` tomas
@ 2021-11-05  8:31                                               ` Eli Zaretskii
  2021-11-05  9:34                                                 ` Juri Linkov
  2 siblings, 1 reply; 172+ messages in thread
From: Eli Zaretskii @ 2021-11-05  8:31 UTC (permalink / raw)
  To: Daniel Brooks; +Cc: cpitclaudel, emacs-devel, stefan, monnier, yuri.v.khan

> From: Daniel Brooks <db48x@db48x.net>
> Cc: cpitclaudel@gmail.com,  yuri.v.khan@gmail.com,  stefan@marxist.se,
>   monnier@iro.umontreal.ca,  emacs-devel@gnu.org
> Date: Thu, 04 Nov 2021 19:23:08 -0700
> 
> Eli Zaretskii <eliz@gnu.org> writes:
> 
> > Then this visual noise will get in the way of people's reading those
> > comments and strings, and, for strings, will make it very hard to
> > understand what will be presented to the user when those strings are
> > output in some UI.
> >
> >> That’s where the problem is.
> >
> > No, the problem is elsewhere entirely: it's in the punctuation
> > characters unrelated to strings and comments whose directionality is
> > overridden, and which thus display in places that cause incorrect
> > visual interpretation of the program during a casual read.
> 
> Look at the examples again. In many of them, all of the bidi override
> characters are inside a string or comment.

Not relevant to the point I was trying to make.  (And what about those
cases where the directional controls are outside the comments or
strings?)

> When that is the case, these characters are only a problem if they
> cause characters that are inside the string or comment to appear to
> be outside of it, by reordering those characters relative to the
> syntactic markers for the string or comment. In other examples these
> characters are _outside_ the string or comment.
> 
> Unless Emacs has specific knowledge of the language syntax, showing the
> characters is the only sure way to know if there is a problem or not.

The command I installed achieves this without requiring any knowledge
of the language syntax.  So no, yours is not the only way.

> > You misunderstand the cause.  The mere presence of these characters is
> > NOT the root cause.  These characters are legitimate and helpful when
> > used as intended.  See TUTORIAL.he for a pertinent example.
> 
> Please don’t presume to tell me what I do or don’t understand. Yes,
> there are use cases which are not harmful, but as I have said it must be
> up to either the programmer or the compiler to answer that
> question. Emacs doesn’t know the syntax of every programming language.

Emacs should do a good job of not crying wolf too much, or else the
programmer will turn off these safety nets.  The feature you propose
as THE solution for the issue flags each and every use of these
characters, the absolute majority of which is completely legitimate.
That is bad for safety/security related warnings: if they have too low
signal-to-noise ratio, people will disable them and lose all the
safety.

> >> Furthermore, I have not suggested that showing the characters needs to
> >> preclude any other form of highlighting. If you wish to develop some
> >> additional way of warning the developer, please do so.
> >
> > We are talking about what should be in Emacs.  What you suggest
> > shouldn't.
> 
> No other suggested feature will be useful to me. This one will. I
> suggest to you that you do not know what all users want.

I submit that users who'd want your feature indeed don't know what
they want.  They are perhaps alarmed by the brouhaha around this
issue, whose details they don't understand, but that is all.

> > Since the Rust compiler evidently does this when it finds these
> > characters inside comments (and probably also inside strings), IMNSHO
> > this is a terrible misfeature, because it means code that uses those
> > controls in legitimate ways cannot be compiled without tweaking
> > non-default options.  That's a cop-out, not the way to flag the
> > problematic cases.
> 
> Your conclusion here is incorrect. Rust has choosen a fast strategy,
> where they implement a broad error today (well, four days ago) knowing
> that it does not prevent them from introducing a more refined error or
> set of errors later.

Then let's withdraw our approval of what they did until they do
introduce those more refined set of errors, shall we?  For now, their
cure is worse than the disease, because it will fail completely
legitimate programs out of fear of the illegitimate ones, which might
never come.

> Rust also has a very flexible annotation system that allows the
> programmer to annotate specific statements and language items. If a use
> of these characters is determined to be legitimate, the programmer can
> annotate the comment, or the function the comment is in, so that this
> error is disabled.

IME, programmers don't like to do stuff that doesn't directly help
them, and will do anything to evade that.  Especially in the Free
Software world, where usually there's no boss telling them what to do.

> > I think this is terrible.  At best, it only tells you that something
> > non-trivial goes on here (but what exactly?).  At worst, it looks like
> > corruption of the source.  And while in the malicious case treating
> > that as corruption is not such a bad idea, all the valid uses of these
> > characters will also look like corruption.  Which means the cure is
> > probably worse than the disease, because the malicious cases are a
> > tiny fraction of the valid ones.
> 
> I cannot believe that you really think this. It shows up with exactly the
> same highlighting that your recently–introduced
> highlight-confusing-reorderings function uses.

In those few examples, carefully chosen to include only the malicious
reordering, yes.  But try it on legitimate uses of those control
characters, and you will see that highlight-confusing-reorderings
doesn't highlight anything (barring bugs), unlike your proposal that
does.  And that's the main point I'm trying to make: features such as
this one cannot afford crying wolf too much.

> Yours doesn’t even work with `next-error`.

It wasn't supposed to.  It was supposed to be similar to
flyspell-mode, which also "doesn't work" with next-error.  Of course,
if we decide that next-error should be able to find such places, we
can always add that (emacs 29 is still very far from a release, and we
have ample time for that), but I doubt it would be a good idea,
because next-error is about messages emitted by compilers, and this is
not a compiler-based feature.

That said, if the new command doesn't help you, you are free not to
use it, of course.  Hopefully, people who are really interested in
finding the maliciously reordered code will.



^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: Unicode confusables and reordering characters considered harmful,  a simple solution
  2021-11-05  8:31                                               ` Eli Zaretskii
@ 2021-11-05  9:34                                                 ` Juri Linkov
  0 siblings, 0 replies; 172+ messages in thread
From: Juri Linkov @ 2021-11-05  9:34 UTC (permalink / raw)
  To: Eli Zaretskii
  Cc: cpitclaudel, stefan, yuri.v.khan, Daniel Brooks, monnier,
	emacs-devel

>> Yours doesn’t even work with `next-error`.
>
> It wasn't supposed to.  It was supposed to be similar to
> flyspell-mode, which also "doesn't work" with next-error.  Of course,
> if we decide that next-error should be able to find such places, we
> can always add that (emacs 29 is still very far from a release, and we
> have ample time for that), but I doubt it would be a good idea,
> because next-error is about messages emitted by compilers, and this is
> not a compiler-based feature.

markchars.el doesn't support next-error OOTB too,
so this is what I use to add next-error support to it:

  (progn
    (font-lock-ensure)
    (text-property-search-forward 'markchars 'confusable))

For suspiciously reordered this should do the same:

  (progn
    (highlight-confusing-reorderings (point-min) (point-max))
    (text-property-search-forward 'face 'confusingly-reordered))



^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: Unicode confusables and reordering characters considered harmful,  a simple solution
  2021-11-05  8:07                                                     ` Eli Zaretskii
@ 2021-11-05  9:58                                                       ` Stefan Kangas
  2021-11-05 12:12                                                         ` Eli Zaretskii
  0 siblings, 1 reply; 172+ messages in thread
From: Stefan Kangas @ 2021-11-05  9:58 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: db48x, cpitclaudel, emacs-devel, monnier, yuri.v.khan

Eli Zaretskii <eliz@gnu.org> writes:

>> The idea is to make the programmer explicitly say yes to using these
>> characters.  (Or at the very least give them a way to say no, but I'd
>> much prefer the former.)
>
> IMNSHO, that would be a nuisance.  IOW, this cure is much worse than
> the disease.

I very much disagree that byte-compiler warnings would be "worse than
the disease".  Why should any user be so very inconvenienced by that?

Security will always be at odds with convenience.  The question is one
of striking a balance between the two.

In this case, I think asking users to add one line of code to those rare
files that need to use these control characters seems like a price worth
paying to improve security in Emacs Lisp as a whole.

Yes, it'll ask more from users that want to write Emacs Lisp with
strings and comments in RTL languages.  But they can also choose to do
nothing and live with the byte-compiler warnings instead.



^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: Unicode confusables and reordering characters considered harmful,  a simple solution
  2021-11-05  9:58                                                       ` Stefan Kangas
@ 2021-11-05 12:12                                                         ` Eli Zaretskii
  2021-11-05 13:08                                                           ` Stefan Kangas
  0 siblings, 1 reply; 172+ messages in thread
From: Eli Zaretskii @ 2021-11-05 12:12 UTC (permalink / raw)
  To: Stefan Kangas; +Cc: db48x, cpitclaudel, emacs-devel, monnier, yuri.v.khan

> From: Stefan Kangas <stefan@marxist.se>
> Date: Fri, 5 Nov 2021 02:58:49 -0700
> Cc: db48x@db48x.net, cpitclaudel@gmail.com, yuri.v.khan@gmail.com, 
> 	monnier@iro.umontreal.ca, emacs-devel@gnu.org
> 
> Eli Zaretskii <eliz@gnu.org> writes:
> 
> >> The idea is to make the programmer explicitly say yes to using these
> >> characters.  (Or at the very least give them a way to say no, but I'd
> >> much prefer the former.)
> >
> > IMNSHO, that would be a nuisance.  IOW, this cure is much worse than
> > the disease.
> 
> I very much disagree that byte-compiler warnings would be "worse than
> the disease".  Why should any user be so very inconvenienced by that?

Because the way this is being proposed, i.e. issue a warning whenever
any of the directional controls are present, its signal-to-noise ratio
will be too low to be useful.  If the proposal is to teach the
byte-compiler to identify the cases flagged by
bidi-find-overridden-directionality, then I don't mind to it
triggering a warning.

> Security will always be at odds with convenience.  The question is one
> of striking a balance between the two.

The right balance is where the percent of false positives is very low.
If we are just going to warn because some codepoints are seen in the
source, the absolute majority of the warnings in Real Life will be
false positives, and that is AFAIU a bad idea for a security feature.

> In this case, I think asking users to add one line of code to those rare
> files that need to use these control characters seems like a price worth
> paying to improve security in Emacs Lisp as a whole.

Adding one line is a nuisance.  If it can be avoided, we should avoid
it.  Since we are capable of detecting the really suspicious uses of
those controls, it is much better to use that, because in that case
users will not have to add anything.

Don't you agree that a feature whose signal-to-noise ratio is high
enough to avoid the need of adding anything to the source is better
than a feature which does require such additions?

> Yes, it'll ask more from users that want to write Emacs Lisp with
> strings and comments in RTL languages.  But they can also choose to do
> nothing and live with the byte-compiler warnings instead.

That is not the stance we should take, because basically it says we
don't care enough about users who use these languages in their
programs.  Especially when we have a means of doing that without
causing any inconvenience.



^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: Unicode confusables and reordering characters considered harmful,  a simple solution
  2021-11-05 12:12                                                         ` Eli Zaretskii
@ 2021-11-05 13:08                                                           ` Stefan Kangas
  2021-11-05 14:19                                                             ` Eli Zaretskii
  0 siblings, 1 reply; 172+ messages in thread
From: Stefan Kangas @ 2021-11-05 13:08 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: db48x, cpitclaudel, yuri.v.khan, monnier, emacs-devel

Eli Zaretskii <eliz@gnu.org> writes:

> Because the way this is being proposed, i.e. issue a warning whenever
> any of the directional controls are present, its signal-to-noise ratio
> will be too low to be useful.  If the proposal is to teach the
> byte-compiler to identify the cases flagged by
> bidi-find-overridden-directionality, then I don't mind to it
> triggering a warning.

OK, that's a fair point.

I didn't study `bidi-find-overridden-directionality' yet, but the
"Trojan Source" paper writes:

    "By banning all directionality-control characters, users with
    legitimate Bidi-override use cases in comments are penalized.
    Therefore, a better defense might be to ban the use of
    _unterminated_ Bidi override characters within string literals and
    comments.  By ensuring that each override is terminated – that is,
    for example, that every LRI has a matching PDI– it becomes
    impossible to distort legitimate source code outside of string
    literals and comments."  (p. 8, their emphasis)

So, IIUC, the problematic cases are "unterminated Bidi override
characters", and those are the ones worth warning about.  Does that
sound correct to you?

> Adding one line is a nuisance.  If it can be avoided, we should avoid
> it.  Since we are capable of detecting the really suspicious uses of
> those controls, it is much better to use that, because in that case
> users will not have to add anything.

I agree that it does sound better to prefer such an approach if
possible.



^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: Unicode confusables and reordering characters considered harmful,  a simple solution
  2021-11-05 13:08                                                           ` Stefan Kangas
@ 2021-11-05 14:19                                                             ` Eli Zaretskii
  2021-11-05 23:33                                                               ` Gregory Heytings
  2021-11-06 13:58                                                               ` Benjamin Riefenstahl
  0 siblings, 2 replies; 172+ messages in thread
From: Eli Zaretskii @ 2021-11-05 14:19 UTC (permalink / raw)
  To: Stefan Kangas; +Cc: db48x, cpitclaudel, yuri.v.khan, monnier, emacs-devel

> From: Stefan Kangas <stefan@marxist.se>
> Date: Fri, 5 Nov 2021 06:08:42 -0700
> Cc: db48x@db48x.net, cpitclaudel@gmail.com, emacs-devel@gnu.org, 
> 	monnier@iro.umontreal.ca, yuri.v.khan@gmail.com
> 
> I didn't study `bidi-find-overridden-directionality' yet, but the
> "Trojan Source" paper writes:
> 
>     "By banning all directionality-control characters, users with
>     legitimate Bidi-override use cases in comments are penalized.
>     Therefore, a better defense might be to ban the use of
>     _unterminated_ Bidi override characters within string literals and
>     comments.  By ensuring that each override is terminated – that is,
>     for example, that every LRI has a matching PDI– it becomes
>     impossible to distort legitimate source code outside of string
>     literals and comments."  (p. 8, their emphasis)
> 
> So, IIUC, the problematic cases are "unterminated Bidi override
> characters", and those are the ones worth warning about.  Does that
> sound correct to you?

No.  What they say is simply wrong: such unterminated overrides and
embeddings are perfectly valid.  The Unicode Bidirectional Algorithm
(UBA) mandates (https://unicode.org/reports/tr9/#X8):

  X8. All explicit directional embeddings, overrides and isolates are
  completely terminated at the end of each paragraph.

      Explicit paragraph separators (bidirectional character type B)
      indicate the end of a paragraph. As such, they are not included in
      any embedding, override or isolate. They are simply assigned the
      paragraph embedding level.

And in https://unicode.org/reports/tr9/#Bidirectional_Character_Types
you can see that newline is one of the characters whose bidi type is
B; compare:

  (get-char-code-property ?\n 'bidi-class) => B

So when the UBA says "at the end of each paragraph", it means in
practice at EOL, since all the other paragraph separators are rarely
if ever used in human-readable text.  (And Emacs, of course,
implements that rule.)

The authors of the paper simply don't understand the bidi stuff well
enough to make useful proposals about this.  They should have bring
this up on the Unicode mailing list, where at least the experts (and I
don't mean myself, I mean the people who wrote the UBA) could set them
straight.

I encourage you to read the comments in the implementation I wrote, to
see which cases I consider "suspicious".  The comments need to be read
with the UBA spec in mind, at least its Xn rules.  I will be happy to
explain or clarify if something is unclear there.  This is a complex
issue, and discussing it rationally could really enhance our
understanding and handling of these cases.

> > Adding one line is a nuisance.  If it can be avoided, we should avoid
> > it.  Since we are capable of detecting the really suspicious uses of
> > those controls, it is much better to use that, because in that case
> > users will not have to add anything.
> 
> I agree that it does sound better to prefer such an approach if
> possible.

Then let's try to implement that.  If there's a need for more
bidi-specific infrastructure, let me know and I will see what I can
do.

Thanks.



^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: Unicode confusables considered harmful
  2021-11-02 12:57 Unicode confusables and reordering characters considered harmful Vasilij Schneidermann
                   ` (4 preceding siblings ...)
  2021-11-02 14:57 ` Stefan Kangas
@ 2021-11-05 18:53 ` Vasilij Schneidermann
  2021-11-05 20:03   ` Eli Zaretskii
  2021-11-05 21:36   ` Stefan Monnier
  2021-11-10 15:47 ` Unicode confusables and reordering characters " Dmitry Gutov
  6 siblings, 2 replies; 172+ messages in thread
From: Vasilij Schneidermann @ 2021-11-05 18:53 UTC (permalink / raw)
  To: emacs-devel

[-- Attachment #1: Type: text/plain, Size: 361 bytes --]

It's been more than a hundred messages and they all talk about
reordering characters, not Unicode confusables. Which kind of surprises
me because disabling bidi is an easy workaround for 95% of the world
population not knowing RTL languages.

Any thoughts on how the uni-confusables package could be extended and
used to detect suspicious identifiers?

Vasilij

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: Unicode confusables considered harmful
  2021-11-05 18:53 ` Unicode confusables " Vasilij Schneidermann
@ 2021-11-05 20:03   ` Eli Zaretskii
  2021-11-06 11:56     ` Vasilij Schneidermann
  2021-11-05 21:36   ` Stefan Monnier
  1 sibling, 1 reply; 172+ messages in thread
From: Eli Zaretskii @ 2021-11-05 20:03 UTC (permalink / raw)
  To: Vasilij Schneidermann; +Cc: emacs-devel

> Date: Fri, 5 Nov 2021 19:53:14 +0100
> From: Vasilij Schneidermann <mail@vasilij.de>
> 
> It's been more than a hundred messages and they all talk about
> reordering characters, not Unicode confusables. Which kind of surprises
> me because disabling bidi is an easy workaround for 95% of the world
> population not knowing RTL languages.

Disabling bidi in Emacs is asking for trouble because one cannot do
that and rely on the display engine to still work correctly in all
cases.  Bidirectional support is nowadays hardwired into the display
engine and cannot be disabled completely.

> Any thoughts on how the uni-confusables package could be extended and
> used to detect suspicious identifiers?

Please try the new command highlight-confusing-reorderings (available
on master), it is supposed to be the way to detect suspicious
reorderings without falsely flagging any legitimate ones.  (I can
easily understand how mentioning it could drown in the sea of the
other messages in this thread; sorry about that.)



^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: Unicode confusables considered harmful
  2021-11-05 18:53 ` Unicode confusables " Vasilij Schneidermann
  2021-11-05 20:03   ` Eli Zaretskii
@ 2021-11-05 21:36   ` Stefan Monnier
  1 sibling, 0 replies; 172+ messages in thread
From: Stefan Monnier @ 2021-11-05 21:36 UTC (permalink / raw)
  To: Vasilij Schneidermann; +Cc: emacs-devel

> It's been more than a hundred messages and they all talk about
> reordering characters, not Unicode confusables. Which kind of surprises
> me because disabling bidi is an easy workaround for 95% of the world
> population not knowing RTL languages.

Indeed, a package which highlights all the characters with strong RTL
directionality will do the trick for the bidi-illiterate population.
Or the bidi.c code could be easily tweaked to warn whenever it goes into
RTL direction.

It's clearly not a satisfactory solution in general, but just like ASCII
was good enough for a significant user population, this would be
sufficient for a non-trivial chunk of users.

> Any thoughts on how the uni-confusables package could be extended and
> used to detect suspicious identifiers?

And indeed, personally I'm more worried about the uni-confusables, and
about de-normalized representations of accented chars (since
I'd expect most compilers don't bother to normalize their unicode
inputs).


        Stefan




^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: Unicode confusables and reordering characters considered harmful,  a simple solution
  2021-11-04 19:16                                                           ` Stefan Monnier
@ 2021-11-05 23:31                                                             ` Gregory Heytings
  2021-11-06  7:25                                                               ` Eli Zaretskii
  0 siblings, 1 reply; 172+ messages in thread
From: Gregory Heytings @ 2021-11-05 23:31 UTC (permalink / raw)
  To: Stefan Monnier
  Cc: cpitclaudel, stefan, yuri.v.khan, db48x, Eli Zaretskii,
	emacs-devel

[-- Attachment #1: Type: text/plain, Size: 661 bytes --]


>>> myfun("שָׁלוֹם" ,"السّلامعليكم");
>>
>> There is no danger in that example, and in particular nothing 
>> invisible.
>
> I'm pretty sure an attacker can use the above confusing arg order to 
> turn an apparently harmless program into a security hole.
>

That's possible indeed, but this is not what the "Trojan Source" paper is 
about.  The example you show is only one instance of the many possible 
reasons why a piece of code can be difficult to interpret, there are many 
others, e.g. misleading indentation in code.  The point made by the 
"Trojan Source" paper is only about invisible reordering control 
characters.

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: Unicode confusables and reordering characters considered harmful,  a simple solution
  2021-11-04 19:22                                                           ` Stefan Monnier
  2021-11-04 19:55                                                             ` Eli Zaretskii
@ 2021-11-05 23:32                                                             ` Gregory Heytings
  1 sibling, 0 replies; 172+ messages in thread
From: Gregory Heytings @ 2021-11-05 23:32 UTC (permalink / raw)
  To: Stefan Monnier
  Cc: cpitclaudel, stefan, yuri.v.khan, db48x, Eli Zaretskii,
	emacs-devel


>> There is a danger when, because the source code contains invisible 
>> control characters, the programmer sees something on their screen, and 
>> the compiler sees something completely different.
>
> You mean there is a special kind of danger coming from the invisible 
> control characters because they can make code render unexpectedly even 
> though all the rendered chars are "familiar" (e.g. all-ASCII)?
>
> That's a good point.
>

Indeed, that's what I mean.  Or rather, that's what the authors of the 
"Trojan Source" paper mean.  And given that the legitimate uses of these 
invisible control characters in source code are exceedingly rare (I still 
haven't seen a single real-life case), making them visible by default 
makes sense.  Just like we make no-break spaces visible by default.



^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: Unicode confusables and reordering characters considered harmful,  a simple solution
  2021-11-05 14:19                                                             ` Eli Zaretskii
@ 2021-11-05 23:33                                                               ` Gregory Heytings
  2021-11-06  0:54                                                                 ` Daniel Brooks
  2021-11-06 10:48                                                                 ` Eli Zaretskii
  2021-11-06 13:58                                                               ` Benjamin Riefenstahl
  1 sibling, 2 replies; 172+ messages in thread
From: Gregory Heytings @ 2021-11-05 23:33 UTC (permalink / raw)
  To: Eli Zaretskii
  Cc: cpitclaudel, Stefan Kangas, emacs-devel, db48x, monnier,
	yuri.v.khan

[-- Attachment #1: Type: text/plain, Size: 1360 bytes --]


>
> The right balance is where the percent of false positives is very low.
>

IMO, that's not the right balance: the right balance is where the 
percentage of false negatives is zero.  When security is at stake, I very 
much prefer too many false positives to missing one danger.  In particular 
because such warnings give you the feeling that there is no danger when 
there is no warning.

>
> I encourage you to read the comments in the implementation I wrote, to 
> see which cases I consider "suspicious".
>

This "I consider" is the problem of your approach.  Malevolent actors are 
always more inventive, and will find a way to escape the safety net you 
created.  The cases you consider suspicious are cases where the 
directionality of one or more characters is overridden by reordering 
control characters, but this is not what the "Trojan Source" paper is 
about.  The problem it points to is much broader, it's about using these 
invisible control characters to make the source code appear different to a 
human reader and to a compiler.

In fact, it did not take me much time to create a case that your algorithm 
doesn't detect (and AFAIU cannot detect without also displaying warnings 
about many legitimate uses).  I attach the example code, how that code is 
displayed by Emacs, and how that code would be displayed with the patch I 
proposed.

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: Type: text/x-csrc; name=bidi-reordering.c; charset=us-ascii, Size: 538 bytes --]

#include <stdio.h>
#include <string.h>

#define is_restricted_user(user)			      \
  !strcmp (user, "root") ? 0 :				      \
  !strcmp (user, "admin") ? 0 :				      \
  !strcmp (user, "superuser‮⁦? 0 : 1⁩ ⁦")

int main () {
  printf ("root: %d\n", is_restricted_user ("root"));
  printf ("admin: %d\n", is_restricted_user ("admin"));
  printf ("superuser: %d\n", is_restricted_user ("superuser"));
  printf ("luser: %d\n", is_restricted_user ("luser"));
  printf ("nobody: %d\n", is_restricted_user ("nobody"));
}

[-- Attachment #3: Type: image/png, Size: 76462 bytes --]

[-- Attachment #4: Type: image/png, Size: 77029 bytes --]

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: Unicode confusables and reordering characters considered harmful,  a simple solution
  2021-11-05 23:33                                                               ` Gregory Heytings
@ 2021-11-06  0:54                                                                 ` Daniel Brooks
  2021-11-06 10:56                                                                   ` Eli Zaretskii
  2021-11-06 10:48                                                                 ` Eli Zaretskii
  1 sibling, 1 reply; 172+ messages in thread
From: Daniel Brooks @ 2021-11-06  0:54 UTC (permalink / raw)
  To: Gregory Heytings
  Cc: cpitclaudel, Stefan Kangas, yuri.v.khan, monnier, Eli Zaretskii,
	emacs-devel

Gregory Heytings <gregory@heytings.org> writes:

> This "I consider" is the problem of your approach.  Malevolent actors
> are always more inventive, and will find a way to escape the safety
> net you created.

Absolutely.

> The cases you consider suspicious are cases where the directionality
> of one or more characters is overridden by reordering control
> characters, but this is not what the "Trojan Source" paper is about.
> The problem it points to is much broader, it's about using these
> invisible control characters to make the source code appear different
> to a human reader and to a compiler.

Specifically reordering the source so that something which is inside of
a comment or string appears to be outside of it, or visa versa.

However, as you say arbitrary rearrangement is on the table. The paper
specifically mentions that the line can be treated as an anagram, and
the characters rearranged into an arbitrary order. It would be fun to
find a nice example where one enum variant was substituted for another,
with no string or comment on the line to supply the necessary
characters. It would require enum variants whose names are anagrams…

> In fact, it did not take me much time to create a case that your
> algorithm doesn't detect (and AFAIU cannot detect without also
> displaying warnings about many legitimate uses).  I attach the example
> code, how that code is displayed by Emacs, and how that code would be
> displayed with the patch I proposed.
>
> #define is_restricted_user(user)			      \
>   !strcmp (user, "root") ? 0 :				      \
>   !strcmp (user, "admin") ? 0 :				      \
>   !strcmp (user, "superuser‮⁦? 0 : 1⁩ ⁦")

I love this example.

I think that it can be detected though. As the paper says, we should be
on the lookout for unterminated overrides. This example has a
LEFT-TO-RIGHT ISOLATE that is left unterminated by a POP DIRECTIONAL
ISOLATE; it thus applies long enough to hit the string delimiter.

Personally I don’t mind detecting these sorts of errors, as long as we
recognize that we cannot reliably do so unless we also know the syntax
of the language; not every language terminates a string the same
way. Imagine this were Perl, and we were manipulating not a
double–quoted string but a q{}, a qx{}, or worse: a regex match
(m//). Recall that regex matches can use arbitrary punctuation
characters as delimiters; m[] is just as valid as m//. But perhaps it
would suffice to find isolates which are only terminated by a newline
character.

db48x



^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: Unicode confusables and reordering characters considered harmful,  a simple solution
  2021-11-05  8:09                                               ` tomas
@ 2021-11-06  1:09                                                 ` Daniel Brooks
  0 siblings, 0 replies; 172+ messages in thread
From: Daniel Brooks @ 2021-11-06  1:09 UTC (permalink / raw)
  To: tomas; +Cc: emacs-devel

<tomas@tuxteam.de> writes:

> On Thu, Nov 04, 2021 at 07:23:08PM -0700, Daniel Brooks wrote:
>> No other suggested feature will be useful to me. This one will. I
>> suggest to you that you do not know what all users want.
>
> So far, so good. Emacs is extensible for a reason. If it doesn't suit
> you, you can always extend it.
>
> What you can request from a maintainer is that (s)he makes your task
> easier, by providing the necessary knobs and levers.

Please don’t put words in my mouth; I did not ask him to provide
anything. I have repeatedly tried to convince him that his suggested fix
was inadequate, but offered to do the work myself. I will probably do
that work this weekend, in case anyone else wants to use it.

In fact, I seem to recall that I entered this discussion agreeing with
him that somebody should write something, but offering the suggestion
that we didn’t need to write a whole new mode since whitespace-mode is
right there and very handy. Others have made other similar suggestions,
such as the uni-confusables package from ELPA. Gregory Heytings
suggested using buffer-display-table directly, rather than going through
whitespace-mode.

>
> [...]
>
>> Now I think you are being deliberately insulting. I conclude that your
>> only purpose in this conversation was to troll people or to say no to
>> any solution you didn’t think of yourself.
>
> But this, for my taste, at least, goes definitely too far. I think
> Eli knows much more about Emacs and bidi than most of the people
> around here taken together. He may be wrong on whatever thing (we
> all are sometimes), and convincing him is sometimes hard work, but
> accusing him of trolling is Just Wrong. IMO.

I disagree. You conveniently ommitted the part where he compared me to
the TSA. What can that be other than a troll? At least he didn’t call me
a Nazi I guess.

db48x



^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: Unicode confusables and reordering characters considered harmful,  a simple solution
  2021-11-05 23:31                                                             ` Gregory Heytings
@ 2021-11-06  7:25                                                               ` Eli Zaretskii
  0 siblings, 0 replies; 172+ messages in thread
From: Eli Zaretskii @ 2021-11-06  7:25 UTC (permalink / raw)
  To: Gregory Heytings
  Cc: cpitclaudel, stefan, yuri.v.khan, db48x, monnier, emacs-devel

> Date: Fri, 05 Nov 2021 23:31:38 +0000
> From: Gregory Heytings <gregory@heytings.org>
> cc: Eli Zaretskii <eliz@gnu.org>, cpitclaudel@gmail.com, stefan@marxist.se, 
>     emacs-devel@gnu.org, db48x@db48x.net, yuri.v.khan@gmail.com
> 
> That's possible indeed, but this is not what the "Trojan Source" paper is 
> about.  The example you show is only one instance of the many possible 
> reasons why a piece of code can be difficult to interpret, there are many 
> others, e.g. misleading indentation in code.  The point made by the 
> "Trojan Source" paper is only about invisible reordering control 
> characters.

We should have features to flag any potentially confusing text, not
just what that paper was talking about.



^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: Unicode confusables and reordering characters considered harmful,  a simple solution
  2021-11-05 23:33                                                               ` Gregory Heytings
  2021-11-06  0:54                                                                 ` Daniel Brooks
@ 2021-11-06 10:48                                                                 ` Eli Zaretskii
  2021-11-08 19:58                                                                   ` Gregory Heytings
  1 sibling, 1 reply; 172+ messages in thread
From: Eli Zaretskii @ 2021-11-06 10:48 UTC (permalink / raw)
  To: Gregory Heytings
  Cc: cpitclaudel, stefan, emacs-devel, db48x, monnier, yuri.v.khan

> Date: Fri, 05 Nov 2021 23:33:39 +0000
> From: Gregory Heytings <gregory@heytings.org>
> cc: Stefan Kangas <stefan@marxist.se>, db48x@db48x.net, cpitclaudel@gmail.com, 
>     yuri.v.khan@gmail.com, monnier@iro.umontreal.ca, emacs-devel@gnu.org
> 
> > The right balance is where the percent of false positives is very low.
> 
> IMO, that's not the right balance: the right balance is where the 
> percentage of false negatives is zero.

If you need zero false negatives, and don't care about the level of
noise (i.e. false positives), you have the features for that already:
customize glyphless-char-display-control to show the control
characters as acronyms or hex codes.  And if you want them to stand
out even more, you can in addition use highlight-regexp to show them
in some prominent background color.

However, this basically means you don't need to display any buffers
with truly bidirectional text as a matter of routine.  The command I
added yesterday is for those who do, for whom the level of noise from
false positives will be too much.

> When security is at stake, I very much prefer too many false
> positives to missing one danger.  In particular because such
> warnings give you the feeling that there is no danger when there is
> no warning.

That's fine.  Then you can use those other facilities.

> > I encourage you to read the comments in the implementation I wrote, to 
> > see which cases I consider "suspicious".
> 
> This "I consider" is the problem of your approach.  Malevolent actors are 
> always more inventive, and will find a way to escape the safety net you 
> created.  The cases you consider suspicious are cases where the 
> directionality of one or more characters is overridden by reordering 
> control characters, but this is not what the "Trojan Source" paper is 
> about.  The problem it points to is much broader, it's about using these 
> invisible control characters to make the source code appear different to a 
> human reader and to a compiler.

The only way to make the source code appear different to a human
reader is to reorder some of the characters, by tweaking their
directionality using those formatting controls.  That is why those
control characters are used in these examples.  So there's no
difference between what I consider suspicious and what that paper
says, we just say it in different words.

> In fact, it did not take me much time to create a case that your algorithm 
> doesn't detect (and AFAIU cannot detect without also displaying warnings 
> about many legitimate uses).  I attach the example code, how that code is 
> displayed by Emacs, and how that code would be displayed with the patch I 
> proposed.

Thanks, I've now enhanced the code which detects suspiciously
reordered source to cover this kind of cases as well.  I didn't see
any legitimate uses flagged after the change, but if you can find any
such cases, please show them and I will take a look.



^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: Unicode confusables and reordering characters considered harmful,  a simple solution
  2021-11-06  0:54                                                                 ` Daniel Brooks
@ 2021-11-06 10:56                                                                   ` Eli Zaretskii
  0 siblings, 0 replies; 172+ messages in thread
From: Eli Zaretskii @ 2021-11-06 10:56 UTC (permalink / raw)
  To: Daniel Brooks
  Cc: cpitclaudel, stefan, yuri.v.khan, gregory, monnier, emacs-devel

> From: Daniel Brooks <db48x@db48x.net>
> Cc: Eli Zaretskii <eliz@gnu.org>,  cpitclaudel@gmail.com,  Stefan Kangas
>  <stefan@marxist.se>,  emacs-devel@gnu.org,  monnier@iro.umontreal.ca,
>   yuri.v.khan@gmail.com
> Date: Fri, 05 Nov 2021 17:54:37 -0700
> 
> > #define is_restricted_user(user)			      \
> >   !strcmp (user, "root") ? 0 :				      \
> >   !strcmp (user, "admin") ? 0 :				      \
> >   !strcmp (user, "superuser‮⁦? 0 : 1⁩ ⁦")
> 
> I love this example.

Well, then maybe you'll also like the solution I just installed.

> I think that it can be detected though. As the paper says, we should be
> on the lookout for unterminated overrides. This example has a
> LEFT-TO-RIGHT ISOLATE that is left unterminated by a POP DIRECTIONAL
> ISOLATE; it thus applies long enough to hit the string delimiter.

No, this example (and others as well) will display the same even if
all the embeddings and isolates are terminated by the corresponding
POP controls.  In fact, the test case I installed does just that.  As
I write elsewhere, the UBA says that unterminated embeddings and
overrides are perfectly legitimate.  So the search for "unterminated"
overrides and isolates cannot be the solution, it can only detect the
cases where the malicious parties got sloppy.

> Personally I don’t mind detecting these sorts of errors, as long as we
> recognize that we cannot reliably do so unless we also know the syntax
> of the language; not every language terminates a string the same
> way. Imagine this were Perl, and we were manipulating not a
> double–quoted string but a q{}, a qx{}, or worse: a regex match
> (m//). Recall that regex matches can use arbitrary punctuation
> characters as delimiters; m[] is just as valid as m//.

I don't see how this is relevant, as long as the detection doesn't
care about the syntax, and just looks at the characters whose
bidirectional properties are being tweaked.  The parties that concoct
these malicious code samples do indeed have to consider the syntax of
the language, since they want to dupe human readers and also avoid
compiler flagging the source as invalid.  But detection doesn't have
to know anything about the syntax, at least not for some class of
detection algorithms.



^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: Unicode confusables considered harmful
  2021-11-05 20:03   ` Eli Zaretskii
@ 2021-11-06 11:56     ` Vasilij Schneidermann
  2021-11-06 12:20       ` Eli Zaretskii
  0 siblings, 1 reply; 172+ messages in thread
From: Vasilij Schneidermann @ 2021-11-06 11:56 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: emacs-devel

[-- Attachment #1: Type: text/plain, Size: 1019 bytes --]

> Disabling bidi in Emacs is asking for trouble because one cannot do
> that and rely on the display engine to still work correctly in all
> cases.  Bidirectional support is nowadays hardwired into the display
> engine and cannot be disabled completely.

If it works correctly in all the cases not using RTL scripts, that's
fine by me. And I'm far from the only one thinking like that. Bonus if
it improves redisplay speed.

> Please try the new command highlight-confusing-reorderings (available
> on master), it is supposed to be the way to detect suspicious
> reorderings without falsely flagging any legitimate ones.  (I can
> easily understand how mentioning it could drown in the sea of the
> other messages in this thread; sorry about that.)

I'm specifically not talking about reordering characters, but
confusables, that is, characters that look visually identical. See
https://unicode.org/reports/tr39/#Confusable_Detection for further
elaboration on the topic. Hence the change of the subject line.

Vasilij

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: Unicode confusables considered harmful
  2021-11-06 11:56     ` Vasilij Schneidermann
@ 2021-11-06 12:20       ` Eli Zaretskii
  2021-11-06 13:10         ` Vasilij Schneidermann
  0 siblings, 1 reply; 172+ messages in thread
From: Eli Zaretskii @ 2021-11-06 12:20 UTC (permalink / raw)
  To: Vasilij Schneidermann; +Cc: emacs-devel

> Date: Sat, 6 Nov 2021 12:56:29 +0100
> From: Vasilij Schneidermann <mail@vasilij.de>
> Cc: emacs-devel@gnu.org
> 
> > Disabling bidi in Emacs is asking for trouble because one cannot do
> > that and rely on the display engine to still work correctly in all
> > cases.  Bidirectional support is nowadays hardwired into the display
> > engine and cannot be disabled completely.
> 
> If it works correctly in all the cases not using RTL scripts, that's
> fine by me.

That's not something I can say.  It's unreliable because some parts of
the display engine assume that bidi reordering always happens.  I
didn't try to find in which cases the result is OK, and don't intend
investing any time in doing so.

> And I'm far from the only one thinking like that.

If there are volunteers interested in adding such a feature to Emacs,
let them send patches.  The Emacs development team decided long ago to
make the reordering an inherent feature that doesn't need to be turned
off, and the development in the display engine since then didn't
bother to keep 2 separate code paths, one each for every value of
bidi-display-reordering.  And that's what we have now.

> > Please try the new command highlight-confusing-reorderings (available
> > on master), it is supposed to be the way to detect suspicious
> > reorderings without falsely flagging any legitimate ones.  (I can
> > easily understand how mentioning it could drown in the sea of the
> > other messages in this thread; sorry about that.)
> 
> I'm specifically not talking about reordering characters, but
> confusables, that is, characters that look visually identical. See
> https://unicode.org/reports/tr39/#Confusable_Detection for further
> elaboration on the topic. Hence the change of the subject line.

That's supposed to be the subject of uni-confusables in ELPA, I think.
It has nothing to do with bidirectional reordering, AFAIU.  If
uni-confusables doesn't do its job well enough, please submit bug
reports.



^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: Unicode confusables considered harmful
  2021-11-06 12:20       ` Eli Zaretskii
@ 2021-11-06 13:10         ` Vasilij Schneidermann
  2021-11-06 13:29           ` Eli Zaretskii
  0 siblings, 1 reply; 172+ messages in thread
From: Vasilij Schneidermann @ 2021-11-06 13:10 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: emacs-devel

[-- Attachment #1: Type: text/plain, Size: 506 bytes --]

> That's supposed to be the subject of uni-confusables in ELPA, I think.

That may be, but the package does nothing besides setting up a new
character table (see the initial message).

> If uni-confusables doesn't do its job well enough, please submit bug
> reports.

It wouldn't hurt to have some discussion first what exactly the package
is supposed to do. Considering there isn't agreement on how the bidi
situation is handled, I doubt there will be an obvious solution for
confusables either.

Vasilij

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: Unicode confusables considered harmful
  2021-11-06 13:10         ` Vasilij Schneidermann
@ 2021-11-06 13:29           ` Eli Zaretskii
  0 siblings, 0 replies; 172+ messages in thread
From: Eli Zaretskii @ 2021-11-06 13:29 UTC (permalink / raw)
  To: Vasilij Schneidermann; +Cc: emacs-devel

> Date: Sat, 6 Nov 2021 14:10:15 +0100
> From: Vasilij Schneidermann <mail@vasilij.de>
> Cc: emacs-devel@gnu.org
> 
> > That's supposed to be the subject of uni-confusables in ELPA, I think.
> 
> That may be, but the package does nothing besides setting up a new
> character table (see the initial message).

Hmm... I thought it did more.  So I guess it has to be extended to
provide some facilities based on that table, and perhaps also based on
other character information.

> > If uni-confusables doesn't do its job well enough, please submit bug
> > reports.
> 
> It wouldn't hurt to have some discussion first what exactly the package
> is supposed to do.

Fine, let's have it.  Perhaps you or someone else would like to
propose a set of APIs for dealing with these issues?  That'd be a good
start, I think.

> Considering there isn't agreement on how the bidi situation is
> handled, I doubt there will be an obvious solution for confusables
> either.

Let's see what we want it to do, before talking about solutions.  I
wouldn't expect it to be hard to reach a consensus on confusables.
Implementation is another matter...



^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: Unicode confusables and reordering characters considered harmful,  a simple solution
  2021-11-05 14:19                                                             ` Eli Zaretskii
  2021-11-05 23:33                                                               ` Gregory Heytings
@ 2021-11-06 13:58                                                               ` Benjamin Riefenstahl
  2021-11-06 15:34                                                                 ` Eli Zaretskii
  1 sibling, 1 reply; 172+ messages in thread
From: Benjamin Riefenstahl @ 2021-11-06 13:58 UTC (permalink / raw)
  To: emacs-devel

Eli Zaretskii writes:
> The Unicode Bidirectional Algorithm (UBA) mandates
> (https://unicode.org/reports/tr9/#X8):
>
>   X8. All explicit directional embeddings, overrides and isolates are
>   completely terminated at the end of each paragraph.
>
> [...]
>
> So when the UBA says "at the end of each paragraph", it means in
> practice at EOL, since all the other paragraph separators are rarely
> if ever used in human-readable text.  (And Emacs, of course,
> implements that rule.)

Should the end of a comment or string in source code then also qualify
as the end of a paragraph in this sense?



^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: Unicode confusables and reordering characters considered harmful,  a simple solution
  2021-11-06 13:58                                                               ` Benjamin Riefenstahl
@ 2021-11-06 15:34                                                                 ` Eli Zaretskii
  2021-11-06 17:09                                                                   ` Benjamin Riefenstahl
  0 siblings, 1 reply; 172+ messages in thread
From: Eli Zaretskii @ 2021-11-06 15:34 UTC (permalink / raw)
  To: Benjamin Riefenstahl; +Cc: emacs-devel

> From: Benjamin Riefenstahl <b.riefenstahl@turtle-trading.net>
> Date: Sat, 06 Nov 2021 14:58:31 +0100
> 
> Eli Zaretskii writes:
> > The Unicode Bidirectional Algorithm (UBA) mandates
> > (https://unicode.org/reports/tr9/#X8):
> >
> >   X8. All explicit directional embeddings, overrides and isolates are
> >   completely terminated at the end of each paragraph.
> >
> > [...]
> >
> > So when the UBA says "at the end of each paragraph", it means in
> > practice at EOL, since all the other paragraph separators are rarely
> > if ever used in human-readable text.  (And Emacs, of course,
> > implements that rule.)
> 
> Should the end of a comment or string in source code then also qualify
> as the end of a paragraph in this sense?

It could be, but the way the UBA is implemented in Emacs makes that
very hard to do, if not impossible.  And that's even before you
consider comment styles which make that hard even in principle.  For
example:

  /* This is the beginning of a comment, */
  /* and this is its continuation.      */



^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: Unicode confusables and reordering characters considered harmful,  a simple solution
  2021-11-06 15:34                                                                 ` Eli Zaretskii
@ 2021-11-06 17:09                                                                   ` Benjamin Riefenstahl
  2021-11-06 17:35                                                                     ` Eli Zaretskii
  0 siblings, 1 reply; 172+ messages in thread
From: Benjamin Riefenstahl @ 2021-11-06 17:09 UTC (permalink / raw)
  To: emacs-devel

Eli Zaretskii writes:
> And that's even before you consider comment styles which make that
> hard even in principle.  For example:
>
>   /* This is the beginning of a comment, */
>   /* and this is its continuation.      */

But if EOL is a paragraph separator for the purposes of bidi, then your
example would be two paragraphs even without the comment syntax, right?
As in

   This is the beginning of a comment,
   and this is its continuation.

So this looks to me to be orthogonal.  Which is not to say, that it
would not be nice to have a solution, but it seems a problem further
down the road.



^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: Unicode confusables and reordering characters considered harmful,  a simple solution
  2021-11-06 17:09                                                                   ` Benjamin Riefenstahl
@ 2021-11-06 17:35                                                                     ` Eli Zaretskii
  0 siblings, 0 replies; 172+ messages in thread
From: Eli Zaretskii @ 2021-11-06 17:35 UTC (permalink / raw)
  To: Benjamin Riefenstahl; +Cc: emacs-devel

> From: Benjamin Riefenstahl <b.riefenstahl@turtle-trading.net>
> Date: Sat, 06 Nov 2021 18:09:38 +0100
> 
> Eli Zaretskii writes:
> > And that's even before you consider comment styles which make that
> > hard even in principle.  For example:
> >
> >   /* This is the beginning of a comment, */
> >   /* and this is its continuation.      */
> 
> But if EOL is a paragraph separator for the purposes of bidi, then your
> example would be two paragraphs even without the comment syntax, right?

For the purposes of embeddings and overrides, but not for other bidi
aspects.

Anyway, that was a tangent, not directly related to the question you
asked.



^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: Unicode confusables and reordering characters considered harmful,  a simple solution
  2021-11-06 10:48                                                                 ` Eli Zaretskii
@ 2021-11-08 19:58                                                                   ` Gregory Heytings
  2021-11-08 20:27                                                                     ` Eli Zaretskii
  0 siblings, 1 reply; 172+ messages in thread
From: Gregory Heytings @ 2021-11-08 19:58 UTC (permalink / raw)
  To: Eli Zaretskii
  Cc: cpitclaudel, stefan, yuri.v.khan, db48x, monnier, emacs-devel


>> In fact, it did not take me much time to create a case that your 
>> algorithm doesn't detect (and AFAIU cannot detect without also 
>> displaying warnings about many legitimate uses).  I attach the example 
>> code, how that code is displayed by Emacs, and how that code would be 
>> displayed with the patch I proposed.
>
> Thanks, I've now enhanced the code which detects suspiciously reordered 
> source to cover this kind of cases as well.  I didn't see any legitimate 
> uses flagged after the change, but if you can find any such cases, 
> please show them and I will take a look.
>

Clearly, you failed to understand the meaning of my post.  It did *not* 
mean:

Your algorithm could be improved.

It meant:

Your algorithm cannot be trusted.

It took less than 24 hours (after your commit) to a non-malevolent actor 
to find a way to escape the detection algorithm you implemented and which 
you claimed was the proper solution to the problem pointed to by the 
"Trojan Source" paper.  Your slightly improved algorithm will evidently 
not resist longer if an actually malevolent actor tries to find a way to 
escape it (and of course they won't tell you when and how they did it).

So I'll say it one more time:

The only proper solution to that problem is to highlight, by default, 
these control characters in prog-mode and its descendants.  That's the 
only 100% foolproof solution that guarantees that such constructs will 
never be missed, and this is what about 99.99% Emacs users need.  The 
remaining 0.01% are those who:

1. Use RTL languages in their source code, AND

2. Use these reordering control characters in their source code, AND

3. Would find such highlighted characters annoying.

Those few users can turn that highlighting option off, either globally or 
by turning the minor mode off in this or that buffer.

>>> The right balance is where the percent of false positives is very low.
>>
>> IMO, that's not the right balance: the right balance is where the 
>> percentage of false negatives is zero.
>
> If you need zero false negatives, and don't care about the level of 
> noise (i.e. false positives), you have the features for that already: 
> customize glyphless-char-display-control to show the control characters 
> as acronyms or hex codes.
>

Again you clearly fail to understand what I said.  The problem has nothing 
to do with me, the problem is, as the "Trojan Source" paper rightly 
explains, what the default settings of various available editors are. 
Claiming that asking every Emacs user (except the few users mentioned 
above) to set an obscure configuration option (which is only mentioned 
once, in passing, in the manual) is a solution to that problem is just 
wrong.

Anyway, it's now clear that this problem will remain unfixed in Emacs. 
Given this, I can only applaud the Rust developers when they took the 
decision to ban these control characters from Rust code files.  If editors 
cannot be trusted to do a proper job on this matter, compilers should do 
it, and I hope that a similar solution will soon be adopted in other 
compilers.

And I leave this discussion with this post.



^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: Unicode confusables and reordering characters considered harmful,  a simple solution
  2021-11-08 19:58                                                                   ` Gregory Heytings
@ 2021-11-08 20:27                                                                     ` Eli Zaretskii
  2021-11-08 21:59                                                                       ` Stefan Monnier
  0 siblings, 1 reply; 172+ messages in thread
From: Eli Zaretskii @ 2021-11-08 20:27 UTC (permalink / raw)
  To: Gregory Heytings
  Cc: cpitclaudel, stefan, yuri.v.khan, db48x, monnier, emacs-devel

> Date: Mon, 08 Nov 2021 19:58:56 +0000
> From: Gregory Heytings <gregory@heytings.org>
> cc: cpitclaudel@gmail.com, stefan@marxist.se, emacs-devel@gnu.org, 
>     db48x@db48x.net, monnier@iro.umontreal.ca, yuri.v.khan@gmail.com
> 
> So I'll say it one more time:

And you will be wrong one more time.

> Anyway, it's now clear that this problem will remain unfixed in Emacs. 

Nothing is farther from the truth.  Emacs does have solutions for
this, and they include those you applaud, but have also superior parts
and aspects.  They are just not the solutions you consider to be the
only ones worth having.  We also have infrastructure that can be
easily extended to provide more automatic detection of the suspicious
segments of text, if we want that.

So I can only conclude that you don't understand the issues well
enough to judge what is or isn't a good solution for them.  Which
isn't surprising: this stuff is very complex and takes years to wrap
the head around it.



^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: Unicode confusables and reordering characters considered harmful, a simple solution
  2021-11-08 20:27                                                                     ` Eli Zaretskii
@ 2021-11-08 21:59                                                                       ` Stefan Monnier
  2021-11-09  3:28                                                                         ` Eli Zaretskii
  0 siblings, 1 reply; 172+ messages in thread
From: Stefan Monnier @ 2021-11-08 21:59 UTC (permalink / raw)
  To: Eli Zaretskii
  Cc: Gregory Heytings, cpitclaudel, stefan, emacs-devel, db48x,
	yuri.v.khan

> So I can only conclude that you don't understand the issues well
> enough to judge what is or isn't a good solution for them.

From what I've read, I get the feeling that nobody truly
understands it, actually.

> Which isn't surprising: this stuff is very complex and takes years to
> wrap the head around it.

Also it requires understanding the human factor, and that is affected by
that human's past experience and habits, so I suspect that there isn't
a "one size fits all" solution.

Of course, once you start to look for "unnoticed attacks" in source
code, beside bidi and confusables, there's a host of other tricks you
can play.
See https://en.wikipedia.org/wiki/Underhanded_C_Contest for examples.


        Stefan




^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: Unicode confusables and reordering characters considered harmful, a simple solution
  2021-11-08 21:59                                                                       ` Stefan Monnier
@ 2021-11-09  3:28                                                                         ` Eli Zaretskii
  0 siblings, 0 replies; 172+ messages in thread
From: Eli Zaretskii @ 2021-11-09  3:28 UTC (permalink / raw)
  To: Stefan Monnier
  Cc: cpitclaudel, stefan, yuri.v.khan, db48x, gregory, emacs-devel

> From: Stefan Monnier <monnier@iro.umontreal.ca>
> Cc: Gregory Heytings <gregory@heytings.org>,  cpitclaudel@gmail.com,
>   stefan@marxist.se,  emacs-devel@gnu.org,  db48x@db48x.net,
>   yuri.v.khan@gmail.com
> Date: Mon, 08 Nov 2021 16:59:10 -0500
> 
> > So I can only conclude that you don't understand the issues well
> > enough to judge what is or isn't a good solution for them.
> 
> >From what I've read, I get the feeling that nobody truly
> understands it, actually.

Well, "nobody" is an exaggeration, I think.  But yes, not many do.

> See https://en.wikipedia.org/wiki/Underhanded_C_Contest for examples

Right.



^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: Unicode confusables and reordering characters considered harmful
  2021-11-02 12:57 Unicode confusables and reordering characters considered harmful Vasilij Schneidermann
                   ` (5 preceding siblings ...)
  2021-11-05 18:53 ` Unicode confusables " Vasilij Schneidermann
@ 2021-11-10 15:47 ` Dmitry Gutov
  2021-11-10 17:03   ` Eli Zaretskii
  6 siblings, 1 reply; 172+ messages in thread
From: Dmitry Gutov @ 2021-11-10 15:47 UTC (permalink / raw)
  To: Vasilij Schneidermann, emacs-devel

On 02.11.2021 15:57, Vasilij Schneidermann wrote:
> The first issue is about bidirectional reordering characters. If bidi
> text rendering is not needed, it's easy enough to work around with
> `(setq-default bidi-display-reordering nil)`. Some people already make
> use of this to speed up redisplay. Maybe there's a better solution, such
> as automatically detecting whether the user is working with a RTL script
> and only then enable bidi text rendering.
> 
> The second issue is about mixed-script confusable characters. Emacs does
> not appear to have a workaround for that. I've come across the
> uni-confusables package in GNU ELPA, but it merely sets up character
> tables. The only mention of confusables I can find in the Emacs sources
> is for `help-uni-confusables` which contains a much smaller list for
> quotation marks, used in help buffers and elisp buffers. A
> possible solution would be to implement the Unicode confusables
> algorithm and expose it in the uni-confusables package.

Here's also an article from yesterday focusing on _invisible_ characters:

https://certitude.consulting/blog/en/invisible-backdoor/

AFAICS it hasn't been mentioned in this thread yet.



^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: Unicode confusables and reordering characters considered harmful
  2021-11-10 15:47 ` Unicode confusables and reordering characters " Dmitry Gutov
@ 2021-11-10 17:03   ` Eli Zaretskii
  2021-11-10 17:15     ` Dmitry Gutov
  0 siblings, 1 reply; 172+ messages in thread
From: Eli Zaretskii @ 2021-11-10 17:03 UTC (permalink / raw)
  To: Dmitry Gutov; +Cc: emacs-devel, mail

> From: Dmitry Gutov <dgutov@yandex.ru>
> Date: Wed, 10 Nov 2021 18:47:12 +0300
> 
> Here's also an article from yesterday focusing on _invisible_ characters:
> 
> https://certitude.consulting/blog/en/invisible-backdoor/

First, that character (U+3164) is not invisible in Emacs: it displays
as a very wide space, so should probably stand out even if not
specifically highlighted.

And second, UTS #39 covers such characters as well.  So if we
implement some of the recommendations there, we will flag this case as
well.

But yes, this is one more case we should be able to handle.



^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: Unicode confusables and reordering characters considered harmful
  2021-11-10 17:03   ` Eli Zaretskii
@ 2021-11-10 17:15     ` Dmitry Gutov
  0 siblings, 0 replies; 172+ messages in thread
From: Dmitry Gutov @ 2021-11-10 17:15 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: emacs-devel, mail

On 10.11.2021 20:03, Eli Zaretskii wrote:
>> From: Dmitry Gutov <dgutov@yandex.ru>
>> Date: Wed, 10 Nov 2021 18:47:12 +0300
>>
>> Here's also an article from yesterday focusing on _invisible_ characters:
>>
>> https://certitude.consulting/blog/en/invisible-backdoor/
> 
> First, that character (U+3164) is not invisible in Emacs: it displays
> as a very wide space, so should probably stand out even if not
> specifically highlighted.

I have tried out that example in Emacs, and could only notice those 
chars when the cursor is directly over it (displayed as a wide space, yes).

> And second, UTS #39 covers such characters as well.  So if we
> implement some of the recommendations there, we will flag this case as
> well.

Sounds good to me.



^ permalink raw reply	[flat|nested] 172+ messages in thread

end of thread, other threads:[~2021-11-10 17:15 UTC | newest]

Thread overview: 172+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2021-11-02 12:57 Unicode confusables and reordering characters considered harmful Vasilij Schneidermann
2021-11-02 13:18 ` Po Lu
2021-11-02 13:54   ` Uwe Brauer
2021-11-02 14:53     ` Eli Zaretskii
2021-11-02 15:16       ` Eli Zaretskii
2021-11-02 15:21         ` Uwe Brauer
2021-11-02 16:24       ` Clément Pit-Claudel
2021-11-02 16:47         ` Eli Zaretskii
2021-11-02 17:01           ` Stefan Kangas
2021-11-02 17:10             ` Eli Zaretskii
2021-11-02 18:43               ` Stefan Kangas
2021-11-02 18:49                 ` Eli Zaretskii
2021-11-02 19:12                   ` Stefan Monnier
2021-11-02 19:36                     ` Eli Zaretskii
2021-11-02 19:47                       ` Stefan Monnier
2021-11-02 19:51                         ` Eli Zaretskii
2021-11-02 21:28                           ` Unicode confusables and reordering characters considered harmful, a simple solution Daniel Brooks
2021-11-03 13:30                             ` Eli Zaretskii
2021-11-03 17:41                             ` Yuri Khan
2021-11-03 17:56                               ` Eli Zaretskii
2021-11-03 18:20                                 ` Juri Linkov
2021-11-03 19:02                                   ` Gregory Heytings
2021-11-03 19:46                                     ` Eli Zaretskii
2021-11-03 19:58                                       ` Yuri Khan
2021-11-03 20:21                                       ` Gregory Heytings
2021-11-03 20:31                                         ` Eli Zaretskii
2021-11-03 21:16                                           ` Gregory Heytings
2021-11-04  7:16                                             ` Eli Zaretskii
2021-11-04  9:06                                               ` Gregory Heytings
2021-11-04  9:19                                                 ` Eli Zaretskii
2021-11-04  9:48                                                   ` Eli Zaretskii
2021-11-04  8:44                                     ` Juri Linkov
2021-11-03 18:45                                 ` Yuri Khan
2021-11-03 19:09                                   ` Eli Zaretskii
2021-11-03 19:35                                     ` Yuri Khan
2021-11-03 20:01                                       ` Eli Zaretskii
2021-11-03 20:45                                         ` Gregory Heytings
2021-11-03 20:53                                           ` Eli Zaretskii
2021-11-03 21:23                                             ` Gregory Heytings
2021-11-04  6:58                                               ` Eli Zaretskii
2021-11-04  8:53                                                 ` Gregory Heytings
2021-11-04  9:15                                                   ` Eli Zaretskii
2021-11-03 19:54                                     ` Daniel Brooks
2021-11-03 20:08                                       ` Eli Zaretskii
2021-11-04  6:00                                         ` Daniel Brooks
2021-11-04  7:44                                           ` Eli Zaretskii
2021-11-04  9:14                                             ` Gregory Heytings
2021-11-04  9:45                                               ` Eli Zaretskii
2021-11-04 10:41                                                 ` Gregory Heytings
2021-11-04 11:03                                                   ` Po Lu
2021-11-04 11:27                                                     ` Gregory Heytings
2021-11-04 11:20                                                   ` Eli Zaretskii
2021-11-04 11:34                                                     ` Gregory Heytings
2021-11-04 13:25                                                       ` Eli Zaretskii
2021-11-04 14:10                                                         ` Gregory Heytings
2021-11-04 16:50                                                           ` Eli Zaretskii
2021-11-04 17:04                                                             ` Gregory Heytings
2021-11-04 19:16                                                           ` Stefan Monnier
2021-11-05 23:31                                                             ` Gregory Heytings
2021-11-06  7:25                                                               ` Eli Zaretskii
2021-11-04 19:22                                                           ` Stefan Monnier
2021-11-04 19:55                                                             ` Eli Zaretskii
2021-11-05 23:32                                                             ` Gregory Heytings
2021-11-04 19:08                                                     ` Eli Zaretskii
2021-11-04 20:00                                                       ` Eli Zaretskii
2021-11-05  2:23                                             ` Daniel Brooks
2021-11-05  3:52                                               ` Stefan Kangas
2021-11-05  5:21                                                 ` code annotations Daniel Brooks
2021-11-05  5:53                                                   ` Stefan Kangas
2021-11-05  5:23                                                 ` Unicode confusables and reordering characters considered harmful, a simple solution Daniel Brooks
2021-11-05  6:13                                                 ` Po Lu
2021-11-05  7:37                                                 ` Eli Zaretskii
2021-11-05  8:00                                                   ` Stefan Kangas
2021-11-05  8:07                                                     ` Eli Zaretskii
2021-11-05  9:58                                                       ` Stefan Kangas
2021-11-05 12:12                                                         ` Eli Zaretskii
2021-11-05 13:08                                                           ` Stefan Kangas
2021-11-05 14:19                                                             ` Eli Zaretskii
2021-11-05 23:33                                                               ` Gregory Heytings
2021-11-06  0:54                                                                 ` Daniel Brooks
2021-11-06 10:56                                                                   ` Eli Zaretskii
2021-11-06 10:48                                                                 ` Eli Zaretskii
2021-11-08 19:58                                                                   ` Gregory Heytings
2021-11-08 20:27                                                                     ` Eli Zaretskii
2021-11-08 21:59                                                                       ` Stefan Monnier
2021-11-09  3:28                                                                         ` Eli Zaretskii
2021-11-06 13:58                                                               ` Benjamin Riefenstahl
2021-11-06 15:34                                                                 ` Eli Zaretskii
2021-11-06 17:09                                                                   ` Benjamin Riefenstahl
2021-11-06 17:35                                                                     ` Eli Zaretskii
2021-11-05  8:09                                               ` tomas
2021-11-06  1:09                                                 ` Daniel Brooks
2021-11-05  8:31                                               ` Eli Zaretskii
2021-11-05  9:34                                                 ` Juri Linkov
2021-11-04 19:05                                           ` Stefan Monnier
2021-11-03 21:13                                 ` Daniel Brooks
2021-11-04  6:52                                   ` Eli Zaretskii
2021-11-02 20:18                       ` Unicode confusables and reordering characters considered harmful Tim Cross
2021-11-03  0:28                     ` Gregory Heytings
2021-11-03  1:07                       ` Stefan Monnier
2021-11-03  1:59                         ` Daniel Brooks
2021-11-03 13:35                           ` Eli Zaretskii
2021-11-03  9:59                         ` Gregory Heytings
2021-11-03 11:19                           ` Stefan Kangas
2021-11-03 11:31                             ` Gregory Heytings
2021-11-03 12:20                               ` Stefan Monnier
2021-11-03 12:41                                 ` tomas
2021-11-03 13:15                                   ` Eli Zaretskii
2021-11-03 14:46                                     ` tomas
2021-11-03 17:13                                       ` Eli Zaretskii
2021-11-03 17:34                                         ` tomas
2021-11-03 13:46                                 ` Eli Zaretskii
2021-11-03 13:45                               ` Eli Zaretskii
2021-11-03 13:44                             ` Eli Zaretskii
2021-11-03 14:29                               ` Gregory Heytings
2021-11-03 14:37                                 ` Eli Zaretskii
2021-11-03 16:01                                   ` Gregory Heytings
2021-11-03 17:44                                     ` Eli Zaretskii
2021-11-03 17:53                                       ` Gregory Heytings
2021-11-03 11:29                           ` Andreas Schwab
2021-11-03 18:47                             ` Stefan Monnier
2021-11-03 18:52                               ` Yuri Khan
2021-11-03 19:19                                 ` Stefan Monnier
2021-11-03 19:28                               ` Gregory Heytings
2021-11-03 19:32                                 ` Stefan Monnier
2021-11-03 19:41                                   ` Yuri Khan
2021-11-03 20:12                                   ` Gregory Heytings
2021-11-03 22:03                                     ` Gregory Heytings
2021-11-04  8:50                                       ` Gregory Heytings
2021-11-03 19:51                                 ` Eli Zaretskii
2021-11-03 19:30                               ` Eli Zaretskii
2021-11-03 19:34                                 ` Andreas Schwab
2021-11-03 19:54                                   ` Eli Zaretskii
2021-11-03 13:37                           ` Eli Zaretskii
2021-11-03 18:53                             ` Manuel Giraud
2021-11-03 19:36                               ` Eli Zaretskii
2021-11-03 21:15                                 ` Manuel Giraud
2021-11-04  6:56                                   ` Eli Zaretskii
2021-11-04 19:04                                     ` Eli Zaretskii
2021-11-03 13:33                         ` Eli Zaretskii
2021-11-03 13:31                       ` Eli Zaretskii
2021-11-02 19:26                   ` Stefan Kangas
2021-11-02 19:44                     ` Eli Zaretskii
2021-11-02 19:49                     ` Stefan Monnier
2021-11-02 18:16           ` Clément Pit-Claudel
2021-11-02 18:37             ` Eli Zaretskii
2021-11-02 19:17         ` Yuri Khan
2021-11-02 19:37           ` Eli Zaretskii
2021-11-02 17:24       ` [authors: default bidi-display-reordering is set to t] (was: Unicode confusables and reordering characters considered harmful) Uwe Brauer
2021-11-02 17:37         ` Eli Zaretskii
2021-11-02 14:31   ` Unicode confusables and reordering characters considered harmful Eli Zaretskii
2021-11-02 15:13     ` Uwe Brauer
2021-11-02 13:42 ` tomas
2021-11-02 14:57   ` Stefan Kangas
2021-11-02 14:30 ` Eli Zaretskii
2021-11-02 14:43 ` Clément Pit-Claudel
2021-11-03 15:07   ` Reini Urban
2021-11-03 15:43     ` Stefan Monnier
2021-11-04  7:50       ` Reini Urban
2021-11-04  8:21         ` Eli Zaretskii
2021-11-03 17:24     ` Eli Zaretskii
2021-11-02 14:57 ` Stefan Kangas
2021-11-05 18:53 ` Unicode confusables " Vasilij Schneidermann
2021-11-05 20:03   ` Eli Zaretskii
2021-11-06 11:56     ` Vasilij Schneidermann
2021-11-06 12:20       ` Eli Zaretskii
2021-11-06 13:10         ` Vasilij Schneidermann
2021-11-06 13:29           ` Eli Zaretskii
2021-11-05 21:36   ` Stefan Monnier
2021-11-10 15:47 ` Unicode confusables and reordering characters " Dmitry Gutov
2021-11-10 17:03   ` Eli Zaretskii
2021-11-10 17:15     ` Dmitry Gutov

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).