* bug#51733: 27.1; Detect impossible email addresses better
@ 2021-11-10 0:29 積丹尼 Dan Jacobson
2021-11-10 0:42 ` Lars Ingebrigtsen
2022-01-20 8:57 ` Lars Ingebrigtsen
0 siblings, 2 replies; 123+ messages in thread
From: 積丹尼 Dan Jacobson @ 2021-11-10 0:29 UTC (permalink / raw)
To: 51733
Upon sending,
To: Bob_Norbolwits@GCSsafetyACE.com
should trigger a warning:
"You won't get far trying to send mail with ZERO WIDTH SPACE in an address,"
instead of blundering along and sending to "gcssafetyace.xn--com-7m0a"!!
^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better
2021-11-10 0:29 bug#51733: 27.1; Detect impossible email addresses better 積丹尼 Dan Jacobson
@ 2021-11-10 0:42 ` Lars Ingebrigtsen
2021-11-10 3:34 ` Eli Zaretskii
2022-01-17 17:43 ` 積丹尼 Dan Jacobson
2022-01-20 8:57 ` Lars Ingebrigtsen
1 sibling, 2 replies; 123+ messages in thread
From: Lars Ingebrigtsen @ 2021-11-10 0:42 UTC (permalink / raw)
To: 積丹尼 Dan Jacobson; +Cc: 51733
積丹尼 Dan Jacobson <jidanni@jidanni.org> writes:
> Upon sending,
> To: Bob_Norbolwits@GCSsafetyACE.com
> should trigger a warning:
> "You won't get far trying to send mail with ZERO WIDTH SPACE in an address,"
> instead of blundering along and sending to "gcssafetyace.xn--com-7m0a"!!
I guess Emacs should run all email addresses through a check for
Unicode confusability and direction markers and all that stuff, too.
(Which got a lot of work lately in a display context.)
Do we have a predicate somewhere that says whether a string is suspicious
based on confusables and r2l markers and stuff?
--
(domestic pets only, the antidote for overdose, milk.)
bloggy blog: http://lars.ingebrigtsen.no
^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better
2021-11-10 0:42 ` Lars Ingebrigtsen
@ 2021-11-10 3:34 ` Eli Zaretskii
2021-11-10 4:44 ` Lars Ingebrigtsen
2022-01-17 17:43 ` 積丹尼 Dan Jacobson
1 sibling, 1 reply; 123+ messages in thread
From: Eli Zaretskii @ 2021-11-10 3:34 UTC (permalink / raw)
To: Lars Ingebrigtsen; +Cc: 51733, jidanni
> From: Lars Ingebrigtsen <larsi@gnus.org>
> Date: Wed, 10 Nov 2021 01:42:34 +0100
> Cc: 51733@debbugs.gnu.org
>
> Do we have a predicate somewhere that says whether a string is suspicious
> based on confusables and r2l markers and stuff?
No. We have the infrastructure for detecting the reordering, though.
^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better
2021-11-10 3:34 ` Eli Zaretskii
@ 2021-11-10 4:44 ` Lars Ingebrigtsen
2021-11-10 13:39 ` Eli Zaretskii
0 siblings, 1 reply; 123+ messages in thread
From: Lars Ingebrigtsen @ 2021-11-10 4:44 UTC (permalink / raw)
To: Eli Zaretskii; +Cc: 51733, jidanni
Eli Zaretskii <eliz@gnu.org> writes:
>> Do we have a predicate somewhere that says whether a string is suspicious
>> based on confusables and r2l markers and stuff?
>
> No. We have the infrastructure for detecting the reordering, though.
I thought I vaguely remembered you writing something in this area in
conjunction with some URL stuff some years back, but I don't recall what
happened to it.
Hm... and there's uni-confusables in GNU ELPA? Should we have that in
core instead? (Or in addition.)
--
(domestic pets only, the antidote for overdose, milk.)
bloggy blog: http://lars.ingebrigtsen.no
^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better
2021-11-10 4:44 ` Lars Ingebrigtsen
@ 2021-11-10 13:39 ` Eli Zaretskii
2021-11-11 2:52 ` Lars Ingebrigtsen
0 siblings, 1 reply; 123+ messages in thread
From: Eli Zaretskii @ 2021-11-10 13:39 UTC (permalink / raw)
To: Lars Ingebrigtsen; +Cc: 51733, jidanni
> From: Lars Ingebrigtsen <larsi@gnus.org>
> Cc: jidanni@jidanni.org, 51733@debbugs.gnu.org
> Date: Wed, 10 Nov 2021 05:44:05 +0100
>
> Eli Zaretskii <eliz@gnu.org> writes:
>
> >> Do we have a predicate somewhere that says whether a string is suspicious
> >> based on confusables and r2l markers and stuff?
> >
> > No. We have the infrastructure for detecting the reordering, though.
>
> I thought I vaguely remembered you writing something in this area in
> conjunction with some URL stuff some years back, but I don't recall what
> happened to it.
I did write it, that's bidi-find-overridden-directionality, which we
have since Emacs 25. That is what I meant by "detecting the
reordering". Detecting confusables in general is a much broader
issue, not limited to bidi reordering alone.
> Hm... and there's uni-confusables in GNU ELPA? Should we have that in
> core instead? (Or in addition.)
We could add that to core, but currently uni-confusables just gives
you a char-table which Lisp programs can use to find out whether a
given character is a potential confusable. We need applications
layers above that, ideally implementing at least part of the
recommendations in Unicode's UTS #39
(https://www.unicode.org/reports/tr39/). We should probably first
discuss what we want to implement from there, though. How about
chiming in to emacs-devel thread "Unicode confusables considered
harmful", where Vasilij Schneidermann already asked what we think
should be done about these cases?
^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better
2021-11-10 13:39 ` Eli Zaretskii
@ 2021-11-11 2:52 ` Lars Ingebrigtsen
2021-11-11 7:01 ` Eli Zaretskii
0 siblings, 1 reply; 123+ messages in thread
From: Lars Ingebrigtsen @ 2021-11-11 2:52 UTC (permalink / raw)
To: Eli Zaretskii; +Cc: 51733, jidanni
Eli Zaretskii <eliz@gnu.org> writes:
> I did write it, that's bidi-find-overridden-directionality, which we
> have since Emacs 25. That is what I meant by "detecting the
> reordering".
Ah, right.
> We could add that to core, but currently uni-confusables just gives
> you a char-table which Lisp programs can use to find out whether a
> given character is a potential confusable. We need applications
> layers above that, ideally implementing at least part of the
> recommendations in Unicode's UTS #39
> (https://www.unicode.org/reports/tr39/).
It's great to see that somebody's already done the hard work -- now we
just have to implement it. 😅
> We should probably first discuss what we want to implement from there,
> though. How about chiming in to emacs-devel thread "Unicode
> confusables considered harmful", where Vasilij Schneidermann already
> asked what we think should be done about these cases?
I'm not sure that'd be productive. I think Somebody just has to write a
library that exposes the various levels/profiles as defined by TR39, and
then we should sprinkle libraries that deal with these issues (url.el,
smtpmail.el, message.el) with calls to that library, much along the same
lines as the NSM is consulted about network connections.
--
(domestic pets only, the antidote for overdose, milk.)
bloggy blog: http://lars.ingebrigtsen.no
^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better
2021-11-11 2:52 ` Lars Ingebrigtsen
@ 2021-11-11 7:01 ` Eli Zaretskii
2021-11-11 7:31 ` Lars Ingebrigtsen
0 siblings, 1 reply; 123+ messages in thread
From: Eli Zaretskii @ 2021-11-11 7:01 UTC (permalink / raw)
To: Lars Ingebrigtsen; +Cc: 51733, jidanni
> From: Lars Ingebrigtsen <larsi@gnus.org>
> Cc: jidanni@jidanni.org, 51733@debbugs.gnu.org
> Date: Thu, 11 Nov 2021 03:52:39 +0100
>
> I think Somebody just has to write a library that exposes the
> various levels/profiles as defined by TR39, and then we should
> sprinkle libraries that deal with these issues (url.el, smtpmail.el,
> message.el) with calls to that library, much along the same lines as
> the NSM is consulted about network connections.
That'd be fine, of course. Is Somebody around? please speak up if you
are.
^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better
2021-11-11 7:01 ` Eli Zaretskii
@ 2021-11-11 7:31 ` Lars Ingebrigtsen
2022-01-16 15:47 ` Lars Ingebrigtsen
0 siblings, 1 reply; 123+ messages in thread
From: Lars Ingebrigtsen @ 2021-11-11 7:31 UTC (permalink / raw)
To: Eli Zaretskii; +Cc: 51733, jidanni
Eli Zaretskii <eliz@gnu.org> writes:
> That'd be fine, of course. Is Somebody around? please speak up if you
> are.
Sometimes that Somebody is me.
I think it looks like a fun little project -- it's so refreshing to have
an actual spec to program against. 😸 And I've read most of the TS
now, so it's just a small matter of typing.
But I probably won't have the time this week -- if somebody else wants
to get in on the action, please do go ahead.
--
(domestic pets only, the antidote for overdose, milk.)
bloggy blog: http://lars.ingebrigtsen.no
^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better
2021-11-11 7:31 ` Lars Ingebrigtsen
@ 2022-01-16 15:47 ` Lars Ingebrigtsen
2022-01-16 16:03 ` Eli Zaretskii
0 siblings, 1 reply; 123+ messages in thread
From: Lars Ingebrigtsen @ 2022-01-16 15:47 UTC (permalink / raw)
To: Eli Zaretskii; +Cc: 51733, jidanni
Lars Ingebrigtsen <larsi@gnus.org> writes:
> I think it looks like a fun little project -- it's so refreshing to have
> an actual spec to program against. 😸 And I've read most of the TS
> now, so it's just a small matter of typing.
>
> But I probably won't have the time this week -- if somebody else wants
> to get in on the action, please do go ahead.
Well, it took longer to find time to start this, but I think now's a
good time.
So we'll be importing a handful of new Unicode data files, and I think
an interface like
(suspicious-email-p "C𝗂𝗋𝖼𝗅𝖾@example.com")
=> "Confusables used in address: 𝗂 (MATHEMATICAL SANS-SERIF SMALL I) confusable with etc etc"
would be nice. But there's also a bunch of lower level functions that
might be nice to expose separately, like
(single-script-p "Сirсlе")
=> nil
but it'd be nice to group these in a single package name. But I'm
coming up blank. I mean, `unicode-suspicious-email-p' would be
nonsensical, because ... it's not really Unicode that's the point here.
For instance, if you have a link text like http://innocent.org but the
link goes to http://evil.com, then it'd be nice to implement something
for that, too, in this same package. Or http://paypaI.com, for that
matter.
So does anybody have an idea for a package name, so I can start typing
away at this? 😀
--
(domestic pets only, the antidote for overdose, milk.)
bloggy blog: http://lars.ingebrigtsen.no
^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better
2022-01-16 15:47 ` Lars Ingebrigtsen
@ 2022-01-16 16:03 ` Eli Zaretskii
2022-01-16 16:09 ` Lars Ingebrigtsen
0 siblings, 1 reply; 123+ messages in thread
From: Eli Zaretskii @ 2022-01-16 16:03 UTC (permalink / raw)
To: Lars Ingebrigtsen; +Cc: 51733, jidanni
> From: Lars Ingebrigtsen <larsi@gnus.org>
> Cc: 51733@debbugs.gnu.org, jidanni@jidanni.org
> Date: Sun, 16 Jan 2022 16:47:21 +0100
>
> but it'd be nice to group these in a single package name. But I'm
> coming up blank. I mean, `unicode-suspicious-email-p' would be
> nonsensical, because ... it's not really Unicode that's the point here.
> For instance, if you have a link text like http://innocent.org but the
> link goes to http://evil.com, then it'd be nice to implement something
> for that, too, in this same package. Or http://paypaI.com, for that
> matter.
>
> So does anybody have an idea for a package name, so I can start typing
> away at this? 😀
unicode-security.el? I mean, most of that _is_ based on Unicode
recommendations, right?
^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better
2022-01-16 16:03 ` Eli Zaretskii
@ 2022-01-16 16:09 ` Lars Ingebrigtsen
2022-01-16 16:14 ` Eli Zaretskii
0 siblings, 1 reply; 123+ messages in thread
From: Lars Ingebrigtsen @ 2022-01-16 16:09 UTC (permalink / raw)
To: Eli Zaretskii; +Cc: 51733, jidanni
Eli Zaretskii <eliz@gnu.org> writes:
> unicode-security.el? I mean, most of that _is_ based on Unicode
> recommendations, right?
Most of it is, but not all. And putting "unicode" in the function names
wouldn't be helpful, because it's not important to the people using
these functions that most of the recommendations come from Unicode.
--
(domestic pets only, the antidote for overdose, milk.)
bloggy blog: http://lars.ingebrigtsen.no
^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better
2022-01-16 16:09 ` Lars Ingebrigtsen
@ 2022-01-16 16:14 ` Eli Zaretskii
2022-01-16 16:33 ` Lars Ingebrigtsen
0 siblings, 1 reply; 123+ messages in thread
From: Eli Zaretskii @ 2022-01-16 16:14 UTC (permalink / raw)
To: Lars Ingebrigtsen; +Cc: 51733, jidanni
> From: Lars Ingebrigtsen <larsi@gnus.org>
> Cc: 51733@debbugs.gnu.org, jidanni@jidanni.org
> Date: Sun, 16 Jan 2022 17:09:38 +0100
>
> Eli Zaretskii <eliz@gnu.org> writes:
>
> > unicode-security.el? I mean, most of that _is_ based on Unicode
> > recommendations, right?
>
> Most of it is, but not all. And putting "unicode" in the function names
> wouldn't be helpful, because it's not important to the people using
> these functions that most of the recommendations come from Unicode.
You are a tough customer.
Then what about text-security.el? or textsec.el?
^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better
2022-01-16 16:14 ` Eli Zaretskii
@ 2022-01-16 16:33 ` Lars Ingebrigtsen
2022-01-16 16:44 ` Eli Zaretskii
2022-01-16 17:53 ` Achim Gratz
0 siblings, 2 replies; 123+ messages in thread
From: Lars Ingebrigtsen @ 2022-01-16 16:33 UTC (permalink / raw)
To: Eli Zaretskii; +Cc: 51733, jidanni
Eli Zaretskii <eliz@gnu.org> writes:
> You are a tough customer.
😀
> Then what about text-security.el? or textsec.el?
Yes, that'd work. Or... string-analysis.el? With functions like
`string-scripts' (lists the different scripts in the string) as well as
the more higher level functions... Hm...
--
(domestic pets only, the antidote for overdose, milk.)
bloggy blog: http://lars.ingebrigtsen.no
^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better
2022-01-16 16:33 ` Lars Ingebrigtsen
@ 2022-01-16 16:44 ` Eli Zaretskii
2022-01-16 17:03 ` Lars Ingebrigtsen
2022-01-16 17:53 ` Achim Gratz
1 sibling, 1 reply; 123+ messages in thread
From: Eli Zaretskii @ 2022-01-16 16:44 UTC (permalink / raw)
To: Lars Ingebrigtsen; +Cc: 51733, jidanni
> From: Lars Ingebrigtsen <larsi@gnus.org>
> Cc: 51733@debbugs.gnu.org, jidanni@jidanni.org
> Date: Sun, 16 Jan 2022 17:33:49 +0100
>
> > Then what about text-security.el? or textsec.el?
>
> Yes, that'd work. Or... string-analysis.el?
Is such an "analysis" useful for any other purposes than the one you
want to use it?
> With functions like `string-scripts' (lists the different scripts in
> the string)
That one should probably be elsewhere. Although even in that case, I
don't really see how it could be useful for anything other than this
particular purpose? I bet most Lisp programmers don't even know what
is a "script" in the Emacs context.
^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better
2022-01-16 16:44 ` Eli Zaretskii
@ 2022-01-16 17:03 ` Lars Ingebrigtsen
2022-01-16 17:50 ` Lars Ingebrigtsen
2022-01-16 18:14 ` Eli Zaretskii
0 siblings, 2 replies; 123+ messages in thread
From: Lars Ingebrigtsen @ 2022-01-16 17:03 UTC (permalink / raw)
To: Eli Zaretskii; +Cc: 51733, jidanni
Eli Zaretskii <eliz@gnu.org> writes:
> That one should probably be elsewhere. Although even in that case, I
> don't really see how it could be useful for anything other than this
> particular purpose? I bet most Lisp programmers don't even know what
> is a "script" in the Emacs context.
Yeah, probably true...
By the way:
https://www.unicode.org/reports/tr24/tr24-32.html#Scripts_and_Blocks
As a result, using the block names as simplistic substitute for
script identity generally leads to poor results.
It looks like we're doing that, though? And indeed:
(elt char-script-table #xAB65)
=> latin
which is wrong, because that's
GREEK LETTER SMALL CAPITAL OMEGA
So we should be populating char-script-table from
http://www.unicode.org/Public/UCD/latest/ucd/Scripts.txt instead of
Blocks.txt. So I'll be doing that, too.
--
(domestic pets only, the antidote for overdose, milk.)
bloggy blog: http://lars.ingebrigtsen.no
^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better
2022-01-16 17:03 ` Lars Ingebrigtsen
@ 2022-01-16 17:50 ` Lars Ingebrigtsen
2022-01-16 18:18 ` Eli Zaretskii
2022-01-16 18:14 ` Eli Zaretskii
1 sibling, 1 reply; 123+ messages in thread
From: Lars Ingebrigtsen @ 2022-01-16 17:50 UTC (permalink / raw)
To: Eli Zaretskii; +Cc: 51733, jidanni
Lars Ingebrigtsen <larsi@gnus.org> writes:
> So we should be populating char-script-table from
> http://www.unicode.org/Public/UCD/latest/ucd/Scripts.txt instead of
> Blocks.txt. So I'll be doing that, too.
Hm, well, that'd be difficult to do in a backwards compatible way -- for
instance, there's stuff in Emacs that depends on things mapping to
`symbol', which isn't really a thing in Scripts.txt.
So I guess the Scripts.txt file will have to be parsed in addition, and
into a new char table.
--
(domestic pets only, the antidote for overdose, milk.)
bloggy blog: http://lars.ingebrigtsen.no
^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better
2022-01-16 16:33 ` Lars Ingebrigtsen
2022-01-16 16:44 ` Eli Zaretskii
@ 2022-01-16 17:53 ` Achim Gratz
2022-01-17 17:13 ` Lars Ingebrigtsen
1 sibling, 1 reply; 123+ messages in thread
From: Achim Gratz @ 2022-01-16 17:53 UTC (permalink / raw)
To: 51733
Lars Ingebrigtsen writes:
> Eli Zaretskii <eliz@gnu.org> writes:
>> Then what about text-security.el? or textsec.el?
>
> Yes, that'd work. Or... string-analysis.el? With functions like
> `string-scripts' (lists the different scripts in the string) as well as
> the more higher level functions... Hm...
Since you're trying to harden against homograph / homoglyph attacks, why
not mention it on the tin? Besides URL and eMail addresses, it would
probably be useful for checking source code (where the language allows
unicode identifiers), in this case it should also (optionally) warn
about non-normalized sequences.
Regards,
Achim.
--
+<[Q+ Matrix-12 WAVE#46+305 Neuron microQkb Andromeda XTk Blofeld]>+
Wavetables for the Waldorf Blofeld:
http://Synth.Stromeko.net/Downloads.html#BlofeldUserWavetables
^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better
2022-01-16 17:03 ` Lars Ingebrigtsen
2022-01-16 17:50 ` Lars Ingebrigtsen
@ 2022-01-16 18:14 ` Eli Zaretskii
2022-01-16 18:24 ` Eli Zaretskii
1 sibling, 1 reply; 123+ messages in thread
From: Eli Zaretskii @ 2022-01-16 18:14 UTC (permalink / raw)
To: Lars Ingebrigtsen; +Cc: 51733, jidanni
> From: Lars Ingebrigtsen <larsi@gnus.org>
> Cc: 51733@debbugs.gnu.org, jidanni@jidanni.org
> Date: Sun, 16 Jan 2022 18:03:23 +0100
>
> https://www.unicode.org/reports/tr24/tr24-32.html#Scripts_and_Blocks
>
> As a result, using the block names as simplistic substitute for
> script identity generally leads to poor results.
>
> It looks like we're doing that, though?
No, not really. We collect various blocks of the same scripts
together.
> And indeed:
>
> (elt char-script-table #xAB65)
> => latin
>
> which is wrong, because that's
>
> GREEK LETTER SMALL CAPITAL OMEGA
>
> So we should be populating char-script-table from
> http://www.unicode.org/Public/UCD/latest/ucd/Scripts.txt instead of
> Blocks.txt. So I'll be doing that, too.
Beware: the Unicode Script property is not identical to ours! Before
throwing away what we have, please consider how many deviations we
have in practice, and if they are just a few, let's fix only them
individually. It's easy. You will have to add some manual heuristics
even if you do use the Unicode Scripts.txt as the basis.
^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better
2022-01-16 17:50 ` Lars Ingebrigtsen
@ 2022-01-16 18:18 ` Eli Zaretskii
2022-01-17 8:59 ` Lars Ingebrigtsen
0 siblings, 1 reply; 123+ messages in thread
From: Eli Zaretskii @ 2022-01-16 18:18 UTC (permalink / raw)
To: Lars Ingebrigtsen; +Cc: 51733, jidanni
> From: Lars Ingebrigtsen <larsi@gnus.org>
> Cc: 51733@debbugs.gnu.org, jidanni@jidanni.org
> Date: Sun, 16 Jan 2022 18:50:27 +0100
>
> So I guess the Scripts.txt file will have to be parsed in addition, and
> into a new char table.
Why can't we use our char-script-table? how different is it from what
Unicode wants?
^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better
2022-01-16 18:14 ` Eli Zaretskii
@ 2022-01-16 18:24 ` Eli Zaretskii
2022-01-16 18:34 ` Andreas Schwab
0 siblings, 1 reply; 123+ messages in thread
From: Eli Zaretskii @ 2022-01-16 18:24 UTC (permalink / raw)
To: larsi; +Cc: 51733, jidanni
> Date: Sun, 16 Jan 2022 20:14:08 +0200
> From: Eli Zaretskii <eliz@gnu.org>
> Cc: 51733@debbugs.gnu.org, jidanni@jidanni.org
>
> > (elt char-script-table #xAB65)
> > => latin
> >
> > which is wrong, because that's
> >
> > GREEK LETTER SMALL CAPITAL OMEGA
Btw, this is not necessarily an error, because the Latin language did
have the omega letter. It's not an accident this character is in a
Latin block.
^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better
2022-01-16 18:24 ` Eli Zaretskii
@ 2022-01-16 18:34 ` Andreas Schwab
2022-01-16 18:44 ` Eli Zaretskii
0 siblings, 1 reply; 123+ messages in thread
From: Andreas Schwab @ 2022-01-16 18:34 UTC (permalink / raw)
To: Eli Zaretskii; +Cc: 51733, larsi, jidanni
On Jan 16 2022, Eli Zaretskii wrote:
>> Date: Sun, 16 Jan 2022 20:14:08 +0200
>> From: Eli Zaretskii <eliz@gnu.org>
>> Cc: 51733@debbugs.gnu.org, jidanni@jidanni.org
>>
>> > (elt char-script-table #xAB65)
>> > => latin
>> >
>> > which is wrong, because that's
>> >
>> > GREEK LETTER SMALL CAPITAL OMEGA
>
> Btw, this is not necessarily an error, because the Latin language did
> have the omega letter. It's not an accident this character is in a
> Latin block.
The latin omega has its own code points U+A7B6 and U+A7B7 (since Unicode
8.0).
--
Andreas Schwab, schwab@linux-m68k.org
GPG Key fingerprint = 7578 EB47 D4E5 4D69 2510 2552 DF73 E780 A9DA AEC1
"And now for something completely different."
^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better
2022-01-16 18:34 ` Andreas Schwab
@ 2022-01-16 18:44 ` Eli Zaretskii
0 siblings, 0 replies; 123+ messages in thread
From: Eli Zaretskii @ 2022-01-16 18:44 UTC (permalink / raw)
To: Andreas Schwab; +Cc: 51733, larsi, jidanni
> From: Andreas Schwab <schwab@linux-m68k.org>
> Cc: larsi@gnus.org, 51733@debbugs.gnu.org, jidanni@jidanni.org
> Date: Sun, 16 Jan 2022 19:34:29 +0100
>
> On Jan 16 2022, Eli Zaretskii wrote:
>
> >> Date: Sun, 16 Jan 2022 20:14:08 +0200
> >> From: Eli Zaretskii <eliz@gnu.org>
> >> Cc: 51733@debbugs.gnu.org, jidanni@jidanni.org
> >>
> >> > (elt char-script-table #xAB65)
> >> > => latin
> >> >
> >> > which is wrong, because that's
> >> >
> >> > GREEK LETTER SMALL CAPITAL OMEGA
> >
> > Btw, this is not necessarily an error, because the Latin language did
> > have the omega letter. It's not an accident this character is in a
> > Latin block.
>
> The latin omega has its own code points U+A7B6 and U+A7B7 (since Unicode
> 8.0).
Yes, I know. But U+AB65 predates Unicode 8.0.
And it's besides the point, really: since omega was in the Latin
alphabet, it is not a mistake to give it the Latin script.
^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better
2022-01-16 18:18 ` Eli Zaretskii
@ 2022-01-17 8:59 ` Lars Ingebrigtsen
2022-01-17 10:18 ` Eli Zaretskii
0 siblings, 1 reply; 123+ messages in thread
From: Lars Ingebrigtsen @ 2022-01-17 8:59 UTC (permalink / raw)
To: Eli Zaretskii; +Cc: 51733, jidanni
Eli Zaretskii <eliz@gnu.org> writes:
>> So I guess the Scripts.txt file will have to be parsed in addition, and
>> into a new char table.
>
> Why can't we use our char-script-table? how different is it from what
> Unicode wants?
Well, as the Unicode web page says -- using Blocks to determine the
script is just, well, wrong. (Or "inaccurate", if you want.) So using
it will give both false positives and negatives.
In addition, that table assumes that each character belongs to a single
script, which is also wrong. So I'm making a new table based on
Scripts.txt and ScriptExtensions.txt.
--
(domestic pets only, the antidote for overdose, milk.)
bloggy blog: http://lars.ingebrigtsen.no
^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better
2022-01-17 8:59 ` Lars Ingebrigtsen
@ 2022-01-17 10:18 ` Eli Zaretskii
2022-01-17 14:54 ` Lars Ingebrigtsen
2022-01-17 15:22 ` Eli Zaretskii
0 siblings, 2 replies; 123+ messages in thread
From: Eli Zaretskii @ 2022-01-17 10:18 UTC (permalink / raw)
To: Lars Ingebrigtsen; +Cc: 51733, jidanni
On January 17, 2022 10:59:36 AM GMT+02:00, Lars Ingebrigtsen <larsi@gnus.org> wrote:
> Eli Zaretskii <eliz@gnu.org> writes:
>
> >> So I guess the Scripts.txt file will have to be parsed in addition, and
> >> into a new char table.
> >
> > Why can't we use our char-script-table? how different is it from what
> > Unicode wants?
>
> Well, as the Unicode web page says -- using Blocks to determine the
> script is just, well, wrong. (Or "inaccurate", if you want.) So using
> it will give both false positives and negatives.
Yes, I understand the general concern, but I'm asking how serious is this in practice. Can you tell?
> In addition, that table assumes that each character belongs to a single
> script, which is also wrong. So I'm making a new table based on
> Scripts.txt and ScriptExtensions.txt.
It is confusing to have 2 separate properties of a character that are subtly incompatible, and for such obscure properties at that. It will be source of many problems. So I think we should avoid that if it's feasible. Can we plrase discuss any real problems that would be xaused by using the existing char-table?
^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better
2022-01-17 10:18 ` Eli Zaretskii
@ 2022-01-17 14:54 ` Lars Ingebrigtsen
2022-01-17 16:47 ` Eli Zaretskii
2022-01-17 15:22 ` Eli Zaretskii
1 sibling, 1 reply; 123+ messages in thread
From: Lars Ingebrigtsen @ 2022-01-17 14:54 UTC (permalink / raw)
To: Eli Zaretskii; +Cc: 51733, jidanni
Eli Zaretskii <eliz@gnu.org> writes:
> Yes, I understand the general concern, but I'm asking how serious is
> this in practice. Can you tell?
I don't know how to quantity that. We're talking about security
mechanisms, and they should be reliable.
(But, yes, the differences are massive, especially in the Asian parts of
the data.)
>> In addition, that table assumes that each character belongs to a single
>> script, which is also wrong. So I'm making a new table based on
>> Scripts.txt and ScriptExtensions.txt.
>
> It is confusing to have 2 separate properties of a character that are
> subtly incompatible, and for such obscure properties at that. It will
> be source of many problems. So I think we should avoid that if it's
> feasible. Can we plrase discuss any real problems that would be
> xaused by using the existing char-table?
It's impossible to implement the Unicode security recommendations based
on the Blocks.txt data -- it's that simple.
--
(domestic pets only, the antidote for overdose, milk.)
bloggy blog: http://lars.ingebrigtsen.no
^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better
2022-01-17 10:18 ` Eli Zaretskii
2022-01-17 14:54 ` Lars Ingebrigtsen
@ 2022-01-17 15:22 ` Eli Zaretskii
2022-01-17 15:25 ` Lars Ingebrigtsen
2022-01-17 15:53 ` Lars Ingebrigtsen
1 sibling, 2 replies; 123+ messages in thread
From: Eli Zaretskii @ 2022-01-17 15:22 UTC (permalink / raw)
To: larsi; +Cc: 51733
> Date: Mon, 17 Jan 2022 12:18:44 +0200
> From: Eli Zaretskii <eliz@gnu.org>
> Cc: 51733@debbugs.gnu.org, jidanni@jidanni.org
>
> It is confusing to have 2 separate properties of a character that are subtly incompatible, and for such obscure properties at that. It will be source of many problems. So I think we should avoid that if it's feasible. Can we plrase discuss any real problems that would be xaused by using the existing char-table?
I've now wrote a Lisp program to produce script property according to
Unicode vs what we have in char-script-table, so it's possible to see
all the differences.
^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better
2022-01-17 15:22 ` Eli Zaretskii
@ 2022-01-17 15:25 ` Lars Ingebrigtsen
2022-01-17 15:53 ` Lars Ingebrigtsen
1 sibling, 0 replies; 123+ messages in thread
From: Lars Ingebrigtsen @ 2022-01-17 15:25 UTC (permalink / raw)
To: Eli Zaretskii; +Cc: 51733
Eli Zaretskii <eliz@gnu.org> writes:
> I've now wrote a Lisp program to produce script property according to
> Unicode vs what we have in char-script-table, so it's possible to see
> all the differences.
There's also lisp/international/uni-scripts.el now. 😀
--
(domestic pets only, the antidote for overdose, milk.)
bloggy blog: http://lars.ingebrigtsen.no
^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better
2022-01-17 15:22 ` Eli Zaretskii
2022-01-17 15:25 ` Lars Ingebrigtsen
@ 2022-01-17 15:53 ` Lars Ingebrigtsen
2022-01-17 16:31 ` Lars Ingebrigtsen
2022-01-17 16:52 ` Eli Zaretskii
1 sibling, 2 replies; 123+ messages in thread
From: Lars Ingebrigtsen @ 2022-01-17 15:53 UTC (permalink / raw)
To: Eli Zaretskii; +Cc: 51733
I'm now looking at 5.3 Mixed-Number Detection:
d U+09EA ( ৪ ) BENGALI DIGIT FOUR can be confused with U+0038 ( 8 )
DIGIT EIGHT.
Right, but they recommend implementing this by looking at the digit
version of the character first... but... Does Emacs have a function to
get the number value of ৪? (Which should be 8. 😀) They then
recommend comparing the value with the zero value of that system, and
I'm pretty sure we don't have that.
I don't quite understand why it's not sufficient to see that we have
numbers from two different numbering systems (which is trivial by
looking at the Nd category and then comparing the scripts).
Does anybody understand why they're doing this in a much more convoluted
manner here? I must be missing something:
https://www.unicode.org/reports/tr39/#Mixed_Number_Detection
--
(domestic pets only, the antidote for overdose, milk.)
bloggy blog: http://lars.ingebrigtsen.no
^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better
2022-01-17 15:53 ` Lars Ingebrigtsen
@ 2022-01-17 16:31 ` Lars Ingebrigtsen
2022-01-17 16:52 ` Eli Zaretskii
1 sibling, 0 replies; 123+ messages in thread
From: Lars Ingebrigtsen @ 2022-01-17 16:31 UTC (permalink / raw)
To: Eli Zaretskii; +Cc: 51733
Lars Ingebrigtsen <larsi@gnus.org> writes:
> Does anybody understand why they're doing this in a much more convoluted
> manner here? I must be missing something:
>
> https://www.unicode.org/reports/tr39/#Mixed_Number_Detection
I think I understand now -- 0-9 are in `common', so they are (by the
definition used in these documents) "the same" script as BENGALI DIGIT
FOUR.
--
(domestic pets only, the antidote for overdose, milk.)
bloggy blog: http://lars.ingebrigtsen.no
^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better
2022-01-17 14:54 ` Lars Ingebrigtsen
@ 2022-01-17 16:47 ` Eli Zaretskii
2022-01-17 17:09 ` Lars Ingebrigtsen
0 siblings, 1 reply; 123+ messages in thread
From: Eli Zaretskii @ 2022-01-17 16:47 UTC (permalink / raw)
To: Lars Ingebrigtsen; +Cc: 51733, jidanni
> From: Lars Ingebrigtsen <larsi@gnus.org>
> Cc: 51733@debbugs.gnu.org, jidanni@jidanni.org
> Date: Mon, 17 Jan 2022 15:54:58 +0100
>
> Eli Zaretskii <eliz@gnu.org> writes:
>
> > Yes, I understand the general concern, but I'm asking how serious is
> > this in practice. Can you tell?
>
> I don't know how to quantity that. We're talking about security
> mechanisms, and they should be reliable.
Well, now that I know the answer, I don't think it's hard to quantify.
But maybe I'm missing something.
> (But, yes, the differences are massive, especially in the Asian parts of
> the data.)
I don't think I understand what you mean by "the Asian parts". Do you
mean the CJK parts where we lump several scripts together into 'han'
and 'kana'?
> It's impossible to implement the Unicode security recommendations based
> on the Blocks.txt data -- it's that simple.
Can you tell more about why it is impossible? If it's a relatively
simple issue of "translating" the Unicode script names into ours, then
it should be quite simple. Since you say it's impossible, I guess
there's some factor(s) here that I miss?
^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better
2022-01-17 15:53 ` Lars Ingebrigtsen
2022-01-17 16:31 ` Lars Ingebrigtsen
@ 2022-01-17 16:52 ` Eli Zaretskii
2022-01-17 16:57 ` Lars Ingebrigtsen
1 sibling, 1 reply; 123+ messages in thread
From: Eli Zaretskii @ 2022-01-17 16:52 UTC (permalink / raw)
To: Lars Ingebrigtsen; +Cc: 51733
> From: Lars Ingebrigtsen <larsi@gnus.org>
> Cc: 51733@debbugs.gnu.org
> Date: Mon, 17 Jan 2022 16:53:49 +0100
>
> I'm now looking at 5.3 Mixed-Number Detection:
>
> d U+09EA ( ৪ ) BENGALI DIGIT FOUR can be confused with U+0038 ( 8 )
> DIGIT EIGHT.
>
> Right, but they recommend implementing this by looking at the digit
> version of the character first... but... Does Emacs have a function to
> get the number value of ৪?
Yes, we do have that:
(get-char-code-property ?৪ 'numeric-value) => 4
> They then recommend comparing the value with the zero value of that
> system, and I'm pretty sure we don't have that.
Why not? what do you need, exactly?
^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better
2022-01-17 16:52 ` Eli Zaretskii
@ 2022-01-17 16:57 ` Lars Ingebrigtsen
2022-01-17 17:02 ` Eli Zaretskii
0 siblings, 1 reply; 123+ messages in thread
From: Lars Ingebrigtsen @ 2022-01-17 16:57 UTC (permalink / raw)
To: Eli Zaretskii; +Cc: 51733
Eli Zaretskii <eliz@gnu.org> writes:
> Yes, we do have that:
>
> (get-char-code-property ?৪ 'numeric-value) => 4
Cool.
>> They then recommend comparing the value with the zero value of that
>> system, and I'm pretty sure we don't have that.
>
> Why not? what do you need, exactly?
Just a thinko -- I was wondering whether we had a way to find the zero
character, but that's just:
(- ?৪ (get-char-code-property ?৪ 'numeric-value))
--
(domestic pets only, the antidote for overdose, milk.)
bloggy blog: http://lars.ingebrigtsen.no
^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better
2022-01-17 16:57 ` Lars Ingebrigtsen
@ 2022-01-17 17:02 ` Eli Zaretskii
2022-01-17 17:04 ` Lars Ingebrigtsen
0 siblings, 1 reply; 123+ messages in thread
From: Eli Zaretskii @ 2022-01-17 17:02 UTC (permalink / raw)
To: Lars Ingebrigtsen; +Cc: 51733
> From: Lars Ingebrigtsen <larsi@gnus.org>
> Cc: 51733@debbugs.gnu.org
> Date: Mon, 17 Jan 2022 17:57:46 +0100
>
> >> They then recommend comparing the value with the zero value of that
> >> system, and I'm pretty sure we don't have that.
> >
> > Why not? what do you need, exactly?
>
> Just a thinko -- I was wondering whether we had a way to find the zero
> character, but that's just:
>
> (- ?৪ (get-char-code-property ?৪ 'numeric-value))
That's just sheer luck, AFAIU (there are some characters with
numeric-value property that are not arranged from zero to 9), but
maybe for this particular purpose it's all that's needed.
^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better
2022-01-17 17:02 ` Eli Zaretskii
@ 2022-01-17 17:04 ` Lars Ingebrigtsen
0 siblings, 0 replies; 123+ messages in thread
From: Lars Ingebrigtsen @ 2022-01-17 17:04 UTC (permalink / raw)
To: Eli Zaretskii; +Cc: 51733
Eli Zaretskii <eliz@gnu.org> writes:
> That's just sheer luck, AFAIU (there are some characters with
> numeric-value property that are not arranged from zero to 9), but
> maybe for this particular purpose it's all that's needed.
We're only doing this check for the characters with Nd, which are
guaranteed to be organised this way. (There's only three of these
number systems, allegedly.)
--
(domestic pets only, the antidote for overdose, milk.)
bloggy blog: http://lars.ingebrigtsen.no
^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better
2022-01-17 16:47 ` Eli Zaretskii
@ 2022-01-17 17:09 ` Lars Ingebrigtsen
2022-01-17 17:19 ` Eli Zaretskii
0 siblings, 1 reply; 123+ messages in thread
From: Lars Ingebrigtsen @ 2022-01-17 17:09 UTC (permalink / raw)
To: Eli Zaretskii; +Cc: 51733, jidanni
Eli Zaretskii <eliz@gnu.org> writes:
> I don't think I understand what you mean by "the Asian parts". Do you
> mean the CJK parts where we lump several scripts together into 'han'
> and 'kana'?
Possibly -- I haven't looked closely.
>> It's impossible to implement the Unicode security recommendations based
>> on the Blocks.txt data -- it's that simple.
>
> Can you tell more about why it is impossible? If it's a relatively
> simple issue of "translating" the Unicode script names into ours, then
> it should be quite simple. Since you say it's impossible, I guess
> there's some factor(s) here that I miss?
Perhaps there's something I'm missing, because it seems self-evident to
me that the Blocks data can't be used for this.
For instance,
(textsec-single-script-p "ޱ﷽")
=> t
but
(elt char-script-table ?ޱ)
=> thaana
(elt char-script-table ?﷽)
=> arabic
I think the Unicode people have the authoritative say here, so
implementing the recommendations seems like the way to go. And it's
less work in the long run, because we can just import the data files and
not try to fix things up manually (like blocks.awk attempts to do).
--
(domestic pets only, the antidote for overdose, milk.)
bloggy blog: http://lars.ingebrigtsen.no
^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better
2022-01-16 17:53 ` Achim Gratz
@ 2022-01-17 17:13 ` Lars Ingebrigtsen
0 siblings, 0 replies; 123+ messages in thread
From: Lars Ingebrigtsen @ 2022-01-17 17:13 UTC (permalink / raw)
To: Achim Gratz; +Cc: 51733
Achim Gratz <Stromeko@nexgo.de> writes:
> Since you're trying to harden against homograph / homoglyph attacks, why
> not mention it on the tin? Besides URL and eMail addresses, it would
> probably be useful for checking source code (where the language allows
> unicode identifiers), in this case it should also (optionally) warn
> about non-normalized sequences.
It's not just about homoglyphs (but mostly about that) -- it's also
about classifying strings as to their applicability as identifiers and
stuff.
--
(domestic pets only, the antidote for overdose, milk.)
bloggy blog: http://lars.ingebrigtsen.no
^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better
2022-01-17 17:09 ` Lars Ingebrigtsen
@ 2022-01-17 17:19 ` Eli Zaretskii
2022-01-17 17:26 ` Lars Ingebrigtsen
0 siblings, 1 reply; 123+ messages in thread
From: Eli Zaretskii @ 2022-01-17 17:19 UTC (permalink / raw)
To: Lars Ingebrigtsen; +Cc: 51733, jidanni
> From: Lars Ingebrigtsen <larsi@gnus.org>
> Cc: 51733@debbugs.gnu.org, jidanni@jidanni.org
> Date: Mon, 17 Jan 2022 18:09:19 +0100
>
> I think the Unicode people have the authoritative say here, so
> implementing the recommendations seems like the way to go. And it's
> less work in the long run, because we can just import the data files and
> not try to fix things up manually (like blocks.awk attempts to do).
Let's at least call this something other than "script", to avoid
confusion.
^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better
2022-01-17 17:19 ` Eli Zaretskii
@ 2022-01-17 17:26 ` Lars Ingebrigtsen
2022-01-17 17:38 ` Lars Ingebrigtsen
2022-01-17 17:42 ` Eli Zaretskii
0 siblings, 2 replies; 123+ messages in thread
From: Lars Ingebrigtsen @ 2022-01-17 17:26 UTC (permalink / raw)
To: Eli Zaretskii; +Cc: 51733, jidanni
Eli Zaretskii <eliz@gnu.org> writes:
> Let's at least call this something other than "script", to avoid
> confusion.
Sure. But... what. 🤔 I made a slight attempt at that by calling it
"scripts" instead of "script", since each character belongs to a list of
scripts, but it's probably too subtle.
--
(domestic pets only, the antidote for overdose, milk.)
bloggy blog: http://lars.ingebrigtsen.no
^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better
2022-01-17 17:26 ` Lars Ingebrigtsen
@ 2022-01-17 17:38 ` Lars Ingebrigtsen
2022-01-17 17:48 ` Eli Zaretskii
2022-01-17 17:42 ` Eli Zaretskii
1 sibling, 1 reply; 123+ messages in thread
From: Lars Ingebrigtsen @ 2022-01-17 17:38 UTC (permalink / raw)
To: Eli Zaretskii; +Cc: 51733, jidanni
I'm looking at the Confusable section now.
https://www.unicode.org/reports/tr39/#Confusable_Detection
Looks easy enough to implement (and the ELPA package already does the
parsing, so I'll be reusing bits from that).
But... I'm wondering what the higher level interface would be? I mean,
quite a lot of strings are confusable with something else, but which
ones are interesting? The only thing that seems immediately interesting
to check for is whether a string is confusable with ASCII?
That is,
(textsec-confusable-with-ascii-p "C𝗂𝗋𝖼𝗅𝖾")
=> t
Because the ASCII characters are the ones that people rely on when doing
... things, like email and browsing the web.
But I mean, "C𝗂𝗋𝖼𝗅𝖾" is confusable with "СігсӀе" (the latter is
Cyrillic), and if you're writing Russian, that might also be
interesting. So perhaps a
(textsec-confusable-with-script-p "C𝗂𝗋𝖼𝗅𝖾" 'cyrillic)
=> t
? But... I'm not sure in which contexts that would actually be vital
to know. Hm.
Anybody have any thoughts here?
--
(domestic pets only, the antidote for overdose, milk.)
bloggy blog: http://lars.ingebrigtsen.no
^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better
2022-01-17 17:26 ` Lars Ingebrigtsen
2022-01-17 17:38 ` Lars Ingebrigtsen
@ 2022-01-17 17:42 ` Eli Zaretskii
2022-01-17 17:46 ` Lars Ingebrigtsen
1 sibling, 1 reply; 123+ messages in thread
From: Eli Zaretskii @ 2022-01-17 17:42 UTC (permalink / raw)
To: Lars Ingebrigtsen; +Cc: 51733, jidanni
> From: Lars Ingebrigtsen <larsi@gnus.org>
> Cc: 51733@debbugs.gnu.org, jidanni@jidanni.org
> Date: Mon, 17 Jan 2022 18:26:42 +0100
>
> Eli Zaretskii <eliz@gnu.org> writes:
>
> > Let's at least call this something other than "script", to avoid
> > confusion.
>
> Sure. But... what.
I don't know. script-id? script-class? scriptprop? uniscript?
> I made a slight attempt at that by calling it "scripts" instead of
> "script", since each character belongs to a list of scripts
Does it? UAX#24 says no:
The Script property is an enumerated property of type catalog. Its
values form a full partition of the codespace: every Unicode code
point is assigned a single Script property value. This value is
either the explicit value for a specific script, such as Cyrillic,
or is one of the following three special values:
. Inherited—for characters that may be used with multiple scripts,
and that inherit their script from a preceding base
character. These include nonspacing combining marks and
enclosing combining marks, as well as U+200C ZERO WIDTH
NON-JOINER and U+200D ZERO WIDTH JOINER.
. Common—for other characters that may be used with multiple
scripts.
. Unknown—for unassigned, private-use, noncharacter, and surrogate
code points.
This seems to say that each character has only a single script
property value assigned to it?
^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better
2021-11-10 0:42 ` Lars Ingebrigtsen
2021-11-10 3:34 ` Eli Zaretskii
@ 2022-01-17 17:43 ` 積丹尼 Dan Jacobson
2022-01-17 19:06 ` Eli Zaretskii
1 sibling, 1 reply; 123+ messages in thread
From: 積丹尼 Dan Jacobson @ 2022-01-17 17:43 UTC (permalink / raw)
To: Eli Zaretskii; +Cc: 51733, Lars Ingebrigtsen
OK, stay safe, beware of Ο,
and unsubscribe me from all these details. Thanks.
^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better
2022-01-17 17:42 ` Eli Zaretskii
@ 2022-01-17 17:46 ` Lars Ingebrigtsen
0 siblings, 0 replies; 123+ messages in thread
From: Lars Ingebrigtsen @ 2022-01-17 17:46 UTC (permalink / raw)
To: Eli Zaretskii; +Cc: 51733, jidanni
Eli Zaretskii <eliz@gnu.org> writes:
>> I made a slight attempt at that by calling it "scripts" instead of
>> "script", since each character belongs to a list of scripts
>
> Does it? UAX#24 says no:
At least in this context. See ScriptExtensions.txt and TR39.
--
(domestic pets only, the antidote for overdose, milk.)
bloggy blog: http://lars.ingebrigtsen.no
^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better
2022-01-17 17:38 ` Lars Ingebrigtsen
@ 2022-01-17 17:48 ` Eli Zaretskii
2022-01-17 19:08 ` Eli Zaretskii
2022-01-19 13:55 ` Lars Ingebrigtsen
0 siblings, 2 replies; 123+ messages in thread
From: Eli Zaretskii @ 2022-01-17 17:48 UTC (permalink / raw)
To: Lars Ingebrigtsen; +Cc: 51733, jidanni
> From: Lars Ingebrigtsen <larsi@gnus.org>
> Cc: 51733@debbugs.gnu.org, jidanni@jidanni.org
> Date: Mon, 17 Jan 2022 18:38:48 +0100
>
> I'm looking at the Confusable section now.
>
> https://www.unicode.org/reports/tr39/#Confusable_Detection
>
> Looks easy enough to implement (and the ELPA package already does the
> parsing, so I'll be reusing bits from that).
>
> But... I'm wondering what the higher level interface would be? I mean,
> quite a lot of strings are confusable with something else, but which
> ones are interesting? The only thing that seems immediately interesting
> to check for is whether a string is confusable with ASCII?
>
> That is,
>
> (textsec-confusable-with-ascii-p "C𝗂𝗋𝖼𝗅𝖾")
> => t
>
> Because the ASCII characters are the ones that people rely on when doing
> ... things, like email and browsing the web.
>
> But I mean, "C𝗂𝗋𝖼𝗅𝖾" is confusable with "СігсӀе" (the latter is
> Cyrillic), and if you're writing Russian, that might also be
> interesting. So perhaps a
>
> (textsec-confusable-with-script-p "C𝗂𝗋𝖼𝗅𝖾" 'cyrillic)
> => t
>
> ? But... I'm not sure in which contexts that would actually be vital
> to know. Hm.
I think we should first determine what kinds of applications may need
this, and take it from there. The initial number of "confusability
with" classes can be very small, and we can add more as we discover
interesting use cases. The full number is pretty much infinite, I
think, but I'm not sure Emacs needs to support all of them OOTB. We
could support some of the popular ones, and provide infrastructure for
developing more.
^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better
2022-01-17 17:43 ` 積丹尼 Dan Jacobson
@ 2022-01-17 19:06 ` Eli Zaretskii
0 siblings, 0 replies; 123+ messages in thread
From: Eli Zaretskii @ 2022-01-17 19:06 UTC (permalink / raw)
To: 積丹尼 Dan Jacobson; +Cc: 51733, larsi
> From: 積丹尼 Dan Jacobson <jidanni@jidanni.org>
> Cc: Lars Ingebrigtsen <larsi@gnus.org>, 51733@debbugs.gnu.org
> Date: Tue, 18 Jan 2022 01:43:45 +0800
>
> and unsubscribe me from all these details. Thanks.
No way! You started this, so now you pay the price.
^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better
2022-01-17 17:48 ` Eli Zaretskii
@ 2022-01-17 19:08 ` Eli Zaretskii
2022-01-17 20:22 ` Lars Ingebrigtsen
2022-01-19 13:55 ` Lars Ingebrigtsen
1 sibling, 1 reply; 123+ messages in thread
From: Eli Zaretskii @ 2022-01-17 19:08 UTC (permalink / raw)
To: larsi; +Cc: 51733
> Date: Mon, 17 Jan 2022 19:48:01 +0200
> From: Eli Zaretskii <eliz@gnu.org>
> Cc: 51733@debbugs.gnu.org, jidanni@jidanni.org
>
> I think we should first determine what kinds of applications may need
> this
By that I meant: confusables in URL, confusables in email addresses,
etc.
^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better
2022-01-17 19:08 ` Eli Zaretskii
@ 2022-01-17 20:22 ` Lars Ingebrigtsen
2022-01-18 8:40 ` Lars Ingebrigtsen
0 siblings, 1 reply; 123+ messages in thread
From: Lars Ingebrigtsen @ 2022-01-17 20:22 UTC (permalink / raw)
To: Eli Zaretskii; +Cc: 51733
I'm not quite sure I understand this bit here
https://www.unicode.org/reports/tr39/#Confusable_Detection
---
For an input string X, define skeleton(X) to be the following transformation on the string:
Convert X to NFD format, as described in [UAX15].
Concatenate the prototypes for each character in X according to the specified data, producing a string of exemplar characters.
Reapply NFD.
---
I mean, that sounds OK in and of itself, but then:
---
X and Y are single-script confusables if and only if they are confusable, and their resolved script sets have at least one element in common.
Examples: “ljeto” and “ljeto” in Latin (the Croatian word for “summer”), where the first word uses only four codepoints, the first of which is U+01C9 (lj) LATIN SMALL LETTER LJ.
---
But:
(ucs-normalize-NFD-string "ljeto")
=> "ljeto"
So according to that algo "ljeto" and "ljeto" are not confusable.
But if we use NFKD instead, they are:
(ucs-normalize-NFKD-string "ljeto")
=> "ljeto"
It seems unlikely to be a typo in this document, surely? But NFKD seems
to make a whole lot more sense than NFD for this usage. I must be
missing or misreading something.
--
(domestic pets only, the antidote for overdose, milk.)
bloggy blog: http://lars.ingebrigtsen.no
^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better
2022-01-17 20:22 ` Lars Ingebrigtsen
@ 2022-01-18 8:40 ` Lars Ingebrigtsen
2022-01-18 11:26 ` Lars Ingebrigtsen
0 siblings, 1 reply; 123+ messages in thread
From: Lars Ingebrigtsen @ 2022-01-18 8:40 UTC (permalink / raw)
To: Eli Zaretskii; +Cc: 51733
Lars Ingebrigtsen <larsi@gnus.org> writes:
> I must be missing or misreading something.
Yes, indeed. I missed that the point of the confusable table was to do
the lj -> lj mapping. Doh.
(Well, one of the points.)
--
(domestic pets only, the antidote for overdose, milk.)
bloggy blog: http://lars.ingebrigtsen.no
^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better
2022-01-18 8:40 ` Lars Ingebrigtsen
@ 2022-01-18 11:26 ` Lars Ingebrigtsen
2022-01-18 11:37 ` Lars Ingebrigtsen
2022-01-18 14:55 ` Eli Zaretskii
0 siblings, 2 replies; 123+ messages in thread
From: Lars Ingebrigtsen @ 2022-01-18 11:26 UTC (permalink / raw)
To: Eli Zaretskii; +Cc: 51733
Next stupid question:
---
It must not contain any stateful bidirectional format characters.
That is, no [:bidicontrol:] except for the LRM, RLM, and ALM, since the bidirectional controls could influence the ordering of characters outside the quotes.
---
We don't have the :bidicontrol: regexp class. Do we have another way to
classify bidi control characters? The have class Cf, but so does many
other non-bidi control characters...
--
(domestic pets only, the antidote for overdose, milk.)
bloggy blog: http://lars.ingebrigtsen.no
^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better
2022-01-18 11:26 ` Lars Ingebrigtsen
@ 2022-01-18 11:37 ` Lars Ingebrigtsen
2022-01-18 11:44 ` Lars Ingebrigtsen
2022-01-18 14:55 ` Eli Zaretskii
1 sibling, 1 reply; 123+ messages in thread
From: Lars Ingebrigtsen @ 2022-01-18 11:37 UTC (permalink / raw)
To: Eli Zaretskii; +Cc: 51733
Lars Ingebrigtsen <larsi@gnus.org> writes:
> We don't have the :bidicontrol: regexp class. Do we have another way to
> classify bidi control characters? The have class Cf, but so does many
> other non-bidi control characters...
I guess it's
(get-char-code-property ?\N{LEFT-TO-RIGHT ISOLATE} 'bidi-class)
combined with whether it's a control character?
--
(domestic pets only, the antidote for overdose, milk.)
bloggy blog: http://lars.ingebrigtsen.no
^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better
2022-01-18 11:37 ` Lars Ingebrigtsen
@ 2022-01-18 11:44 ` Lars Ingebrigtsen
2022-01-18 12:00 ` Lars Ingebrigtsen
0 siblings, 1 reply; 123+ messages in thread
From: Lars Ingebrigtsen @ 2022-01-18 11:44 UTC (permalink / raw)
To: Eli Zaretskii; +Cc: 51733
Lars Ingebrigtsen <larsi@gnus.org> writes:
> I guess it's
>
> (get-char-code-property ?\N{LEFT-TO-RIGHT ISOLATE} 'bidi-class)
>
> combined with whether it's a control character?
No, that doesn't really help here:
(get-char-code-property ?\N{LEFT-TO-RIGHT MARK} 'bidi-class)
=> L
Hm...
--
(domestic pets only, the antidote for overdose, milk.)
bloggy blog: http://lars.ingebrigtsen.no
^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better
2022-01-18 11:44 ` Lars Ingebrigtsen
@ 2022-01-18 12:00 ` Lars Ingebrigtsen
2022-01-18 12:47 ` Lars Ingebrigtsen
2022-01-18 14:59 ` Eli Zaretskii
0 siblings, 2 replies; 123+ messages in thread
From: Lars Ingebrigtsen @ 2022-01-18 12:00 UTC (permalink / raw)
To: Eli Zaretskii; +Cc: 51733
Lars Ingebrigtsen <larsi@gnus.org> writes:
> No, that doesn't really help here:
>
> (get-char-code-property ?\N{LEFT-TO-RIGHT MARK} 'bidi-class)
> => L
>
> Hm...
OK, there's glyphless--bidi-control-characters, and I could make that
non-private, and add the three missing ones...
--
(domestic pets only, the antidote for overdose, milk.)
bloggy blog: http://lars.ingebrigtsen.no
^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better
2022-01-18 12:00 ` Lars Ingebrigtsen
@ 2022-01-18 12:47 ` Lars Ingebrigtsen
2022-01-18 12:51 ` Lars Ingebrigtsen
2022-01-18 15:05 ` Eli Zaretskii
2022-01-18 14:59 ` Eli Zaretskii
1 sibling, 2 replies; 123+ messages in thread
From: Lars Ingebrigtsen @ 2022-01-18 12:47 UTC (permalink / raw)
To: Eli Zaretskii; +Cc: 51733
OK, I think the textsec stuff is basically 90% implemented now. (So
according to custom, there's at least 90% left.)
The next step would be to make other packages use this. For instance,
when shr displays a suspicious URL, it could mark it in red (and perhaps
add a warning icon), and have a tooltip that describes in which way it's
suspicious.
I think the places it would make sense to hook this machinery in would
be in:
* shr (displaying URLs and links)
* Gnus/rmail (displaying email addresses)
* Message (when responding to mail; a prompt "do you really?")
* browse-url (prompt)
Feel free to add to the list.
There should probably be a customization point? A user option like
`warn-about-suspicious-identifiers'? (Better name would be nice.) And
then a utility function that would return a propertised string with the
warning, perhaps, so that all the callers don't have to do so much work.
So shr/Gnus/rmail could use
(possibly-add-warning-about-suspiciousness string) to do that, and if
the user has switched the user option off, textsec isn't loaded at all.
(Since it loads so much data, some people might prefer not to.)
--
(domestic pets only, the antidote for overdose, milk.)
bloggy blog: http://lars.ingebrigtsen.no
^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better
2022-01-18 12:47 ` Lars Ingebrigtsen
@ 2022-01-18 12:51 ` Lars Ingebrigtsen
2022-01-18 18:44 ` Eli Zaretskii
2022-01-18 18:48 ` Eli Zaretskii
2022-01-18 15:05 ` Eli Zaretskii
1 sibling, 2 replies; 123+ messages in thread
From: Lars Ingebrigtsen @ 2022-01-18 12:51 UTC (permalink / raw)
To: Eli Zaretskii; +Cc: 51733
Lars Ingebrigtsen <larsi@gnus.org> writes:
> The next step would be to make other packages use this.
(But I'm taking the rest of the day off, and possibly tomorrow, too, so
if somebody else wants to tinker with this, please do go ahead.)
--
(domestic pets only, the antidote for overdose, milk.)
bloggy blog: http://lars.ingebrigtsen.no
^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better
2022-01-18 11:26 ` Lars Ingebrigtsen
2022-01-18 11:37 ` Lars Ingebrigtsen
@ 2022-01-18 14:55 ` Eli Zaretskii
1 sibling, 0 replies; 123+ messages in thread
From: Eli Zaretskii @ 2022-01-18 14:55 UTC (permalink / raw)
To: Lars Ingebrigtsen; +Cc: 51733
> From: Lars Ingebrigtsen <larsi@gnus.org>
> Cc: 51733@debbugs.gnu.org
> Date: Tue, 18 Jan 2022 12:26:30 +0100
>
> Next stupid question:
>
> ---
> It must not contain any stateful bidirectional format characters.
>
> That is, no [:bidicontrol:] except for the LRM, RLM, and ALM, since the bidirectional controls could influence the ordering of characters outside the quotes.
> ---
>
> We don't have the :bidicontrol: regexp class. Do we have another way to
> classify bidi control characters? The have class Cf, but so does many
> other non-bidi control characters...
I don't think you need any classification: the offending control
characters are very few, so you could just test for them explicitly.
^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better
2022-01-18 12:00 ` Lars Ingebrigtsen
2022-01-18 12:47 ` Lars Ingebrigtsen
@ 2022-01-18 14:59 ` Eli Zaretskii
2022-01-19 13:56 ` Lars Ingebrigtsen
1 sibling, 1 reply; 123+ messages in thread
From: Eli Zaretskii @ 2022-01-18 14:59 UTC (permalink / raw)
To: Lars Ingebrigtsen; +Cc: 51733
> From: Lars Ingebrigtsen <larsi@gnus.org>
> Cc: 51733@debbugs.gnu.org
> Date: Tue, 18 Jan 2022 13:00:42 +0100
>
> OK, there's glyphless--bidi-control-characters, and I could make that
> non-private, and add the three missing ones...
I don't think that's what you want, because AFAIU that includes LRM,
RLM, and ALM, which are stateless.
^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better
2022-01-18 12:47 ` Lars Ingebrigtsen
2022-01-18 12:51 ` Lars Ingebrigtsen
@ 2022-01-18 15:05 ` Eli Zaretskii
2022-01-19 12:49 ` Michael Albinus
2022-01-19 13:35 ` Lars Ingebrigtsen
1 sibling, 2 replies; 123+ messages in thread
From: Eli Zaretskii @ 2022-01-18 15:05 UTC (permalink / raw)
To: Lars Ingebrigtsen; +Cc: 51733
> From: Lars Ingebrigtsen <larsi@gnus.org>
> Cc: 51733@debbugs.gnu.org
> Date: Tue, 18 Jan 2022 13:47:35 +0100
>
> I think the places it would make sense to hook this machinery in would
> be in:
>
> * shr (displaying URLs and links)
> * Gnus/rmail (displaying email addresses)
> * Message (when responding to mail; a prompt "do you really?")
> * browse-url (prompt)
Sounds reasonable.
Perhaps also Tramp (host names)?
> There should probably be a customization point? A user option like
> `warn-about-suspicious-identifiers'?
Is this a go/no-go test, or are there levels? If there are levels,
perhaps something similar to NSM would be more appropriate? (And
maybe levels of NSM should determine the default textsec level?)
^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better
2022-01-18 12:51 ` Lars Ingebrigtsen
@ 2022-01-18 18:44 ` Eli Zaretskii
2022-01-19 9:21 ` Robert Pluim
2022-01-19 9:25 ` Lars Ingebrigtsen
2022-01-18 18:48 ` Eli Zaretskii
1 sibling, 2 replies; 123+ messages in thread
From: Eli Zaretskii @ 2022-01-18 18:44 UTC (permalink / raw)
To: Lars Ingebrigtsen; +Cc: 51733
> From: Lars Ingebrigtsen <larsi@gnus.org>
> Cc: 51733@debbugs.gnu.org
> Date: Tue, 18 Jan 2022 13:51:38 +0100
>
> Lars Ingebrigtsen <larsi@gnus.org> writes:
>
> > The next step would be to make other packages use this.
>
> (But I'm taking the rest of the day off, and possibly tomorrow, too, so
> if somebody else wants to tinker with this, please do go ahead.)
Does textsec-email-suspicious-p expect non-ASCII email addresses to be
RFC 2047 encoded? If so, it will not work in the Rmail display
buffers, where email addresses are shown decoded. For non-ASCII names
the function signals an error.
^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better
2022-01-18 12:51 ` Lars Ingebrigtsen
2022-01-18 18:44 ` Eli Zaretskii
@ 2022-01-18 18:48 ` Eli Zaretskii
2022-01-18 20:15 ` Eli Zaretskii
1 sibling, 1 reply; 123+ messages in thread
From: Eli Zaretskii @ 2022-01-18 18:48 UTC (permalink / raw)
To: Lars Ingebrigtsen; +Cc: 51733
These two tests seem to reveal a bug in the implementation:
(should (textsec-name-suspicious-p
"\N{LEFT-TO-RIGHT MARK}\N{LEFT-TO-RIGHT MARK}Lars Ingebrigtsen"))
(should (textsec-name-suspicious-p
"\N{LEFT-TO-RIGHT MARK}\N{RIGHT-TO-LEFT MARK}\N{LEFT-TO-RIGHT MARK}\N{RIGHT-TO-LEFT MARK}\N{LEFT-TO-RIGHT MARK}Lars Ingebrigtsen")))
LRM and RLM are stateless controls, so they shouldn't be flagged as
suspicious, AFAIU.
^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better
2022-01-18 18:48 ` Eli Zaretskii
@ 2022-01-18 20:15 ` Eli Zaretskii
2022-01-18 20:31 ` Eli Zaretskii
2022-01-19 13:38 ` Lars Ingebrigtsen
0 siblings, 2 replies; 123+ messages in thread
From: Eli Zaretskii @ 2022-01-18 20:15 UTC (permalink / raw)
To: larsi; +Cc: 51733
> Date: Tue, 18 Jan 2022 20:48:46 +0200
> From: Eli Zaretskii <eliz@gnu.org>
> Cc: 51733@debbugs.gnu.org
>
> These two tests seem to reveal a bug in the implementation:
>
> (should (textsec-name-suspicious-p
> "\N{LEFT-TO-RIGHT MARK}\N{LEFT-TO-RIGHT MARK}Lars Ingebrigtsen"))
> (should (textsec-name-suspicious-p
> "\N{LEFT-TO-RIGHT MARK}\N{RIGHT-TO-LEFT MARK}\N{LEFT-TO-RIGHT MARK}\N{RIGHT-TO-LEFT MARK}\N{LEFT-TO-RIGHT MARK}Lars Ingebrigtsen")))
>
> LRM and RLM are stateless controls, so they shouldn't be flagged as
> suspicious, AFAIU.
I think I get it now: it's because of textsec-suspicious-nonspacing-p,
which forbids consecutive nonspacing characters, right? But then I
don't think it's correct to consider Cf characters for that purpose:
UTS#39 explicitly talks about nonspacing _marks_, i.e. Mn and Me
characters. Where did you see Cf and Cc as well?
^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better
2022-01-18 20:15 ` Eli Zaretskii
@ 2022-01-18 20:31 ` Eli Zaretskii
2022-01-19 13:38 ` Lars Ingebrigtsen
1 sibling, 0 replies; 123+ messages in thread
From: Eli Zaretskii @ 2022-01-18 20:31 UTC (permalink / raw)
To: larsi; +Cc: 51733
> Date: Tue, 18 Jan 2022 22:15:39 +0200
> From: Eli Zaretskii <eliz@gnu.org>
> Cc: 51733@debbugs.gnu.org
>
> > LRM and RLM are stateless controls, so they shouldn't be flagged as
> > suspicious, AFAIU.
>
> I think I get it now: it's because of textsec-suspicious-nonspacing-p,
> which forbids consecutive nonspacing characters, right? But then I
> don't think it's correct to consider Cf characters for that purpose:
> UTS#39 explicitly talks about nonspacing _marks_, i.e. Mn and Me
> characters. Where did you see Cf and Cc as well?
Including Cf characters in this suspicious category is also
problematic because the ZERO WIDTH characters (like ZWJ and ZWNJ) are
Cf, and it is not reasonable to limit the use of those, as some
scripts (like Arabic, for example), uses them quite a lot.
^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better
2022-01-18 18:44 ` Eli Zaretskii
@ 2022-01-19 9:21 ` Robert Pluim
2022-01-19 9:26 ` Lars Ingebrigtsen
2022-01-19 11:53 ` Eli Zaretskii
2022-01-19 9:25 ` Lars Ingebrigtsen
1 sibling, 2 replies; 123+ messages in thread
From: Robert Pluim @ 2022-01-19 9:21 UTC (permalink / raw)
To: Eli Zaretskii; +Cc: 51733, Lars Ingebrigtsen
>>>>> On Tue, 18 Jan 2022 20:44:51 +0200, Eli Zaretskii <eliz@gnu.org> said:
>> From: Lars Ingebrigtsen <larsi@gnus.org>
>> Cc: 51733@debbugs.gnu.org
>> Date: Tue, 18 Jan 2022 13:51:38 +0100
>>
>> Lars Ingebrigtsen <larsi@gnus.org> writes:
>>
>> > The next step would be to make other packages use this.
>>
>> (But I'm taking the rest of the day off, and possibly tomorrow, too, so
>> if somebody else wants to tinker with this, please do go ahead.)
Eli> Does textsec-email-suspicious-p expect non-ASCII email addresses to be
Eli> RFC 2047 encoded? If so, it will not work in the Rmail display
Eli> buffers, where email addresses are shown decoded. For non-ASCII names
Eli> the function signals an error.
It does? Do you have an example? The following works fine
ELISP> (textsec-email-suspicious-p "rpluimм <rpluimм@gmail.com>")
=> "`rpluimм' isn't restrictive enough"
Although I think that message should say something like
"Mailbox name contains non-ASCII characters"
Robert
--
^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better
2022-01-18 18:44 ` Eli Zaretskii
2022-01-19 9:21 ` Robert Pluim
@ 2022-01-19 9:25 ` Lars Ingebrigtsen
2022-01-19 11:51 ` Eli Zaretskii
1 sibling, 1 reply; 123+ messages in thread
From: Lars Ingebrigtsen @ 2022-01-19 9:25 UTC (permalink / raw)
To: Eli Zaretskii; +Cc: 51733
Eli Zaretskii <eliz@gnu.org> writes:
> Does textsec-email-suspicious-p expect non-ASCII email addresses to be
> RFC 2047 encoded?
Yes.
> If so, it will not work in the Rmail display buffers, where email
> addresses are shown decoded. For non-ASCII names the function signals
> an error.
Rmail does have access to the encoded header, so it'll just have to call
the textsec function before it decodes it (and displays it).
--
(domestic pets only, the antidote for overdose, milk.)
bloggy blog: http://lars.ingebrigtsen.no
^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better
2022-01-19 9:21 ` Robert Pluim
@ 2022-01-19 9:26 ` Lars Ingebrigtsen
2022-01-19 10:12 ` Robert Pluim
2022-01-19 11:53 ` Eli Zaretskii
1 sibling, 1 reply; 123+ messages in thread
From: Lars Ingebrigtsen @ 2022-01-19 9:26 UTC (permalink / raw)
To: Robert Pluim; +Cc: 51733
Robert Pluim <rpluim@gmail.com> writes:
> ELISP> (textsec-email-suspicious-p "rpluimм <rpluimм@gmail.com>")
> => "`rpluimм' isn't restrictive enough"
>
> Although I think that message should say something like
>
> "Mailbox name contains non-ASCII characters"
But it's fine for mailbox names to be non-ASCII.
(textsec-email-suspicious-p "rpluimм <м@gmail.com>")
=> nil
It's just various combinations of ... things ... that are suspicious.
--
(domestic pets only, the antidote for overdose, milk.)
bloggy blog: http://lars.ingebrigtsen.no
^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better
2022-01-19 9:26 ` Lars Ingebrigtsen
@ 2022-01-19 10:12 ` Robert Pluim
2022-01-19 10:27 ` Lars Ingebrigtsen
0 siblings, 1 reply; 123+ messages in thread
From: Robert Pluim @ 2022-01-19 10:12 UTC (permalink / raw)
To: Lars Ingebrigtsen; +Cc: 51733
>>>>> On Wed, 19 Jan 2022 10:26:42 +0100, Lars Ingebrigtsen <larsi@gnus.org> said:
Lars> Robert Pluim <rpluim@gmail.com> writes:
ELISP> (textsec-email-suspicious-p "rpluimм <rpluimм@gmail.com>")
>> => "`rpluimм' isn't restrictive enough"
>>
>> Although I think that message should say something like
>>
>> "Mailbox name contains non-ASCII characters"
Lars> But it's fine for mailbox names to be non-ASCII.
Lars> (textsec-email-suspicious-p "rpluimм <м@gmail.com>")
Lars> => nil
Lars> It's just various combinations of ... things ... that are suspicious.
OK, but the error message could be better, no?
Robert
--
^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better
2022-01-19 10:12 ` Robert Pluim
@ 2022-01-19 10:27 ` Lars Ingebrigtsen
2022-01-19 10:42 ` Robert Pluim
0 siblings, 1 reply; 123+ messages in thread
From: Lars Ingebrigtsen @ 2022-01-19 10:27 UTC (permalink / raw)
To: Robert Pluim; +Cc: 51733
Robert Pluim <rpluim@gmail.com> writes:
> OK, but the error message could be better, no?
Sure, but what? (And it's not an error message, it's information about
something that looks like it might be odd.)
Summarising Unicode® Technical Standard #39 in one line isn't easy.
We can go all vague, like "Something is wrong", or we can go long, like
"It's not all-ASCII, and it's not single script, and it's not a mixture
of arabic armenian bengali bopomofo devanagari ethiopic georgian
gujarati gurmukhi hangul han hebrew hiragana katakana kannada khmer lao
malayalam myanmar oriya sinhala tamil telugu thaana thai tibetan latin,
and it not a latin/han/korea/japan mixture". (And I probably forgot
some bits.)
--
(domestic pets only, the antidote for overdose, milk.)
bloggy blog: http://lars.ingebrigtsen.no
^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better
2022-01-19 10:27 ` Lars Ingebrigtsen
@ 2022-01-19 10:42 ` Robert Pluim
2022-01-19 13:46 ` Lars Ingebrigtsen
0 siblings, 1 reply; 123+ messages in thread
From: Robert Pluim @ 2022-01-19 10:42 UTC (permalink / raw)
To: Lars Ingebrigtsen; +Cc: 51733
>>>>> On Wed, 19 Jan 2022 11:27:57 +0100, Lars Ingebrigtsen <larsi@gnus.org> said:
Lars> Robert Pluim <rpluim@gmail.com> writes:
>> OK, but the error message could be better, no?
Lars> Sure, but what? (And it's not an error message, it's information about
Lars> something that looks like it might be odd.)
Lars> Summarising Unicode® Technical Standard #39 in one line isn't easy.
Lars> We can go all vague, like "Something is wrong", or we can go long, like
Lars> "It's not all-ASCII, and it's not single script, and it's not a mixture
Lars> of arabic armenian bengali bopomofo devanagari ethiopic georgian
Lars> gujarati gurmukhi hangul han hebrew hiragana katakana kannada khmer lao
Lars> malayalam myanmar oriya sinhala tamil telugu thaana thai tibetan latin,
Lars> and it not a latin/han/korea/japan mixture". (And I probably forgot
Lars> some bits.)
How about "Contains suspicious characters or mix of characters"? That
would at least point users in the right direction.
Robert
--
^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better
2022-01-19 9:25 ` Lars Ingebrigtsen
@ 2022-01-19 11:51 ` Eli Zaretskii
2022-01-19 12:54 ` Lars Ingebrigtsen
0 siblings, 1 reply; 123+ messages in thread
From: Eli Zaretskii @ 2022-01-19 11:51 UTC (permalink / raw)
To: Lars Ingebrigtsen; +Cc: 51733
> From: Lars Ingebrigtsen <larsi@gnus.org>
> Cc: 51733@debbugs.gnu.org
> Date: Wed, 19 Jan 2022 10:25:42 +0100
>
> Eli Zaretskii <eliz@gnu.org> writes:
>
> > Does textsec-email-suspicious-p expect non-ASCII email addresses to be
> > RFC 2047 encoded?
>
> Yes.
>
> > If so, it will not work in the Rmail display buffers, where email
> > addresses are shown decoded. For non-ASCII names the function signals
> > an error.
>
> Rmail does have access to the encoded header, so it'll just have to call
> the textsec function before it decodes it (and displays it).
This is unfortunate. It means, for example, that a simple lazy
discovery of suspicious addresses by scanning the email reading buffer
with regular expressions will not work, and the feature must instead
scan the original mbox buffer.
Why cannot we lift this restriction? mail-header-parse-address is not
the only way to parse email addresses. Or maybe we could encode the
email address if the original one causes an error?
^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better
2022-01-19 9:21 ` Robert Pluim
2022-01-19 9:26 ` Lars Ingebrigtsen
@ 2022-01-19 11:53 ` Eli Zaretskii
2022-01-19 12:49 ` Robert Pluim
1 sibling, 1 reply; 123+ messages in thread
From: Eli Zaretskii @ 2022-01-19 11:53 UTC (permalink / raw)
To: Robert Pluim; +Cc: 51733, larsi
> From: Robert Pluim <rpluim@gmail.com>
> Cc: Lars Ingebrigtsen <larsi@gnus.org>, 51733@debbugs.gnu.org
> Date: Wed, 19 Jan 2022 10:21:22 +0100
>
> Eli> Does textsec-email-suspicious-p expect non-ASCII email addresses to be
> Eli> RFC 2047 encoded? If so, it will not work in the Rmail display
> Eli> buffers, where email addresses are shown decoded. For non-ASCII names
> Eli> the function signals an error.
>
> It does? Do you have an example? The following works fine
Here:
(textsec-email-suspicious-p "אבגד <foo@bar.com>")
=> (wrong-type-argument stringp nil)
with this backtrace:
Debugger entered--Lisp error: (wrong-type-argument stringp nil)
string-search("=?" nil)
rfc2047-decode-string(nil)
mail-header-parse-address("אבגד <foo@bar.com>" t)
textsec-email-suspicious-p("אבגד <foo@bar.com>")
(progn (textsec-email-suspicious-p "אבגד <foo@bar.com>"))
eval((progn (textsec-email-suspicious-p "אבגד <foo@bar.com>")) t)
elisp--eval-last-sexp(t)
eval-last-sexp(t)
eval-print-last-sexp(nil)
funcall-interactively(eval-print-last-sexp nil)
call-interactively(eval-print-last-sexp nil nil)
command-execute(eval-print-last-sexp)
^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better
2022-01-19 11:53 ` Eli Zaretskii
@ 2022-01-19 12:49 ` Robert Pluim
2022-01-19 12:56 ` Lars Ingebrigtsen
2022-01-19 12:58 ` Eli Zaretskii
0 siblings, 2 replies; 123+ messages in thread
From: Robert Pluim @ 2022-01-19 12:49 UTC (permalink / raw)
To: Eli Zaretskii; +Cc: 51733, larsi
>>>>> On Wed, 19 Jan 2022 13:53:59 +0200, Eli Zaretskii <eliz@gnu.org> said:
Eli> Here:
Eli> (textsec-email-suspicious-p "אבגד <foo@bar.com>")
Eli> => (wrong-type-argument stringp nil)
Eli> with this backtrace:
Eli> Debugger entered--Lisp error: (wrong-type-argument stringp nil)
Eli> string-search("=?" nil)
Eli> rfc2047-decode-string(nil)
Eli> mail-header-parse-address("אבגד <foo@bar.com>" t)
Eli> textsec-email-suspicious-p("אבגד <foo@bar.com>")
Eli> (progn (textsec-email-suspicious-p "אבגד <foo@bar.com>"))
Eli> eval((progn (textsec-email-suspicious-p "אבגד <foo@bar.com>")) t)
Eli> elisp--eval-last-sexp(t)
Eli> eval-last-sexp(t)
Eli> eval-print-last-sexp(nil)
Eli> funcall-interactively(eval-print-last-sexp nil)
Eli> call-interactively(eval-print-last-sexp nil nil)
Eli> command-execute(eval-print-last-sexp)
mail-header-parse-address assumes that the display name or the local
name starts with a (subset of) ASCII. The following doesnʼt signal an
error:
(textsec-email-suspicious-p "דגבאa <foo@bar.com>")
Since itʼs now open season on display names and mailbox names, the
following might be enough. Lars?
diff --git a/lisp/mail/ietf-drums.el b/lisp/mail/ietf-drums.el
index 4a07959189..1885f958ba 100644
--- a/lisp/mail/ietf-drums.el
+++ b/lisp/mail/ietf-drums.el
@@ -217,7 +217,7 @@ ietf-drums-parse-address
(push (buffer-substring
(1+ (point)) (progn (forward-sexp 1) (1- (point))))
display-name))
- ((looking-at (concat "[" ietf-drums-atext-token "@" "]"))
+ ((not (eq c ?<))
(push (buffer-substring (point) (progn (forward-sexp 1) (point)))
display-name))
((eq c ?<)
@@ -240,7 +240,7 @@ ietf-drums-parse-address
(cons
(mapconcat #'identity (nreverse display-name) "")
(ietf-drums-get-comment string)))
- (cons mailbox (if decode
+ (cons mailbox (if (and decode display-string)
(rfc2047-decode-string display-string)
display-string))))))
Robert
--
^ permalink raw reply related [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better
2022-01-18 15:05 ` Eli Zaretskii
@ 2022-01-19 12:49 ` Michael Albinus
2022-01-19 12:59 ` Eli Zaretskii
2022-01-19 13:35 ` Lars Ingebrigtsen
1 sibling, 1 reply; 123+ messages in thread
From: Michael Albinus @ 2022-01-19 12:49 UTC (permalink / raw)
To: Eli Zaretskii; +Cc: 51733, Lars Ingebrigtsen
Eli Zaretskii <eliz@gnu.org> writes:
Hi Eli,
>> I think the places it would make sense to hook this machinery in would
>> be in:
>>
>> * shr (displaying URLs and links)
>> * Gnus/rmail (displaying email addresses)
>> * Message (when responding to mail; a prompt "do you really?")
>> * browse-url (prompt)
>
> Sounds reasonable.
>
> Perhaps also Tramp (host names)?
--8<---------------cut here---------------start------------->8---
(defconst tramp-host-regexp "[[:alnum:]_.%-]+"
"Regexp matching host names.")
;; The following regexp is a bit sloppy. But it shall serve our
;; purposes. It covers also IPv4 mapped IPv6 addresses, like in
;; "::ffff:192.168.0.1".
(defconst tramp-ipv6-regexp "\\(?:[[:alnum:]]*:\\)+[[:alnum:].]+"
"Regexp matching IPv6 addresses.")
--8<---------------cut here---------------end--------------->8---
This should be sufficient, shouldn't it?
Best regards, Michael.
^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better
2022-01-19 11:51 ` Eli Zaretskii
@ 2022-01-19 12:54 ` Lars Ingebrigtsen
2022-01-19 13:01 ` Eli Zaretskii
2022-01-19 13:36 ` Andreas Schwab
0 siblings, 2 replies; 123+ messages in thread
From: Lars Ingebrigtsen @ 2022-01-19 12:54 UTC (permalink / raw)
To: Eli Zaretskii; +Cc: 51733
Eli Zaretskii <eliz@gnu.org> writes:
> This is unfortunate. It means, for example, that a simple lazy
> discovery of suspicious addresses by scanning the email reading buffer
> with regular expressions will not work, and the feature must instead
> scan the original mbox buffer.
There's no scanning -- rmail displays the From header, right? So it
does decoding before displaying the header. It has to do the textsec
stuff first, too.
> Why cannot we lift this restriction? mail-header-parse-address is not
> the only way to parse email addresses. Or maybe we could encode the
> email address if the original one causes an error?
There is no reliable way to parse a decoded mail address, and since this
is a security thing, we don't want to do DWIM and guesses (which is what
you have to do when composing a valid email address from a string like
"Fóo, Jr. <foo@example.com>").
--
(domestic pets only, the antidote for overdose, milk.)
bloggy blog: http://lars.ingebrigtsen.no
^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better
2022-01-19 12:49 ` Robert Pluim
@ 2022-01-19 12:56 ` Lars Ingebrigtsen
2022-01-19 13:00 ` Lars Ingebrigtsen
2022-01-19 13:03 ` Eli Zaretskii
2022-01-19 12:58 ` Eli Zaretskii
1 sibling, 2 replies; 123+ messages in thread
From: Lars Ingebrigtsen @ 2022-01-19 12:56 UTC (permalink / raw)
To: Robert Pluim; +Cc: 51733
Robert Pluim <rpluim@gmail.com> writes:
> (textsec-email-suspicious-p "דגבאa <foo@bar.com>")
That is not a valid email address.
> Since itʼs now open season on display names and mailbox names, the
> following might be enough. Lars?
No, that function parses well-formed email addresses, as defined by the
standards. It does not do any kind of DWIM or guesswork, and it
shouldn't.
(textsec-email-suspicious-p "דגבאa <foo@bar.com>")
shouldn't bug out, though -- it should instead say that the string is
suspicious because it's not well-formed as an email address.
--
(domestic pets only, the antidote for overdose, milk.)
bloggy blog: http://lars.ingebrigtsen.no
^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better
2022-01-19 12:49 ` Robert Pluim
2022-01-19 12:56 ` Lars Ingebrigtsen
@ 2022-01-19 12:58 ` Eli Zaretskii
2022-01-19 13:02 ` Lars Ingebrigtsen
1 sibling, 1 reply; 123+ messages in thread
From: Eli Zaretskii @ 2022-01-19 12:58 UTC (permalink / raw)
To: Robert Pluim; +Cc: 51733, larsi
> From: Robert Pluim <rpluim@gmail.com>
> Cc: larsi@gnus.org, 51733@debbugs.gnu.org
> Date: Wed, 19 Jan 2022 13:49:20 +0100
>
> mail-header-parse-address assumes that the display name or the local
> name starts with a (subset of) ASCII.
Is that expectation reasonable? I can show you many email addresses
that violate that.
^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better
2022-01-19 12:49 ` Michael Albinus
@ 2022-01-19 12:59 ` Eli Zaretskii
0 siblings, 0 replies; 123+ messages in thread
From: Eli Zaretskii @ 2022-01-19 12:59 UTC (permalink / raw)
To: Michael Albinus; +Cc: 51733, larsi
> From: Michael Albinus <michael.albinus@gmx.de>
> Cc: Lars Ingebrigtsen <larsi@gnus.org>, 51733@debbugs.gnu.org
> Date: Wed, 19 Jan 2022 13:49:56 +0100
>
> > Perhaps also Tramp (host names)?
>
> --8<---------------cut here---------------start------------->8---
> (defconst tramp-host-regexp "[[:alnum:]_.%-]+"
> "Regexp matching host names.")
>
> ;; The following regexp is a bit sloppy. But it shall serve our
> ;; purposes. It covers also IPv4 mapped IPv6 addresses, like in
> ;; "::ffff:192.168.0.1".
> (defconst tramp-ipv6-regexp "\\(?:[[:alnum:]]*:\\)+[[:alnum:].]+"
> "Regexp matching IPv6 addresses.")
> --8<---------------cut here---------------end--------------->8---
>
> This should be sufficient, shouldn't it?
If this isn't too restrictive, sure.
^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better
2022-01-19 12:56 ` Lars Ingebrigtsen
@ 2022-01-19 13:00 ` Lars Ingebrigtsen
2022-01-19 13:03 ` Eli Zaretskii
1 sibling, 0 replies; 123+ messages in thread
From: Lars Ingebrigtsen @ 2022-01-19 13:00 UTC (permalink / raw)
To: Robert Pluim; +Cc: 51733
Lars Ingebrigtsen <larsi@gnus.org> writes:
> No, that function parses well-formed email addresses, as defined by the
> standards. It does not do any kind of DWIM or guesswork, and it
> shouldn't.
(The function that tries to parse a random mail-like string as if it
were a mail address is `mail-header-parse-address-lax'.)
--
(domestic pets only, the antidote for overdose, milk.)
bloggy blog: http://lars.ingebrigtsen.no
^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better
2022-01-19 12:54 ` Lars Ingebrigtsen
@ 2022-01-19 13:01 ` Eli Zaretskii
2022-01-19 13:06 ` Lars Ingebrigtsen
2022-01-19 13:36 ` Andreas Schwab
1 sibling, 1 reply; 123+ messages in thread
From: Eli Zaretskii @ 2022-01-19 13:01 UTC (permalink / raw)
To: Lars Ingebrigtsen; +Cc: 51733
> From: Lars Ingebrigtsen <larsi@gnus.org>
> Cc: 51733@debbugs.gnu.org
> Date: Wed, 19 Jan 2022 13:54:37 +0100
>
> Eli Zaretskii <eliz@gnu.org> writes:
>
> > This is unfortunate. It means, for example, that a simple lazy
> > discovery of suspicious addresses by scanning the email reading buffer
> > with regular expressions will not work, and the feature must instead
> > scan the original mbox buffer.
>
> There's no scanning -- rmail displays the From header, right? So it
> does decoding before displaying the header. It has to do the textsec
> stuff first, too.
Not if textsec is optional, it doesn't.
And I think your mental model of how Rmail presents the email in the
reading buffer is not accurate.
> > Why cannot we lift this restriction? mail-header-parse-address is not
> > the only way to parse email addresses. Or maybe we could encode the
> > email address if the original one causes an error?
>
> There is no reliable way to parse a decoded mail address, and since this
> is a security thing, we don't want to do DWIM and guesses (which is what
> you have to do when composing a valid email address from a string like
> "Fóo, Jr. <foo@example.com>").
I think Robert just suggested a way?
^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better
2022-01-19 12:58 ` Eli Zaretskii
@ 2022-01-19 13:02 ` Lars Ingebrigtsen
2022-01-19 13:06 ` Eli Zaretskii
2022-01-19 13:39 ` Robert Pluĭm
0 siblings, 2 replies; 123+ messages in thread
From: Lars Ingebrigtsen @ 2022-01-19 13:02 UTC (permalink / raw)
To: Eli Zaretskii; +Cc: 51733, Robert Pluim
Eli Zaretskii <eliz@gnu.org> writes:
> Is that expectation reasonable? I can show you many email addresses
> that violate that.
It depends on what you mean. There are no valid email addresses that
have non-ASCII name parts -- when we're talking wire format (RFC2047
etc), which is what that function is parsing.
But displayed email addresses may have any characters, of course.
--
(domestic pets only, the antidote for overdose, milk.)
bloggy blog: http://lars.ingebrigtsen.no
^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better
2022-01-19 12:56 ` Lars Ingebrigtsen
2022-01-19 13:00 ` Lars Ingebrigtsen
@ 2022-01-19 13:03 ` Eli Zaretskii
1 sibling, 0 replies; 123+ messages in thread
From: Eli Zaretskii @ 2022-01-19 13:03 UTC (permalink / raw)
To: Lars Ingebrigtsen; +Cc: 51733, rpluim
> From: Lars Ingebrigtsen <larsi@gnus.org>
> Cc: Eli Zaretskii <eliz@gnu.org>, 51733@debbugs.gnu.org
> Date: Wed, 19 Jan 2022 13:56:33 +0100
>
> Robert Pluim <rpluim@gmail.com> writes:
>
> > (textsec-email-suspicious-p "דגבאa <foo@bar.com>")
>
> That is not a valid email address.
??? My INBOX is full of mail from people with such "invalid"
addresses.
What is not valid about it?
^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better
2022-01-19 13:02 ` Lars Ingebrigtsen
@ 2022-01-19 13:06 ` Eli Zaretskii
2022-01-19 13:10 ` Lars Ingebrigtsen
2022-01-19 13:39 ` Robert Pluĭm
1 sibling, 1 reply; 123+ messages in thread
From: Eli Zaretskii @ 2022-01-19 13:06 UTC (permalink / raw)
To: Lars Ingebrigtsen; +Cc: 51733, rpluim
> From: Lars Ingebrigtsen <larsi@gnus.org>
> Cc: Robert Pluim <rpluim@gmail.com>, 51733@debbugs.gnu.org
> Date: Wed, 19 Jan 2022 14:02:56 +0100
>
> Eli Zaretskii <eliz@gnu.org> writes:
>
> > Is that expectation reasonable? I can show you many email addresses
> > that violate that.
>
> It depends on what you mean. There are no valid email addresses that
> have non-ASCII name parts -- when we're talking wire format (RFC2047
> etc), which is what that function is parsing.
>
> But displayed email addresses may have any characters, of course.
I _am_ talking about the displayed format. It would be better if
textsec supported those as well, because they are ubiquitous in
Emacs. E.g., what if someone sends me a citation from someone else's
email, and I want to textsec-check that citation? Chances are the
citation will not include RFC2047 encoded addresses.
^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better
2022-01-19 13:01 ` Eli Zaretskii
@ 2022-01-19 13:06 ` Lars Ingebrigtsen
2022-01-19 13:11 ` Eli Zaretskii
0 siblings, 1 reply; 123+ messages in thread
From: Lars Ingebrigtsen @ 2022-01-19 13:06 UTC (permalink / raw)
To: Eli Zaretskii; +Cc: 51733
Eli Zaretskii <eliz@gnu.org> writes:
>> There's no scanning -- rmail displays the From header, right? So it
>> does decoding before displaying the header. It has to do the textsec
>> stuff first, too.
>
> Not if textsec is optional, it doesn't.
I don't understand what you mean here. rmail will call
(decorate-suspicious-email from) and then insert the result into the
buffer. If textsec is switched off, it'll just return `from' as is.
> And I think your mental model of how Rmail presents the email in the
> reading buffer is not accurate.
Here's what it does today:
;; Decode any RFC2047 encoded message headers.
(if rmail-enable-mime
(with-current-buffer rmail-view-buffer
(rfc2047-decode-region
(point-min)
(progn
(search-forward "\n\n" nil 'move)
(point))))))
It'll just have to call
(insert (rfc2047-decode-string (decorate-suspicious-email (substring ...))))
instead.
--
(domestic pets only, the antidote for overdose, milk.)
bloggy blog: http://lars.ingebrigtsen.no
^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better
2022-01-19 13:06 ` Eli Zaretskii
@ 2022-01-19 13:10 ` Lars Ingebrigtsen
2022-01-19 13:21 ` Eli Zaretskii
0 siblings, 1 reply; 123+ messages in thread
From: Lars Ingebrigtsen @ 2022-01-19 13:10 UTC (permalink / raw)
To: Eli Zaretskii; +Cc: 51733, rpluim
Eli Zaretskii <eliz@gnu.org> writes:
> I _am_ talking about the displayed format. It would be better if
> textsec supported those as well, because they are ubiquitous in
> Emacs.
And, again, these functions as implemented work on the protocol level,
because that's the interesting thing here. "Is this From: header
suspicious?" That can only be determined reliably if we don't get any
DWIM involved.
> E.g., what if someone sends me a citation from someone else's
> email, and I want to textsec-check that citation? Chances are the
> citation will not include RFC2047 encoded addresses.
You can, of course, add all kinds of things to try to gues whether other
things in other places in Emacs are suspicious or not, but that is not
what these functions I've written do.
--
(domestic pets only, the antidote for overdose, milk.)
bloggy blog: http://lars.ingebrigtsen.no
^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better
2022-01-19 13:06 ` Lars Ingebrigtsen
@ 2022-01-19 13:11 ` Eli Zaretskii
2022-01-19 13:16 ` Lars Ingebrigtsen
0 siblings, 1 reply; 123+ messages in thread
From: Eli Zaretskii @ 2022-01-19 13:11 UTC (permalink / raw)
To: Lars Ingebrigtsen; +Cc: 51733
> From: Lars Ingebrigtsen <larsi@gnus.org>
> Cc: 51733@debbugs.gnu.org
> Date: Wed, 19 Jan 2022 14:06:56 +0100
>
> Eli Zaretskii <eliz@gnu.org> writes:
>
> >> There's no scanning -- rmail displays the From header, right? So it
> >> does decoding before displaying the header. It has to do the textsec
> >> stuff first, too.
> >
> > Not if textsec is optional, it doesn't.
>
> I don't understand what you mean here. rmail will call
> (decorate-suspicious-email from) and then insert the result into the
> buffer. If textsec is switched off, it'll just return `from' as is.
But From is not the only place where a suspicious address could hide.
It could also be in the body, or in the quotation parts. We cannot
rely on header decoding alone to do this job well.
^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better
2022-01-19 13:11 ` Eli Zaretskii
@ 2022-01-19 13:16 ` Lars Ingebrigtsen
2022-01-19 13:25 ` Eli Zaretskii
0 siblings, 1 reply; 123+ messages in thread
From: Lars Ingebrigtsen @ 2022-01-19 13:16 UTC (permalink / raw)
To: Eli Zaretskii; +Cc: 51733
Eli Zaretskii <eliz@gnu.org> writes:
> But From is not the only place where a suspicious address could hide.
> It could also be in the body, or in the quotation parts. We cannot
> rely on header decoding alone to do this job well.
The scope of the relevant implemented functions are to determine if the
(on-wire) mail headers are suspicious or not, and do so reliably. We
can add a slew of other functions for other types of DWIM
suspiciousness, of course, but that's outside the remit.
(For instance, if you wish to implement a filter that looks for
suspicious emails, you'd typically find anything that looks like an
email, see whether it can be RFC2047 encoded, encode it, and then call
the -email-suspicious-p function.)
--
(domestic pets only, the antidote for overdose, milk.)
bloggy blog: http://lars.ingebrigtsen.no
^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better
2022-01-19 13:10 ` Lars Ingebrigtsen
@ 2022-01-19 13:21 ` Eli Zaretskii
2022-01-19 13:25 ` Lars Ingebrigtsen
0 siblings, 1 reply; 123+ messages in thread
From: Eli Zaretskii @ 2022-01-19 13:21 UTC (permalink / raw)
To: Lars Ingebrigtsen; +Cc: 51733, rpluim
> From: Lars Ingebrigtsen <larsi@gnus.org>
> Cc: rpluim@gmail.com, 51733@debbugs.gnu.org
> Date: Wed, 19 Jan 2022 14:10:11 +0100
>
> Eli Zaretskii <eliz@gnu.org> writes:
>
> > E.g., what if someone sends me a citation from someone else's
> > email, and I want to textsec-check that citation? Chances are the
> > citation will not include RFC2047 encoded addresses.
>
> You can, of course, add all kinds of things to try to gues whether other
> things in other places in Emacs are suspicious or not, but that is not
> what these functions I've written do.
I don't understand this stubborn opposition to provide better, more
general APIs to our users. textsec.el is not an application, it is
infrastructure applications should use to provide user-level features.
So any application-level decisions, like at what level to detect
suspicious addresses, is not textsec's bloody business to make!
^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better
2022-01-19 13:16 ` Lars Ingebrigtsen
@ 2022-01-19 13:25 ` Eli Zaretskii
2022-01-19 13:31 ` Lars Ingebrigtsen
0 siblings, 1 reply; 123+ messages in thread
From: Eli Zaretskii @ 2022-01-19 13:25 UTC (permalink / raw)
To: Lars Ingebrigtsen; +Cc: 51733
> From: Lars Ingebrigtsen <larsi@gnus.org>
> Cc: 51733@debbugs.gnu.org
> Date: Wed, 19 Jan 2022 14:16:48 +0100
>
> Eli Zaretskii <eliz@gnu.org> writes:
>
> > But From is not the only place where a suspicious address could hide.
> > It could also be in the body, or in the quotation parts. We cannot
> > rely on header decoding alone to do this job well.
>
> The scope of the relevant implemented functions are to determine if the
> (on-wire) mail headers are suspicious or not, and do so reliably. We
> can add a slew of other functions for other types of DWIM
> suspiciousness, of course, but that's outside the remit.
I disagree with this narrow definition of the scope. textsec is more
general, and should not limit itself to specific wire protocols.
I'm not asking to _replace_ RFC2047 support, I'm saying that we should
also support email addresses that were already decoded, for the use
cases where that could be more convenient or where the wire level is
unavailable. Why would you object to extending these functions so
that they could support decoded email addresses? What harm could that
possibly do?
^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better
2022-01-19 13:21 ` Eli Zaretskii
@ 2022-01-19 13:25 ` Lars Ingebrigtsen
2022-01-19 13:28 ` Eli Zaretskii
0 siblings, 1 reply; 123+ messages in thread
From: Lars Ingebrigtsen @ 2022-01-19 13:25 UTC (permalink / raw)
To: Eli Zaretskii; +Cc: 51733, rpluim
Eli Zaretskii <eliz@gnu.org> writes:
> I don't understand this stubborn opposition to provide better, more
> general APIs to our users.
I'm for making functions with well-defined interfaces.
> textsec.el is not an application, it is infrastructure applications
> should use to provide user-level features. So any application-level
> decisions, like at what level to detect suspicious addresses, is not
> textsec's bloody business to make!
I don't understand what you mean. It's just because the textsec
functions are well-defined that application-level packages can use it
reliably. textsec isn't making any decisions about levels -- that's up
to the callers.
--
(domestic pets only, the antidote for overdose, milk.)
bloggy blog: http://lars.ingebrigtsen.no
^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better
2022-01-19 13:25 ` Lars Ingebrigtsen
@ 2022-01-19 13:28 ` Eli Zaretskii
0 siblings, 0 replies; 123+ messages in thread
From: Eli Zaretskii @ 2022-01-19 13:28 UTC (permalink / raw)
To: Lars Ingebrigtsen; +Cc: 51733, rpluim
> From: Lars Ingebrigtsen <larsi@gnus.org>
> Cc: rpluim@gmail.com, 51733@debbugs.gnu.org
> Date: Wed, 19 Jan 2022 14:25:34 +0100
>
> Eli Zaretskii <eliz@gnu.org> writes:
>
> > I don't understand this stubborn opposition to provide better, more
> > general APIs to our users.
>
> I'm for making functions with well-defined interfaces.
>
> > textsec.el is not an application, it is infrastructure applications
> > should use to provide user-level features. So any application-level
> > decisions, like at what level to detect suspicious addresses, is not
> > textsec's bloody business to make!
>
> I don't understand what you mean. It's just because the textsec
> functions are well-defined that application-level packages can use it
> reliably. textsec isn't making any decisions about levels -- that's up
> to the callers.
You are being unreasonably stubborn here.
I give up.
^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better
2022-01-19 13:25 ` Eli Zaretskii
@ 2022-01-19 13:31 ` Lars Ingebrigtsen
2022-01-19 13:35 ` Eli Zaretskii
0 siblings, 1 reply; 123+ messages in thread
From: Lars Ingebrigtsen @ 2022-01-19 13:31 UTC (permalink / raw)
To: Eli Zaretskii; +Cc: 51733
Eli Zaretskii <eliz@gnu.org> writes:
> I'm not asking to _replace_ RFC2047 support, I'm saying that we should
> also support email addresses that were already decoded, for the use
> cases where that could be more convenient or where the wire level is
> unavailable.
These already exist. The applications can call *-name-suspicious-p
(etc) individually, if they want to.
> Why would you object to extending these functions so that they could
> support decoded email addresses? What harm could that possibly do?
That's the point -- when doing DWIM parsing, the function can't reliably
say whether a string is a suspicious email address, because the attacker
may construct a name part, that when decoded, confuses the address
parser, and thereby escapes domain/local part checking. (Think of
various combinations of names that contain "@" and "," characters.)
--
(domestic pets only, the antidote for overdose, milk.)
bloggy blog: http://lars.ingebrigtsen.no
^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better
2022-01-18 15:05 ` Eli Zaretskii
2022-01-19 12:49 ` Michael Albinus
@ 2022-01-19 13:35 ` Lars Ingebrigtsen
1 sibling, 0 replies; 123+ messages in thread
From: Lars Ingebrigtsen @ 2022-01-19 13:35 UTC (permalink / raw)
To: Eli Zaretskii; +Cc: 51733
Eli Zaretskii <eliz@gnu.org> writes:
>> There should probably be a customization point? A user option like
>> `warn-about-suspicious-identifiers'?
>
> Is this a go/no-go test, or are there levels? If there are levels,
> perhaps something similar to NSM would be more appropriate? (And
> maybe levels of NSM should determine the default textsec level?)
I was thinking go/no-go -- I don't immediately see any different levels
of suspiciousness that'd be interesting for the user. But we can tweak
that later, I guess, if necessary.
--
(domestic pets only, the antidote for overdose, milk.)
bloggy blog: http://lars.ingebrigtsen.no
^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better
2022-01-19 13:31 ` Lars Ingebrigtsen
@ 2022-01-19 13:35 ` Eli Zaretskii
0 siblings, 0 replies; 123+ messages in thread
From: Eli Zaretskii @ 2022-01-19 13:35 UTC (permalink / raw)
To: Lars Ingebrigtsen; +Cc: 51733
> From: Lars Ingebrigtsen <larsi@gnus.org>
> Cc: 51733@debbugs.gnu.org
> Date: Wed, 19 Jan 2022 14:31:11 +0100
>
> Eli Zaretskii <eliz@gnu.org> writes:
>
> > I'm not asking to _replace_ RFC2047 support, I'm saying that we should
> > also support email addresses that were already decoded, for the use
> > cases where that could be more convenient or where the wire level is
> > unavailable.
>
> These already exist. The applications can call *-name-suspicious-p
> (etc) individually, if they want to.
I don't have a NAME, I have a full email address.
> > Why would you object to extending these functions so that they could
> > support decoded email addresses? What harm could that possibly do?
>
> That's the point -- when doing DWIM parsing
I didn't say DWIM, you did.
> the function can't reliably
> say whether a string is a suspicious email address, because the attacker
> may construct a name part, that when decoded, confuses the address
> parser, and thereby escapes domain/local part checking. (Think of
> various combinations of names that contain "@" and "," characters.)
When the wire format is gone, this is all I have left. You are saying
we should leave this case without a solution. So be it.
^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better
2022-01-19 12:54 ` Lars Ingebrigtsen
2022-01-19 13:01 ` Eli Zaretskii
@ 2022-01-19 13:36 ` Andreas Schwab
2022-01-19 13:57 ` Lars Ingebrigtsen
1 sibling, 1 reply; 123+ messages in thread
From: Andreas Schwab @ 2022-01-19 13:36 UTC (permalink / raw)
To: Lars Ingebrigtsen; +Cc: 51733
On Jan 19 2022, Lars Ingebrigtsen wrote:
> There's no scanning -- rmail displays the From header, right? So it
> does decoding before displaying the header. It has to do the textsec
> stuff first, too.
I don't think it makes sense to run the textsec check on the encoded
address, since that will always be ASCII-only.
--
Andreas Schwab, schwab@linux-m68k.org
GPG Key fingerprint = 7578 EB47 D4E5 4D69 2510 2552 DF73 E780 A9DA AEC1
"And now for something completely different."
^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better
2022-01-18 20:15 ` Eli Zaretskii
2022-01-18 20:31 ` Eli Zaretskii
@ 2022-01-19 13:38 ` Lars Ingebrigtsen
1 sibling, 0 replies; 123+ messages in thread
From: Lars Ingebrigtsen @ 2022-01-19 13:38 UTC (permalink / raw)
To: Eli Zaretskii; +Cc: 51733
Eli Zaretskii <eliz@gnu.org> writes:
> I think I get it now: it's because of textsec-suspicious-nonspacing-p,
> which forbids consecutive nonspacing characters, right?
Yup.
> But then I don't think it's correct to consider Cf characters for that
> purpose: UTS#39 explicitly talks about nonspacing _marks_, i.e. Mn and
> Me characters. Where did you see Cf and Cc as well?
Good catch; I'll amend the code and test.
--
(domestic pets only, the antidote for overdose, milk.)
bloggy blog: http://lars.ingebrigtsen.no
^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better
2022-01-19 13:02 ` Lars Ingebrigtsen
2022-01-19 13:06 ` Eli Zaretskii
@ 2022-01-19 13:39 ` Robert Pluĭm
2022-01-19 14:00 ` Lars Ingebrigtsen
1 sibling, 1 reply; 123+ messages in thread
From: Robert Pluĭm @ 2022-01-19 13:39 UTC (permalink / raw)
To: Lars Ingebrigtsen; +Cc: 51733
>>>>> On Wed, 19 Jan 2022 14:02:56 +0100, Lars Ingebrigtsen <larsi@gnus.org> said:
Lars> Eli Zaretskii <eliz@gnu.org> writes:
>> Is that expectation reasonable? I can show you many email addresses
>> that violate that.
Lars> It depends on what you mean. There are no valid email addresses that
Lars> have non-ASCII name parts -- when we're talking wire format (RFC2047
Lars> etc), which is what that function is parsing.
I canʼt recall if this is allowed by the standards or not offhand, but
as youʼre probably well aware, the major email providers allow you to
use UTF-8 characters directly in the display name of email adresses,
without using RFC 2047 encoding. In fact, the last time I did any
testing of this, Gmail *replaced* RFC 2047 encoded non-ASCII
characters with their UTF-8 encoding.
Robert
--
^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better
2022-01-19 10:42 ` Robert Pluim
@ 2022-01-19 13:46 ` Lars Ingebrigtsen
2022-01-19 17:18 ` Eli Zaretskii
0 siblings, 1 reply; 123+ messages in thread
From: Lars Ingebrigtsen @ 2022-01-19 13:46 UTC (permalink / raw)
To: Robert Pluim; +Cc: 51733
Robert Pluim <rpluim@gmail.com> writes:
> How about "Contains suspicious characters or mix of characters"? That
> would at least point users in the right direction.
I'm not 100% that it's not misleading in all cases, though. textsec
still doesn't implement "Unicode Identifier and Pattern Syntax":
https://www.unicode.org/reports/tr31/
There's some other stuff in there... But I might be quibbling.
--
(domestic pets only, the antidote for overdose, milk.)
bloggy blog: http://lars.ingebrigtsen.no
^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better
2022-01-17 17:48 ` Eli Zaretskii
2022-01-17 19:08 ` Eli Zaretskii
@ 2022-01-19 13:55 ` Lars Ingebrigtsen
2022-01-19 14:14 ` Eli Zaretskii
1 sibling, 1 reply; 123+ messages in thread
From: Lars Ingebrigtsen @ 2022-01-19 13:55 UTC (permalink / raw)
To: Eli Zaretskii; +Cc: 51733, jidanni
Eli Zaretskii <eliz@gnu.org> writes:
> I think we should first determine what kinds of applications may need
> this, and take it from there. The initial number of "confusability
> with" classes can be very small, and we can add more as we discover
> interesting use cases. The full number is pretty much infinite, I
> think, but I'm not sure Emacs needs to support all of them OOTB. We
> could support some of the popular ones, and provide infrastructure for
> developing more.
Yes.
I was thinking about this bit, which isn't implemented yet (although the
utility functions for it basically are).
----
The process of determining suspect usage of whole-script confusables is more complicated than simply looking at the scripts of the labels in a domain name. For example, it can be perfectly legitimate to have scripts in a SLD (second level domain) not be the same as scripts in a TLD (top-level domain), such as:
Cyrillic labels in a domain name with a TLD of .ru or .рф
Chinese labels in a domain name with a TLD of .com.au or .com
Cyrillic labels that aren’t confusable with Latin with a TLD of .com.au or .com
The following high-level algorithm can be used to determine all scripts that contain a whole-script confusable with a string X:
Consider Q, the set of all strings confusable with X.
Remove all strings from Q whose resolved script set is ∅ or ALL (that is, keep only single-script strings plus those with characters only in Common).
Take the union of the resolved script sets of all strings remaining in Q.
As usual, this algorithm is intended only as a definition;
implementations should use an optimized routine that produces the same
result.
----
I'm not sure I understand the algorithm they're proposing. I think this
shouldn't be suspicious? But I may be wrong:
(textsec-domain-suspicious-p "Сгсе.рф")
=> nil
But this should be, but isn't currently:
(textsec-domain-suspicious-p "Сгсе.ru")
=> nil
Now,
(textsec-ascii-confusable-p "Сгсе.ru")
=> t
and
(textsec-ascii-confusable-p "Сгсе.рф")
=> nil
Is that what they mean here? I'm finding the logic overly clear here.
--
(domestic pets only, the antidote for overdose, milk.)
bloggy blog: http://lars.ingebrigtsen.no
^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better
2022-01-18 14:59 ` Eli Zaretskii
@ 2022-01-19 13:56 ` Lars Ingebrigtsen
0 siblings, 0 replies; 123+ messages in thread
From: Lars Ingebrigtsen @ 2022-01-19 13:56 UTC (permalink / raw)
To: Eli Zaretskii; +Cc: 51733
Eli Zaretskii <eliz@gnu.org> writes:
>> OK, there's glyphless--bidi-control-characters, and I could make that
>> non-private, and add the three missing ones...
>
> I don't think that's what you want, because AFAIU that includes LRM,
> RLM, and ALM, which are stateless.
Yes, but then we remove those explicitly in the test, so I think that's
OK...
--
(domestic pets only, the antidote for overdose, milk.)
bloggy blog: http://lars.ingebrigtsen.no
^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better
2022-01-19 13:36 ` Andreas Schwab
@ 2022-01-19 13:57 ` Lars Ingebrigtsen
2022-01-19 14:06 ` Andreas Schwab
0 siblings, 1 reply; 123+ messages in thread
From: Lars Ingebrigtsen @ 2022-01-19 13:57 UTC (permalink / raw)
To: Andreas Schwab; +Cc: 51733
Andreas Schwab <schwab@linux-m68k.org> writes:
> I don't think it makes sense to run the textsec check on the encoded
> address, since that will always be ASCII-only.
We decode the header after parsing it (and before doing the textsec
tests).
--
(domestic pets only, the antidote for overdose, milk.)
bloggy blog: http://lars.ingebrigtsen.no
^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better
2022-01-19 13:39 ` Robert Pluĭm
@ 2022-01-19 14:00 ` Lars Ingebrigtsen
2022-01-19 14:10 ` Robert Pluĭm
2022-01-19 16:08 ` Andreas Schwab
0 siblings, 2 replies; 123+ messages in thread
From: Lars Ingebrigtsen @ 2022-01-19 14:00 UTC (permalink / raw)
To: Robert Pluĭm; +Cc: 51733
Robert "=?utf-8?Q?Plu=C4=ADm?=" <rpluim@gmail.com> writes:
> I canʼt recall if this is allowed by the standards or not offhand, but
> as youʼre probably well aware, the major email providers allow you to
> use UTF-8 characters directly in the display name of email adresses,
> without using RFC 2047 encoding. In fact, the last time I did any
> testing of this, Gmail *replaced* RFC 2047 encoded non-ASCII
> characters with their UTF-8 encoding.
Gmail expects you to type in characters representing your name -- they
don't expose the wire format. Why should they?
And your address arrived as (wire format):
Robert =?us-ascii?Q?=3D=3Futf-8=3FQ=3FPlu=3DC4=3DADm=3F=3D?=
which displays as
Robert =?utf-8?Q?Plu=C4=ADm?=
😀
--
(domestic pets only, the antidote for overdose, milk.)
bloggy blog: http://lars.ingebrigtsen.no
^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better
2022-01-19 13:57 ` Lars Ingebrigtsen
@ 2022-01-19 14:06 ` Andreas Schwab
2022-01-19 14:09 ` Lars Ingebrigtsen
0 siblings, 1 reply; 123+ messages in thread
From: Andreas Schwab @ 2022-01-19 14:06 UTC (permalink / raw)
To: Lars Ingebrigtsen; +Cc: 51733
On Jan 19 2022, Lars Ingebrigtsen wrote:
> Andreas Schwab <schwab@linux-m68k.org> writes:
>
>> I don't think it makes sense to run the textsec check on the encoded
>> address, since that will always be ASCII-only.
>
> We decode the header after parsing it (and before doing the textsec
> tests).
The why not allow to run the textsec on the decoded header directly? If
you have to encode it first you have to do DWIM parsing anyway.
--
Andreas Schwab, schwab@linux-m68k.org
GPG Key fingerprint = 7578 EB47 D4E5 4D69 2510 2552 DF73 E780 A9DA AEC1
"And now for something completely different."
^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better
2022-01-19 14:06 ` Andreas Schwab
@ 2022-01-19 14:09 ` Lars Ingebrigtsen
2022-01-19 14:13 ` Andreas Schwab
0 siblings, 1 reply; 123+ messages in thread
From: Lars Ingebrigtsen @ 2022-01-19 14:09 UTC (permalink / raw)
To: Andreas Schwab; +Cc: 51733
Andreas Schwab <schwab@linux-m68k.org> writes:
> The why not allow to run the textsec on the decoded header directly?
Consider somebody sending you an email containing @", characters in the
name part, and then you decode the address, and then run the parsing
function. The attacker would then have a wide attack surface to trick
the checker into checking the wrong parts of the address.
--
(domestic pets only, the antidote for overdose, milk.)
bloggy blog: http://lars.ingebrigtsen.no
^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better
2022-01-19 14:00 ` Lars Ingebrigtsen
@ 2022-01-19 14:10 ` Robert Pluĭm
2022-01-19 14:24 ` Lars Ingebrigtsen
2022-01-19 16:08 ` Andreas Schwab
1 sibling, 1 reply; 123+ messages in thread
From: Robert Pluĭm @ 2022-01-19 14:10 UTC (permalink / raw)
To: Lars Ingebrigtsen; +Cc: 51733
>>>>> On Wed, 19 Jan 2022 15:00:04 +0100, Lars Ingebrigtsen <larsi@gnus.org> said:
Lars> Robert "=?utf-8?Q?Plu=C4=ADm?=" <rpluim@gmail.com> writes:
>> I canʼt recall if this is allowed by the standards or not offhand, but
>> as youʼre probably well aware, the major email providers allow you to
>> use UTF-8 characters directly in the display name of email adresses,
>> without using RFC 2047 encoding. In fact, the last time I did any
>> testing of this, Gmail *replaced* RFC 2047 encoded non-ASCII
>> characters with their UTF-8 encoding.
Lars> Gmail expects you to type in characters representing your name -- they
Lars> don't expose the wire format. Why should they?
Lars> And your address arrived as (wire format):
Lars> Robert =?us-ascii?Q?=3D=3Futf-8=3FQ=3FPlu=3DC4=3DADm=3F=3D?=
Lars> which displays as
Lars> Robert =?utf-8?Q?Plu=C4=ADm?=
Lars> 😀
Double fun. Iʼd manually rfc2047 encoded that before sending it, so
either gnus or Gmail encoded it again :-)
Iʼve turned off the gnus rfc 2047 for this message, let's see what
happens.
Robert
--
^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better
2022-01-19 14:09 ` Lars Ingebrigtsen
@ 2022-01-19 14:13 ` Andreas Schwab
2022-01-19 14:33 ` Lars Ingebrigtsen
0 siblings, 1 reply; 123+ messages in thread
From: Andreas Schwab @ 2022-01-19 14:13 UTC (permalink / raw)
To: Lars Ingebrigtsen; +Cc: 51733
On Jan 19 2022, Lars Ingebrigtsen wrote:
> Consider somebody sending you an email containing @", characters in the
> name part, and then you decode the address, and then run the parsing
> function. The attacker would then have a wide attack surface to trick
> the checker into checking the wrong parts of the address.
Isn't that the whole point of textsec?
--
Andreas Schwab, schwab@linux-m68k.org
GPG Key fingerprint = 7578 EB47 D4E5 4D69 2510 2552 DF73 E780 A9DA AEC1
"And now for something completely different."
^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better
2022-01-19 13:55 ` Lars Ingebrigtsen
@ 2022-01-19 14:14 ` Eli Zaretskii
2022-01-19 14:28 ` Lars Ingebrigtsen
0 siblings, 1 reply; 123+ messages in thread
From: Eli Zaretskii @ 2022-01-19 14:14 UTC (permalink / raw)
To: Lars Ingebrigtsen; +Cc: 51733, jidanni
> From: Lars Ingebrigtsen <larsi@gnus.org>
> Cc: 51733@debbugs.gnu.org, jidanni@jidanni.org
> Date: Wed, 19 Jan 2022 14:55:35 +0100
>
> But this should be, but isn't currently:
>
> (textsec-domain-suspicious-p "Сгсе.ru")
> => nil
Why? .ru is a top-level domain, it doesn't affect what should be
before the dot, I think?
If you replace "Сгсе.ru" with "Cгсе.ru", you do get a warning.
> Is that what they mean here?
I'm not sure I understand the purpose of finding which scripts
"contain a whole-script confusable with a string X". What are we
supposed to do with the resulting list?
^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better
2022-01-19 14:10 ` Robert Pluĭm
@ 2022-01-19 14:24 ` Lars Ingebrigtsen
2022-01-19 14:30 ` Robert Pluim
0 siblings, 1 reply; 123+ messages in thread
From: Lars Ingebrigtsen @ 2022-01-19 14:24 UTC (permalink / raw)
To: Robert Pluĭm; +Cc: 51733
Robert Pluĭm <rpluim@gmail.com> writes:
> Double fun. Iʼd manually rfc2047 encoded that before sending it, so
> either gnus or Gmail encoded it again :-)
Of course it was encoded again -- =?utf-8?Q?Plu=C4=ADm?= is a perfectly
fine, if unusual, name. 😀 I think it's Bobby Table's brother?
--
(domestic pets only, the antidote for overdose, milk.)
bloggy blog: http://lars.ingebrigtsen.no
^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better
2022-01-19 14:14 ` Eli Zaretskii
@ 2022-01-19 14:28 ` Lars Ingebrigtsen
2022-01-19 14:57 ` Eli Zaretskii
0 siblings, 1 reply; 123+ messages in thread
From: Lars Ingebrigtsen @ 2022-01-19 14:28 UTC (permalink / raw)
To: Eli Zaretskii; +Cc: 51733, jidanni
Eli Zaretskii <eliz@gnu.org> writes:
> Why? .ru is a top-level domain, it doesn't affect what should be
> before the dot, I think?
>
> If you replace "Сгсе.ru" with "Cгсе.ru", you do get a warning.
Yes. But "Сгсе.ru" is a whole-script confusable with "Crce.ru", and is
therefore suspicious.
>> Is that what they mean here?
>
> I'm not sure I understand the purpose of finding which scripts
> "contain a whole-script confusable with a string X". What are we
> supposed to do with the resulting list?
I think this standard was written by somebody with a PhD in Philosophy,
and not a programmer, so the language is very high falutin'.
So they're not actually suggesting that a list should be made, but the
result should be mathematically equivalent with the result of the
mathematical algorithm described. I just don't understand what he's
saying here.
--
(domestic pets only, the antidote for overdose, milk.)
bloggy blog: http://lars.ingebrigtsen.no
^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better
2022-01-19 14:24 ` Lars Ingebrigtsen
@ 2022-01-19 14:30 ` Robert Pluim
2022-01-19 14:36 ` Lars Ingebrigtsen
0 siblings, 1 reply; 123+ messages in thread
From: Robert Pluim @ 2022-01-19 14:30 UTC (permalink / raw)
To: Lars Ingebrigtsen; +Cc: 51733
>>>>> On Wed, 19 Jan 2022 15:24:54 +0100, Lars Ingebrigtsen <larsi@gnus.org> said:
Lars> Robert Pluĭm <rpluim@gmail.com> writes:
>> Double fun. Iʼd manually rfc2047 encoded that before sending it, so
>> either gnus or Gmail encoded it again :-)
Lars> Of course it was encoded again -- =?utf-8?Q?Plu=C4=ADm?= is a perfectly
Lars> fine, if unusual, name. 😀 I think it's Bobby Table's brother?
Or his cousin.
Based on the message buffer I have, gnus didnʼt encode it again, so it
must have been gmail. Of course that header was ascii-only, so why did
they encode again?
What did the From: look like on the message you've just replied to?
Gnus should not have encoded it, and it contained utf-8 in the display
name.
Robert
--
^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better
2022-01-19 14:13 ` Andreas Schwab
@ 2022-01-19 14:33 ` Lars Ingebrigtsen
2022-01-19 14:39 ` Andreas Schwab
0 siblings, 1 reply; 123+ messages in thread
From: Lars Ingebrigtsen @ 2022-01-19 14:33 UTC (permalink / raw)
To: Andreas Schwab; +Cc: 51733
Andreas Schwab <schwab@linux-m68k.org> writes:
> On Jan 19 2022, Lars Ingebrigtsen wrote:
>
>> Consider somebody sending you an email containing @", characters in the
>> name part, and then you decode the address, and then run the parsing
>> function. The attacker would then have a wide attack surface to trick
>> the checker into checking the wrong parts of the address.
>
> Isn't that the whole point of textsec?
It's perfectly valid to have a
From: "larsi@example.com" <larsi@other.com>
address. It's unambigious, and the responses will go to
larsi@other.com.
Of course, it's... suspicious... but not on the Unicode level. (I'll
also be adding some non-Unicode bits to textsec, like
<a href="http://foo.bar">http://other.bar</a>
--
(domestic pets only, the antidote for overdose, milk.)
bloggy blog: http://lars.ingebrigtsen.no
^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better
2022-01-19 14:30 ` Robert Pluim
@ 2022-01-19 14:36 ` Lars Ingebrigtsen
2022-01-19 14:43 ` Robert Pluim
0 siblings, 1 reply; 123+ messages in thread
From: Lars Ingebrigtsen @ 2022-01-19 14:36 UTC (permalink / raw)
To: Robert Pluim; +Cc: 51733
Robert Pluim <rpluim@gmail.com> writes:
> Based on the message buffer I have, gnus didnʼt encode it again, so it
> must have been gmail. Of course that header was ascii-only, so why did
> they encode again?
RFC2047 isn't just about ASCII -- it's about a bunch of other unsafe
characters, like =, which will trigger encoding of (naked) words that
contain those characters.
> What did the From: look like on the message you've just replied to?
> Gnus should not have encoded it, and it contained utf-8 in the display
> name.
I included the From in wire format and displayed format already.
--
(domestic pets only, the antidote for overdose, milk.)
bloggy blog: http://lars.ingebrigtsen.no
^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better
2022-01-19 14:33 ` Lars Ingebrigtsen
@ 2022-01-19 14:39 ` Andreas Schwab
2022-01-19 14:44 ` Lars Ingebrigtsen
0 siblings, 1 reply; 123+ messages in thread
From: Andreas Schwab @ 2022-01-19 14:39 UTC (permalink / raw)
To: Lars Ingebrigtsen; +Cc: 51733
On Jan 19 2022, Lars Ingebrigtsen wrote:
> Andreas Schwab <schwab@linux-m68k.org> writes:
>
>> On Jan 19 2022, Lars Ingebrigtsen wrote:
>>
>>> Consider somebody sending you an email containing @", characters in the
>>> name part, and then you decode the address, and then run the parsing
>>> function. The attacker would then have a wide attack surface to trick
>>> the checker into checking the wrong parts of the address.
>>
>> Isn't that the whole point of textsec?
>
> It's perfectly valid to have a
>
> From: "larsi@example.com" <larsi@other.com>
>
> address. It's unambigious, and the responses will go to
> larsi@other.com.
What's your point?
--
Andreas Schwab, schwab@linux-m68k.org
GPG Key fingerprint = 7578 EB47 D4E5 4D69 2510 2552 DF73 E780 A9DA AEC1
"And now for something completely different."
^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better
2022-01-19 14:36 ` Lars Ingebrigtsen
@ 2022-01-19 14:43 ` Robert Pluim
0 siblings, 0 replies; 123+ messages in thread
From: Robert Pluim @ 2022-01-19 14:43 UTC (permalink / raw)
To: Lars Ingebrigtsen; +Cc: 51733
>>>>> On Wed, 19 Jan 2022 15:36:34 +0100, Lars Ingebrigtsen <larsi@gnus.org> said:
Lars> Robert Pluim <rpluim@gmail.com> writes:
>> Based on the message buffer I have, gnus didnʼt encode it again, so it
>> must have been gmail. Of course that header was ascii-only, so why did
>> they encode again?
Lars> RFC2047 isn't just about ASCII -- it's about a bunch of other unsafe
Lars> characters, like =, which will trigger encoding of (naked) words that
Lars> contain those characters.
>> What did the From: look like on the message you've just replied to?
>> Gnus should not have encoded it, and it contained utf-8 in the display
>> name.
Lars> I included the From in wire format and displayed format already.
From the original message where I rfc2047 encoded it myself, yes. The
one after I didnʼt encode manually, and turned off the gnus encoding.
Looking at the bug archive, for Message-ID: <874k5zolz0.fsf@gmail.com>
we have:
From: Robert Pluĭm <rpluim@gmail.com>
with no rfc2047 in sight. Gmail is weird, let's go shopping :-)
Robert
--
^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better
2022-01-19 14:39 ` Andreas Schwab
@ 2022-01-19 14:44 ` Lars Ingebrigtsen
0 siblings, 0 replies; 123+ messages in thread
From: Lars Ingebrigtsen @ 2022-01-19 14:44 UTC (permalink / raw)
To: Andreas Schwab; +Cc: 51733
Andreas Schwab <schwab@linux-m68k.org> writes:
> What's your point?
You first.
--
(domestic pets only, the antidote for overdose, milk.)
bloggy blog: http://lars.ingebrigtsen.no
^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better
2022-01-19 14:28 ` Lars Ingebrigtsen
@ 2022-01-19 14:57 ` Eli Zaretskii
2022-01-19 15:45 ` Lars Ingebrigtsen
0 siblings, 1 reply; 123+ messages in thread
From: Eli Zaretskii @ 2022-01-19 14:57 UTC (permalink / raw)
To: Lars Ingebrigtsen; +Cc: 51733, jidanni
> From: Lars Ingebrigtsen <larsi@gnus.org>
> Cc: 51733@debbugs.gnu.org, jidanni@jidanni.org
> Date: Wed, 19 Jan 2022 15:28:51 +0100
>
> Eli Zaretskii <eliz@gnu.org> writes:
>
> > Why? .ru is a top-level domain, it doesn't affect what should be
> > before the dot, I think?
> >
> > If you replace "Сгсе.ru" with "Cгсе.ru", you do get a warning.
>
> Yes. But "Сгсе.ru" is a whole-script confusable with "Crce.ru", and is
> therefore suspicious.
OK, but why do you think "Сгсе.ru" is confusable? The SLD part is
entirely made of single-script characters, and UTS#39 explicitly
allows that:
[...] it can be perfectly legitimate to have scripts in a SLD
(second level domain) not be the same as scripts in a TLD (top-level
domain), such as:
Cyrillic labels in a domain name with a TLD of .ru or .рф
That's your case, isn't it?
> >> Is that what they mean here?
> >
> > I'm not sure I understand the purpose of finding which scripts
> > "contain a whole-script confusable with a string X". What are we
> > supposed to do with the resulting list?
>
> I think this standard was written by somebody with a PhD in Philosophy,
> and not a programmer, so the language is very high falutin'.
>
> So they're not actually suggesting that a list should be made, but the
> result should be mathematically equivalent with the result of the
> mathematical algorithm described. I just don't understand what he's
> saying here.
Regardless of what they are saying, I don't think the above is
suitable for production. I think it should be enough to see whether
there could be confusion with the corresponding ASCII characters from
confusables.txt.
^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better
2022-01-19 14:57 ` Eli Zaretskii
@ 2022-01-19 15:45 ` Lars Ingebrigtsen
2022-01-19 16:58 ` Eli Zaretskii
0 siblings, 1 reply; 123+ messages in thread
From: Lars Ingebrigtsen @ 2022-01-19 15:45 UTC (permalink / raw)
To: Eli Zaretskii; +Cc: 51733, jidanni
Eli Zaretskii <eliz@gnu.org> writes:
> OK, but why do you think "Сгсе.ru" is confusable? The SLD part is
> entirely made of single-script characters, and UTS#39 explicitly
> allows that:
>
> [...] it can be perfectly legitimate to have scripts in a SLD
> (second level domain) not be the same as scripts in a TLD (top-level
> domain), such as:
>
> Cyrillic labels in a domain name with a TLD of .ru or .рф
>
> That's your case, isn't it?
Yes, indeed. But:
---
For some applications, it is useful to determine if a given input string has any whole-script confusable. For example, the identifier "ѕсоре" using Cyrillic characters would pass the single-script test described in Section 5.2, Restriction-Level Detection, even though it is likely to be a spoof attempt.
---
So "Сгсе.ru" is suspicious in most contexts.
> Regardless of what they are saying, I don't think the above is
> suitable for production. I think it should be enough to see whether
> there could be confusion with the corresponding ASCII characters from
> confusables.txt.
Yes, so that's what I've done now, but... I'd feel slightly better if I
knew what they were actually getting at. I think they're saying that if
"foo" is confusable with anything in any other scripts, then it's
suspicious? But that sounds unworkeable. For instance, "circle.ru" is
confusable with "СігсӀе.ru", and perhaps it's suspicious to a Russian,
but I don't see how to make a workable function from that.
Unless we start bringing in locales, and meh.
So perhaps what I've implemented now is sufficient for domains.
Anyway, I've implemented the user option and implemented this in shr, so
we'll see how that goes. If no problems crop up, I'll announce all this
in NEWS and document it in the lispref manual tomorrow.
--
(domestic pets only, the antidote for overdose, milk.)
bloggy blog: http://lars.ingebrigtsen.no
^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better
2022-01-19 14:00 ` Lars Ingebrigtsen
2022-01-19 14:10 ` Robert Pluĭm
@ 2022-01-19 16:08 ` Andreas Schwab
2022-01-19 16:47 ` Robert Pluim
1 sibling, 1 reply; 123+ messages in thread
From: Andreas Schwab @ 2022-01-19 16:08 UTC (permalink / raw)
To: Lars Ingebrigtsen; +Cc: 51733, Robert Pluĭm
On Jan 19 2022, Lars Ingebrigtsen wrote:
> And your address arrived as (wire format):
>
> Robert =?us-ascii?Q?=3D=3Futf-8=3FQ=3FPlu=3DC4=3DADm=3F=3D?=
>
> which displays as
>
> Robert =?utf-8?Q?Plu=C4=ADm?=
Looks like there is a bug in gnus-read-ephemeral-emacs-bug-group, as it
shows this header line in the raw article:
From: Robert =?utf-8?Q?Plu=C4=ADm?=
<rpluim@gmail.com>
The debbugs.gnu.org web interface gets it right, including the
downloadable mbox contents.
--
Andreas Schwab, schwab@linux-m68k.org
GPG Key fingerprint = 7578 EB47 D4E5 4D69 2510 2552 DF73 E780 A9DA AEC1
"And now for something completely different."
^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better
2022-01-19 16:08 ` Andreas Schwab
@ 2022-01-19 16:47 ` Robert Pluim
2022-01-19 16:51 ` Lars Ingebrigtsen
0 siblings, 1 reply; 123+ messages in thread
From: Robert Pluim @ 2022-01-19 16:47 UTC (permalink / raw)
To: Andreas Schwab; +Cc: 51733, Lars Ingebrigtsen
>>>>> On Wed, 19 Jan 2022 17:08:13 +0100, Andreas Schwab <schwab@linux-m68k.org> said:
Andreas> On Jan 19 2022, Lars Ingebrigtsen wrote:
>> And your address arrived as (wire format):
>>
>> Robert =?us-ascii?Q?=3D=3Futf-8=3FQ=3FPlu=3DC4=3DADm=3F=3D?=
>>
>> which displays as
>>
>> Robert =?utf-8?Q?Plu=C4=ADm?=
Andreas> Looks like there is a bug in gnus-read-ephemeral-emacs-bug-group, as it
Andreas> shows this header line in the raw article:
Andreas> From: Robert =?utf-8?Q?Plu=C4=ADm?=
Andreas> <rpluim@gmail.com>
Andreas> The debbugs.gnu.org web interface gets it right, including the
Andreas> downloadable mbox contents.
The downloaded mbox looks correct, but
https://debbugs.gnu.org/cgi/bugreport.cgi?bug=51733#281
shows
From: Robert =?utf-8?Q?Plu=C4=ADm?=
<rpluim <at> gmail.com>
for me. And
https://debbugs.gnu.org/cgi/bugreport.cgi?bug=51733#305 shows
From: Robert Pluĭm <rpluim <at> gmail.com>
which I thought Lars said was not allowed?
Robert
--
^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better
2022-01-19 16:47 ` Robert Pluim
@ 2022-01-19 16:51 ` Lars Ingebrigtsen
2022-01-19 16:57 ` Robert Pluim
0 siblings, 1 reply; 123+ messages in thread
From: Lars Ingebrigtsen @ 2022-01-19 16:51 UTC (permalink / raw)
To: Robert Pluim; +Cc: 51733, Andreas Schwab
Robert Pluim <rpluim@gmail.com> writes:
> https://debbugs.gnu.org/cgi/bugreport.cgi?bug=51733#305 shows
>
> From: Robert Pluĭm <rpluim <at> gmail.com>
>
> which I thought Lars said was not allowed?
A web page can show whatever it wants, surely?
--
(domestic pets only, the antidote for overdose, milk.)
bloggy blog: http://lars.ingebrigtsen.no
^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better
2022-01-19 16:51 ` Lars Ingebrigtsen
@ 2022-01-19 16:57 ` Robert Pluim
0 siblings, 0 replies; 123+ messages in thread
From: Robert Pluim @ 2022-01-19 16:57 UTC (permalink / raw)
To: Lars Ingebrigtsen; +Cc: 51733, Andreas Schwab
>>>>> On Wed, 19 Jan 2022 17:51:32 +0100, Lars Ingebrigtsen <larsi@gnus.org> said:
Lars> Robert Pluim <rpluim@gmail.com> writes:
>> https://debbugs.gnu.org/cgi/bugreport.cgi?bug=51733#305 shows
>>
>> From: Robert Pluĭm <rpluim <at> gmail.com>
>>
>> which I thought Lars said was not allowed?
Lars> A web page can show whatever it wants, surely?
Indeed, and the mbox is correct, so Iʼm going to have to retract my
maligning of gmail :-)
Robert
--
^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better
2022-01-19 15:45 ` Lars Ingebrigtsen
@ 2022-01-19 16:58 ` Eli Zaretskii
2022-01-19 18:25 ` Lars Ingebrigtsen
0 siblings, 1 reply; 123+ messages in thread
From: Eli Zaretskii @ 2022-01-19 16:58 UTC (permalink / raw)
To: Lars Ingebrigtsen; +Cc: 51733, jidanni
> From: Lars Ingebrigtsen <larsi@gnus.org>
> Cc: 51733@debbugs.gnu.org, jidanni@jidanni.org
> Date: Wed, 19 Jan 2022 16:45:29 +0100
>
> Eli Zaretskii <eliz@gnu.org> writes:
>
> > OK, but why do you think "Сгсе.ru" is confusable? The SLD part is
> > entirely made of single-script characters, and UTS#39 explicitly
> > allows that:
> >
> > [...] it can be perfectly legitimate to have scripts in a SLD
> > (second level domain) not be the same as scripts in a TLD (top-level
> > domain), such as:
> >
> > Cyrillic labels in a domain name with a TLD of .ru or .рф
> >
> > That's your case, isn't it?
>
> Yes, indeed. But:
>
> ---
> For some applications, it is useful to determine if a given input string has any whole-script confusable. For example, the identifier "ѕсоре" using Cyrillic characters would pass the single-script test described in Section 5.2, Restriction-Level Detection, even though it is likely to be a spoof attempt.
> ---
>
> So "Сгсе.ru" is suspicious in most contexts.
Right, but the functions we had back then didn't yet support that
part.
> > Regardless of what they are saying, I don't think the above is
> > suitable for production. I think it should be enough to see whether
> > there could be confusion with the corresponding ASCII characters from
> > confusables.txt.
>
> Yes, so that's what I've done now, but... I'd feel slightly better if I
> knew what they were actually getting at. I think they're saying that if
> "foo" is confusable with anything in any other scripts, then it's
> suspicious?
Yes, that's what they meant.
> But that sounds unworkeable. For instance, "circle.ru" is
> confusable with "СігсӀе.ru", and perhaps it's suspicious to a Russian,
> but I don't see how to make a workable function from that.
They've left that to the implementation...
Anyway, I think confusable to ASCII is good enough for Emacs for now.
> So perhaps what I've implemented now is sufficient for domains.
I think it is, yes. It definitely covers a very large chunk of the
problem.
^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better
2022-01-19 13:46 ` Lars Ingebrigtsen
@ 2022-01-19 17:18 ` Eli Zaretskii
2022-01-20 8:36 ` Lars Ingebrigtsen
0 siblings, 1 reply; 123+ messages in thread
From: Eli Zaretskii @ 2022-01-19 17:18 UTC (permalink / raw)
To: Lars Ingebrigtsen; +Cc: 51733, rpluim
> From: Lars Ingebrigtsen <larsi@gnus.org>
> Cc: Eli Zaretskii <eliz@gnu.org>, 51733@debbugs.gnu.org
> Date: Wed, 19 Jan 2022 14:46:42 +0100
>
> Robert Pluim <rpluim@gmail.com> writes:
>
> > How about "Contains suspicious characters or mix of characters"? That
> > would at least point users in the right direction.
>
> I'm not 100% that it's not misleading in all cases, though. textsec
> still doesn't implement "Unicode Identifier and Pattern Syntax":
>
> https://www.unicode.org/reports/tr31/
>
> There's some other stuff in there... But I might be quibbling.
My suggestion for the diagnostic in this case is:
%s mixes characters from different scripts in suspicious ways
^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better
2022-01-19 16:58 ` Eli Zaretskii
@ 2022-01-19 18:25 ` Lars Ingebrigtsen
0 siblings, 0 replies; 123+ messages in thread
From: Lars Ingebrigtsen @ 2022-01-19 18:25 UTC (permalink / raw)
To: Eli Zaretskii; +Cc: 51733
[-- Attachment #1: Type: text/plain, Size: 129 bytes --]
I'm not quite sure how noticeable we should be making suspicious things.
With the following test file with `M-x eww-open-file':
[-- Attachment #2: Type: image/png, Size: 24175 bytes --]
[-- Attachment #3: Type: text/plain, Size: 407 bytes --]
The ⚠️ will obviously be customiseable, but is this generally the amount
of attention we should be aiming for? The mouseover on the ⚠️s has the
explanation for the suspicion -- we could also output that text into the
buffer, but I think that's overkill.
Anybody have an opinion?
--
(domestic pets only, the antidote for overdose, milk.)
bloggy blog: http://lars.ingebrigtsen.no
[-- Attachment #4: sus.html --]
[-- Type: text/html, Size: 241 bytes --]
^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better
2022-01-19 17:18 ` Eli Zaretskii
@ 2022-01-20 8:36 ` Lars Ingebrigtsen
0 siblings, 0 replies; 123+ messages in thread
From: Lars Ingebrigtsen @ 2022-01-20 8:36 UTC (permalink / raw)
To: Eli Zaretskii; +Cc: 51733, rpluim
Eli Zaretskii <eliz@gnu.org> writes:
> My suggestion for the diagnostic in this case is:
>
> %s mixes characters from different scripts in suspicious ways
Now done.
--
(domestic pets only, the antidote for overdose, milk.)
bloggy blog: http://lars.ingebrigtsen.no
^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better
2021-11-10 0:29 bug#51733: 27.1; Detect impossible email addresses better 積丹尼 Dan Jacobson
2021-11-10 0:42 ` Lars Ingebrigtsen
@ 2022-01-20 8:57 ` Lars Ingebrigtsen
2022-01-20 15:25 ` 積丹尼 Dan Jacobson
1 sibling, 1 reply; 123+ messages in thread
From: Lars Ingebrigtsen @ 2022-01-20 8:57 UTC (permalink / raw)
To: 積丹尼 Dan Jacobson; +Cc: 51733
積丹尼 Dan Jacobson <jidanni@jidanni.org> writes:
> Upon sending,
> To: Bob_Norbolwits@GCSsafetyACE.com
> should trigger a warning:
> "You won't get far trying to send mail with ZERO WIDTH SPACE in an address,"
> instead of blundering along and sending to "gcssafetyace.xn--com-7m0a"!!
This has now been fixed in Emacs 29, and may probably be the highest
"line number in fix" to "line numbers in report" ratio ever, with 14K
lines of data added, and about 600 lines of code. Congrats!
(So I'm now closing this bug report.)
--
(domestic pets only, the antidote for overdose, milk.)
bloggy blog: http://lars.ingebrigtsen.no
^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better
2022-01-20 8:57 ` Lars Ingebrigtsen
@ 2022-01-20 15:25 ` 積丹尼 Dan Jacobson
0 siblings, 0 replies; 123+ messages in thread
From: 積丹尼 Dan Jacobson @ 2022-01-20 15:25 UTC (permalink / raw)
To: Lars Ingebrigtsen; +Cc: 51733
>>>>> "LI" == Lars Ingebrigtsen <larsi@gnus.org> writes:
LI> This has now been fixed in Emacs 29, and may probably be the highest
LI> "line number in fix" to "line numbers in report" ratio ever, with 14K
LI> lines of data added, and about 600 lines of code. Congrats!
Good. Thanks.
^ permalink raw reply [flat|nested] 123+ messages in thread
end of thread, other threads:[~2022-01-20 15:25 UTC | newest]
Thread overview: 123+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2021-11-10 0:29 bug#51733: 27.1; Detect impossible email addresses better 積丹尼 Dan Jacobson
2021-11-10 0:42 ` Lars Ingebrigtsen
2021-11-10 3:34 ` Eli Zaretskii
2021-11-10 4:44 ` Lars Ingebrigtsen
2021-11-10 13:39 ` Eli Zaretskii
2021-11-11 2:52 ` Lars Ingebrigtsen
2021-11-11 7:01 ` Eli Zaretskii
2021-11-11 7:31 ` Lars Ingebrigtsen
2022-01-16 15:47 ` Lars Ingebrigtsen
2022-01-16 16:03 ` Eli Zaretskii
2022-01-16 16:09 ` Lars Ingebrigtsen
2022-01-16 16:14 ` Eli Zaretskii
2022-01-16 16:33 ` Lars Ingebrigtsen
2022-01-16 16:44 ` Eli Zaretskii
2022-01-16 17:03 ` Lars Ingebrigtsen
2022-01-16 17:50 ` Lars Ingebrigtsen
2022-01-16 18:18 ` Eli Zaretskii
2022-01-17 8:59 ` Lars Ingebrigtsen
2022-01-17 10:18 ` Eli Zaretskii
2022-01-17 14:54 ` Lars Ingebrigtsen
2022-01-17 16:47 ` Eli Zaretskii
2022-01-17 17:09 ` Lars Ingebrigtsen
2022-01-17 17:19 ` Eli Zaretskii
2022-01-17 17:26 ` Lars Ingebrigtsen
2022-01-17 17:38 ` Lars Ingebrigtsen
2022-01-17 17:48 ` Eli Zaretskii
2022-01-17 19:08 ` Eli Zaretskii
2022-01-17 20:22 ` Lars Ingebrigtsen
2022-01-18 8:40 ` Lars Ingebrigtsen
2022-01-18 11:26 ` Lars Ingebrigtsen
2022-01-18 11:37 ` Lars Ingebrigtsen
2022-01-18 11:44 ` Lars Ingebrigtsen
2022-01-18 12:00 ` Lars Ingebrigtsen
2022-01-18 12:47 ` Lars Ingebrigtsen
2022-01-18 12:51 ` Lars Ingebrigtsen
2022-01-18 18:44 ` Eli Zaretskii
2022-01-19 9:21 ` Robert Pluim
2022-01-19 9:26 ` Lars Ingebrigtsen
2022-01-19 10:12 ` Robert Pluim
2022-01-19 10:27 ` Lars Ingebrigtsen
2022-01-19 10:42 ` Robert Pluim
2022-01-19 13:46 ` Lars Ingebrigtsen
2022-01-19 17:18 ` Eli Zaretskii
2022-01-20 8:36 ` Lars Ingebrigtsen
2022-01-19 11:53 ` Eli Zaretskii
2022-01-19 12:49 ` Robert Pluim
2022-01-19 12:56 ` Lars Ingebrigtsen
2022-01-19 13:00 ` Lars Ingebrigtsen
2022-01-19 13:03 ` Eli Zaretskii
2022-01-19 12:58 ` Eli Zaretskii
2022-01-19 13:02 ` Lars Ingebrigtsen
2022-01-19 13:06 ` Eli Zaretskii
2022-01-19 13:10 ` Lars Ingebrigtsen
2022-01-19 13:21 ` Eli Zaretskii
2022-01-19 13:25 ` Lars Ingebrigtsen
2022-01-19 13:28 ` Eli Zaretskii
2022-01-19 13:39 ` Robert Pluĭm
2022-01-19 14:00 ` Lars Ingebrigtsen
2022-01-19 14:10 ` Robert Pluĭm
2022-01-19 14:24 ` Lars Ingebrigtsen
2022-01-19 14:30 ` Robert Pluim
2022-01-19 14:36 ` Lars Ingebrigtsen
2022-01-19 14:43 ` Robert Pluim
2022-01-19 16:08 ` Andreas Schwab
2022-01-19 16:47 ` Robert Pluim
2022-01-19 16:51 ` Lars Ingebrigtsen
2022-01-19 16:57 ` Robert Pluim
2022-01-19 9:25 ` Lars Ingebrigtsen
2022-01-19 11:51 ` Eli Zaretskii
2022-01-19 12:54 ` Lars Ingebrigtsen
2022-01-19 13:01 ` Eli Zaretskii
2022-01-19 13:06 ` Lars Ingebrigtsen
2022-01-19 13:11 ` Eli Zaretskii
2022-01-19 13:16 ` Lars Ingebrigtsen
2022-01-19 13:25 ` Eli Zaretskii
2022-01-19 13:31 ` Lars Ingebrigtsen
2022-01-19 13:35 ` Eli Zaretskii
2022-01-19 13:36 ` Andreas Schwab
2022-01-19 13:57 ` Lars Ingebrigtsen
2022-01-19 14:06 ` Andreas Schwab
2022-01-19 14:09 ` Lars Ingebrigtsen
2022-01-19 14:13 ` Andreas Schwab
2022-01-19 14:33 ` Lars Ingebrigtsen
2022-01-19 14:39 ` Andreas Schwab
2022-01-19 14:44 ` Lars Ingebrigtsen
2022-01-18 18:48 ` Eli Zaretskii
2022-01-18 20:15 ` Eli Zaretskii
2022-01-18 20:31 ` Eli Zaretskii
2022-01-19 13:38 ` Lars Ingebrigtsen
2022-01-18 15:05 ` Eli Zaretskii
2022-01-19 12:49 ` Michael Albinus
2022-01-19 12:59 ` Eli Zaretskii
2022-01-19 13:35 ` Lars Ingebrigtsen
2022-01-18 14:59 ` Eli Zaretskii
2022-01-19 13:56 ` Lars Ingebrigtsen
2022-01-18 14:55 ` Eli Zaretskii
2022-01-19 13:55 ` Lars Ingebrigtsen
2022-01-19 14:14 ` Eli Zaretskii
2022-01-19 14:28 ` Lars Ingebrigtsen
2022-01-19 14:57 ` Eli Zaretskii
2022-01-19 15:45 ` Lars Ingebrigtsen
2022-01-19 16:58 ` Eli Zaretskii
2022-01-19 18:25 ` Lars Ingebrigtsen
2022-01-17 17:42 ` Eli Zaretskii
2022-01-17 17:46 ` Lars Ingebrigtsen
2022-01-17 15:22 ` Eli Zaretskii
2022-01-17 15:25 ` Lars Ingebrigtsen
2022-01-17 15:53 ` Lars Ingebrigtsen
2022-01-17 16:31 ` Lars Ingebrigtsen
2022-01-17 16:52 ` Eli Zaretskii
2022-01-17 16:57 ` Lars Ingebrigtsen
2022-01-17 17:02 ` Eli Zaretskii
2022-01-17 17:04 ` Lars Ingebrigtsen
2022-01-16 18:14 ` Eli Zaretskii
2022-01-16 18:24 ` Eli Zaretskii
2022-01-16 18:34 ` Andreas Schwab
2022-01-16 18:44 ` Eli Zaretskii
2022-01-16 17:53 ` Achim Gratz
2022-01-17 17:13 ` Lars Ingebrigtsen
2022-01-17 17:43 ` 積丹尼 Dan Jacobson
2022-01-17 19:06 ` Eli Zaretskii
2022-01-20 8:57 ` Lars Ingebrigtsen
2022-01-20 15:25 ` 積丹尼 Dan Jacobson
Code repositories for project(s) associated with this public inbox
https://git.savannah.gnu.org/cgit/emacs.git
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).