* bug#51733: 27.1; Detect impossible email addresses better @ 2021-11-10 0:29 積丹尼 Dan Jacobson 2021-11-10 0:42 ` Lars Ingebrigtsen 2022-01-20 8:57 ` Lars Ingebrigtsen 0 siblings, 2 replies; 123+ messages in thread From: 積丹尼 Dan Jacobson @ 2021-11-10 0:29 UTC (permalink / raw) To: 51733 Upon sending, To: Bob_Norbolwits@GCSsafetyACE.com should trigger a warning: "You won't get far trying to send mail with ZERO WIDTH SPACE in an address," instead of blundering along and sending to "gcssafetyace.xn--com-7m0a"!! ^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better 2021-11-10 0:29 bug#51733: 27.1; Detect impossible email addresses better 積丹尼 Dan Jacobson @ 2021-11-10 0:42 ` Lars Ingebrigtsen 2021-11-10 3:34 ` Eli Zaretskii 2022-01-17 17:43 ` 積丹尼 Dan Jacobson 2022-01-20 8:57 ` Lars Ingebrigtsen 1 sibling, 2 replies; 123+ messages in thread From: Lars Ingebrigtsen @ 2021-11-10 0:42 UTC (permalink / raw) To: 積丹尼 Dan Jacobson; +Cc: 51733 積丹尼 Dan Jacobson <jidanni@jidanni.org> writes: > Upon sending, > To: Bob_Norbolwits@GCSsafetyACE.com > should trigger a warning: > "You won't get far trying to send mail with ZERO WIDTH SPACE in an address," > instead of blundering along and sending to "gcssafetyace.xn--com-7m0a"!! I guess Emacs should run all email addresses through a check for Unicode confusability and direction markers and all that stuff, too. (Which got a lot of work lately in a display context.) Do we have a predicate somewhere that says whether a string is suspicious based on confusables and r2l markers and stuff? -- (domestic pets only, the antidote for overdose, milk.) bloggy blog: http://lars.ingebrigtsen.no ^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better 2021-11-10 0:42 ` Lars Ingebrigtsen @ 2021-11-10 3:34 ` Eli Zaretskii 2021-11-10 4:44 ` Lars Ingebrigtsen 2022-01-17 17:43 ` 積丹尼 Dan Jacobson 1 sibling, 1 reply; 123+ messages in thread From: Eli Zaretskii @ 2021-11-10 3:34 UTC (permalink / raw) To: Lars Ingebrigtsen; +Cc: 51733, jidanni > From: Lars Ingebrigtsen <larsi@gnus.org> > Date: Wed, 10 Nov 2021 01:42:34 +0100 > Cc: 51733@debbugs.gnu.org > > Do we have a predicate somewhere that says whether a string is suspicious > based on confusables and r2l markers and stuff? No. We have the infrastructure for detecting the reordering, though. ^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better 2021-11-10 3:34 ` Eli Zaretskii @ 2021-11-10 4:44 ` Lars Ingebrigtsen 2021-11-10 13:39 ` Eli Zaretskii 0 siblings, 1 reply; 123+ messages in thread From: Lars Ingebrigtsen @ 2021-11-10 4:44 UTC (permalink / raw) To: Eli Zaretskii; +Cc: 51733, jidanni Eli Zaretskii <eliz@gnu.org> writes: >> Do we have a predicate somewhere that says whether a string is suspicious >> based on confusables and r2l markers and stuff? > > No. We have the infrastructure for detecting the reordering, though. I thought I vaguely remembered you writing something in this area in conjunction with some URL stuff some years back, but I don't recall what happened to it. Hm... and there's uni-confusables in GNU ELPA? Should we have that in core instead? (Or in addition.) -- (domestic pets only, the antidote for overdose, milk.) bloggy blog: http://lars.ingebrigtsen.no ^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better 2021-11-10 4:44 ` Lars Ingebrigtsen @ 2021-11-10 13:39 ` Eli Zaretskii 2021-11-11 2:52 ` Lars Ingebrigtsen 0 siblings, 1 reply; 123+ messages in thread From: Eli Zaretskii @ 2021-11-10 13:39 UTC (permalink / raw) To: Lars Ingebrigtsen; +Cc: 51733, jidanni > From: Lars Ingebrigtsen <larsi@gnus.org> > Cc: jidanni@jidanni.org, 51733@debbugs.gnu.org > Date: Wed, 10 Nov 2021 05:44:05 +0100 > > Eli Zaretskii <eliz@gnu.org> writes: > > >> Do we have a predicate somewhere that says whether a string is suspicious > >> based on confusables and r2l markers and stuff? > > > > No. We have the infrastructure for detecting the reordering, though. > > I thought I vaguely remembered you writing something in this area in > conjunction with some URL stuff some years back, but I don't recall what > happened to it. I did write it, that's bidi-find-overridden-directionality, which we have since Emacs 25. That is what I meant by "detecting the reordering". Detecting confusables in general is a much broader issue, not limited to bidi reordering alone. > Hm... and there's uni-confusables in GNU ELPA? Should we have that in > core instead? (Or in addition.) We could add that to core, but currently uni-confusables just gives you a char-table which Lisp programs can use to find out whether a given character is a potential confusable. We need applications layers above that, ideally implementing at least part of the recommendations in Unicode's UTS #39 (https://www.unicode.org/reports/tr39/). We should probably first discuss what we want to implement from there, though. How about chiming in to emacs-devel thread "Unicode confusables considered harmful", where Vasilij Schneidermann already asked what we think should be done about these cases? ^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better 2021-11-10 13:39 ` Eli Zaretskii @ 2021-11-11 2:52 ` Lars Ingebrigtsen 2021-11-11 7:01 ` Eli Zaretskii 0 siblings, 1 reply; 123+ messages in thread From: Lars Ingebrigtsen @ 2021-11-11 2:52 UTC (permalink / raw) To: Eli Zaretskii; +Cc: 51733, jidanni Eli Zaretskii <eliz@gnu.org> writes: > I did write it, that's bidi-find-overridden-directionality, which we > have since Emacs 25. That is what I meant by "detecting the > reordering". Ah, right. > We could add that to core, but currently uni-confusables just gives > you a char-table which Lisp programs can use to find out whether a > given character is a potential confusable. We need applications > layers above that, ideally implementing at least part of the > recommendations in Unicode's UTS #39 > (https://www.unicode.org/reports/tr39/). It's great to see that somebody's already done the hard work -- now we just have to implement it. 😅 > We should probably first discuss what we want to implement from there, > though. How about chiming in to emacs-devel thread "Unicode > confusables considered harmful", where Vasilij Schneidermann already > asked what we think should be done about these cases? I'm not sure that'd be productive. I think Somebody just has to write a library that exposes the various levels/profiles as defined by TR39, and then we should sprinkle libraries that deal with these issues (url.el, smtpmail.el, message.el) with calls to that library, much along the same lines as the NSM is consulted about network connections. -- (domestic pets only, the antidote for overdose, milk.) bloggy blog: http://lars.ingebrigtsen.no ^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better 2021-11-11 2:52 ` Lars Ingebrigtsen @ 2021-11-11 7:01 ` Eli Zaretskii 2021-11-11 7:31 ` Lars Ingebrigtsen 0 siblings, 1 reply; 123+ messages in thread From: Eli Zaretskii @ 2021-11-11 7:01 UTC (permalink / raw) To: Lars Ingebrigtsen; +Cc: 51733, jidanni > From: Lars Ingebrigtsen <larsi@gnus.org> > Cc: jidanni@jidanni.org, 51733@debbugs.gnu.org > Date: Thu, 11 Nov 2021 03:52:39 +0100 > > I think Somebody just has to write a library that exposes the > various levels/profiles as defined by TR39, and then we should > sprinkle libraries that deal with these issues (url.el, smtpmail.el, > message.el) with calls to that library, much along the same lines as > the NSM is consulted about network connections. That'd be fine, of course. Is Somebody around? please speak up if you are. ^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better 2021-11-11 7:01 ` Eli Zaretskii @ 2021-11-11 7:31 ` Lars Ingebrigtsen 2022-01-16 15:47 ` Lars Ingebrigtsen 0 siblings, 1 reply; 123+ messages in thread From: Lars Ingebrigtsen @ 2021-11-11 7:31 UTC (permalink / raw) To: Eli Zaretskii; +Cc: 51733, jidanni Eli Zaretskii <eliz@gnu.org> writes: > That'd be fine, of course. Is Somebody around? please speak up if you > are. Sometimes that Somebody is me. I think it looks like a fun little project -- it's so refreshing to have an actual spec to program against. 😸 And I've read most of the TS now, so it's just a small matter of typing. But I probably won't have the time this week -- if somebody else wants to get in on the action, please do go ahead. -- (domestic pets only, the antidote for overdose, milk.) bloggy blog: http://lars.ingebrigtsen.no ^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better 2021-11-11 7:31 ` Lars Ingebrigtsen @ 2022-01-16 15:47 ` Lars Ingebrigtsen 2022-01-16 16:03 ` Eli Zaretskii 0 siblings, 1 reply; 123+ messages in thread From: Lars Ingebrigtsen @ 2022-01-16 15:47 UTC (permalink / raw) To: Eli Zaretskii; +Cc: 51733, jidanni Lars Ingebrigtsen <larsi@gnus.org> writes: > I think it looks like a fun little project -- it's so refreshing to have > an actual spec to program against. 😸 And I've read most of the TS > now, so it's just a small matter of typing. > > But I probably won't have the time this week -- if somebody else wants > to get in on the action, please do go ahead. Well, it took longer to find time to start this, but I think now's a good time. So we'll be importing a handful of new Unicode data files, and I think an interface like (suspicious-email-p "C𝗂𝗋𝖼𝗅𝖾@example.com") => "Confusables used in address: 𝗂 (MATHEMATICAL SANS-SERIF SMALL I) confusable with etc etc" would be nice. But there's also a bunch of lower level functions that might be nice to expose separately, like (single-script-p "Сirсlе") => nil but it'd be nice to group these in a single package name. But I'm coming up blank. I mean, `unicode-suspicious-email-p' would be nonsensical, because ... it's not really Unicode that's the point here. For instance, if you have a link text like http://innocent.org but the link goes to http://evil.com, then it'd be nice to implement something for that, too, in this same package. Or http://paypaI.com, for that matter. So does anybody have an idea for a package name, so I can start typing away at this? 😀 -- (domestic pets only, the antidote for overdose, milk.) bloggy blog: http://lars.ingebrigtsen.no ^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better 2022-01-16 15:47 ` Lars Ingebrigtsen @ 2022-01-16 16:03 ` Eli Zaretskii 2022-01-16 16:09 ` Lars Ingebrigtsen 0 siblings, 1 reply; 123+ messages in thread From: Eli Zaretskii @ 2022-01-16 16:03 UTC (permalink / raw) To: Lars Ingebrigtsen; +Cc: 51733, jidanni > From: Lars Ingebrigtsen <larsi@gnus.org> > Cc: 51733@debbugs.gnu.org, jidanni@jidanni.org > Date: Sun, 16 Jan 2022 16:47:21 +0100 > > but it'd be nice to group these in a single package name. But I'm > coming up blank. I mean, `unicode-suspicious-email-p' would be > nonsensical, because ... it's not really Unicode that's the point here. > For instance, if you have a link text like http://innocent.org but the > link goes to http://evil.com, then it'd be nice to implement something > for that, too, in this same package. Or http://paypaI.com, for that > matter. > > So does anybody have an idea for a package name, so I can start typing > away at this? 😀 unicode-security.el? I mean, most of that _is_ based on Unicode recommendations, right? ^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better 2022-01-16 16:03 ` Eli Zaretskii @ 2022-01-16 16:09 ` Lars Ingebrigtsen 2022-01-16 16:14 ` Eli Zaretskii 0 siblings, 1 reply; 123+ messages in thread From: Lars Ingebrigtsen @ 2022-01-16 16:09 UTC (permalink / raw) To: Eli Zaretskii; +Cc: 51733, jidanni Eli Zaretskii <eliz@gnu.org> writes: > unicode-security.el? I mean, most of that _is_ based on Unicode > recommendations, right? Most of it is, but not all. And putting "unicode" in the function names wouldn't be helpful, because it's not important to the people using these functions that most of the recommendations come from Unicode. -- (domestic pets only, the antidote for overdose, milk.) bloggy blog: http://lars.ingebrigtsen.no ^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better 2022-01-16 16:09 ` Lars Ingebrigtsen @ 2022-01-16 16:14 ` Eli Zaretskii 2022-01-16 16:33 ` Lars Ingebrigtsen 0 siblings, 1 reply; 123+ messages in thread From: Eli Zaretskii @ 2022-01-16 16:14 UTC (permalink / raw) To: Lars Ingebrigtsen; +Cc: 51733, jidanni > From: Lars Ingebrigtsen <larsi@gnus.org> > Cc: 51733@debbugs.gnu.org, jidanni@jidanni.org > Date: Sun, 16 Jan 2022 17:09:38 +0100 > > Eli Zaretskii <eliz@gnu.org> writes: > > > unicode-security.el? I mean, most of that _is_ based on Unicode > > recommendations, right? > > Most of it is, but not all. And putting "unicode" in the function names > wouldn't be helpful, because it's not important to the people using > these functions that most of the recommendations come from Unicode. You are a tough customer. Then what about text-security.el? or textsec.el? ^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better 2022-01-16 16:14 ` Eli Zaretskii @ 2022-01-16 16:33 ` Lars Ingebrigtsen 2022-01-16 16:44 ` Eli Zaretskii 2022-01-16 17:53 ` Achim Gratz 0 siblings, 2 replies; 123+ messages in thread From: Lars Ingebrigtsen @ 2022-01-16 16:33 UTC (permalink / raw) To: Eli Zaretskii; +Cc: 51733, jidanni Eli Zaretskii <eliz@gnu.org> writes: > You are a tough customer. 😀 > Then what about text-security.el? or textsec.el? Yes, that'd work. Or... string-analysis.el? With functions like `string-scripts' (lists the different scripts in the string) as well as the more higher level functions... Hm... -- (domestic pets only, the antidote for overdose, milk.) bloggy blog: http://lars.ingebrigtsen.no ^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better 2022-01-16 16:33 ` Lars Ingebrigtsen @ 2022-01-16 16:44 ` Eli Zaretskii 2022-01-16 17:03 ` Lars Ingebrigtsen 2022-01-16 17:53 ` Achim Gratz 1 sibling, 1 reply; 123+ messages in thread From: Eli Zaretskii @ 2022-01-16 16:44 UTC (permalink / raw) To: Lars Ingebrigtsen; +Cc: 51733, jidanni > From: Lars Ingebrigtsen <larsi@gnus.org> > Cc: 51733@debbugs.gnu.org, jidanni@jidanni.org > Date: Sun, 16 Jan 2022 17:33:49 +0100 > > > Then what about text-security.el? or textsec.el? > > Yes, that'd work. Or... string-analysis.el? Is such an "analysis" useful for any other purposes than the one you want to use it? > With functions like `string-scripts' (lists the different scripts in > the string) That one should probably be elsewhere. Although even in that case, I don't really see how it could be useful for anything other than this particular purpose? I bet most Lisp programmers don't even know what is a "script" in the Emacs context. ^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better 2022-01-16 16:44 ` Eli Zaretskii @ 2022-01-16 17:03 ` Lars Ingebrigtsen 2022-01-16 17:50 ` Lars Ingebrigtsen 2022-01-16 18:14 ` Eli Zaretskii 0 siblings, 2 replies; 123+ messages in thread From: Lars Ingebrigtsen @ 2022-01-16 17:03 UTC (permalink / raw) To: Eli Zaretskii; +Cc: 51733, jidanni Eli Zaretskii <eliz@gnu.org> writes: > That one should probably be elsewhere. Although even in that case, I > don't really see how it could be useful for anything other than this > particular purpose? I bet most Lisp programmers don't even know what > is a "script" in the Emacs context. Yeah, probably true... By the way: https://www.unicode.org/reports/tr24/tr24-32.html#Scripts_and_Blocks As a result, using the block names as simplistic substitute for script identity generally leads to poor results. It looks like we're doing that, though? And indeed: (elt char-script-table #xAB65) => latin which is wrong, because that's GREEK LETTER SMALL CAPITAL OMEGA So we should be populating char-script-table from http://www.unicode.org/Public/UCD/latest/ucd/Scripts.txt instead of Blocks.txt. So I'll be doing that, too. -- (domestic pets only, the antidote for overdose, milk.) bloggy blog: http://lars.ingebrigtsen.no ^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better 2022-01-16 17:03 ` Lars Ingebrigtsen @ 2022-01-16 17:50 ` Lars Ingebrigtsen 2022-01-16 18:18 ` Eli Zaretskii 2022-01-16 18:14 ` Eli Zaretskii 1 sibling, 1 reply; 123+ messages in thread From: Lars Ingebrigtsen @ 2022-01-16 17:50 UTC (permalink / raw) To: Eli Zaretskii; +Cc: 51733, jidanni Lars Ingebrigtsen <larsi@gnus.org> writes: > So we should be populating char-script-table from > http://www.unicode.org/Public/UCD/latest/ucd/Scripts.txt instead of > Blocks.txt. So I'll be doing that, too. Hm, well, that'd be difficult to do in a backwards compatible way -- for instance, there's stuff in Emacs that depends on things mapping to `symbol', which isn't really a thing in Scripts.txt. So I guess the Scripts.txt file will have to be parsed in addition, and into a new char table. -- (domestic pets only, the antidote for overdose, milk.) bloggy blog: http://lars.ingebrigtsen.no ^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better 2022-01-16 17:50 ` Lars Ingebrigtsen @ 2022-01-16 18:18 ` Eli Zaretskii 2022-01-17 8:59 ` Lars Ingebrigtsen 0 siblings, 1 reply; 123+ messages in thread From: Eli Zaretskii @ 2022-01-16 18:18 UTC (permalink / raw) To: Lars Ingebrigtsen; +Cc: 51733, jidanni > From: Lars Ingebrigtsen <larsi@gnus.org> > Cc: 51733@debbugs.gnu.org, jidanni@jidanni.org > Date: Sun, 16 Jan 2022 18:50:27 +0100 > > So I guess the Scripts.txt file will have to be parsed in addition, and > into a new char table. Why can't we use our char-script-table? how different is it from what Unicode wants? ^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better 2022-01-16 18:18 ` Eli Zaretskii @ 2022-01-17 8:59 ` Lars Ingebrigtsen 2022-01-17 10:18 ` Eli Zaretskii 0 siblings, 1 reply; 123+ messages in thread From: Lars Ingebrigtsen @ 2022-01-17 8:59 UTC (permalink / raw) To: Eli Zaretskii; +Cc: 51733, jidanni Eli Zaretskii <eliz@gnu.org> writes: >> So I guess the Scripts.txt file will have to be parsed in addition, and >> into a new char table. > > Why can't we use our char-script-table? how different is it from what > Unicode wants? Well, as the Unicode web page says -- using Blocks to determine the script is just, well, wrong. (Or "inaccurate", if you want.) So using it will give both false positives and negatives. In addition, that table assumes that each character belongs to a single script, which is also wrong. So I'm making a new table based on Scripts.txt and ScriptExtensions.txt. -- (domestic pets only, the antidote for overdose, milk.) bloggy blog: http://lars.ingebrigtsen.no ^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better 2022-01-17 8:59 ` Lars Ingebrigtsen @ 2022-01-17 10:18 ` Eli Zaretskii 2022-01-17 14:54 ` Lars Ingebrigtsen 2022-01-17 15:22 ` Eli Zaretskii 0 siblings, 2 replies; 123+ messages in thread From: Eli Zaretskii @ 2022-01-17 10:18 UTC (permalink / raw) To: Lars Ingebrigtsen; +Cc: 51733, jidanni On January 17, 2022 10:59:36 AM GMT+02:00, Lars Ingebrigtsen <larsi@gnus.org> wrote: > Eli Zaretskii <eliz@gnu.org> writes: > > >> So I guess the Scripts.txt file will have to be parsed in addition, and > >> into a new char table. > > > > Why can't we use our char-script-table? how different is it from what > > Unicode wants? > > Well, as the Unicode web page says -- using Blocks to determine the > script is just, well, wrong. (Or "inaccurate", if you want.) So using > it will give both false positives and negatives. Yes, I understand the general concern, but I'm asking how serious is this in practice. Can you tell? > In addition, that table assumes that each character belongs to a single > script, which is also wrong. So I'm making a new table based on > Scripts.txt and ScriptExtensions.txt. It is confusing to have 2 separate properties of a character that are subtly incompatible, and for such obscure properties at that. It will be source of many problems. So I think we should avoid that if it's feasible. Can we plrase discuss any real problems that would be xaused by using the existing char-table? ^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better 2022-01-17 10:18 ` Eli Zaretskii @ 2022-01-17 14:54 ` Lars Ingebrigtsen 2022-01-17 16:47 ` Eli Zaretskii 2022-01-17 15:22 ` Eli Zaretskii 1 sibling, 1 reply; 123+ messages in thread From: Lars Ingebrigtsen @ 2022-01-17 14:54 UTC (permalink / raw) To: Eli Zaretskii; +Cc: 51733, jidanni Eli Zaretskii <eliz@gnu.org> writes: > Yes, I understand the general concern, but I'm asking how serious is > this in practice. Can you tell? I don't know how to quantity that. We're talking about security mechanisms, and they should be reliable. (But, yes, the differences are massive, especially in the Asian parts of the data.) >> In addition, that table assumes that each character belongs to a single >> script, which is also wrong. So I'm making a new table based on >> Scripts.txt and ScriptExtensions.txt. > > It is confusing to have 2 separate properties of a character that are > subtly incompatible, and for such obscure properties at that. It will > be source of many problems. So I think we should avoid that if it's > feasible. Can we plrase discuss any real problems that would be > xaused by using the existing char-table? It's impossible to implement the Unicode security recommendations based on the Blocks.txt data -- it's that simple. -- (domestic pets only, the antidote for overdose, milk.) bloggy blog: http://lars.ingebrigtsen.no ^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better 2022-01-17 14:54 ` Lars Ingebrigtsen @ 2022-01-17 16:47 ` Eli Zaretskii 2022-01-17 17:09 ` Lars Ingebrigtsen 0 siblings, 1 reply; 123+ messages in thread From: Eli Zaretskii @ 2022-01-17 16:47 UTC (permalink / raw) To: Lars Ingebrigtsen; +Cc: 51733, jidanni > From: Lars Ingebrigtsen <larsi@gnus.org> > Cc: 51733@debbugs.gnu.org, jidanni@jidanni.org > Date: Mon, 17 Jan 2022 15:54:58 +0100 > > Eli Zaretskii <eliz@gnu.org> writes: > > > Yes, I understand the general concern, but I'm asking how serious is > > this in practice. Can you tell? > > I don't know how to quantity that. We're talking about security > mechanisms, and they should be reliable. Well, now that I know the answer, I don't think it's hard to quantify. But maybe I'm missing something. > (But, yes, the differences are massive, especially in the Asian parts of > the data.) I don't think I understand what you mean by "the Asian parts". Do you mean the CJK parts where we lump several scripts together into 'han' and 'kana'? > It's impossible to implement the Unicode security recommendations based > on the Blocks.txt data -- it's that simple. Can you tell more about why it is impossible? If it's a relatively simple issue of "translating" the Unicode script names into ours, then it should be quite simple. Since you say it's impossible, I guess there's some factor(s) here that I miss? ^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better 2022-01-17 16:47 ` Eli Zaretskii @ 2022-01-17 17:09 ` Lars Ingebrigtsen 2022-01-17 17:19 ` Eli Zaretskii 0 siblings, 1 reply; 123+ messages in thread From: Lars Ingebrigtsen @ 2022-01-17 17:09 UTC (permalink / raw) To: Eli Zaretskii; +Cc: 51733, jidanni Eli Zaretskii <eliz@gnu.org> writes: > I don't think I understand what you mean by "the Asian parts". Do you > mean the CJK parts where we lump several scripts together into 'han' > and 'kana'? Possibly -- I haven't looked closely. >> It's impossible to implement the Unicode security recommendations based >> on the Blocks.txt data -- it's that simple. > > Can you tell more about why it is impossible? If it's a relatively > simple issue of "translating" the Unicode script names into ours, then > it should be quite simple. Since you say it's impossible, I guess > there's some factor(s) here that I miss? Perhaps there's something I'm missing, because it seems self-evident to me that the Blocks data can't be used for this. For instance, (textsec-single-script-p "ޱ﷽") => t but (elt char-script-table ?ޱ) => thaana (elt char-script-table ?﷽) => arabic I think the Unicode people have the authoritative say here, so implementing the recommendations seems like the way to go. And it's less work in the long run, because we can just import the data files and not try to fix things up manually (like blocks.awk attempts to do). -- (domestic pets only, the antidote for overdose, milk.) bloggy blog: http://lars.ingebrigtsen.no ^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better 2022-01-17 17:09 ` Lars Ingebrigtsen @ 2022-01-17 17:19 ` Eli Zaretskii 2022-01-17 17:26 ` Lars Ingebrigtsen 0 siblings, 1 reply; 123+ messages in thread From: Eli Zaretskii @ 2022-01-17 17:19 UTC (permalink / raw) To: Lars Ingebrigtsen; +Cc: 51733, jidanni > From: Lars Ingebrigtsen <larsi@gnus.org> > Cc: 51733@debbugs.gnu.org, jidanni@jidanni.org > Date: Mon, 17 Jan 2022 18:09:19 +0100 > > I think the Unicode people have the authoritative say here, so > implementing the recommendations seems like the way to go. And it's > less work in the long run, because we can just import the data files and > not try to fix things up manually (like blocks.awk attempts to do). Let's at least call this something other than "script", to avoid confusion. ^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better 2022-01-17 17:19 ` Eli Zaretskii @ 2022-01-17 17:26 ` Lars Ingebrigtsen 2022-01-17 17:38 ` Lars Ingebrigtsen 2022-01-17 17:42 ` Eli Zaretskii 0 siblings, 2 replies; 123+ messages in thread From: Lars Ingebrigtsen @ 2022-01-17 17:26 UTC (permalink / raw) To: Eli Zaretskii; +Cc: 51733, jidanni Eli Zaretskii <eliz@gnu.org> writes: > Let's at least call this something other than "script", to avoid > confusion. Sure. But... what. 🤔 I made a slight attempt at that by calling it "scripts" instead of "script", since each character belongs to a list of scripts, but it's probably too subtle. -- (domestic pets only, the antidote for overdose, milk.) bloggy blog: http://lars.ingebrigtsen.no ^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better 2022-01-17 17:26 ` Lars Ingebrigtsen @ 2022-01-17 17:38 ` Lars Ingebrigtsen 2022-01-17 17:48 ` Eli Zaretskii 2022-01-17 17:42 ` Eli Zaretskii 1 sibling, 1 reply; 123+ messages in thread From: Lars Ingebrigtsen @ 2022-01-17 17:38 UTC (permalink / raw) To: Eli Zaretskii; +Cc: 51733, jidanni I'm looking at the Confusable section now. https://www.unicode.org/reports/tr39/#Confusable_Detection Looks easy enough to implement (and the ELPA package already does the parsing, so I'll be reusing bits from that). But... I'm wondering what the higher level interface would be? I mean, quite a lot of strings are confusable with something else, but which ones are interesting? The only thing that seems immediately interesting to check for is whether a string is confusable with ASCII? That is, (textsec-confusable-with-ascii-p "C𝗂𝗋𝖼𝗅𝖾") => t Because the ASCII characters are the ones that people rely on when doing ... things, like email and browsing the web. But I mean, "C𝗂𝗋𝖼𝗅𝖾" is confusable with "СігсӀе" (the latter is Cyrillic), and if you're writing Russian, that might also be interesting. So perhaps a (textsec-confusable-with-script-p "C𝗂𝗋𝖼𝗅𝖾" 'cyrillic) => t ? But... I'm not sure in which contexts that would actually be vital to know. Hm. Anybody have any thoughts here? -- (domestic pets only, the antidote for overdose, milk.) bloggy blog: http://lars.ingebrigtsen.no ^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better 2022-01-17 17:38 ` Lars Ingebrigtsen @ 2022-01-17 17:48 ` Eli Zaretskii 2022-01-17 19:08 ` Eli Zaretskii 2022-01-19 13:55 ` Lars Ingebrigtsen 0 siblings, 2 replies; 123+ messages in thread From: Eli Zaretskii @ 2022-01-17 17:48 UTC (permalink / raw) To: Lars Ingebrigtsen; +Cc: 51733, jidanni > From: Lars Ingebrigtsen <larsi@gnus.org> > Cc: 51733@debbugs.gnu.org, jidanni@jidanni.org > Date: Mon, 17 Jan 2022 18:38:48 +0100 > > I'm looking at the Confusable section now. > > https://www.unicode.org/reports/tr39/#Confusable_Detection > > Looks easy enough to implement (and the ELPA package already does the > parsing, so I'll be reusing bits from that). > > But... I'm wondering what the higher level interface would be? I mean, > quite a lot of strings are confusable with something else, but which > ones are interesting? The only thing that seems immediately interesting > to check for is whether a string is confusable with ASCII? > > That is, > > (textsec-confusable-with-ascii-p "C𝗂𝗋𝖼𝗅𝖾") > => t > > Because the ASCII characters are the ones that people rely on when doing > ... things, like email and browsing the web. > > But I mean, "C𝗂𝗋𝖼𝗅𝖾" is confusable with "СігсӀе" (the latter is > Cyrillic), and if you're writing Russian, that might also be > interesting. So perhaps a > > (textsec-confusable-with-script-p "C𝗂𝗋𝖼𝗅𝖾" 'cyrillic) > => t > > ? But... I'm not sure in which contexts that would actually be vital > to know. Hm. I think we should first determine what kinds of applications may need this, and take it from there. The initial number of "confusability with" classes can be very small, and we can add more as we discover interesting use cases. The full number is pretty much infinite, I think, but I'm not sure Emacs needs to support all of them OOTB. We could support some of the popular ones, and provide infrastructure for developing more. ^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better 2022-01-17 17:48 ` Eli Zaretskii @ 2022-01-17 19:08 ` Eli Zaretskii 2022-01-17 20:22 ` Lars Ingebrigtsen 2022-01-19 13:55 ` Lars Ingebrigtsen 1 sibling, 1 reply; 123+ messages in thread From: Eli Zaretskii @ 2022-01-17 19:08 UTC (permalink / raw) To: larsi; +Cc: 51733 > Date: Mon, 17 Jan 2022 19:48:01 +0200 > From: Eli Zaretskii <eliz@gnu.org> > Cc: 51733@debbugs.gnu.org, jidanni@jidanni.org > > I think we should first determine what kinds of applications may need > this By that I meant: confusables in URL, confusables in email addresses, etc. ^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better 2022-01-17 19:08 ` Eli Zaretskii @ 2022-01-17 20:22 ` Lars Ingebrigtsen 2022-01-18 8:40 ` Lars Ingebrigtsen 0 siblings, 1 reply; 123+ messages in thread From: Lars Ingebrigtsen @ 2022-01-17 20:22 UTC (permalink / raw) To: Eli Zaretskii; +Cc: 51733 I'm not quite sure I understand this bit here https://www.unicode.org/reports/tr39/#Confusable_Detection --- For an input string X, define skeleton(X) to be the following transformation on the string: Convert X to NFD format, as described in [UAX15]. Concatenate the prototypes for each character in X according to the specified data, producing a string of exemplar characters. Reapply NFD. --- I mean, that sounds OK in and of itself, but then: --- X and Y are single-script confusables if and only if they are confusable, and their resolved script sets have at least one element in common. Examples: “ljeto” and “ljeto” in Latin (the Croatian word for “summer”), where the first word uses only four codepoints, the first of which is U+01C9 (lj) LATIN SMALL LETTER LJ. --- But: (ucs-normalize-NFD-string "ljeto") => "ljeto" So according to that algo "ljeto" and "ljeto" are not confusable. But if we use NFKD instead, they are: (ucs-normalize-NFKD-string "ljeto") => "ljeto" It seems unlikely to be a typo in this document, surely? But NFKD seems to make a whole lot more sense than NFD for this usage. I must be missing or misreading something. -- (domestic pets only, the antidote for overdose, milk.) bloggy blog: http://lars.ingebrigtsen.no ^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better 2022-01-17 20:22 ` Lars Ingebrigtsen @ 2022-01-18 8:40 ` Lars Ingebrigtsen 2022-01-18 11:26 ` Lars Ingebrigtsen 0 siblings, 1 reply; 123+ messages in thread From: Lars Ingebrigtsen @ 2022-01-18 8:40 UTC (permalink / raw) To: Eli Zaretskii; +Cc: 51733 Lars Ingebrigtsen <larsi@gnus.org> writes: > I must be missing or misreading something. Yes, indeed. I missed that the point of the confusable table was to do the lj -> lj mapping. Doh. (Well, one of the points.) -- (domestic pets only, the antidote for overdose, milk.) bloggy blog: http://lars.ingebrigtsen.no ^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better 2022-01-18 8:40 ` Lars Ingebrigtsen @ 2022-01-18 11:26 ` Lars Ingebrigtsen 2022-01-18 11:37 ` Lars Ingebrigtsen 2022-01-18 14:55 ` Eli Zaretskii 0 siblings, 2 replies; 123+ messages in thread From: Lars Ingebrigtsen @ 2022-01-18 11:26 UTC (permalink / raw) To: Eli Zaretskii; +Cc: 51733 Next stupid question: --- It must not contain any stateful bidirectional format characters. That is, no [:bidicontrol:] except for the LRM, RLM, and ALM, since the bidirectional controls could influence the ordering of characters outside the quotes. --- We don't have the :bidicontrol: regexp class. Do we have another way to classify bidi control characters? The have class Cf, but so does many other non-bidi control characters... -- (domestic pets only, the antidote for overdose, milk.) bloggy blog: http://lars.ingebrigtsen.no ^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better 2022-01-18 11:26 ` Lars Ingebrigtsen @ 2022-01-18 11:37 ` Lars Ingebrigtsen 2022-01-18 11:44 ` Lars Ingebrigtsen 2022-01-18 14:55 ` Eli Zaretskii 1 sibling, 1 reply; 123+ messages in thread From: Lars Ingebrigtsen @ 2022-01-18 11:37 UTC (permalink / raw) To: Eli Zaretskii; +Cc: 51733 Lars Ingebrigtsen <larsi@gnus.org> writes: > We don't have the :bidicontrol: regexp class. Do we have another way to > classify bidi control characters? The have class Cf, but so does many > other non-bidi control characters... I guess it's (get-char-code-property ?\N{LEFT-TO-RIGHT ISOLATE} 'bidi-class) combined with whether it's a control character? -- (domestic pets only, the antidote for overdose, milk.) bloggy blog: http://lars.ingebrigtsen.no ^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better 2022-01-18 11:37 ` Lars Ingebrigtsen @ 2022-01-18 11:44 ` Lars Ingebrigtsen 2022-01-18 12:00 ` Lars Ingebrigtsen 0 siblings, 1 reply; 123+ messages in thread From: Lars Ingebrigtsen @ 2022-01-18 11:44 UTC (permalink / raw) To: Eli Zaretskii; +Cc: 51733 Lars Ingebrigtsen <larsi@gnus.org> writes: > I guess it's > > (get-char-code-property ?\N{LEFT-TO-RIGHT ISOLATE} 'bidi-class) > > combined with whether it's a control character? No, that doesn't really help here: (get-char-code-property ?\N{LEFT-TO-RIGHT MARK} 'bidi-class) => L Hm... -- (domestic pets only, the antidote for overdose, milk.) bloggy blog: http://lars.ingebrigtsen.no ^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better 2022-01-18 11:44 ` Lars Ingebrigtsen @ 2022-01-18 12:00 ` Lars Ingebrigtsen 2022-01-18 12:47 ` Lars Ingebrigtsen 2022-01-18 14:59 ` Eli Zaretskii 0 siblings, 2 replies; 123+ messages in thread From: Lars Ingebrigtsen @ 2022-01-18 12:00 UTC (permalink / raw) To: Eli Zaretskii; +Cc: 51733 Lars Ingebrigtsen <larsi@gnus.org> writes: > No, that doesn't really help here: > > (get-char-code-property ?\N{LEFT-TO-RIGHT MARK} 'bidi-class) > => L > > Hm... OK, there's glyphless--bidi-control-characters, and I could make that non-private, and add the three missing ones... -- (domestic pets only, the antidote for overdose, milk.) bloggy blog: http://lars.ingebrigtsen.no ^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better 2022-01-18 12:00 ` Lars Ingebrigtsen @ 2022-01-18 12:47 ` Lars Ingebrigtsen 2022-01-18 12:51 ` Lars Ingebrigtsen 2022-01-18 15:05 ` Eli Zaretskii 2022-01-18 14:59 ` Eli Zaretskii 1 sibling, 2 replies; 123+ messages in thread From: Lars Ingebrigtsen @ 2022-01-18 12:47 UTC (permalink / raw) To: Eli Zaretskii; +Cc: 51733 OK, I think the textsec stuff is basically 90% implemented now. (So according to custom, there's at least 90% left.) The next step would be to make other packages use this. For instance, when shr displays a suspicious URL, it could mark it in red (and perhaps add a warning icon), and have a tooltip that describes in which way it's suspicious. I think the places it would make sense to hook this machinery in would be in: * shr (displaying URLs and links) * Gnus/rmail (displaying email addresses) * Message (when responding to mail; a prompt "do you really?") * browse-url (prompt) Feel free to add to the list. There should probably be a customization point? A user option like `warn-about-suspicious-identifiers'? (Better name would be nice.) And then a utility function that would return a propertised string with the warning, perhaps, so that all the callers don't have to do so much work. So shr/Gnus/rmail could use (possibly-add-warning-about-suspiciousness string) to do that, and if the user has switched the user option off, textsec isn't loaded at all. (Since it loads so much data, some people might prefer not to.) -- (domestic pets only, the antidote for overdose, milk.) bloggy blog: http://lars.ingebrigtsen.no ^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better 2022-01-18 12:47 ` Lars Ingebrigtsen @ 2022-01-18 12:51 ` Lars Ingebrigtsen 2022-01-18 18:44 ` Eli Zaretskii 2022-01-18 18:48 ` Eli Zaretskii 2022-01-18 15:05 ` Eli Zaretskii 1 sibling, 2 replies; 123+ messages in thread From: Lars Ingebrigtsen @ 2022-01-18 12:51 UTC (permalink / raw) To: Eli Zaretskii; +Cc: 51733 Lars Ingebrigtsen <larsi@gnus.org> writes: > The next step would be to make other packages use this. (But I'm taking the rest of the day off, and possibly tomorrow, too, so if somebody else wants to tinker with this, please do go ahead.) -- (domestic pets only, the antidote for overdose, milk.) bloggy blog: http://lars.ingebrigtsen.no ^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better 2022-01-18 12:51 ` Lars Ingebrigtsen @ 2022-01-18 18:44 ` Eli Zaretskii 2022-01-19 9:21 ` Robert Pluim 2022-01-19 9:25 ` Lars Ingebrigtsen 2022-01-18 18:48 ` Eli Zaretskii 1 sibling, 2 replies; 123+ messages in thread From: Eli Zaretskii @ 2022-01-18 18:44 UTC (permalink / raw) To: Lars Ingebrigtsen; +Cc: 51733 > From: Lars Ingebrigtsen <larsi@gnus.org> > Cc: 51733@debbugs.gnu.org > Date: Tue, 18 Jan 2022 13:51:38 +0100 > > Lars Ingebrigtsen <larsi@gnus.org> writes: > > > The next step would be to make other packages use this. > > (But I'm taking the rest of the day off, and possibly tomorrow, too, so > if somebody else wants to tinker with this, please do go ahead.) Does textsec-email-suspicious-p expect non-ASCII email addresses to be RFC 2047 encoded? If so, it will not work in the Rmail display buffers, where email addresses are shown decoded. For non-ASCII names the function signals an error. ^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better 2022-01-18 18:44 ` Eli Zaretskii @ 2022-01-19 9:21 ` Robert Pluim 2022-01-19 9:26 ` Lars Ingebrigtsen 2022-01-19 11:53 ` Eli Zaretskii 2022-01-19 9:25 ` Lars Ingebrigtsen 1 sibling, 2 replies; 123+ messages in thread From: Robert Pluim @ 2022-01-19 9:21 UTC (permalink / raw) To: Eli Zaretskii; +Cc: 51733, Lars Ingebrigtsen >>>>> On Tue, 18 Jan 2022 20:44:51 +0200, Eli Zaretskii <eliz@gnu.org> said: >> From: Lars Ingebrigtsen <larsi@gnus.org> >> Cc: 51733@debbugs.gnu.org >> Date: Tue, 18 Jan 2022 13:51:38 +0100 >> >> Lars Ingebrigtsen <larsi@gnus.org> writes: >> >> > The next step would be to make other packages use this. >> >> (But I'm taking the rest of the day off, and possibly tomorrow, too, so >> if somebody else wants to tinker with this, please do go ahead.) Eli> Does textsec-email-suspicious-p expect non-ASCII email addresses to be Eli> RFC 2047 encoded? If so, it will not work in the Rmail display Eli> buffers, where email addresses are shown decoded. For non-ASCII names Eli> the function signals an error. It does? Do you have an example? The following works fine ELISP> (textsec-email-suspicious-p "rpluimм <rpluimм@gmail.com>") => "`rpluimм' isn't restrictive enough" Although I think that message should say something like "Mailbox name contains non-ASCII characters" Robert -- ^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better 2022-01-19 9:21 ` Robert Pluim @ 2022-01-19 9:26 ` Lars Ingebrigtsen 2022-01-19 10:12 ` Robert Pluim 2022-01-19 11:53 ` Eli Zaretskii 1 sibling, 1 reply; 123+ messages in thread From: Lars Ingebrigtsen @ 2022-01-19 9:26 UTC (permalink / raw) To: Robert Pluim; +Cc: 51733 Robert Pluim <rpluim@gmail.com> writes: > ELISP> (textsec-email-suspicious-p "rpluimм <rpluimм@gmail.com>") > => "`rpluimм' isn't restrictive enough" > > Although I think that message should say something like > > "Mailbox name contains non-ASCII characters" But it's fine for mailbox names to be non-ASCII. (textsec-email-suspicious-p "rpluimм <м@gmail.com>") => nil It's just various combinations of ... things ... that are suspicious. -- (domestic pets only, the antidote for overdose, milk.) bloggy blog: http://lars.ingebrigtsen.no ^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better 2022-01-19 9:26 ` Lars Ingebrigtsen @ 2022-01-19 10:12 ` Robert Pluim 2022-01-19 10:27 ` Lars Ingebrigtsen 0 siblings, 1 reply; 123+ messages in thread From: Robert Pluim @ 2022-01-19 10:12 UTC (permalink / raw) To: Lars Ingebrigtsen; +Cc: 51733 >>>>> On Wed, 19 Jan 2022 10:26:42 +0100, Lars Ingebrigtsen <larsi@gnus.org> said: Lars> Robert Pluim <rpluim@gmail.com> writes: ELISP> (textsec-email-suspicious-p "rpluimм <rpluimм@gmail.com>") >> => "`rpluimм' isn't restrictive enough" >> >> Although I think that message should say something like >> >> "Mailbox name contains non-ASCII characters" Lars> But it's fine for mailbox names to be non-ASCII. Lars> (textsec-email-suspicious-p "rpluimм <м@gmail.com>") Lars> => nil Lars> It's just various combinations of ... things ... that are suspicious. OK, but the error message could be better, no? Robert -- ^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better 2022-01-19 10:12 ` Robert Pluim @ 2022-01-19 10:27 ` Lars Ingebrigtsen 2022-01-19 10:42 ` Robert Pluim 0 siblings, 1 reply; 123+ messages in thread From: Lars Ingebrigtsen @ 2022-01-19 10:27 UTC (permalink / raw) To: Robert Pluim; +Cc: 51733 Robert Pluim <rpluim@gmail.com> writes: > OK, but the error message could be better, no? Sure, but what? (And it's not an error message, it's information about something that looks like it might be odd.) Summarising Unicode® Technical Standard #39 in one line isn't easy. We can go all vague, like "Something is wrong", or we can go long, like "It's not all-ASCII, and it's not single script, and it's not a mixture of arabic armenian bengali bopomofo devanagari ethiopic georgian gujarati gurmukhi hangul han hebrew hiragana katakana kannada khmer lao malayalam myanmar oriya sinhala tamil telugu thaana thai tibetan latin, and it not a latin/han/korea/japan mixture". (And I probably forgot some bits.) -- (domestic pets only, the antidote for overdose, milk.) bloggy blog: http://lars.ingebrigtsen.no ^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better 2022-01-19 10:27 ` Lars Ingebrigtsen @ 2022-01-19 10:42 ` Robert Pluim 2022-01-19 13:46 ` Lars Ingebrigtsen 0 siblings, 1 reply; 123+ messages in thread From: Robert Pluim @ 2022-01-19 10:42 UTC (permalink / raw) To: Lars Ingebrigtsen; +Cc: 51733 >>>>> On Wed, 19 Jan 2022 11:27:57 +0100, Lars Ingebrigtsen <larsi@gnus.org> said: Lars> Robert Pluim <rpluim@gmail.com> writes: >> OK, but the error message could be better, no? Lars> Sure, but what? (And it's not an error message, it's information about Lars> something that looks like it might be odd.) Lars> Summarising Unicode® Technical Standard #39 in one line isn't easy. Lars> We can go all vague, like "Something is wrong", or we can go long, like Lars> "It's not all-ASCII, and it's not single script, and it's not a mixture Lars> of arabic armenian bengali bopomofo devanagari ethiopic georgian Lars> gujarati gurmukhi hangul han hebrew hiragana katakana kannada khmer lao Lars> malayalam myanmar oriya sinhala tamil telugu thaana thai tibetan latin, Lars> and it not a latin/han/korea/japan mixture". (And I probably forgot Lars> some bits.) How about "Contains suspicious characters or mix of characters"? That would at least point users in the right direction. Robert -- ^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better 2022-01-19 10:42 ` Robert Pluim @ 2022-01-19 13:46 ` Lars Ingebrigtsen 2022-01-19 17:18 ` Eli Zaretskii 0 siblings, 1 reply; 123+ messages in thread From: Lars Ingebrigtsen @ 2022-01-19 13:46 UTC (permalink / raw) To: Robert Pluim; +Cc: 51733 Robert Pluim <rpluim@gmail.com> writes: > How about "Contains suspicious characters or mix of characters"? That > would at least point users in the right direction. I'm not 100% that it's not misleading in all cases, though. textsec still doesn't implement "Unicode Identifier and Pattern Syntax": https://www.unicode.org/reports/tr31/ There's some other stuff in there... But I might be quibbling. -- (domestic pets only, the antidote for overdose, milk.) bloggy blog: http://lars.ingebrigtsen.no ^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better 2022-01-19 13:46 ` Lars Ingebrigtsen @ 2022-01-19 17:18 ` Eli Zaretskii 2022-01-20 8:36 ` Lars Ingebrigtsen 0 siblings, 1 reply; 123+ messages in thread From: Eli Zaretskii @ 2022-01-19 17:18 UTC (permalink / raw) To: Lars Ingebrigtsen; +Cc: 51733, rpluim > From: Lars Ingebrigtsen <larsi@gnus.org> > Cc: Eli Zaretskii <eliz@gnu.org>, 51733@debbugs.gnu.org > Date: Wed, 19 Jan 2022 14:46:42 +0100 > > Robert Pluim <rpluim@gmail.com> writes: > > > How about "Contains suspicious characters or mix of characters"? That > > would at least point users in the right direction. > > I'm not 100% that it's not misleading in all cases, though. textsec > still doesn't implement "Unicode Identifier and Pattern Syntax": > > https://www.unicode.org/reports/tr31/ > > There's some other stuff in there... But I might be quibbling. My suggestion for the diagnostic in this case is: %s mixes characters from different scripts in suspicious ways ^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better 2022-01-19 17:18 ` Eli Zaretskii @ 2022-01-20 8:36 ` Lars Ingebrigtsen 0 siblings, 0 replies; 123+ messages in thread From: Lars Ingebrigtsen @ 2022-01-20 8:36 UTC (permalink / raw) To: Eli Zaretskii; +Cc: 51733, rpluim Eli Zaretskii <eliz@gnu.org> writes: > My suggestion for the diagnostic in this case is: > > %s mixes characters from different scripts in suspicious ways Now done. -- (domestic pets only, the antidote for overdose, milk.) bloggy blog: http://lars.ingebrigtsen.no ^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better 2022-01-19 9:21 ` Robert Pluim 2022-01-19 9:26 ` Lars Ingebrigtsen @ 2022-01-19 11:53 ` Eli Zaretskii 2022-01-19 12:49 ` Robert Pluim 1 sibling, 1 reply; 123+ messages in thread From: Eli Zaretskii @ 2022-01-19 11:53 UTC (permalink / raw) To: Robert Pluim; +Cc: 51733, larsi > From: Robert Pluim <rpluim@gmail.com> > Cc: Lars Ingebrigtsen <larsi@gnus.org>, 51733@debbugs.gnu.org > Date: Wed, 19 Jan 2022 10:21:22 +0100 > > Eli> Does textsec-email-suspicious-p expect non-ASCII email addresses to be > Eli> RFC 2047 encoded? If so, it will not work in the Rmail display > Eli> buffers, where email addresses are shown decoded. For non-ASCII names > Eli> the function signals an error. > > It does? Do you have an example? The following works fine Here: (textsec-email-suspicious-p "אבגד <foo@bar.com>") => (wrong-type-argument stringp nil) with this backtrace: Debugger entered--Lisp error: (wrong-type-argument stringp nil) string-search("=?" nil) rfc2047-decode-string(nil) mail-header-parse-address("אבגד <foo@bar.com>" t) textsec-email-suspicious-p("אבגד <foo@bar.com>") (progn (textsec-email-suspicious-p "אבגד <foo@bar.com>")) eval((progn (textsec-email-suspicious-p "אבגד <foo@bar.com>")) t) elisp--eval-last-sexp(t) eval-last-sexp(t) eval-print-last-sexp(nil) funcall-interactively(eval-print-last-sexp nil) call-interactively(eval-print-last-sexp nil nil) command-execute(eval-print-last-sexp) ^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better 2022-01-19 11:53 ` Eli Zaretskii @ 2022-01-19 12:49 ` Robert Pluim 2022-01-19 12:56 ` Lars Ingebrigtsen 2022-01-19 12:58 ` Eli Zaretskii 0 siblings, 2 replies; 123+ messages in thread From: Robert Pluim @ 2022-01-19 12:49 UTC (permalink / raw) To: Eli Zaretskii; +Cc: 51733, larsi >>>>> On Wed, 19 Jan 2022 13:53:59 +0200, Eli Zaretskii <eliz@gnu.org> said: Eli> Here: Eli> (textsec-email-suspicious-p "אבגד <foo@bar.com>") Eli> => (wrong-type-argument stringp nil) Eli> with this backtrace: Eli> Debugger entered--Lisp error: (wrong-type-argument stringp nil) Eli> string-search("=?" nil) Eli> rfc2047-decode-string(nil) Eli> mail-header-parse-address("אבגד <foo@bar.com>" t) Eli> textsec-email-suspicious-p("אבגד <foo@bar.com>") Eli> (progn (textsec-email-suspicious-p "אבגד <foo@bar.com>")) Eli> eval((progn (textsec-email-suspicious-p "אבגד <foo@bar.com>")) t) Eli> elisp--eval-last-sexp(t) Eli> eval-last-sexp(t) Eli> eval-print-last-sexp(nil) Eli> funcall-interactively(eval-print-last-sexp nil) Eli> call-interactively(eval-print-last-sexp nil nil) Eli> command-execute(eval-print-last-sexp) mail-header-parse-address assumes that the display name or the local name starts with a (subset of) ASCII. The following doesnʼt signal an error: (textsec-email-suspicious-p "דגבאa <foo@bar.com>") Since itʼs now open season on display names and mailbox names, the following might be enough. Lars? diff --git a/lisp/mail/ietf-drums.el b/lisp/mail/ietf-drums.el index 4a07959189..1885f958ba 100644 --- a/lisp/mail/ietf-drums.el +++ b/lisp/mail/ietf-drums.el @@ -217,7 +217,7 @@ ietf-drums-parse-address (push (buffer-substring (1+ (point)) (progn (forward-sexp 1) (1- (point)))) display-name)) - ((looking-at (concat "[" ietf-drums-atext-token "@" "]")) + ((not (eq c ?<)) (push (buffer-substring (point) (progn (forward-sexp 1) (point))) display-name)) ((eq c ?<) @@ -240,7 +240,7 @@ ietf-drums-parse-address (cons (mapconcat #'identity (nreverse display-name) "") (ietf-drums-get-comment string))) - (cons mailbox (if decode + (cons mailbox (if (and decode display-string) (rfc2047-decode-string display-string) display-string)))))) Robert -- ^ permalink raw reply related [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better 2022-01-19 12:49 ` Robert Pluim @ 2022-01-19 12:56 ` Lars Ingebrigtsen 2022-01-19 13:00 ` Lars Ingebrigtsen 2022-01-19 13:03 ` Eli Zaretskii 2022-01-19 12:58 ` Eli Zaretskii 1 sibling, 2 replies; 123+ messages in thread From: Lars Ingebrigtsen @ 2022-01-19 12:56 UTC (permalink / raw) To: Robert Pluim; +Cc: 51733 Robert Pluim <rpluim@gmail.com> writes: > (textsec-email-suspicious-p "דגבאa <foo@bar.com>") That is not a valid email address. > Since itʼs now open season on display names and mailbox names, the > following might be enough. Lars? No, that function parses well-formed email addresses, as defined by the standards. It does not do any kind of DWIM or guesswork, and it shouldn't. (textsec-email-suspicious-p "דגבאa <foo@bar.com>") shouldn't bug out, though -- it should instead say that the string is suspicious because it's not well-formed as an email address. -- (domestic pets only, the antidote for overdose, milk.) bloggy blog: http://lars.ingebrigtsen.no ^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better 2022-01-19 12:56 ` Lars Ingebrigtsen @ 2022-01-19 13:00 ` Lars Ingebrigtsen 2022-01-19 13:03 ` Eli Zaretskii 1 sibling, 0 replies; 123+ messages in thread From: Lars Ingebrigtsen @ 2022-01-19 13:00 UTC (permalink / raw) To: Robert Pluim; +Cc: 51733 Lars Ingebrigtsen <larsi@gnus.org> writes: > No, that function parses well-formed email addresses, as defined by the > standards. It does not do any kind of DWIM or guesswork, and it > shouldn't. (The function that tries to parse a random mail-like string as if it were a mail address is `mail-header-parse-address-lax'.) -- (domestic pets only, the antidote for overdose, milk.) bloggy blog: http://lars.ingebrigtsen.no ^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better 2022-01-19 12:56 ` Lars Ingebrigtsen 2022-01-19 13:00 ` Lars Ingebrigtsen @ 2022-01-19 13:03 ` Eli Zaretskii 1 sibling, 0 replies; 123+ messages in thread From: Eli Zaretskii @ 2022-01-19 13:03 UTC (permalink / raw) To: Lars Ingebrigtsen; +Cc: 51733, rpluim > From: Lars Ingebrigtsen <larsi@gnus.org> > Cc: Eli Zaretskii <eliz@gnu.org>, 51733@debbugs.gnu.org > Date: Wed, 19 Jan 2022 13:56:33 +0100 > > Robert Pluim <rpluim@gmail.com> writes: > > > (textsec-email-suspicious-p "דגבאa <foo@bar.com>") > > That is not a valid email address. ??? My INBOX is full of mail from people with such "invalid" addresses. What is not valid about it? ^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better 2022-01-19 12:49 ` Robert Pluim 2022-01-19 12:56 ` Lars Ingebrigtsen @ 2022-01-19 12:58 ` Eli Zaretskii 2022-01-19 13:02 ` Lars Ingebrigtsen 1 sibling, 1 reply; 123+ messages in thread From: Eli Zaretskii @ 2022-01-19 12:58 UTC (permalink / raw) To: Robert Pluim; +Cc: 51733, larsi > From: Robert Pluim <rpluim@gmail.com> > Cc: larsi@gnus.org, 51733@debbugs.gnu.org > Date: Wed, 19 Jan 2022 13:49:20 +0100 > > mail-header-parse-address assumes that the display name or the local > name starts with a (subset of) ASCII. Is that expectation reasonable? I can show you many email addresses that violate that. ^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better 2022-01-19 12:58 ` Eli Zaretskii @ 2022-01-19 13:02 ` Lars Ingebrigtsen 2022-01-19 13:06 ` Eli Zaretskii 2022-01-19 13:39 ` Robert Pluĭm 0 siblings, 2 replies; 123+ messages in thread From: Lars Ingebrigtsen @ 2022-01-19 13:02 UTC (permalink / raw) To: Eli Zaretskii; +Cc: 51733, Robert Pluim Eli Zaretskii <eliz@gnu.org> writes: > Is that expectation reasonable? I can show you many email addresses > that violate that. It depends on what you mean. There are no valid email addresses that have non-ASCII name parts -- when we're talking wire format (RFC2047 etc), which is what that function is parsing. But displayed email addresses may have any characters, of course. -- (domestic pets only, the antidote for overdose, milk.) bloggy blog: http://lars.ingebrigtsen.no ^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better 2022-01-19 13:02 ` Lars Ingebrigtsen @ 2022-01-19 13:06 ` Eli Zaretskii 2022-01-19 13:10 ` Lars Ingebrigtsen 2022-01-19 13:39 ` Robert Pluĭm 1 sibling, 1 reply; 123+ messages in thread From: Eli Zaretskii @ 2022-01-19 13:06 UTC (permalink / raw) To: Lars Ingebrigtsen; +Cc: 51733, rpluim > From: Lars Ingebrigtsen <larsi@gnus.org> > Cc: Robert Pluim <rpluim@gmail.com>, 51733@debbugs.gnu.org > Date: Wed, 19 Jan 2022 14:02:56 +0100 > > Eli Zaretskii <eliz@gnu.org> writes: > > > Is that expectation reasonable? I can show you many email addresses > > that violate that. > > It depends on what you mean. There are no valid email addresses that > have non-ASCII name parts -- when we're talking wire format (RFC2047 > etc), which is what that function is parsing. > > But displayed email addresses may have any characters, of course. I _am_ talking about the displayed format. It would be better if textsec supported those as well, because they are ubiquitous in Emacs. E.g., what if someone sends me a citation from someone else's email, and I want to textsec-check that citation? Chances are the citation will not include RFC2047 encoded addresses. ^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better 2022-01-19 13:06 ` Eli Zaretskii @ 2022-01-19 13:10 ` Lars Ingebrigtsen 2022-01-19 13:21 ` Eli Zaretskii 0 siblings, 1 reply; 123+ messages in thread From: Lars Ingebrigtsen @ 2022-01-19 13:10 UTC (permalink / raw) To: Eli Zaretskii; +Cc: 51733, rpluim Eli Zaretskii <eliz@gnu.org> writes: > I _am_ talking about the displayed format. It would be better if > textsec supported those as well, because they are ubiquitous in > Emacs. And, again, these functions as implemented work on the protocol level, because that's the interesting thing here. "Is this From: header suspicious?" That can only be determined reliably if we don't get any DWIM involved. > E.g., what if someone sends me a citation from someone else's > email, and I want to textsec-check that citation? Chances are the > citation will not include RFC2047 encoded addresses. You can, of course, add all kinds of things to try to gues whether other things in other places in Emacs are suspicious or not, but that is not what these functions I've written do. -- (domestic pets only, the antidote for overdose, milk.) bloggy blog: http://lars.ingebrigtsen.no ^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better 2022-01-19 13:10 ` Lars Ingebrigtsen @ 2022-01-19 13:21 ` Eli Zaretskii 2022-01-19 13:25 ` Lars Ingebrigtsen 0 siblings, 1 reply; 123+ messages in thread From: Eli Zaretskii @ 2022-01-19 13:21 UTC (permalink / raw) To: Lars Ingebrigtsen; +Cc: 51733, rpluim > From: Lars Ingebrigtsen <larsi@gnus.org> > Cc: rpluim@gmail.com, 51733@debbugs.gnu.org > Date: Wed, 19 Jan 2022 14:10:11 +0100 > > Eli Zaretskii <eliz@gnu.org> writes: > > > E.g., what if someone sends me a citation from someone else's > > email, and I want to textsec-check that citation? Chances are the > > citation will not include RFC2047 encoded addresses. > > You can, of course, add all kinds of things to try to gues whether other > things in other places in Emacs are suspicious or not, but that is not > what these functions I've written do. I don't understand this stubborn opposition to provide better, more general APIs to our users. textsec.el is not an application, it is infrastructure applications should use to provide user-level features. So any application-level decisions, like at what level to detect suspicious addresses, is not textsec's bloody business to make! ^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better 2022-01-19 13:21 ` Eli Zaretskii @ 2022-01-19 13:25 ` Lars Ingebrigtsen 2022-01-19 13:28 ` Eli Zaretskii 0 siblings, 1 reply; 123+ messages in thread From: Lars Ingebrigtsen @ 2022-01-19 13:25 UTC (permalink / raw) To: Eli Zaretskii; +Cc: 51733, rpluim Eli Zaretskii <eliz@gnu.org> writes: > I don't understand this stubborn opposition to provide better, more > general APIs to our users. I'm for making functions with well-defined interfaces. > textsec.el is not an application, it is infrastructure applications > should use to provide user-level features. So any application-level > decisions, like at what level to detect suspicious addresses, is not > textsec's bloody business to make! I don't understand what you mean. It's just because the textsec functions are well-defined that application-level packages can use it reliably. textsec isn't making any decisions about levels -- that's up to the callers. -- (domestic pets only, the antidote for overdose, milk.) bloggy blog: http://lars.ingebrigtsen.no ^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better 2022-01-19 13:25 ` Lars Ingebrigtsen @ 2022-01-19 13:28 ` Eli Zaretskii 0 siblings, 0 replies; 123+ messages in thread From: Eli Zaretskii @ 2022-01-19 13:28 UTC (permalink / raw) To: Lars Ingebrigtsen; +Cc: 51733, rpluim > From: Lars Ingebrigtsen <larsi@gnus.org> > Cc: rpluim@gmail.com, 51733@debbugs.gnu.org > Date: Wed, 19 Jan 2022 14:25:34 +0100 > > Eli Zaretskii <eliz@gnu.org> writes: > > > I don't understand this stubborn opposition to provide better, more > > general APIs to our users. > > I'm for making functions with well-defined interfaces. > > > textsec.el is not an application, it is infrastructure applications > > should use to provide user-level features. So any application-level > > decisions, like at what level to detect suspicious addresses, is not > > textsec's bloody business to make! > > I don't understand what you mean. It's just because the textsec > functions are well-defined that application-level packages can use it > reliably. textsec isn't making any decisions about levels -- that's up > to the callers. You are being unreasonably stubborn here. I give up. ^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better 2022-01-19 13:02 ` Lars Ingebrigtsen 2022-01-19 13:06 ` Eli Zaretskii @ 2022-01-19 13:39 ` Robert Pluĭm 2022-01-19 14:00 ` Lars Ingebrigtsen 1 sibling, 1 reply; 123+ messages in thread From: Robert Pluĭm @ 2022-01-19 13:39 UTC (permalink / raw) To: Lars Ingebrigtsen; +Cc: 51733 >>>>> On Wed, 19 Jan 2022 14:02:56 +0100, Lars Ingebrigtsen <larsi@gnus.org> said: Lars> Eli Zaretskii <eliz@gnu.org> writes: >> Is that expectation reasonable? I can show you many email addresses >> that violate that. Lars> It depends on what you mean. There are no valid email addresses that Lars> have non-ASCII name parts -- when we're talking wire format (RFC2047 Lars> etc), which is what that function is parsing. I canʼt recall if this is allowed by the standards or not offhand, but as youʼre probably well aware, the major email providers allow you to use UTF-8 characters directly in the display name of email adresses, without using RFC 2047 encoding. In fact, the last time I did any testing of this, Gmail *replaced* RFC 2047 encoded non-ASCII characters with their UTF-8 encoding. Robert -- ^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better 2022-01-19 13:39 ` Robert Pluĭm @ 2022-01-19 14:00 ` Lars Ingebrigtsen 2022-01-19 14:10 ` Robert Pluĭm 2022-01-19 16:08 ` Andreas Schwab 0 siblings, 2 replies; 123+ messages in thread From: Lars Ingebrigtsen @ 2022-01-19 14:00 UTC (permalink / raw) To: Robert Pluĭm; +Cc: 51733 Robert "=?utf-8?Q?Plu=C4=ADm?=" <rpluim@gmail.com> writes: > I canʼt recall if this is allowed by the standards or not offhand, but > as youʼre probably well aware, the major email providers allow you to > use UTF-8 characters directly in the display name of email adresses, > without using RFC 2047 encoding. In fact, the last time I did any > testing of this, Gmail *replaced* RFC 2047 encoded non-ASCII > characters with their UTF-8 encoding. Gmail expects you to type in characters representing your name -- they don't expose the wire format. Why should they? And your address arrived as (wire format): Robert =?us-ascii?Q?=3D=3Futf-8=3FQ=3FPlu=3DC4=3DADm=3F=3D?= which displays as Robert =?utf-8?Q?Plu=C4=ADm?= 😀 -- (domestic pets only, the antidote for overdose, milk.) bloggy blog: http://lars.ingebrigtsen.no ^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better 2022-01-19 14:00 ` Lars Ingebrigtsen @ 2022-01-19 14:10 ` Robert Pluĭm 2022-01-19 14:24 ` Lars Ingebrigtsen 2022-01-19 16:08 ` Andreas Schwab 1 sibling, 1 reply; 123+ messages in thread From: Robert Pluĭm @ 2022-01-19 14:10 UTC (permalink / raw) To: Lars Ingebrigtsen; +Cc: 51733 >>>>> On Wed, 19 Jan 2022 15:00:04 +0100, Lars Ingebrigtsen <larsi@gnus.org> said: Lars> Robert "=?utf-8?Q?Plu=C4=ADm?=" <rpluim@gmail.com> writes: >> I canʼt recall if this is allowed by the standards or not offhand, but >> as youʼre probably well aware, the major email providers allow you to >> use UTF-8 characters directly in the display name of email adresses, >> without using RFC 2047 encoding. In fact, the last time I did any >> testing of this, Gmail *replaced* RFC 2047 encoded non-ASCII >> characters with their UTF-8 encoding. Lars> Gmail expects you to type in characters representing your name -- they Lars> don't expose the wire format. Why should they? Lars> And your address arrived as (wire format): Lars> Robert =?us-ascii?Q?=3D=3Futf-8=3FQ=3FPlu=3DC4=3DADm=3F=3D?= Lars> which displays as Lars> Robert =?utf-8?Q?Plu=C4=ADm?= Lars> 😀 Double fun. Iʼd manually rfc2047 encoded that before sending it, so either gnus or Gmail encoded it again :-) Iʼve turned off the gnus rfc 2047 for this message, let's see what happens. Robert -- ^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better 2022-01-19 14:10 ` Robert Pluĭm @ 2022-01-19 14:24 ` Lars Ingebrigtsen 2022-01-19 14:30 ` Robert Pluim 0 siblings, 1 reply; 123+ messages in thread From: Lars Ingebrigtsen @ 2022-01-19 14:24 UTC (permalink / raw) To: Robert Pluĭm; +Cc: 51733 Robert Pluĭm <rpluim@gmail.com> writes: > Double fun. Iʼd manually rfc2047 encoded that before sending it, so > either gnus or Gmail encoded it again :-) Of course it was encoded again -- =?utf-8?Q?Plu=C4=ADm?= is a perfectly fine, if unusual, name. 😀 I think it's Bobby Table's brother? -- (domestic pets only, the antidote for overdose, milk.) bloggy blog: http://lars.ingebrigtsen.no ^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better 2022-01-19 14:24 ` Lars Ingebrigtsen @ 2022-01-19 14:30 ` Robert Pluim 2022-01-19 14:36 ` Lars Ingebrigtsen 0 siblings, 1 reply; 123+ messages in thread From: Robert Pluim @ 2022-01-19 14:30 UTC (permalink / raw) To: Lars Ingebrigtsen; +Cc: 51733 >>>>> On Wed, 19 Jan 2022 15:24:54 +0100, Lars Ingebrigtsen <larsi@gnus.org> said: Lars> Robert Pluĭm <rpluim@gmail.com> writes: >> Double fun. Iʼd manually rfc2047 encoded that before sending it, so >> either gnus or Gmail encoded it again :-) Lars> Of course it was encoded again -- =?utf-8?Q?Plu=C4=ADm?= is a perfectly Lars> fine, if unusual, name. 😀 I think it's Bobby Table's brother? Or his cousin. Based on the message buffer I have, gnus didnʼt encode it again, so it must have been gmail. Of course that header was ascii-only, so why did they encode again? What did the From: look like on the message you've just replied to? Gnus should not have encoded it, and it contained utf-8 in the display name. Robert -- ^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better 2022-01-19 14:30 ` Robert Pluim @ 2022-01-19 14:36 ` Lars Ingebrigtsen 2022-01-19 14:43 ` Robert Pluim 0 siblings, 1 reply; 123+ messages in thread From: Lars Ingebrigtsen @ 2022-01-19 14:36 UTC (permalink / raw) To: Robert Pluim; +Cc: 51733 Robert Pluim <rpluim@gmail.com> writes: > Based on the message buffer I have, gnus didnʼt encode it again, so it > must have been gmail. Of course that header was ascii-only, so why did > they encode again? RFC2047 isn't just about ASCII -- it's about a bunch of other unsafe characters, like =, which will trigger encoding of (naked) words that contain those characters. > What did the From: look like on the message you've just replied to? > Gnus should not have encoded it, and it contained utf-8 in the display > name. I included the From in wire format and displayed format already. -- (domestic pets only, the antidote for overdose, milk.) bloggy blog: http://lars.ingebrigtsen.no ^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better 2022-01-19 14:36 ` Lars Ingebrigtsen @ 2022-01-19 14:43 ` Robert Pluim 0 siblings, 0 replies; 123+ messages in thread From: Robert Pluim @ 2022-01-19 14:43 UTC (permalink / raw) To: Lars Ingebrigtsen; +Cc: 51733 >>>>> On Wed, 19 Jan 2022 15:36:34 +0100, Lars Ingebrigtsen <larsi@gnus.org> said: Lars> Robert Pluim <rpluim@gmail.com> writes: >> Based on the message buffer I have, gnus didnʼt encode it again, so it >> must have been gmail. Of course that header was ascii-only, so why did >> they encode again? Lars> RFC2047 isn't just about ASCII -- it's about a bunch of other unsafe Lars> characters, like =, which will trigger encoding of (naked) words that Lars> contain those characters. >> What did the From: look like on the message you've just replied to? >> Gnus should not have encoded it, and it contained utf-8 in the display >> name. Lars> I included the From in wire format and displayed format already. From the original message where I rfc2047 encoded it myself, yes. The one after I didnʼt encode manually, and turned off the gnus encoding. Looking at the bug archive, for Message-ID: <874k5zolz0.fsf@gmail.com> we have: From: Robert Pluĭm <rpluim@gmail.com> with no rfc2047 in sight. Gmail is weird, let's go shopping :-) Robert -- ^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better 2022-01-19 14:00 ` Lars Ingebrigtsen 2022-01-19 14:10 ` Robert Pluĭm @ 2022-01-19 16:08 ` Andreas Schwab 2022-01-19 16:47 ` Robert Pluim 1 sibling, 1 reply; 123+ messages in thread From: Andreas Schwab @ 2022-01-19 16:08 UTC (permalink / raw) To: Lars Ingebrigtsen; +Cc: 51733, Robert Pluĭm On Jan 19 2022, Lars Ingebrigtsen wrote: > And your address arrived as (wire format): > > Robert =?us-ascii?Q?=3D=3Futf-8=3FQ=3FPlu=3DC4=3DADm=3F=3D?= > > which displays as > > Robert =?utf-8?Q?Plu=C4=ADm?= Looks like there is a bug in gnus-read-ephemeral-emacs-bug-group, as it shows this header line in the raw article: From: Robert =?utf-8?Q?Plu=C4=ADm?= <rpluim@gmail.com> The debbugs.gnu.org web interface gets it right, including the downloadable mbox contents. -- Andreas Schwab, schwab@linux-m68k.org GPG Key fingerprint = 7578 EB47 D4E5 4D69 2510 2552 DF73 E780 A9DA AEC1 "And now for something completely different." ^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better 2022-01-19 16:08 ` Andreas Schwab @ 2022-01-19 16:47 ` Robert Pluim 2022-01-19 16:51 ` Lars Ingebrigtsen 0 siblings, 1 reply; 123+ messages in thread From: Robert Pluim @ 2022-01-19 16:47 UTC (permalink / raw) To: Andreas Schwab; +Cc: 51733, Lars Ingebrigtsen >>>>> On Wed, 19 Jan 2022 17:08:13 +0100, Andreas Schwab <schwab@linux-m68k.org> said: Andreas> On Jan 19 2022, Lars Ingebrigtsen wrote: >> And your address arrived as (wire format): >> >> Robert =?us-ascii?Q?=3D=3Futf-8=3FQ=3FPlu=3DC4=3DADm=3F=3D?= >> >> which displays as >> >> Robert =?utf-8?Q?Plu=C4=ADm?= Andreas> Looks like there is a bug in gnus-read-ephemeral-emacs-bug-group, as it Andreas> shows this header line in the raw article: Andreas> From: Robert =?utf-8?Q?Plu=C4=ADm?= Andreas> <rpluim@gmail.com> Andreas> The debbugs.gnu.org web interface gets it right, including the Andreas> downloadable mbox contents. The downloaded mbox looks correct, but https://debbugs.gnu.org/cgi/bugreport.cgi?bug=51733#281 shows From: Robert =?utf-8?Q?Plu=C4=ADm?= <rpluim <at> gmail.com> for me. And https://debbugs.gnu.org/cgi/bugreport.cgi?bug=51733#305 shows From: Robert Pluĭm <rpluim <at> gmail.com> which I thought Lars said was not allowed? Robert -- ^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better 2022-01-19 16:47 ` Robert Pluim @ 2022-01-19 16:51 ` Lars Ingebrigtsen 2022-01-19 16:57 ` Robert Pluim 0 siblings, 1 reply; 123+ messages in thread From: Lars Ingebrigtsen @ 2022-01-19 16:51 UTC (permalink / raw) To: Robert Pluim; +Cc: 51733, Andreas Schwab Robert Pluim <rpluim@gmail.com> writes: > https://debbugs.gnu.org/cgi/bugreport.cgi?bug=51733#305 shows > > From: Robert Pluĭm <rpluim <at> gmail.com> > > which I thought Lars said was not allowed? A web page can show whatever it wants, surely? -- (domestic pets only, the antidote for overdose, milk.) bloggy blog: http://lars.ingebrigtsen.no ^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better 2022-01-19 16:51 ` Lars Ingebrigtsen @ 2022-01-19 16:57 ` Robert Pluim 0 siblings, 0 replies; 123+ messages in thread From: Robert Pluim @ 2022-01-19 16:57 UTC (permalink / raw) To: Lars Ingebrigtsen; +Cc: 51733, Andreas Schwab >>>>> On Wed, 19 Jan 2022 17:51:32 +0100, Lars Ingebrigtsen <larsi@gnus.org> said: Lars> Robert Pluim <rpluim@gmail.com> writes: >> https://debbugs.gnu.org/cgi/bugreport.cgi?bug=51733#305 shows >> >> From: Robert Pluĭm <rpluim <at> gmail.com> >> >> which I thought Lars said was not allowed? Lars> A web page can show whatever it wants, surely? Indeed, and the mbox is correct, so Iʼm going to have to retract my maligning of gmail :-) Robert -- ^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better 2022-01-18 18:44 ` Eli Zaretskii 2022-01-19 9:21 ` Robert Pluim @ 2022-01-19 9:25 ` Lars Ingebrigtsen 2022-01-19 11:51 ` Eli Zaretskii 1 sibling, 1 reply; 123+ messages in thread From: Lars Ingebrigtsen @ 2022-01-19 9:25 UTC (permalink / raw) To: Eli Zaretskii; +Cc: 51733 Eli Zaretskii <eliz@gnu.org> writes: > Does textsec-email-suspicious-p expect non-ASCII email addresses to be > RFC 2047 encoded? Yes. > If so, it will not work in the Rmail display buffers, where email > addresses are shown decoded. For non-ASCII names the function signals > an error. Rmail does have access to the encoded header, so it'll just have to call the textsec function before it decodes it (and displays it). -- (domestic pets only, the antidote for overdose, milk.) bloggy blog: http://lars.ingebrigtsen.no ^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better 2022-01-19 9:25 ` Lars Ingebrigtsen @ 2022-01-19 11:51 ` Eli Zaretskii 2022-01-19 12:54 ` Lars Ingebrigtsen 0 siblings, 1 reply; 123+ messages in thread From: Eli Zaretskii @ 2022-01-19 11:51 UTC (permalink / raw) To: Lars Ingebrigtsen; +Cc: 51733 > From: Lars Ingebrigtsen <larsi@gnus.org> > Cc: 51733@debbugs.gnu.org > Date: Wed, 19 Jan 2022 10:25:42 +0100 > > Eli Zaretskii <eliz@gnu.org> writes: > > > Does textsec-email-suspicious-p expect non-ASCII email addresses to be > > RFC 2047 encoded? > > Yes. > > > If so, it will not work in the Rmail display buffers, where email > > addresses are shown decoded. For non-ASCII names the function signals > > an error. > > Rmail does have access to the encoded header, so it'll just have to call > the textsec function before it decodes it (and displays it). This is unfortunate. It means, for example, that a simple lazy discovery of suspicious addresses by scanning the email reading buffer with regular expressions will not work, and the feature must instead scan the original mbox buffer. Why cannot we lift this restriction? mail-header-parse-address is not the only way to parse email addresses. Or maybe we could encode the email address if the original one causes an error? ^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better 2022-01-19 11:51 ` Eli Zaretskii @ 2022-01-19 12:54 ` Lars Ingebrigtsen 2022-01-19 13:01 ` Eli Zaretskii 2022-01-19 13:36 ` Andreas Schwab 0 siblings, 2 replies; 123+ messages in thread From: Lars Ingebrigtsen @ 2022-01-19 12:54 UTC (permalink / raw) To: Eli Zaretskii; +Cc: 51733 Eli Zaretskii <eliz@gnu.org> writes: > This is unfortunate. It means, for example, that a simple lazy > discovery of suspicious addresses by scanning the email reading buffer > with regular expressions will not work, and the feature must instead > scan the original mbox buffer. There's no scanning -- rmail displays the From header, right? So it does decoding before displaying the header. It has to do the textsec stuff first, too. > Why cannot we lift this restriction? mail-header-parse-address is not > the only way to parse email addresses. Or maybe we could encode the > email address if the original one causes an error? There is no reliable way to parse a decoded mail address, and since this is a security thing, we don't want to do DWIM and guesses (which is what you have to do when composing a valid email address from a string like "Fóo, Jr. <foo@example.com>"). -- (domestic pets only, the antidote for overdose, milk.) bloggy blog: http://lars.ingebrigtsen.no ^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better 2022-01-19 12:54 ` Lars Ingebrigtsen @ 2022-01-19 13:01 ` Eli Zaretskii 2022-01-19 13:06 ` Lars Ingebrigtsen 2022-01-19 13:36 ` Andreas Schwab 1 sibling, 1 reply; 123+ messages in thread From: Eli Zaretskii @ 2022-01-19 13:01 UTC (permalink / raw) To: Lars Ingebrigtsen; +Cc: 51733 > From: Lars Ingebrigtsen <larsi@gnus.org> > Cc: 51733@debbugs.gnu.org > Date: Wed, 19 Jan 2022 13:54:37 +0100 > > Eli Zaretskii <eliz@gnu.org> writes: > > > This is unfortunate. It means, for example, that a simple lazy > > discovery of suspicious addresses by scanning the email reading buffer > > with regular expressions will not work, and the feature must instead > > scan the original mbox buffer. > > There's no scanning -- rmail displays the From header, right? So it > does decoding before displaying the header. It has to do the textsec > stuff first, too. Not if textsec is optional, it doesn't. And I think your mental model of how Rmail presents the email in the reading buffer is not accurate. > > Why cannot we lift this restriction? mail-header-parse-address is not > > the only way to parse email addresses. Or maybe we could encode the > > email address if the original one causes an error? > > There is no reliable way to parse a decoded mail address, and since this > is a security thing, we don't want to do DWIM and guesses (which is what > you have to do when composing a valid email address from a string like > "Fóo, Jr. <foo@example.com>"). I think Robert just suggested a way? ^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better 2022-01-19 13:01 ` Eli Zaretskii @ 2022-01-19 13:06 ` Lars Ingebrigtsen 2022-01-19 13:11 ` Eli Zaretskii 0 siblings, 1 reply; 123+ messages in thread From: Lars Ingebrigtsen @ 2022-01-19 13:06 UTC (permalink / raw) To: Eli Zaretskii; +Cc: 51733 Eli Zaretskii <eliz@gnu.org> writes: >> There's no scanning -- rmail displays the From header, right? So it >> does decoding before displaying the header. It has to do the textsec >> stuff first, too. > > Not if textsec is optional, it doesn't. I don't understand what you mean here. rmail will call (decorate-suspicious-email from) and then insert the result into the buffer. If textsec is switched off, it'll just return `from' as is. > And I think your mental model of how Rmail presents the email in the > reading buffer is not accurate. Here's what it does today: ;; Decode any RFC2047 encoded message headers. (if rmail-enable-mime (with-current-buffer rmail-view-buffer (rfc2047-decode-region (point-min) (progn (search-forward "\n\n" nil 'move) (point)))))) It'll just have to call (insert (rfc2047-decode-string (decorate-suspicious-email (substring ...)))) instead. -- (domestic pets only, the antidote for overdose, milk.) bloggy blog: http://lars.ingebrigtsen.no ^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better 2022-01-19 13:06 ` Lars Ingebrigtsen @ 2022-01-19 13:11 ` Eli Zaretskii 2022-01-19 13:16 ` Lars Ingebrigtsen 0 siblings, 1 reply; 123+ messages in thread From: Eli Zaretskii @ 2022-01-19 13:11 UTC (permalink / raw) To: Lars Ingebrigtsen; +Cc: 51733 > From: Lars Ingebrigtsen <larsi@gnus.org> > Cc: 51733@debbugs.gnu.org > Date: Wed, 19 Jan 2022 14:06:56 +0100 > > Eli Zaretskii <eliz@gnu.org> writes: > > >> There's no scanning -- rmail displays the From header, right? So it > >> does decoding before displaying the header. It has to do the textsec > >> stuff first, too. > > > > Not if textsec is optional, it doesn't. > > I don't understand what you mean here. rmail will call > (decorate-suspicious-email from) and then insert the result into the > buffer. If textsec is switched off, it'll just return `from' as is. But From is not the only place where a suspicious address could hide. It could also be in the body, or in the quotation parts. We cannot rely on header decoding alone to do this job well. ^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better 2022-01-19 13:11 ` Eli Zaretskii @ 2022-01-19 13:16 ` Lars Ingebrigtsen 2022-01-19 13:25 ` Eli Zaretskii 0 siblings, 1 reply; 123+ messages in thread From: Lars Ingebrigtsen @ 2022-01-19 13:16 UTC (permalink / raw) To: Eli Zaretskii; +Cc: 51733 Eli Zaretskii <eliz@gnu.org> writes: > But From is not the only place where a suspicious address could hide. > It could also be in the body, or in the quotation parts. We cannot > rely on header decoding alone to do this job well. The scope of the relevant implemented functions are to determine if the (on-wire) mail headers are suspicious or not, and do so reliably. We can add a slew of other functions for other types of DWIM suspiciousness, of course, but that's outside the remit. (For instance, if you wish to implement a filter that looks for suspicious emails, you'd typically find anything that looks like an email, see whether it can be RFC2047 encoded, encode it, and then call the -email-suspicious-p function.) -- (domestic pets only, the antidote for overdose, milk.) bloggy blog: http://lars.ingebrigtsen.no ^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better 2022-01-19 13:16 ` Lars Ingebrigtsen @ 2022-01-19 13:25 ` Eli Zaretskii 2022-01-19 13:31 ` Lars Ingebrigtsen 0 siblings, 1 reply; 123+ messages in thread From: Eli Zaretskii @ 2022-01-19 13:25 UTC (permalink / raw) To: Lars Ingebrigtsen; +Cc: 51733 > From: Lars Ingebrigtsen <larsi@gnus.org> > Cc: 51733@debbugs.gnu.org > Date: Wed, 19 Jan 2022 14:16:48 +0100 > > Eli Zaretskii <eliz@gnu.org> writes: > > > But From is not the only place where a suspicious address could hide. > > It could also be in the body, or in the quotation parts. We cannot > > rely on header decoding alone to do this job well. > > The scope of the relevant implemented functions are to determine if the > (on-wire) mail headers are suspicious or not, and do so reliably. We > can add a slew of other functions for other types of DWIM > suspiciousness, of course, but that's outside the remit. I disagree with this narrow definition of the scope. textsec is more general, and should not limit itself to specific wire protocols. I'm not asking to _replace_ RFC2047 support, I'm saying that we should also support email addresses that were already decoded, for the use cases where that could be more convenient or where the wire level is unavailable. Why would you object to extending these functions so that they could support decoded email addresses? What harm could that possibly do? ^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better 2022-01-19 13:25 ` Eli Zaretskii @ 2022-01-19 13:31 ` Lars Ingebrigtsen 2022-01-19 13:35 ` Eli Zaretskii 0 siblings, 1 reply; 123+ messages in thread From: Lars Ingebrigtsen @ 2022-01-19 13:31 UTC (permalink / raw) To: Eli Zaretskii; +Cc: 51733 Eli Zaretskii <eliz@gnu.org> writes: > I'm not asking to _replace_ RFC2047 support, I'm saying that we should > also support email addresses that were already decoded, for the use > cases where that could be more convenient or where the wire level is > unavailable. These already exist. The applications can call *-name-suspicious-p (etc) individually, if they want to. > Why would you object to extending these functions so that they could > support decoded email addresses? What harm could that possibly do? That's the point -- when doing DWIM parsing, the function can't reliably say whether a string is a suspicious email address, because the attacker may construct a name part, that when decoded, confuses the address parser, and thereby escapes domain/local part checking. (Think of various combinations of names that contain "@" and "," characters.) -- (domestic pets only, the antidote for overdose, milk.) bloggy blog: http://lars.ingebrigtsen.no ^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better 2022-01-19 13:31 ` Lars Ingebrigtsen @ 2022-01-19 13:35 ` Eli Zaretskii 0 siblings, 0 replies; 123+ messages in thread From: Eli Zaretskii @ 2022-01-19 13:35 UTC (permalink / raw) To: Lars Ingebrigtsen; +Cc: 51733 > From: Lars Ingebrigtsen <larsi@gnus.org> > Cc: 51733@debbugs.gnu.org > Date: Wed, 19 Jan 2022 14:31:11 +0100 > > Eli Zaretskii <eliz@gnu.org> writes: > > > I'm not asking to _replace_ RFC2047 support, I'm saying that we should > > also support email addresses that were already decoded, for the use > > cases where that could be more convenient or where the wire level is > > unavailable. > > These already exist. The applications can call *-name-suspicious-p > (etc) individually, if they want to. I don't have a NAME, I have a full email address. > > Why would you object to extending these functions so that they could > > support decoded email addresses? What harm could that possibly do? > > That's the point -- when doing DWIM parsing I didn't say DWIM, you did. > the function can't reliably > say whether a string is a suspicious email address, because the attacker > may construct a name part, that when decoded, confuses the address > parser, and thereby escapes domain/local part checking. (Think of > various combinations of names that contain "@" and "," characters.) When the wire format is gone, this is all I have left. You are saying we should leave this case without a solution. So be it. ^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better 2022-01-19 12:54 ` Lars Ingebrigtsen 2022-01-19 13:01 ` Eli Zaretskii @ 2022-01-19 13:36 ` Andreas Schwab 2022-01-19 13:57 ` Lars Ingebrigtsen 1 sibling, 1 reply; 123+ messages in thread From: Andreas Schwab @ 2022-01-19 13:36 UTC (permalink / raw) To: Lars Ingebrigtsen; +Cc: 51733 On Jan 19 2022, Lars Ingebrigtsen wrote: > There's no scanning -- rmail displays the From header, right? So it > does decoding before displaying the header. It has to do the textsec > stuff first, too. I don't think it makes sense to run the textsec check on the encoded address, since that will always be ASCII-only. -- Andreas Schwab, schwab@linux-m68k.org GPG Key fingerprint = 7578 EB47 D4E5 4D69 2510 2552 DF73 E780 A9DA AEC1 "And now for something completely different." ^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better 2022-01-19 13:36 ` Andreas Schwab @ 2022-01-19 13:57 ` Lars Ingebrigtsen 2022-01-19 14:06 ` Andreas Schwab 0 siblings, 1 reply; 123+ messages in thread From: Lars Ingebrigtsen @ 2022-01-19 13:57 UTC (permalink / raw) To: Andreas Schwab; +Cc: 51733 Andreas Schwab <schwab@linux-m68k.org> writes: > I don't think it makes sense to run the textsec check on the encoded > address, since that will always be ASCII-only. We decode the header after parsing it (and before doing the textsec tests). -- (domestic pets only, the antidote for overdose, milk.) bloggy blog: http://lars.ingebrigtsen.no ^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better 2022-01-19 13:57 ` Lars Ingebrigtsen @ 2022-01-19 14:06 ` Andreas Schwab 2022-01-19 14:09 ` Lars Ingebrigtsen 0 siblings, 1 reply; 123+ messages in thread From: Andreas Schwab @ 2022-01-19 14:06 UTC (permalink / raw) To: Lars Ingebrigtsen; +Cc: 51733 On Jan 19 2022, Lars Ingebrigtsen wrote: > Andreas Schwab <schwab@linux-m68k.org> writes: > >> I don't think it makes sense to run the textsec check on the encoded >> address, since that will always be ASCII-only. > > We decode the header after parsing it (and before doing the textsec > tests). The why not allow to run the textsec on the decoded header directly? If you have to encode it first you have to do DWIM parsing anyway. -- Andreas Schwab, schwab@linux-m68k.org GPG Key fingerprint = 7578 EB47 D4E5 4D69 2510 2552 DF73 E780 A9DA AEC1 "And now for something completely different." ^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better 2022-01-19 14:06 ` Andreas Schwab @ 2022-01-19 14:09 ` Lars Ingebrigtsen 2022-01-19 14:13 ` Andreas Schwab 0 siblings, 1 reply; 123+ messages in thread From: Lars Ingebrigtsen @ 2022-01-19 14:09 UTC (permalink / raw) To: Andreas Schwab; +Cc: 51733 Andreas Schwab <schwab@linux-m68k.org> writes: > The why not allow to run the textsec on the decoded header directly? Consider somebody sending you an email containing @", characters in the name part, and then you decode the address, and then run the parsing function. The attacker would then have a wide attack surface to trick the checker into checking the wrong parts of the address. -- (domestic pets only, the antidote for overdose, milk.) bloggy blog: http://lars.ingebrigtsen.no ^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better 2022-01-19 14:09 ` Lars Ingebrigtsen @ 2022-01-19 14:13 ` Andreas Schwab 2022-01-19 14:33 ` Lars Ingebrigtsen 0 siblings, 1 reply; 123+ messages in thread From: Andreas Schwab @ 2022-01-19 14:13 UTC (permalink / raw) To: Lars Ingebrigtsen; +Cc: 51733 On Jan 19 2022, Lars Ingebrigtsen wrote: > Consider somebody sending you an email containing @", characters in the > name part, and then you decode the address, and then run the parsing > function. The attacker would then have a wide attack surface to trick > the checker into checking the wrong parts of the address. Isn't that the whole point of textsec? -- Andreas Schwab, schwab@linux-m68k.org GPG Key fingerprint = 7578 EB47 D4E5 4D69 2510 2552 DF73 E780 A9DA AEC1 "And now for something completely different." ^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better 2022-01-19 14:13 ` Andreas Schwab @ 2022-01-19 14:33 ` Lars Ingebrigtsen 2022-01-19 14:39 ` Andreas Schwab 0 siblings, 1 reply; 123+ messages in thread From: Lars Ingebrigtsen @ 2022-01-19 14:33 UTC (permalink / raw) To: Andreas Schwab; +Cc: 51733 Andreas Schwab <schwab@linux-m68k.org> writes: > On Jan 19 2022, Lars Ingebrigtsen wrote: > >> Consider somebody sending you an email containing @", characters in the >> name part, and then you decode the address, and then run the parsing >> function. The attacker would then have a wide attack surface to trick >> the checker into checking the wrong parts of the address. > > Isn't that the whole point of textsec? It's perfectly valid to have a From: "larsi@example.com" <larsi@other.com> address. It's unambigious, and the responses will go to larsi@other.com. Of course, it's... suspicious... but not on the Unicode level. (I'll also be adding some non-Unicode bits to textsec, like <a href="http://foo.bar">http://other.bar</a> -- (domestic pets only, the antidote for overdose, milk.) bloggy blog: http://lars.ingebrigtsen.no ^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better 2022-01-19 14:33 ` Lars Ingebrigtsen @ 2022-01-19 14:39 ` Andreas Schwab 2022-01-19 14:44 ` Lars Ingebrigtsen 0 siblings, 1 reply; 123+ messages in thread From: Andreas Schwab @ 2022-01-19 14:39 UTC (permalink / raw) To: Lars Ingebrigtsen; +Cc: 51733 On Jan 19 2022, Lars Ingebrigtsen wrote: > Andreas Schwab <schwab@linux-m68k.org> writes: > >> On Jan 19 2022, Lars Ingebrigtsen wrote: >> >>> Consider somebody sending you an email containing @", characters in the >>> name part, and then you decode the address, and then run the parsing >>> function. The attacker would then have a wide attack surface to trick >>> the checker into checking the wrong parts of the address. >> >> Isn't that the whole point of textsec? > > It's perfectly valid to have a > > From: "larsi@example.com" <larsi@other.com> > > address. It's unambigious, and the responses will go to > larsi@other.com. What's your point? -- Andreas Schwab, schwab@linux-m68k.org GPG Key fingerprint = 7578 EB47 D4E5 4D69 2510 2552 DF73 E780 A9DA AEC1 "And now for something completely different." ^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better 2022-01-19 14:39 ` Andreas Schwab @ 2022-01-19 14:44 ` Lars Ingebrigtsen 0 siblings, 0 replies; 123+ messages in thread From: Lars Ingebrigtsen @ 2022-01-19 14:44 UTC (permalink / raw) To: Andreas Schwab; +Cc: 51733 Andreas Schwab <schwab@linux-m68k.org> writes: > What's your point? You first. -- (domestic pets only, the antidote for overdose, milk.) bloggy blog: http://lars.ingebrigtsen.no ^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better 2022-01-18 12:51 ` Lars Ingebrigtsen 2022-01-18 18:44 ` Eli Zaretskii @ 2022-01-18 18:48 ` Eli Zaretskii 2022-01-18 20:15 ` Eli Zaretskii 1 sibling, 1 reply; 123+ messages in thread From: Eli Zaretskii @ 2022-01-18 18:48 UTC (permalink / raw) To: Lars Ingebrigtsen; +Cc: 51733 These two tests seem to reveal a bug in the implementation: (should (textsec-name-suspicious-p "\N{LEFT-TO-RIGHT MARK}\N{LEFT-TO-RIGHT MARK}Lars Ingebrigtsen")) (should (textsec-name-suspicious-p "\N{LEFT-TO-RIGHT MARK}\N{RIGHT-TO-LEFT MARK}\N{LEFT-TO-RIGHT MARK}\N{RIGHT-TO-LEFT MARK}\N{LEFT-TO-RIGHT MARK}Lars Ingebrigtsen"))) LRM and RLM are stateless controls, so they shouldn't be flagged as suspicious, AFAIU. ^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better 2022-01-18 18:48 ` Eli Zaretskii @ 2022-01-18 20:15 ` Eli Zaretskii 2022-01-18 20:31 ` Eli Zaretskii 2022-01-19 13:38 ` Lars Ingebrigtsen 0 siblings, 2 replies; 123+ messages in thread From: Eli Zaretskii @ 2022-01-18 20:15 UTC (permalink / raw) To: larsi; +Cc: 51733 > Date: Tue, 18 Jan 2022 20:48:46 +0200 > From: Eli Zaretskii <eliz@gnu.org> > Cc: 51733@debbugs.gnu.org > > These two tests seem to reveal a bug in the implementation: > > (should (textsec-name-suspicious-p > "\N{LEFT-TO-RIGHT MARK}\N{LEFT-TO-RIGHT MARK}Lars Ingebrigtsen")) > (should (textsec-name-suspicious-p > "\N{LEFT-TO-RIGHT MARK}\N{RIGHT-TO-LEFT MARK}\N{LEFT-TO-RIGHT MARK}\N{RIGHT-TO-LEFT MARK}\N{LEFT-TO-RIGHT MARK}Lars Ingebrigtsen"))) > > LRM and RLM are stateless controls, so they shouldn't be flagged as > suspicious, AFAIU. I think I get it now: it's because of textsec-suspicious-nonspacing-p, which forbids consecutive nonspacing characters, right? But then I don't think it's correct to consider Cf characters for that purpose: UTS#39 explicitly talks about nonspacing _marks_, i.e. Mn and Me characters. Where did you see Cf and Cc as well? ^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better 2022-01-18 20:15 ` Eli Zaretskii @ 2022-01-18 20:31 ` Eli Zaretskii 2022-01-19 13:38 ` Lars Ingebrigtsen 1 sibling, 0 replies; 123+ messages in thread From: Eli Zaretskii @ 2022-01-18 20:31 UTC (permalink / raw) To: larsi; +Cc: 51733 > Date: Tue, 18 Jan 2022 22:15:39 +0200 > From: Eli Zaretskii <eliz@gnu.org> > Cc: 51733@debbugs.gnu.org > > > LRM and RLM are stateless controls, so they shouldn't be flagged as > > suspicious, AFAIU. > > I think I get it now: it's because of textsec-suspicious-nonspacing-p, > which forbids consecutive nonspacing characters, right? But then I > don't think it's correct to consider Cf characters for that purpose: > UTS#39 explicitly talks about nonspacing _marks_, i.e. Mn and Me > characters. Where did you see Cf and Cc as well? Including Cf characters in this suspicious category is also problematic because the ZERO WIDTH characters (like ZWJ and ZWNJ) are Cf, and it is not reasonable to limit the use of those, as some scripts (like Arabic, for example), uses them quite a lot. ^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better 2022-01-18 20:15 ` Eli Zaretskii 2022-01-18 20:31 ` Eli Zaretskii @ 2022-01-19 13:38 ` Lars Ingebrigtsen 1 sibling, 0 replies; 123+ messages in thread From: Lars Ingebrigtsen @ 2022-01-19 13:38 UTC (permalink / raw) To: Eli Zaretskii; +Cc: 51733 Eli Zaretskii <eliz@gnu.org> writes: > I think I get it now: it's because of textsec-suspicious-nonspacing-p, > which forbids consecutive nonspacing characters, right? Yup. > But then I don't think it's correct to consider Cf characters for that > purpose: UTS#39 explicitly talks about nonspacing _marks_, i.e. Mn and > Me characters. Where did you see Cf and Cc as well? Good catch; I'll amend the code and test. -- (domestic pets only, the antidote for overdose, milk.) bloggy blog: http://lars.ingebrigtsen.no ^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better 2022-01-18 12:47 ` Lars Ingebrigtsen 2022-01-18 12:51 ` Lars Ingebrigtsen @ 2022-01-18 15:05 ` Eli Zaretskii 2022-01-19 12:49 ` Michael Albinus 2022-01-19 13:35 ` Lars Ingebrigtsen 1 sibling, 2 replies; 123+ messages in thread From: Eli Zaretskii @ 2022-01-18 15:05 UTC (permalink / raw) To: Lars Ingebrigtsen; +Cc: 51733 > From: Lars Ingebrigtsen <larsi@gnus.org> > Cc: 51733@debbugs.gnu.org > Date: Tue, 18 Jan 2022 13:47:35 +0100 > > I think the places it would make sense to hook this machinery in would > be in: > > * shr (displaying URLs and links) > * Gnus/rmail (displaying email addresses) > * Message (when responding to mail; a prompt "do you really?") > * browse-url (prompt) Sounds reasonable. Perhaps also Tramp (host names)? > There should probably be a customization point? A user option like > `warn-about-suspicious-identifiers'? Is this a go/no-go test, or are there levels? If there are levels, perhaps something similar to NSM would be more appropriate? (And maybe levels of NSM should determine the default textsec level?) ^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better 2022-01-18 15:05 ` Eli Zaretskii @ 2022-01-19 12:49 ` Michael Albinus 2022-01-19 12:59 ` Eli Zaretskii 2022-01-19 13:35 ` Lars Ingebrigtsen 1 sibling, 1 reply; 123+ messages in thread From: Michael Albinus @ 2022-01-19 12:49 UTC (permalink / raw) To: Eli Zaretskii; +Cc: 51733, Lars Ingebrigtsen Eli Zaretskii <eliz@gnu.org> writes: Hi Eli, >> I think the places it would make sense to hook this machinery in would >> be in: >> >> * shr (displaying URLs and links) >> * Gnus/rmail (displaying email addresses) >> * Message (when responding to mail; a prompt "do you really?") >> * browse-url (prompt) > > Sounds reasonable. > > Perhaps also Tramp (host names)? --8<---------------cut here---------------start------------->8--- (defconst tramp-host-regexp "[[:alnum:]_.%-]+" "Regexp matching host names.") ;; The following regexp is a bit sloppy. But it shall serve our ;; purposes. It covers also IPv4 mapped IPv6 addresses, like in ;; "::ffff:192.168.0.1". (defconst tramp-ipv6-regexp "\\(?:[[:alnum:]]*:\\)+[[:alnum:].]+" "Regexp matching IPv6 addresses.") --8<---------------cut here---------------end--------------->8--- This should be sufficient, shouldn't it? Best regards, Michael. ^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better 2022-01-19 12:49 ` Michael Albinus @ 2022-01-19 12:59 ` Eli Zaretskii 0 siblings, 0 replies; 123+ messages in thread From: Eli Zaretskii @ 2022-01-19 12:59 UTC (permalink / raw) To: Michael Albinus; +Cc: 51733, larsi > From: Michael Albinus <michael.albinus@gmx.de> > Cc: Lars Ingebrigtsen <larsi@gnus.org>, 51733@debbugs.gnu.org > Date: Wed, 19 Jan 2022 13:49:56 +0100 > > > Perhaps also Tramp (host names)? > > --8<---------------cut here---------------start------------->8--- > (defconst tramp-host-regexp "[[:alnum:]_.%-]+" > "Regexp matching host names.") > > ;; The following regexp is a bit sloppy. But it shall serve our > ;; purposes. It covers also IPv4 mapped IPv6 addresses, like in > ;; "::ffff:192.168.0.1". > (defconst tramp-ipv6-regexp "\\(?:[[:alnum:]]*:\\)+[[:alnum:].]+" > "Regexp matching IPv6 addresses.") > --8<---------------cut here---------------end--------------->8--- > > This should be sufficient, shouldn't it? If this isn't too restrictive, sure. ^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better 2022-01-18 15:05 ` Eli Zaretskii 2022-01-19 12:49 ` Michael Albinus @ 2022-01-19 13:35 ` Lars Ingebrigtsen 1 sibling, 0 replies; 123+ messages in thread From: Lars Ingebrigtsen @ 2022-01-19 13:35 UTC (permalink / raw) To: Eli Zaretskii; +Cc: 51733 Eli Zaretskii <eliz@gnu.org> writes: >> There should probably be a customization point? A user option like >> `warn-about-suspicious-identifiers'? > > Is this a go/no-go test, or are there levels? If there are levels, > perhaps something similar to NSM would be more appropriate? (And > maybe levels of NSM should determine the default textsec level?) I was thinking go/no-go -- I don't immediately see any different levels of suspiciousness that'd be interesting for the user. But we can tweak that later, I guess, if necessary. -- (domestic pets only, the antidote for overdose, milk.) bloggy blog: http://lars.ingebrigtsen.no ^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better 2022-01-18 12:00 ` Lars Ingebrigtsen 2022-01-18 12:47 ` Lars Ingebrigtsen @ 2022-01-18 14:59 ` Eli Zaretskii 2022-01-19 13:56 ` Lars Ingebrigtsen 1 sibling, 1 reply; 123+ messages in thread From: Eli Zaretskii @ 2022-01-18 14:59 UTC (permalink / raw) To: Lars Ingebrigtsen; +Cc: 51733 > From: Lars Ingebrigtsen <larsi@gnus.org> > Cc: 51733@debbugs.gnu.org > Date: Tue, 18 Jan 2022 13:00:42 +0100 > > OK, there's glyphless--bidi-control-characters, and I could make that > non-private, and add the three missing ones... I don't think that's what you want, because AFAIU that includes LRM, RLM, and ALM, which are stateless. ^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better 2022-01-18 14:59 ` Eli Zaretskii @ 2022-01-19 13:56 ` Lars Ingebrigtsen 0 siblings, 0 replies; 123+ messages in thread From: Lars Ingebrigtsen @ 2022-01-19 13:56 UTC (permalink / raw) To: Eli Zaretskii; +Cc: 51733 Eli Zaretskii <eliz@gnu.org> writes: >> OK, there's glyphless--bidi-control-characters, and I could make that >> non-private, and add the three missing ones... > > I don't think that's what you want, because AFAIU that includes LRM, > RLM, and ALM, which are stateless. Yes, but then we remove those explicitly in the test, so I think that's OK... -- (domestic pets only, the antidote for overdose, milk.) bloggy blog: http://lars.ingebrigtsen.no ^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better 2022-01-18 11:26 ` Lars Ingebrigtsen 2022-01-18 11:37 ` Lars Ingebrigtsen @ 2022-01-18 14:55 ` Eli Zaretskii 1 sibling, 0 replies; 123+ messages in thread From: Eli Zaretskii @ 2022-01-18 14:55 UTC (permalink / raw) To: Lars Ingebrigtsen; +Cc: 51733 > From: Lars Ingebrigtsen <larsi@gnus.org> > Cc: 51733@debbugs.gnu.org > Date: Tue, 18 Jan 2022 12:26:30 +0100 > > Next stupid question: > > --- > It must not contain any stateful bidirectional format characters. > > That is, no [:bidicontrol:] except for the LRM, RLM, and ALM, since the bidirectional controls could influence the ordering of characters outside the quotes. > --- > > We don't have the :bidicontrol: regexp class. Do we have another way to > classify bidi control characters? The have class Cf, but so does many > other non-bidi control characters... I don't think you need any classification: the offending control characters are very few, so you could just test for them explicitly. ^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better 2022-01-17 17:48 ` Eli Zaretskii 2022-01-17 19:08 ` Eli Zaretskii @ 2022-01-19 13:55 ` Lars Ingebrigtsen 2022-01-19 14:14 ` Eli Zaretskii 1 sibling, 1 reply; 123+ messages in thread From: Lars Ingebrigtsen @ 2022-01-19 13:55 UTC (permalink / raw) To: Eli Zaretskii; +Cc: 51733, jidanni Eli Zaretskii <eliz@gnu.org> writes: > I think we should first determine what kinds of applications may need > this, and take it from there. The initial number of "confusability > with" classes can be very small, and we can add more as we discover > interesting use cases. The full number is pretty much infinite, I > think, but I'm not sure Emacs needs to support all of them OOTB. We > could support some of the popular ones, and provide infrastructure for > developing more. Yes. I was thinking about this bit, which isn't implemented yet (although the utility functions for it basically are). ---- The process of determining suspect usage of whole-script confusables is more complicated than simply looking at the scripts of the labels in a domain name. For example, it can be perfectly legitimate to have scripts in a SLD (second level domain) not be the same as scripts in a TLD (top-level domain), such as: Cyrillic labels in a domain name with a TLD of .ru or .рф Chinese labels in a domain name with a TLD of .com.au or .com Cyrillic labels that aren’t confusable with Latin with a TLD of .com.au or .com The following high-level algorithm can be used to determine all scripts that contain a whole-script confusable with a string X: Consider Q, the set of all strings confusable with X. Remove all strings from Q whose resolved script set is ∅ or ALL (that is, keep only single-script strings plus those with characters only in Common). Take the union of the resolved script sets of all strings remaining in Q. As usual, this algorithm is intended only as a definition; implementations should use an optimized routine that produces the same result. ---- I'm not sure I understand the algorithm they're proposing. I think this shouldn't be suspicious? But I may be wrong: (textsec-domain-suspicious-p "Сгсе.рф") => nil But this should be, but isn't currently: (textsec-domain-suspicious-p "Сгсе.ru") => nil Now, (textsec-ascii-confusable-p "Сгсе.ru") => t and (textsec-ascii-confusable-p "Сгсе.рф") => nil Is that what they mean here? I'm finding the logic overly clear here. -- (domestic pets only, the antidote for overdose, milk.) bloggy blog: http://lars.ingebrigtsen.no ^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better 2022-01-19 13:55 ` Lars Ingebrigtsen @ 2022-01-19 14:14 ` Eli Zaretskii 2022-01-19 14:28 ` Lars Ingebrigtsen 0 siblings, 1 reply; 123+ messages in thread From: Eli Zaretskii @ 2022-01-19 14:14 UTC (permalink / raw) To: Lars Ingebrigtsen; +Cc: 51733, jidanni > From: Lars Ingebrigtsen <larsi@gnus.org> > Cc: 51733@debbugs.gnu.org, jidanni@jidanni.org > Date: Wed, 19 Jan 2022 14:55:35 +0100 > > But this should be, but isn't currently: > > (textsec-domain-suspicious-p "Сгсе.ru") > => nil Why? .ru is a top-level domain, it doesn't affect what should be before the dot, I think? If you replace "Сгсе.ru" with "Cгсе.ru", you do get a warning. > Is that what they mean here? I'm not sure I understand the purpose of finding which scripts "contain a whole-script confusable with a string X". What are we supposed to do with the resulting list? ^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better 2022-01-19 14:14 ` Eli Zaretskii @ 2022-01-19 14:28 ` Lars Ingebrigtsen 2022-01-19 14:57 ` Eli Zaretskii 0 siblings, 1 reply; 123+ messages in thread From: Lars Ingebrigtsen @ 2022-01-19 14:28 UTC (permalink / raw) To: Eli Zaretskii; +Cc: 51733, jidanni Eli Zaretskii <eliz@gnu.org> writes: > Why? .ru is a top-level domain, it doesn't affect what should be > before the dot, I think? > > If you replace "Сгсе.ru" with "Cгсе.ru", you do get a warning. Yes. But "Сгсе.ru" is a whole-script confusable with "Crce.ru", and is therefore suspicious. >> Is that what they mean here? > > I'm not sure I understand the purpose of finding which scripts > "contain a whole-script confusable with a string X". What are we > supposed to do with the resulting list? I think this standard was written by somebody with a PhD in Philosophy, and not a programmer, so the language is very high falutin'. So they're not actually suggesting that a list should be made, but the result should be mathematically equivalent with the result of the mathematical algorithm described. I just don't understand what he's saying here. -- (domestic pets only, the antidote for overdose, milk.) bloggy blog: http://lars.ingebrigtsen.no ^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better 2022-01-19 14:28 ` Lars Ingebrigtsen @ 2022-01-19 14:57 ` Eli Zaretskii 2022-01-19 15:45 ` Lars Ingebrigtsen 0 siblings, 1 reply; 123+ messages in thread From: Eli Zaretskii @ 2022-01-19 14:57 UTC (permalink / raw) To: Lars Ingebrigtsen; +Cc: 51733, jidanni > From: Lars Ingebrigtsen <larsi@gnus.org> > Cc: 51733@debbugs.gnu.org, jidanni@jidanni.org > Date: Wed, 19 Jan 2022 15:28:51 +0100 > > Eli Zaretskii <eliz@gnu.org> writes: > > > Why? .ru is a top-level domain, it doesn't affect what should be > > before the dot, I think? > > > > If you replace "Сгсе.ru" with "Cгсе.ru", you do get a warning. > > Yes. But "Сгсе.ru" is a whole-script confusable with "Crce.ru", and is > therefore suspicious. OK, but why do you think "Сгсе.ru" is confusable? The SLD part is entirely made of single-script characters, and UTS#39 explicitly allows that: [...] it can be perfectly legitimate to have scripts in a SLD (second level domain) not be the same as scripts in a TLD (top-level domain), such as: Cyrillic labels in a domain name with a TLD of .ru or .рф That's your case, isn't it? > >> Is that what they mean here? > > > > I'm not sure I understand the purpose of finding which scripts > > "contain a whole-script confusable with a string X". What are we > > supposed to do with the resulting list? > > I think this standard was written by somebody with a PhD in Philosophy, > and not a programmer, so the language is very high falutin'. > > So they're not actually suggesting that a list should be made, but the > result should be mathematically equivalent with the result of the > mathematical algorithm described. I just don't understand what he's > saying here. Regardless of what they are saying, I don't think the above is suitable for production. I think it should be enough to see whether there could be confusion with the corresponding ASCII characters from confusables.txt. ^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better 2022-01-19 14:57 ` Eli Zaretskii @ 2022-01-19 15:45 ` Lars Ingebrigtsen 2022-01-19 16:58 ` Eli Zaretskii 0 siblings, 1 reply; 123+ messages in thread From: Lars Ingebrigtsen @ 2022-01-19 15:45 UTC (permalink / raw) To: Eli Zaretskii; +Cc: 51733, jidanni Eli Zaretskii <eliz@gnu.org> writes: > OK, but why do you think "Сгсе.ru" is confusable? The SLD part is > entirely made of single-script characters, and UTS#39 explicitly > allows that: > > [...] it can be perfectly legitimate to have scripts in a SLD > (second level domain) not be the same as scripts in a TLD (top-level > domain), such as: > > Cyrillic labels in a domain name with a TLD of .ru or .рф > > That's your case, isn't it? Yes, indeed. But: --- For some applications, it is useful to determine if a given input string has any whole-script confusable. For example, the identifier "ѕсоре" using Cyrillic characters would pass the single-script test described in Section 5.2, Restriction-Level Detection, even though it is likely to be a spoof attempt. --- So "Сгсе.ru" is suspicious in most contexts. > Regardless of what they are saying, I don't think the above is > suitable for production. I think it should be enough to see whether > there could be confusion with the corresponding ASCII characters from > confusables.txt. Yes, so that's what I've done now, but... I'd feel slightly better if I knew what they were actually getting at. I think they're saying that if "foo" is confusable with anything in any other scripts, then it's suspicious? But that sounds unworkeable. For instance, "circle.ru" is confusable with "СігсӀе.ru", and perhaps it's suspicious to a Russian, but I don't see how to make a workable function from that. Unless we start bringing in locales, and meh. So perhaps what I've implemented now is sufficient for domains. Anyway, I've implemented the user option and implemented this in shr, so we'll see how that goes. If no problems crop up, I'll announce all this in NEWS and document it in the lispref manual tomorrow. -- (domestic pets only, the antidote for overdose, milk.) bloggy blog: http://lars.ingebrigtsen.no ^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better 2022-01-19 15:45 ` Lars Ingebrigtsen @ 2022-01-19 16:58 ` Eli Zaretskii 2022-01-19 18:25 ` Lars Ingebrigtsen 0 siblings, 1 reply; 123+ messages in thread From: Eli Zaretskii @ 2022-01-19 16:58 UTC (permalink / raw) To: Lars Ingebrigtsen; +Cc: 51733, jidanni > From: Lars Ingebrigtsen <larsi@gnus.org> > Cc: 51733@debbugs.gnu.org, jidanni@jidanni.org > Date: Wed, 19 Jan 2022 16:45:29 +0100 > > Eli Zaretskii <eliz@gnu.org> writes: > > > OK, but why do you think "Сгсе.ru" is confusable? The SLD part is > > entirely made of single-script characters, and UTS#39 explicitly > > allows that: > > > > [...] it can be perfectly legitimate to have scripts in a SLD > > (second level domain) not be the same as scripts in a TLD (top-level > > domain), such as: > > > > Cyrillic labels in a domain name with a TLD of .ru or .рф > > > > That's your case, isn't it? > > Yes, indeed. But: > > --- > For some applications, it is useful to determine if a given input string has any whole-script confusable. For example, the identifier "ѕсоре" using Cyrillic characters would pass the single-script test described in Section 5.2, Restriction-Level Detection, even though it is likely to be a spoof attempt. > --- > > So "Сгсе.ru" is suspicious in most contexts. Right, but the functions we had back then didn't yet support that part. > > Regardless of what they are saying, I don't think the above is > > suitable for production. I think it should be enough to see whether > > there could be confusion with the corresponding ASCII characters from > > confusables.txt. > > Yes, so that's what I've done now, but... I'd feel slightly better if I > knew what they were actually getting at. I think they're saying that if > "foo" is confusable with anything in any other scripts, then it's > suspicious? Yes, that's what they meant. > But that sounds unworkeable. For instance, "circle.ru" is > confusable with "СігсӀе.ru", and perhaps it's suspicious to a Russian, > but I don't see how to make a workable function from that. They've left that to the implementation... Anyway, I think confusable to ASCII is good enough for Emacs for now. > So perhaps what I've implemented now is sufficient for domains. I think it is, yes. It definitely covers a very large chunk of the problem. ^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better 2022-01-19 16:58 ` Eli Zaretskii @ 2022-01-19 18:25 ` Lars Ingebrigtsen 0 siblings, 0 replies; 123+ messages in thread From: Lars Ingebrigtsen @ 2022-01-19 18:25 UTC (permalink / raw) To: Eli Zaretskii; +Cc: 51733 [-- Attachment #1: Type: text/plain, Size: 129 bytes --] I'm not quite sure how noticeable we should be making suspicious things. With the following test file with `M-x eww-open-file': [-- Attachment #2: Type: image/png, Size: 24175 bytes --] [-- Attachment #3: Type: text/plain, Size: 407 bytes --] The ⚠️ will obviously be customiseable, but is this generally the amount of attention we should be aiming for? The mouseover on the ⚠️s has the explanation for the suspicion -- we could also output that text into the buffer, but I think that's overkill. Anybody have an opinion? -- (domestic pets only, the antidote for overdose, milk.) bloggy blog: http://lars.ingebrigtsen.no [-- Attachment #4: sus.html --] [-- Type: text/html, Size: 241 bytes --] ^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better 2022-01-17 17:26 ` Lars Ingebrigtsen 2022-01-17 17:38 ` Lars Ingebrigtsen @ 2022-01-17 17:42 ` Eli Zaretskii 2022-01-17 17:46 ` Lars Ingebrigtsen 1 sibling, 1 reply; 123+ messages in thread From: Eli Zaretskii @ 2022-01-17 17:42 UTC (permalink / raw) To: Lars Ingebrigtsen; +Cc: 51733, jidanni > From: Lars Ingebrigtsen <larsi@gnus.org> > Cc: 51733@debbugs.gnu.org, jidanni@jidanni.org > Date: Mon, 17 Jan 2022 18:26:42 +0100 > > Eli Zaretskii <eliz@gnu.org> writes: > > > Let's at least call this something other than "script", to avoid > > confusion. > > Sure. But... what. I don't know. script-id? script-class? scriptprop? uniscript? > I made a slight attempt at that by calling it "scripts" instead of > "script", since each character belongs to a list of scripts Does it? UAX#24 says no: The Script property is an enumerated property of type catalog. Its values form a full partition of the codespace: every Unicode code point is assigned a single Script property value. This value is either the explicit value for a specific script, such as Cyrillic, or is one of the following three special values: . Inherited—for characters that may be used with multiple scripts, and that inherit their script from a preceding base character. These include nonspacing combining marks and enclosing combining marks, as well as U+200C ZERO WIDTH NON-JOINER and U+200D ZERO WIDTH JOINER. . Common—for other characters that may be used with multiple scripts. . Unknown—for unassigned, private-use, noncharacter, and surrogate code points. This seems to say that each character has only a single script property value assigned to it? ^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better 2022-01-17 17:42 ` Eli Zaretskii @ 2022-01-17 17:46 ` Lars Ingebrigtsen 0 siblings, 0 replies; 123+ messages in thread From: Lars Ingebrigtsen @ 2022-01-17 17:46 UTC (permalink / raw) To: Eli Zaretskii; +Cc: 51733, jidanni Eli Zaretskii <eliz@gnu.org> writes: >> I made a slight attempt at that by calling it "scripts" instead of >> "script", since each character belongs to a list of scripts > > Does it? UAX#24 says no: At least in this context. See ScriptExtensions.txt and TR39. -- (domestic pets only, the antidote for overdose, milk.) bloggy blog: http://lars.ingebrigtsen.no ^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better 2022-01-17 10:18 ` Eli Zaretskii 2022-01-17 14:54 ` Lars Ingebrigtsen @ 2022-01-17 15:22 ` Eli Zaretskii 2022-01-17 15:25 ` Lars Ingebrigtsen 2022-01-17 15:53 ` Lars Ingebrigtsen 1 sibling, 2 replies; 123+ messages in thread From: Eli Zaretskii @ 2022-01-17 15:22 UTC (permalink / raw) To: larsi; +Cc: 51733 > Date: Mon, 17 Jan 2022 12:18:44 +0200 > From: Eli Zaretskii <eliz@gnu.org> > Cc: 51733@debbugs.gnu.org, jidanni@jidanni.org > > It is confusing to have 2 separate properties of a character that are subtly incompatible, and for such obscure properties at that. It will be source of many problems. So I think we should avoid that if it's feasible. Can we plrase discuss any real problems that would be xaused by using the existing char-table? I've now wrote a Lisp program to produce script property according to Unicode vs what we have in char-script-table, so it's possible to see all the differences. ^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better 2022-01-17 15:22 ` Eli Zaretskii @ 2022-01-17 15:25 ` Lars Ingebrigtsen 2022-01-17 15:53 ` Lars Ingebrigtsen 1 sibling, 0 replies; 123+ messages in thread From: Lars Ingebrigtsen @ 2022-01-17 15:25 UTC (permalink / raw) To: Eli Zaretskii; +Cc: 51733 Eli Zaretskii <eliz@gnu.org> writes: > I've now wrote a Lisp program to produce script property according to > Unicode vs what we have in char-script-table, so it's possible to see > all the differences. There's also lisp/international/uni-scripts.el now. 😀 -- (domestic pets only, the antidote for overdose, milk.) bloggy blog: http://lars.ingebrigtsen.no ^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better 2022-01-17 15:22 ` Eli Zaretskii 2022-01-17 15:25 ` Lars Ingebrigtsen @ 2022-01-17 15:53 ` Lars Ingebrigtsen 2022-01-17 16:31 ` Lars Ingebrigtsen 2022-01-17 16:52 ` Eli Zaretskii 1 sibling, 2 replies; 123+ messages in thread From: Lars Ingebrigtsen @ 2022-01-17 15:53 UTC (permalink / raw) To: Eli Zaretskii; +Cc: 51733 I'm now looking at 5.3 Mixed-Number Detection: d U+09EA ( ৪ ) BENGALI DIGIT FOUR can be confused with U+0038 ( 8 ) DIGIT EIGHT. Right, but they recommend implementing this by looking at the digit version of the character first... but... Does Emacs have a function to get the number value of ৪? (Which should be 8. 😀) They then recommend comparing the value with the zero value of that system, and I'm pretty sure we don't have that. I don't quite understand why it's not sufficient to see that we have numbers from two different numbering systems (which is trivial by looking at the Nd category and then comparing the scripts). Does anybody understand why they're doing this in a much more convoluted manner here? I must be missing something: https://www.unicode.org/reports/tr39/#Mixed_Number_Detection -- (domestic pets only, the antidote for overdose, milk.) bloggy blog: http://lars.ingebrigtsen.no ^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better 2022-01-17 15:53 ` Lars Ingebrigtsen @ 2022-01-17 16:31 ` Lars Ingebrigtsen 2022-01-17 16:52 ` Eli Zaretskii 1 sibling, 0 replies; 123+ messages in thread From: Lars Ingebrigtsen @ 2022-01-17 16:31 UTC (permalink / raw) To: Eli Zaretskii; +Cc: 51733 Lars Ingebrigtsen <larsi@gnus.org> writes: > Does anybody understand why they're doing this in a much more convoluted > manner here? I must be missing something: > > https://www.unicode.org/reports/tr39/#Mixed_Number_Detection I think I understand now -- 0-9 are in `common', so they are (by the definition used in these documents) "the same" script as BENGALI DIGIT FOUR. -- (domestic pets only, the antidote for overdose, milk.) bloggy blog: http://lars.ingebrigtsen.no ^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better 2022-01-17 15:53 ` Lars Ingebrigtsen 2022-01-17 16:31 ` Lars Ingebrigtsen @ 2022-01-17 16:52 ` Eli Zaretskii 2022-01-17 16:57 ` Lars Ingebrigtsen 1 sibling, 1 reply; 123+ messages in thread From: Eli Zaretskii @ 2022-01-17 16:52 UTC (permalink / raw) To: Lars Ingebrigtsen; +Cc: 51733 > From: Lars Ingebrigtsen <larsi@gnus.org> > Cc: 51733@debbugs.gnu.org > Date: Mon, 17 Jan 2022 16:53:49 +0100 > > I'm now looking at 5.3 Mixed-Number Detection: > > d U+09EA ( ৪ ) BENGALI DIGIT FOUR can be confused with U+0038 ( 8 ) > DIGIT EIGHT. > > Right, but they recommend implementing this by looking at the digit > version of the character first... but... Does Emacs have a function to > get the number value of ৪? Yes, we do have that: (get-char-code-property ?৪ 'numeric-value) => 4 > They then recommend comparing the value with the zero value of that > system, and I'm pretty sure we don't have that. Why not? what do you need, exactly? ^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better 2022-01-17 16:52 ` Eli Zaretskii @ 2022-01-17 16:57 ` Lars Ingebrigtsen 2022-01-17 17:02 ` Eli Zaretskii 0 siblings, 1 reply; 123+ messages in thread From: Lars Ingebrigtsen @ 2022-01-17 16:57 UTC (permalink / raw) To: Eli Zaretskii; +Cc: 51733 Eli Zaretskii <eliz@gnu.org> writes: > Yes, we do have that: > > (get-char-code-property ?৪ 'numeric-value) => 4 Cool. >> They then recommend comparing the value with the zero value of that >> system, and I'm pretty sure we don't have that. > > Why not? what do you need, exactly? Just a thinko -- I was wondering whether we had a way to find the zero character, but that's just: (- ?৪ (get-char-code-property ?৪ 'numeric-value)) -- (domestic pets only, the antidote for overdose, milk.) bloggy blog: http://lars.ingebrigtsen.no ^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better 2022-01-17 16:57 ` Lars Ingebrigtsen @ 2022-01-17 17:02 ` Eli Zaretskii 2022-01-17 17:04 ` Lars Ingebrigtsen 0 siblings, 1 reply; 123+ messages in thread From: Eli Zaretskii @ 2022-01-17 17:02 UTC (permalink / raw) To: Lars Ingebrigtsen; +Cc: 51733 > From: Lars Ingebrigtsen <larsi@gnus.org> > Cc: 51733@debbugs.gnu.org > Date: Mon, 17 Jan 2022 17:57:46 +0100 > > >> They then recommend comparing the value with the zero value of that > >> system, and I'm pretty sure we don't have that. > > > > Why not? what do you need, exactly? > > Just a thinko -- I was wondering whether we had a way to find the zero > character, but that's just: > > (- ?৪ (get-char-code-property ?৪ 'numeric-value)) That's just sheer luck, AFAIU (there are some characters with numeric-value property that are not arranged from zero to 9), but maybe for this particular purpose it's all that's needed. ^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better 2022-01-17 17:02 ` Eli Zaretskii @ 2022-01-17 17:04 ` Lars Ingebrigtsen 0 siblings, 0 replies; 123+ messages in thread From: Lars Ingebrigtsen @ 2022-01-17 17:04 UTC (permalink / raw) To: Eli Zaretskii; +Cc: 51733 Eli Zaretskii <eliz@gnu.org> writes: > That's just sheer luck, AFAIU (there are some characters with > numeric-value property that are not arranged from zero to 9), but > maybe for this particular purpose it's all that's needed. We're only doing this check for the characters with Nd, which are guaranteed to be organised this way. (There's only three of these number systems, allegedly.) -- (domestic pets only, the antidote for overdose, milk.) bloggy blog: http://lars.ingebrigtsen.no ^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better 2022-01-16 17:03 ` Lars Ingebrigtsen 2022-01-16 17:50 ` Lars Ingebrigtsen @ 2022-01-16 18:14 ` Eli Zaretskii 2022-01-16 18:24 ` Eli Zaretskii 1 sibling, 1 reply; 123+ messages in thread From: Eli Zaretskii @ 2022-01-16 18:14 UTC (permalink / raw) To: Lars Ingebrigtsen; +Cc: 51733, jidanni > From: Lars Ingebrigtsen <larsi@gnus.org> > Cc: 51733@debbugs.gnu.org, jidanni@jidanni.org > Date: Sun, 16 Jan 2022 18:03:23 +0100 > > https://www.unicode.org/reports/tr24/tr24-32.html#Scripts_and_Blocks > > As a result, using the block names as simplistic substitute for > script identity generally leads to poor results. > > It looks like we're doing that, though? No, not really. We collect various blocks of the same scripts together. > And indeed: > > (elt char-script-table #xAB65) > => latin > > which is wrong, because that's > > GREEK LETTER SMALL CAPITAL OMEGA > > So we should be populating char-script-table from > http://www.unicode.org/Public/UCD/latest/ucd/Scripts.txt instead of > Blocks.txt. So I'll be doing that, too. Beware: the Unicode Script property is not identical to ours! Before throwing away what we have, please consider how many deviations we have in practice, and if they are just a few, let's fix only them individually. It's easy. You will have to add some manual heuristics even if you do use the Unicode Scripts.txt as the basis. ^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better 2022-01-16 18:14 ` Eli Zaretskii @ 2022-01-16 18:24 ` Eli Zaretskii 2022-01-16 18:34 ` Andreas Schwab 0 siblings, 1 reply; 123+ messages in thread From: Eli Zaretskii @ 2022-01-16 18:24 UTC (permalink / raw) To: larsi; +Cc: 51733, jidanni > Date: Sun, 16 Jan 2022 20:14:08 +0200 > From: Eli Zaretskii <eliz@gnu.org> > Cc: 51733@debbugs.gnu.org, jidanni@jidanni.org > > > (elt char-script-table #xAB65) > > => latin > > > > which is wrong, because that's > > > > GREEK LETTER SMALL CAPITAL OMEGA Btw, this is not necessarily an error, because the Latin language did have the omega letter. It's not an accident this character is in a Latin block. ^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better 2022-01-16 18:24 ` Eli Zaretskii @ 2022-01-16 18:34 ` Andreas Schwab 2022-01-16 18:44 ` Eli Zaretskii 0 siblings, 1 reply; 123+ messages in thread From: Andreas Schwab @ 2022-01-16 18:34 UTC (permalink / raw) To: Eli Zaretskii; +Cc: 51733, larsi, jidanni On Jan 16 2022, Eli Zaretskii wrote: >> Date: Sun, 16 Jan 2022 20:14:08 +0200 >> From: Eli Zaretskii <eliz@gnu.org> >> Cc: 51733@debbugs.gnu.org, jidanni@jidanni.org >> >> > (elt char-script-table #xAB65) >> > => latin >> > >> > which is wrong, because that's >> > >> > GREEK LETTER SMALL CAPITAL OMEGA > > Btw, this is not necessarily an error, because the Latin language did > have the omega letter. It's not an accident this character is in a > Latin block. The latin omega has its own code points U+A7B6 and U+A7B7 (since Unicode 8.0). -- Andreas Schwab, schwab@linux-m68k.org GPG Key fingerprint = 7578 EB47 D4E5 4D69 2510 2552 DF73 E780 A9DA AEC1 "And now for something completely different." ^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better 2022-01-16 18:34 ` Andreas Schwab @ 2022-01-16 18:44 ` Eli Zaretskii 0 siblings, 0 replies; 123+ messages in thread From: Eli Zaretskii @ 2022-01-16 18:44 UTC (permalink / raw) To: Andreas Schwab; +Cc: 51733, larsi, jidanni > From: Andreas Schwab <schwab@linux-m68k.org> > Cc: larsi@gnus.org, 51733@debbugs.gnu.org, jidanni@jidanni.org > Date: Sun, 16 Jan 2022 19:34:29 +0100 > > On Jan 16 2022, Eli Zaretskii wrote: > > >> Date: Sun, 16 Jan 2022 20:14:08 +0200 > >> From: Eli Zaretskii <eliz@gnu.org> > >> Cc: 51733@debbugs.gnu.org, jidanni@jidanni.org > >> > >> > (elt char-script-table #xAB65) > >> > => latin > >> > > >> > which is wrong, because that's > >> > > >> > GREEK LETTER SMALL CAPITAL OMEGA > > > > Btw, this is not necessarily an error, because the Latin language did > > have the omega letter. It's not an accident this character is in a > > Latin block. > > The latin omega has its own code points U+A7B6 and U+A7B7 (since Unicode > 8.0). Yes, I know. But U+AB65 predates Unicode 8.0. And it's besides the point, really: since omega was in the Latin alphabet, it is not a mistake to give it the Latin script. ^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better 2022-01-16 16:33 ` Lars Ingebrigtsen 2022-01-16 16:44 ` Eli Zaretskii @ 2022-01-16 17:53 ` Achim Gratz 2022-01-17 17:13 ` Lars Ingebrigtsen 1 sibling, 1 reply; 123+ messages in thread From: Achim Gratz @ 2022-01-16 17:53 UTC (permalink / raw) To: 51733 Lars Ingebrigtsen writes: > Eli Zaretskii <eliz@gnu.org> writes: >> Then what about text-security.el? or textsec.el? > > Yes, that'd work. Or... string-analysis.el? With functions like > `string-scripts' (lists the different scripts in the string) as well as > the more higher level functions... Hm... Since you're trying to harden against homograph / homoglyph attacks, why not mention it on the tin? Besides URL and eMail addresses, it would probably be useful for checking source code (where the language allows unicode identifiers), in this case it should also (optionally) warn about non-normalized sequences. Regards, Achim. -- +<[Q+ Matrix-12 WAVE#46+305 Neuron microQkb Andromeda XTk Blofeld]>+ Wavetables for the Waldorf Blofeld: http://Synth.Stromeko.net/Downloads.html#BlofeldUserWavetables ^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better 2022-01-16 17:53 ` Achim Gratz @ 2022-01-17 17:13 ` Lars Ingebrigtsen 0 siblings, 0 replies; 123+ messages in thread From: Lars Ingebrigtsen @ 2022-01-17 17:13 UTC (permalink / raw) To: Achim Gratz; +Cc: 51733 Achim Gratz <Stromeko@nexgo.de> writes: > Since you're trying to harden against homograph / homoglyph attacks, why > not mention it on the tin? Besides URL and eMail addresses, it would > probably be useful for checking source code (where the language allows > unicode identifiers), in this case it should also (optionally) warn > about non-normalized sequences. It's not just about homoglyphs (but mostly about that) -- it's also about classifying strings as to their applicability as identifiers and stuff. -- (domestic pets only, the antidote for overdose, milk.) bloggy blog: http://lars.ingebrigtsen.no ^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better 2021-11-10 0:42 ` Lars Ingebrigtsen 2021-11-10 3:34 ` Eli Zaretskii @ 2022-01-17 17:43 ` 積丹尼 Dan Jacobson 2022-01-17 19:06 ` Eli Zaretskii 1 sibling, 1 reply; 123+ messages in thread From: 積丹尼 Dan Jacobson @ 2022-01-17 17:43 UTC (permalink / raw) To: Eli Zaretskii; +Cc: 51733, Lars Ingebrigtsen OK, stay safe, beware of Ο, and unsubscribe me from all these details. Thanks. ^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better 2022-01-17 17:43 ` 積丹尼 Dan Jacobson @ 2022-01-17 19:06 ` Eli Zaretskii 0 siblings, 0 replies; 123+ messages in thread From: Eli Zaretskii @ 2022-01-17 19:06 UTC (permalink / raw) To: 積丹尼 Dan Jacobson; +Cc: 51733, larsi > From: 積丹尼 Dan Jacobson <jidanni@jidanni.org> > Cc: Lars Ingebrigtsen <larsi@gnus.org>, 51733@debbugs.gnu.org > Date: Tue, 18 Jan 2022 01:43:45 +0800 > > and unsubscribe me from all these details. Thanks. No way! You started this, so now you pay the price. ^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better 2021-11-10 0:29 bug#51733: 27.1; Detect impossible email addresses better 積丹尼 Dan Jacobson 2021-11-10 0:42 ` Lars Ingebrigtsen @ 2022-01-20 8:57 ` Lars Ingebrigtsen 2022-01-20 15:25 ` 積丹尼 Dan Jacobson 1 sibling, 1 reply; 123+ messages in thread From: Lars Ingebrigtsen @ 2022-01-20 8:57 UTC (permalink / raw) To: 積丹尼 Dan Jacobson; +Cc: 51733 積丹尼 Dan Jacobson <jidanni@jidanni.org> writes: > Upon sending, > To: Bob_Norbolwits@GCSsafetyACE.com > should trigger a warning: > "You won't get far trying to send mail with ZERO WIDTH SPACE in an address," > instead of blundering along and sending to "gcssafetyace.xn--com-7m0a"!! This has now been fixed in Emacs 29, and may probably be the highest "line number in fix" to "line numbers in report" ratio ever, with 14K lines of data added, and about 600 lines of code. Congrats! (So I'm now closing this bug report.) -- (domestic pets only, the antidote for overdose, milk.) bloggy blog: http://lars.ingebrigtsen.no ^ permalink raw reply [flat|nested] 123+ messages in thread
* bug#51733: 27.1; Detect impossible email addresses better 2022-01-20 8:57 ` Lars Ingebrigtsen @ 2022-01-20 15:25 ` 積丹尼 Dan Jacobson 0 siblings, 0 replies; 123+ messages in thread From: 積丹尼 Dan Jacobson @ 2022-01-20 15:25 UTC (permalink / raw) To: Lars Ingebrigtsen; +Cc: 51733 >>>>> "LI" == Lars Ingebrigtsen <larsi@gnus.org> writes: LI> This has now been fixed in Emacs 29, and may probably be the highest LI> "line number in fix" to "line numbers in report" ratio ever, with 14K LI> lines of data added, and about 600 lines of code. Congrats! Good. Thanks. ^ permalink raw reply [flat|nested] 123+ messages in thread
end of thread, other threads:[~2022-01-20 15:25 UTC | newest] Thread overview: 123+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2021-11-10 0:29 bug#51733: 27.1; Detect impossible email addresses better 積丹尼 Dan Jacobson 2021-11-10 0:42 ` Lars Ingebrigtsen 2021-11-10 3:34 ` Eli Zaretskii 2021-11-10 4:44 ` Lars Ingebrigtsen 2021-11-10 13:39 ` Eli Zaretskii 2021-11-11 2:52 ` Lars Ingebrigtsen 2021-11-11 7:01 ` Eli Zaretskii 2021-11-11 7:31 ` Lars Ingebrigtsen 2022-01-16 15:47 ` Lars Ingebrigtsen 2022-01-16 16:03 ` Eli Zaretskii 2022-01-16 16:09 ` Lars Ingebrigtsen 2022-01-16 16:14 ` Eli Zaretskii 2022-01-16 16:33 ` Lars Ingebrigtsen 2022-01-16 16:44 ` Eli Zaretskii 2022-01-16 17:03 ` Lars Ingebrigtsen 2022-01-16 17:50 ` Lars Ingebrigtsen 2022-01-16 18:18 ` Eli Zaretskii 2022-01-17 8:59 ` Lars Ingebrigtsen 2022-01-17 10:18 ` Eli Zaretskii 2022-01-17 14:54 ` Lars Ingebrigtsen 2022-01-17 16:47 ` Eli Zaretskii 2022-01-17 17:09 ` Lars Ingebrigtsen 2022-01-17 17:19 ` Eli Zaretskii 2022-01-17 17:26 ` Lars Ingebrigtsen 2022-01-17 17:38 ` Lars Ingebrigtsen 2022-01-17 17:48 ` Eli Zaretskii 2022-01-17 19:08 ` Eli Zaretskii 2022-01-17 20:22 ` Lars Ingebrigtsen 2022-01-18 8:40 ` Lars Ingebrigtsen 2022-01-18 11:26 ` Lars Ingebrigtsen 2022-01-18 11:37 ` Lars Ingebrigtsen 2022-01-18 11:44 ` Lars Ingebrigtsen 2022-01-18 12:00 ` Lars Ingebrigtsen 2022-01-18 12:47 ` Lars Ingebrigtsen 2022-01-18 12:51 ` Lars Ingebrigtsen 2022-01-18 18:44 ` Eli Zaretskii 2022-01-19 9:21 ` Robert Pluim 2022-01-19 9:26 ` Lars Ingebrigtsen 2022-01-19 10:12 ` Robert Pluim 2022-01-19 10:27 ` Lars Ingebrigtsen 2022-01-19 10:42 ` Robert Pluim 2022-01-19 13:46 ` Lars Ingebrigtsen 2022-01-19 17:18 ` Eli Zaretskii 2022-01-20 8:36 ` Lars Ingebrigtsen 2022-01-19 11:53 ` Eli Zaretskii 2022-01-19 12:49 ` Robert Pluim 2022-01-19 12:56 ` Lars Ingebrigtsen 2022-01-19 13:00 ` Lars Ingebrigtsen 2022-01-19 13:03 ` Eli Zaretskii 2022-01-19 12:58 ` Eli Zaretskii 2022-01-19 13:02 ` Lars Ingebrigtsen 2022-01-19 13:06 ` Eli Zaretskii 2022-01-19 13:10 ` Lars Ingebrigtsen 2022-01-19 13:21 ` Eli Zaretskii 2022-01-19 13:25 ` Lars Ingebrigtsen 2022-01-19 13:28 ` Eli Zaretskii 2022-01-19 13:39 ` Robert Pluĭm 2022-01-19 14:00 ` Lars Ingebrigtsen 2022-01-19 14:10 ` Robert Pluĭm 2022-01-19 14:24 ` Lars Ingebrigtsen 2022-01-19 14:30 ` Robert Pluim 2022-01-19 14:36 ` Lars Ingebrigtsen 2022-01-19 14:43 ` Robert Pluim 2022-01-19 16:08 ` Andreas Schwab 2022-01-19 16:47 ` Robert Pluim 2022-01-19 16:51 ` Lars Ingebrigtsen 2022-01-19 16:57 ` Robert Pluim 2022-01-19 9:25 ` Lars Ingebrigtsen 2022-01-19 11:51 ` Eli Zaretskii 2022-01-19 12:54 ` Lars Ingebrigtsen 2022-01-19 13:01 ` Eli Zaretskii 2022-01-19 13:06 ` Lars Ingebrigtsen 2022-01-19 13:11 ` Eli Zaretskii 2022-01-19 13:16 ` Lars Ingebrigtsen 2022-01-19 13:25 ` Eli Zaretskii 2022-01-19 13:31 ` Lars Ingebrigtsen 2022-01-19 13:35 ` Eli Zaretskii 2022-01-19 13:36 ` Andreas Schwab 2022-01-19 13:57 ` Lars Ingebrigtsen 2022-01-19 14:06 ` Andreas Schwab 2022-01-19 14:09 ` Lars Ingebrigtsen 2022-01-19 14:13 ` Andreas Schwab 2022-01-19 14:33 ` Lars Ingebrigtsen 2022-01-19 14:39 ` Andreas Schwab 2022-01-19 14:44 ` Lars Ingebrigtsen 2022-01-18 18:48 ` Eli Zaretskii 2022-01-18 20:15 ` Eli Zaretskii 2022-01-18 20:31 ` Eli Zaretskii 2022-01-19 13:38 ` Lars Ingebrigtsen 2022-01-18 15:05 ` Eli Zaretskii 2022-01-19 12:49 ` Michael Albinus 2022-01-19 12:59 ` Eli Zaretskii 2022-01-19 13:35 ` Lars Ingebrigtsen 2022-01-18 14:59 ` Eli Zaretskii 2022-01-19 13:56 ` Lars Ingebrigtsen 2022-01-18 14:55 ` Eli Zaretskii 2022-01-19 13:55 ` Lars Ingebrigtsen 2022-01-19 14:14 ` Eli Zaretskii 2022-01-19 14:28 ` Lars Ingebrigtsen 2022-01-19 14:57 ` Eli Zaretskii 2022-01-19 15:45 ` Lars Ingebrigtsen 2022-01-19 16:58 ` Eli Zaretskii 2022-01-19 18:25 ` Lars Ingebrigtsen 2022-01-17 17:42 ` Eli Zaretskii 2022-01-17 17:46 ` Lars Ingebrigtsen 2022-01-17 15:22 ` Eli Zaretskii 2022-01-17 15:25 ` Lars Ingebrigtsen 2022-01-17 15:53 ` Lars Ingebrigtsen 2022-01-17 16:31 ` Lars Ingebrigtsen 2022-01-17 16:52 ` Eli Zaretskii 2022-01-17 16:57 ` Lars Ingebrigtsen 2022-01-17 17:02 ` Eli Zaretskii 2022-01-17 17:04 ` Lars Ingebrigtsen 2022-01-16 18:14 ` Eli Zaretskii 2022-01-16 18:24 ` Eli Zaretskii 2022-01-16 18:34 ` Andreas Schwab 2022-01-16 18:44 ` Eli Zaretskii 2022-01-16 17:53 ` Achim Gratz 2022-01-17 17:13 ` Lars Ingebrigtsen 2022-01-17 17:43 ` 積丹尼 Dan Jacobson 2022-01-17 19:06 ` Eli Zaretskii 2022-01-20 8:57 ` Lars Ingebrigtsen 2022-01-20 15:25 ` 積丹尼 Dan Jacobson
Code repositories for project(s) associated with this public inbox https://git.savannah.gnu.org/cgit/emacs.git This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).