From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: Eli Zaretskii Newsgroups: gmane.emacs.devel Subject: Re: Bidirectional text and URLs Date: Sat, 29 Nov 2014 10:22:45 +0200 Message-ID: <83r3wml8kq.fsf@gnu.org> References: <87a93cngwv.fsf@uwakimon.sk.tsukuba.ac.jp> <837fyfml31.fsf@gnu.org> <874mtio7wh.fsf@uwakimon.sk.tsukuba.ac.jp> Reply-To: Eli Zaretskii NNTP-Posting-Host: plane.gmane.org X-Trace: ger.gmane.org 1417249402 13978 80.91.229.3 (29 Nov 2014 08:23:22 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Sat, 29 Nov 2014 08:23:22 +0000 (UTC) Cc: larsi@gnus.org, emacs-devel@gnu.org To: "Stephen J. Turnbull" Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Sat Nov 29 09:23:10 2014 Return-path: Envelope-to: ged-emacs-devel@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by plane.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1XudJ7-00060D-PE for ged-emacs-devel@m.gmane.org; Sat, 29 Nov 2014 09:23:10 +0100 Original-Received: from localhost ([::1]:47022 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1XudJ6-0001CH-Sv for ged-emacs-devel@m.gmane.org; Sat, 29 Nov 2014 03:23:08 -0500 Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:42501) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1XudIp-0001C5-3e for emacs-devel@gnu.org; Sat, 29 Nov 2014 03:22:56 -0500 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1XudIj-000085-63 for emacs-devel@gnu.org; Sat, 29 Nov 2014 03:22:50 -0500 Original-Received: from mtaout25.012.net.il ([80.179.55.181]:55747) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1XudIi-00007Q-P0 for emacs-devel@gnu.org; Sat, 29 Nov 2014 03:22:45 -0500 Original-Received: from conversion-daemon.mtaout25.012.net.il by mtaout25.012.net.il (HyperSendmail v2007.08) id <0NFS00B00K51E500@mtaout25.012.net.il> for emacs-devel@gnu.org; Sat, 29 Nov 2014 10:18:23 +0200 (IST) Original-Received: from HOME-C4E4A596F7 ([87.69.4.28]) by mtaout25.012.net.il (HyperSendmail v2007.08) with ESMTPA id <0NFS0077SKENGP40@mtaout25.012.net.il>; Sat, 29 Nov 2014 10:18:23 +0200 (IST) In-reply-to: <874mtio7wh.fsf@uwakimon.sk.tsukuba.ac.jp> X-012-Sender: halo1@inter.net.il X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.6.x X-Received-From: 80.179.55.181 X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Xref: news.gmane.org gmane.emacs.devel:178447 Archived-At: > From: "Stephen J. Turnbull" > Cc: larsi@gnus.org, > emacs-devel@gnu.org > Date: Sat, 29 Nov 2014 15:09:02 +0900 > > > > but I would say that given that the UAX#9 bidi algorithm does what's > > > wanted 99.44% of the time, it makes sense to mark text reordered by > > > RTL markers with a warning face > > > > That might be considered an annoyance by users of bidi scripts. > > There's any number of perfectly valid URLs that use the same > > formatting control characters. > > Why? Because many displays don't implement UAX#9? Or is it because > UAX#9 defines segments in a way that would reorder the components of a > domain name or path? That is, the logical URL > > http://www.example.com/ABC/DEF/ > > is expected by a bidi reader to appear as > > http://www.example.com/CBA/FED/ > > but UAX#9 would display it as > > http://www.example.com/FED/CBA/ Yes. And there are worse examples (e.g., try an HTML link which includes both a URL and a link text). The problem here is that all those /, :, <, and > characters are neutrals, so they take the direction of surrounding text, i.e. are reversed for display when the surrounding text is RTL. In addition, < and > are mirrored in that case. That can make quite a jumble. (Unicode 6.3 added special handling for "paired-bracket" characters, which makes the situation with < and > somewhat better, but we only support that on master, Emacs 24.4 doesn't.) > Whatever the reason, I'd have to say that's too bad for users of bidi > languages, because that means *any* bidi URLs is ambiguous, and > therefore subject to being deliberately obfuscated by reflection > and/or jumbling, regardless of the presence of directional controls. I agree, but the issue discussed here is different: it's AFAIU about users of LTR scripts that can fall victim to use of directional controls that are by default (almost) invisible on Emacs display. I think we would like to have at least that situation "handled" in some way. My point above was that the way we handle that should not unduly punish users of bidi scripts, i.e. legitimate uses of these controls. > > What you suggest might be TRT when left-to-right text is enclosed > > within directional override controls (which is what Lars did in his > > example). These controls assign right-to-left directionality to all > > the enclosed characters, which is indeed highly suspicious in URLs. > > This isn't hard to detect. But there is also the case where you have > a word which is a different word when reflected. If we have a dictionary, we can detect that, too. If we don't, then detecting only the enclosed-LTR case is better than nothing, I think. Another possibility is to modify the way these control characters are displayed by manipulating their entries in the glyphless-char-display char-table. It should probably be enough to display them as hex-code in a box, to make the user aware of the possible problem. This should be done by applications that display URLs, like eww, Gnus, Rmail, etc.; not globally. > I assume that this is the case in bidi languages as well Yes, but that would require RTL text embedded in a left-to-right overriding embedding, which is easily detectable, like the opposite case that started this thread. > and of course any jumble is possible as a domain or path component > which is an abbreviation. And any useful jumble can probably be > registered as a domain, and certainly incorporated in a path. I doubt that a domain like this could be registered, as using such characters in a domain name is AFAIU against the regulations, see RFC3987. > > In addition to using a special face, another possibility is to present > > the directional overrides in these cases in percent-hex notation, > > which will disable their effect on the enclosed text. Of course, this > > should be only done when the enclosed text is entirely made of LTR > > characters and neutrals. > > Well, no. I assume that bidi readers are as vulnerable to phishing > and other frauds as non-bidi readers (hard as that may be to believe > for you bidi readers). That is not yet clear. The easy cases with RTL text, as mentioned above, should be also easily detectable, and I agree they should get the same treatment. > > > You do need a way to turn it off, or to make it reasonably smart, in > > > the case of ASCII which is often mixed with other charsets. > > > > Not sure what you mean here. > > As above, where the domain name is ASCII and the path is RTL. Or the > path (or the domain) might be mixed. > > > "Turn off" how? > > "We need to decide what we want to do, and then look for a mechanism." OK, let me rephrase: what effect will "turning off" have on display? > > And how do you do that without unduly punishing perfectly valid > > URLs that need these controls to avoid visual "jumbles"? > > I hate to tell you, but the phishers have *already* started punishing > those perfectly valid URLs. You have a choice of punishment, that's > all: "jumbled display" vs. "defrauded users". I very much hope we will find a sane middle ground, possibly subject to user control. I'd hate to see Emacs become another case of the TSA disaster. > Except that as I say above, apparently all bidi URLs must now be > considered to offer suspicious display under some circumstances, so > maybe you have no choice about the defrauded users. In that case I > suppose avoiding jumbles does take precedence. Once we decide which cases we want to avoid or flag, we could be smart there, by comparing the original and reordered strings, perhaps aided by some dictionary lookup. The infrastructure is either already there or easy to add. It's "just" a matter of deciding what to do and when. Someone(TM) should present a list of well-thought requirements, and we can take it from there.