From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!.POSTED!not-for-mail From: Itai Berli Newsgroups: gmane.emacs.bugs Subject: bug#27526: 25.1; Nonconformance to Unicode bidirectionality algorithm due to paragraph separator Date: Tue, 4 Jul 2017 18:57:33 +0300 Message-ID: References: <83inj8nt0h.fsf@gnu.org> NNTP-Posting-Host: blaine.gmane.org Mime-Version: 1.0 Content-Type: multipart/alternative; boundary="94eb2c1240566b7b6105537ff3c2" X-Trace: blaine.gmane.org 1499183959 3853 195.159.176.226 (4 Jul 2017 15:59:19 GMT) X-Complaints-To: usenet@blaine.gmane.org NNTP-Posting-Date: Tue, 4 Jul 2017 15:59:19 +0000 (UTC) To: 27526@debbugs.gnu.org Original-X-From: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane.org@gnu.org Tue Jul 04 17:59:14 2017 Return-path: Envelope-to: geb-bug-gnu-emacs@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by blaine.gmane.org with esmtp (Exim 4.84_2) (envelope-from ) id 1dSQEK-0000Vg-Km for geb-bug-gnu-emacs@m.gmane.org; Tue, 04 Jul 2017 17:59:12 +0200 Original-Received: from localhost ([::1]:42031 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1dSQEM-0003ld-JS for geb-bug-gnu-emacs@m.gmane.org; Tue, 04 Jul 2017 11:59:14 -0400 Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:58673) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1dSQEF-0003lX-9U for bug-gnu-emacs@gnu.org; Tue, 04 Jul 2017 11:59:09 -0400 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1dSQEA-0005NF-Ui for bug-gnu-emacs@gnu.org; Tue, 04 Jul 2017 11:59:07 -0400 Original-Received: from debbugs.gnu.org ([208.118.235.43]:49658) by eggs.gnu.org with esmtps (TLS1.0:RSA_AES_128_CBC_SHA1:16) (Exim 4.71) (envelope-from ) id 1dSQEA-0005Mm-QK for bug-gnu-emacs@gnu.org; Tue, 04 Jul 2017 11:59:02 -0400 Original-Received: from Debian-debbugs by debbugs.gnu.org with local (Exim 4.84_2) (envelope-from ) id 1dSQEA-0006TI-Fc for bug-gnu-emacs@gnu.org; Tue, 04 Jul 2017 11:59:02 -0400 X-Loop: help-debbugs@gnu.org Resent-From: Itai Berli Original-Sender: "Debbugs-submit" Resent-CC: bug-gnu-emacs@gnu.org Resent-Date: Tue, 04 Jul 2017 15:59:02 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 27526 X-GNU-PR-Package: emacs X-GNU-PR-Keywords: Original-Received: via spool by 27526-submit@debbugs.gnu.org id=B27526.149918390124828 (code B ref 27526); Tue, 04 Jul 2017 15:59:02 +0000 Original-Received: (at 27526) by debbugs.gnu.org; 4 Jul 2017 15:58:21 +0000 Original-Received: from localhost ([127.0.0.1]:52335 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1dSQDU-0006SO-SY for submit@debbugs.gnu.org; Tue, 04 Jul 2017 11:58:21 -0400 Original-Received: from mail-ua0-f181.google.com ([209.85.217.181]:36040) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1dSQDT-0006SB-97 for 27526@debbugs.gnu.org; Tue, 04 Jul 2017 11:58:19 -0400 Original-Received: by mail-ua0-f181.google.com with SMTP id g40so129091068uaa.3 for <27526@debbugs.gnu.org>; Tue, 04 Jul 2017 08:58:19 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:in-reply-to:references:from:date:message-id:subject:to; bh=F1csm9V4jrYFwJ6Xq3JmVVvBVA0vco0663V/HcGEA10=; b=WsCfyLMUerGs+MbPD6JWQDm1y8sAkWbN0Wm5nMvohfai6nvMI/dCLvWBG+44MwR6xg 7eX64UvAx0mKhwbhgCYdd9jJRI4ROCneUUCpzU3tZdPugTiEr3k5rScR18P40Ogbadle F1e0t13HgLmRsUnPYG+pP3+v3LrnxolPs1tNOtMP/vJ+ZJPFKg1/p/YmHfnE+xNKanT/ 3uWLHGnbDrb7W0XWADzTlHy/n2qovwKdaILAV1gEcrXW7J6OXcOIKpxb3AmdGK1OK9kn CKVHabeok6FtptBQa7KYFuE/D/C5lyubCYlMn/TpgdGUVZr1vxme9s6aYYXnb567Q4KM qXZg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to; bh=F1csm9V4jrYFwJ6Xq3JmVVvBVA0vco0663V/HcGEA10=; b=Pq5be9pE39uwusUPfdYGN0vZuNbyG1DJ2/l12N4Fr53nlpfRHXO4Kk+7gGaBjcud4g U/ay3HNIvAO2tt+gL7PVHNsS82vmotkElQDAVpxt5GNadcNCtgjDtDabxzGQmm3ouN1U YymzFsqrRDzB3rkyhlhvJCTT3SlJxONvg/tMQH0+N1itfAbywGC/YNi8h/qoHTsm/b1d CZ9b22Lnlmp3EyEMXm9LICKwUlpXo72YnjpcpG0a3L5BzpRzLfp7WH9GlqYKk96UItfy po7j8TpV2QpZ+YVgSYN6h2PaCutuWiXN7FyuzaJOatTzJhTf5YV99z2crc2tkHzQXrMf BnSg== X-Gm-Message-State: AKS2vOzwqul+PQKxynAwzzvVYlhzZNf6guK7iH+9NHQ6d2FYv9MtMQOx SkQpCxhwJt8j/qjvNp3v5MvwhItIdU5v X-Received: by 10.159.39.74 with SMTP id a68mr21485872uaa.10.1499183893410; Tue, 04 Jul 2017 08:58:13 -0700 (PDT) Original-Received: by 10.176.70.85 with HTTP; Tue, 4 Jul 2017 08:57:33 -0700 (PDT) In-Reply-To: <83inj8nt0h.fsf@gnu.org> X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] X-Received-From: 208.118.235.43 X-BeenThere: bug-gnu-emacs@gnu.org List-Id: "Bug reports for GNU Emacs, the Swiss army knife of text editors" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane.org@gnu.org Original-Sender: "bug-gnu-emacs" Xref: news.gmane.org gmane.emacs.bugs:134174 Archived-At: --94eb2c1240566b7b6105537ff3c2 Content-Type: text/plain; charset="UTF-8" > As I already explained, the behavior of GEdit is unacceptable for Emacs, because most modes derived from Text mode tend to deal with buffers where lines are broken by newlines, so potentially switching paragraph direction just because a newline happens to be there would have devastating effect on the text as displayed. How about letting the user decide what's best for them? Would it be possible to add an option to Emacs that a user can set, say, in their .emacs file, which will determine whether the bidi imiplementation will consider the newline character as the paragraph separator or an empty line? On Tue, Jul 4, 2017 at 6:03 PM, Eli Zaretskii wrote: > > From: Itai Berli > > Date: Tue, 4 Jul 2017 13:42:19 +0300 > > > > I'd like to add another reason why this behavior is problematic: it > breaks interoperability with other plain text > > editors, since the text will not be displayed the same way. Consider, > for instance, the very same plain text file > > in GEdit: http://imgur.com/Iw4yrdQ > > in Emacs: http://imgur.com/7kfWseE > > As I already explained, the behavior of GEdit is unacceptable for > Emacs, because most modes derived from Text mode tend to deal with > buffers where lines are broken by newlines, so potentially switching > paragraph direction just because a newline happens to be there would > have devastating effect on the text as displayed. This is perhaps in > contrast with other editors and word-processors which mostly deal with > long lines without hard newlines. That's why the notion of paragraph > in Emacs's UBA implementation was chosen to fit the traditional Emacs > definition of paragraph in text-mode and its derivatives. > > > Finally, the question of whether Emacs behavior is consistent with the > UBA specifications is debatable, since > > when UBA section 3 states "Paragraphs may also be determined by > higher-level protocols" the question is > > what exactly the "also" means: is it that the higher-level protocols > (HLP) can decide that a newline character is > > not a paragraph boundary, as Emacs does, or is it that the HLP can only > declare paragraph boundaries in > > addition to paragraph separator characters? > > It is clear from the context and the example following the above > sentence that "also" doesn't mean "in addition". > > However, the main issue is not the paragraph boundary, the main issue > is how the base direction of the paragraph is determined. Because no > matter where the paragraph boundary is, if the base direction is not > recalculated there, then the fact that the boundary is there doesn't > matter. > > From Section 4.3 Higher-Level Protocols of the UAX#9: > > HL1. Override P3, and set the paragraph embedding level > explicitly. This does not apply when deciding how to treat FSI > in rule X5c. > > . A higher-level protocol may set any paragraph level. This can > be done on the basis of the context, such as on a table cell, > paragraph, document, or system level. (P2 may be skipped if > P3 is overridden). [...] > . A higher-level protocol may apply rules equivalent to P2 and > P3 but default to level 1 (RTL) rather than 0 (LTR) to match > overall RTL context. > . A higher-level protocol may use an entirely different > algorithm that heuristically auto-detects the paragraph > embedding level based on the paragraph text and its > context. For example, it could base it on whether there are > more RTL characters in the text than LTR. As another example, > when the paragraph contains no strong characters, its > direction could be determined by the levels of the paragraphs > before and after. > > And Section 3.3.1, which describes the P1, P2, and P3 paragraph-level > rules, says: > > Whenever a higher-level protocol specifies the paragraph level, > rules P2 and P3 may be overridden: see HL1. > > So an application is allowed to override _all_ of the paragraph-level > rules, and do what suits it best. And based on some non-negligible > experience with bidi-aware applications, I submit that an application > that does _not_ employ some higher-level protocol for base paragraph > direction will violate user expectations when working with plain text. > E.g., try reading in MS Outlook an unformatted text message which has > a lot of RTL text mixed with LTR. It's unreadable; I always > copy/paste it into Emacs, and only then I'm able to read it. > --94eb2c1240566b7b6105537ff3c2 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
>=C2=A0As I already ex= plained, the behavior of GEdit is unacceptable for
Emacs, because most modes der= ived from Text mode tend to deal with
= buffers where lines are broken by newlines= , so potentially switching
paragraph direction just because a newline happens to= be there would
have devastating effect on the text as displayed.

How about letting the user decide what's best for them? Would= it be possible to add an option to Emacs that a user can set, say, in thei= r .emacs file, which will determine whether the bidi imiplementation will c= onsider the newline character as the paragraph separator or an empty line?<= /span>

On Tu= e, Jul 4, 2017 at 6:03 PM, Eli Zaretskii <eliz@gnu.org> wrote:
> From: Itai Berli <itai.berli@gmail.com>
> Date: Tue, 4 Jul 2017 13:42:19 +0300
>
> I'd like to add another reason why this behavior is problematic: i= t breaks interoperability with other plain text
> editors, since the text will not be displayed the same way. Consider, = for instance, the very same plain text file
> in GEdit: http://imgur.com/Iw4yrdQ
> in Emacs: http://imgur.com/7kfWseE

As I already explained, the behavior of GEdit is unacceptable for
Emacs, because most modes derived from Text mode tend to deal with
buffers where lines are broken by newlines, so potentially switching
paragraph direction just because a newline happens to be there would
have devastating effect on the text as displayed.=C2=A0 This is perhaps in<= br> contrast with other editors and word-processors which mostly deal with
long lines without hard newlines.=C2=A0 That's why the notion of paragr= aph
in Emacs's UBA implementation was chosen to fit the traditional Emacs definition of paragraph in text-mode and its derivatives.

> Finally, the question of whether Emacs behavior is consistent with the= UBA specifications is debatable, since
> when UBA section 3 states "Paragraphs may also be determined by h= igher-level protocols" the question is
> what exactly the "also" means: is it that the higher-level p= rotocols (HLP) can decide that a newline character is
> not a paragraph boundary, as Emacs does, or is it that the HLP can onl= y declare paragraph boundaries in
> addition to paragraph separator characters?

It is clear from the context and the example following the above
sentence that "also" doesn't mean "in addition".
However, the main issue is not the paragraph boundary, the main issue
is how the base direction of the paragraph is determined.=C2=A0 Because no<= br> matter where the paragraph boundary is, if the base direction is not
recalculated there, then the fact that the boundary is there doesn't matter.

>From Section 4.3 Higher-Level Protocols of the UAX#9:

=C2=A0 HL1. Override P3, and set the paragraph embedding level
=C2=A0 =C2=A0 =C2=A0 =C2=A0explicitly. This does not apply when deciding ho= w to treat FSI
=C2=A0 =C2=A0 =C2=A0 =C2=A0in rule X5c.

=C2=A0 =C2=A0 =C2=A0 =C2=A0. A higher-level protocol may set any paragraph = level. This can
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0be done on the basis of the context, such= as on a table cell,
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0paragraph, document, or system level. (P2= may be skipped if
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0P3 is overridden). [...]
=C2=A0 =C2=A0 =C2=A0 =C2=A0. A higher-level protocol may apply rules equiva= lent to P2 and
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0P3 but default to level 1 (RTL) rather th= an 0 (LTR) to match
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0overall RTL context.
=C2=A0 =C2=A0 =C2=A0 =C2=A0. A higher-level protocol may use an entirely di= fferent
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0algorithm that heuristically auto-detects= the paragraph
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0embedding level based on the paragraph te= xt and its
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0context. For example, it could base it on= whether there are
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0more RTL characters in the text than LTR.= As another example,
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0when the paragraph contains no strong cha= racters, its
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0direction could be determined by the leve= ls of the paragraphs
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0before and after.

And Section 3.3.1, which describes the P1, P2, and P3 paragraph-level
rules, says:

=C2=A0 Whenever a higher-level protocol specifies the paragraph level,
=C2=A0 rules P2 and P3 may be overridden: see HL1.

So an application is allowed to override _all_ of the paragraph-level
rules, and do what suits it best.=C2=A0 And based on some non-negligible experience with bidi-aware applications, I submit that an application
that does _not_ employ some higher-level protocol for base paragraph
direction will violate user expectations when working with plain text.
E.g., try reading in MS Outlook an unformatted text message which has
a lot of RTL text mixed with LTR.=C2=A0 It's unreadable; I always
copy/paste it into Emacs, and only then I'm able to read it.

--94eb2c1240566b7b6105537ff3c2--