bug#41506: 28.0.50; RTL problem

unofficial mirror of bug-gnu-emacs@gnu.org 
 help / color / mirror / code / Atom feed

* bug#41506: 28.0.50; RTL problem
@ 2020-05-24 13:05 Pip Cet
  2020-05-24 14:46 ` Eli Zaretskii
  0 siblings, 1 reply; 9+ messages in thread
From: Pip Cet @ 2020-05-24 13:05 UTC (permalink / raw)
  To: 41506

[-- Attachment #1: Type: text/plain, Size: 563 bytes --]

Hi,
I'm surprised by the way vanilla Emacs behaves when given RTL input:

Recipe:
emacs -Q hebrew.txt

hebrew.txt contains (or should contain!) two newlines followed by the
Hebrew word Ivri, punctuated, followed by another single newline.

Expected result:
right-aligned text

Actual result:
left-aligned text

Is that a bug, or is there something I don't understand? It only
appears to be left-aligned when there are precisely two newlines at
the beginning of the file.

I'm attaching a screenshot since I don't know whether it's a
font-specific issue.

Thanks!

[-- Attachment #2: hebrew.txt --]
[-- Type: text/plain, Size: 17 bytes --]

עִבְרִי

[-- Attachment #3: ivri.jpg --]
[-- Type: image/jpeg, Size: 3878 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* bug#41506: 28.0.50; RTL problem
  2020-05-24 13:05 bug#41506: 28.0.50; RTL problem Pip Cet
@ 2020-05-24 14:46 ` Eli Zaretskii
  2020-06-02 10:17   ` Pip Cet
  0 siblings, 1 reply; 9+ messages in thread
From: Eli Zaretskii @ 2020-05-24 14:46 UTC (permalink / raw)
  To: Pip Cet; +Cc: 41506

> From: Pip Cet <pipcet@gmail.com>
> Date: Sun, 24 May 2020 13:05:17 +0000
> 
> I'm surprised by the way vanilla Emacs behaves when given RTL input:
> 
> Recipe:
> emacs -Q hebrew.txt
> 
> hebrew.txt contains (or should contain!) two newlines followed by the
> Hebrew word Ivri, punctuated, followed by another single newline.
> 
> Expected result:
> right-aligned text
> 
> Actual result:
> left-aligned text
> 
> Is that a bug, or is there something I don't understand?

It's a bug, but one that's very tricky to fix, AFAIR.  If you insert
or delete a single character, the display becomes correct.






^ permalink raw reply	[flat|nested] 9+ messages in thread

* bug#41506: 28.0.50; RTL problem
  2020-05-24 14:46 ` Eli Zaretskii
@ 2020-06-02 10:17   ` Pip Cet
  2020-06-02 16:34     ` Eli Zaretskii
  0 siblings, 1 reply; 9+ messages in thread
From: Pip Cet @ 2020-06-02 10:17 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 41506

[-- Attachment #1: Type: text/plain, Size: 1217 bytes --]

On Sun, May 24, 2020 at 2:46 PM Eli Zaretskii <eliz@gnu.org> wrote:
> > From: Pip Cet <pipcet@gmail.com>
> > Date: Sun, 24 May 2020 13:05:17 +0000
> >
> > I'm surprised by the way vanilla Emacs behaves when given RTL input:
> >
> > Recipe:
> > emacs -Q hebrew.txt
> >
> > hebrew.txt contains (or should contain!) two newlines followed by the
> > Hebrew word Ivri, punctuated, followed by another single newline.
> >
> > Expected result:
> > right-aligned text
> >
> > Actual result:
> > left-aligned text
> >
> > Is that a bug, or is there something I don't understand?
>
> It's a bug, but one that's very tricky to fix, AFAIR.  If you insert
> or delete a single character, the display becomes correct.

The attached patch seems to avoid the problem, but I'm sure I'm
missing something.

The comment says "don't do that at BEGV since then we are potentially
in a new paragraph that doesn't yet exist". I'm failing to make sense
of that, and the commit (5e65aec01a9bc5a147e492f11dd0115c98bedef4)
isn't too helpful either: "Fix bidi iteration near BEGV and ZV."

I suspect what might have been meant is that narrowing an LTR
paragraph to a line containing STRONG_R text shouldn't result in RTL
display, but it does...

[-- Attachment #2: 0001-Bidi-patch.patch --]
[-- Type: text/x-patch, Size: 802 bytes --]

diff --git a/src/bidi.c b/src/bidi.c
index 1017bd2d52..e3a5fe7de6 100644
--- a/src/bidi.c
+++ b/src/bidi.c
@@ -1707,15 +1707,12 @@ bidi_paragraph_init (bidi_dir_t dir, struct bidi_it *bidi_it, bool no_default_p)
 	return;
 
       /* If we are on a newline, get past it to where the next
-	 paragraph might start.  But don't do that at BEGV since then
-	 we are potentially in a new paragraph that doesn't yet
-	 exist.  */
+	 paragraph might start.  */
       pos = bidi_it->charpos;
       s = (STRINGP (bidi_it->string.lstring)
 	   ? SDATA (bidi_it->string.lstring)
 	   : bidi_it->string.s);
-      if (bytepos > begbyte
-	  && bidi_char_at_pos (bytepos, s, bidi_it->string.unibyte) == '\n')
+      if (bidi_char_at_pos (bytepos, s, bidi_it->string.unibyte) == '\n')
 	{
 	  bytepos++;
 	  pos++;

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* bug#41506: 28.0.50; RTL problem
  2020-06-02 10:17   ` Pip Cet
@ 2020-06-02 16:34     ` Eli Zaretskii
  2020-06-02 19:07       ` Eli Zaretskii
  0 siblings, 1 reply; 9+ messages in thread
From: Eli Zaretskii @ 2020-06-02 16:34 UTC (permalink / raw)
  To: Pip Cet; +Cc: 41506

> From: Pip Cet <pipcet@gmail.com>
> Date: Tue, 2 Jun 2020 10:17:55 +0000
> Cc: 41506@debbugs.gnu.org
> 
> > It's a bug, but one that's very tricky to fix, AFAIR.  If you insert
> > or delete a single character, the display becomes correct.
> 
> The attached patch seems to avoid the problem, but I'm sure I'm
> missing something.

This condition was added 11 years ago, when I just started integrating
bidi.c with Emacs.  The commit log message and the comment both say I
had some real problem on my hands that this condition fixed.  However,
I have now thrown several use cases on the patched code, and could see
no problem.

So I guess whatever issues I had back then were meanwhile solved "by
other means", and you should install this patch.  If there is indeed
some subtlety here, it will present itself sooner or later (like, in
another 11 years).

> I suspect what might have been meant is that narrowing an LTR
> paragraph to a line containing STRONG_R text shouldn't result in RTL
> display, but it does...

No, this works as designed: the Emacs display engine always behaves as
if there's nothing before beginning of the narrowed region.

Thanks.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* bug#41506: 28.0.50; RTL problem
  2020-06-02 16:34     ` Eli Zaretskii
@ 2020-06-02 19:07       ` Eli Zaretskii
  2020-06-06  7:58         ` Pip Cet
  0 siblings, 1 reply; 9+ messages in thread
From: Eli Zaretskii @ 2020-06-02 19:07 UTC (permalink / raw)
  To: pipcet; +Cc: 41506

> Date: Tue, 02 Jun 2020 19:34:31 +0300
> From: Eli Zaretskii <eliz@gnu.org>
> Cc: 41506@debbugs.gnu.org
> 
> So I guess whatever issues I had back then were meanwhile solved "by
> other means", and you should install this patch.  If there is indeed
> some subtlety here, it will present itself sooner or later (like, in
> another 11 years).

Btw, please note that some residual problem remains: after the patch,
if a buffer begins with 2 newlines and an RTL letter, the first screen
line is rendered right-to-left, which is wrong.  You can see that it's
wrong if you insert more newlines at BOB: then only the single empty
line immediately before the RTL letter is rendered right-to-left.

Of course, this is better than the original problem.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* bug#41506: 28.0.50; RTL problem
  2020-06-02 19:07       ` Eli Zaretskii
@ 2020-06-06  7:58         ` Pip Cet
  2020-06-06  8:35           ` Eli Zaretskii
  0 siblings, 1 reply; 9+ messages in thread
From: Pip Cet @ 2020-06-06  7:58 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 41506

[-- Attachment #1: Type: text/plain, Size: 1083 bytes --]

Eli Zaretskii <eliz@gnu.org> writes:

>> Date: Tue, 02 Jun 2020 19:34:31 +0300
>> From: Eli Zaretskii <eliz@gnu.org>
>> Cc: 41506@debbugs.gnu.org
>> 
>> So I guess whatever issues I had back then were meanwhile solved "by
>> other means", and you should install this patch.  If there is indeed
>> some subtlety here, it will present itself sooner or later (like, in
>> another 11 years).
>
> Btw, please note that some residual problem remains: after the patch,
> if a buffer begins with 2 newlines and an RTL letter, the first screen
> line is rendered right-to-left, which is wrong.  You can see that it's
> wrong if you insert more newlines at BOB: then only the single empty
> line immediately before the RTL letter is rendered right-to-left.
>
> Of course, this is better than the original problem.

I decided to investigate further, and finally came up with this patch,
which appears to work. I'm not a hundred percent sure it's the right
thing to do, because when we're called with bidi_it->first_elt = true,
it's possible we shouldn't touch bidi_it->new_paragraph at all...


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 0001-Handle-buffers-containing-two-newlines-followed-by-a.patch --]
[-- Type: text/x-diff, Size: 1254 bytes --]

From b5302d2e89710166cc8540c8fc08a7eaabc341f4 Mon Sep 17 00:00:00 2001
From: Pip Cet <pipcet@gmail.com>
Date: Sat, 6 Jun 2020 07:52:13 +0000
Subject: [PATCH] Handle buffers containing two newlines followed by an RTL
 char

* src/bidi.c (bidi_paragraph_init): Correct handling of initial
newlines.  (Bug#41506)
---
 src/bidi.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/src/bidi.c b/src/bidi.c
index 1017bd2d52..8d2d3c1f07 100644
--- a/src/bidi.c
+++ b/src/bidi.c
@@ -1707,14 +1707,14 @@ bidi_paragraph_init (bidi_dir_t dir, struct bidi_it *bidi_it, bool no_default_p)
 	return;
 
       /* If we are on a newline, get past it to where the next
-	 paragraph might start.  But don't do that at BEGV since then
-	 we are potentially in a new paragraph that doesn't yet
-	 exist.  */
+	 paragraph might start.  But don't do that for the first
+	 element since this function will be called twice in that
+	 case.  */
       pos = bidi_it->charpos;
       s = (STRINGP (bidi_it->string.lstring)
 	   ? SDATA (bidi_it->string.lstring)
 	   : bidi_it->string.s);
-      if (bytepos > begbyte
+      if (!bidi_it->first_elt
 	  && bidi_char_at_pos (bytepos, s, bidi_it->string.unibyte) == '\n')
 	{
 	  bytepos++;
-- 
2.27.0.rc0


[-- Attachment #3: Type: text/plain, Size: 122 bytes --]


If you have any further comments, I'd be glad to amend the comments in
bidi.c to reflect what we actually do understand.

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* bug#41506: 28.0.50; RTL problem
  2020-06-06  7:58         ` Pip Cet
@ 2020-06-06  8:35           ` Eli Zaretskii
  2020-06-06 13:05             ` Pip Cet
  0 siblings, 1 reply; 9+ messages in thread
From: Eli Zaretskii @ 2020-06-06  8:35 UTC (permalink / raw)
  To: Pip Cet; +Cc: 41506

> From: Pip Cet <pipcet@gmail.com>
> Cc: 41506@debbugs.gnu.org
> Date: Sat, 06 Jun 2020 07:58:24 +0000
> 
> when we're called with bidi_it->first_elt = true, it's possible we
> shouldn't touch bidi_it->new_paragraph at all...

Can you elaborate on why you think that?

first_elt can be set when we are at the beginning of a paragraph or
when we are in the middle of it, so its meaning is different from that
of new_paragraph.

> +	 paragraph might start.  But don't do that for the first
> +	 element since this function will be called twice in that
> +	 case.  */

Which code causes the two calls, and why is that significant in this
case?





^ permalink raw reply	[flat|nested] 9+ messages in thread

* bug#41506: 28.0.50; RTL problem
  2020-06-06  8:35           ` Eli Zaretskii
@ 2020-06-06 13:05             ` Pip Cet
  2020-06-06 13:45               ` Eli Zaretskii
  0 siblings, 1 reply; 9+ messages in thread
From: Pip Cet @ 2020-06-06 13:05 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 41506

[-- Attachment #1: Type: text/plain, Size: 2355 bytes --]

Eli Zaretskii <eliz@gnu.org> writes:

>> From: Pip Cet <pipcet@gmail.com>
>> Cc: 41506@debbugs.gnu.org
>> Date: Sat, 06 Jun 2020 07:58:24 +0000
>> 
>> when we're called with bidi_it->first_elt = true, it's possible we
>> shouldn't touch bidi_it->new_paragraph at all...
>
> Can you elaborate on why you think that?

Sorry, I shouldn't have said "touch" there. I meant "set", though I no
longer think so.

> first_elt can be set when we are at the beginning of a paragraph or
> when we are in the middle of it, so its meaning is different from that
> of new_paragraph.

Indeed.

>> +	 paragraph might start.  But don't do that for the first
>> +	 element since this function will be called twice in that
>> +	 case.  */
>
> Which code causes the two calls, and why is that significant in this
> case?

Maybe this code would be clearer:

      if (!bidi_it->first_elt)
	{
	  bytepos++;
	  pos++;
	}

We always look at the paragraph containing the next character to be
loaded by bidi_level_of_next_char. If first_elt is set, that is the
current character; otherwise, it's the one after that.

In the "\n\nש" case, this happens:

1. bidi_paragraph_init is called with first_elt = 1 at buffer position 1
2. new_paragraph is cleared to false
3. bidi_at_paragraph_end is called for buffer position 2. That looks
like a line ending a paragraph, though it's actually a line starting the
next paragraph. Still, it returns true.
4. new_paragraph is set again
5. bidi_paragraph_init is called with first_elt = 0 at buffer position 1

So everything happens to work in this case, even though several of the
assumptions in the bidi code are violated.  The code is written to
assume paragraphs contain at least two characters: that assumption means
it's valid for bidi_paragraph_init to clear new_paragraph. In this case,
it's not, but the next line we're looking at, while not actually ending
a paragraph, looks like it is...

What I'm not sure about is "\n \nש". It could be either a single
two-line paragraph followed by ש, or a single-character paragraph
followed by another paragraph whose first line happens to contain only a
space character; in the first case, paragraph orientation would default
to L2R, in the second case, it would be R2L. Do you happen to know what
Unicode says for this case?

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 0001-Handle-buffers-containing-two-newlines-followed-by-a.patch --]
[-- Type: text/x-diff, Size: 1123 bytes --]

From c5232df875d62ead326d5e90f122ab9ac9798e59 Mon Sep 17 00:00:00 2001
From: Pip Cet <pipcet@gmail.com>
Date: Sat, 6 Jun 2020 13:02:55 +0000
Subject: [PATCH] Handle buffers containing two newlines followed by an RTL
 char

* src/bidi.c (bidi_paragraph_init): Correct handling of initial
newlines.  (Bug#41506)
---
 src/bidi.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/src/bidi.c b/src/bidi.c
index 1017bd2d52..8aa325fe6d 100644
--- a/src/bidi.c
+++ b/src/bidi.c
@@ -1714,8 +1714,12 @@ bidi_paragraph_init (bidi_dir_t dir, struct bidi_it *bidi_it, bool no_default_p)
       s = (STRINGP (bidi_it->string.lstring)
 	   ? SDATA (bidi_it->string.lstring)
 	   : bidi_it->string.s);
-      if (bytepos > begbyte
-	  && bidi_char_at_pos (bytepos, s, bidi_it->string.unibyte) == '\n')
+      /* We always look at the paragraph containing the next character
+	 to be loaded by bidi_level_of_next_char.
+
+	 This code happens to work for a buffer containing two
+	 newlines followed by an RTL character (Bug#41506).  */
+      if (!bidi_it->first_elt)
 	{
 	  bytepos++;
 	  pos++;
-- 
2.27.0.rc0

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* bug#41506: 28.0.50; RTL problem
  2020-06-06 13:05             ` Pip Cet
@ 2020-06-06 13:45               ` Eli Zaretskii
  0 siblings, 0 replies; 9+ messages in thread
From: Eli Zaretskii @ 2020-06-06 13:45 UTC (permalink / raw)
  To: Pip Cet; +Cc: 41506

> From: Pip Cet <pipcet@gmail.com>
> Cc: 41506@debbugs.gnu.org
> Date: Sat, 06 Jun 2020 13:05:43 +0000
> 
> >> +	 paragraph might start.  But don't do that for the first
> >> +	 element since this function will be called twice in that
> >> +	 case.  */
> >
> > Which code causes the two calls, and why is that significant in this
> > case?
> 
> Maybe this code would be clearer:
> 
>       if (!bidi_it->first_elt)
> 	{
> 	  bytepos++;
> 	  pos++;
> 	}

Could be, let's see what is the conclusion of this discussion.

> In the "\n\nש" case, this happens:
> 
> 1. bidi_paragraph_init is called with first_elt = 1 at buffer position 1
> 2. new_paragraph is cleared to false
> 3. bidi_at_paragraph_end is called for buffer position 2. That looks
> like a line ending a paragraph, though it's actually a line starting the
> next paragraph. Still, it returns true.
> 4. new_paragraph is set again
> 5. bidi_paragraph_init is called with first_elt = 0 at buffer position 1

I minor correction to item 3: the second newline in this example is
handled as belonging to the previous paragraph.  You can see that by
examining the behavior of RIGHT and LEFT arrow keys: they behave
differently in R2L and L2R paragraphs.

> What I'm not sure about is "\n \nש". It could be either a single
> two-line paragraph followed by ש, or a single-character paragraph
> followed by another paragraph whose first line happens to contain only a
> space character; in the first case, paragraph orientation would default
> to L2R, in the second case, it would be R2L. Do you happen to know what
> Unicode says for this case?

It's not Unicode in this case, it's Emacs.  If UAX#9 is read and
followed strictly, then each \n ends a paragraph and begins a new one.
IOW, every physical line is a separate paragraph.  This is a direct
consequence of Newline's bidi class being B (paragraph separator):

  (get-char-code-property #x0a 'bidi-class) => B

(as mandated by 3.2 in UAX#9), and of rules P1--P3 in UAX#9.

However, since in Emacs the usual case is that hard newlines are used
to fill text, the default UAX#9 behavior would make no sense, as a
line that happens to start with a R2L character would be rendered
right-to-left, even if the previous line wasn't.  It would produce a
randomly jagged display of paragraphs that mix L2R and R2L characters
just because a line was broken in a different place by filling.

So we use the "higher-level protocols" fire escape (see 4.3 in UAX#9)
and define a "paragraph" differently, for the purposes of base
paragraph direction: we by default require that paragraphs be
separated by empty lines, see bidi-paragraph-separate-re.  Thus, the
above example by default treats the " \n" line as a paragraph
separator, and the ש after it as the start of a new paragraph.

(For completeness, we do support the strict interpretation of UAX#9:
if you set both bidi-paragraph-start-re and bidi-paragraph-separate-re
to "^", you get that.  Any code changes we come up with here must
therefore be tested at least with those settings as well.)

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2020-06-06 13:45 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-05-24 13:05 bug#41506: 28.0.50; RTL problem Pip Cet
2020-05-24 14:46 ` Eli Zaretskii
2020-06-02 10:17   ` Pip Cet
2020-06-02 16:34     ` Eli Zaretskii
2020-06-02 19:07       ` Eli Zaretskii
2020-06-06  7:58         ` Pip Cet
2020-06-06  8:35           ` Eli Zaretskii
2020-06-06 13:05             ` Pip Cet
2020-06-06 13:45               ` Eli Zaretskii

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).