* bug#41506: 28.0.50; RTL problem @ 2020-05-24 13:05 Pip Cet 2020-05-24 14:46 ` Eli Zaretskii 0 siblings, 1 reply; 9+ messages in thread From: Pip Cet @ 2020-05-24 13:05 UTC (permalink / raw) To: 41506 [-- Attachment #1: Type: text/plain, Size: 563 bytes --] Hi, I'm surprised by the way vanilla Emacs behaves when given RTL input: Recipe: emacs -Q hebrew.txt hebrew.txt contains (or should contain!) two newlines followed by the Hebrew word Ivri, punctuated, followed by another single newline. Expected result: right-aligned text Actual result: left-aligned text Is that a bug, or is there something I don't understand? It only appears to be left-aligned when there are precisely two newlines at the beginning of the file. I'm attaching a screenshot since I don't know whether it's a font-specific issue. Thanks! [-- Attachment #2: hebrew.txt --] [-- Type: text/plain, Size: 17 bytes --] עִבְרִי [-- Attachment #3: ivri.jpg --] [-- Type: image/jpeg, Size: 3878 bytes --] ^ permalink raw reply [flat|nested] 9+ messages in thread
* bug#41506: 28.0.50; RTL problem 2020-05-24 13:05 bug#41506: 28.0.50; RTL problem Pip Cet @ 2020-05-24 14:46 ` Eli Zaretskii 2020-06-02 10:17 ` Pip Cet 0 siblings, 1 reply; 9+ messages in thread From: Eli Zaretskii @ 2020-05-24 14:46 UTC (permalink / raw) To: Pip Cet; +Cc: 41506 > From: Pip Cet <pipcet@gmail.com> > Date: Sun, 24 May 2020 13:05:17 +0000 > > I'm surprised by the way vanilla Emacs behaves when given RTL input: > > Recipe: > emacs -Q hebrew.txt > > hebrew.txt contains (or should contain!) two newlines followed by the > Hebrew word Ivri, punctuated, followed by another single newline. > > Expected result: > right-aligned text > > Actual result: > left-aligned text > > Is that a bug, or is there something I don't understand? It's a bug, but one that's very tricky to fix, AFAIR. If you insert or delete a single character, the display becomes correct. ^ permalink raw reply [flat|nested] 9+ messages in thread
* bug#41506: 28.0.50; RTL problem 2020-05-24 14:46 ` Eli Zaretskii @ 2020-06-02 10:17 ` Pip Cet 2020-06-02 16:34 ` Eli Zaretskii 0 siblings, 1 reply; 9+ messages in thread From: Pip Cet @ 2020-06-02 10:17 UTC (permalink / raw) To: Eli Zaretskii; +Cc: 41506 [-- Attachment #1: Type: text/plain, Size: 1217 bytes --] On Sun, May 24, 2020 at 2:46 PM Eli Zaretskii <eliz@gnu.org> wrote: > > From: Pip Cet <pipcet@gmail.com> > > Date: Sun, 24 May 2020 13:05:17 +0000 > > > > I'm surprised by the way vanilla Emacs behaves when given RTL input: > > > > Recipe: > > emacs -Q hebrew.txt > > > > hebrew.txt contains (or should contain!) two newlines followed by the > > Hebrew word Ivri, punctuated, followed by another single newline. > > > > Expected result: > > right-aligned text > > > > Actual result: > > left-aligned text > > > > Is that a bug, or is there something I don't understand? > > It's a bug, but one that's very tricky to fix, AFAIR. If you insert > or delete a single character, the display becomes correct. The attached patch seems to avoid the problem, but I'm sure I'm missing something. The comment says "don't do that at BEGV since then we are potentially in a new paragraph that doesn't yet exist". I'm failing to make sense of that, and the commit (5e65aec01a9bc5a147e492f11dd0115c98bedef4) isn't too helpful either: "Fix bidi iteration near BEGV and ZV." I suspect what might have been meant is that narrowing an LTR paragraph to a line containing STRONG_R text shouldn't result in RTL display, but it does... [-- Attachment #2: 0001-Bidi-patch.patch --] [-- Type: text/x-patch, Size: 802 bytes --] diff --git a/src/bidi.c b/src/bidi.c index 1017bd2d52..e3a5fe7de6 100644 --- a/src/bidi.c +++ b/src/bidi.c @@ -1707,15 +1707,12 @@ bidi_paragraph_init (bidi_dir_t dir, struct bidi_it *bidi_it, bool no_default_p) return; /* If we are on a newline, get past it to where the next - paragraph might start. But don't do that at BEGV since then - we are potentially in a new paragraph that doesn't yet - exist. */ + paragraph might start. */ pos = bidi_it->charpos; s = (STRINGP (bidi_it->string.lstring) ? SDATA (bidi_it->string.lstring) : bidi_it->string.s); - if (bytepos > begbyte - && bidi_char_at_pos (bytepos, s, bidi_it->string.unibyte) == '\n') + if (bidi_char_at_pos (bytepos, s, bidi_it->string.unibyte) == '\n') { bytepos++; pos++; ^ permalink raw reply related [flat|nested] 9+ messages in thread
* bug#41506: 28.0.50; RTL problem 2020-06-02 10:17 ` Pip Cet @ 2020-06-02 16:34 ` Eli Zaretskii 2020-06-02 19:07 ` Eli Zaretskii 0 siblings, 1 reply; 9+ messages in thread From: Eli Zaretskii @ 2020-06-02 16:34 UTC (permalink / raw) To: Pip Cet; +Cc: 41506 > From: Pip Cet <pipcet@gmail.com> > Date: Tue, 2 Jun 2020 10:17:55 +0000 > Cc: 41506@debbugs.gnu.org > > > It's a bug, but one that's very tricky to fix, AFAIR. If you insert > > or delete a single character, the display becomes correct. > > The attached patch seems to avoid the problem, but I'm sure I'm > missing something. This condition was added 11 years ago, when I just started integrating bidi.c with Emacs. The commit log message and the comment both say I had some real problem on my hands that this condition fixed. However, I have now thrown several use cases on the patched code, and could see no problem. So I guess whatever issues I had back then were meanwhile solved "by other means", and you should install this patch. If there is indeed some subtlety here, it will present itself sooner or later (like, in another 11 years). > I suspect what might have been meant is that narrowing an LTR > paragraph to a line containing STRONG_R text shouldn't result in RTL > display, but it does... No, this works as designed: the Emacs display engine always behaves as if there's nothing before beginning of the narrowed region. Thanks. ^ permalink raw reply [flat|nested] 9+ messages in thread
* bug#41506: 28.0.50; RTL problem 2020-06-02 16:34 ` Eli Zaretskii @ 2020-06-02 19:07 ` Eli Zaretskii 2020-06-06 7:58 ` Pip Cet 0 siblings, 1 reply; 9+ messages in thread From: Eli Zaretskii @ 2020-06-02 19:07 UTC (permalink / raw) To: pipcet; +Cc: 41506 > Date: Tue, 02 Jun 2020 19:34:31 +0300 > From: Eli Zaretskii <eliz@gnu.org> > Cc: 41506@debbugs.gnu.org > > So I guess whatever issues I had back then were meanwhile solved "by > other means", and you should install this patch. If there is indeed > some subtlety here, it will present itself sooner or later (like, in > another 11 years). Btw, please note that some residual problem remains: after the patch, if a buffer begins with 2 newlines and an RTL letter, the first screen line is rendered right-to-left, which is wrong. You can see that it's wrong if you insert more newlines at BOB: then only the single empty line immediately before the RTL letter is rendered right-to-left. Of course, this is better than the original problem. ^ permalink raw reply [flat|nested] 9+ messages in thread
* bug#41506: 28.0.50; RTL problem 2020-06-02 19:07 ` Eli Zaretskii @ 2020-06-06 7:58 ` Pip Cet 2020-06-06 8:35 ` Eli Zaretskii 0 siblings, 1 reply; 9+ messages in thread From: Pip Cet @ 2020-06-06 7:58 UTC (permalink / raw) To: Eli Zaretskii; +Cc: 41506 [-- Attachment #1: Type: text/plain, Size: 1083 bytes --] Eli Zaretskii <eliz@gnu.org> writes: >> Date: Tue, 02 Jun 2020 19:34:31 +0300 >> From: Eli Zaretskii <eliz@gnu.org> >> Cc: 41506@debbugs.gnu.org >> >> So I guess whatever issues I had back then were meanwhile solved "by >> other means", and you should install this patch. If there is indeed >> some subtlety here, it will present itself sooner or later (like, in >> another 11 years). > > Btw, please note that some residual problem remains: after the patch, > if a buffer begins with 2 newlines and an RTL letter, the first screen > line is rendered right-to-left, which is wrong. You can see that it's > wrong if you insert more newlines at BOB: then only the single empty > line immediately before the RTL letter is rendered right-to-left. > > Of course, this is better than the original problem. I decided to investigate further, and finally came up with this patch, which appears to work. I'm not a hundred percent sure it's the right thing to do, because when we're called with bidi_it->first_elt = true, it's possible we shouldn't touch bidi_it->new_paragraph at all... [-- Warning: decoded text below may be mangled, UTF-8 assumed --] [-- Attachment #2: 0001-Handle-buffers-containing-two-newlines-followed-by-a.patch --] [-- Type: text/x-diff, Size: 1254 bytes --] From b5302d2e89710166cc8540c8fc08a7eaabc341f4 Mon Sep 17 00:00:00 2001 From: Pip Cet <pipcet@gmail.com> Date: Sat, 6 Jun 2020 07:52:13 +0000 Subject: [PATCH] Handle buffers containing two newlines followed by an RTL char * src/bidi.c (bidi_paragraph_init): Correct handling of initial newlines. (Bug#41506) --- src/bidi.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/src/bidi.c b/src/bidi.c index 1017bd2d52..8d2d3c1f07 100644 --- a/src/bidi.c +++ b/src/bidi.c @@ -1707,14 +1707,14 @@ bidi_paragraph_init (bidi_dir_t dir, struct bidi_it *bidi_it, bool no_default_p) return; /* If we are on a newline, get past it to where the next - paragraph might start. But don't do that at BEGV since then - we are potentially in a new paragraph that doesn't yet - exist. */ + paragraph might start. But don't do that for the first + element since this function will be called twice in that + case. */ pos = bidi_it->charpos; s = (STRINGP (bidi_it->string.lstring) ? SDATA (bidi_it->string.lstring) : bidi_it->string.s); - if (bytepos > begbyte + if (!bidi_it->first_elt && bidi_char_at_pos (bytepos, s, bidi_it->string.unibyte) == '\n') { bytepos++; -- 2.27.0.rc0 [-- Attachment #3: Type: text/plain, Size: 122 bytes --] If you have any further comments, I'd be glad to amend the comments in bidi.c to reflect what we actually do understand. ^ permalink raw reply related [flat|nested] 9+ messages in thread
* bug#41506: 28.0.50; RTL problem 2020-06-06 7:58 ` Pip Cet @ 2020-06-06 8:35 ` Eli Zaretskii 2020-06-06 13:05 ` Pip Cet 0 siblings, 1 reply; 9+ messages in thread From: Eli Zaretskii @ 2020-06-06 8:35 UTC (permalink / raw) To: Pip Cet; +Cc: 41506 > From: Pip Cet <pipcet@gmail.com> > Cc: 41506@debbugs.gnu.org > Date: Sat, 06 Jun 2020 07:58:24 +0000 > > when we're called with bidi_it->first_elt = true, it's possible we > shouldn't touch bidi_it->new_paragraph at all... Can you elaborate on why you think that? first_elt can be set when we are at the beginning of a paragraph or when we are in the middle of it, so its meaning is different from that of new_paragraph. > + paragraph might start. But don't do that for the first > + element since this function will be called twice in that > + case. */ Which code causes the two calls, and why is that significant in this case? ^ permalink raw reply [flat|nested] 9+ messages in thread
* bug#41506: 28.0.50; RTL problem 2020-06-06 8:35 ` Eli Zaretskii @ 2020-06-06 13:05 ` Pip Cet 2020-06-06 13:45 ` Eli Zaretskii 0 siblings, 1 reply; 9+ messages in thread From: Pip Cet @ 2020-06-06 13:05 UTC (permalink / raw) To: Eli Zaretskii; +Cc: 41506 [-- Attachment #1: Type: text/plain, Size: 2355 bytes --] Eli Zaretskii <eliz@gnu.org> writes: >> From: Pip Cet <pipcet@gmail.com> >> Cc: 41506@debbugs.gnu.org >> Date: Sat, 06 Jun 2020 07:58:24 +0000 >> >> when we're called with bidi_it->first_elt = true, it's possible we >> shouldn't touch bidi_it->new_paragraph at all... > > Can you elaborate on why you think that? Sorry, I shouldn't have said "touch" there. I meant "set", though I no longer think so. > first_elt can be set when we are at the beginning of a paragraph or > when we are in the middle of it, so its meaning is different from that > of new_paragraph. Indeed. >> + paragraph might start. But don't do that for the first >> + element since this function will be called twice in that >> + case. */ > > Which code causes the two calls, and why is that significant in this > case? Maybe this code would be clearer: if (!bidi_it->first_elt) { bytepos++; pos++; } We always look at the paragraph containing the next character to be loaded by bidi_level_of_next_char. If first_elt is set, that is the current character; otherwise, it's the one after that. In the "\n\nש" case, this happens: 1. bidi_paragraph_init is called with first_elt = 1 at buffer position 1 2. new_paragraph is cleared to false 3. bidi_at_paragraph_end is called for buffer position 2. That looks like a line ending a paragraph, though it's actually a line starting the next paragraph. Still, it returns true. 4. new_paragraph is set again 5. bidi_paragraph_init is called with first_elt = 0 at buffer position 1 So everything happens to work in this case, even though several of the assumptions in the bidi code are violated. The code is written to assume paragraphs contain at least two characters: that assumption means it's valid for bidi_paragraph_init to clear new_paragraph. In this case, it's not, but the next line we're looking at, while not actually ending a paragraph, looks like it is... What I'm not sure about is "\n \nש". It could be either a single two-line paragraph followed by ש, or a single-character paragraph followed by another paragraph whose first line happens to contain only a space character; in the first case, paragraph orientation would default to L2R, in the second case, it would be R2L. Do you happen to know what Unicode says for this case? [-- Warning: decoded text below may be mangled, UTF-8 assumed --] [-- Attachment #2: 0001-Handle-buffers-containing-two-newlines-followed-by-a.patch --] [-- Type: text/x-diff, Size: 1123 bytes --] From c5232df875d62ead326d5e90f122ab9ac9798e59 Mon Sep 17 00:00:00 2001 From: Pip Cet <pipcet@gmail.com> Date: Sat, 6 Jun 2020 13:02:55 +0000 Subject: [PATCH] Handle buffers containing two newlines followed by an RTL char * src/bidi.c (bidi_paragraph_init): Correct handling of initial newlines. (Bug#41506) --- src/bidi.c | 8 ++++++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/src/bidi.c b/src/bidi.c index 1017bd2d52..8aa325fe6d 100644 --- a/src/bidi.c +++ b/src/bidi.c @@ -1714,8 +1714,12 @@ bidi_paragraph_init (bidi_dir_t dir, struct bidi_it *bidi_it, bool no_default_p) s = (STRINGP (bidi_it->string.lstring) ? SDATA (bidi_it->string.lstring) : bidi_it->string.s); - if (bytepos > begbyte - && bidi_char_at_pos (bytepos, s, bidi_it->string.unibyte) == '\n') + /* We always look at the paragraph containing the next character + to be loaded by bidi_level_of_next_char. + + This code happens to work for a buffer containing two + newlines followed by an RTL character (Bug#41506). */ + if (!bidi_it->first_elt) { bytepos++; pos++; -- 2.27.0.rc0 ^ permalink raw reply related [flat|nested] 9+ messages in thread
* bug#41506: 28.0.50; RTL problem 2020-06-06 13:05 ` Pip Cet @ 2020-06-06 13:45 ` Eli Zaretskii 0 siblings, 0 replies; 9+ messages in thread From: Eli Zaretskii @ 2020-06-06 13:45 UTC (permalink / raw) To: Pip Cet; +Cc: 41506 > From: Pip Cet <pipcet@gmail.com> > Cc: 41506@debbugs.gnu.org > Date: Sat, 06 Jun 2020 13:05:43 +0000 > > >> + paragraph might start. But don't do that for the first > >> + element since this function will be called twice in that > >> + case. */ > > > > Which code causes the two calls, and why is that significant in this > > case? > > Maybe this code would be clearer: > > if (!bidi_it->first_elt) > { > bytepos++; > pos++; > } Could be, let's see what is the conclusion of this discussion. > In the "\n\nש" case, this happens: > > 1. bidi_paragraph_init is called with first_elt = 1 at buffer position 1 > 2. new_paragraph is cleared to false > 3. bidi_at_paragraph_end is called for buffer position 2. That looks > like a line ending a paragraph, though it's actually a line starting the > next paragraph. Still, it returns true. > 4. new_paragraph is set again > 5. bidi_paragraph_init is called with first_elt = 0 at buffer position 1 I minor correction to item 3: the second newline in this example is handled as belonging to the previous paragraph. You can see that by examining the behavior of RIGHT and LEFT arrow keys: they behave differently in R2L and L2R paragraphs. > What I'm not sure about is "\n \nש". It could be either a single > two-line paragraph followed by ש, or a single-character paragraph > followed by another paragraph whose first line happens to contain only a > space character; in the first case, paragraph orientation would default > to L2R, in the second case, it would be R2L. Do you happen to know what > Unicode says for this case? It's not Unicode in this case, it's Emacs. If UAX#9 is read and followed strictly, then each \n ends a paragraph and begins a new one. IOW, every physical line is a separate paragraph. This is a direct consequence of Newline's bidi class being B (paragraph separator): (get-char-code-property #x0a 'bidi-class) => B (as mandated by 3.2 in UAX#9), and of rules P1--P3 in UAX#9. However, since in Emacs the usual case is that hard newlines are used to fill text, the default UAX#9 behavior would make no sense, as a line that happens to start with a R2L character would be rendered right-to-left, even if the previous line wasn't. It would produce a randomly jagged display of paragraphs that mix L2R and R2L characters just because a line was broken in a different place by filling. So we use the "higher-level protocols" fire escape (see 4.3 in UAX#9) and define a "paragraph" differently, for the purposes of base paragraph direction: we by default require that paragraphs be separated by empty lines, see bidi-paragraph-separate-re. Thus, the above example by default treats the " \n" line as a paragraph separator, and the ש after it as the start of a new paragraph. (For completeness, we do support the strict interpretation of UAX#9: if you set both bidi-paragraph-start-re and bidi-paragraph-separate-re to "^", you get that. Any code changes we come up with here must therefore be tested at least with those settings as well.) ^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2020-06-06 13:45 UTC | newest] Thread overview: 9+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2020-05-24 13:05 bug#41506: 28.0.50; RTL problem Pip Cet 2020-05-24 14:46 ` Eli Zaretskii 2020-06-02 10:17 ` Pip Cet 2020-06-02 16:34 ` Eli Zaretskii 2020-06-02 19:07 ` Eli Zaretskii 2020-06-06 7:58 ` Pip Cet 2020-06-06 8:35 ` Eli Zaretskii 2020-06-06 13:05 ` Pip Cet 2020-06-06 13:45 ` Eli Zaretskii
Code repositories for project(s) associated with this public inbox https://git.savannah.gnu.org/cgit/emacs.git This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).