From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: Richard Wordingham Newsgroups: gmane.emacs.bugs Subject: bug#20140: 24.4; M17n shaper output rejected Date: Sat, 21 Mar 2015 17:58:18 +0000 Message-ID: <20150321175818.1b125eba@JRWUBU2> References: <20150318222040.4066e6e9@JRWUBU2> <87pp8292cy.fsf@gnu.org> NNTP-Posting-Host: plane.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Trace: ger.gmane.org 1426960763 20518 80.91.229.3 (21 Mar 2015 17:59:23 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Sat, 21 Mar 2015 17:59:23 +0000 (UTC) Cc: 20140@debbugs.gnu.org To: handa@gnu.org (K. Handa) Original-X-From: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane.org@gnu.org Sat Mar 21 18:59:12 2015 Return-path: Envelope-to: geb-bug-gnu-emacs@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by plane.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1YZNfz-00048M-9D for geb-bug-gnu-emacs@m.gmane.org; Sat, 21 Mar 2015 18:59:11 +0100 Original-Received: from localhost ([::1]:48554 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1YZNfy-0003Mg-JV for geb-bug-gnu-emacs@m.gmane.org; Sat, 21 Mar 2015 13:59:10 -0400 Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:40343) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1YZNfu-0003MY-M9 for bug-gnu-emacs@gnu.org; Sat, 21 Mar 2015 13:59:08 -0400 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1YZNfq-0008JV-Jr for bug-gnu-emacs@gnu.org; Sat, 21 Mar 2015 13:59:06 -0400 Original-Received: from debbugs.gnu.org ([140.186.70.43]:42011) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1YZNfq-0008JJ-Gn for bug-gnu-emacs@gnu.org; Sat, 21 Mar 2015 13:59:02 -0400 Original-Received: from Debian-debbugs by debbugs.gnu.org with local (Exim 4.80) (envelope-from ) id 1YZNfq-0004LA-2l for bug-gnu-emacs@gnu.org; Sat, 21 Mar 2015 13:59:02 -0400 X-Loop: help-debbugs@gnu.org Resent-From: Richard Wordingham Original-Sender: "Debbugs-submit" Resent-CC: bug-gnu-emacs@gnu.org Resent-Date: Sat, 21 Mar 2015 17:59:02 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 20140 X-GNU-PR-Package: emacs X-GNU-PR-Keywords: Original-Received: via spool by 20140-submit@debbugs.gnu.org id=B20140.142696071416647 (code B ref 20140); Sat, 21 Mar 2015 17:59:02 +0000 Original-Received: (at 20140) by debbugs.gnu.org; 21 Mar 2015 17:58:34 +0000 Original-Received: from localhost ([127.0.0.1]:60020 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1YZNfN-0004KQ-Kb for submit@debbugs.gnu.org; Sat, 21 Mar 2015 13:58:34 -0400 Original-Received: from know-smtprelay-omc-9.server.virginmedia.net ([80.0.253.73]:44830) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1YZNfK-0004K9-Hq for 20140@debbugs.gnu.org; Sat, 21 Mar 2015 13:58:31 -0400 Original-Received: from JRWUBU2 ([81.103.224.4]) by know-smtprelay-9-imp with bizsmtp id 6HyP1q02R06JmVd01HyQjP; Sat, 21 Mar 2015 17:58:24 +0000 X-Originating-IP: [81.103.224.4] X-Spam: 0 X-Authority: v=2.1 cv=dJgomYpb c=1 sm=1 tr=0 a=pLuj3OkTrmEUIJBpyvkqVg==:117 a=pLuj3OkTrmEUIJBpyvkqVg==:17 a=IkcTkHD0fZMA:10 a=NLZqzBF-AAAA:8 a=mDV3o1hIAAAA:8 a=ct7dnu-KoUftn3s5otsA:9 a=UsawN17YW7ie1F18:21 a=WAKGHCdlqEPa-ZRm:21 a=QEXdDO2ut3YA:10 In-Reply-To: <87pp8292cy.fsf@gnu.org> X-Mailer: Claws Mail 3.8.0 (GTK+ 2.24.10; i686-pc-linux-gnu) X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.15 Precedence: list X-detected-operating-system: by eggs.gnu.org: GNU/Linux 3.x X-Received-From: 140.186.70.43 X-BeenThere: bug-gnu-emacs@gnu.org List-Id: "Bug reports for GNU Emacs, the Swiss army knife of text editors" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane.org@gnu.org Original-Sender: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane.org@gnu.org Xref: news.gmane.org gmane.emacs.bugs:100750 Archived-At: On Sat, 21 Mar 2015 17:33:17 +0900 handa@gnu.org (K. Handa) wrote: > In article <20150318222040.4066e6e9@JRWUBU2>, Richard Wordingham > writes: [...] > > I extract and analyse what was rendered as shaped ('accepted') and > > what was not ('rejected'), quoting the monitoring output. I > > suspect the problem is the strict testing of the from and to fields > > in Lisp function font-shape-gstring, which is defined in file > > font.c. > [...] > > The shaping of the following, with vowels or MEDIAL RA that should > > be rendered before the consonant, was rejected: >=20 > > mflt_run( 1A3E 1A6E 1A6C 1A65) produced ( 1A6E>872:1:1 1A3E>810:0:3 > 1A6C>869:0:3 1A65>862:0:3)=20 >=20 > If U+1A6E is displayed before U+1A3E, and they are in > different grapheme cluster, when you move point forward one > step by one, the cursor must move back and forth as below > (cursor is indicated by dashes): >=20 > display: SPC 1A6E 1A3E+1A6C+1A65 SPC > step 1: --- =20 > step 2: -------------- > step 3: ---- > step 4: --- >=20 > Is that what you want? It gives me more control for editing in Emacs. Another implementation could choose to move in visual order. The policing function could choose to merge the 'out of order' clusters - that is what new HarfBuzz does, though I think that should only be done if the client requests it. What I ought to want is SIL's split cursor scheme, which indicated the next ('point') and previous characters, even in bidirectional text. Unfortunately, that's not compatible with m17n, which seems to assume that cursor position will be a single number. The Emacs functions forward-char-intrusive and backward-char-intrusive provided a pleasant, more intuitive, alternative, and I am sad to hear they are gone. Perhaps I'll have to start using toggle-auto-composition. The one consolation in Emacs is that delete-forward-char deletes a single character, rather than a whole cluster. That greatly reduces the disadvantage of having clusters. Also, search still works by characters rather than by clusters. If I want to search for a character in LibreOffice, I have to go into the special regular expression find and replace menu. That is unpleasant. > At least, the support for all Indic scripts (they have > characters in logical order as your example of Tai Tham > text) treats re-ordered glyphs as one grapheme cluster. > That is not only Emacs but also gtk (pango) applications. That's a nasty fault with HarfBuzz. > Please try to move cursor over this Devanagri text "=E0=A4=B9=E0=A4=BF=E0= =A4=82=E0=A4=A6=E0=A5=80" on > Emacs, gedit, and, for instance, firefox. They all treat > that text as 2 grapheme clusters "=E0=A4=B9=E0=A4=BF=E0=A4=82" and "=E0= =A4=A6=E0=A5=80". The first > one corresponds to character the sequence U+935 U+93F, and > U+93F (vowel I) is displayed before U+935 (base cosonant). Note that those clusters are only 3 and 2 characters long. Retyping them is tolerable. Now consider the Sanskrit Devanagari text =E0=A4=B8=E0= =A5=8D=E0=A4=A4=E0=A5=8D=E0=A4=B0=E0=A5=80, which contains two consonant-combining viramas. Emacs moves across it in 1 step, but Claws e-mail (GTK-based, I believe) and LibreOffice (HarfBuzz-based, at least for linux) both take 3 steps to move across it. Claws and LibreOffice use different algorithms to position the cursor. That of LibreOffice seems more reasonable, but that of Claws works better! The reason is that Unicode did not declare virama as forming grapheme clusters. > [...] >=20 > > There does appear to be a work around, which is to have m17n declare > > the orthographic syllables it receives to be 'grapheme clusters'. >=20 > I think that's the right solution; i.e. make all combined > and out-of-ordered glyphs as one cluster. >=20 > > It solves at least some of the problems above. >=20 > Which one is not solved by it? It seems to have solved all of them. When I reported the bug, I was having problems with my font because libotf was silently ignoring half the lookups in my font. I though I might have problems with U+1A58 TAI THAM SIGN MAI KANG LAI, which in Lao visually groups (usually) with the following base consonant and in Tai Khuen groups with the preceding base consonant. My clustering in Emacs follows the Tai Khuen scheme. (I compose two orthographic clusters together in Emacs, but declare two grapheme clusters in the FLT processing.) However, my font follows a major Northern Thai dictionary and places it on the following base consonant if there is nothing above it, but otherwise places it on the preceding base consonant. However, my implementation is too dirty to cause problems - the second cluster is not reported as deriving from the mai kang lai character. I wonder, though, what will happen if I manage to implement the Universal Shaping Engine's (USE) rphf feature. The author of a Lao-style Tai Tham font wanted this feature in HarfBuzz. The desired effect seems easy to achieve in m17n-flt, but placing it under font control is more difficult. I'm studying MLM2-OTF.flt to see how to do it. > > However, it then makes editing of the 'clusters' more > > difficult. Note that there are examples above with 5 > > characters in a cluster, and this is by no means the > > limit. >=20 > But, it seems that the current behavior is accepted, at > least, by Indic people. Who do you mean by 'Indic people'? I can see at least three groups: 1) Indian speakers of Indic languages who use Indic scripts, thus including users of Hindi, Gujarati and Bengali. See my comments above. 2) Indian users of Indic scripts, thus also including speakers of Malayalam and Tamil. In Tamil, a phonetically CVCCV word will normally naturally split into clusters as CV.C+virama.CV. I must admit I am surprised that they have accepted CV.CCV - or do Tamils not use Emacs for Tamil? Tamils are notorious for regarding their writing system as a syllabary rather than as an abugida. I haven't studied the Malayalam script - that does seem a fairly complicated Indian script, as one might expect when Dravidians use a script tailored to Middle Indic and stretched to cover Old Indic. 3) Users of Indic scripts, thus also including the Burmese, Thai, Cambodians and Lao as well as the users of the Tai Tham script. Rebellion is rampant. The original Unicode encoding of Thai followed the phonetic order (allegedly - it was probably the collation order instead). This was rapidly thrown out as incompatible with the current, working encoding. Unicode responded with the derogatory property of 'logical order exception'. Around Unicode 5.1, the preposed vowels of Thai and Lao were suddenly included in grapheme clusters with the base consonant. As the consequences started to appear in applications, there were howls of rage from Thais, and the characters were restored to their original status as fully independent characters. It doesn't seem so long ago that the Cambodian government imposed Unicode on Cambodia. You'd have thought that access to applications would have made Unicode the obvious choice. New Tai Lue is an interesting case. Microsoft delayed support for this simple Indic script for so long that most apparently Unicode-encoded New Tai Lue text was actually encoded in visual order. With Unicode 8.0, New Tai Lue is changing from phonetic order to visual order, and it will no longer need any clusters at all! Emacs 23.3 (which is what is in long-term support Ubuntu 12.04) offers no support for New Tai Lue, so I am not sure that there is yet a New Tai Lue view on composition in Emacs. Richard.