From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.io!.POSTED.blaine.gmane.org!not-for-mail From: Richard Wordingham via "Bug reports for GNU Emacs, the Swiss army knife of text editors" Newsgroups: gmane.emacs.bugs Subject: bug#20140: 24.4; M17n shaper output rejected Date: Sun, 6 Feb 2022 22:09:58 +0000 Message-ID: <20220206220958.5a4d8ffe@JRWUBU2> References: <20150318222040.4066e6e9@JRWUBU2> <87r18jk5nr.fsf@gnus.org> <83v8xv2icg.fsf@gnu.org> <20220205225251.08a0faab@JRWUBU2> <83y22oza77.fsf@gnu.org> Reply-To: Richard Wordingham Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Injection-Info: ciao.gmane.io; posting-host="blaine.gmane.org:116.202.254.214"; logging-data="40046"; mail-complaints-to="usenet@ciao.gmane.io" Cc: 20140@debbugs.gnu.org, Kenichi Handa , larsi@gnus.org To: Eli Zaretskii Original-X-From: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane-mx.org@gnu.org Sun Feb 06 23:12:08 2022 Return-path: Envelope-to: geb-bug-gnu-emacs@m.gmane-mx.org Original-Received: from lists.gnu.org ([209.51.188.17]) by ciao.gmane.io with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.92) (envelope-from ) id 1nGplP-000AFl-LO for geb-bug-gnu-emacs@m.gmane-mx.org; Sun, 06 Feb 2022 23:12:07 +0100 Original-Received: from localhost ([::1]:39174 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1nGplO-0007ln-O0 for geb-bug-gnu-emacs@m.gmane-mx.org; Sun, 06 Feb 2022 17:12:06 -0500 Original-Received: from eggs.gnu.org ([209.51.188.92]:32900) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1nGpkM-0005vj-TV for bug-gnu-emacs@gnu.org; Sun, 06 Feb 2022 17:11:06 -0500 Original-Received: from debbugs.gnu.org ([209.51.188.43]:46158) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.90_1) (envelope-from ) id 1nGpkL-0002Dh-W5 for bug-gnu-emacs@gnu.org; Sun, 06 Feb 2022 17:11:02 -0500 Original-Received: from Debian-debbugs by debbugs.gnu.org with local (Exim 4.84_2) (envelope-from ) id 1nGpkL-00075f-SJ for bug-gnu-emacs@gnu.org; Sun, 06 Feb 2022 17:11:01 -0500 X-Loop: help-debbugs@gnu.org Resent-From: Richard Wordingham Original-Sender: "Debbugs-submit" Resent-CC: bug-gnu-emacs@gnu.org Resent-Date: Sun, 06 Feb 2022 22:11:01 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 20140 X-GNU-PR-Package: emacs X-GNU-PR-Keywords: moreinfo Original-Received: via spool by 20140-submit@debbugs.gnu.org id=B20140.164418541227181 (code B ref 20140); Sun, 06 Feb 2022 22:11:01 +0000 Original-Received: (at 20140) by debbugs.gnu.org; 6 Feb 2022 22:10:12 +0000 Original-Received: from localhost ([127.0.0.1]:40051 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1nGpjU-00074C-S9 for submit@debbugs.gnu.org; Sun, 06 Feb 2022 17:10:12 -0500 Original-Received: from smtpq1.tb.ukmail.iss.as9143.net ([212.54.57.96]:44330) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1nGpjS-00073S-1o for 20140@debbugs.gnu.org; Sun, 06 Feb 2022 17:10:07 -0500 Original-Received: from [212.54.57.108] (helo=csmtp4.tb.ukmail.iss.as9143.net) by smtpq1.tb.ukmail.iss.as9143.net with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1nGpjM-0007B5-0a for 20140@debbugs.gnu.org; Sun, 06 Feb 2022 23:10:00 +0100 Original-Received: from JRWUBU2 ([82.27.122.109]) by cmsmtp with ESMTP id GpjLnVGOGpODMGpjLnip3d; Sun, 06 Feb 2022 23:10:00 +0100 X-SourceIP: 82.27.122.109 X-Spam: 0 X-Authority: v=2.4 cv=GcsEICbL c=1 sm=1 tr=0 ts=62004738 cx=a_exe a=lZfnwhydZ+7bl6OdZ0zTBw==:117 a=lZfnwhydZ+7bl6OdZ0zTBw==:17 a=IkcTkHD0fZMA:10 a=oGFeUVbbRNcA:10 a=mDV3o1hIAAAA:8 a=NLZqzBF-AAAA:8 a=OocQHUDgAAAA:8 a=AZnJWaAPAAAA:8 a=bfS6WyRxXP7FMp168e0A:9 a=QEXdDO2ut3YA:10 a=RUIEsB1ujRkA:10 a=cET8LZuHwC8A:10 a=_FVE-zBwftR9WsbkzFJk:22 a=wW_WBVUImv98JQXhvVPZ:22 a=xUZTl98r3Qw_uB5NK3jt:22 a=T2rBzvJ0ivks0o3LBaDr:22 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ntlworld.com; s=meg.feb2017; t=1644185400; bh=RodmaYOecXkbLFG2MTMlEByqQOmm4UbhXjCYCc4khiM=; h=Date:From:To:Cc:Subject:In-Reply-To:References; b=S1CnfnbbyKRctsT6JAGcgA5+XcB08U0GekdnclVf/SgSy3c2gxkeqm5f3Pvarnyo3 f+JMN2JD+rMg5+E7BYLIUnpgPr3II8YfUv9NDoKLYvjfy3tEESGCBa9GJ84qhgpRsw SC+YOtwYCaViQxmZKb3NJ/9m658+v7qfnhKyRmAFeHxvaQAiUK6fBmFoAHCafAagLz zXiDDhwMmSmmZub+xG9L11rECiLVPgyy9RM9j8fl2Av1d8TASvMRKHy9mTvbBHN5aT Fwy8NCBWiI/FPyD7Q64Vjimi+wp7h14zb1763XnXSgtx8UFxb1JHzIzVoapAG23JIj nRYSqXLFfyocw== In-Reply-To: <83y22oza77.fsf@gnu.org> X-Mailer: Claws Mail 3.17.5 (GTK+ 2.24.32; x86_64-pc-linux-gnu) X-CMAE-Envelope: MS4xfGKQyosGxB5CB9ez+bqsKVZMT5KXkvZg0Dv104sryRTqwbUdrnrGPBAav/Nh+NKl57/B/pzR9QLjPGGqWlqEPYjDdymXepHpdlrLPMgl6DAL/cdtRjuR TN2flDJ8cOn3mlWw+9w2FjWOmuke4X7Xy3bOnjlfmra/WbmH2PA+8ppL62Opd85kb21m3osonW0q9uam0I/6RfL3EQ5q2peE+iq3SHeq5ku2DqvU2aBf9dXH b4dSYbmmmVfe7R7RS5qHK7XS31CP/lLakxXC2kUVSGI= X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list X-BeenThere: bug-gnu-emacs@gnu.org List-Id: "Bug reports for GNU Emacs, the Swiss army knife of text editors" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane-mx.org@gnu.org Original-Sender: "bug-gnu-emacs" Xref: news.gmane.io gmane.emacs.bugs:226176 Archived-At: On Sun, 06 Feb 2022 10:11:08 +0200 Eli Zaretskii wrote: > > Date: Sat, 5 Feb 2022 22:52:51 +0000 > > From: Richard Wordingham > > Cc: Lars Ingebrigtsen , 20140@debbugs.gnu.org > >=20 > > I'm currently using the vanilla emacs on Ubuntu Focal, which is > > described as 'GNU Emacs 26.3 (build 2, x86_64-pc-linux-gnu, GTK+ > > Version 3.24.14) of 2020-03-26, modified by Debian'. The key good > > news is that the commands forward-char-intrusive and > > backward-char-intrusive are now standard, so I can position the > > cursor by dead-reckoning. You can reasonably mark the issue as > > solved. =20 >=20 > I don't see the commands forward-char-intrusive and > backward-char-intrusive anywhere in Emacs, so I guess they are your > local changes, based on the code posted by Handa-san in this > discussion? >=20 > > > The most important change is that we now use HarfBuzz by default. > > > =20 > >=20 > > Isn't that only true for Emacs 27.1 and above? =20 >=20 > That's true, but Emacs 26 is ancient history; Emacs 28.1 is about to > be released. So from our perspective, HarfBuzz is the default shaping > engine, and since it's available on all the supported platforms we > care about, we are phasing out m17n-flt shapers. >=20 > > > Richard didn't contribute the Tai Tham composition rules to us > > > (AFAIR), so I cannot test what happens now in Emacs with HarfBuzz. > > > Maybe we should revisit this issue, but first I hope Richard could > > > tell whether the issue still exists, and if so, what composition > > > rules he uses or suggests to use for Tai Tham. =20 > >=20 > > Sad to see that Khaled Hosny's suggestion not to use composition > > rules seems not to have been taken. =20 >=20 > You mean, to pass all the text via HarfBuzz instead? That makes the > Emacs redisplay painfully slow, and would require a complete redesign > of how we render text to be bearable. So as long as such a redesign > is not available, we cannot use that advice. Except for Malayalam! (Subexpression XX* in indian.el at the moment.) > > You're welcome to include my composition rules. =20 >=20 > Thanks. >=20 > > They're complicated by the facts that the 'regular expressions' are > > not interpreted as regular expressions and they are not interpreted > > as closed under canonical equivalence. I therefore calculate the > > regular expression. =20 >=20 > I'm not sure I understand the issue: what you do seems to be very > similar to what we do for the Indic scripts in indian.el, so what kind > of complications are you talking about here? Well, those rules themselves are a bit odd. Why are you composing single clusters? Why are you breaking clusters where Microsoft imitators are likely to insert dotted circles? The basic structure for most Indic scripts is R*C(M|HC)*(M|H)* where R is miscellaneous prefixed forms (e.g. dot reph, visarga variants), C is consonants (and things that can act like them), H is the conjoining operator, and M is miscellaneous marks, including ZWJ and ZWNJ. "(M|H)*" accounts for explicit viramas and isolated half-forms. Jackboots are then applied on the ground that spell checkers cannot be relied upon. The first problem for Tai Tham is that marks with non-zero canonical combining class (ccc) greater than 9 (note that script-specific nuktas generally have ccc=3D7) do not mix with conjoining operators with ccc=3D9; the conjoining operator (as opposed to visible virama) should not be separated from the following consonant. Mark Davis ignored this requirement from the proposals, so unless your 'regular' expression is acting on traces under canonical equivalence rather than mere strings, one has to complicate the expressions to cope. The second issue is that the behaviour of U+1A58 TAI THAM SIGN MAI KANG LAI is a stylistic variable. It can act like a dot reph or a phonetic-syllable-final mark. My composition rules therefore have to treat it as gluing orthographic syllables together. The third issue, that is less visible, is that I had a problem with back-tracking. > Also, your rules seem to follow the description in the "Structuring > Tai Tham Unicode" document (Revision 7), a.k.a. "L2/19-365", dated Oct > 2019, is that right? Is that document the latest word on shaping Tai > Tham, or are there any additional sources? No, the document's a crime. I tried to minmise it's destructiveness, which is why I got an acknowledgement in it. I advocate sticking to phonetic order, as in Khmer and Brahmi. That scheme needs a couple of formally unproposed characters to make some distinctions. The best sources are the regular expressions in the proposals, but they missed out the combination of tone mark and final consonant signs. What do you mean by 'shaping'? For Tai Tham, only positive service provided by rendering engines is the movement of preposed vowels and MEDIAL RA to the start of the glyph sequence; all the other resequencing has to be done by the fonts themselves. > > There are some deficiencies; I've a feeling there may be a problem > > with adding ZWNJ and CGJ as marks; ZWJ should also be added for > > completeness. =20 >=20 > These are barely mentioned in the L2/19-365 document, and not > mentioned at all in the Tai Tham section of the Unicode Standard. > Does it mean they are not very important in contemporary Tai Tham > texts? The Tai Tham section is based on information before grammar nazification disabled Tai Tham texts, or at least, those that were to be rendered using restrictive shapers based on alleged knowledge of the languages. ZWNJ is a standard mechanism for disabling ligatures in non-cursive scripts, though I'm not sure of the balance of ZWJ and ZWNJ in Fraktur, e.g. the different renderings of the two meanings of Antiqua German Wachstube. CGJ is needed where there is no other character to mark the boundary of two chained syllables and concatenating the vowel and tone marks of the two together violates the ordering rules for a single syllable. It would also be needed to mark other differences relevant to collation, e.g. if syllable-initial BA were sorted according to its pronunciation, as in one major dictionary. Automating an inconsistent hand-sort is hard, slow work, especially as the CLDR tools choke on an easy Lao sort. (By contrast, the official Thai sort is very machine-friendly.) > > I need ZWNJ to write 4-column =E1=A8=B4=E1=A9=A3=E1=A9=B4=E1=A8=B6=E1= =A9=A0=E1=A9=85=E2=80=8C=E1=A9=A3=E1=A9=A0=E1=A8=BF as opposed to > > 3-column =E1=A8=B4=E1=A9=A3=E1=A9=B4=E1=A8=B6=E1=A9=A0=E1=A9=85=E1=A9= =A3=E1=A9=A0=E1=A8=BF, and even with my font, HarfBuzz will need CGJ > > for the suppression of jack-booted dotted circles. Additionally, for > > didactic text, what can I do for U+25CC for explicit display of > > marks and their equivalents on a dotted circle, and for that > > matter, for display on NBSP? =20 This, the main use of ZWNJ, was unknown to the authors of the Tai Tham proposals. In Lao texts of the 1930s, non-ligation seems to mark an enthusiasm for the spelling reforms, which one normally thinks of as only applying to the Lao script. Having looked at indian.el, it seems that it will be easy to add these controls (CGJ, ZWJ and ZWNJ) to the composition tables. > At least for the dotted circle case, Emacs has a general composition > rule; see compose-gstring-for-dotted-circle and the corresponding rule > in composite.c. So I'm not sure we need anything specific to Tai Tham > there. Does the 3-character Khmer sequence "=E2=97=8C=E1=9F=92=E1=9E=80" work in Version 28? It doesn't in Version 26.3. It should look like a dotted circle with the lower part of =E1=9E=80=E1=9F=92=E1=9E=80 below it. = In Version 26.3, I don't even get the consonant U+1780 subscripted! With HarfBuzz, if you don't compose U+25CC with the following mark, you are very likely to get two dotted circles - are you deliberately deleting one? Doing so wouldn't be a reliable process. Possibly I could fix the rendering problem by also composing sequences starting with marks - to be investigated. If it works, it might work with NBSP, though it wouldn't help with my plan for to render as just the spacing mark.=20 > Can you recommend good fonts for Tai Tham? Are they free fonts? Almost all Tai Tham fonts have problems. Probably the best is the one used for the New Testament, which relies on the SIL Graphite renderer. I'll dig into that one. The nicest OTL shaper-based one for most words is Lamphun, which is based on Hariphunchai. Unfortunately, not even Lamphun distinguished subscript HIGH RATHA from the subscript , and it is rather limited for interacting marks - Hariphunchai lacks mark-to-mark positioning. The commoner combinations of marks are handled by glyph substitution, and Lamphun has made a start on mark-to-mark positioning. Hariphunchai and Lamphun are available under the SIL Open Font licence. For Lao and Pali, Khottabun is a nice font, but there are some idiosyncrasies in its encoding of words. (Unicode appears only to define character encoding, and is largely silent on the encoding of Tai Tham words.) It is available under the SIL Open Font licence, so I can and perhaps ought to add it to my renderer (https://wrdingham.co.uk/renderer_test.htm) and font (https://wrdingham.co.uk/font_test.htm) tests. Unfortunately, it only supports characters used for Lao or Pali. It appears to evade the jackboots of the HarfBuzz implementation of the Universal Shaping Engine (USE) by not having a glyph for U+25CC - cunning! I don't know whether this trick works with the Windows renderers. There's a clutch of Tai Khuen fonts released under the SIL Open Font licence that are aesthetically satisfying, but have a tendency to rely on Tai Khuen orthographic rules to avoid clashing glyphs, and don't extend to supporting somewhat exceptional words like Pali _indriya_. The fonts are: A Tai Tham KH A Tai Tham KH New A Tai Tham KH New V3 They are unlikely to work with Uniscribe or DirectWrite, as they rely on the ccmp or liga feature being enabled for the default script; I'm not sure whether that's a problem for those using emacs on Windows. If you don't mind the reactionary square nature of the glyphs, there is also my Da Lekh family, with full coverage of the encoded character set, and some support for language-specific glyphs that are very different between the languages. (Generally the glyphs aim to be an 'international' compromise.) Features may be used instead of language environment - I don't set out to punish Windows victims. There are four fonts: Da Lekh Da Lekh Si Da Lekh Seri Da Lekh Si Seri The ones with Seri in the name have the same freedoms as the Deja Vu fonts and none of the restrictions. (I drew all their glyphs.) The others have the same freedoms and, necessarily, restrictions as Deja Vu Sans. The Seri (meaning 'untrammelled') fonts were created for unconstrained use by the Unicode Consortium and deliberately have no defence against the jackboots of the Universal Shaping Engine. They should work fine with the M17n renderer. Unfortunately, for its Latin glyphs, one only gets what one pays for. The ones with 'Si' colour conjoined syllables red so that one can see how words are spelt. This capability was added for use with spell checkers, and I use it successfully for spell-checking in Firefox and LibreOffice. Richard.