From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.io!.POSTED.blaine.gmane.org!not-for-mail From: Richard Wordingham via "Bug reports for GNU Emacs, the Swiss army knife of text editors" Newsgroups: gmane.emacs.bugs Subject: bug#20140: 24.4; M17n shaper output rejected Date: Mon, 7 Feb 2022 23:38:08 +0000 Message-ID: <20220207233808.1a84d8ec@JRWUBU2> References: <20150318222040.4066e6e9@JRWUBU2> <87r18jk5nr.fsf@gnus.org> <83v8xv2icg.fsf@gnu.org> <20220205225251.08a0faab@JRWUBU2> <83y22oza77.fsf@gnu.org> <20220206220958.5a4d8ffe@JRWUBU2> <83czjyydqk.fsf@gnu.org> Reply-To: Richard Wordingham Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Injection-Info: ciao.gmane.io; posting-host="blaine.gmane.org:116.202.254.214"; logging-data="36380"; mail-complaints-to="usenet@ciao.gmane.io" Cc: 20140@debbugs.gnu.org, handa@gnu.org, larsi@gnus.org To: Eli Zaretskii Original-X-From: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane-mx.org@gnu.org Tue Feb 08 00:39:12 2022 Return-path: Envelope-to: geb-bug-gnu-emacs@m.gmane-mx.org Original-Received: from lists.gnu.org ([209.51.188.17]) by ciao.gmane.io with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.92) (envelope-from ) id 1nHDbD-0009Gv-Ev for geb-bug-gnu-emacs@m.gmane-mx.org; Tue, 08 Feb 2022 00:39:11 +0100 Original-Received: from localhost ([::1]:46780 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1nHDbB-0004oE-VQ for geb-bug-gnu-emacs@m.gmane-mx.org; Mon, 07 Feb 2022 18:39:09 -0500 Original-Received: from eggs.gnu.org ([209.51.188.92]:43724) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1nHDb5-0004nz-8k for bug-gnu-emacs@gnu.org; Mon, 07 Feb 2022 18:39:03 -0500 Original-Received: from debbugs.gnu.org ([209.51.188.43]:50218) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.90_1) (envelope-from ) id 1nHDb4-0005e7-Dw for bug-gnu-emacs@gnu.org; Mon, 07 Feb 2022 18:39:02 -0500 Original-Received: from Debian-debbugs by debbugs.gnu.org with local (Exim 4.84_2) (envelope-from ) id 1nHDb4-0007XU-C0 for bug-gnu-emacs@gnu.org; Mon, 07 Feb 2022 18:39:02 -0500 X-Loop: help-debbugs@gnu.org Resent-From: Richard Wordingham Original-Sender: "Debbugs-submit" Resent-CC: bug-gnu-emacs@gnu.org Resent-Date: Mon, 07 Feb 2022 23:39:02 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 20140 X-GNU-PR-Package: emacs X-GNU-PR-Keywords: moreinfo Original-Received: via spool by 20140-submit@debbugs.gnu.org id=B20140.164427709928908 (code B ref 20140); Mon, 07 Feb 2022 23:39:02 +0000 Original-Received: (at 20140) by debbugs.gnu.org; 7 Feb 2022 23:38:19 +0000 Original-Received: from localhost ([127.0.0.1]:44115 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1nHDaM-0007WC-D9 for submit@debbugs.gnu.org; Mon, 07 Feb 2022 18:38:18 -0500 Original-Received: from smtpq2.tb.ukmail.iss.as9143.net ([212.54.57.97]:56532) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1nHDaJ-0007Vu-PT for 20140@debbugs.gnu.org; Mon, 07 Feb 2022 18:38:16 -0500 Original-Received: from [212.54.57.109] (helo=csmtp5.tb.ukmail.iss.as9143.net) by smtpq2.tb.ukmail.iss.as9143.net with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1nHDaE-0000oD-2N for 20140@debbugs.gnu.org; Tue, 08 Feb 2022 00:38:10 +0100 Original-Received: from JRWUBU2 ([82.27.122.109]) by cmsmtp with ESMTP id HDaDnSTIoXlqZHDaDnbJBl; Tue, 08 Feb 2022 00:38:10 +0100 X-SourceIP: 82.27.122.109 X-Spam: 0 X-Authority: v=2.4 cv=ZcUOi+ZA c=1 sm=1 tr=0 ts=6201ad62 cx=a_exe a=lZfnwhydZ+7bl6OdZ0zTBw==:117 a=lZfnwhydZ+7bl6OdZ0zTBw==:17 a=IkcTkHD0fZMA:10 a=oGFeUVbbRNcA:10 a=mDV3o1hIAAAA:8 a=NLZqzBF-AAAA:8 a=OocQHUDgAAAA:8 a=te1EGT4yAAAA:8 a=nOWOQ1zm7K-q8ooLlBwA:9 a=QEXdDO2ut3YA:10 a=_FVE-zBwftR9WsbkzFJk:22 a=wW_WBVUImv98JQXhvVPZ:22 a=xUZTl98r3Qw_uB5NK3jt:22 a=RRElR4r2U1jGY2dU47NL:22 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ntlworld.com; s=meg.feb2017; t=1644277090; bh=x60pPCckrl81m/x70WB6BYtLjCKtus6Vx1Dv78CB2Vg=; h=Date:From:To:Cc:Subject:In-Reply-To:References; b=z2s7n7I8u8+6kZmMzYIWjGv9cML2othF+dkoyBI4RFB4zKrDV7ELuu5iBOzzRTFio 1upAmux8w0b9dSuUIg3Ju8eQ4jm5DBPDR7fWa1uiYfacX39qNflmxws2QlMpRRK0RR oSF5JJu+K2QJBpEoUadCjrP4+oTym3Os9MgTe0Q5glXfm4YpXoqPZGXTZJrS2PBjAj QONiBEUS3aeHrvVzdsxvXfJikXjJaBtcmHWWmxNX1AlD1ZnPdYRL/A/eNL9HOKQLzr Tpwgumw5mw12fhaXPx7A3inDCnIrgIuHoB8ljDRf5jbs2PjrgiPrfZhTwMfCp2iIOK OrQQhvX877URw== In-Reply-To: <83czjyydqk.fsf@gnu.org> X-Mailer: Claws Mail 3.17.5 (GTK+ 2.24.32; x86_64-pc-linux-gnu) X-CMAE-Envelope: MS4xfFVv2oRQ+lKAHMI7ixXMepQTFyJY40u8K8rDs+pnVNWa7t6O+AcT5wcYN3r5E7G8weWy6Zmfd+S0RkxVxS9thzNGeIxjpVLSgWNv3vwXfCn/d1sNCY+y /hxKI5uDRqg4gI2f3TS1YkIifLpz7hFY73YKsNBpX5KzCdHkHWObYORxN/xYqT0+h+//8hHXWDR8k3Mc6M8tkAx8P+mGQjfF9rHlf5URsK3AV2/Rjfm7CRno a+nd8goUf30a1eruyiLcioC45DkI0Nej7ZF4EvvaqVY= X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list X-BeenThere: bug-gnu-emacs@gnu.org List-Id: "Bug reports for GNU Emacs, the Swiss army knife of text editors" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane-mx.org@gnu.org Original-Sender: "bug-gnu-emacs" Xref: news.gmane.io gmane.emacs.bugs:226292 Archived-At: On Mon, 07 Feb 2022 16:04:35 +0200 Eli Zaretskii wrote: > > Date: Sun, 6 Feb 2022 22:09:58 +0000 > > From: Richard Wordingham > > Cc: larsi@gnus.org, 20140@debbugs.gnu.org, Kenichi Handa > > =20 > > > > Sad to see that Khaled Hosny's suggestion not to use composition > > > > rules seems not to have been taken. =20 > > >=20 > > > You mean, to pass all the text via HarfBuzz instead? That makes > > > the Emacs redisplay painfully slow, and would require a complete > > > redesign of how we render text to be bearable. So as long as > > > such a redesign is not available, we cannot use that advice. =20 > >=20 > > Except for Malayalam! (Subexpression XX* in indian.el at the > > moment.) =20 >=20 > (That was changed lately. But it is a tangent.) >=20 > > > > They're complicated by the facts that the 'regular expressions' > > > > are not interpreted as regular expressions and they are not > > > > interpreted as closed under canonical equivalence. I therefore > > > > calculate the regular expression. =20 > > >=20 > > > I'm not sure I understand the issue: what you do seems to be very > > > similar to what we do for the Indic scripts in indian.el, so what > > > kind of complications are you talking about here? =20 > >=20 > > Well, those rules themselves are a bit odd. Why are you composing > > single clusters? Why are you breaking clusters where Microsoft > > imitators are likely to insert dotted circles? =20 >=20 > I'm not sure this is what I asked. I asked why you think this way of > defining patterns for composition rules is in any way exceptional. It > seems pretty much boilerplate to me. Your 'boilerplate' rules look like a straightforward derivation from the DirectWrite rules for valid subsequences - I haven't checked for repair work. That seems unlikely to handle prohibited dittograms nicely. It also wouldn't work well when 'well-formed' adjacent clusters need to interact, as with virama-terminated clusters in Kharoshthi and some styles of Brahmi. I haven't hunted for their definitions - I should probably download a recent tarball. The exceptional features were the calculation of the regular expression, especially the expression (replace-regexp-in-string "X" basic_syllable regexp t t)) > > The best sources are the regular expressions in the proposals, but > > they missed out the combination of tone mark and final consonant > > signs. =20 >=20 > Can you be more specific about those proposals? Any specific > pointers? >=20 > Also, does this mean there's currently no widely accepted agreement > regarding Tai Tham shaping? What do native readers of that script > expect? >=20 > > What do you mean by 'shaping'? =20 >=20 > Whatever is needed to produce correct display from a sequence of > codepoints in a given script. The main shaper writers refused to maintain such a service for Tai Tham, though HarfBuzz did briefly provide such a service with its South East Asian Shaper. Windows still confesses its inability to render the full range of orthographic syllables. To work, fonts have to engage in dotted circle removal by some means or other. It seems that native readers expect a font encoding, where the key sequence for a mark (or subscript consonant) specifies its position and shape. I was badly shocked when I found the backing store for the Tai Tham Northern Thai New Testament. I found examples of marks above entered in the reverse-order to what Unicode-savvy people would expect, and the complete opposite to what one would type for Thai, for which input systems generally enforce the rule of typing from base character outwards. The general pointer would be to look at the English Wikipedia entry for _(Unicode_block). In this case, that becomes https://www.unicode.org/L2/L2007/07007r-n3207r-lanna.pdf Section 13. The codepoints have changed since then, but the names (apart from the script name) and representative glyphs have been pretty stable. The relationship between the outermost subexpression and syllables needs updating, and 'H' needs to be updated to include other subscript consonants, but formally the expression as a whole still stands. > > > At least for the dotted circle case, Emacs has a general > > > composition rule; see compose-gstring-for-dotted-circle and the > > > corresponding rule in composite.c. So I'm not sure we need > > > anything specific to Tai Tham there. =20 > >=20 > > Does the 3-character Khmer sequence "=E2=97=8C=E1=9F=92=E1=9E=80" > > work in Version 28? It doesn't in Version 26.3. It should look > > like a dotted circle with the lower part of =E1=9E=80=E1=9F=92=E1=9E=80= below it. In > > Version 26.3, I don't even get the consonant U+1780 subscripted! =20 >=20 > No, it doesn't produce what you want (though the 2nd and the 3rd > characters do combine), but that's not surprising: the general rules > for U+25CC that we have cover only a single combining mark after it: >=20 > (aset composition-function-table #x25CC > `([,(purecopy ".\\c^") 0 compose-gstring-for-dotted-circle])) >=20 > So a sequence of more than one character after U+25CC needs an > explicit rule to work. What is the rule in this case? (And what does > Khmer have to do with the question I asked, which is about Tai Tham?) You asked if there were any Tai Tham specific requirements. The requirement is general, but the need for Khmer is the most obvious. The rule for Brahmi, Kharoshthi and their descendants is fairly close to 'take any existing composition, and substitute dotted circle for the first letter (Lo)'. For the important cases, it is: (i) Dotted circle plus any sequence of marks (Let the shaper worry about validity); (ii) Dotted circle, conjoining operator, consonant, VS?; and (iii) Dotted circle, conjoining operator, consonant, VS?, any sequence of marks. (iv) (i)-(iii) preceded by anything repha-like. 'Conjoining operator' is a virama or pure stacker optionally preceded or followed by ZWJ or ZWNJ. VS is a variation selector. 'Repha-like' includes U+0D4E MALAYALAM LETTER DOT REPH, the Mymr script kinzi sequences, and the prototypical . The entire sequence would be best handled in the renderer, though you may have problems with selecting the font and script. Richard.