From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.io!.POSTED.blaine.gmane.org!not-for-mail From: Richard Wordingham via "Bug reports for GNU Emacs, the Swiss army knife of text editors" Newsgroups: gmane.emacs.bugs Subject: bug#20140: 24.4; M17n shaper output rejected Date: Sun, 13 Feb 2022 20:53:10 +0000 Message-ID: <20220213205310.0b8a715c@JRWUBU2> References: <20150318222040.4066e6e9@JRWUBU2> <87r18jk5nr.fsf@gnus.org> <83v8xv2icg.fsf@gnu.org> <20220205225251.08a0faab@JRWUBU2> <831r06rbwk.fsf@gnu.org> Reply-To: Richard Wordingham Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Injection-Info: ciao.gmane.io; posting-host="blaine.gmane.org:116.202.254.214"; logging-data="26143"; mail-complaints-to="usenet@ciao.gmane.io" Cc: 20140@debbugs.gnu.org, larsi@gnus.org To: Eli Zaretskii Original-X-From: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane-mx.org@gnu.org Sun Feb 13 21:54:10 2022 Return-path: Envelope-to: geb-bug-gnu-emacs@m.gmane-mx.org Original-Received: from lists.gnu.org ([209.51.188.17]) by ciao.gmane.io with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.92) (envelope-from ) id 1nJLsn-0006bB-KS for geb-bug-gnu-emacs@m.gmane-mx.org; Sun, 13 Feb 2022 21:54:09 +0100 Original-Received: from localhost ([::1]:32970 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1nJLsl-0003QO-Pu for geb-bug-gnu-emacs@m.gmane-mx.org; Sun, 13 Feb 2022 15:54:07 -0500 Original-Received: from eggs.gnu.org ([209.51.188.92]:60790) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1nJLsg-0003QG-7Z for bug-gnu-emacs@gnu.org; Sun, 13 Feb 2022 15:54:02 -0500 Original-Received: from debbugs.gnu.org ([209.51.188.43]:45109) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.90_1) (envelope-from ) id 1nJLsf-0006Dj-Uf for bug-gnu-emacs@gnu.org; Sun, 13 Feb 2022 15:54:01 -0500 Original-Received: from Debian-debbugs by debbugs.gnu.org with local (Exim 4.84_2) (envelope-from ) id 1nJLsf-0001ie-QE for bug-gnu-emacs@gnu.org; Sun, 13 Feb 2022 15:54:01 -0500 X-Loop: help-debbugs@gnu.org Resent-From: Richard Wordingham Original-Sender: "Debbugs-submit" Resent-CC: bug-gnu-emacs@gnu.org Resent-Date: Sun, 13 Feb 2022 20:54:01 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 20140 X-GNU-PR-Package: emacs X-GNU-PR-Keywords: moreinfo Original-Received: via spool by 20140-submit@debbugs.gnu.org id=B20140.16447856036561 (code B ref 20140); Sun, 13 Feb 2022 20:54:01 +0000 Original-Received: (at 20140) by debbugs.gnu.org; 13 Feb 2022 20:53:23 +0000 Original-Received: from localhost ([127.0.0.1]:39006 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1nJLs2-0001hl-Ku for submit@debbugs.gnu.org; Sun, 13 Feb 2022 15:53:23 -0500 Original-Received: from smtpq2.tb.ukmail.iss.as9143.net ([212.54.57.97]:58072) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1nJLry-0001hV-N0 for 20140@debbugs.gnu.org; Sun, 13 Feb 2022 15:53:21 -0500 Original-Received: from [212.54.57.109] (helo=csmtp5.tb.ukmail.iss.as9143.net) by smtpq2.tb.ukmail.iss.as9143.net with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1nJLrs-00052j-DR for 20140@debbugs.gnu.org; Sun, 13 Feb 2022 21:53:12 +0100 Original-Received: from JRWUBU2 ([82.27.122.109]) by cmsmtp with ESMTP id JLrrntkD2XlqZJLrrnf8Ib; Sun, 13 Feb 2022 21:53:12 +0100 X-SourceIP: 82.27.122.109 X-Spam: 0 X-Authority: v=2.4 cv=ZcUOi+ZA c=1 sm=1 tr=0 ts=62096fb8 cx=a_exe a=lZfnwhydZ+7bl6OdZ0zTBw==:117 a=lZfnwhydZ+7bl6OdZ0zTBw==:17 a=kj9zAlcOel0A:10 a=oGFeUVbbRNcA:10 a=mDV3o1hIAAAA:8 a=NLZqzBF-AAAA:8 a=OocQHUDgAAAA:8 a=AZnJWaAPAAAA:8 a=xLwohXldv0jksHtPaH0A:9 a=CjuIK1q_8ugA:10 a=wgAateuqcuIA:10 a=qskxWB65Wv0A:10 a=_FVE-zBwftR9WsbkzFJk:22 a=wW_WBVUImv98JQXhvVPZ:22 a=xUZTl98r3Qw_uB5NK3jt:22 a=T2rBzvJ0ivks0o3LBaDr:22 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ntlworld.com; s=meg.feb2017; t=1644785592; bh=8rgsLsBbzmmYa5s4gSfpj9HE9QYHdPtvxmVRXT0CGlY=; h=Date:From:To:Cc:Subject:In-Reply-To:References; b=53p50cqZXPTr35ZCMaOL3gocFt20ksUbLScp+s792xT7TPDmkVP60nPASS1sRFd3u TVVr7MsduBo2gL373AQazVd0P3z/0xTgM6dTYqPSzzOaocAid5uVXSgC9hT3Lym957 s+L7CO4UCYlHTEKj8h1wMlQ9ayidls+XnYCL+qZqr9Z3t+TsWmvvYN40QjdjqxmBCe uLq9NUD/8ptOoobyYm0/yo4Y+XXmao5lmWqp+x+zoUFYFtnhAhHZ2hFz/ZtkR81TLL pAcGQYuPl4l+8bikmcKszi34MOILx6YDjhAzfnYPCm7QKX1FryM78SVQKE+DUA6uG1 IvBucyr7TB8WA== In-Reply-To: <831r06rbwk.fsf@gnu.org> X-Mailer: Claws Mail 3.17.5 (GTK+ 2.24.32; x86_64-pc-linux-gnu) X-CMAE-Envelope: MS4xfMMkZxf+L1QDGJKggW8Yhlg3IvS9mpcNe0+2199gvrlEW1v0+LW55QpVAizlx9fa+UQYSmFPY5tkHMJK/r7xrmlfjx6FA57l0/jhvzXFcz9iJ0Haw5yJ DWbNe7sAZFif2+3li9ZaMR09+f0MhAgsYHc7Yd1L2dxizQBimjTs1rq6XKVRQHCvpxscrbzIZZRoY90murqDoIlXSD+1PMOo8fj++jUkWuEYhPLTuHonbDsv WGXPepDp3jP0XTacysgE6A== X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list X-BeenThere: bug-gnu-emacs@gnu.org List-Id: "Bug reports for GNU Emacs, the Swiss army knife of text editors" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane-mx.org@gnu.org Original-Sender: "bug-gnu-emacs" Xref: news.gmane.io gmane.emacs.bugs:226840 Archived-At: On Sun, 13 Feb 2022 18:04:11 +0200 Eli Zaretskii wrote: > > Date: Sat, 5 Feb 2022 22:52:51 +0000 > > From: Richard Wordingham > > Cc: Lars Ingebrigtsen , 20140@debbugs.gnu.org > > > > You're welcome to include my composition rules. > > Thanks. I started with your code: > > > (defvar tai-tham-composable-pattern > > (let ((table > > ;; C is letters, independent vowels, digits, punctuation > > and symbols. '(("C" . > > "[\u1A20-\u1A54\u1A80-\u1A89\u1A90-\u1A99\u1AA0-\u1AAD]") ("M" . > > "[\u1A55-\u1A57\u1A59-\u1A5E\u1A61-\u1A7C\u1A7F]"); Mark ("H" . > > "\u1A60") ; sakot ("S" . "[\u1A75-\u1A7C]") ; Marks commuting with > > sakot ("N" . "\u1A58"))) ; mai kang lai > > (basic_syllable "C\\(N*\\(M\\|HS*C\\)\\)*") > > (regexp "X\\(N\\(X\\)?\\)*H?")) ; X is basic syllable > > (let ((case-fold-search nil)) > > (setq regexp (replace-regexp-in-string "X" basic_syllable > > regexp t t)) (dolist (elt table) > > (setq regexp (replace-regexp-in-string (car elt) (cdr elt) > > regexp t t)))) > > regexp)) > > > > (let ((elt (list (vector tai-tham-composable-pattern 0 > > 'font-shape-gstring) (vector "." 0 'font-shape-gstring) > > ))) > > (set-char-table-range composition-function-table '(#x1A20 . > > #x1AAD) elt)) > > But that didn't seem to work well enough: e.g., some marks in your > "sample text" didn't combine with letters, as I think they should. Which ones? Are you sure they didn't combine at the Emacs level? I did suspect the problem was writing '\u1A7C' instead of '\u1a7c', but I'm no longer so sure. (The 'C' might get expanded, but I'm beginning to think not.) > Then I tried this simplistic setting: > > (set-char-table-range composition-function-table > '(#x1a20 . #x1aaf) > (list (vector "[\u1a20-\u1aaf]+" 0 > 'font-shape-gstring))) > > and it worked much better, including passing a small number of the > tests from your renderer test page that I threw on Emacs. This is on > MS-Windows with Emacs 29 and HarfBuzz 2.4.0 (which is not even the > latest release of HarfBuzz), and with the A Tai Tham KH New V3 font. > Any reason not to use the above simple setup for Tai Tham text > composition? Mostly only that you would have to edit the text with "autocomposition at point disabled" or mark word boundaries, e.g. with U+200B ZERO WIDTH SPACE. The Tai languages that use Tai Tham use scriptio continua. While modern Pali does separate words with visible white space, its words tend to be polysyllabic; with discerning composition, it would be about as tolerable as editing Hindi in Devanagari with autocomposition enabled. (Quite a few people edit Devanagari in transliteration to Latin!) You should also add CGJ and ZWNJ, and some people may appreciate ZWJ - the Khottabun font has ligatures involving ZWJ, though it may just be an experimental feature - and ultimately WJ, for when someone writes a Tai Tham word breaker. Oh, and Thai and Lao mai t(r)i and mai chat(t)awa and U+0324 COMBINING DIAERESIS BELOW turn up occasionally - U+0324 is supported in Thep's Khottabun font, and my Da Lekh series supports Thai mai tri and mai chattawa. These characters seem to work with HarfBuzz. If using the native Windows renderer is an option with Emacs, then 'A Tai Tham KH New' works better than 'A Tai Tham KH New V3'. I've created https://wrdingham.co.uk/lanna/font_test.htm to do _font_ comparisons. I'd delayed because I've only recently satisfied myself that it is lawful, at least under English law. (The qualms were with the samples taken from books.) It's still very much a work in progress. > I needed a couple more additions to Emacs to make Tai Tham support > work OOTB: for example, script-representative-chars lacked an entry > for Tai Tham, and the default fontset needed an addition. (And on > MS-Windows, one needs to run the w32-find-non-USB-fonts magic once, to > notice the newly installed Tai Tham font.) > Other than that, assuming the above setting of > composition-function-table is okay, we are ready to officially add Tai > Tham to scripts supported by Emacs. > Btw, is there a way to get all the examples from your > https://wrdingham.co.uk/lanna/renderer_test.htm as a UTF-8 encoded > text file? I'd like to test the Emacs rendering with all of the > examples, but copy-pasting each example separately from the browser is > not my idea of useful time investment. So if you could provide the > examples as a downloadable text file, I'd appreciate. As buried (you're not the only one to have overlooked it) in the penultimate paragraph of 'Content and Layout' section, "The test words may, in principle, be extracted quite simply from this web page. Each test 'word' is the content of the first cell in each row whose class is tst1. For convenience*, I have extracted the first two cells in such rows, along with titles, to a CSV file." The file is rt.csv in the same directory. I included the meaning and pronunciation as those who don't know the script may find it easier to refer to the words by translation or transcription. You may prefer to use the file more or less as it is, but one can easily knock up an Emacs macro sequence to delete the first comma and the rest of the line. I left the section titles in for easier navigation to the renderer test file. *Some people claim to find XML files easy to use, they should then be able to analyse a file conforming to HTML4 syntax. Dodgy spellings go in pink rows whose class is 'tst2'. The alternative encodings demanded by the USE go in orange rows whose class is 'tst3'. I have not extracted these. Richard.