unofficial mirror of bug-gnu-emacs@gnu.org 
 help / color / mirror / code / Atom feed
From: Richard Wordingham via "Bug reports for GNU Emacs, the Swiss army knife of text editors" <bug-gnu-emacs@gnu.org>
To: Eli Zaretskii <eliz@gnu.org>
Cc: 20140@debbugs.gnu.org, handa@gnu.org, larsi@gnus.org
Subject: bug#20140: 24.4; M17n shaper output rejected
Date: Mon, 7 Feb 2022 23:38:08 +0000	[thread overview]
Message-ID: <20220207233808.1a84d8ec@JRWUBU2> (raw)
In-Reply-To: <83czjyydqk.fsf@gnu.org>

On Mon, 07 Feb 2022 16:04:35 +0200
Eli Zaretskii <eliz@gnu.org> wrote:

> > Date: Sun, 6 Feb 2022 22:09:58 +0000
> > From: Richard Wordingham <richard.wordingham@ntlworld.com>
> > Cc: larsi@gnus.org, 20140@debbugs.gnu.org, Kenichi Handa
> > <handa@gnu.org> 
> > > > Sad to see that Khaled Hosny's suggestion not to use composition
> > > > rules seems not to have been taken.    
> > > 
> > > You mean, to pass all the text via HarfBuzz instead?  That makes
> > > the Emacs redisplay painfully slow, and would require a complete
> > > redesign of how we render text to be bearable.  So as long as
> > > such a redesign is not available, we cannot use that advice.  
> > 
> > Except for Malayalam!  (Subexpression XX* in indian.el at the
> > moment.)  
> 
> (That was changed lately.  But it is a tangent.)
> 
> > > > They're complicated by the facts that the 'regular expressions'
> > > > are not interpreted as regular expressions and they are not
> > > > interpreted as closed under canonical equivalence.  I therefore
> > > > calculate the regular expression.    
> > > 
> > > I'm not sure I understand the issue: what you do seems to be very
> > > similar to what we do for the Indic scripts in indian.el, so what
> > > kind of complications are you talking about here?  
> > 
> > Well, those rules themselves are a bit odd.  Why are you composing
> > single clusters?  Why are you breaking clusters where Microsoft
> > imitators are likely to insert dotted circles?  
> 
> I'm not sure this is what I asked.  I asked why you think this way of
> defining patterns for composition rules is in any way exceptional.  It
> seems pretty much boilerplate to me.

Your 'boilerplate' rules look like a straightforward derivation from
the DirectWrite rules for valid subsequences - I haven't checked for
repair work.  That seems unlikely to handle prohibited dittograms
nicely.  It also wouldn't work well when 'well-formed' adjacent
clusters need to interact, as with virama-terminated clusters in
Kharoshthi and some styles of Brahmi.  I haven't hunted for their
definitions - I should probably download a recent tarball.

The exceptional features were the calculation of the regular
expression, especially the expression

(replace-regexp-in-string "X" basic_syllable regexp t t))

> > The best sources are the regular expressions in the proposals, but
> > they missed out the combination of tone mark and final consonant
> > signs.  
> 
> Can you be more specific about those proposals?  Any specific
> pointers?
> 
> Also, does this mean there's currently no widely accepted agreement
> regarding Tai Tham shaping?  What do native readers of that script
> expect?
> 
> > What do you mean by 'shaping'?  
> 
> Whatever is needed to produce correct display from a sequence of
> codepoints in a given script.

The main shaper writers refused to maintain such a service for Tai Tham,
though HarfBuzz did briefly provide such a service with its South East
Asian Shaper.  Windows still confesses its inability to render the full
range of orthographic syllables.  To work, fonts have to engage in
dotted circle removal by some means or other.

It seems that native readers expect a font encoding, where the key
sequence for a mark (or subscript consonant) specifies its position and
shape.  I was badly shocked when I found the backing store for the Tai
Tham Northern Thai New Testament.  I found examples of marks above
entered in the reverse-order to what Unicode-savvy people would expect,
and the complete opposite to what one would type for Thai, for which
input systems generally enforce the rule of typing from base
character outwards.

The general pointer would be to look at the English Wikipedia entry for
<block_name>_(Unicode_block).  In this case, that becomes

https://www.unicode.org/L2/L2007/07007r-n3207r-lanna.pdf Section 13.

The codepoints have changed since then, but the names (apart from the
script name) and representative glyphs have been pretty stable.  The
relationship between the outermost subexpression and syllables needs
updating, and 'H' needs to be updated to include other subscript
consonants, but formally the expression as a whole still stands.

> > > At least for the dotted circle case, Emacs has a general
> > > composition rule; see compose-gstring-for-dotted-circle and the
> > > corresponding rule in composite.c.  So I'm not sure we need
> > > anything specific to Tai Tham there.  
> > 
> > Does the 3-character Khmer sequence "◌្ក" <U+25CC, U+17D2, U+1780>
> > work in Version 28?  It doesn't in Version 26.3.  It should look
> > like a dotted circle with the lower part of ក្ក below it.  In
> > Version 26.3, I don't even get the consonant U+1780 subscripted!  
> 
> No, it doesn't produce what you want (though the 2nd and the 3rd
> characters do combine), but that's not surprising: the general rules
> for U+25CC that we have cover only a single combining mark after it:
> 
>   (aset composition-function-table #x25CC
> 	`([,(purecopy ".\\c^") 0 compose-gstring-for-dotted-circle]))
> 
> So a sequence of more than one character after U+25CC needs an
> explicit rule to work.  What is the rule in this case?  (And what does
> Khmer have to do with the question I asked, which is about Tai Tham?)

You asked if there were any Tai Tham specific requirements.  The
requirement is general, but the need for Khmer is the most obvious.  The
rule for Brahmi, Kharoshthi and their descendants is fairly close to
'take any existing composition, and substitute dotted circle for the
first letter (Lo)'. For the important cases, it is:

(i) Dotted circle plus any sequence of marks (Let the shaper worry
about validity);
(ii) Dotted circle, conjoining operator, consonant, VS?; and
(iii) Dotted circle, conjoining operator, consonant, VS?, any sequence
of marks.
(iv) (i)-(iii) preceded by anything repha-like.

'Conjoining operator' is a virama or pure stacker optionally preceded or
followed by ZWJ or ZWNJ.  VS is a variation selector.

'Repha-like' includes U+0D4E MALAYALAM LETTER DOT REPH, the Mymr script
kinzi sequences, and the prototypical <U+0930 DEVANAGARI LETTER RA,
U+094D DEVANAGARI SIGN VIRAMA>.

The entire sequence would be best handled in the renderer, though you
may have problems with selecting the font and script.

Richard.





  reply	other threads:[~2022-02-07 23:38 UTC|newest]

Thread overview: 35+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-03-18 22:20 bug#20140: 24.4; M17n shaper output rejected Richard Wordingham
2015-03-19  3:43 ` Eli Zaretskii
2015-03-21  8:33 ` K. Handa
2015-03-21 17:20   ` Wolfgang Jenkner
2015-03-21 17:58   ` Richard Wordingham
2015-03-21 18:26     ` Eli Zaretskii
2015-03-25 14:25     ` K. Handa
2015-03-25 21:45       ` Richard Wordingham
2015-04-05 19:48       ` Richard Wordingham
2022-02-03 21:21 ` Lars Ingebrigtsen
2022-02-04  7:37   ` Eli Zaretskii
2022-02-05 22:52     ` Richard Wordingham via Bug reports for GNU Emacs, the Swiss army knife of text editors
2022-02-06  8:11       ` Eli Zaretskii
2022-02-06 22:09         ` Richard Wordingham via Bug reports for GNU Emacs, the Swiss army knife of text editors
2022-02-07 14:04           ` Eli Zaretskii
2022-02-07 23:38             ` Richard Wordingham via Bug reports for GNU Emacs, the Swiss army knife of text editors [this message]
2022-02-08 22:13         ` Richard Wordingham via Bug reports for GNU Emacs, the Swiss army knife of text editors
2022-02-12 18:54           ` Eli Zaretskii
2022-02-13 16:04       ` Eli Zaretskii
2022-02-13 20:53         ` Richard Wordingham via Bug reports for GNU Emacs, the Swiss army knife of text editors
2022-02-14 13:19           ` Eli Zaretskii
2022-02-14 22:14             ` Richard Wordingham via Bug reports for GNU Emacs, the Swiss army knife of text editors
2022-02-15  1:27               ` Richard Wordingham via Bug reports for GNU Emacs, the Swiss army knife of text editors
2022-02-16 15:13                 ` Eli Zaretskii
2022-02-16 15:12               ` Eli Zaretskii
2022-02-16 15:11           ` Eli Zaretskii
2022-02-13 19:49       ` Eli Zaretskii
2022-02-13 21:11         ` Richard Wordingham via Bug reports for GNU Emacs, the Swiss army knife of text editors
2022-02-14 13:26           ` Eli Zaretskii
2022-02-14 23:26             ` Richard Wordingham via Bug reports for GNU Emacs, the Swiss army knife of text editors
2022-02-15 14:40               ` Eli Zaretskii
2022-02-15 21:06                 ` Richard Wordingham via Bug reports for GNU Emacs, the Swiss army knife of text editors
2022-02-16 13:15                   ` Eli Zaretskii
2022-02-16 19:01                     ` Richard Wordingham via Bug reports for GNU Emacs, the Swiss army knife of text editors
2022-02-16 19:20                       ` Eli Zaretskii

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://www.gnu.org/software/emacs/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20220207233808.1a84d8ec@JRWUBU2 \
    --to=bug-gnu-emacs@gnu.org \
    --cc=20140@debbugs.gnu.org \
    --cc=eliz@gnu.org \
    --cc=handa@gnu.org \
    --cc=larsi@gnus.org \
    --cc=richard.wordingham@ntlworld.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).