bug#20140: 24.4; M17n shaper output rejected - Richard Wordingham via Bug reports for GNU Emacs, the Swiss army knife of text editors

all messages for Emacs-related lists mirrored at yhetil.org
 help / color / mirror / code / Atom feed

From: Richard Wordingham via "Bug reports for GNU Emacs, the Swiss army knife of text editors" <bug-gnu-emacs@gnu.org>
To: Eli Zaretskii <eliz@gnu.org>
Cc: 20140@debbugs.gnu.org, Kenichi Handa <handa@gnu.org>, larsi@gnus.org
Subject: bug#20140: 24.4; M17n shaper output rejected
Date: Sun, 6 Feb 2022 22:09:58 +0000	[thread overview]
Message-ID: <20220206220958.5a4d8ffe@JRWUBU2> (raw)
In-Reply-To: <83y22oza77.fsf@gnu.org>

On Sun, 06 Feb 2022 10:11:08 +0200
Eli Zaretskii <eliz@gnu.org> wrote:

> > Date: Sat, 5 Feb 2022 22:52:51 +0000
> > From: Richard Wordingham <richard.wordingham@ntlworld.com>
> > Cc: Lars Ingebrigtsen <larsi@gnus.org>, 20140@debbugs.gnu.org
> > 
> > I'm currently using the vanilla emacs on Ubuntu Focal, which is
> > described as 'GNU Emacs 26.3 (build 2, x86_64-pc-linux-gnu, GTK+
> > Version 3.24.14) of 2020-03-26, modified by Debian'.  The key good
> > news is that the commands forward-char-intrusive and
> > backward-char-intrusive are now standard, so I can position the
> > cursor by dead-reckoning.  You can reasonably mark the issue as
> > solved.  
> 
> I don't see the commands forward-char-intrusive and
> backward-char-intrusive anywhere in Emacs, so I guess they are your
> local changes, based on the code posted by Handa-san in this
> discussion?
> 
> > > The most important change is that we now use HarfBuzz by default.
> > >  
> > 
> > Isn't that only true for Emacs 27.1 and above?  
> 
> That's true, but Emacs 26 is ancient history; Emacs 28.1 is about to
> be released.  So from our perspective, HarfBuzz is the default shaping
> engine, and since it's available on all the supported platforms we
> care about, we are phasing out m17n-flt shapers.
> 
> > > Richard didn't contribute the Tai Tham composition rules to us
> > > (AFAIR), so I cannot test what happens now in Emacs with HarfBuzz.
> > > Maybe we should revisit this issue, but first I hope Richard could
> > > tell whether the issue still exists, and if so, what composition
> > > rules he uses or suggests to use for Tai Tham.  
> > 
> > Sad to see that Khaled Hosny's suggestion not to use composition
> > rules seems not to have been taken.  
> 
> You mean, to pass all the text via HarfBuzz instead?  That makes the
> Emacs redisplay painfully slow, and would require a complete redesign
> of how we render text to be bearable.  So as long as such a redesign
> is not available, we cannot use that advice.

Except for Malayalam!  (Subexpression XX* in indian.el at the moment.)

> > You're welcome to include my composition rules.  
> 
> Thanks.
> 
> > They're complicated by the facts that the 'regular expressions' are
> > not interpreted as regular expressions and they are not interpreted
> > as closed under canonical equivalence.  I therefore calculate the
> > regular expression.  
> 
> I'm not sure I understand the issue: what you do seems to be very
> similar to what we do for the Indic scripts in indian.el, so what kind
> of complications are you talking about here?

Well, those rules themselves are a bit odd.  Why are you composing
single clusters?  Why are you breaking clusters where Microsoft
imitators are likely to insert dotted circles?

The basic structure for most Indic scripts is R*C(M|HC)*(M|H)* where R
is miscellaneous prefixed forms (e.g. dot reph, visarga variants), C is
consonants (and things that can act like them), H is the conjoining
operator, and M is miscellaneous marks, including ZWJ and ZWNJ.
"(M|H)*" accounts for explicit viramas and isolated half-forms.
Jackboots are then applied on the ground that spell checkers cannot be
relied upon.

The first problem for Tai Tham is that marks with non-zero canonical
combining class (ccc) greater than 9 (note that script-specific nuktas
generally have ccc=7) do not mix with conjoining operators with ccc=9;
the conjoining operator (as opposed to visible virama) should not be
separated from the following consonant. Mark Davis ignored this
requirement from the proposals, so unless your 'regular' expression is
acting on traces under canonical equivalence rather than mere strings,
one has to complicate the expressions to cope.

The second issue is that the behaviour of U+1A58 TAI THAM SIGN MAI KANG
LAI is a stylistic variable.  It can act like a dot reph or a
phonetic-syllable-final mark. My composition rules therefore have to
treat it as gluing orthographic syllables together.

The third issue, that is less visible, is that I had a problem with
back-tracking.

> Also, your rules seem to follow the description in the "Structuring
> Tai Tham Unicode" document (Revision 7), a.k.a. "L2/19-365", dated Oct
> 2019, is that right?  Is that document the latest word on shaping Tai
> Tham, or are there any additional sources?

No, the document's a crime.  I tried to minmise it's destructiveness,
which is why I got an acknowledgement in it.  I advocate sticking to
phonetic order, as in Khmer and Brahmi.  That scheme needs a couple of
formally unproposed characters to make some distinctions.

The best sources are the regular expressions in the proposals, but they
missed out the combination of tone mark and final consonant signs.
What do you mean by 'shaping'?  For Tai Tham, only positive service
provided by rendering engines is the movement of preposed vowels and
MEDIAL RA to the start of the glyph sequence; all the other
resequencing has to be done by the fonts themselves.

> > There are some deficiencies; I've a feeling there may be a problem
> > with adding ZWNJ and CGJ as marks; ZWJ should also be added for
> > completeness.  
> 
> These are barely mentioned in the L2/19-365 document, and not
> mentioned at all in the Tai Tham section of the Unicode Standard.
> Does it mean they are not very important in contemporary Tai Tham
> texts?

The Tai Tham section is based on information before grammar
nazification disabled Tai Tham texts, or at least, those that were to
be rendered using restrictive shapers based on alleged knowledge of the
languages.  ZWNJ is a standard mechanism for disabling ligatures in
non-cursive scripts, though I'm not sure of the balance of ZWJ and ZWNJ
in Fraktur, e.g. the different renderings of the two meanings
of Antiqua German Wachstube.

CGJ is needed where there is no other character to mark the boundary of
two chained syllables and concatenating the vowel and tone marks of the
two together violates the ordering rules for a single syllable.  It
would also be needed to mark other differences relevant to collation,
e.g. if syllable-initial BA were sorted according to its pronunciation,
as in one major dictionary.  Automating an inconsistent hand-sort is
hard, slow work, especially as the CLDR tools choke on an easy Lao
sort.  (By contrast, the official Thai sort is very machine-friendly.)

> > I need ZWNJ to write 4-column ᨴᩣᩴᨶ᩠ᩅ‌ᩣ᩠ᨿ as opposed to
> > 3-column ᨴᩣᩴᨶ᩠ᩅᩣ᩠ᨿ, and even with my font, HarfBuzz will need CGJ
> > for the suppression of jack-booted dotted circles. Additionally, for
> > didactic text, what can I do for U+25CC for explicit display of
> > marks and their equivalents on a dotted circle, and for that
> > matter, for display on NBSP?  

This, the main use of ZWNJ, was unknown to the authors of the Tai Tham
proposals.  In Lao texts of the 1930s, non-ligation seems to mark an
enthusiasm for the spelling reforms, which one normally thinks of as
only applying to the Lao script.

Having looked at indian.el, it seems that it will be easy to add these
controls (CGJ, ZWJ and ZWNJ) to the composition tables.

> At least for the dotted circle case, Emacs has a general composition
> rule; see compose-gstring-for-dotted-circle and the corresponding rule
> in composite.c.  So I'm not sure we need anything specific to Tai Tham
> there.

Does the 3-character Khmer sequence "◌្ក" <U+25CC, U+17D2, U+1780> work
in Version 28?  It doesn't in Version 26.3.  It should look like a
dotted circle with the lower part of ក្ក below it.  In Version 26.3, I
don't even get the consonant U+1780 subscripted!

With HarfBuzz, if you don't compose U+25CC with the following mark, you
are very likely to get two dotted circles - are you deliberately
deleting one?  Doing so wouldn't be a reliable process.

Possibly I could fix the rendering problem by also composing sequences
starting with marks - to be investigated.  If it works, it might work
with NBSP, though it wouldn't help with my plan for <NBSP, ZWJ, spacing
mark> to render as just the spacing mark. 

> Can you recommend good fonts for Tai Tham?  Are they free fonts?

Almost all Tai Tham fonts have problems.  Probably the best is the
one used for the New Testament, which relies on the SIL Graphite
renderer. I'll dig into that one.

The nicest OTL shaper-based one for most words is Lamphun, which is
based on Hariphunchai.  Unfortunately, not even Lamphun distinguished
subscript HIGH RATHA from the subscript <HIGH RATA, SIGN HIGH RATHA OR
LOW PA>, and it is rather limited for interacting marks - Hariphunchai
lacks mark-to-mark positioning.  The commoner combinations of marks are
handled by glyph substitution, and Lamphun has made a start on
mark-to-mark positioning.  Hariphunchai and Lamphun are available under
the SIL Open Font licence.

For Lao and Pali, Khottabun is a nice font, but there are some
idiosyncrasies in its encoding of words.  (Unicode appears only to
define character encoding, and is largely silent on the encoding of Tai
Tham words.)  It is available under the SIL Open Font licence, so I
can and perhaps ought to add it to my renderer
(https://wrdingham.co.uk/renderer_test.htm) and font
(https://wrdingham.co.uk/font_test.htm) tests.  Unfortunately, it only
supports characters used for Lao or Pali.  It appears to evade the
jackboots of the HarfBuzz implementation of the Universal Shaping Engine
(USE) by not having a glyph for U+25CC - cunning!  I don't know whether
this trick works with the Windows renderers.

There's a clutch of Tai Khuen fonts released under the SIL Open Font
licence that are aesthetically satisfying, but have a tendency to rely
on Tai Khuen orthographic rules to avoid clashing glyphs, and don't
extend to supporting somewhat exceptional words like Pali _indriya_.
The fonts are:

A Tai Tham KH
A Tai Tham KH New
A Tai Tham KH New V3

They are unlikely to work with Uniscribe or DirectWrite, as they rely
on the ccmp or liga feature being enabled for the default script; I'm
not sure whether that's a problem for those using emacs on Windows.

If you don't mind the reactionary square nature of the glyphs, there
is also my Da Lekh family, with full coverage of the encoded character
set, and some support for language-specific glyphs that are very
different between the languages.  (Generally the glyphs aim to be an
'international' compromise.)  Features may be used instead of
language environment - I don't set out to punish Windows victims.  There
are four fonts:

Da Lekh
Da Lekh Si
Da Lekh Seri
Da Lekh Si Seri

The ones with Seri in the name have the same freedoms as the Deja Vu
fonts and none of the restrictions.  (I drew all their glyphs.) The
others have the same freedoms and, necessarily, restrictions as Deja
Vu Sans.  The Seri (meaning 'untrammelled') fonts were created for
unconstrained use by the Unicode Consortium and deliberately have no
defence against the jackboots of the Universal Shaping Engine.  They
should work fine with the M17n renderer.  Unfortunately, for its Latin
glyphs, one only gets what one pays for. The ones with 'Si' colour
conjoined syllables red so that one can see how words are spelt.  This
capability was added for use with spell checkers, and I use it
successfully for spell-checking in Firefox and LibreOffice.

Richard.

next prev parent reply	other threads:[~2022-02-06 22:09 UTC|newest]

Thread overview: 35+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-03-18 22:20 bug#20140: 24.4; M17n shaper output rejected Richard Wordingham
2015-03-19  3:43 ` Eli Zaretskii
2015-03-21  8:33 ` K. Handa
2015-03-21 17:20   ` Wolfgang Jenkner
2015-03-21 17:58   ` Richard Wordingham
2015-03-21 18:26     ` Eli Zaretskii
2015-03-25 14:25     ` K. Handa
2015-03-25 21:45       ` Richard Wordingham
2015-04-05 19:48       ` Richard Wordingham
2022-02-03 21:21 ` Lars Ingebrigtsen
2022-02-04  7:37   ` Eli Zaretskii
2022-02-05 22:52     ` Richard Wordingham via Bug reports for GNU Emacs, the Swiss army knife of text editors
2022-02-06  8:11       ` Eli Zaretskii
2022-02-06 22:09         ` Richard Wordingham via Bug reports for GNU Emacs, the Swiss army knife of text editors [this message]
2022-02-07 14:04           ` Eli Zaretskii
2022-02-07 23:38             ` Richard Wordingham via Bug reports for GNU Emacs, the Swiss army knife of text editors
2022-02-08 22:13         ` Richard Wordingham via Bug reports for GNU Emacs, the Swiss army knife of text editors
2022-02-12 18:54           ` Eli Zaretskii
2022-02-13 16:04       ` Eli Zaretskii
2022-02-13 20:53         ` Richard Wordingham via Bug reports for GNU Emacs, the Swiss army knife of text editors
2022-02-14 13:19           ` Eli Zaretskii
2022-02-14 22:14             ` Richard Wordingham via Bug reports for GNU Emacs, the Swiss army knife of text editors
2022-02-15  1:27               ` Richard Wordingham via Bug reports for GNU Emacs, the Swiss army knife of text editors
2022-02-16 15:13                 ` Eli Zaretskii
2022-02-16 15:12               ` Eli Zaretskii
2022-02-16 15:11           ` Eli Zaretskii
2022-02-13 19:49       ` Eli Zaretskii
2022-02-13 21:11         ` Richard Wordingham via Bug reports for GNU Emacs, the Swiss army knife of text editors
2022-02-14 13:26           ` Eli Zaretskii
2022-02-14 23:26             ` Richard Wordingham via Bug reports for GNU Emacs, the Swiss army knife of text editors
2022-02-15 14:40               ` Eli Zaretskii
2022-02-15 21:06                 ` Richard Wordingham via Bug reports for GNU Emacs, the Swiss army knife of text editors
2022-02-16 13:15                   ` Eli Zaretskii
2022-02-16 19:01                     ` Richard Wordingham via Bug reports for GNU Emacs, the Swiss army knife of text editors
2022-02-16 19:20                       ` Eli Zaretskii

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20220206220958.5a4d8ffe@JRWUBU2 \
    --to=bug-gnu-emacs@gnu.org \
    --cc=20140@debbugs.gnu.org \
    --cc=eliz@gnu.org \
    --cc=handa@gnu.org \
    --cc=larsi@gnus.org \
    --cc=richard.wordingham@ntlworld.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

Code repositories for project(s) associated with this external index

	https://git.savannah.gnu.org/cgit/emacs.git
	https://git.savannah.gnu.org/cgit/emacs/org-mode.git

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.