unofficial mirror of help-gnu-emacs@gnu.org
 help / color / mirror / Atom feed
From: Richard Wordingham <richard.wordingham@ntlworld.com>
To: help-gnu-emacs@gnu.org
Subject: Re: Composed Sequences
Date: Sat, 26 Feb 2022 19:46:16 +0000	[thread overview]
Message-ID: <20220226194616.4c6e0330@JRWUBU2> (raw)
In-Reply-To: <83fso5prnp.fsf@gnu.org>

On Sat, 26 Feb 2022 17:35:22 +0200
Eli Zaretskii <eliz@gnu.org> wrote:

> > Date: Sat, 26 Feb 2022 15:11:44 +0000
> > From: Richard Wordingham <richard.wordingham@ntlworld.com>
> >   
> > > > Different renderers give different clusters, and thus, by
> > > > default, different cursor motion!    
> >   
> > > Not "different renderers", but "different fonts".  
> > 
> > I experimented with the Tai Tham composition-function-table entry
> > 
> > (list (vector "[\u1a20-\u1aad]+" 0 'font-shape-gstring))
> > 
> > For GNU Emacs 23.4.1 (i386-mingw-nt6.2.9200) using Uniscribe, the
> > word ᨠᩣ᩠ᨿ <1A20 HIGH KA, 1A63 AA, 1A60 SAKOT, 1A3F LOW YA>, the
> > glyph string for Version 0.8 of my font Da Lekh is divided into two
> > clusters as identified by the 'glyph' values [0 1 6688...] [0 1
> > 6688...] [2 3 6752...] and confirmed by ordinary cursor motion.
> > While this division into <1A20, 1A63> and <1A60, 1A3F> is not the
> > Unicode division into grapheme clusters, it accords with what are
> > natively namable clusters.
> > 
> > For GNU Emacs 27.1 (build1 i686-w64-mingw32) of 2020-08-21, which
> > uses HarfBuzz, the same word is one indivisible cluster (at least
> > with Version 0.13 of the same font).  I think this is a change in
> > the behaviour of HarfBuzz.  
> 
> If you must have the last word in this.  (It's quite clear that in
> gray areas, such as Tai Tham, and where a shaping engine has a bug or
> a misfeature, the results will also depend on the shaping engine.  But
> that is not the main lesson to be taken home from the original issue,
> which btw was with Arabic, not Tai Tham.)

The original query was how the cursor could wind up being displayed
inside a cluster as defined by the composition rules.  The answer is
that it is always allowed at the boundary of graphemes, as defined
below.

It does, unfortunately, seem that the Uniscribe behaviour results from
oppressive coding, rather than any desire to support default grapheme
clusters (Unicode) or the like.

> > > Emacs
> > > obeys the decisions of the font designers.  

> > Unless they recorded the positions of the boundaries between the
> > parts of a ligature!  

> I don't understand what you mean by that.

The GDEF table of an OpenType font records the boundary between the
components of a ligature glyph, via the 'ligature caret list' table
therein. These data, if they exist, are amongst the 'decisions of the
font designers'.

Annoyingly, the font designers may be overridden by the rendering
engine designers.  A font designer can merge 'graphemes', but seemingly
not split 'graphemes'.

Glossary:

cluster  - sequence of coded characters presented to the shaping engine
           to be shaped.

grapheme - A sequence of coded characters which the shaping engine
           treats as a unit for the purpose of 'hit detection'.

(Perhaps this glossary has been published somewhere.)

In principle, a glyph may be shared between two graphemes, but I doubt
that Emacs has a mechanism to support that.

> Emacs behaves according to what the shaping engine tells us about the
> number of graphems in the cluster.  Each grapheme is (by default) a
> single unit for the purposes of cursor motion: Emacs will not let you
> "enter" the grapheme, even if it is make out of several glyphs.  But
> there's nothing in particular that Emacs expects from the number and
> order of the graphemes in a cluster, we just use what the shaping
> engine hands back to us.  And the cursor motion in Emacs is by default
> in logical order, i.e. in the increasing order of buffer positions of
> the original codepoints.

I hope you mean "several characters", not "several glyphs".  The
exception is related to disable-point-adjustment and its relatives, and
I think also to undisplayed buffers.

Richard.



  reply	other threads:[~2022-02-26 19:46 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-02-20 11:09 Manually parsing char-tables Richard Wordingham
2022-02-20 12:50 ` Eli Zaretskii
2022-02-21  1:39   ` Richard Wordingham
2022-02-26  0:28   ` Composed Sequences (was: Manually parsing char-tables) Richard Wordingham
2022-02-26  6:33     ` Eli Zaretskii
2022-02-26 15:11       ` Composed Sequences Richard Wordingham
2022-02-26 15:35         ` Eli Zaretskii
2022-02-26 19:46           ` Richard Wordingham [this message]
2022-02-26 20:02             ` Eli Zaretskii

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://www.gnu.org/software/emacs/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20220226194616.4c6e0330@JRWUBU2 \
    --to=richard.wordingham@ntlworld.com \
    --cc=help-gnu-emacs@gnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).