Manually parsing char-tables

unofficial mirror of help-gnu-emacs@gnu.org
 help / color / mirror / Atom feed

* Manually parsing char-tables
@ 2022-02-20 11:09 Richard Wordingham
  2022-02-20 12:50 ` Eli Zaretskii
  0 siblings, 1 reply; 9+ messages in thread
From: Richard Wordingham @ 2022-02-20 11:09 UTC (permalink / raw)
  To: help-gnu-emacs

I am trying to understand how Arabic script rendering works in Emacs
28.0.90, as it seems to be using a different mechanism to that used for
Indic or European scripts.  (There seems to be more to it than just the
asymmetries between right-to-left and left-to-right.)  To that end, I
am trying to understand the contents of the variable
composition-function-table.

When I use command describe-variable, the value shown starts out:

#^[nil nil nil nil

  #^^[1 0

    #^^[2 0 nil nil nil nil nil nil

      #^^[3 768 #1=(["\\c.\\c^+" 1 compose-gstring-for-graphic] 

                    [nil 0 compose-gstring-for-graphic])

        #1# #1# #1# #1# #1# #1# #1# #1# #1# #1# #1# #1# #1# #1# #1# #1#
        #1# #1# #1# #1# #1# #1# #1# #1# #1# #1# #1# #1# #1# #1# #1# #1#
        #1# #1# #1# #1# #1# #1# #1# #1# #1# #1# #1# #1# #1# #1# #1# #1#
        #1# #1# #1# #1# #1# #1# #1# #1# #1# #1# #1# #1# #1# #1# #1# #1#
        #1# #1# #1# #1# #1# #1# #1# #1# #1# #1# #1# #1# #1# #1# #1# #1#
        #1# #1# #1# #1# #1# #1# #1# #1# #1# #1# #1# #1# #1# #1# #1# #1#
        #1# #1# #1# #1# #1# #1# #1# #1# #1# #1# #1# #1# #1# #1# #1# nil
        nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil]

      nil nil 

      #^^[3 1152 nil nil nil #1# #1# #1# #1# #1# #1# #1# nil nil nil
      nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil
      nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil
      nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil
      nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil
      nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil
      nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil
      nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil
      nil nil nil]

(I've converted lines to paragraphs and abbreviated leading white
space.)

 I'm guessing that #1# is a macro invocation; when I invoke (print
 composition-function-table), I get something similar, but with #1#
 expanded and the '#1=' in the apparent macro definition omitted.

Where is this syntax explained?  I've looked in the elisp manual, but
not found it, though I may simply have failed to guess where such a
description was.

Richard.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Manually parsing char-tables
  2022-02-20 11:09 Manually parsing char-tables Richard Wordingham
@ 2022-02-20 12:50 ` Eli Zaretskii
  2022-02-21  1:39   ` Richard Wordingham
  2022-02-26  0:28   ` Composed Sequences (was: Manually parsing char-tables) Richard Wordingham
  0 siblings, 2 replies; 9+ messages in thread
From: Eli Zaretskii @ 2022-02-20 12:50 UTC (permalink / raw)
  To: help-gnu-emacs

> Date: Sun, 20 Feb 2022 11:09:26 +0000
> From: Richard Wordingham <richard.wordingham@ntlworld.com>
> 
> I am trying to understand how Arabic script rendering works in Emacs
> 28.0.90, as it seems to be using a different mechanism to that used for
> Indic or European scripts.  (There seems to be more to it than just the
> asymmetries between right-to-left and left-to-right.)  To that end, I
> am trying to understand the contents of the variable
> composition-function-table.

I think it is easier to just look at how the Arabic part of this table
is populated.  See lisp/language/misc-lang.el starting from line 105.

>       #^^[3 1152 nil nil nil #1# #1# #1# #1# #1# #1# #1# nil nil nil
>       nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil
>       nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil
>       nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil
>       nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil
>       nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil
>       nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil
>       nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil
>       nil nil nil]
> 
> (I've converted lines to paragraphs and abbreviated leading white
> space.)
> 
>  I'm guessing that #1# is a macro invocation; when I invoke (print
>  composition-function-table), I get something similar, but with #1#
>  expanded and the '#1=' in the apparent macro definition omitted.

#1# is a backreference to the value indicated by #1=.

> Where is this syntax explained?  I've looked in the elisp manual, but
> not found it, though I may simply have failed to guess where such a
> description was.

See the node "Circular Objects" there.

(Btw, 28.0.90 is not the latest pretest of Emacs 28, there's 28.0.91.)



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Manually parsing char-tables
  2022-02-20 12:50 ` Eli Zaretskii
@ 2022-02-21  1:39   ` Richard Wordingham
  2022-02-26  0:28   ` Composed Sequences (was: Manually parsing char-tables) Richard Wordingham
  1 sibling, 0 replies; 9+ messages in thread
From: Richard Wordingham @ 2022-02-21  1:39 UTC (permalink / raw)
  To: help-gnu-emacs

On Sun, 20 Feb 2022 14:50:54 +0200
Eli Zaretskii <eliz@gnu.org> wrote:

> > Date: Sun, 20 Feb 2022 11:09:26 +0000
> > From: Richard Wordingham <richard.wordingham@ntlworld.com>
> > 
> > I am trying to understand how Arabic script rendering works in Emacs
> > 28.0.90, as it seems to be using a different mechanism to that used
> > for Indic or European scripts.  (There seems to be more to it than
> > just the asymmetries between right-to-left and left-to-right.)  To
> > that end, I am trying to understand the contents of the variable
> > composition-function-table.  
> 
> I think it is easier to just look at how the Arabic part of this table
> is populated.  See lisp/language/misc-lang.el starting from line 105.

I first wanted to check that it was overwritten somewhere else.

> >       #^^[3 1152 nil nil nil #1# #1# #1# #1# #1# #1# #1# nil nil nil
> >       nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil
> > nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil
> >       nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil
> > nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil
> >       nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil
> > nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil
> >       nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil
> > nil nil nil nil]
> > 
> > (I've converted lines to paragraphs and abbreviated leading white
> > space.)
> > 
> >  I'm guessing that #1# is a macro invocation; when I invoke (print
> >  composition-function-table), I get something similar, but with #1#
> >  expanded and the '#1=' in the apparent macro definition omitted.  
> 
> #1# is a backreference to the value indicated by #1=.
> 
> > Where is this syntax explained?  I've looked in the elisp manual,
> > but not found it, though I may simply have failed to guess where
> > such a description was.  
> 
> See the node "Circular Objects" there.

That was reassuring - but I'm wondering why it was not familiar.  Had I
forgotten it?  Perhaps it's later then Emacs 19, when I last came close
to reading the lisp reference manual cover to cover.

Even the read syntax of a char-table is poorly documented. Using the
hint of an unexpanded reference to a 'sub-char-table', I've discovered
that the first key to understanding it is in list.h, and I may have to
delve into the .c files for the finer details.  It looks full of tricks
to reduce the storage requirement, which are reflected in the read
syntax. Perhaps it's not been documented because someone hopes it will
be cleaned up, but it is a useful syntax for dumping the table if
someone suspect the structure has been corrupted.  I will now present
my analysis in the hope that someone will find it useful.

Basically the data is stored in 64 blocks (of 'depth' 1) each for 2^16
characters, which in turn are composed of 16 blocks (of 'depth' 2) each
for 2^12 characters, which in turn are composed of 32 blocks (of 'depth'
3) each for 128 characters.  These blocks are the 'sub-char-tables', and
are introduced as a vector with two prepended items - the depth and the
first character code.  If all the data in a block is the same, that
same value replaces its sub-char-table.  (That happens with the
Unicode Arabic Block, which is covered by two sub-char-tables.)  This
structure is, eminently sensibly, hidden from the lisp interfaces.  The
sub-char-tables' syntax is basically

#^^[depth min_char ...]

where the ellipsis is the values at the lower level.

I suspect that the char-table syntax is basically

#^^[default parent purpose ascii_block ...]

but I haven't verified the order of those first four values, and indeed
I may have them wrong.

(In case anyone is wondering, the Emacs code space consists of 64
planes, rather than Unicode's 'measly' 17.)

Richard.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Composed Sequences (was: Manually parsing char-tables)
  2022-02-20 12:50 ` Eli Zaretskii
  2022-02-21  1:39   ` Richard Wordingham
@ 2022-02-26  0:28   ` Richard Wordingham
  2022-02-26  6:33     ` Eli Zaretskii
  1 sibling, 1 reply; 9+ messages in thread
From: Richard Wordingham @ 2022-02-26  0:28 UTC (permalink / raw)
  To: help-gnu-emacs

On Sun, 20 Feb 2022 14:50:54 +0200
Eli Zaretskii <eliz@gnu.org> wrote:

> > Date: Sun, 20 Feb 2022 11:09:26 +0000
> > From: Richard Wordingham <richard.wordingham@ntlworld.com>
> > 
> > I am trying to understand how Arabic script rendering works in Emacs
> > 28.0.90, as it seems to be using a different mechanism to that used
> > for Indic or European scripts.  (There seems to be more to it than
> > just the asymmetries between right-to-left and left-to-right.)  To
> > that end, I am trying to understand the contents of the variable
> > composition-function-table.  

(The version above should actually read 28.0.91.)

> I think it is easier to just look at how the Arabic part of this table
> is populated.  See lisp/language/misc-lang.el starting from line 105.

I still haven't found the code where the difference occurs, but I now
have a better idea of what is going on.  It seems that runs with the
same value of the composition property ('composed sequences') are
sequences of clusters for the font that match a regular expression
given in composition-function-table.  Different renderers give
different clusters, and thus, by default, different cursor motion!
The insight was given by GNU Emacs 23.4.1 (i386-mingw-nt6.2.9200)
running on Windows 10.

The reason Arabic seemed different is that when lam+hah appears to
ligate, what is happening (at least with Amiri) is that substitutions
are made which give the effect of a ligature, while remaining two
distinct glyphs.

Richard.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Composed Sequences (was: Manually parsing char-tables)
  2022-02-26  0:28   ` Composed Sequences (was: Manually parsing char-tables) Richard Wordingham
@ 2022-02-26  6:33     ` Eli Zaretskii
  2022-02-26 15:11       ` Composed Sequences Richard Wordingham
  0 siblings, 1 reply; 9+ messages in thread
From: Eli Zaretskii @ 2022-02-26  6:33 UTC (permalink / raw)
  To: help-gnu-emacs

> Date: Sat, 26 Feb 2022 00:28:37 +0000
> From: Richard Wordingham <richard.wordingham@ntlworld.com>
> 
> I still haven't found the code where the difference occurs, but I now
> have a better idea of what is going on.  It seems that runs with the
> same value of the composition property ('composed sequences') are
> sequences of clusters for the font that match a regular expression
> given in composition-function-table.

(Please don't use "composition property" in this context, because it's
confusing: the 'composition' text property does exist in Emacs (it's
an old and now deprecated way of composing characters), but it is not
relevant to this discussion, which instead focuses on what is known in
Emacs as "automatic composition".)

> Different renderers give different clusters, and thus, by default,
> different cursor motion!

Not "different renderers", but "different fonts".

And yes, the grapheme clusters produced by the text shaping engine
(HarfBuzz etc.) and displayed by the Emacs display code indeed
crucially depend on the font.

> The reason Arabic seemed different is that when lam+hah appears to
> ligate, what is happening (at least with Amiri) is that substitutions
> are made which give the effect of a ligature, while remaining two
> distinct glyphs.

Yes, I see that as well.  "C-u C-x =" should tell you whether ligation
happened or not.  What you see is normal, I think: Emacs obeys the
decisions of the font designers.



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Composed Sequences
  2022-02-26  6:33     ` Eli Zaretskii
@ 2022-02-26 15:11       ` Richard Wordingham
  2022-02-26 15:35         ` Eli Zaretskii
  0 siblings, 1 reply; 9+ messages in thread
From: Richard Wordingham @ 2022-02-26 15:11 UTC (permalink / raw)
  To: help-gnu-emacs

On Sat, 26 Feb 2022 08:33:35 +0200
Eli Zaretskii <eliz@gnu.org> wrote:

> > Date: Sat, 26 Feb 2022 00:28:37 +0000
> > From: Richard Wordingham <richard.wordingham@ntlworld.com>
> > 
> > I still haven't found the code where the difference occurs, but I
> > now have a better idea of what is going on.  It seems that runs
> > with the same value of the composition property ('composed
> > sequences') are sequences of clusters for the font that match a
> > regular expression given in composition-function-table.  
> 
> (Please don't use "composition property" in this context, because it's
> confusing: the 'composition' text property does exist in Emacs (it's
> an old and now deprecated way of composing characters), but it is not
> relevant to this discussion, which instead focuses on what is known in
> Emacs as "automatic composition".)

Ah, I've misinterpreted some of the code.

> > Different renderers give different clusters, and thus, by default,
> > different cursor motion!  

> Not "different renderers", but "different fonts".

I experimented with the Tai Tham composition-function-table entry

(list (vector "[\u1a20-\u1aad]+" 0 'font-shape-gstring))

For GNU Emacs 23.4.1 (i386-mingw-nt6.2.9200) using Uniscribe, the word
ᨠᩣ᩠ᨿ <1A20 HIGH KA, 1A63 AA, 1A60 SAKOT, 1A3F LOW YA>, the glyph string
for Version 0.8 of my font Da Lekh is divided into two
clusters as identified by the 'glyph' values [0 1 6688...] [0 1
6688...] [2 3 6752...] and confirmed by ordinary cursor motion.  While
this division into <1A20, 1A63> and <1A60, 1A3F> is not the Unicode
division into grapheme clusters, it accords with what are natively
namable clusters.

For GNU Emacs 27.1 (build1 i686-w64-mingw32) of 2020-08-21, which uses
HarfBuzz, the same word is one indivisible cluster (at least with
Version 0.13 of the same font).  I think this is a change in the
behaviour of HarfBuzz.

So should also depend on the clustering by the rendering engine.

> > The reason Arabic seemed different is that when lam+hah appears to
> > ligate, what is happening (at least with Amiri) is that
> > substitutions are made which give the effect of a ligature, while
> > remaining two distinct glyphs.  

> Yes, I see that as well.  "C-u C-x =" should tell you whether ligation
> happened or not.  What you see is normal, I think: Emacs obeys the
> decisions of the font designers.

Unless they recorded the positions of the boundaries between the parts
of a ligature!  (There is such a facility in the GDEF table, but it is
very widely ignored, and so a consumer would have to check its quality.)

Richard.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Composed Sequences
  2022-02-26 15:11       ` Composed Sequences Richard Wordingham
@ 2022-02-26 15:35         ` Eli Zaretskii
  2022-02-26 19:46           ` Richard Wordingham
  0 siblings, 1 reply; 9+ messages in thread
From: Eli Zaretskii @ 2022-02-26 15:35 UTC (permalink / raw)
  To: help-gnu-emacs

> Date: Sat, 26 Feb 2022 15:11:44 +0000
> From: Richard Wordingham <richard.wordingham@ntlworld.com>
> 
> > > Different renderers give different clusters, and thus, by default,
> > > different cursor motion!  
> 
> > Not "different renderers", but "different fonts".
> 
> I experimented with the Tai Tham composition-function-table entry
> 
> (list (vector "[\u1a20-\u1aad]+" 0 'font-shape-gstring))
> 
> For GNU Emacs 23.4.1 (i386-mingw-nt6.2.9200) using Uniscribe, the word
> ᨠᩣ᩠ᨿ <1A20 HIGH KA, 1A63 AA, 1A60 SAKOT, 1A3F LOW YA>, the glyph string
> for Version 0.8 of my font Da Lekh is divided into two
> clusters as identified by the 'glyph' values [0 1 6688...] [0 1
> 6688...] [2 3 6752...] and confirmed by ordinary cursor motion.  While
> this division into <1A20, 1A63> and <1A60, 1A3F> is not the Unicode
> division into grapheme clusters, it accords with what are natively
> namable clusters.
> 
> For GNU Emacs 27.1 (build1 i686-w64-mingw32) of 2020-08-21, which uses
> HarfBuzz, the same word is one indivisible cluster (at least with
> Version 0.13 of the same font).  I think this is a change in the
> behaviour of HarfBuzz.

If you must have the last word in this.  (It's quite clear that in
gray areas, such as Tai Tham, and where a shaping engine has a bug or
a misfeature, the results will also depend on the shaping engine.  But
that is not the main lesson to be taken home from the original issue,
which btw was with Arabic, not Tai Tham.)

> > > The reason Arabic seemed different is that when lam+hah appears to
> > > ligate, what is happening (at least with Amiri) is that
> > > substitutions are made which give the effect of a ligature, while
> > > remaining two distinct glyphs.  
> 
> > Yes, I see that as well.  "C-u C-x =" should tell you whether ligation
> > happened or not.  What you see is normal, I think: Emacs obeys the
> > decisions of the font designers.
> 
> Unless they recorded the positions of the boundaries between the parts
> of a ligature!

I don't understand what you mean by that.

Emacs behaves according to what the shaping engine tells us about the
number of graphems in the cluster.  Each grapheme is (by default) a
single unit for the purposes of cursor motion: Emacs will not let you
"enter" the grapheme, even if it is make out of several glyphs.  But
there's nothing in particular that Emacs expects from the number and
order of the graphemes in a cluster, we just use what the shaping
engine hands back to us.  And the cursor motion in Emacs is by default
in logical order, i.e. in the increasing order of buffer positions of
the original codepoints.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Composed Sequences
  2022-02-26 15:35         ` Eli Zaretskii
@ 2022-02-26 19:46           ` Richard Wordingham
  2022-02-26 20:02             ` Eli Zaretskii
  0 siblings, 1 reply; 9+ messages in thread
From: Richard Wordingham @ 2022-02-26 19:46 UTC (permalink / raw)
  To: help-gnu-emacs

On Sat, 26 Feb 2022 17:35:22 +0200
Eli Zaretskii <eliz@gnu.org> wrote:

> > Date: Sat, 26 Feb 2022 15:11:44 +0000
> > From: Richard Wordingham <richard.wordingham@ntlworld.com>
> >   
> > > > Different renderers give different clusters, and thus, by
> > > > default, different cursor motion!    
> >   
> > > Not "different renderers", but "different fonts".  
> > 
> > I experimented with the Tai Tham composition-function-table entry
> > 
> > (list (vector "[\u1a20-\u1aad]+" 0 'font-shape-gstring))
> > 
> > For GNU Emacs 23.4.1 (i386-mingw-nt6.2.9200) using Uniscribe, the
> > word ᨠᩣ᩠ᨿ <1A20 HIGH KA, 1A63 AA, 1A60 SAKOT, 1A3F LOW YA>, the
> > glyph string for Version 0.8 of my font Da Lekh is divided into two
> > clusters as identified by the 'glyph' values [0 1 6688...] [0 1
> > 6688...] [2 3 6752...] and confirmed by ordinary cursor motion.
> > While this division into <1A20, 1A63> and <1A60, 1A3F> is not the
> > Unicode division into grapheme clusters, it accords with what are
> > natively namable clusters.
> > 
> > For GNU Emacs 27.1 (build1 i686-w64-mingw32) of 2020-08-21, which
> > uses HarfBuzz, the same word is one indivisible cluster (at least
> > with Version 0.13 of the same font).  I think this is a change in
> > the behaviour of HarfBuzz.  
> 
> If you must have the last word in this.  (It's quite clear that in
> gray areas, such as Tai Tham, and where a shaping engine has a bug or
> a misfeature, the results will also depend on the shaping engine.  But
> that is not the main lesson to be taken home from the original issue,
> which btw was with Arabic, not Tai Tham.)

The original query was how the cursor could wind up being displayed
inside a cluster as defined by the composition rules.  The answer is
that it is always allowed at the boundary of graphemes, as defined
below.

It does, unfortunately, seem that the Uniscribe behaviour results from
oppressive coding, rather than any desire to support default grapheme
clusters (Unicode) or the like.

> > > Emacs
> > > obeys the decisions of the font designers.  

> > Unless they recorded the positions of the boundaries between the
> > parts of a ligature!  

> I don't understand what you mean by that.

The GDEF table of an OpenType font records the boundary between the
components of a ligature glyph, via the 'ligature caret list' table
therein. These data, if they exist, are amongst the 'decisions of the
font designers'.

Annoyingly, the font designers may be overridden by the rendering
engine designers.  A font designer can merge 'graphemes', but seemingly
not split 'graphemes'.

Glossary:

cluster  - sequence of coded characters presented to the shaping engine
           to be shaped.

grapheme - A sequence of coded characters which the shaping engine
           treats as a unit for the purpose of 'hit detection'.

(Perhaps this glossary has been published somewhere.)

In principle, a glyph may be shared between two graphemes, but I doubt
that Emacs has a mechanism to support that.

> Emacs behaves according to what the shaping engine tells us about the
> number of graphems in the cluster.  Each grapheme is (by default) a
> single unit for the purposes of cursor motion: Emacs will not let you
> "enter" the grapheme, even if it is make out of several glyphs.  But
> there's nothing in particular that Emacs expects from the number and
> order of the graphemes in a cluster, we just use what the shaping
> engine hands back to us.  And the cursor motion in Emacs is by default
> in logical order, i.e. in the increasing order of buffer positions of
> the original codepoints.

I hope you mean "several characters", not "several glyphs".  The
exception is related to disable-point-adjustment and its relatives, and
I think also to undisplayed buffers.

Richard.



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Composed Sequences
  2022-02-26 19:46           ` Richard Wordingham
@ 2022-02-26 20:02             ` Eli Zaretskii
  0 siblings, 0 replies; 9+ messages in thread
From: Eli Zaretskii @ 2022-02-26 20:02 UTC (permalink / raw)
  To: help-gnu-emacs

> Date: Sat, 26 Feb 2022 19:46:16 +0000
> From: Richard Wordingham <richard.wordingham@ntlworld.com>
> 
> > > > Emacs
> > > > obeys the decisions of the font designers.  
> 
> > > Unless they recorded the positions of the boundaries between the
> > > parts of a ligature!  
> 
> > I don't understand what you mean by that.
> 
> The GDEF table of an OpenType font records the boundary between the
> components of a ligature glyph, via the 'ligature caret list' table
> therein. These data, if they exist, are amongst the 'decisions of the
> font designers'.

Emacs doesn't (yet) use that information, so it cannot (yet) let you
move "inside" the ligature, if the ligature is a single grapheme.

> Glossary:
> 
> cluster  - sequence of coded characters presented to the shaping engine
>            to be shaped.

That's not the terminology we use in Emacs.  A grapheme cluster is the
output of shaping, not the input.  The input is just a match for a
regular expression that expresses our idea of the shortest sequence of
characters the shaper needs to see to do its job correctly.

> In principle, a glyph may be shared between two graphemes, but I doubt
> that Emacs has a mechanism to support that.

It doesn't, and I don't think HarfBuzz can produce such results (IIUC
what you mean).

> > Emacs behaves according to what the shaping engine tells us about the
> > number of graphems in the cluster.  Each grapheme is (by default) a
> > single unit for the purposes of cursor motion: Emacs will not let you
> > "enter" the grapheme, even if it is make out of several glyphs.  But
> > there's nothing in particular that Emacs expects from the number and
> > order of the graphemes in a cluster, we just use what the shaping
> > engine hands back to us.  And the cursor motion in Emacs is by default
> > in logical order, i.e. in the increasing order of buffer positions of
> > the original codepoints.
> 
> I hope you mean "several characters", not "several glyphs".

I mean both: in general, the shaping engine can take N codepoints
(a.k.a. "characters") and return M glyphs, arranged as K graphemes, to
display those N codepoints.

> The exception is related to disable-point-adjustment and its
> relatives, and I think also to undisplayed buffers.

In Emacs 29, there's now a variable to allow cursor movement inside a
composed sequence even when disable-point-adjustment is nil (as it is
by default).



^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2022-02-26 20:02 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-02-20 11:09 Manually parsing char-tables Richard Wordingham
2022-02-20 12:50 ` Eli Zaretskii
2022-02-21  1:39   ` Richard Wordingham
2022-02-26  0:28   ` Composed Sequences (was: Manually parsing char-tables) Richard Wordingham
2022-02-26  6:33     ` Eli Zaretskii
2022-02-26 15:11       ` Composed Sequences Richard Wordingham
2022-02-26 15:35         ` Eli Zaretskii
2022-02-26 19:46           ` Richard Wordingham
2022-02-26 20:02             ` Eli Zaretskii

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).