bug#36923: Combining Diacritical Marks are not Latin only

unofficial mirror of bug-gnu-emacs@gnu.org 
 help / color / mirror / code / Atom feed

* bug#36923: Combining Diacritical Marks are not Latin only
@ 2019-08-04 20:40 Juri Linkov
  2019-08-05 16:08 ` Eli Zaretskii
  0 siblings, 1 reply; 5+ messages in thread
From: Juri Linkov @ 2019-08-04 20:40 UTC (permalink / raw)
  To: 36923

The generated file lisp/international/charscript.el
assigns the block “Combining Diacritical Marks” to the ‘latin’ script
on the assumption that these characters are used only in Latin.

But in fact according to e.g. https://en.wikipedia.org/wiki/Acute_accent
the acute accent marks the stressed vowel of a word in several languages
with alphabets based on the Latin, Cyrillic, and Greek scripts.
In particular https://en.wikipedia.org/wiki/Cyrillic_script_in_Unicode
mentions how characters from other blocks are used in Cyrillic script.
Moreover, the Combining Diacritical Marks block also
contains several characters from the Greek script:
COMBINING GREEK PERISPOMENI, COMBINING GREEK KORONIS
COMBINING GREEK DIALYTIKA TONOS, COMBINING GREEK YPOGEGRAMMENI

I noticed this problem recently while helping to develop char-fold where
GREEK SMALL LETTER IOTA combined with COMBINING GREEK DIALYTIKA TONOS was
alarmingly highlighted as “mixed scripts” by markchars-mode from GNU ELPA.

Of course, it's possible to add exceptions for characters in this block
in markchars-mode.  But before doing this, I'm asking a confirmation
whether Unicode data should be fixed in ‘char-script-table’, so e.g.

  (aref char-script-table ?\N{COMBINING ACUTE ACCENT})

could return

  (latin greek cyrillic)

instead of the current

  latin

^ permalink raw reply	[flat|nested] 5+ messages in thread

* bug#36923: Combining Diacritical Marks are not Latin only
  2019-08-04 20:40 bug#36923: Combining Diacritical Marks are not Latin only Juri Linkov
@ 2019-08-05 16:08 ` Eli Zaretskii
  2019-08-05 19:41   ` Juri Linkov
  0 siblings, 1 reply; 5+ messages in thread
From: Eli Zaretskii @ 2019-08-05 16:08 UTC (permalink / raw)
  To: Juri Linkov; +Cc: 36923

> From: Juri Linkov <juri@linkov.net>
> Date: Sun, 04 Aug 2019 23:40:38 +0300
> 
> The generated file lisp/international/charscript.el
> assigns the block “Combining Diacritical Marks” to the ‘latin’ script
> on the assumption that these characters are used only in Latin.
> 
> But in fact according to e.g. https://en.wikipedia.org/wiki/Acute_accent
> the acute accent marks the stressed vowel of a word in several languages
> with alphabets based on the Latin, Cyrillic, and Greek scripts.
> In particular https://en.wikipedia.org/wiki/Cyrillic_script_in_Unicode
> mentions how characters from other blocks are used in Cyrillic script.
> Moreover, the Combining Diacritical Marks block also
> contains several characters from the Greek script:
> COMBINING GREEK PERISPOMENI, COMBINING GREEK KORONIS
> COMBINING GREEK DIALYTIKA TONOS, COMBINING GREEK YPOGEGRAMMENI
> 
> I noticed this problem recently while helping to develop char-fold where
> GREEK SMALL LETTER IOTA combined with COMBINING GREEK DIALYTIKA TONOS was
> alarmingly highlighted as “mixed scripts” by markchars-mode from GNU ELPA.
> 
> Of course, it's possible to add exceptions for characters in this block
> in markchars-mode.  But before doing this, I'm asking a confirmation
> whether Unicode data should be fixed in ‘char-script-table’, so e.g.
> 
>   (aref char-script-table ?\N{COMBINING ACUTE ACCENT})
> 
> could return
> 
>   (latin greek cyrillic)
> 
> instead of the current
> 
>   latin

char-script-table is documented to yield a single symbol, so returning
a list would be an incompatible change, which we should avoid.

More generally, I think what you describe is a clear conceptual bug in
markchars-mode: it should only pay attention to the script of the base
characters, not to the script of combining accents.  The latter is
mostly irrelevant, certainly so for the purpose of detecting
confusables.

So I think this should be fixed in markchars-mode, and the fact that
we somewhat arbitrarily assign those diacritics to the latin script is
not a serious problem, if at all.





^ permalink raw reply	[flat|nested] 5+ messages in thread

* bug#36923: Combining Diacritical Marks are not Latin only
  2019-08-05 16:08 ` Eli Zaretskii
@ 2019-08-05 19:41   ` Juri Linkov
  2019-08-06 14:32     ` Eli Zaretskii
  0 siblings, 1 reply; 5+ messages in thread
From: Juri Linkov @ 2019-08-05 19:41 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 36923

>>   (aref char-script-table ?\N{COMBINING ACUTE ACCENT})
>>
>> could return
>>
>>   (latin greek cyrillic)
>>
>> instead of the current
>>
>>   latin
>
> char-script-table is documented to yield a single symbol, so returning
> a list would be an incompatible change, which we should avoid.

The docstring of char-script-table says:

  Char table of script symbols.
  It has one extra slot whose value is a list of script symbols.

So it seems char-script-table should yield a list of script symbols?

I searched more for char-script-table in the documentation, and one
place where it's used is forward-word.  But I don't understand why
forward-word doesn't stop between “COMBINING ACUTE ACCENT” (that is
the Latin script) and non-Latin letters.

This is good that it doesn't stop here, and I'm just trying to
understand why - so the same logic could be used in markchars-mode.
Maybe it doesn't stop because of special script handling in
‘find-word-boundary-function-table’?  Or because it ignores all
combining characters?

BTW, while looking at forward-word and right-word I noticed inconsistency:
there are left-word and right-word commands, but no left-sexp and right-sexp
to accompany forward-sexp.

> More generally, I think what you describe is a clear conceptual bug in
> markchars-mode: it should only pay attention to the script of the base
> characters, not to the script of combining accents.  The latter is
> mostly irrelevant, certainly so for the purpose of detecting
> confusables.

Could you suggest a proper function to strip all combining characters
from the string?

^ permalink raw reply	[flat|nested] 5+ messages in thread

* bug#36923: Combining Diacritical Marks are not Latin only
  2019-08-05 19:41   ` Juri Linkov
@ 2019-08-06 14:32     ` Eli Zaretskii
  2019-08-07 21:44       ` Juri Linkov
  0 siblings, 1 reply; 5+ messages in thread
From: Eli Zaretskii @ 2019-08-06 14:32 UTC (permalink / raw)
  To: Juri Linkov; +Cc: 36923

> From: Juri Linkov <juri@linkov.net>
> Cc: 36923@debbugs.gnu.org
> Date: Mon, 05 Aug 2019 22:41:59 +0300
> 
> >>   (aref char-script-table ?\N{COMBINING ACUTE ACCENT})
> >>
> >> could return
> >>
> >>   (latin greek cyrillic)
> >>
> >> instead of the current
> >>
> >>   latin
> >
> > char-script-table is documented to yield a single symbol, so returning
> > a list would be an incompatible change, which we should avoid.
> 
> The docstring of char-script-table says:
> 
>   Char table of script symbols.
>   It has one extra slot whose value is a list of script symbols.
> 
> So it seems char-script-table should yield a list of script symbols?

No, that's only in the extra slot.  The ELisp manual says:

 -- Variable: char-script-table
     The value of this variable is a char-table that specifies, for each
     character, a symbol whose name is the script to which the character
     belongs, according to the Unicode Standard classification of the
     Unicode code space into script-specific blocks.  This char-table
     has a single extra slot whose value is the list of all script
     symbols.

> I searched more for char-script-table in the documentation, and one
> place where it's used is forward-word.  But I don't understand why
> forward-word doesn't stop between “COMBINING ACUTE ACCENT” (that is
> the Latin script) and non-Latin letters.

See word-combining-categories: it causes word-movement commands to
ignore any script boundaries with characters whose category is
combining diacritic or mark.

> Maybe it doesn't stop because of special script handling in
> ‘find-word-boundary-function-table’?

Not by default, because find-word-boundary-function-table's entry for
any character is nil by default.

> BTW, while looking at forward-word and right-word I noticed inconsistency:
> there are left-word and right-word commands, but no left-sexp and right-sexp
> to accompany forward-sexp.

Programming languages are all L2R, so there's no need to move by sexps
in R2L direction.

> > More generally, I think what you describe is a clear conceptual bug in
> > markchars-mode: it should only pay attention to the script of the base
> > characters, not to the script of combining accents.  The latter is
> > mostly irrelevant, certainly so for the purpose of detecting
> > confusables.
> 
> Could you suggest a proper function to strip all combining characters
> from the string?

Each base character has its canonical combining class attribute as
zero, so you could use

   (get-char-code-property CHAR 'canonical-combining-class)

to filter out those CHARs for which the value is non-zero.

Alternatively, you could go by categories: base characters have the
?. category set, combining characters have the ?^ category set.

My recommendation is to use the canonical-combining-class property, as
it is a more direct way of doing this.






^ permalink raw reply	[flat|nested] 5+ messages in thread

* bug#36923: Combining Diacritical Marks are not Latin only
  2019-08-06 14:32     ` Eli Zaretskii
@ 2019-08-07 21:44       ` Juri Linkov
  0 siblings, 0 replies; 5+ messages in thread
From: Juri Linkov @ 2019-08-07 21:44 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 36923-done

> Each base character has its canonical combining class attribute as
> zero, so you could use
>
>    (get-char-code-property CHAR 'canonical-combining-class)
>
> to filter out those CHARs for which the value is non-zero.
>
> Alternatively, you could go by categories: base characters have the
> ?. category set, combining characters have the ?^ category set.
>
> My recommendation is to use the canonical-combining-class property, as
> it is a more direct way of doing this.

Thanks, I fixed markchars-mode by using canonical-combining-class.





^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2019-08-07 21:44 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2019-08-04 20:40 bug#36923: Combining Diacritical Marks are not Latin only Juri Linkov
2019-08-05 16:08 ` Eli Zaretskii
2019-08-05 19:41   ` Juri Linkov
2019-08-06 14:32     ` Eli Zaretskii
2019-08-07 21:44       ` Juri Linkov

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).