From: Eli Zaretskii <eliz@gnu.org>
To: Juri Linkov <juri@linkov.net>
Cc: 36923@debbugs.gnu.org
Subject: bug#36923: Combining Diacritical Marks are not Latin only
Date: Tue, 06 Aug 2019 17:32:33 +0300 [thread overview]
Message-ID: <83a7cmcqke.fsf@gnu.org> (raw)
In-Reply-To: <87zhknzc7c.fsf@mail.linkov.net> (message from Juri Linkov on Mon, 05 Aug 2019 22:41:59 +0300)
> From: Juri Linkov <juri@linkov.net>
> Cc: 36923@debbugs.gnu.org
> Date: Mon, 05 Aug 2019 22:41:59 +0300
>
> >> (aref char-script-table ?\N{COMBINING ACUTE ACCENT})
> >>
> >> could return
> >>
> >> (latin greek cyrillic)
> >>
> >> instead of the current
> >>
> >> latin
> >
> > char-script-table is documented to yield a single symbol, so returning
> > a list would be an incompatible change, which we should avoid.
>
> The docstring of char-script-table says:
>
> Char table of script symbols.
> It has one extra slot whose value is a list of script symbols.
>
> So it seems char-script-table should yield a list of script symbols?
No, that's only in the extra slot. The ELisp manual says:
-- Variable: char-script-table
The value of this variable is a char-table that specifies, for each
character, a symbol whose name is the script to which the character
belongs, according to the Unicode Standard classification of the
Unicode code space into script-specific blocks. This char-table
has a single extra slot whose value is the list of all script
symbols.
> I searched more for char-script-table in the documentation, and one
> place where it's used is forward-word. But I don't understand why
> forward-word doesn't stop between “COMBINING ACUTE ACCENT” (that is
> the Latin script) and non-Latin letters.
See word-combining-categories: it causes word-movement commands to
ignore any script boundaries with characters whose category is
combining diacritic or mark.
> Maybe it doesn't stop because of special script handling in
> ‘find-word-boundary-function-table’?
Not by default, because find-word-boundary-function-table's entry for
any character is nil by default.
> BTW, while looking at forward-word and right-word I noticed inconsistency:
> there are left-word and right-word commands, but no left-sexp and right-sexp
> to accompany forward-sexp.
Programming languages are all L2R, so there's no need to move by sexps
in R2L direction.
> > More generally, I think what you describe is a clear conceptual bug in
> > markchars-mode: it should only pay attention to the script of the base
> > characters, not to the script of combining accents. The latter is
> > mostly irrelevant, certainly so for the purpose of detecting
> > confusables.
>
> Could you suggest a proper function to strip all combining characters
> from the string?
Each base character has its canonical combining class attribute as
zero, so you could use
(get-char-code-property CHAR 'canonical-combining-class)
to filter out those CHARs for which the value is non-zero.
Alternatively, you could go by categories: base characters have the
?. category set, combining characters have the ?^ category set.
My recommendation is to use the canonical-combining-class property, as
it is a more direct way of doing this.
next prev parent reply other threads:[~2019-08-06 14:32 UTC|newest]
Thread overview: 5+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-08-04 20:40 bug#36923: Combining Diacritical Marks are not Latin only Juri Linkov
2019-08-05 16:08 ` Eli Zaretskii
2019-08-05 19:41 ` Juri Linkov
2019-08-06 14:32 ` Eli Zaretskii [this message]
2019-08-07 21:44 ` Juri Linkov
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
List information: https://www.gnu.org/software/emacs/
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=83a7cmcqke.fsf@gnu.org \
--to=eliz@gnu.org \
--cc=36923@debbugs.gnu.org \
--cc=juri@linkov.net \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this public inbox
https://git.savannah.gnu.org/cgit/emacs.git
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).