From mboxrd@z Thu Jan  1 00:00:00 1970
Path: news.gmane.org!.POSTED.blaine.gmane.org!not-for-mail
From: Eli Zaretskii <eliz@gnu.org>
Newsgroups: gmane.emacs.bugs
Subject: bug#36923: Combining Diacritical Marks are not Latin only
Date: Tue, 06 Aug 2019 17:32:33 +0300
Message-ID: <83a7cmcqke.fsf@gnu.org>
References: <87lfw8r744.fsf@mail.linkov.net> <83k1brd28a.fsf@gnu.org>
 <87zhknzc7c.fsf@mail.linkov.net>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 8bit
Injection-Info: blaine.gmane.org; posting-host="blaine.gmane.org:195.159.176.226";
	logging-data="95855"; mail-complaints-to="usenet@blaine.gmane.org"
Cc: 36923@debbugs.gnu.org
To: Juri Linkov <juri@linkov.net>
Original-X-From: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane.org@gnu.org Tue Aug 06 16:33:10 2019
Return-path: <bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane.org@gnu.org>
Envelope-to: geb-bug-gnu-emacs@m.gmane.org
Original-Received: from lists.gnu.org ([209.51.188.17])
	by blaine.gmane.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256)
	(Exim 4.89)
	(envelope-from <bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane.org@gnu.org>)
	id 1hv0WT-000OpD-R5
	for geb-bug-gnu-emacs@m.gmane.org; Tue, 06 Aug 2019 16:33:10 +0200
Original-Received: from localhost ([::1]:33818 helo=lists1p.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.86_2)
	(envelope-from <bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane.org@gnu.org>)
	id 1hv0WS-0002H6-D5
	for geb-bug-gnu-emacs@m.gmane.org; Tue, 06 Aug 2019 10:33:08 -0400
Original-Received: from eggs.gnu.org ([2001:470:142:3::10]:43514)
 by lists.gnu.org with esmtp (Exim 4.86_2)
 (envelope-from <Debian-debbugs@debbugs.gnu.org>) id 1hv0WO-0002Go-6k
 for bug-gnu-emacs@gnu.org; Tue, 06 Aug 2019 10:33:05 -0400
Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
 (envelope-from <Debian-debbugs@debbugs.gnu.org>) id 1hv0WN-00022W-0l
 for bug-gnu-emacs@gnu.org; Tue, 06 Aug 2019 10:33:04 -0400
Original-Received: from debbugs.gnu.org ([209.51.188.43]:57030)
 by eggs.gnu.org with esmtps (TLS1.0:RSA_AES_128_CBC_SHA1:16)
 (Exim 4.71) (envelope-from <Debian-debbugs@debbugs.gnu.org>)
 id 1hv0WM-00022L-HV
 for bug-gnu-emacs@gnu.org; Tue, 06 Aug 2019 10:33:02 -0400
Original-Received: from Debian-debbugs by debbugs.gnu.org with local (Exim 4.84_2)
 (envelope-from <Debian-debbugs@debbugs.gnu.org>) id 1hv0WM-0002LA-9i
 for bug-gnu-emacs@gnu.org; Tue, 06 Aug 2019 10:33:02 -0400
X-Loop: help-debbugs@gnu.org
Resent-From: Eli Zaretskii <eliz@gnu.org>
Original-Sender: "Debbugs-submit" <debbugs-submit-bounces@debbugs.gnu.org>
Resent-CC: bug-gnu-emacs@gnu.org
Resent-Date: Tue, 06 Aug 2019 14:33:02 +0000
Resent-Message-ID: <handler.36923.B36923.15651019748979@debbugs.gnu.org>
Resent-Sender: help-debbugs@gnu.org
X-GNU-PR-Message: followup 36923
X-GNU-PR-Package: emacs
Original-Received: via spool by 36923-submit@debbugs.gnu.org id=B36923.15651019748979
 (code B ref 36923); Tue, 06 Aug 2019 14:33:02 +0000
Original-Received: (at 36923) by debbugs.gnu.org; 6 Aug 2019 14:32:54 +0000
Original-Received: from localhost ([127.0.0.1]:37618 helo=debbugs.gnu.org)
 by debbugs.gnu.org with esmtp (Exim 4.84_2)
 (envelope-from <debbugs-submit-bounces@debbugs.gnu.org>)
 id 1hv0WD-0002Kl-Us
 for submit@debbugs.gnu.org; Tue, 06 Aug 2019 10:32:54 -0400
Original-Received: from eggs.gnu.org ([209.51.188.92]:59345)
 by debbugs.gnu.org with esmtp (Exim 4.84_2)
 (envelope-from <eliz@gnu.org>) id 1hv0WB-0002KX-UI
 for 36923@debbugs.gnu.org; Tue, 06 Aug 2019 10:32:52 -0400
Original-Received: from fencepost.gnu.org ([2001:470:142:3::e]:53619)
 by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from <eliz@gnu.org>)
 id 1hv0W6-0001z4-HJ; Tue, 06 Aug 2019 10:32:46 -0400
Original-Received: from [176.228.60.248] (port=1704 helo=home-c4e4a596f7)
 by fencepost.gnu.org with esmtpsa (TLS1.2:RSA_AES_256_CBC_SHA1:256)
 (Exim 4.82) (envelope-from <eliz@gnu.org>)
 id 1hv0W5-0002jF-PF; Tue, 06 Aug 2019 10:32:46 -0400
In-reply-to: <87zhknzc7c.fsf@mail.linkov.net> (message from Juri Linkov on
 Mon, 05 Aug 2019 22:41:59 +0300)
X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic]
X-BeenThere: debbugs-submit@debbugs.gnu.org
X-Mailman-Version: 2.1.18
Precedence: list
X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic]
X-Received-From: 209.51.188.43
X-BeenThere: bug-gnu-emacs@gnu.org
List-Id: "Bug reports for GNU Emacs,
 the Swiss army knife of text editors" <bug-gnu-emacs.gnu.org>
List-Unsubscribe: <https://lists.gnu.org/mailman/options/bug-gnu-emacs>,
 <mailto:bug-gnu-emacs-request@gnu.org?subject=unsubscribe>
List-Archive: <https://lists.gnu.org/archive/html/bug-gnu-emacs>
List-Post: <mailto:bug-gnu-emacs@gnu.org>
List-Help: <mailto:bug-gnu-emacs-request@gnu.org?subject=help>
List-Subscribe: <https://lists.gnu.org/mailman/listinfo/bug-gnu-emacs>,
 <mailto:bug-gnu-emacs-request@gnu.org?subject=subscribe>
Errors-To: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane.org@gnu.org
Original-Sender: "bug-gnu-emacs"
 <bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane.org@gnu.org>
Xref: news.gmane.org gmane.emacs.bugs:164680
Archived-At: <http://permalink.gmane.org/gmane.emacs.bugs/164680>

> From: Juri Linkov <juri@linkov.net>
> Cc: 36923@debbugs.gnu.org
> Date: Mon, 05 Aug 2019 22:41:59 +0300
> 
> >>   (aref char-script-table ?\N{COMBINING ACUTE ACCENT})
> >>
> >> could return
> >>
> >>   (latin greek cyrillic)
> >>
> >> instead of the current
> >>
> >>   latin
> >
> > char-script-table is documented to yield a single symbol, so returning
> > a list would be an incompatible change, which we should avoid.
> 
> The docstring of char-script-table says:
> 
>   Char table of script symbols.
>   It has one extra slot whose value is a list of script symbols.
> 
> So it seems char-script-table should yield a list of script symbols?

No, that's only in the extra slot.  The ELisp manual says:

 -- Variable: char-script-table
     The value of this variable is a char-table that specifies, for each
     character, a symbol whose name is the script to which the character
     belongs, according to the Unicode Standard classification of the
     Unicode code space into script-specific blocks.  This char-table
     has a single extra slot whose value is the list of all script
     symbols.

> I searched more for char-script-table in the documentation, and one
> place where it's used is forward-word.  But I don't understand why
> forward-word doesn't stop between “COMBINING ACUTE ACCENT” (that is
> the Latin script) and non-Latin letters.

See word-combining-categories: it causes word-movement commands to
ignore any script boundaries with characters whose category is
combining diacritic or mark.

> Maybe it doesn't stop because of special script handling in
> ‘find-word-boundary-function-table’?

Not by default, because find-word-boundary-function-table's entry for
any character is nil by default.

> BTW, while looking at forward-word and right-word I noticed inconsistency:
> there are left-word and right-word commands, but no left-sexp and right-sexp
> to accompany forward-sexp.

Programming languages are all L2R, so there's no need to move by sexps
in R2L direction.

> > More generally, I think what you describe is a clear conceptual bug in
> > markchars-mode: it should only pay attention to the script of the base
> > characters, not to the script of combining accents.  The latter is
> > mostly irrelevant, certainly so for the purpose of detecting
> > confusables.
> 
> Could you suggest a proper function to strip all combining characters
> from the string?

Each base character has its canonical combining class attribute as
zero, so you could use

   (get-char-code-property CHAR 'canonical-combining-class)

to filter out those CHARs for which the value is non-zero.

Alternatively, you could go by categories: base characters have the
?. category set, combining characters have the ?^ category set.

My recommendation is to use the canonical-combining-class property, as
it is a more direct way of doing this.