From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!.POSTED.blaine.gmane.org!not-for-mail From: Eli Zaretskii Newsgroups: gmane.emacs.bugs Subject: bug#36923: Combining Diacritical Marks are not Latin only Date: Tue, 06 Aug 2019 17:32:33 +0300 Message-ID: <83a7cmcqke.fsf@gnu.org> References: <87lfw8r744.fsf@mail.linkov.net> <83k1brd28a.fsf@gnu.org> <87zhknzc7c.fsf@mail.linkov.net> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit Injection-Info: blaine.gmane.org; posting-host="blaine.gmane.org:195.159.176.226"; logging-data="95855"; mail-complaints-to="usenet@blaine.gmane.org" Cc: 36923@debbugs.gnu.org To: Juri Linkov Original-X-From: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane.org@gnu.org Tue Aug 06 16:33:10 2019 Return-path: Envelope-to: geb-bug-gnu-emacs@m.gmane.org Original-Received: from lists.gnu.org ([209.51.188.17]) by blaine.gmane.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.89) (envelope-from ) id 1hv0WT-000OpD-R5 for geb-bug-gnu-emacs@m.gmane.org; Tue, 06 Aug 2019 16:33:10 +0200 Original-Received: from localhost ([::1]:33818 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.86_2) (envelope-from ) id 1hv0WS-0002H6-D5 for geb-bug-gnu-emacs@m.gmane.org; Tue, 06 Aug 2019 10:33:08 -0400 Original-Received: from eggs.gnu.org ([2001:470:142:3::10]:43514) by lists.gnu.org with esmtp (Exim 4.86_2) (envelope-from ) id 1hv0WO-0002Go-6k for bug-gnu-emacs@gnu.org; Tue, 06 Aug 2019 10:33:05 -0400 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1hv0WN-00022W-0l for bug-gnu-emacs@gnu.org; Tue, 06 Aug 2019 10:33:04 -0400 Original-Received: from debbugs.gnu.org ([209.51.188.43]:57030) by eggs.gnu.org with esmtps (TLS1.0:RSA_AES_128_CBC_SHA1:16) (Exim 4.71) (envelope-from ) id 1hv0WM-00022L-HV for bug-gnu-emacs@gnu.org; Tue, 06 Aug 2019 10:33:02 -0400 Original-Received: from Debian-debbugs by debbugs.gnu.org with local (Exim 4.84_2) (envelope-from ) id 1hv0WM-0002LA-9i for bug-gnu-emacs@gnu.org; Tue, 06 Aug 2019 10:33:02 -0400 X-Loop: help-debbugs@gnu.org Resent-From: Eli Zaretskii Original-Sender: "Debbugs-submit" Resent-CC: bug-gnu-emacs@gnu.org Resent-Date: Tue, 06 Aug 2019 14:33:02 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 36923 X-GNU-PR-Package: emacs Original-Received: via spool by 36923-submit@debbugs.gnu.org id=B36923.15651019748979 (code B ref 36923); Tue, 06 Aug 2019 14:33:02 +0000 Original-Received: (at 36923) by debbugs.gnu.org; 6 Aug 2019 14:32:54 +0000 Original-Received: from localhost ([127.0.0.1]:37618 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1hv0WD-0002Kl-Us for submit@debbugs.gnu.org; Tue, 06 Aug 2019 10:32:54 -0400 Original-Received: from eggs.gnu.org ([209.51.188.92]:59345) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1hv0WB-0002KX-UI for 36923@debbugs.gnu.org; Tue, 06 Aug 2019 10:32:52 -0400 Original-Received: from fencepost.gnu.org ([2001:470:142:3::e]:53619) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1hv0W6-0001z4-HJ; Tue, 06 Aug 2019 10:32:46 -0400 Original-Received: from [176.228.60.248] (port=1704 helo=home-c4e4a596f7) by fencepost.gnu.org with esmtpsa (TLS1.2:RSA_AES_256_CBC_SHA1:256) (Exim 4.82) (envelope-from ) id 1hv0W5-0002jF-PF; Tue, 06 Aug 2019 10:32:46 -0400 In-reply-to: <87zhknzc7c.fsf@mail.linkov.net> (message from Juri Linkov on Mon, 05 Aug 2019 22:41:59 +0300) X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] X-Received-From: 209.51.188.43 X-BeenThere: bug-gnu-emacs@gnu.org List-Id: "Bug reports for GNU Emacs, the Swiss army knife of text editors" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane.org@gnu.org Original-Sender: "bug-gnu-emacs" Xref: news.gmane.org gmane.emacs.bugs:164680 Archived-At: > From: Juri Linkov > Cc: 36923@debbugs.gnu.org > Date: Mon, 05 Aug 2019 22:41:59 +0300 > > >> (aref char-script-table ?\N{COMBINING ACUTE ACCENT}) > >> > >> could return > >> > >> (latin greek cyrillic) > >> > >> instead of the current > >> > >> latin > > > > char-script-table is documented to yield a single symbol, so returning > > a list would be an incompatible change, which we should avoid. > > The docstring of char-script-table says: > > Char table of script symbols. > It has one extra slot whose value is a list of script symbols. > > So it seems char-script-table should yield a list of script symbols? No, that's only in the extra slot. The ELisp manual says: -- Variable: char-script-table The value of this variable is a char-table that specifies, for each character, a symbol whose name is the script to which the character belongs, according to the Unicode Standard classification of the Unicode code space into script-specific blocks. This char-table has a single extra slot whose value is the list of all script symbols. > I searched more for char-script-table in the documentation, and one > place where it's used is forward-word. But I don't understand why > forward-word doesn't stop between “COMBINING ACUTE ACCENT” (that is > the Latin script) and non-Latin letters. See word-combining-categories: it causes word-movement commands to ignore any script boundaries with characters whose category is combining diacritic or mark. > Maybe it doesn't stop because of special script handling in > ‘find-word-boundary-function-table’? Not by default, because find-word-boundary-function-table's entry for any character is nil by default. > BTW, while looking at forward-word and right-word I noticed inconsistency: > there are left-word and right-word commands, but no left-sexp and right-sexp > to accompany forward-sexp. Programming languages are all L2R, so there's no need to move by sexps in R2L direction. > > More generally, I think what you describe is a clear conceptual bug in > > markchars-mode: it should only pay attention to the script of the base > > characters, not to the script of combining accents. The latter is > > mostly irrelevant, certainly so for the purpose of detecting > > confusables. > > Could you suggest a proper function to strip all combining characters > from the string? Each base character has its canonical combining class attribute as zero, so you could use (get-char-code-property CHAR 'canonical-combining-class) to filter out those CHARs for which the value is non-zero. Alternatively, you could go by categories: base characters have the ?. category set, combining characters have the ?^ category set. My recommendation is to use the canonical-combining-class property, as it is a more direct way of doing this.