From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.io!.POSTED.blaine.gmane.org!not-for-mail From: Eli Zaretskii Newsgroups: gmane.emacs.devel Subject: Re: ignoring combining diacritics in isearch Date: Wed, 23 Nov 2022 20:02:52 +0200 Message-ID: <83wn7ly2er.fsf@gnu.org> References: <878rk1fuo2.fsf@gmail.com> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit Injection-Info: ciao.gmane.io; posting-host="blaine.gmane.org:116.202.254.214"; logging-data="36614"; mail-complaints-to="usenet@ciao.gmane.io" Cc: emacs-devel@gnu.org To: Robert Pluim Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org Wed Nov 23 19:03:47 2022 Return-path: Envelope-to: ged-emacs-devel@m.gmane-mx.org Original-Received: from lists.gnu.org ([209.51.188.17]) by ciao.gmane.io with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.92) (envelope-from ) id 1oxu66-0009H2-FG for ged-emacs-devel@m.gmane-mx.org; Wed, 23 Nov 2022 19:03:46 +0100 Original-Received: from localhost ([::1] helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1oxu5D-0002uZ-9j; Wed, 23 Nov 2022 13:02:51 -0500 Original-Received: from eggs.gnu.org ([2001:470:142:3::10]) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1oxu4z-0002pE-Kc for emacs-devel@gnu.org; Wed, 23 Nov 2022 13:02:49 -0500 Original-Received: from fencepost.gnu.org ([2001:470:142:3::e]) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1oxu4z-00015H-7N; Wed, 23 Nov 2022 13:02:37 -0500 DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=gnu.org; s=fencepost-gnu-org; h=MIME-version:References:Subject:In-Reply-To:To:From: Date; bh=L2WOrIL9ochVyS2JEmH5tiKYpjNldCt+0B99rkMyCcw=; b=KnojQDD7XwSii+rLhWgN kKY3TkX7FNwsyH4JGwlC/APWIZLFDT2OLiJlNnaEr/zqwLkfPl6APXeVCZhjxp6kbniSIzgKMEJ0B jwVzzuHBwV0hdOadpWElKSch/IWcbzYE/dRRHo6imWikspvEdMt7xSpjs8fJghH5J3qSkecybdEIG 1M2VGpuW2Bts/KXVNy3CNg8Jv4CZQZPjm473PuLBRSnbfPTElgew8lcQcpmzMj4RLXCkqLuAqG6qS 0HeFMyv1Xha3bZGCr5xU1TD2/hAeu3X0NUULOyzx2Ofxk4BYM782rm/nrHbnS6jKlLQFDN2O2XJ09 +zRqycLNp/pz1A==; Original-Received: from [87.69.77.57] (helo=home-c4e4a596f7) by fencepost.gnu.org with esmtpsa (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1oxu4y-0005O5-A9; Wed, 23 Nov 2022 13:02:37 -0500 In-Reply-To: <878rk1fuo2.fsf@gmail.com> (message from Robert Pluim on Wed, 23 Nov 2022 18:27:25 +0100) X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org Xref: news.gmane.io gmane.emacs.devel:300403 Archived-At: > From: Robert Pluim > Date: Wed, 23 Nov 2022 18:27:25 +0100 > > Over on Stack Overflow, someone has been trying to get char-folded > isearch working for Arabic, and has been having some issues because > char-folding only works for equivalent characters, not base characters > followed by combining characters. So eg searching for 'ee' when the > buffer contains > > éé > > (thatʼs 'e' followed by COMBINING ACUTE ACCENT) fails. > > The following patch fixes that, but itʼs a bit of a sledgehammer (the > "\\c^*" bit probably needs to be configurable, because there are > diacritic-like codepoints in Arabic that are not combining, such as > U+0640 ARABIC TATWEEL) Yes, this is definitely not the way. There are many more "foldings" that Latin scripts don't know about. For example, it should be possible to fold the initial, medial, and final forms of letters that exist in some scripts (including Arabic). I think we've all but reached the limit to which this quasi-folding via regexps can be stretched. Writing regexp by hand or semi-mechanically based on Unicode properties can only go this far. _Real_ character folding cannot work this way. We should work on infrastructure for folding text for search purposes, and then we can build features on top of that.