From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: Eli Zaretskii Newsgroups: gmane.emacs.devel Subject: Re: On language-dependent defaults for character-folding Date: Sat, 13 Feb 2016 10:49:30 +0200 Message-ID: <834mdd6llx.fsf@gnu.org> References: <87mvr9wxqz.fsf@wanadoo.es> <87io1xwq1e.fsf@wanadoo.es> <87vb5wvzfz.fsf@mail.linkov.net> <87io1wt4cc.fsf@wanadoo.es> <8737syoima.fsf@mail.linkov.net> <871t8iu277.fsf@wanadoo.es> <83d1s28kvh.fsf@gnu.org> <87r3gis7sm.fsf@wanadoo.es> <83twle71xy.fsf@gnu.org> <87io1us0te.fsf@wanadoo.es> <83pow26svf.fsf@gnu.org> <87a8n5srbp.fsf@wanadoo.es> <83d1s17npz.fsf@gnu.org> <87oablfpn3.fsf@mail.linkov.net> Reply-To: Eli Zaretskii NNTP-Posting-Host: plane.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit X-Trace: ger.gmane.org 1455353390 25923 80.91.229.3 (13 Feb 2016 08:49:50 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Sat, 13 Feb 2016 08:49:50 +0000 (UTC) Cc: ofv@wanadoo.es, emacs-devel@gnu.org To: Juri Linkov Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Sat Feb 13 09:49:42 2016 Return-path: Envelope-to: ged-emacs-devel@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by plane.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1aUVtc-0005E2-TD for ged-emacs-devel@m.gmane.org; Sat, 13 Feb 2016 09:49:41 +0100 Original-Received: from localhost ([::1]:40434 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1aUVtb-0005YZ-Uc for ged-emacs-devel@m.gmane.org; Sat, 13 Feb 2016 03:49:39 -0500 Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:59416) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1aUVtX-0005YE-Qt for emacs-devel@gnu.org; Sat, 13 Feb 2016 03:49:37 -0500 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1aUVtU-0006AC-Kz for emacs-devel@gnu.org; Sat, 13 Feb 2016 03:49:35 -0500 Original-Received: from fencepost.gnu.org ([2001:4830:134:3::e]:52048) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1aUVtU-0006A8-HQ; Sat, 13 Feb 2016 03:49:32 -0500 Original-Received: from 84.94.185.246.cable.012.net.il ([84.94.185.246]:3597 helo=home-c4e4a596f7) by fencepost.gnu.org with esmtpsa (TLS1.2:RSA_AES_128_CBC_SHA1:128) (Exim 4.82) (envelope-from ) id 1aUVtR-0003n9-Ow; Sat, 13 Feb 2016 03:49:30 -0500 In-reply-to: <87oablfpn3.fsf@mail.linkov.net> (message from Juri Linkov on Sat, 13 Feb 2016 01:57:33 +0200) X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] X-Received-From: 2001:4830:134:3::e X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Xref: news.gmane.org gmane.emacs.devel:199859 Archived-At: > From: Juri Linkov > Cc: Óscar Fuentes , emacs-devel@gnu.org > Date: Sat, 13 Feb 2016 01:57:33 +0200 > > Can't we somehow use the same char-folding as is implemented in > ICU String Search Service (this is also used for search in Chromium): > http://userguide.icu-project.org/collation/icu-string-search-service > that supports matching of accented letters, conjoined letters, > and ignorable punctuation. > > As is described in http://userguide.icu-project.org/collation/concepts > there are several levels of character matching: > > 1. Primary Level: differences between base characters > > 2. Secondary Level: Accents in the characters > > 3. Tertiary Level: Upper and lower case differences in characters > > 4. Quaternary Level: Punctuation is ignored (where e.g. snake-cased > “black_bird” matches camel-cased “blackBird”) > > 5. Identical Level > > Maybe our customization could provide options to choose > between all these levels? That's the final goal, yes. The current implementation is just the initial step, and it basically does just item #1. (The list above is about collation, not about searching, so the wording does not really fit the searching use case. Also, they just reiterate what the Unicode TR#10, http://unicode.org/reports/tr10/, specifies.) The implementation should really be on the C level, like the case-folding support. The current implementation isn't, and therefore has several disadvantages some of which were already pointed out (e.g., the regexp it uses that gets exposed in some situations and causes users to be surprised). For these and other reasons, I think we should replace the current implementation with one that's in search_buffer, driven by tables generated from the Unicode database. I also think we will be unable to move to the higher levels mentioned above without first moving the implementation into search_buffer. Volunteers are welcome to work on that. Doing this will eventually require to use the data in DUCET (Default Unicode Collation Element Table) and CLDR (Common Locale Data Repository), I think, to support both the language-independent and language-dependent folding. But this is only needed for the next levels, the current level that basically only looks at the base character doesn't need fancy databases apart of what we already have. At the time, no one stepped forward to do this on the C level, and the current implementation was considered to be good-enough for the first step.