From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: =?UTF-8?Q?Elias_M=C3=A5rtenson?= Newsgroups: gmane.emacs.devel Subject: Re: On language-dependent defaults for character-folding Date: Fri, 19 Feb 2016 17:22:18 +0800 Message-ID: References: <87io1xwq1e.fsf@wanadoo.es> <87vb5wvzfz.fsf@mail.linkov.net> <87io1wt4cc.fsf@wanadoo.es> <8737syoima.fsf@mail.linkov.net> <871t8iu277.fsf@wanadoo.es> <83d1s28kvh.fsf@gnu.org> <87r3gis7sm.fsf@wanadoo.es> <83twle71xy.fsf@gnu.org> <87io1us0te.fsf@wanadoo.es> <83pow26svf.fsf@gnu.org> <87a8n5srbp.fsf@wanadoo.es> <83d1s17npz.fsf@gnu.org> <87oablfpn3.fsf@mail.linkov.net> <834mdd6llx.fsf@gnu.org> <7fbb8bc7-9a97-4bad-a103-a6690a35241d@default> <834mdc5w6o.fsf@gnu.org> <838u2hu6aq.fsf@gnu.org> <871t899tde.fsf@gnus.org> <83y4ahru04.fsf@gnu.org> NNTP-Posting-Host: plane.gmane.org Mime-Version: 1.0 Content-Type: multipart/alternative; boundary=001a1143f9480c97a5052c1c0596 X-Trace: ger.gmane.org 1455873782 30317 80.91.229.3 (19 Feb 2016 09:23:02 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Fri, 19 Feb 2016 09:23:02 +0000 (UTC) Cc: Lars Ingebrigtsen , emacs-devel To: Eli Zaretskii Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Fri Feb 19 10:22:59 2016 Return-path: Envelope-to: ged-emacs-devel@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by plane.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1aWhH7-0000z0-GN for ged-emacs-devel@m.gmane.org; Fri, 19 Feb 2016 10:22:57 +0100 Original-Received: from localhost ([::1]:50282 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1aWhH3-0003U2-50 for ged-emacs-devel@m.gmane.org; Fri, 19 Feb 2016 04:22:53 -0500 Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:48957) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1aWhGa-0003Ho-As for emacs-devel@gnu.org; Fri, 19 Feb 2016 04:22:28 -0500 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1aWhGV-00050M-P3 for emacs-devel@gnu.org; Fri, 19 Feb 2016 04:22:23 -0500 Original-Received: from mail-vk0-x22b.google.com ([2607:f8b0:400c:c05::22b]:36812) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1aWhGV-00050E-Iz; Fri, 19 Feb 2016 04:22:19 -0500 Original-Received: by mail-vk0-x22b.google.com with SMTP id c3so68846432vkb.3; Fri, 19 Feb 2016 01:22:19 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; bh=mVFWt5bPvnaOXHs3ogs+zwFdXF0r9BSfdx4KjLqftsU=; b=ryORrlcBDuGp2g0SRb+6mby209FE8ut7RSHAvkXQ5Y9yHaz+ZGzii/hrMg56JRI6ua 4dF214uNrgHAqYmQM96P78+l0hMcuEOLH4EbkB/46mz/dy9FlXkAvOiZ6/DJRYKhmkEf VrEVKWrNGtzhEv2PbISfLG1crzGYX9xjqyLi3NUAuz5SJxtl+FCCBJYxEeWPaLIAu9bZ 0MxV1Qf29G7mSgPYOLkMyh0FKksTGhCsDqzVc3L0mBWC7sbawfPY06ULAOD7shDg6Qm7 wU68fFE9w1yB/asyIhU3pWqgxyLS3eMaY8C1jrzdsbtACmf+9i0hmhAiQ64h1+vLVmnL 1SQA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:date :message-id:subject:from:to:cc:content-type; bh=mVFWt5bPvnaOXHs3ogs+zwFdXF0r9BSfdx4KjLqftsU=; b=WKG3htKlMwQeMFTdbktfa58RX6in83doZLFzxF1bHBVnCOE7Zui4zcEUBQvIC7vbU2 nZgIacIapARFqC1FRP59u2UCEBKpgOkrsJQGklKt35TOQdIok4a5zf0zL/EL2LSXCt2q XHbnJAkKPsXaXv4J36m++81cs+RUWZMMvB49EMDWmZg+ZuLouFTXzutXSXeTycx/o9WR Tt1yMg0pqgb7BYi/whn7N50Mz8wvtKqiEMq/YSy0cDi3uZJ/A+Z/39XM7sWSoSIPE7N/ 7O6TyJcUtovWabPxyXvr7cNY8PLyxpAbS7i5AFGd8ffnQsMHuJf9/LrCyRfi2mbKCZvk q4BQ== X-Gm-Message-State: AG10YORabuS0b/yyFZxofKtoLqO1MA5kjCdhsETpp59/Y2/LsE6N6mrX7cP4IQj4WKoiGp93Ae3eFYXFfKeF8A== X-Received: by 10.31.56.151 with SMTP id f145mr10163490vka.107.1455873738968; Fri, 19 Feb 2016 01:22:18 -0800 (PST) Original-Received: by 10.176.3.146 with HTTP; Fri, 19 Feb 2016 01:22:18 -0800 (PST) In-Reply-To: <83y4ahru04.fsf@gnu.org> X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] X-Received-From: 2607:f8b0:400c:c05::22b X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Xref: news.gmane.org gmane.emacs.devel:200191 Archived-At: --001a1143f9480c97a5052c1c0596 Content-Type: text/plain; charset=UTF-8 On 19 February 2016 at 16:20, Eli Zaretskii wrote: > > From: Lars Ingebrigtsen > > Date: Fri, 19 Feb 2016 16:11:41 +1100 > > > > Here's my vote: I think character folding is a good idea, and that it > > should be turned on by default if it respects the locale. If not, it > > should be off by default. > > Thanks. But what does "respect the locale" mean, in practical terms? > A large portion of the characters that have some decomposition, and > thus will be folded when searching, belong to scripts that are not > related to any language or other locale-specific attribute. What do > you think should be done with them in the context of this feature? > The Unicode character decomposition was never meant to be used to provide a feature such as character folding in Emacs. But, Unicode really doesn't provide a good alternative. The standard itself states that this belongs to the realm of localisation (IIRC, it even goes as far as mentioning Swedish as a counterexample). I readily agree that using the decomposition is a clever way to get the functionality quite a long way, but the cases where it breaks down, it does so quite spectacularly, and that's what I (and others) have been opposing. My suggestion would be to apply several levels of comparisons: 1. Check if the characters have locale-specific folding rules (for Swedish, this would be no more than 3-5 characters or so). If not: 2. Check the equivalence according to the Unicode collation charts: http://unicode.org/charts/collation/ 3. (maybe) Use the decomposition trick As for the per-locale exception tables mentioned in point 1, I don't know if such information is easily available. It may be possible to extract it from the localedata files from Glibc. But even if it isn't, creating one for a language should be trivial since we only need a list of character groups that should _not_ be folded, which for most languages should be a very small list (in fact, for most(?) it's probably empty). Regards, Elias --001a1143f9480c97a5052c1c0596 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
On 1= 9 February 2016 at 16:20, Eli Zaretskii <eliz@gnu.org> wrote:
=
> From: Lars Ingebrigtsen <larsi@gnus.org>
> Date: Fri, 19 Feb 2016 16:11:41 +1100
>
> Here's my vote: I think character folding is a good idea, and that= it
> should be turned on by default if it respects the locale.=C2=A0 If not= , it
> should be off by default.

Thanks.=C2=A0 But what does "respect the locale" mean, in = practical terms?
A large portion of the characters that have some decomposition, and
thus will be folded when searching, belong to scripts that are not
related to any language or other locale-specific attribute.=C2=A0 What do you think should be done with them in the context of this feature?

The Unicode character decomposition was never m= eant to be used to provide a feature such as character folding in Emacs. Bu= t, Unicode really doesn't provide a good alternative. The standard itse= lf states that this belongs to the realm of localisation (IIRC, it even goe= s as far as mentioning Swedish as a counterexample).

I readily agree that using the decomposition is a clever way to get the = functionality quite a long way, but the cases where it breaks down, it does= so quite spectacularly, and that's what I (and others) have been oppos= ing.

My suggestion would be to apply several level= s of comparisons:

=C2=A0 1. Check if the character= s have locale-specific folding rules (for Swedish, this would be no more th= an 3-5 characters or so). If not:
=C2=A0 2. Check the equivalence= according to the Unicode collation charts:=C2=A0http://unicode.org/charts/collation/
= =C2=A0 3. (maybe) Use the decomposition trick

As f= or the per-locale exception tables mentioned in point 1, I don't know i= f such information is easily available. It may be possible to extract it fr= om the localedata files from Glibc. But even if it isn't, creating one = for a language should be trivial since we only need a list of character gro= ups that should _not_ be folded, which for most languages should be a very = small list (in fact, for most(?) it's probably empty).

Regards,
Elias
--001a1143f9480c97a5052c1c0596--