From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: Eli Zaretskii Newsgroups: gmane.emacs.devel Subject: Re: On language-dependent defaults for character-folding Date: Tue, 23 Feb 2016 18:56:36 +0200 Message-ID: <83ziuricvv.fsf@gnu.org> References: <83d1s17npz.fsf@gnu.org> <87oablfpn3.fsf@mail.linkov.net> <834mdd6llx.fsf@gnu.org> <7fbb8bc7-9a97-4bad-a103-a6690a35241d@default> <834mdc5w6o.fsf@gnu.org> <838u2hu6aq.fsf@gnu.org> <871t899tde.fsf@gnus.org> <83y4ahru04.fsf@gnu.org> <83fuwproyf.fsf@gnu.org> <837fi0sz29.fsf@gnu.org> <83egc8qzjh.fsf@gnu.org> <87egc7evu3.fsf@gnus.org> <83io1jpt4u.fsf@gnu.org> <87povqhj25.fsf@gnus.org> <87egc68opx.fsf@Rainer.invalid> Reply-To: Eli Zaretskii NNTP-Posting-Host: plane.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit X-Trace: ger.gmane.org 1456246619 23564 80.91.229.3 (23 Feb 2016 16:56:59 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Tue, 23 Feb 2016 16:56:59 +0000 (UTC) Cc: emacs-devel@gnu.org To: Achim Gratz Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Tue Feb 23 17:56:52 2016 Return-path: Envelope-to: ged-emacs-devel@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by plane.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1aYGGZ-0001A4-NM for ged-emacs-devel@m.gmane.org; Tue, 23 Feb 2016 17:56:51 +0100 Original-Received: from localhost ([::1]:58450 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1aYGGZ-0001zb-7o for ged-emacs-devel@m.gmane.org; Tue, 23 Feb 2016 11:56:51 -0500 Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:49947) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1aYGGV-0001zN-Ji for emacs-devel@gnu.org; Tue, 23 Feb 2016 11:56:48 -0500 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1aYGGR-0001j8-L2 for emacs-devel@gnu.org; Tue, 23 Feb 2016 11:56:47 -0500 Original-Received: from fencepost.gnu.org ([2001:4830:134:3::e]:44110) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1aYGGR-0001j4-HX; Tue, 23 Feb 2016 11:56:43 -0500 Original-Received: from 84.94.185.246.cable.012.net.il ([84.94.185.246]:2567 helo=home-c4e4a596f7) by fencepost.gnu.org with esmtpsa (TLS1.2:RSA_AES_128_CBC_SHA1:128) (Exim 4.82) (envelope-from ) id 1aYGGQ-0005Cv-Ti; Tue, 23 Feb 2016 11:56:43 -0500 In-reply-to: <87egc68opx.fsf@Rainer.invalid> (message from Achim Gratz on Sun, 21 Feb 2016 09:14:18 +0100) X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] X-Received-From: 2001:4830:134:3::e X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Xref: news.gmane.org gmane.emacs.devel:200545 Archived-At: > From: Achim Gratz > Date: Sun, 21 Feb 2016 09:14:18 +0100 > > Elias Mårtenson writes: > > Because under the Unicode decomposition rules, ø is not decomposable. I > > can't explain why that is the case (probably because there is no reason to > > have a combining /. After all, the only languages that use ø are languages > > that use it as a character of its own). > > AFAIK, for combining characters to be composable/decomposable the glyphs > must not overlap. This is the same issue as with the polish »ł« to the > best of my knowledge. The definitive answer is here, for those interested: http://www.unicode.org/mail-arch/unicode-ml/y2016-m02/0106.html > In other words, unicode composition/decomposition rules tell you more > about the glyph construction than they do about useful strategies to > search for multiple characters. That conclusion is too radical, IMO. You will see in the above message that the criterion you describe was just a means for the UTC to draw a line somewhere, i.e. it was an ad-hoc rule more than anything else. > The idea of using the base character of the canonical decomposition > in the search might still yield a useful shortcut in most cases, but > I'm not sure it is correct in all languages even when that > decomposition exists and, as the examples show, there are cases > where the non-decomposed character has to be treated specially. Language-specific tailoring is indeed needed for best results, but the language-independent decompositions have their place. E.g., you will see in the Unicode collation database (UCA) a file named decomps.txt that is basically a list of decompositions from UnicodeData.txt with additions specifically for collation, searching, and matching (including ł, btw). Which tells me that the decomposition data in UnicodeData.txt is a good basis for these features, it is not just about glyph constructions.