From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: Nathan Trapuzzano Newsgroups: gmane.emacs.bugs Subject: bug#17130: 24.4.50; Deficient Unicode case folding Date: Sat, 29 Mar 2014 14:31:52 -0400 Message-ID: <8761mwua93.fsf@nbtrap.com> References: <87txair0g7.fsf@ivytech.edu> <83fvm2fhii.fsf@gnu.org> <87ob0qrugy.fsf@nbtrap.com> <83y4ztec5l.fsf@gnu.org> <87ob0pnptc.fsf@nbtrap.com> <83d2h5du2e.fsf@gnu.org> <87eh1lcdaj.fsf@nbtrap.com> <838urtdpwk.fsf@gnu.org> <87ioqxxbtk.fsf@nbtrap.com> <831txkewil.fsf@gnu.org> NNTP-Posting-Host: plane.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Trace: ger.gmane.org 1396118007 16197 80.91.229.3 (29 Mar 2014 18:33:27 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Sat, 29 Mar 2014 18:33:27 +0000 (UTC) Cc: 17130@debbugs.gnu.org To: Eli Zaretskii Original-X-From: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane.org@gnu.org Sat Mar 29 19:33:19 2014 Return-path: Envelope-to: geb-bug-gnu-emacs@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by plane.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1WTy4D-0006O1-Fd for geb-bug-gnu-emacs@m.gmane.org; Sat, 29 Mar 2014 19:33:17 +0100 Original-Received: from localhost ([::1]:40576 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1WTy4D-0004Wf-1Q for geb-bug-gnu-emacs@m.gmane.org; Sat, 29 Mar 2014 14:33:17 -0400 Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:37964) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1WTy45-0004Mw-ET for bug-gnu-emacs@gnu.org; Sat, 29 Mar 2014 14:33:14 -0400 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1WTy40-00010W-35 for bug-gnu-emacs@gnu.org; Sat, 29 Mar 2014 14:33:09 -0400 Original-Received: from debbugs.gnu.org ([140.186.70.43]:55199) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1WTy40-00010P-00 for bug-gnu-emacs@gnu.org; Sat, 29 Mar 2014 14:33:04 -0400 Original-Received: from Debian-debbugs by debbugs.gnu.org with local (Exim 4.80) (envelope-from ) id 1WTy3y-0003Dx-KL for bug-gnu-emacs@gnu.org; Sat, 29 Mar 2014 14:33:03 -0400 X-Loop: help-debbugs@gnu.org Resent-From: Nathan Trapuzzano Original-Sender: "Debbugs-submit" Resent-CC: bug-gnu-emacs@gnu.org Resent-Date: Sat, 29 Mar 2014 18:33:02 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 17130 X-GNU-PR-Package: emacs X-GNU-PR-Keywords: Original-Received: via spool by 17130-submit@debbugs.gnu.org id=B17130.139611793412330 (code B ref 17130); Sat, 29 Mar 2014 18:33:02 +0000 Original-Received: (at 17130) by debbugs.gnu.org; 29 Mar 2014 18:32:14 +0000 Original-Received: from localhost ([127.0.0.1]:56381 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1WTy3A-0003Ci-Fg for submit@debbugs.gnu.org; Sat, 29 Mar 2014 14:32:13 -0400 Original-Received: from gproxy3-pub.mail.unifiedlayer.com ([69.89.30.42]:40030) by debbugs.gnu.org with smtp (Exim 4.80) (envelope-from ) id 1WTy35-0003CV-8z for 17130@debbugs.gnu.org; Sat, 29 Mar 2014 14:32:09 -0400 Original-Received: (qmail 26185 invoked by uid 0); 29 Mar 2014 18:32:05 -0000 Original-Received: from unknown (HELO cmgw3) (10.0.90.84) by gproxy3.mail.unifiedlayer.com with SMTP; 29 Mar 2014 18:32:05 -0000 Original-Received: from host393.hostmonster.com ([66.147.240.193]) by cmgw3 with id jdXu1n00W4B3kjm01dXxZH; Sat, 29 Mar 2014 19:32:03 -0600 X-Authority-Analysis: v=2.1 cv=O5+q4nNW c=1 sm=1 tr=0 a=GZ6qK+eS4AuCRVUKGEKC+Q==:117 a=GZ6qK+eS4AuCRVUKGEKC+Q==:17 a=DsvgjBjRAAAA:8 a=f5113yIGAAAA:8 a=4GsTxW34auoA:10 a=2__L0ovz5gcA:10 a=lfvU_ReahkwA:10 a=IkcTkHD0fZMA:10 a=ngU5ixn2AAAA:8 a=fWyWhr6xdMwA:10 a=mDV3o1hIAAAA:8 a=te1EGT4yAAAA:8 a=l8tov7semGp3CIHwoYQA:9 a=4012WQjoIh5nzd0e:21 a=ZzfYYkI4sY-AK2ry:21 a=QEXdDO2ut3YA:10 a=ii61gXl28gQA:10 DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=nbtrap.com; s=default; h=Content-Transfer-Encoding:Content-Type:MIME-Version:Message-ID:In-Reply-To:Date:References:Subject:Cc:To:From; bh=eoHU66cqGA6JaDYXPOKj543YH/zLH5t6hHydDCh8PQA=; b=dQv+PH4EiiR/LUxNisuUSo1c8+Z50aD4EqF69myjaxkRUYkmjJczKdCZUlu8/aEnEQdsMokE7uy+LJ3Kqd6QuK2A8i/kJJGWoh4DF8v8O6PS0Rsj7uu7tjox74VEcZXj; Original-Received: from [50.90.253.209] (port=51410 helo=Nathan-GNU) by host393.hostmonster.com with esmtpsa (TLSv1.2:CAMELLIA128-SHA:128) (Exim 4.82) (envelope-from ) id 1WTy2t-0000sk-R6; Sat, 29 Mar 2014 12:31:56 -0600 In-Reply-To: <831txkewil.fsf@gnu.org> (Eli Zaretskii's message of "Sat, 29 Mar 2014 20:37:38 +0300") User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/24.4.50 (gnu/linux) X-Identified-User: {1585:host393.hostmonster.com:nbtrapco:nbtrap.com} {sentby:smtp auth 50.90.253.209 authed with nbtrap@nbtrap.com} X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.15 Precedence: list X-detected-operating-system: by eggs.gnu.org: GNU/Linux 3.x X-Received-From: 140.186.70.43 X-BeenThere: bug-gnu-emacs@gnu.org List-Id: "Bug reports for GNU Emacs, the Swiss army knife of text editors" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane.org@gnu.org Original-Sender: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane.org@gnu.org Xref: news.gmane.org gmane.emacs.bugs:87538 Archived-At: Eli Zaretskii writes: >> > So you would need to search all characters to find those which have = =CF=83 >> > in the CANONICALIZE slot -- not very efficient, to say the least. >>=20 >> Doesn't this already happen? > > No, not when that slot is used for case-insensitive search. You just > use it to get the canonical equivalent, i.e. use the one-way mapping > that it provides. I still don't get it. What I say below may explain why. >> If not, then what is the CANONICALIZE slot doing that couldn't be >> done with the regular upcase/downcase slots by themselves? > > If that slot is "trivial", i.e. contains the lower-case variant of the > character, then indeed this slot doesn't add information, I think, > only utility. But it doesn't have to contain the lower-case variant. I know. But if Emacs doesn't do Unicode folding, what is there other than lower/upper variants? >> > IOW, what you suggest will provide a one-way mapping, whereas we need >> > a two-way mapping. >>=20 >> Not sure I follow. Seems to me the CANONICALIZE slot is sufficient, at >> least in principle. > > It is sufficient for mapping a character to its canonical equivalent, > but not finding the non-canonical variants of a canonical character. > IOW, it is not well suited to finding =CF=82 given just =CF=83. Finding the non-canonical variants is not something that happens (at least in principle) during case-insensitive matching. You convert both the matching string and the string being matched into their canonical equivalents and see if they match. You never UNfold. Case folding is by definition a one-way operation. >> That said, making M-l downcase terminal =CE=A3 to =CF=82 would be a nice= feature >> that could be enabled, e.g., by enabling a minor mode or by modifying >> some *-functions variable of functions that get called before the normal >> behavior of M-l is applied, etc. But it shouldn't have anything to do >> with Unicode-compliant case-insensitive searching. > > For searching, you only need the CANONICALIZE slot. But what about > replacing the search string while keeping the letter case in the > replacement? For that, CANONICALIZE alone is not enough, you need the > reverse mapping. There is no reverse mapping when it comes to folding. There can't be, since multiple characters can fold into the same character. I don't fully understand what "case-replace" does (e.g. case being a property of characters and not strings, what does it mean to "preserve case" when replacing a string of length x with a string of length y where x !=3D y), but I don't think Unicode folding would complicate it. There are three cases in Unicode: lower, upper, and title. Upper and title already overlap for the vast majority of codepoints, so there you already have problems with a case-preserving replace. That said "fold" is not a case in Unicode; it's a one-way mapping of non-overlapping sets of characters to a canonical equivalent, so it makes no sense to talk about preserving case with respect to case folding. Notandum: I was wrong about Unicode saying nothing about character ordering for non-combining characters. The "special casing" document (ftp://ftp.unicode.org/Public/UCD/latest/ucd/SpecialCasing.txt) contains context- and language- dependent case rules for certain characters, including final sigma. Notably, the document says that =CE=A3 in terminal position should (or "may"--I'm not really sure about how to interpret the document) downcase to =CF=82. That said, the document has _nothing_ to do with case _folding_, which is always context- and language- independent. Rightly interpreted, therefore, case _conversion_ (such as in case-preserving replace) and case-insensitive _searching_ (i.e. case folding), according to Unicode, are orthogonal. We don't have to address both at the same time. >> Given the point about =C3=9F, you're probably right. Unless we can make >> entries in the CANONICALIZE slot be strings rather than code points. > > This is Lisp; a vector slot can contain any Lisp object. But using > CANONICALIZE for what you want would be wrong, I think, because it > will screw up case-insensitive search, which expects to find there a > single character. Right, that's what I meant. Putting strings there would break something.