From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: Nathan Trapuzzano Newsgroups: gmane.emacs.bugs Subject: bug#17130: 24.4.50; Deficient Unicode case folding Date: Sat, 29 Mar 2014 11:29:43 -0400 Message-ID: <87ioqxxbtk.fsf@nbtrap.com> References: <87txair0g7.fsf@ivytech.edu> <83fvm2fhii.fsf@gnu.org> <87ob0qrugy.fsf@nbtrap.com> <83y4ztec5l.fsf@gnu.org> <87ob0pnptc.fsf@nbtrap.com> <83d2h5du2e.fsf@gnu.org> <87eh1lcdaj.fsf@nbtrap.com> <838urtdpwk.fsf@gnu.org> NNTP-Posting-Host: plane.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Trace: ger.gmane.org 1396107089 2106 80.91.229.3 (29 Mar 2014 15:31:29 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Sat, 29 Mar 2014 15:31:29 +0000 (UTC) Cc: 17130@debbugs.gnu.org To: Eli Zaretskii Original-X-From: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane.org@gnu.org Sat Mar 29 16:31:23 2014 Return-path: Envelope-to: geb-bug-gnu-emacs@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by plane.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1WTvE8-0004t7-HR for geb-bug-gnu-emacs@m.gmane.org; Sat, 29 Mar 2014 16:31:20 +0100 Original-Received: from localhost ([::1]:39896 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1WTvE8-0006z1-4X for geb-bug-gnu-emacs@m.gmane.org; Sat, 29 Mar 2014 11:31:20 -0400 Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:36147) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1WTvDx-0006rI-Bu for bug-gnu-emacs@gnu.org; Sat, 29 Mar 2014 11:31:15 -0400 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1WTvDr-0003Zb-Au for bug-gnu-emacs@gnu.org; Sat, 29 Mar 2014 11:31:09 -0400 Original-Received: from debbugs.gnu.org ([140.186.70.43]:55122) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1WTvDr-0003ZX-8E for bug-gnu-emacs@gnu.org; Sat, 29 Mar 2014 11:31:03 -0400 Original-Received: from Debian-debbugs by debbugs.gnu.org with local (Exim 4.80) (envelope-from ) id 1WTvDq-0005vg-FZ for bug-gnu-emacs@gnu.org; Sat, 29 Mar 2014 11:31:02 -0400 X-Loop: help-debbugs@gnu.org Resent-From: Nathan Trapuzzano Original-Sender: "Debbugs-submit" Resent-CC: bug-gnu-emacs@gnu.org Resent-Date: Sat, 29 Mar 2014 15:31:02 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 17130 X-GNU-PR-Package: emacs X-GNU-PR-Keywords: Original-Received: via spool by 17130-submit@debbugs.gnu.org id=B17130.139610701122727 (code B ref 17130); Sat, 29 Mar 2014 15:31:02 +0000 Original-Received: (at 17130) by debbugs.gnu.org; 29 Mar 2014 15:30:11 +0000 Original-Received: from localhost ([127.0.0.1]:56304 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1WTvCz-0005uT-Fk for submit@debbugs.gnu.org; Sat, 29 Mar 2014 11:30:10 -0400 Original-Received: from gproxy4-pub.mail.unifiedlayer.com ([69.89.23.142]:34315) by debbugs.gnu.org with smtp (Exim 4.80) (envelope-from ) id 1WTvCm-0005sl-Ui for 17130@debbugs.gnu.org; Sat, 29 Mar 2014 11:30:06 -0400 Original-Received: (qmail 24545 invoked by uid 0); 29 Mar 2014 15:29:54 -0000 Original-Received: from unknown (HELO CMOut01) (10.0.90.82) by gproxy4.mail.unifiedlayer.com with SMTP; 29 Mar 2014 15:29:54 -0000 Original-Received: from host393.hostmonster.com ([66.147.240.193]) by CMOut01 with id jTVm1n00Q4B3kjm01TVp8R; Sat, 29 Mar 2014 09:29:54 -0600 X-Authority-Analysis: v=2.1 cv=Re0DVTdv c=1 sm=1 tr=0 a=GZ6qK+eS4AuCRVUKGEKC+Q==:117 a=GZ6qK+eS4AuCRVUKGEKC+Q==:17 a=DsvgjBjRAAAA:8 a=f5113yIGAAAA:8 a=4GsTxW34auoA:10 a=2__L0ovz5gcA:10 a=lfvU_ReahkwA:10 a=IkcTkHD0fZMA:10 a=ngU5ixn2AAAA:8 a=fWyWhr6xdMwA:10 a=mDV3o1hIAAAA:8 a=szCK5NzadJ1RrQgn1dQA:9 a=QEXdDO2ut3YA:10 a=ii61gXl28gQA:10 DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=nbtrap.com; s=default; h=Content-Transfer-Encoding:Content-Type:MIME-Version:Message-ID:In-Reply-To:Date:References:Subject:Cc:To:From; bh=B+A/vfFz5gY7seRj9gXnC1EvSPsjlnyPAP+g4ZuneVM=; b=OsM1X8yn3WoeMZmDiQC6bwE0jfGUV/RQwh0zsJ7Kyf5hwBhBkNIL1VR1xXKOOzElNm2GiOROnV/qLCpw6LubcaZMn30SWIIhPhUK2Del5RC0IwUnDIHTktF/NlByergS; Original-Received: from [50.90.253.209] (port=43486 helo=Nathan-GNU) by host393.hostmonster.com with esmtpsa (TLSv1.2:CAMELLIA128-SHA:128) (Exim 4.82) (envelope-from ) id 1WTvCc-0007Yp-P7; Sat, 29 Mar 2014 09:29:46 -0600 In-Reply-To: <838urtdpwk.fsf@gnu.org> (Eli Zaretskii's message of "Sat, 29 Mar 2014 17:45:47 +0300") User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/24.4.50 (gnu/linux) X-Identified-User: {1585:host393.hostmonster.com:nbtrapco:nbtrap.com} {sentby:smtp auth 50.90.253.209 authed with nbtrap@nbtrap.com} X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.15 Precedence: list X-detected-operating-system: by eggs.gnu.org: GNU/Linux 3.x X-Received-From: 140.186.70.43 X-BeenThere: bug-gnu-emacs@gnu.org List-Id: "Bug reports for GNU Emacs, the Swiss army knife of text editors" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane.org@gnu.org Original-Sender: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane.org@gnu.org Xref: news.gmane.org gmane.emacs.bugs:87527 Archived-At: Eli Zaretskii writes: >> =CF=83, =CF=82, and =CE=A3 would all have =CF=83 in the CANONICALIZE slo= t, since they all >> fold to =CF=83. > > So you would need to search all characters to find those which have =CF=83 > in the CANONICALIZE slot -- not very efficient, to say the least. Doesn't this already happen? If not, then what is the CANONICALIZE slot doing that couldn't be done with the regular upcase/downcase slots by themselves? > IOW, what you suggest will provide a one-way mapping, whereas we need > a two-way mapping. Not sure I follow. Seems to me the CANONICALIZE slot is sufficient, at least in principle. >> > Besides, don't we also need to know that =CF=82 can only be present at= the >> > end of a word? >>=20 >> Don't think so. AFAIK, Unicode says nothing about ordering except when >> it comes to combining characters. But even it did prescribe such a >> rule, I don't think it would have anything to do with case folding. > > Who said this is only about case folding? I should have said just "case", not "case folding". > Emacs should use this data for up-casing and down-casing as well, for > example, so that M-l downcases =CE=A3 to =CF=82, not =CF=83, when it is a= t the end of > the word. Wouldn't users of Greek expect that? Maybe. I'm just saying that Unicode itself doesn't prescribe or even recommend such behavior. It defines case conversions independently of ordering. That said, making M-l downcase terminal =CE=A3 to =CF=82 would be a nice fe= ature that could be enabled, e.g., by enabling a minor mode or by modifying some *-functions variable of functions that get called before the normal behavior of M-l is applied, etc. But it shouldn't have anything to do with Unicode-compliant case-insensitive searching. >> Right, but what I'm asking is: if Emacs doesn't do Unicode case folding, >> what is the purpose of the CANONICALIZE slot except as a kind of >> placeholder that gets autofilled? > > Whenever you need the canonical equivalent of a character, such as in > case-insensitive search, you need that slot. But there's nothing about the slot that mandates that only _pairs_ can be case-equivalent under case folding. Indeed, the manual speaks of "sets" of chracters that might be equivalent under case-folding, hence my understanding that =CF=83, =CF=82, and =CE=A3 can all have =CF=83 in the= ir CANONICALIZE slot, and that's all it would take. (Btw, I'm using "case-insensitive" to mean the same as "under case-folding".) >> Are there other kinds of case folding--other than traditional >> upper/lower and Unicode--that I'm not aware of? > > There's "title case", of course.=20=20 I think title case would require an extra slot in the case table. > There are also characters whose case pair is not a single character, > but several, like the upper-case variant of =C3=9F in German. Good point. "=C3=9F" should fold to "ss". I guess for the CANONICALIZE sl= ot to suffice, it would have to map to a string, not a code point. > Personally, I think we need an additional slot for what you want, and > code to use it. Given the point about =C3=9F, you're probably right. Unless we can make entries in the CANONICALIZE slot be strings rather than code points.