From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!.POSTED!not-for-mail From: Michal Nazarewicz Newsgroups: gmane.emacs.bugs Subject: bug#24603: [PATCHv5 05/11] Support casing characters which map into multiple code points (bug#24603) Date: Tue, 21 Mar 2017 03:09:54 +0100 Organization: http://mina86.com/ Message-ID: References: <20170309215150.9562-1-mina86@mina86.com> <20170309215150.9562-6-mina86@mina86.com> <83d1dodvjm.fsf@gnu.org> NNTP-Posting-Host: blaine.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Trace: blaine.gmane.org 1490062285 22170 195.159.176.226 (21 Mar 2017 02:11:25 GMT) X-Complaints-To: usenet@blaine.gmane.org NNTP-Posting-Date: Tue, 21 Mar 2017 02:11:25 +0000 (UTC) Cc: 24603@debbugs.gnu.org To: Eli Zaretskii Original-X-From: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane.org@gnu.org Tue Mar 21 03:11:17 2017 Return-path: Envelope-to: geb-bug-gnu-emacs@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by blaine.gmane.org with esmtp (Exim 4.84_2) (envelope-from ) id 1cq9GP-0004CI-T9 for geb-bug-gnu-emacs@m.gmane.org; Tue, 21 Mar 2017 03:11:10 +0100 Original-Received: from localhost ([::1]:36259 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1cq9GU-0008CS-2U for geb-bug-gnu-emacs@m.gmane.org; Mon, 20 Mar 2017 22:11:14 -0400 Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:36389) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1cq9GN-0008CJ-DI for bug-gnu-emacs@gnu.org; Mon, 20 Mar 2017 22:11:08 -0400 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1cq9GK-0006cZ-08 for bug-gnu-emacs@gnu.org; Mon, 20 Mar 2017 22:11:07 -0400 Original-Received: from debbugs.gnu.org ([208.118.235.43]:38848) by eggs.gnu.org with esmtps (TLS1.0:RSA_AES_128_CBC_SHA1:16) (Exim 4.71) (envelope-from ) id 1cq9GJ-0006c6-Od for bug-gnu-emacs@gnu.org; Mon, 20 Mar 2017 22:11:03 -0400 Original-Received: from Debian-debbugs by debbugs.gnu.org with local (Exim 4.84_2) (envelope-from ) id 1cq9GI-000145-36 for bug-gnu-emacs@gnu.org; Mon, 20 Mar 2017 22:11:03 -0400 X-Loop: help-debbugs@gnu.org Resent-From: Michal Nazarewicz Original-Sender: "Debbugs-submit" Resent-CC: bug-gnu-emacs@gnu.org Resent-Date: Tue, 21 Mar 2017 02:11:02 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 24603 X-GNU-PR-Package: emacs X-GNU-PR-Keywords: patch Original-Received: via spool by 24603-submit@debbugs.gnu.org id=B24603.14900622034012 (code B ref 24603); Tue, 21 Mar 2017 02:11:02 +0000 Original-Received: (at 24603) by debbugs.gnu.org; 21 Mar 2017 02:10:03 +0000 Original-Received: from localhost ([127.0.0.1]:37047 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1cq9FK-00012d-Kh for submit@debbugs.gnu.org; Mon, 20 Mar 2017 22:10:02 -0400 Original-Received: from mail-wr0-f179.google.com ([209.85.128.179]:33653) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1cq9FJ-000126-Ho for 24603@debbugs.gnu.org; Mon, 20 Mar 2017 22:10:02 -0400 Original-Received: by mail-wr0-f179.google.com with SMTP id u48so103398592wrc.0 for <24603@debbugs.gnu.org>; Mon, 20 Mar 2017 19:10:01 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=sender:from:to:cc:subject:in-reply-to:organization:references:face :openpgp:date:message-id:mime-version:content-transfer-encoding; bh=zuT4G4nMj5KgrQ50hLLlCmGHB2Non3Ae1fDunIGrGRc=; b=F31heZ3Mfm6T1NTfHJ7MiUZ1th94t//NXoaqgKw234CQJg36N7vq3sxjn77NMQOwuI E6/4sCloyEPE1o4Ob8MmkaH/D2uh5dqde58RyrS9YfcfJ3a5139ax5K258zHRbGpOGyL n40gUNHNIZmoFhOI1nYtFJC0XUg6a+y6fDG+bzR4fD+/4x13LQ1jjElE+sb87mTKIJTC M7pHSADOHl6dPwfebjBpkpnawK3eDXH5gFWtJ2njLyA9p8A3a/0G0sLWeT0rq69S+eUd vAwsM1B5k7NRHDfgJyiPbRxL0oIJp3qMqsus3DRJV+Map+aUWJJz2NJitPCIcpc0MuWO ArZg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:sender:from:to:cc:subject:in-reply-to :organization:references:face:openpgp:date:message-id:mime-version :content-transfer-encoding; bh=zuT4G4nMj5KgrQ50hLLlCmGHB2Non3Ae1fDunIGrGRc=; b=TqNmX2LJPc9Ph5Z+9t67JMEI6Ggek9GFMEVtwalNlPR0qUvM7UboYPd9QKGYAoHOeb U/aEqp2HrZ7B+E6PPPa+MjV/oU1C1Hx4TRrleYdp2qDqpf9/UOp0UYDRAZZAVJlMQmhd eeJHU51VizBbHbq4ehxmCUYJFLx+rhMrkfCr3KwUezfUzs6AMhcanTZjqujBwjJroppd s5H9vtWj4752EDYTPuF7XVbAyQwoEclU3vbaiqkianomnivVY8eNqVSbUHfBwIOtlunJ tqSXoTZlwLbfsYY3gJV0XvS4I/HIWI20SQ3DL3Q+4Ktc3SGCc/87VKiODLaMv4YBbQ1I xNZQ== X-Gm-Message-State: AFeK/H1P3SggEsUEGv19o0lIHDQ2C7rQ2CJ3M4o53ZReGr+cJjsjDtN1WLrlDgASofYGmGQy X-Received: by 10.223.142.201 with SMTP id q67mr26027052wrb.182.1490062195734; Mon, 20 Mar 2017 19:09:55 -0700 (PDT) Original-Received: from mpn-glaptop ([2620:0:105f:fd00:ac3b:aca7:b4ad:c0b8]) by smtp.gmail.com with ESMTPSA id d42sm22973138wrd.37.2017.03.20.19.09.54 (version=TLS1_2 cipher=AES128-SHA bits=128/128); Mon, 20 Mar 2017 19:09:54 -0700 (PDT) In-Reply-To: <83d1dodvjm.fsf@gnu.org> Face: iVBORw0KGgoAAAANSUhEUgAAADAAAAAwBAMAAAClLOS0AAAAJFBMVEWbfGlUPDDHgE57V0jUupKjgIObY0PLrom9mH4dFRK4gmjPs41MxjOgAAACP0lEQVQ4T23Sv2vbQBQHcBk1xE6WyALX107VUEgmn6+ouUwpEQQ6uRjttkWP4CkBg2M0BQLBdPFZYPsyFYo7qEtKDQ7on+t7+nF2Ux8ahD587717OmNYrOvycHsZ+o2r051wHTHysAvGb8ygvgu4QWT0sCmkgZCIEnlV2X8BtyraazFGDuxhmKSQJMlwHQ7v5MHSNxmz78rfElwAa3ieVD9e+hBhjaPDDG6NgFo2f4wBMNIo5YmRtF0RyDgFjJjlMIWbnuM4x9MMfABGTlN4qgIQB4A1DEyA1BHWtfeWNUMwiVJKoqh97KrkOO+qzgluVYLvFCUKAX73nONeBr7BGMdM6Sg0kuep03VywLaIzRiVr+GAzKlpQIsAFnWAG2e6DT5WmWDiudZMIc6hYrMOmeMQK9WX0B+/RfjzL9DI7Y9/Iayn29Ci0r2i4f9gMimMSZLCDMalgQGU5hnUtqAN0OGvEmO1Wnl0C0wWSCEHnuHBqmygxdxA8oWXwbipoc1EoNR9DqOpBpOJrnr0criQab9ZT4LL+wI+K7GBQH30CrhUruilgP9DRTrhVWZCiAyILP+wiuLeCKGTD6r/nc8LOJcAwR6IBTUs+7CASw3QFZ0MdA2PI3zNziH4ZKVhXCRMBjeZ1DWMekKwDCASwExy+NQ86TaykaDAFHO4aP48y4 fIcDM5yOG8GcTLbOyp8A8azjJI93JFd1EA6yN8sSxMQJWoABqniRZVykYgRXErzrdqExAoUrRb0xfRp8p2A/4XmfilTtkDZ4cAAAAASUVORK5CYII= X-Face: -TR8(rDTHy/(xl?SfWd1|3:TTgDIatE^t'vop%*gVg[kn$t{EpK(P"VQ=~T2#ysNmJKN$"yTRLB4YQs$4{[.]Fc1)*O]3+XO^oXM>Q#b^ix, O)Zbn)q[y06$`e3?C)`CwR9y5riE=fv^X@x$y?D:XO6L&x4f-}}I4=VRNwiA^t1-ZrVK^07.Pi/57c_du'& OpenPGP: id=AC1F5F5CD41888F8CC8458582060401250751FF4; url=http://mina86.com/mina86.pub X-Hashcash: 1:20:170321:24603@debbugs.gnu.org::Py4DP3QtutYQkeq9:0000000000000000000000000000000000000000242F X-Hashcash: 1:20:170321:eliz@gnu.org::cSEV/AH97pcAqY/y:000003M2L X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] X-Received-From: 208.118.235.43 X-BeenThere: bug-gnu-emacs@gnu.org List-Id: "Bug reports for GNU Emacs, the Swiss army knife of text editors" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane.org@gnu.org Original-Sender: "bug-gnu-emacs" Xref: news.gmane.org gmane.emacs.bugs:130771 Archived-At: On Sat, Mar 11 2017, Eli Zaretskii wrote: >> From: Michal Nazarewicz >> Date: Thu, 9 Mar 2017 22:51:44 +0100 >>=20 >> Implement unconditional special casing rules defined in Unicode standard. >>=20 >> Among other things, they deal with cases when a single code point is >> replaced by multiple ones because single character does not exist (e.g. >> =E2=80=98=EF=AC=81=E2=80=99 ligature turning into =E2=80=98FL=E2=80=99) = or is not commonly used (e.g. =C3=9F turning >> into SS). >>=20 >> * admin/unidata/SpecialCasing.txt: New data file pulled from Unicode >> standard distribution. >> * admin/unidata/README: Mention SpecialCasing.txt. >>=20 >> * admin/unidata/unidata-get.el (unidata-gen-table-special-casing): New >> function for generating =E2=80=98special-casing=E2=80=99 character Unico= de property >> built from the SpecialCasing.txt Unicode data file. > > This new property is attainable via get-char-code-property, right? If > so, it should be documented in the Elisp manual, in the "Character > Properties" node. > > I think I'd also like to see a few simple tests for this property. Done and done. I=E2=80=99ve actually split this property into three separa= te ones. Previously, the property was unique in how it mapped a single character into multiple values. >> diff --git a/doc/lispref/strings.texi b/doc/lispref/strings.texi >> index cf47db4a814..ba1cf2606ce 100644 >> --- a/doc/lispref/strings.texi >> +++ b/doc/lispref/strings.texi >> @@ -1166,6 +1166,29 @@ Case Conversion >> @end example >> @end defun >>=20=20 >> + Note that case conversion is not a one-to-one mapping and the length >> +of the result may differ from the length of the argument (including >> +being shorter). Furthermore, because passing a character forces >> +return type to be a character, functions are unable to perform proper >> +substitution and result may differ compared to treating >> +a one-character string. For example: >> + >> +@example >> +@group >> +(upcase "=EF=AC=81") ; note: single character, ligature "fi" >> + @result{} "FI" >> +@end group >> +@group >> +(upcase ?=EF=AC=81) >> + @result{} 64257 ; i.e. ?=EF=AC=81 >> +@end group >> +@end example >> + >> + To avoid this, a character must first be converted into a string, >> +using @code{string} function, before being passed to one of the casing >> +functions. Of course, no assumptions on the length of the result may >> +be made. > > Once the ELisp manual describes the new special-casing property, the > above text should include a cross-reference to that description. Ah, actually forgot about that one. I don=E2=80=99t want to resend the pat= ch, but I=E2=80=99ll add: + Mapping for such special cases are taken from +@code{special-uppercase}, @code{special-lowercase} and +@code{special-titlecase} @xref{Character Properties}. + before submitting. >> DEFUN ("upcase", Fupcase, Supcase, 1, 1, 0, >> doc: /* Convert argument to upper case and return that. >> The argument may be a character or string. The result has the same typ= e. >> -The argument object is not altered--the value is a copy. >> +The argument object is not altered--the value is a copy. If argument >> +is a character, characters which map to multiple code points when >> +cased, e.g. =EF=AC=81, are returned unchanged. >> See also `capitalize', `downcase' and `upcase-initials'. */) > > This (and other similar doc strings) should mention the special-casing > property as the way to know in advance which characters will remain > unchanged due to that. Done. --=20 Best regards =E3=83=9F=E3=83=8F=E3=82=A6 =E2=80=9C=F0=9D=93=B6=F0=9D=93=B2=F0=9D=93=B7= =F0=9D=93=AA86=E2=80=9D =E3=83=8A=E3=82=B6=E3=83=AC=E3=83=B4=E3=82=A4=E3=83= =84 =C2=ABIf at first you don=E2=80=99t succeed, give up skydiving=C2=BB