From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!.POSTED!not-for-mail From: Eli Zaretskii Newsgroups: gmane.emacs.bugs Subject: bug#24603: [PATCHv5 05/11] Support casing characters which map into multiple code points (bug#24603) Date: Sat, 11 Mar 2017 11:14:53 +0200 Message-ID: <83d1dodvjm.fsf@gnu.org> References: <20170309215150.9562-1-mina86@mina86.com> <20170309215150.9562-6-mina86@mina86.com> Reply-To: Eli Zaretskii NNTP-Posting-Host: blaine.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit X-Trace: blaine.gmane.org 1489223781 29684 195.159.176.226 (11 Mar 2017 09:16:21 GMT) X-Complaints-To: usenet@blaine.gmane.org NNTP-Posting-Date: Sat, 11 Mar 2017 09:16:21 +0000 (UTC) Cc: 24603@debbugs.gnu.org To: Michal Nazarewicz Original-X-From: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane.org@gnu.org Sat Mar 11 10:16:15 2017 Return-path: Envelope-to: geb-bug-gnu-emacs@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by blaine.gmane.org with esmtp (Exim 4.84_2) (envelope-from ) id 1cmd8D-0006Ld-QA for geb-bug-gnu-emacs@m.gmane.org; Sat, 11 Mar 2017 10:16:10 +0100 Original-Received: from localhost ([::1]:42519 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1cmd8G-00020m-E0 for geb-bug-gnu-emacs@m.gmane.org; Sat, 11 Mar 2017 04:16:12 -0500 Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:37909) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1cmd8A-00020g-DC for bug-gnu-emacs@gnu.org; Sat, 11 Mar 2017 04:16:07 -0500 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1cmd86-00025H-E8 for bug-gnu-emacs@gnu.org; Sat, 11 Mar 2017 04:16:06 -0500 Original-Received: from debbugs.gnu.org ([208.118.235.43]:52039) by eggs.gnu.org with esmtps (TLS1.0:RSA_AES_128_CBC_SHA1:16) (Exim 4.71) (envelope-from ) id 1cmd86-00025B-2a for bug-gnu-emacs@gnu.org; Sat, 11 Mar 2017 04:16:02 -0500 Original-Received: from Debian-debbugs by debbugs.gnu.org with local (Exim 4.84_2) (envelope-from ) id 1cmd85-0003d8-Tb for bug-gnu-emacs@gnu.org; Sat, 11 Mar 2017 04:16:01 -0500 X-Loop: help-debbugs@gnu.org Resent-From: Eli Zaretskii Original-Sender: "Debbugs-submit" Resent-CC: bug-gnu-emacs@gnu.org Resent-Date: Sat, 11 Mar 2017 09:16:01 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 24603 X-GNU-PR-Package: emacs X-GNU-PR-Keywords: patch Original-Received: via spool by 24603-submit@debbugs.gnu.org id=B24603.148922372313903 (code B ref 24603); Sat, 11 Mar 2017 09:16:01 +0000 Original-Received: (at 24603) by debbugs.gnu.org; 11 Mar 2017 09:15:23 +0000 Original-Received: from localhost ([127.0.0.1]:50238 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1cmd7S-0003cB-VB for submit@debbugs.gnu.org; Sat, 11 Mar 2017 04:15:23 -0500 Original-Received: from eggs.gnu.org ([208.118.235.92]:46732) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1cmd7R-0003bz-CK for 24603@debbugs.gnu.org; Sat, 11 Mar 2017 04:15:21 -0500 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1cmd7I-0001sw-64 for 24603@debbugs.gnu.org; Sat, 11 Mar 2017 04:15:16 -0500 Original-Received: from fencepost.gnu.org ([2001:4830:134:3::e]:37881) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1cmd7I-0001ss-28; Sat, 11 Mar 2017 04:15:12 -0500 Original-Received: from 84.94.185.246.cable.012.net.il ([84.94.185.246]:4751 helo=home-c4e4a596f7) by fencepost.gnu.org with esmtpsa (TLS1.2:RSA_AES_256_CBC_SHA1:256) (Exim 4.82) (envelope-from ) id 1cmd7H-0002A0-0O; Sat, 11 Mar 2017 04:15:11 -0500 In-reply-to: <20170309215150.9562-6-mina86@mina86.com> (message from Michal Nazarewicz on Thu, 9 Mar 2017 22:51:44 +0100) X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] X-Received-From: 208.118.235.43 X-BeenThere: bug-gnu-emacs@gnu.org List-Id: "Bug reports for GNU Emacs, the Swiss army knife of text editors" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane.org@gnu.org Original-Sender: "bug-gnu-emacs" Xref: news.gmane.org gmane.emacs.bugs:130469 Archived-At: > From: Michal Nazarewicz > Date: Thu, 9 Mar 2017 22:51:44 +0100 > > Implement unconditional special casing rules defined in Unicode standard. > > Among other things, they deal with cases when a single code point is > replaced by multiple ones because single character does not exist (e.g. > ‘fi’ ligature turning into ‘FL’) or is not commonly used (e.g. ß turning > into SS). > > * admin/unidata/SpecialCasing.txt: New data file pulled from Unicode > standard distribution. > * admin/unidata/README: Mention SpecialCasing.txt. > > * admin/unidata/unidata-get.el (unidata-gen-table-special-casing): New > function for generating ‘special-casing’ character Unicode property > built from the SpecialCasing.txt Unicode data file. This new property is attainable via get-char-code-property, right? If so, it should be documented in the Elisp manual, in the "Character Properties" node. I think I'd also like to see a few simple tests for this property. > diff --git a/doc/lispref/strings.texi b/doc/lispref/strings.texi > index cf47db4a814..ba1cf2606ce 100644 > --- a/doc/lispref/strings.texi > +++ b/doc/lispref/strings.texi > @@ -1166,6 +1166,29 @@ Case Conversion > @end example > @end defun > > + Note that case conversion is not a one-to-one mapping and the length > +of the result may differ from the length of the argument (including > +being shorter). Furthermore, because passing a character forces > +return type to be a character, functions are unable to perform proper > +substitution and result may differ compared to treating > +a one-character string. For example: > + > +@example > +@group > +(upcase "fi") ; note: single character, ligature "fi" > + @result{} "FI" > +@end group > +@group > +(upcase ?fi) > + @result{} 64257 ; i.e. ?fi > +@end group > +@end example > + > + To avoid this, a character must first be converted into a string, > +using @code{string} function, before being passed to one of the casing > +functions. Of course, no assumptions on the length of the result may > +be made. Once the ELisp manual describes the new special-casing property, the above text should include a cross-reference to that description. > DEFUN ("upcase", Fupcase, Supcase, 1, 1, 0, > doc: /* Convert argument to upper case and return that. > The argument may be a character or string. The result has the same type. > -The argument object is not altered--the value is a copy. > +The argument object is not altered--the value is a copy. If argument > +is a character, characters which map to multiple code points when > +cased, e.g. fi, are returned unchanged. > See also `capitalize', `downcase' and `upcase-initials'. */) This (and other similar doc strings) should mention the special-casing property as the way to know in advance which characters will remain unchanged due to that.