From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!.POSTED!not-for-mail From: Eli Zaretskii Newsgroups: gmane.emacs.bugs Subject: bug#24603: [RFC 08/18] Support casing characters which map into multiple code points Date: Fri, 07 Oct 2016 10:46:08 +0300 Message-ID: <837f9kk3en.fsf@gnu.org> References: <1475543441-10493-1-git-send-email-mina86@mina86.com> <1475543441-10493-8-git-send-email-mina86@mina86.com> <838tu4o977.fsf@gnu.org> Reply-To: Eli Zaretskii NNTP-Posting-Host: blaine.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit X-Trace: blaine.gmane.org 1475826524 31424 195.159.176.226 (7 Oct 2016 07:48:44 GMT) X-Complaints-To: usenet@blaine.gmane.org NNTP-Posting-Date: Fri, 7 Oct 2016 07:48:44 +0000 (UTC) Cc: 24603@debbugs.gnu.org To: Michal Nazarewicz Original-X-From: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane.org@gnu.org Fri Oct 07 09:48:39 2016 Return-path: Envelope-to: geb-bug-gnu-emacs@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by blaine.gmane.org with esmtp (Exim 4.84_2) (envelope-from ) id 1bsPt8-0004jv-67 for geb-bug-gnu-emacs@m.gmane.org; Fri, 07 Oct 2016 09:48:14 +0200 Original-Received: from localhost ([::1]:32896 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1bsPt6-0002fe-Km for geb-bug-gnu-emacs@m.gmane.org; Fri, 07 Oct 2016 03:48:12 -0400 Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:59371) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1bsPs1-00023e-L8 for bug-gnu-emacs@gnu.org; Fri, 07 Oct 2016 03:47:07 -0400 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1bsPry-0008L9-El for bug-gnu-emacs@gnu.org; Fri, 07 Oct 2016 03:47:05 -0400 Original-Received: from debbugs.gnu.org ([208.118.235.43]:41018) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1bsPry-0008Kv-6D for bug-gnu-emacs@gnu.org; Fri, 07 Oct 2016 03:47:02 -0400 Original-Received: from Debian-debbugs by debbugs.gnu.org with local (Exim 4.84_2) (envelope-from ) id 1bsPry-000897-0k for bug-gnu-emacs@gnu.org; Fri, 07 Oct 2016 03:47:02 -0400 X-Loop: help-debbugs@gnu.org Resent-From: Eli Zaretskii Original-Sender: "Debbugs-submit" Resent-CC: bug-gnu-emacs@gnu.org Resent-Date: Fri, 07 Oct 2016 07:47:01 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 24603 X-GNU-PR-Package: emacs X-GNU-PR-Keywords: Original-Received: via spool by 24603-submit@debbugs.gnu.org id=B24603.147582637331218 (code B ref 24603); Fri, 07 Oct 2016 07:47:01 +0000 Original-Received: (at 24603) by debbugs.gnu.org; 7 Oct 2016 07:46:13 +0000 Original-Received: from localhost ([127.0.0.1]:47198 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1bsPrB-00087R-5w for submit@debbugs.gnu.org; Fri, 07 Oct 2016 03:46:13 -0400 Original-Received: from eggs.gnu.org ([208.118.235.92]:39678) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1bsPrA-00087F-0a for 24603@debbugs.gnu.org; Fri, 07 Oct 2016 03:46:12 -0400 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1bsPr1-0007f9-QF for 24603@debbugs.gnu.org; Fri, 07 Oct 2016 03:46:06 -0400 Original-Received: from fencepost.gnu.org ([2001:4830:134:3::e]:39878) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1bsPr1-0007f2-NL; Fri, 07 Oct 2016 03:46:03 -0400 Original-Received: from 84.94.185.246.cable.012.net.il ([84.94.185.246]:4864 helo=home-c4e4a596f7) by fencepost.gnu.org with esmtpsa (TLS1.2:RSA_AES_128_CBC_SHA1:128) (Exim 4.82) (envelope-from ) id 1bsPr0-0000PS-QY; Fri, 07 Oct 2016 03:46:03 -0400 In-reply-to: (message from Michal Nazarewicz on Thu, 06 Oct 2016 23:40:11 +0200) X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] X-Received-From: 208.118.235.43 X-BeenThere: bug-gnu-emacs@gnu.org List-Id: "Bug reports for GNU Emacs, the Swiss army knife of text editors" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane.org@gnu.org Original-Sender: "bug-gnu-emacs" Xref: news.gmane.org gmane.emacs.bugs:124148 Archived-At: > From: Michal Nazarewicz > Cc: 24603@debbugs.gnu.org > Date: Thu, 06 Oct 2016 23:40:11 +0200 > > >> +#include "special-casing.h" > > > > Why not a shorter 'casing.h'? > > It includes data from SpecialCasing.txt only so I figured > ‘special-casing.h’ would be a more descriptive name. I can change it to > ‘casing.h’ if you prefer. Shorter names are easier to deal with. Also, the "special" part might beg the question: where's the "normal" part. But it's a minor nit, admittedly. If you feel strongly about your name, I won't fight that. > > Once again, this stores the casing rules in C, whereas I'd prefer to > > have them in tables accessible from Lisp. > > There are a few reasons to hard-code the special casing rules in C. > > Some of them have conditions (does are implemented in later patches) > which are non-trivial to encode in Lisp. Some look backwards > (e.g. After_Soft_Dotted) and some look forward (e.g. Not_Before_Dot) and > not necessarily only one character forward (e.g. More_Above). > > By hard-coding the implementation, each of the predicates can be handled > in a custom way such that the code only ever looks at current and one > character forward. Not to mention that is likely faster. > > Furthermore, by not having the data in Lisp I can make certain > assumptions. For example that a single character will get changed into > a sequence of at most six bytes. Having to deal with arbitrary data > that user may have put in the lisp data would further complicate the > code and if the flexibility is not worth it. It doesn't have to be arbitrary Lisp data. It could be just a set of flags stored in a Lisp structure whose implementation is in C. It's IMO okay to have this hard-coded in C, if a Lisp based implementation would be unreasonably complex and inelegant. But I don't see it should be quite yet; maybe I'm missing something. May I suggest that you try designing this, and if it turns out to be too cumbersome, come back with the evidence? > There is also the aspect that not all of the language-dependent rules > implemented in this patchset are part of Unicode. Dutch IJ (when > spelled as separate ASCII characters) is not covered by > SpecialCasing.txt. The way we deal with such augmentations is by having most of the data auto-generated, and some of it maintained manually. One example is the current characters.el and charscript.el it loads. Can we use a similar approach in this case? Experience shows that maintaining everything manually is error-prone and a huge maintenance head-ache in the long run, what with a new version of the Unicode Standard available at least once a year. > >> @@ -194,7 +276,9 @@ casify_object (enum case_action flag, Lisp_Object obj) > >> DEFUN ("upcase", Fupcase, Supcase, 1, 1, 0, > >> doc: /* Convert argument to upper case and return that. > >> The argument may be a character or string. The result has the same type. > >> -The argument object is not altered--the value is a copy. > >> +The argument object is not altered--the value is a copy. If argument > >> +is a character, characters which map to multiple code points when > >> +cased, e.g. fi, are returned unchanged. > >> See also `capitalize', `downcase' and `upcase-initials'. */) > > > > I think this doc string should say what to do if the application wants > > to convert fi into "FI". > > Perhaps it would be better to describe it in Info page and link that > from the docstrings? Fine with me. Thanks.