From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!.POSTED!not-for-mail From: Eli Zaretskii Newsgroups: gmane.emacs.bugs Subject: bug#24603: [RFC 10/18] Implement Turkic dotless and dotted i handling when casing strings Date: Tue, 04 Oct 2016 10:12:37 +0300 Message-ID: <83eg3woae2.fsf@gnu.org> References: <1475543441-10493-1-git-send-email-mina86@mina86.com> <1475543441-10493-10-git-send-email-mina86@mina86.com> Reply-To: Eli Zaretskii NNTP-Posting-Host: blaine.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit X-Trace: blaine.gmane.org 1475565207 32622 195.159.176.226 (4 Oct 2016 07:13:27 GMT) X-Complaints-To: usenet@blaine.gmane.org NNTP-Posting-Date: Tue, 4 Oct 2016 07:13:27 +0000 (UTC) Cc: 24603@debbugs.gnu.org To: Michal Nazarewicz Original-X-From: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane.org@gnu.org Tue Oct 04 09:13:24 2016 Return-path: Envelope-to: geb-bug-gnu-emacs@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by blaine.gmane.org with esmtp (Exim 4.84_2) (envelope-from ) id 1brJue-00072Q-T2 for geb-bug-gnu-emacs@m.gmane.org; Tue, 04 Oct 2016 09:13:17 +0200 Original-Received: from localhost ([::1]:40587 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1brJud-0006Ha-3n for geb-bug-gnu-emacs@m.gmane.org; Tue, 04 Oct 2016 03:13:15 -0400 Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:39308) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1brJuW-0006HG-0T for bug-gnu-emacs@gnu.org; Tue, 04 Oct 2016 03:13:09 -0400 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1brJuQ-0006rw-RM for bug-gnu-emacs@gnu.org; Tue, 04 Oct 2016 03:13:06 -0400 Original-Received: from debbugs.gnu.org ([208.118.235.43]:37495) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1brJuQ-0006rh-Nr for bug-gnu-emacs@gnu.org; Tue, 04 Oct 2016 03:13:02 -0400 Original-Received: from Debian-debbugs by debbugs.gnu.org with local (Exim 4.84_2) (envelope-from ) id 1brJuQ-0001II-IP for bug-gnu-emacs@gnu.org; Tue, 04 Oct 2016 03:13:02 -0400 X-Loop: help-debbugs@gnu.org Resent-From: Eli Zaretskii Original-Sender: "Debbugs-submit" Resent-CC: bug-gnu-emacs@gnu.org Resent-Date: Tue, 04 Oct 2016 07:13:02 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 24603 X-GNU-PR-Package: emacs X-GNU-PR-Keywords: Original-Received: via spool by 24603-submit@debbugs.gnu.org id=B24603.14755651724955 (code B ref 24603); Tue, 04 Oct 2016 07:13:02 +0000 Original-Received: (at 24603) by debbugs.gnu.org; 4 Oct 2016 07:12:52 +0000 Original-Received: from localhost ([127.0.0.1]:43685 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1brJuG-0001Hr-5X for submit@debbugs.gnu.org; Tue, 04 Oct 2016 03:12:52 -0400 Original-Received: from eggs.gnu.org ([208.118.235.92]:48192) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1brJuE-0001HZ-Ry for 24603@debbugs.gnu.org; Tue, 04 Oct 2016 03:12:51 -0400 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1brJu5-0006ZC-Hh for 24603@debbugs.gnu.org; Tue, 04 Oct 2016 03:12:45 -0400 Original-Received: from fencepost.gnu.org ([2001:4830:134:3::e]:43585) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1brJu5-0006Y6-Ds; Tue, 04 Oct 2016 03:12:41 -0400 Original-Received: from 84.94.185.246.cable.012.net.il ([84.94.185.246]:2493 helo=home-c4e4a596f7) by fencepost.gnu.org with esmtpsa (TLS1.2:RSA_AES_128_CBC_SHA1:128) (Exim 4.82) (envelope-from ) id 1brJu2-0002Kh-Tr; Tue, 04 Oct 2016 03:12:39 -0400 In-reply-to: <1475543441-10493-10-git-send-email-mina86@mina86.com> (message from Michal Nazarewicz on Tue, 4 Oct 2016 03:10:33 +0200) X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] X-Received-From: 208.118.235.43 X-BeenThere: bug-gnu-emacs@gnu.org List-Id: "Bug reports for GNU Emacs, the Swiss army knife of text editors" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane.org@gnu.org Original-Sender: "bug-gnu-emacs" Xref: news.gmane.org gmane.emacs.bugs:124017 Archived-At: > From: Michal Nazarewicz > Date: Tue, 4 Oct 2016 03:10:33 +0200 > > + > + /* FIXME: Is current-iso639-language the best source of that information? */ > + lang = Vcurrent_iso639_language; > + tr = intern_c_string ("tr"); > + az = intern_c_string ("az"); > + if (SYMBOLP (lang)) > + { > + l = lang; > + goto check_language; > + } > + while (CONSP (lang)) > + { > + l = XCAR (lang); > + lang = XCDR (lang); > + check_language: > + if (EQ (l, tr) || EQ (l, az)) > + { > + ctx->treat_turkic_i = true; > + break; > + } > + } I'm not sure I like this mechanism. AFAIU, current-iso639-language is a read-only variable that conveys the outside locale's language. So the above would limit this feature to users in the corresponding locales, which is against Emacs's design as a multilingual system. We should allow Lisp applications and users in _any_ locale take advantage of this feature. So I suggest a separate variable which, when non-nil, will cause these conversions to take effect. Lisp applications could then bind that variable when they want these special conversions. (With the eye towards future developments, as hinted by the rest of Unicode's SpecialCasing.txt file, perhaps don't make the variable's name mention a specific language, but instead make its value a language symbol, such as 'tr or 'az.) We could make it a defcustom, if we think users will want to turn this on as their default. > +/* Normalise CFG->flag and return CASE_UP, CASE_DOWN, CASE_CAPITALIZE or ^^^^^^^^^ A nit: we use US English spelling, so "Normalize". > +static enum case_action > +normalise_flag (struct casing_context *ctx) ^^^^^^^^^ Likewise. > +{ > + /* Normalise flag so its one of CASE_UP, CASE_DOWN or CASE_CAPITALIZE. */ This comment repeats what was already said above. > /* In Greek, lower case sigma has two forms: one when used in the middle and one > @@ -152,6 +192,13 @@ case_character_impl (struct casing_str_buf *buf, > #define CAPITAL_SIGMA 0x03A3 > #define SMALL_SIGMA 0x03C3 > #define SMALL_FINAL_SIGMA 0x03C2 > + > +/* Azeri and Turkish have dotless and dotted i. An upper case of i is > + İ while lower case of I is ı. */ > + > +#define CAPITAL_DOTTED_I 0x130 > +#define SMALL_DOTLESS_I 0x131 > +#define COMBINING_DOT_ABOVE 0x307 How about deriving these rules from SpecialCasing.txt and storing them in some char-table, instead of hard-coding them in C? That would allow us to update these features more easily with each release of the Unicode Standard. > + if (flag != CASE_NO_ACTION && __builtin_expect(ctx->treat_turkic_i, false)) I don't think we can use __builtin_expect here, it's AFAIK non-portable to any platform without glibc. > + if (len_bytes > 0) > + src += len_bytes; > + size -= len_bytes > 0 ? 2 : 1; Another nit: please use whitespace consistently in the indentation, either all TABs and spaces, or just spaces. (I think our default is the former for now.) Thanks.