From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!.POSTED!not-for-mail From: Michal Nazarewicz Newsgroups: gmane.emacs.bugs Subject: bug#24603: [PATCHv5 10/11] Implement casing rules for Lithuanian (bug#24603) Date: Thu, 9 Mar 2017 22:51:49 +0100 Message-ID: <20170309215150.9562-11-mina86@mina86.com> References: <20170309215150.9562-1-mina86@mina86.com> NNTP-Posting-Host: blaine.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Trace: blaine.gmane.org 1489096507 2196 195.159.176.226 (9 Mar 2017 21:55:07 GMT) X-Complaints-To: usenet@blaine.gmane.org NNTP-Posting-Date: Thu, 9 Mar 2017 21:55:07 +0000 (UTC) To: 24603@debbugs.gnu.org, eliz@gnu.org Original-X-From: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane.org@gnu.org Thu Mar 09 22:55:02 2017 Return-path: Envelope-to: geb-bug-gnu-emacs@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by blaine.gmane.org with esmtp (Exim 4.84_2) (envelope-from ) id 1cm61W-0008Qb-5x for geb-bug-gnu-emacs@m.gmane.org; Thu, 09 Mar 2017 22:55:02 +0100 Original-Received: from localhost ([::1]:36417 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1cm61c-00048M-B5 for geb-bug-gnu-emacs@m.gmane.org; Thu, 09 Mar 2017 16:55:08 -0500 Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:41330) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1cm5zr-0002oM-7d for bug-gnu-emacs@gnu.org; Thu, 09 Mar 2017 16:53:21 -0500 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1cm5zm-0004Ma-Qh for bug-gnu-emacs@gnu.org; Thu, 09 Mar 2017 16:53:19 -0500 Original-Received: from debbugs.gnu.org ([208.118.235.43]:49937) by eggs.gnu.org with esmtps (TLS1.0:RSA_AES_128_CBC_SHA1:16) (Exim 4.71) (envelope-from ) id 1cm5zm-0004Lt-L1 for bug-gnu-emacs@gnu.org; Thu, 09 Mar 2017 16:53:14 -0500 Original-Received: from Debian-debbugs by debbugs.gnu.org with local (Exim 4.84_2) (envelope-from ) id 1cm5ze-0000Q2-E7 for bug-gnu-emacs@gnu.org; Thu, 09 Mar 2017 16:53:06 -0500 X-Loop: help-debbugs@gnu.org Resent-From: Michal Nazarewicz Original-Sender: "Debbugs-submit" Resent-CC: bug-gnu-emacs@gnu.org Resent-Date: Thu, 09 Mar 2017 21:53:06 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 24603 X-GNU-PR-Package: emacs X-GNU-PR-Keywords: patch Original-Received: via spool by 24603-submit@debbugs.gnu.org id=B24603.14890963421477 (code B ref 24603); Thu, 09 Mar 2017 21:53:06 +0000 Original-Received: (at 24603) by debbugs.gnu.org; 9 Mar 2017 21:52:22 +0000 Original-Received: from localhost ([127.0.0.1]:48122 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1cm5yv-0000Ng-Rv for submit@debbugs.gnu.org; Thu, 09 Mar 2017 16:52:22 -0500 Original-Received: from mail-wr0-f174.google.com ([209.85.128.174]:34813) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1cm5yo-0000Lp-Hj for 24603@debbugs.gnu.org; Thu, 09 Mar 2017 16:52:15 -0500 Original-Received: by mail-wr0-f174.google.com with SMTP id l37so53891202wrc.1 for <24603@debbugs.gnu.org>; Thu, 09 Mar 2017 13:52:14 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=sender:from:to:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=CTtqNCtXmD0yXuLa75Zlm2VoR0iwQpA9Z+nv5bJNOsM=; b=cmtPhVs60MSMltjhbYcjIYkVd2lplGvQB/toaSL/3Q6ahVbh0HhP6LbA00jT+ntC+5 PPqIvG+67dZnk/WcXtoZe1+tqrONozQCeHjsgcI04MebN3kfH5NjIJj4+rjVdAEhQ7dE Hc38NoTHQSrtfhc8Aeo4B7tFqAfHty8LjTMGrvvd0AwazYZqorn1sstaSv4hskHESEhD g/g7SSaeQ49VIwOWXBzEtnKy6p1OnmwXl8iZe+uzQkl0D2EWU9nosRKFNs3BXoWXv8Ys +kPdCchjaHeZ9TTzcnf1fW+ALQ6NPxW85om3FptJ8+IXlfTXnVCQ/h9cI7785NXq2VUx Abdg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:sender:from:to:subject:date:message-id :in-reply-to:references:mime-version:content-transfer-encoding; bh=CTtqNCtXmD0yXuLa75Zlm2VoR0iwQpA9Z+nv5bJNOsM=; b=syCubcI65wmEM2uT4JX8fV1zLvZYeZx4EUSF55ZAUfQ3eL5fDWIr05Nxtf9RDafFpL A8nR1BsyqmMee32psPrGElo06VGH5j7WF2UM5i9d7fjZ5hQVnd6KwKlxJV7UYNtzFB36 fib/Fm+axgqEIiv5/Z3287adP1r8uy/2UQc8vb6ItEQ2rUa7Ralw7XPWPOhsdRdfyL+V 9T7poik6zTyYPr2lepQLCWs6VNe3j79ojmPj0BvRpNXFM0M5Lt7QI8UyqWmm8XT+HHk2 lQDgsNwKOdkiSROgeVxxGR+6Nxj+4UWcOeKlSi5Y++cxZmFVjXkYwEtknLgjKrYDRpgJ 6RnQ== X-Gm-Message-State: AMke39nJs1Gp0Dn9ayVNTzw13sLrqd/QwDOiifWTA2FFJQEa2l6Cx4AECeNfC4x6UoVmVeZJ X-Received: by 10.223.131.3 with SMTP id 3mr12765865wrd.153.1489096328265; Thu, 09 Mar 2017 13:52:08 -0800 (PST) Original-Received: from mpn.zrh.corp.google.com ([172.16.115.43]) by smtp.gmail.com with ESMTPSA id e6sm9800570wrc.30.2017.03.09.13.52.01 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 09 Mar 2017 13:52:05 -0800 (PST) Original-Received: by mpn.zrh.corp.google.com (Postfix, from userid 126942) id C6BDC1E029C; Thu, 9 Mar 2017 22:51:58 +0100 (CET) X-Mailer: git-send-email 2.12.0.246.ga2ecc84866-goog In-Reply-To: <20170309215150.9562-1-mina86@mina86.com> X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] X-Received-From: 208.118.235.43 X-BeenThere: bug-gnu-emacs@gnu.org List-Id: "Bug reports for GNU Emacs, the Swiss army knife of text editors" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane.org@gnu.org Original-Sender: "bug-gnu-emacs" Xref: news.gmane.org gmane.emacs.bugs:130405 Archived-At: In Lithuanian, tittle above lower case i and j are retained even if there are other diacritics above present. For that to work, an explicit combining dot above must be added after i and j or otherwise the rendering engine will remove the tittle. * src/casefiddle.c (struct casing_context, prepare_casing_context): Add SPECIAL_LT, SPECIAL_LT_DEL_DOT_ABOVE and SPECIAL_LT_INS_DOT_ABOVE special flag valus for handling Lithuanian. Set the flag to SPECIAL_LT if buffer is in Lithuanian. (maybe_case_lithuanian): New function which implements Lithuanian rules. (case_characters): Make use of maybe_case_lithuanian. * test/src/casefiddle-tests.el (casefiddle-tests-casing): Add test cases for Lithuanian rules. --- src/casefiddle.c | 178 ++++++++++++++++++++++++++++++++++++++++--- test/src/casefiddle-tests.el | 27 ++++++- 2 files changed, 195 insertions(+), 10 deletions(-) diff --git a/src/casefiddle.c b/src/casefiddle.c index 4785ebaddc4..a33bac7d21e 100644 --- a/src/casefiddle.c +++ b/src/casefiddle.c @@ -77,7 +77,15 @@ struct casing_context { SPECIAL_NL_UPCASE_J, /* Handle Azerbaijani and Turkish dotted and dotless i. */ - SPECIAL_TR + SPECIAL_TR, + + /* Apply Lithuanian rules for i’s and j’s tittle. */ + SPECIAL_LT, + /* As above plus look out for combining dot above to delete. */ + SPECIAL_LT_DEL_DOT_ABOVE, + /* As above plus look out for diacritics combining above because + we may need to inject dot above before them. */ + SPECIAL_LT_INS_DOT_ABOVE, } special; }; @@ -116,6 +124,9 @@ prepare_casing_context (struct casing_context *ctx, case ('t' << 8) | 'r': /* Turkish */ case ('a' << 8) | 'z': /* Azerbaijani */ ctx->special = SPECIAL_TR; + break; + case ('l' << 8) | 't': /* Lithuania */ + ctx->special = SPECIAL_LT; } } @@ -362,6 +373,142 @@ maybe_case_turkic (struct casing_str_buf *buf, struct casing_context *ctx, return ch == cased ? RES_NO_CHANGE : RES_CHANGED; } +/* Lithuanian retains tittle in lower case i and j when there are more + accents above those letters. */ + +#define CAPITAL_I_WITH_GRAVE 0x0CC +#define CAPITAL_I_WITH_ACUTE 0x0CD +#define CAPITAL_I_WITH_TILDE 0x128 +#define CAPITAL_I_WITH_OGONEK 0x12E +#define SMALL_I_WITH_OGONEK 0x12F +#define COMBINING_GRAVE_ABOVE 0x300 +#define COMBINING_ACUTE_ABOVE 0x301 +#define COMBINING_TILDE_ABOVE 0x303 +#define COMBINING_OGONEK 0x328 + +/* Save in BUF result of casing character CH if Lithuanian casing rules apply. + + If not-NULL, NEXT points to the next character in the cased string. If NULL, + it is assumed current character is the last one being cased. This is used to + apply some rules which depend on proceeding state. + + FLAG is a normalised flag (as returned by normalise_flag function). + + Return -2 (RES_NOT_TOUCHED) if Lithuanian rules did not apply, no changes + were made and other casing rules should be tried. Otherwise, meaning of + return values is the same as in case_characters function. */ +static int +maybe_case_lithuanian (struct casing_str_buf *buf, struct casing_context *ctx, + enum case_action flag, int ch) +{ + switch (ctx->special) { + case SPECIAL_LT: + break; + + case SPECIAL_LT_DEL_DOT_ABOVE: + /* When upper-casing i or j, a combining dot above that follows it must be + removed. This is true even if there’s a combining ogonek in between. + But, if there’s another character combining above in between, combining + dot needs to stay (since the dot will be rendered above the other + diacritic). */ + switch (ch) { + case COMBINING_DOT_ABOVE: + buf->len_chars = buf->len_bytes = 0; + ctx->special = SPECIAL_LT; + return RES_CHANGED; + case COMBINING_GRAVE_ABOVE: + case COMBINING_ACUTE_ABOVE: + case COMBINING_TILDE_ABOVE: + ctx->special = SPECIAL_LT; + return RES_NOT_TOUCHED; + case COMBINING_OGONEK: + return RES_NOT_TOUCHED; + default: + ctx->special = SPECIAL_LT; + } + break; + + case SPECIAL_LT_INS_DOT_ABOVE: + /* When lower-casing I or J, if the letter has any accents above, + a combining dot above must be added before them. If we are here, it + means that we have lower cased I or J and we’re now on the lookout for + accents combining above. */ + switch (ch) { + case COMBINING_GRAVE_ABOVE: + case COMBINING_ACUTE_ABOVE: + case COMBINING_TILDE_ABOVE: + buf->len_chars = 2; + buf->len_bytes = CHAR_STRING (COMBINING_DOT_ABOVE, buf->data); + buf->len_bytes += CHAR_STRING (ch, buf->data + buf->len_bytes); + ctx->special = SPECIAL_LT; + return RES_CHANGED; + case COMBINING_OGONEK: + return RES_NOT_TOUCHED; + default: + ctx->special = SPECIAL_LT; + } + break; + + default: + return RES_NOT_TOUCHED; + } + + switch (flag) { + case CASE_UP: + case CASE_CAPITALIZE: + if (ch == 'i' || ch == 'j') + { + buf->data[0] = ch ^ ('i' ^ 'I'); + buf->len_bytes = 1; + } + else if (ch == SMALL_I_WITH_OGONEK) + buf->len_bytes = CHAR_STRING (CAPITAL_I_WITH_OGONEK, buf->data); + else + break; + buf->len_chars = 1; + /* Change the state so we’re on the lookout for combining dot above. */ + ctx->special = SPECIAL_LT_DEL_DOT_ABOVE; + return RES_CHANGED; + + case CASE_DOWN: + /* Turning I or J to lower case requires combining dot above to be included + IF there are any other characters combining above present. This is so + that the tittle is preserved. */ + switch (ch) { + case CAPITAL_I_WITH_GRAVE: + ch = 0x80; /* U+300, "\xCC\x80", combining grave accent */ + goto has_accent; + case CAPITAL_I_WITH_ACUTE: + ch = 0x81; /* U+301, "\xCC \x81", combining acute accent */ + goto has_accent; + case CAPITAL_I_WITH_TILDE: + ch = 0x83; /* U+303, "\xCC\x83", combining tilde */ + has_accent: + memcpy (buf->data, "i\xCC\x87\xCC", 4); + buf->data[4] = ch; + buf->len_chars = 3; + buf->len_bytes = 5; + return RES_CHANGED; + + case 'I': + case 'J': + buf->data[0] = ch ^ ('i' ^ 'I'); + buf->len_bytes = 1; + if (false) + case CAPITAL_I_WITH_OGONEK: + buf->len_bytes = CHAR_STRING (SMALL_I_WITH_OGONEK, buf->data); + buf->len_chars = 1; + /* Change the state so we’re on the lookout for diacritics combining + above. If one is found, we need to add combining dot above. */ + ctx->special = SPECIAL_LT_INS_DOT_ABOVE; + return RES_CHANGED; + } + break; + } + + return RES_NOT_TOUCHED; +} + /* Save in BUF result of casing character CH. If not-NULL, NEXT points to the next character in the cased string. If NULL, @@ -381,17 +528,30 @@ case_characters (struct casing_str_buf *buf, struct casing_context *ctx, int ch, const unsigned char *next) { enum case_action flag = normalise_flag (ctx); - int ret; + int ret = RES_NOT_TOUCHED; + + switch (ctx->special) { + case SPECIAL_NONE: + break; + + case SPECIAL_TR: + ret = maybe_case_turkic (buf, ctx, flag, ch, next); + break; + + default: + /* case SPECIAL_LT: */ + /* case SPECIAL_LT_DEL_DOT_ABOVE: */ + /* case SPECIAL_LT_INS_DOT_ABOVE: */ + ret = maybe_case_lithuanian (buf, ctx, flag, ch); + } - ret = maybe_case_turkic (buf, ctx, flag, ch, next); - if (ret != RES_NOT_TOUCHED) - return ret; + if (ret == RES_NOT_TOUCHED) + ret = maybe_case_greek (buf, ctx, flag, ch, next); - ret = maybe_case_greek (buf, ctx, flag, ch, next); - if (ret != RES_NOT_TOUCHED) - return ret; + if (ret == RES_NOT_TOUCHED) + ret = case_character_impl (buf, ctx, flag, ch); - return case_character_impl (buf, ctx, flag, ch); + return ret; } static Lisp_Object diff --git a/test/src/casefiddle-tests.el b/test/src/casefiddle-tests.el index ce1bb18dd40..f7b0da41029 100644 --- a/test/src/casefiddle-tests.el +++ b/test/src/casefiddle-tests.el @@ -241,7 +241,32 @@ casefiddle-tests--test-casing ("I\u0307si\u0307s" "I\u0307Sİ\u0307S" "isi\u0307s" "I\u0307si\u0307s" "I\u0307si\u0307s" "tr") ("I\u0307sI\u0307s" "I\u0307SI\u0307S" "isis" - "I\u0307sis" "I\u0307sI\u0307s" "tr")))))) + "I\u0307sis" "I\u0307sI\u0307s" "tr") + + ;; Test combining dot above in inserted when needed when lower + ;; casing I or J. + ("I\u0328\u0300" ; I + ogonek + grave + "I\u0328\u0300" "i\u0328\u0307\u0300" + "I\u0328\u0300" "I\u0328\u0300" "lt") + + ("J\u0328\u0300" ; J + ogonek + grave + "J\u0328\u0300" "j\u0328\u0307\u0300" + "J\u0328\u0300" "J\u0328\u0300" "lt") + + ("Į\u0300" ; I-ogonek + grave + "Į\u0300" "į\u0307\u0300" "Į\u0300" "Į\u0300" "lt") + + ("Ì Í Ĩ" + "Ì Í Ĩ" "i\u0307\u0300 i\u0307\u0301 i\u0307\u0303" + "Ì Í Ĩ" "Ì Í Ĩ" "lt") + + ;; Test combining dot above in removed when upper casing i or j. + ("i\u0328\u0307" ; i + ogonek + dot above + "I\u0328" "i\u0328\u0307" "I\u0328" "I\u0328" "lt") + ("j\u0328\u0307" ; j + ogonek + dot above + "J\u0328" "j\u0328\u0307" "J\u0328" "J\u0328" "lt") + ("į\u0307" ; i-ogonek + dot above + "Į" "į\u0307" "Į" "Į" "lt")))))) (ert-deftest casefiddle-tests-casing-byte8 () (should-not -- 2.12.0.246.ga2ecc84866-goog