From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!.POSTED!not-for-mail From: Michal Nazarewicz Newsgroups: gmane.emacs.bugs Subject: bug#24603: [RFC 11/18] Implement casing rules for Lithuanian Date: Tue, 4 Oct 2016 03:10:34 +0200 Message-ID: <1475543441-10493-11-git-send-email-mina86@mina86.com> References: <1475543441-10493-1-git-send-email-mina86@mina86.com> NNTP-Posting-Host: blaine.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Trace: blaine.gmane.org 1475544096 17143 195.159.176.226 (4 Oct 2016 01:21:36 GMT) X-Complaints-To: usenet@blaine.gmane.org NNTP-Posting-Date: Tue, 4 Oct 2016 01:21:36 +0000 (UTC) To: 24603@debbugs.gnu.org Original-X-From: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane.org@gnu.org Tue Oct 04 03:21:32 2016 Return-path: Envelope-to: geb-bug-gnu-emacs@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by blaine.gmane.org with esmtp (Exim 4.84_2) (envelope-from ) id 1brEQB-0003MP-A9 for geb-bug-gnu-emacs@m.gmane.org; Tue, 04 Oct 2016 03:21:27 +0200 Original-Received: from localhost ([::1]:39771 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1brEQ7-0003Wj-Ty for geb-bug-gnu-emacs@m.gmane.org; Mon, 03 Oct 2016 21:21:23 -0400 Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:56583) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1brEHD-0006py-MN for bug-gnu-emacs@gnu.org; Mon, 03 Oct 2016 21:12:14 -0400 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1brEH8-0002Vo-6k for bug-gnu-emacs@gnu.org; Mon, 03 Oct 2016 21:12:11 -0400 Original-Received: from debbugs.gnu.org ([208.118.235.43]:37369) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1brEH8-0002VX-2h for bug-gnu-emacs@gnu.org; Mon, 03 Oct 2016 21:12:06 -0400 Original-Received: from Debian-debbugs by debbugs.gnu.org with local (Exim 4.84_2) (envelope-from ) id 1brEH7-0006k3-UZ for bug-gnu-emacs@gnu.org; Mon, 03 Oct 2016 21:12:05 -0400 X-Loop: help-debbugs@gnu.org Resent-From: Michal Nazarewicz Original-Sender: "Debbugs-submit" Resent-CC: bug-gnu-emacs@gnu.org Resent-Date: Tue, 04 Oct 2016 01:12:05 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 24603 X-GNU-PR-Package: emacs X-GNU-PR-Keywords: Original-Received: via spool by 24603-submit@debbugs.gnu.org id=B24603.147554347025695 (code B ref 24603); Tue, 04 Oct 2016 01:12:05 +0000 Original-Received: (at 24603) by debbugs.gnu.org; 4 Oct 2016 01:11:10 +0000 Original-Received: from localhost ([127.0.0.1]:43538 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1brEGE-0006gD-4V for submit@debbugs.gnu.org; Mon, 03 Oct 2016 21:11:10 -0400 Original-Received: from mail-wm0-f41.google.com ([74.125.82.41]:37822) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1brEG6-0006ca-0A for 24603@debbugs.gnu.org; Mon, 03 Oct 2016 21:11:04 -0400 Original-Received: by mail-wm0-f41.google.com with SMTP id b201so114002011wmb.0 for <24603@debbugs.gnu.org>; Mon, 03 Oct 2016 18:11:02 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=sender:from:to:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=ovUpzZ5Zu2IY3fMNV0nbiDvnRJTDdJqKftSvXkzMBg0=; b=dra0UO/p1fzJVkKUSYZJRL0POHTnZd6eASaLkwZudCjOudqz5/8+y2DhpOAQzsDLwK 7zCWcDELD/7J9htt9a0oNrFDQvPFzcOHnQvuGafjTa1sI7Jsx1lZsOAXS3icJHmvy/6w 0D/cZGvQPGeylRh0F+52aTj2n50bYJSOKA7Vm99z1mdw7WyMvhex7CGbYNek4OlXoW1Y 0J2vhafvZDXHHTQbFntzPTJE2J48m/n+kpV1DIFeSXJmMuHV3Yy+isZDuoT8j4sZpp7e DXpVejz71Ko1A1k1qllh2j8k/oh+6sYf7d+LH08FmrkdmQlwnQJ7FjiF15wo8clrlKQ5 wgdQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:sender:from:to:subject:date:message-id :in-reply-to:references:mime-version:content-transfer-encoding; bh=ovUpzZ5Zu2IY3fMNV0nbiDvnRJTDdJqKftSvXkzMBg0=; b=BeN98wc9eaFeD941hhTjA72WiHIZ55eQ3Ukp5OdvY/9iduiNQSd2QL+LmQfaHQsflU 9xtgFQ1N+/wX3ugiwzc2+EhzfyAsNdx/zj4/jqe4JxYcM3N2xUAUoOE0APUVfoc5BxVg KOgAHtS/DWQSNJ72bVX0jgnbqRUKbamObcqZDiV+BAD7KbhiR759e9tBq13y6EbQNy8w JWg7FQ5c7QVelgkmzlVxsSXx9ZDsxRFzrcB1WSEXV6ZdMFdllg50g7UR+uwstkedFQ6w afZMyP1UsQWLnSmPvwgUEG4dao7v/ziXzjsMQN9NZYMxV0PsrTtNm2xS6bPdk53b0ix5 9Ukg== X-Gm-Message-State: AA6/9RnImVVLLROD3RhGj21flTEZqD/yMjxFP3XlT5qptjE8Pgrd/bfaU64CFlax+G3+H7Rw X-Received: by 10.28.95.87 with SMTP id t84mr1040307wmb.51.1475543455995; Mon, 03 Oct 2016 18:10:55 -0700 (PDT) Original-Received: from mpn.zrh.corp.google.com ([2620:0:105f:301:e126:377e:c57c:59ab]) by smtp.gmail.com with ESMTPSA id k3sm766532wjs.12.2016.10.03.18.10.51 for <24603@debbugs.gnu.org> (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 03 Oct 2016 18:10:53 -0700 (PDT) Original-Received: by mpn.zrh.corp.google.com (Postfix, from userid 126942) id F0F771E029C; Tue, 4 Oct 2016 03:10:48 +0200 (CEST) X-Mailer: git-send-email 2.8.0.rc3.226.g39d4020 In-Reply-To: <1475543441-10493-1-git-send-email-mina86@mina86.com> X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] X-Received-From: 208.118.235.43 X-BeenThere: bug-gnu-emacs@gnu.org List-Id: "Bug reports for GNU Emacs, the Swiss army knife of text editors" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane.org@gnu.org Original-Sender: "bug-gnu-emacs" Xref: news.gmane.org gmane.emacs.bugs:124005 Archived-At: In Lithuanian, tittle above lower case i and j are retained even if there are other diacritics above present. For that to work, an explicit combining dot above must be added after i and j or otherwise the rendering engine will remove the tittle. * src/casefiddle.c (struct casing_context, prepare_casing_context): Add lithuanian_tittle member to hold state of Lithuanian rules handling. (case_lithuanian): New function which implements Lithuanian rules. (case_characters): Make use of case_lithuanian. * test/src/casefiddle-tests.el (casefiddle-tests-casing): Add test cases for Lithuanian rules. --- src/casefiddle.c | 149 +++++++++++++++++++++++++++++++++++++++++-- test/src/casefiddle-tests.el | 27 +++++++- 2 files changed, 170 insertions(+), 6 deletions(-) diff --git a/src/casefiddle.c b/src/casefiddle.c index 2a7aa64..0377fe6 100644 --- a/src/casefiddle.c +++ b/src/casefiddle.c @@ -56,6 +56,16 @@ struct casing_context { bool inword; /* Whether to apply Azeri/Turkish rules for dotted and dotless i. */ bool treat_turkic_i; + + /* Whether to use Lithuanian rules for i’s and j’s tittle. */ + unsigned char lithuanian_tittle; +#define LT_OFF 0 /* No */ +#define LT_ON 1 /* Yes */ +#define LT_DEL_DOT_ABOVE 2 /* Yes and look out for combining dot above to + delete. */ +#define LT_INS_DOT_ABOVE 3 /* Yes and look out for diacritics combining above + because we may need to inject dot above before + them. */ }; /* Initialise CTX structure and prepares related global data for casing @@ -64,7 +74,7 @@ static void prepare_casing_context (struct casing_context *ctx, enum case_action flag, bool inbuffer) { - Lisp_Object lang, l, tr, az; + Lisp_Object lang, l, tr, az, lt; ctx->flag = flag; ctx->inbuffer = inbuffer; @@ -74,6 +84,7 @@ prepare_casing_context (struct casing_context *ctx, : Qnil; ctx->treat_turkic_i = false; + ctx->lithuanian_tittle = LT_OFF; /* If the case table is flagged as modified, rescan it. */ if (NILP (XCHAR_TABLE (BVAR (current_buffer, downcase_table))->extras[1])) @@ -86,6 +97,7 @@ prepare_casing_context (struct casing_context *ctx, lang = Vcurrent_iso639_language; tr = intern_c_string ("tr"); az = intern_c_string ("az"); + lt = intern_c_string ("lt"); if (SYMBOLP (lang)) { l = lang; @@ -97,10 +109,9 @@ prepare_casing_context (struct casing_context *ctx, lang = XCDR (lang); check_language: if (EQ (l, tr) || EQ (l, az)) - { - ctx->treat_turkic_i = true; - break; - } + ctx->treat_turkic_i = true; + else if (EQ (l, lt)) + ctx->lithuanian_tittle = LT_ON; } } @@ -199,6 +210,131 @@ case_character_impl (struct casing_str_buf *buf, #define CAPITAL_DOTTED_I 0x130 #define SMALL_DOTLESS_I 0x131 #define COMBINING_DOT_ABOVE 0x307 + +/* Lithuanian retains tittle in lower case i and j when there are more + accents above those letters. */ + +#define CAPITAL_I_WITH_GRAVE 0x0CC +#define CAPITAL_I_WITH_ACUTE 0x0CD +#define CAPITAL_I_WITH_TILDE 0x128 +#define CAPITAL_I_WITH_OGONEK 0x12E +#define SMALL_I_WITH_OGONEK 0x12F +#define COMBINING_GRAVE_ABOVE 0x300 +#define COMBINING_ACUTE_ABOVE 0x301 +#define COMBINING_TILDE_ABOVE 0x303 +#define COMBINING_OGONEK 0x328 + +/* Attempt to case CH using rules for Lithuanian i and j. Return true if + character has been cased (in which case it’s saved in BUF), false otherwise. + If CTX->lithuanian_tittle is LT_OFF, return false. */ +static bool +case_lithuanian (struct casing_str_buf *buf, struct casing_context *ctx, + enum case_action flag, int ch) +{ + switch (__builtin_expect(ctx->lithuanian_tittle, LT_OFF)) { + case LT_OFF: + return false; + + case LT_DEL_DOT_ABOVE: + /* When upper-casing i or j, a combining dot above that follows it must be + removed. This is true even if there’s a combining ogonek in between. + But, if there’s another character combining above in between, combining + dot needs to stay (since the dot will be rendered above the other + diacritic). */ + switch (ch) { + case COMBINING_DOT_ABOVE: + buf->len_chars = buf->len_bytes = 0; + ctx->lithuanian_tittle = LT_ON; + return true; + case COMBINING_GRAVE_ABOVE: + case COMBINING_ACUTE_ABOVE: + case COMBINING_TILDE_ABOVE: + ctx->lithuanian_tittle = LT_ON; + return false; + case COMBINING_OGONEK: + return false; + default: + ctx->lithuanian_tittle = LT_ON; + } + break; + + case LT_INS_DOT_ABOVE: + /* When lower-casing I or J, if the letter has any accents above, + a combining dot above must be added before them. If we are here, it + means that we have lower cased I or J and we’re now on the lookout for + accents combining above. */ + switch (ch) { + case COMBINING_GRAVE_ABOVE: + case COMBINING_ACUTE_ABOVE: + case COMBINING_TILDE_ABOVE: + buf->len_chars = 2; + buf->len_bytes = CHAR_STRING (COMBINING_DOT_ABOVE, buf->data); + buf->len_bytes += CHAR_STRING (ch, buf->data + buf->len_bytes); + ctx->lithuanian_tittle = LT_ON; + return true; + case COMBINING_OGONEK: + return false; + default: + ctx->lithuanian_tittle = LT_ON; + } + break; + } + + switch (flag) { + case CASE_UP: + case CASE_CAPITALIZE: + if (ch == 'i' || ch == 'j') + { + buf->data[0] = ch ^ ('i' ^ 'I'); + buf->len_bytes = 1; + } + else if (ch == SMALL_I_WITH_OGONEK) + buf->len_bytes = CHAR_STRING (CAPITAL_I_WITH_OGONEK, buf->data); + else + break; + buf->len_chars = 1; + /* Change the state so we’re on the lookout for combining dot above. */ + ctx->lithuanian_tittle = LT_DEL_DOT_ABOVE; + return true; + + case CASE_DOWN: + /* Turning I or J to lower case requires combining dot above to be included + IF there are any other characters combining above present. This is so + that the tittle is preserved. */ + switch (ch) { + case CAPITAL_I_WITH_GRAVE: + ch = 0x80; /* U+300, "\xCC\x80", combining grave accent */ + goto has_accent; + case CAPITAL_I_WITH_ACUTE: + ch = 0x81; /* U+301, "\xCC \x81", combining acute accent */ + goto has_accent; + case CAPITAL_I_WITH_TILDE: + ch = 0x83; /* U+303, "\xCC\x83", combining tilde */ + has_accent: + memcpy (buf->data, "i\xCC\x87\xCC", 4); + buf->data[4] = ch; + buf->len_chars = 3; + buf->len_bytes = 5; + return true; + + case 'I': + case 'J': + buf->data[0] = ch ^ ('i' ^ 'I'); + buf->len_bytes = 1; + if (false) + case CAPITAL_I_WITH_OGONEK: + buf->len_bytes = CHAR_STRING (SMALL_I_WITH_OGONEK, buf->data); + buf->len_chars = 1; + /* Change the state so we’re on the lookout for diacritics combining + above. If one is found, we need to add combining dot above. */ + ctx->lithuanian_tittle = LT_INS_DOT_ABOVE; + return true; + } + break; + } + + return false; +} /* Based on CTX, case character CH accordingly. Update CTX as necessary. Return cased character. @@ -234,6 +370,9 @@ case_characters (struct casing_str_buf *buf, struct casing_context *ctx, { enum case_action flag = normalise_flag (ctx); + if (case_lithuanian (buf, ctx, flag, ch)) + return 0; + if (flag != CASE_NO_ACTION && __builtin_expect(ctx->treat_turkic_i, false)) { bool dot_above = false; diff --git a/test/src/casefiddle-tests.el b/test/src/casefiddle-tests.el index 9f5e43f..bae4242 100644 --- a/test/src/casefiddle-tests.el +++ b/test/src/casefiddle-tests.el @@ -185,7 +185,32 @@ casefiddle-tests--characters ("I\u0307si\u0307s" "I\u0307Sİ\u0307S" "isi\u0307s" "I\u0307si\u0307s" "I\u0307si\u0307s" 'tr) ("I\u0307sI\u0307s" "I\u0307SI\u0307S" "isis" - "I\u0307sis" "I\u0307sI\u0307s" 'tr)) + "I\u0307sis" "I\u0307sI\u0307s" 'tr) + + ;; Test combining dot above in inserted when needed when lower + ;; casing I or J. + ("I\u0328\u0300" ; I + ogonek + grave + "I\u0328\u0300" "i\u0328\u0307\u0300" + "I\u0328\u0300" "I\u0328\u0300" 'lt) + + ("J\u0328\u0300" ; J + ogonek + grave + "J\u0328\u0300" "j\u0328\u0307\u0300" + "J\u0328\u0300" "J\u0328\u0300" 'lt) + + ("Į\u0300" ; I-ogonek + grave + "Į\u0300" "į\u0307\u0300" "Į\u0300" "Į\u0300" 'lt) + + ("Ì Í Ĩ" + "Ì Í Ĩ" "i\u0307\u0300 i\u0307\u0301 i\u0307\u0303" + "Ì Í Ĩ" "Ì Í Ĩ" 'lt) + + ;; Test combining dot above in removed when upper casing i or j. + ("i\u0328\u0307" ; i + ogonek + dot above + "I\u0328" "i\u0328\u0307" "I\u0328" "I\u0328" 'lt) + ("j\u0328\u0307" ; j + ogonek + dot above + "J\u0328" "j\u0328\u0307" "J\u0328" "J\u0328" 'lt) + ("į\u0307" ; i-ogonek + dot above + "Į" "į\u0307" "Į" "Į" 'lt)) (nreverse errors)) (let* ((input (string-to-multibyte (car test))) (expected (cdr test)) -- 2.8.0.rc3.226.g39d4020