From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!.POSTED!not-for-mail From: Michal Nazarewicz Newsgroups: gmane.emacs.bugs Subject: bug#24603: [RFC 12/18] Implement rules for title-casing Dutch ij =?UTF-8?Q?=E2=80=98letter=E2=80=99?= Date: Tue, 4 Oct 2016 03:10:35 +0200 Message-ID: <1475543441-10493-12-git-send-email-mina86@mina86.com> References: <1475543441-10493-1-git-send-email-mina86@mina86.com> NNTP-Posting-Host: blaine.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Trace: blaine.gmane.org 1475544060 32533 195.159.176.226 (4 Oct 2016 01:21:00 GMT) X-Complaints-To: usenet@blaine.gmane.org NNTP-Posting-Date: Tue, 4 Oct 2016 01:21:00 +0000 (UTC) To: 24603@debbugs.gnu.org Original-X-From: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane.org@gnu.org Tue Oct 04 03:20:56 2016 Return-path: Envelope-to: geb-bug-gnu-emacs@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by blaine.gmane.org with esmtp (Exim 4.84_2) (envelope-from ) id 1brEPS-0006Iy-Gc for geb-bug-gnu-emacs@m.gmane.org; Tue, 04 Oct 2016 03:20:42 +0200 Original-Received: from localhost ([::1]:39766 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1brEPU-0003GV-4K for geb-bug-gnu-emacs@m.gmane.org; Mon, 03 Oct 2016 21:20:44 -0400 Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:56456) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1brEH9-0006lX-G7 for bug-gnu-emacs@gnu.org; Mon, 03 Oct 2016 21:12:12 -0400 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1brEH7-0002Tu-6b for bug-gnu-emacs@gnu.org; Mon, 03 Oct 2016 21:12:06 -0400 Original-Received: from debbugs.gnu.org ([208.118.235.43]:37367) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1brEH7-0002Tn-3d for bug-gnu-emacs@gnu.org; Mon, 03 Oct 2016 21:12:05 -0400 Original-Received: from Debian-debbugs by debbugs.gnu.org with local (Exim 4.84_2) (envelope-from ) id 1brEH6-0006jl-Uk for bug-gnu-emacs@gnu.org; Mon, 03 Oct 2016 21:12:04 -0400 X-Loop: help-debbugs@gnu.org Resent-From: Michal Nazarewicz Original-Sender: "Debbugs-submit" Resent-CC: bug-gnu-emacs@gnu.org Resent-Date: Tue, 04 Oct 2016 01:12:04 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 24603 X-GNU-PR-Package: emacs X-GNU-PR-Keywords: Original-Received: via spool by 24603-submit@debbugs.gnu.org id=B24603.147554346925675 (code B ref 24603); Tue, 04 Oct 2016 01:12:04 +0000 Original-Received: (at 24603) by debbugs.gnu.org; 4 Oct 2016 01:11:09 +0000 Original-Received: from localhost ([127.0.0.1]:43534 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1brEGC-0006fr-JM for submit@debbugs.gnu.org; Mon, 03 Oct 2016 21:11:08 -0400 Original-Received: from mail-wm0-f49.google.com ([74.125.82.49]:37816) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1brEG5-0006cX-5X for 24603@debbugs.gnu.org; Mon, 03 Oct 2016 21:11:02 -0400 Original-Received: by mail-wm0-f49.google.com with SMTP id b201so114001647wmb.0 for <24603@debbugs.gnu.org>; Mon, 03 Oct 2016 18:11:01 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=sender:from:to:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=dHE7t3V5xEZoUZVuJSouTPBOglgN2m60cUhS1if9bAI=; b=GKFTDnOgdtbW0q1M2bAh6c9k5pBksuvMB6ajdvda/Sc9Jbr14aPzjzC+NN+r3ch4n9 KmOru0rxQyFXhd3+K6DLDdhB06xevLvGU5fLN0+mQPOOf/CZCpSDwaPHrrs8mc76g75d D6NXy7jFzV5c2RpGbsWCwJyCrB8eqaA+hoJ59NkYdiI02/k5aFZJT7jZ0NMKRp5pl9y5 ToP5ap8BpWhOE47AaXcMgzSNvmhtgJKr6ZLw4fwG3gRuwbah7mesmeo9iYh41NFOOB2Y yfKxFLYEDhl9PGw2qXX1KAoO1oXbrL2/Gaj0+lwAi2TQh/ffaY0wxGf2ysoJcgdxL3QU 3pzQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:sender:from:to:subject:date:message-id :in-reply-to:references:mime-version:content-transfer-encoding; bh=dHE7t3V5xEZoUZVuJSouTPBOglgN2m60cUhS1if9bAI=; b=BprRG9Ntlw4mBcy1Xw6zjRMwRh5SQm65eKt2szY+Ecmt32DkSJ2g3FGHmMazsqooGf PDVaQmSIt4OS3w5hQIFuXW2RKL0UbWJ8pYjnHSpVv5ZV6VBo8eScOihXArggoxyTkvh5 qqrx6HbFuhwYuxKkaOWUWEDPpfXWdl+3lSKLltgGfB4alYz4atC/lwMkFJsY71JTyCxw rOYZMj2l3VfKzwl2goDNKOQVpOlOw3XcbAdhYO5IUr0GfNc1KRYoPioLc3pYNq9kZ9JK q3fVE8dRh3nFPZ/qr+BQz1nfQ5ktWAiEsGoIJ/b+iz6JegTA+2VCzkCI4wOOwRvDEiAR EOPA== X-Gm-Message-State: AA6/9RloHJ8kxNR5rZu93npKpMan2JnJ6GE8ZI8ovhMdZi2MP9hfvX8DT+HygOD15IRx+Qu2 X-Received: by 10.28.166.196 with SMTP id p187mr1024199wme.121.1475543455266; Mon, 03 Oct 2016 18:10:55 -0700 (PDT) Original-Received: from mpn.zrh.corp.google.com ([2620:0:105f:301:e126:377e:c57c:59ab]) by smtp.gmail.com with ESMTPSA id qa7sm724475wjc.39.2016.10.03.18.10.51 for <24603@debbugs.gnu.org> (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 03 Oct 2016 18:10:53 -0700 (PDT) Original-Received: by mpn.zrh.corp.google.com (Postfix, from userid 126942) id E2CA31E029A; Tue, 4 Oct 2016 03:10:48 +0200 (CEST) X-Mailer: git-send-email 2.8.0.rc3.226.g39d4020 In-Reply-To: <1475543441-10493-1-git-send-email-mina86@mina86.com> X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] X-Received-From: 208.118.235.43 X-BeenThere: bug-gnu-emacs@gnu.org List-Id: "Bug reports for GNU Emacs, the Swiss army knife of text editors" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane.org@gnu.org Original-Sender: "bug-gnu-emacs" Xref: news.gmane.org gmane.emacs.bugs:124003 Archived-At: Dutch treats ‘ij’ as a single letter and when capitalising a word it should be capitalised as a single letter (i.e. ‘ij’ becomes ‘IJ’). Implement that. * src/casefiddle.c (casify_context): Add treat_dutch_ij member for determining whether special handling of ij is necessary. (prepare_cosify_context): Set treat_dutch_ij to true when in Dutch locale and capitalising. (dutch_ij_p_impl, dutch_ij_p, handle_dutch_ij_impl, handle_dutch_ij): New routines for detecting and handling when ‘ij’ must be upcased together. (do_casify_multibyte_string, do_casify_unibyte_string, do_casify_unibyte_region, do_casify_multibyte_region): Implement handling of Dutch ij. --- src/casefiddle.c | 49 +++++++++++++++++++++++++++++++++++++++++++- test/src/casefiddle-tests.el | 6 +++++- 2 files changed, 53 insertions(+), 2 deletions(-) diff --git a/src/casefiddle.c b/src/casefiddle.c index 0377fe6..0de7814 100644 --- a/src/casefiddle.c +++ b/src/casefiddle.c @@ -66,6 +66,27 @@ struct casing_context { #define LT_INS_DOT_ABOVE 3 /* Yes and look out for diacritics combining above because we may need to inject dot above before them. */ + + /* In Dutch, ‘ij’ is a digraph and when capitalised the whole thing is upper + cased. Unicode has ‘ij’ and ‘IJ’ (with proper casing mappings) but they + aren’t always used so we cannot/should not rely on them. + + Note that rule for capitalising ‘ij’ as a single letter is not present in + Unicode 9.0’s SpecialCasing.txt. On the flip side, Firefox implements this + as well so we’re not completely alone. + + There are words where ‘ij’ are two separate letters (such as bijectie or + bijoux) in which case the capitalisation rules do not apply. I (mina86) + have googled this a little and couldn’t find a Dutch word which beings with + ‘ij’ that is not a digraph so we should be in the clear since we only care + about the initial. */ + + /* Whether to apply Dutch rules for title-casing ij as IJ. Non-zero + value implies flag is CASE_CAPITALIZE or CASE_CAPITALIZE_UP. */ + unsigned char treat_dutch_ij; +#define NL_OFF 0 /* No */ +#define NL_ON 1 /* Yes */ +#define NL_UPCASE_J 2 /* Yes and the previous character was upcased ‘i’. */ }; /* Initialise CTX structure and prepares related global data for casing @@ -74,7 +95,7 @@ static void prepare_casing_context (struct casing_context *ctx, enum case_action flag, bool inbuffer) { - Lisp_Object lang, l, tr, az, lt; + Lisp_Object lang, l, tr, az, lt, nl; ctx->flag = flag; ctx->inbuffer = inbuffer; @@ -85,6 +106,7 @@ prepare_casing_context (struct casing_context *ctx, ctx->treat_turkic_i = false; ctx->lithuanian_tittle = LT_OFF; + ctx->treat_dutch_ij = NL_OFF; /* If the case table is flagged as modified, rescan it. */ if (NILP (XCHAR_TABLE (BVAR (current_buffer, downcase_table))->extras[1])) @@ -98,6 +120,7 @@ prepare_casing_context (struct casing_context *ctx, tr = intern_c_string ("tr"); az = intern_c_string ("az"); lt = intern_c_string ("lt"); + nl = intern_c_string ("nl"); if (SYMBOLP (lang)) { l = lang; @@ -112,6 +135,8 @@ prepare_casing_context (struct casing_context *ctx, ctx->treat_turkic_i = true; else if (EQ (l, lt)) ctx->lithuanian_tittle = LT_ON; + else if (EQ (l, nl)) + ctx->treat_dutch_ij = (int) flag >= (int) CASE_CAPITALIZE; } } @@ -154,6 +179,28 @@ case_character_impl (struct casing_str_buf *buf, ctx->inword = SYNTAX (ch) == Sword && (!ctx->inbuffer || ctx->inword || !syntax_prefix_flag_p (ch)); + /* Handle dutch ij. We need to do it here before the flag == CASE_NO_ACTION + check. Note that non-zero treat_dutch_ij implies ctx->flag being ≥ + CASE_CAPITALIZE. */ + switch (__builtin_expect(ctx->treat_dutch_ij, NL_OFF)) { + case NL_ON: + if (ch == 'i' && flag == CASE_CAPITALIZE) + { + ctx->treat_dutch_ij = NL_UPCASE_J; + cased = 'I'; + goto done; + } + break; + case NL_UPCASE_J: + ctx->treat_dutch_ij = NL_ON; + if (ch == 'j') + { + cased = 'J'; + goto done; + } + } + + /* We are inside of a word and capitalising initials only. */ if (flag == CASE_NO_ACTION) { cased = ch; diff --git a/test/src/casefiddle-tests.el b/test/src/casefiddle-tests.el index bae4242..3857f08 100644 --- a/test/src/casefiddle-tests.el +++ b/test/src/casefiddle-tests.el @@ -210,7 +210,11 @@ casefiddle-tests--characters ("j\u0328\u0307" ; j + ogonek + dot above "J\u0328" "j\u0328\u0307" "J\u0328" "J\u0328" 'lt) ("į\u0307" ; i-ogonek + dot above - "Į" "į\u0307" "Į" "Į" 'lt)) + "Į" "į\u0307" "Į" "Į" 'lt) + + ;; Dutch 'ij' is capitalised as single digraph. + ("ijsland" "IJSLAND" "ijsland" "Ijsland" "Ijsland") + ("ijsland" "IJSLAND" "ijsland" "IJsland" "IJsland" 'nl)) (nreverse errors)) (let* ((input (string-to-multibyte (car test))) (expected (cdr test)) -- 2.8.0.rc3.226.g39d4020