From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!.POSTED!not-for-mail From: Michal Nazarewicz Newsgroups: gmane.emacs.bugs Subject: bug#24603: [PATCHv5 08/11] Implement rules for title-casing Dutch ij =?UTF-8?Q?=E2=80=98letter=E2=80=99?= (bug#24603) Date: Thu, 9 Mar 2017 22:51:47 +0100 Message-ID: <20170309215150.9562-9-mina86@mina86.com> References: <20170309215150.9562-1-mina86@mina86.com> NNTP-Posting-Host: blaine.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Trace: blaine.gmane.org 1489096413 2730 195.159.176.226 (9 Mar 2017 21:53:33 GMT) X-Complaints-To: usenet@blaine.gmane.org NNTP-Posting-Date: Thu, 9 Mar 2017 21:53:33 +0000 (UTC) To: 24603@debbugs.gnu.org, eliz@gnu.org Original-X-From: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane.org@gnu.org Thu Mar 09 22:53:29 2017 Return-path: Envelope-to: geb-bug-gnu-emacs@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by blaine.gmane.org with esmtp (Exim 4.84_2) (envelope-from ) id 1cm5zy-0008Pu-EQ for geb-bug-gnu-emacs@m.gmane.org; Thu, 09 Mar 2017 22:53:26 +0100 Original-Received: from localhost ([::1]:36402 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1cm604-0002tz-G1 for geb-bug-gnu-emacs@m.gmane.org; Thu, 09 Mar 2017 16:53:32 -0500 Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:41304) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1cm5zq-0002oF-Fq for bug-gnu-emacs@gnu.org; Thu, 09 Mar 2017 16:53:21 -0500 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1cm5zm-0004Mz-S3 for bug-gnu-emacs@gnu.org; Thu, 09 Mar 2017 16:53:18 -0500 Original-Received: from debbugs.gnu.org ([208.118.235.43]:49938) by eggs.gnu.org with esmtps (TLS1.0:RSA_AES_128_CBC_SHA1:16) (Exim 4.71) (envelope-from ) id 1cm5zm-0004M0-M2 for bug-gnu-emacs@gnu.org; Thu, 09 Mar 2017 16:53:14 -0500 Original-Received: from Debian-debbugs by debbugs.gnu.org with local (Exim 4.84_2) (envelope-from ) id 1cm5zc-0000PS-4Q for bug-gnu-emacs@gnu.org; Thu, 09 Mar 2017 16:53:04 -0500 X-Loop: help-debbugs@gnu.org Resent-From: Michal Nazarewicz Original-Sender: "Debbugs-submit" Resent-CC: bug-gnu-emacs@gnu.org Resent-Date: Thu, 09 Mar 2017 21:53:04 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 24603 X-GNU-PR-Package: emacs X-GNU-PR-Keywords: patch Original-Received: via spool by 24603-submit@debbugs.gnu.org id=B24603.14890963331396 (code B ref 24603); Thu, 09 Mar 2017 21:53:04 +0000 Original-Received: (at 24603) by debbugs.gnu.org; 9 Mar 2017 21:52:13 +0000 Original-Received: from localhost ([127.0.0.1]:48107 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1cm5ym-0000MF-4m for submit@debbugs.gnu.org; Thu, 09 Mar 2017 16:52:13 -0500 Original-Received: from mail-wm0-f46.google.com ([74.125.82.46]:33670) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1cm5yj-0000LD-Ps for 24603@debbugs.gnu.org; Thu, 09 Mar 2017 16:52:10 -0500 Original-Received: by mail-wm0-f46.google.com with SMTP id v203so348069wmg.0 for <24603@debbugs.gnu.org>; Thu, 09 Mar 2017 13:52:09 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=sender:from:to:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=L4zkpBGLWV9ygmZ5oKae+NShWzWHElCFYDKmChiuHHc=; b=g7iwe8G9+UY7YnfIXQcDcrY7gwb1mjlyUIJEnd7mDVpEW1y+PXqS0g0L01hsHEfpHf aEhKIP+RLdeoFi3fCoU8+fr7+FPYddaeHLJdzyzM/W8Lsmduz24KrKxvhG96miRw5AJB wh1VK7raSdCQhB++d36vvx5zHt5b8EX+8AsO/4QmUM0Alsgl/USuXcW4x9crrjWZ4IoQ 7jdtfO65NvSzQdQx99r0ImWP0qSz/ZBdLidFMgAyF43v96rx/UuYNj9iat3lGhzkTqqM m9PgZXv9pDjWE9NzevUFeoxPIUv0dFGBgHauD6oVG+iLXUsd5fM6GB+LMevIFFksHwZq m+Dw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:sender:from:to:subject:date:message-id :in-reply-to:references:mime-version:content-transfer-encoding; bh=L4zkpBGLWV9ygmZ5oKae+NShWzWHElCFYDKmChiuHHc=; b=XzsaTC0QmaK2M/0h68+1Fim9aXq7SpIcpPPgPzzVJq7WoZarfTPfpi9A7klz9PCEk/ CM4E4oNzRXDiYwC0p9Ps6SaMaIR3z8Ow20Wp9RgZrpUqQSJOLFQT8oSB5NFSiLC97jFW iTmHriAv4s5oxm3ZHU50ar5Shycufr2Rezxcnjdea6Y4MUhN8z1ARvVpw0/NjpWTyUiT 8AJTA5vD3JA/zE5KxUN7+pyVdNUQrhA7niBfeULKKOt4uZbeJsXV3/jXnO1P/snc3AJv mdmnGhKi8qgNOov2tURpSq5GGWFOH0zZhMC2twttJBtV3AAX5ALWI61/gvgg0mBNsYGt hlWQ== X-Gm-Message-State: AMke39mo5B1LHXbKkgq+UPmf5VTW7btvC+dF3lrJR3QDDGT9kyIX/YV1MF/ZELeBuvZuVL66 X-Received: by 10.28.126.11 with SMTP id z11mr31075837wmc.13.1489096323523; Thu, 09 Mar 2017 13:52:03 -0800 (PST) Original-Received: from mpn.zrh.corp.google.com ([172.16.115.43]) by smtp.gmail.com with ESMTPSA id m83sm244496wmc.33.2017.03.09.13.51.59 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 09 Mar 2017 13:52:00 -0800 (PST) Original-Received: by mpn.zrh.corp.google.com (Postfix, from userid 126942) id 7F2911E029A; Thu, 9 Mar 2017 22:51:58 +0100 (CET) X-Mailer: git-send-email 2.12.0.246.ga2ecc84866-goog In-Reply-To: <20170309215150.9562-1-mina86@mina86.com> X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] X-Received-From: 208.118.235.43 X-BeenThere: bug-gnu-emacs@gnu.org List-Id: "Bug reports for GNU Emacs, the Swiss army knife of text editors" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane.org@gnu.org Original-Sender: "bug-gnu-emacs" Xref: news.gmane.org gmane.emacs.bugs:130397 Archived-At: Dutch treats ‘ij’ as a single letter and when capitalising a word it should be capitalised as such (i.e. ‘ij’ becomes ‘IJ’). Implement that. * src/casefiddle.c (struct casing_context): Add a ‘special’ field which determines if any special casing rules are in effect. (prepare_casing_context): Interpret ‘buffer-language’ variable and set ctx->special accordingly. This allows for per-language special rules. For now only Dutch (‘nl’) is handled specially. (case_character_impl): Add handling of a Dutch ‘ij’ letter. * test/src/casefiddle-tests.el (casefiddle-tests--test-casing): Add test cases for Dutch ‘ij’. --- src/casefiddle.c | 56 ++++++++++++++++++++++++++++++++++++++++++++ test/src/casefiddle-tests.el | 7 +++++- 2 files changed, 62 insertions(+), 1 deletion(-) diff --git a/src/casefiddle.c b/src/casefiddle.c index 2f573782115..d59684c7b8e 100644 --- a/src/casefiddle.c +++ b/src/casefiddle.c @@ -49,6 +49,32 @@ struct casing_context { bool inbuffer; /* Whether we are inside of a word. */ bool inword; + + /* Determines which special casing rules need to be applied as well as tracks + state for some of the transformations.*/ + enum { + /* No special casing rules need to be applied. */ + SPECIAL_NONE, + + /* In Dutch, ‘ij’ is a digraph and when capitalised the whole thing is upper + cased. Unicode has ‘ij’ and ‘IJ’ (with proper casing mappings) but they + aren’t always used so we cannot/should not rely on them. + + Note that rule for capitalising ‘ij’ as a single letter is not present in + Unicode 9.0’s SpecialCasing.txt. On the flip side, Firefox implements + this as well so we’re not completely alone. + + There are words where ‘ij’ are two separate letters (such as bijectie or + bijoux) in which case the capitalisation rules do not apply. I (mina86) + have googled this a little and couldn’t find a Dutch word which beings + with ‘ij’ that is not a digraph so we should be in the clear since we + only care about the initial. */ + /* Apply Dutch rules for capitalising ‘ij’. */ + SPECIAL_NL, + /* As above and the previous character was upcased ‘i’ so if we now see ‘j’ + it needs to be upcased as well. */ + SPECIAL_NL_UPCASE_J + } special; }; /* Initialise CTX structure and prepares related global data for casing @@ -57,6 +83,8 @@ static void prepare_casing_context (struct casing_context *ctx, enum case_action flag, bool inbuffer) { + Lisp_Object lang; + ctx->flag = flag; ctx->inbuffer = inbuffer; ctx->inword = false; @@ -65,6 +93,7 @@ prepare_casing_context (struct casing_context *ctx, : Qnil; ctx->specialcase_char_table = uniprop_table (intern_c_string ("special-casing")); + ctx->special = SPECIAL_NONE; /* If the case table is flagged as modified, rescan it. */ if (NILP (XCHAR_TABLE (BVAR (current_buffer, downcase_table))->extras[1])) @@ -72,6 +101,14 @@ prepare_casing_context (struct casing_context *ctx, if (inbuffer && (int) flag >= (int) CASE_CAPITALIZE) SETUP_BUFFER_SYNTAX_TABLE (); /* For syntax_prefix_flag_p. */ + + lang = BVAR(current_buffer, language); + if (STRINGP (lang) && SCHARS (lang) >= 2) + switch ((SREF(lang, 0) << 8) | SREF(lang, 1) | 0x2020u) { + case ('n' << 8) | 'l': /* Dutch */ + if ((int) flag >= (int) CASE_CAPITALIZE) + ctx->special = SPECIAL_NL; + } } struct casing_str_buf { @@ -95,6 +132,25 @@ case_character_impl (struct casing_str_buf *buf, bool was_inword; int cased; + /* Handle Dutch ij. Note that SPECIAL_NL and SPECIAL_NL_UPCASE_J implies that + ctx->flag ≥ CASE_CAPITALIZE. */ + if (ctx->special == SPECIAL_NL && ch == 'i' && !ctx->inword) + { + ctx->special = SPECIAL_NL_UPCASE_J; + ctx->inword = true; + cased = 'I'; + goto done; + } + else if (ctx->special == SPECIAL_NL_UPCASE_J) + { + ctx->special = SPECIAL_NL; + if (ch == 'j') + { + cased = 'J'; + goto done; + } + } + /* Update inword state */ was_inword = ctx->inword; ctx->inword = SYNTAX (ch) == Sword && diff --git a/test/src/casefiddle-tests.el b/test/src/casefiddle-tests.el index 10450360eab..5e38a97d256 100644 --- a/test/src/casefiddle-tests.el +++ b/test/src/casefiddle-tests.el @@ -135,6 +135,7 @@ casefiddle-tests--test-casing (lambda (errors test) (let* ((input (car test)) (expected (cdr test)) + (buffer-language (or (nth 5 test) "en_GB")) (func-pairs '((upcase upcase-region) (downcase downcase-region) (capitalize capitalize-region) @@ -200,7 +201,11 @@ casefiddle-tests--test-casing ("Σ Σ" "Σ Σ" "σ σ" "Σ Σ" "Σ Σ") ("όσος" "ΌΣΟΣ" "όσος" "Όσος" "Όσος") ;; If sigma is already lower case, we don’t want to change it. - ("όσοσ" "ΌΣΟΣ" "όσοσ" "Όσοσ" "Όσοσ")))))) + ("όσοσ" "ΌΣΟΣ" "όσοσ" "Όσοσ" "Όσοσ") + + ;; Dutch 'ij' is capitalised as single digraph. + ("ijsland" "IJSLAND" "ijsland" "Ijsland" "Ijsland") + ("ijsland" "IJSLAND" "ijsland" "IJsland" "IJsland" "nl")))))) (ert-deftest casefiddle-tests-casing-byte8 () (should-not -- 2.12.0.246.ga2ecc84866-goog