From: Eli Zaretskii <eliz@gnu.org>
To: Michal Nazarewicz <mina86@mina86.com>
Cc: 24603@debbugs.gnu.org
Subject: bug#24603: [PATCHv5 05/11] Support casing characters which map into multiple code points (bug#24603)
Date: Sat, 11 Mar 2017 11:14:53 +0200 [thread overview]
Message-ID: <83d1dodvjm.fsf@gnu.org> (raw)
In-Reply-To: <20170309215150.9562-6-mina86@mina86.com> (message from Michal Nazarewicz on Thu, 9 Mar 2017 22:51:44 +0100)
> From: Michal Nazarewicz <mina86@mina86.com>
> Date: Thu, 9 Mar 2017 22:51:44 +0100
>
> Implement unconditional special casing rules defined in Unicode standard.
>
> Among other things, they deal with cases when a single code point is
> replaced by multiple ones because single character does not exist (e.g.
> ‘fi’ ligature turning into ‘FL’) or is not commonly used (e.g. ß turning
> into SS).
>
> * admin/unidata/SpecialCasing.txt: New data file pulled from Unicode
> standard distribution.
> * admin/unidata/README: Mention SpecialCasing.txt.
>
> * admin/unidata/unidata-get.el (unidata-gen-table-special-casing): New
> function for generating ‘special-casing’ character Unicode property
> built from the SpecialCasing.txt Unicode data file.
This new property is attainable via get-char-code-property, right? If
so, it should be documented in the Elisp manual, in the "Character
Properties" node.
I think I'd also like to see a few simple tests for this property.
> diff --git a/doc/lispref/strings.texi b/doc/lispref/strings.texi
> index cf47db4a814..ba1cf2606ce 100644
> --- a/doc/lispref/strings.texi
> +++ b/doc/lispref/strings.texi
> @@ -1166,6 +1166,29 @@ Case Conversion
> @end example
> @end defun
>
> + Note that case conversion is not a one-to-one mapping and the length
> +of the result may differ from the length of the argument (including
> +being shorter). Furthermore, because passing a character forces
> +return type to be a character, functions are unable to perform proper
> +substitution and result may differ compared to treating
> +a one-character string. For example:
> +
> +@example
> +@group
> +(upcase "fi") ; note: single character, ligature "fi"
> + @result{} "FI"
> +@end group
> +@group
> +(upcase ?fi)
> + @result{} 64257 ; i.e. ?fi
> +@end group
> +@end example
> +
> + To avoid this, a character must first be converted into a string,
> +using @code{string} function, before being passed to one of the casing
> +functions. Of course, no assumptions on the length of the result may
> +be made.
Once the ELisp manual describes the new special-casing property, the
above text should include a cross-reference to that description.
> DEFUN ("upcase", Fupcase, Supcase, 1, 1, 0,
> doc: /* Convert argument to upper case and return that.
> The argument may be a character or string. The result has the same type.
> -The argument object is not altered--the value is a copy.
> +The argument object is not altered--the value is a copy. If argument
> +is a character, characters which map to multiple code points when
> +cased, e.g. fi, are returned unchanged.
> See also `capitalize', `downcase' and `upcase-initials'. */)
This (and other similar doc strings) should mention the special-casing
property as the way to know in advance which characters will remain
unchanged due to that.
next prev parent reply other threads:[~2017-03-11 9:14 UTC|newest]
Thread overview: 89+ messages / expand[flat|nested] mbox.gz Atom feed top
2016-10-04 1:05 bug#24603: [RFC 00/18] Improvement to casing Michal Nazarewicz
2016-10-04 1:10 ` bug#24603: [RFC 01/18] Add tests for casefiddle.c Michal Nazarewicz
2016-10-04 1:10 ` bug#24603: [RFC 02/18] Generate upcase and downcase tables from Unicode data Michal Nazarewicz
2016-10-04 7:27 ` Eli Zaretskii
2016-10-04 14:54 ` Michal Nazarewicz
2016-10-04 15:06 ` Eli Zaretskii
2016-10-04 16:57 ` Michal Nazarewicz
2016-10-04 17:27 ` Eli Zaretskii
2016-10-04 17:44 ` Eli Zaretskii
2016-10-06 20:29 ` Michal Nazarewicz
2016-10-07 6:52 ` Eli Zaretskii
2016-10-04 1:10 ` bug#24603: [RFC 03/18] Don’t assume character can be either upper- or lower-case when casing Michal Nazarewicz
2016-10-04 1:10 ` bug#24603: [RFC 04/18] Split casify_object into multiple functions Michal Nazarewicz
2016-10-04 1:10 ` bug#24603: [RFC 05/18] Introduce case_character function Michal Nazarewicz
2016-10-04 1:10 ` bug#24603: [RFC 06/18] Add support for title-casing letters Michal Nazarewicz
2016-10-04 1:10 ` bug#24603: [RFC 07/18] Split up casify_region function Michal Nazarewicz
2016-10-04 7:17 ` Eli Zaretskii
2016-10-18 2:27 ` Michal Nazarewicz
2016-10-04 1:10 ` bug#24603: [RFC 08/18] Support casing characters which map into multiple code points Michal Nazarewicz
2016-10-04 7:38 ` Eli Zaretskii
2016-10-06 21:40 ` Michal Nazarewicz
2016-10-07 7:46 ` Eli Zaretskii
2017-01-28 23:48 ` Michal Nazarewicz
2017-02-10 9:12 ` Eli Zaretskii
2016-10-04 1:10 ` bug#24603: [RFC 09/18] Implement special sigma casing rule Michal Nazarewicz
2016-10-04 7:22 ` Eli Zaretskii
2016-10-04 1:10 ` bug#24603: [RFC 10/18] Implement Turkic dotless and dotted i handling when casing strings Michal Nazarewicz
2016-10-04 7:12 ` Eli Zaretskii
2016-10-04 1:10 ` bug#24603: [RFC 11/18] Implement casing rules for Lithuanian Michal Nazarewicz
2016-10-04 1:10 ` bug#24603: [RFC 12/18] Implement rules for title-casing Dutch ij ‘letter’ Michal Nazarewicz
2016-10-04 1:10 ` bug#24603: [RFC 13/18] Add some tricky Unicode characters to regex test Michal Nazarewicz
2016-10-04 1:10 ` bug#24603: [RFC 14/18] Factor out character category lookup to separate function Michal Nazarewicz
2016-10-04 1:10 ` bug#24603: [RFC 15/18] Base lower- and upper-case tests on Unicode properties Michal Nazarewicz
2016-10-04 6:54 ` Eli Zaretskii
2016-10-04 1:10 ` bug#24603: [RFC 16/18] Refactor character class checking; optimise ASCII case Michal Nazarewicz
2016-10-04 7:48 ` Eli Zaretskii
2016-10-17 13:22 ` Michal Nazarewicz
2016-11-06 19:26 ` Michal Nazarewicz
2016-11-06 19:44 ` Eli Zaretskii
2016-12-20 14:32 ` Michal Nazarewicz
2016-12-20 16:39 ` Eli Zaretskii
2016-12-22 14:02 ` Michal Nazarewicz
2016-10-04 1:10 ` bug#24603: [RFC 17/18] Optimise character class matching in regexes Michal Nazarewicz
2016-10-04 1:10 ` bug#24603: [RFC 18/18] Fix case-fold-search character class matching Michal Nazarewicz
2016-10-17 22:03 ` bug#24603: [PATCH 0/3] Case table updates Michal Nazarewicz
2016-10-17 22:03 ` bug#24603: [PATCH 1/3] Add tests for casefiddle.c Michal Nazarewicz
2016-10-17 22:03 ` bug#24603: [PATCH 2/3] Generate upcase and downcase tables from Unicode data Michal Nazarewicz
2016-10-17 22:03 ` bug#24603: [PATCH 3/3] Don’t generate ‘X maps to X’ entries in case tables Michal Nazarewicz
2016-10-18 6:36 ` bug#24603: [PATCH 0/3] Case table updates Eli Zaretskii
2016-10-24 15:11 ` Michal Nazarewicz
2016-10-24 15:33 ` Eli Zaretskii
2017-03-09 21:51 ` bug#24603: [PATCHv5 00/11] Casing improvements Michal Nazarewicz
2017-03-09 21:51 ` bug#24603: [PATCHv5 01/11] Split casify_object into multiple functions Michal Nazarewicz
2017-03-10 9:00 ` Andreas Schwab
2017-03-09 21:51 ` bug#24603: [PATCHv5 02/11] Introduce case_character function Michal Nazarewicz
2017-03-09 21:51 ` bug#24603: [PATCHv5 03/11] Add support for title-casing letters (bug#24603) Michal Nazarewicz
2017-03-11 9:03 ` Eli Zaretskii
2017-03-09 21:51 ` bug#24603: [PATCHv5 04/11] Split up casify_region function (bug#24603) Michal Nazarewicz
2017-03-09 21:51 ` bug#24603: [PATCHv5 05/11] Support casing characters which map into multiple code points (bug#24603) Michal Nazarewicz
2017-03-11 9:14 ` Eli Zaretskii [this message]
2017-03-21 2:09 ` Michal Nazarewicz
2017-03-09 21:51 ` bug#24603: [PATCHv5 06/11] Implement special sigma casing rule (bug#24603) Michal Nazarewicz
2017-03-09 21:51 ` bug#24603: [PATCHv5 07/11] Introduce ‘buffer-language’ buffer-locar variable Michal Nazarewicz
2017-03-11 9:29 ` Eli Zaretskii
2017-03-09 21:51 ` bug#24603: [PATCHv5 08/11] Implement rules for title-casing Dutch ij ‘letter’ (bug#24603) Michal Nazarewicz
2017-03-11 9:40 ` Eli Zaretskii
2017-03-16 21:30 ` Michal Nazarewicz
2017-03-17 13:43 ` Eli Zaretskii
2017-03-09 21:51 ` bug#24603: [PATCHv5 09/11] Implement Turkic dotless and dotted i casing rules (bug#24603) Michal Nazarewicz
2017-03-09 21:51 ` bug#24603: [PATCHv5 10/11] Implement casing rules for Lithuanian (bug#24603) Michal Nazarewicz
2017-03-09 21:51 ` bug#24603: [PATCHv5 11/11] Implement Irish casing rules (bug#24603) Michal Nazarewicz
2017-03-11 9:44 ` Eli Zaretskii
2017-03-16 22:16 ` Michal Nazarewicz
2017-03-17 8:20 ` Eli Zaretskii
2017-03-11 10:00 ` bug#24603: [PATCHv5 00/11] Casing improvements Eli Zaretskii
2017-03-21 1:27 ` bug#24603: [PATCHv6 0/6] Casing improvements, language-independent part Michal Nazarewicz
2017-03-21 1:27 ` bug#24603: [PATCHv6 1/6] Split casify_object into multiple functions Michal Nazarewicz
2017-03-21 1:27 ` bug#24603: [PATCHv6 2/6] Introduce case_character function Michal Nazarewicz
2017-03-21 1:27 ` bug#24603: [PATCHv6 3/6] Add support for title-casing letters (bug#24603) Michal Nazarewicz
2017-03-21 1:27 ` bug#24603: [PATCHv6 4/6] Split up casify_region function (bug#24603) Michal Nazarewicz
2017-03-21 1:27 ` bug#24603: [PATCHv6 5/6] Support casing characters which map into multiple code points (bug#24603) Michal Nazarewicz
2017-03-22 16:06 ` Eli Zaretskii
2017-04-03 9:01 ` Michal Nazarewicz
2017-04-03 14:52 ` Eli Zaretskii
2019-06-25 0:09 ` Lars Ingebrigtsen
2019-06-25 0:29 ` Michał Nazarewicz
2020-08-11 13:46 ` Lars Ingebrigtsen
2021-05-10 11:51 ` bug#24603: [RFC 00/18] Improvement to casing Lars Ingebrigtsen
2017-03-21 1:27 ` bug#24603: [PATCHv6 6/6] Implement special sigma casing rule (bug#24603) Michal Nazarewicz
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
List information: https://www.gnu.org/software/emacs/
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=83d1dodvjm.fsf@gnu.org \
--to=eliz@gnu.org \
--cc=24603@debbugs.gnu.org \
--cc=mina86@mina86.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this public inbox
https://git.savannah.gnu.org/cgit/emacs.git
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).