* bug#20789: Invalid script or charset name: cuneiform-numbers-and-punctuation @ 2015-06-11 22:05 Glenn Morris 2015-06-11 22:24 ` Glenn Morris 0 siblings, 1 reply; 11+ messages in thread From: Glenn Morris @ 2015-06-11 22:05 UTC (permalink / raw) To: 20789 Package: emacs Version: 25.0.50 Current master on x86_64 RHEL 7.1. emacs -Q: All looks fine, but there is a *Warnings* buffer with contents: Error (initialization): Creation of the default fontsets failed: (error Invalid script or charset name: cuneiform-numbers-and-punctuation) A second bug: the *Warnings* buffer is not shown at startup, *scratch* is. ^ permalink raw reply [flat|nested] 11+ messages in thread
* bug#20789: Invalid script or charset name: cuneiform-numbers-and-punctuation 2015-06-11 22:05 bug#20789: Invalid script or charset name: cuneiform-numbers-and-punctuation Glenn Morris @ 2015-06-11 22:24 ` Glenn Morris 2015-06-12 8:28 ` Eli Zaretskii 0 siblings, 1 reply; 11+ messages in thread From: Glenn Morris @ 2015-06-11 22:24 UTC (permalink / raw) To: 20789 Glenn Morris wrote: > Error (initialization): Creation of the default fontsets failed: (error > Invalid script or charset name: cuneiform-numbers-and-punctuation) I fixed a typo that seems to have caused that. I don't suppose that big list can be auto-generated from the inputs? > A second bug: the *Warnings* buffer is not shown at startup, *scratch* is. ^ permalink raw reply [flat|nested] 11+ messages in thread
* bug#20789: Invalid script or charset name: cuneiform-numbers-and-punctuation 2015-06-11 22:24 ` Glenn Morris @ 2015-06-12 8:28 ` Eli Zaretskii 2015-06-16 0:22 ` Glenn Morris 0 siblings, 1 reply; 11+ messages in thread From: Eli Zaretskii @ 2015-06-12 8:28 UTC (permalink / raw) To: Glenn Morris; +Cc: 20789 > From: Glenn Morris <rgm@gnu.org> > Date: Thu, 11 Jun 2015 18:24:06 -0400 > > Glenn Morris wrote: > > > Error (initialization): Creation of the default fontsets failed: (error > > Invalid script or charset name: cuneiform-numbers-and-punctuation) > > I fixed a typo that seems to have caused that. Sorry about that. > I don't suppose that big list can be auto-generated from the inputs? It's not trivial. I describe below some of the issues, in the hope that Someone™ will volunteer: . Most of the script names come from the corresponding Unicode blocks, with trivial transformations (downcase words and replace blanks with a hyphen). So basically, we will need to use the information in Blocks.txt, a file that is part of the Unicode Character Database (UCD), but with quirks described below. . The first quirk is that we lump together all the blocks that belong to the same script, like "Basic Latin", "Latin Extended-A", "Latin-1 Supplement", etc. -- these all go to the single script called 'latin'. Likewise with other similar blocks that are either "SOMETHING Extended" or "Supplement" or whatever. . The second quirk is with the CJK characters: those are divided into several broad scripts like 'han', 'kana', and 'cjk-misc' whose exact rules I don't know. . The third quirk is with the 'symbol' pseudo-script: we lump there all punctuation characters and all symbol characters (those for which the General Category is one of Pc, Pd, Ps, Pe, Pi, Pf, Po, Sm, Sc, Sk, So), but with the following notable exception: punctuation characters that belong to blocks that include non-punctuation characters are left in those blocks -- those are punctuation characters used only with the scripts named by those blocks, like U+05BE HEBREW PUNCTUATION MAQAF, which is only used by the Hebrew script. . Another quirk is that mathematical alphanumerics (which are just letters from the Unicode POV) are lumped into a separate script 'mathematical'. Alternatively, one could use Scripts.txt from the UCD, and then the only problem is to subdivide what they call "Common" into the scripts we use. For the general category of a character, one can do in Emacs: (get-char-code-property CHAR 'general-category) Alternatively, one can search UnicodeData.txt directly: the General Category is the 3rd field there. Patches are welcome to do all of the above automatically, perhaps with some small database that expresses the more tricky of the above rules. ^ permalink raw reply [flat|nested] 11+ messages in thread
* bug#20789: Invalid script or charset name: cuneiform-numbers-and-punctuation 2015-06-12 8:28 ` Eli Zaretskii @ 2015-06-16 0:22 ` Glenn Morris 2015-06-16 14:41 ` Eli Zaretskii 0 siblings, 1 reply; 11+ messages in thread From: Glenn Morris @ 2015-06-16 0:22 UTC (permalink / raw) To: Eli Zaretskii; +Cc: 20789 [-- Attachment #1: Type: text/plain, Size: 1281 bytes --] Eli Zaretskii wrote: >> I don't suppose that big list can be auto-generated from the inputs? > > It's not trivial. I describe below some of the issues, in the hope > that Someone™ will volunteer: Thanks. Script that processes Blocks.txt attached. Some questions: 1. In Blocks.txt: FF00..FFEF; Halfwidth and Fullwidth Forms In Emacs: (#xFF00 #xFF5F cjk-misc) (#xFF61 #xFF9F kana) (#xFFE0 #xFFEF cjk-misc) Is ff60 (FULLWIDTH RIGHT WHITE PARENTHESIS) intentionally omitted? 2. In Emacs "olt-italic" looks like a typo ("old-italic"). Can it be renamed? 3. In Blocks.txt, Anatolian Hieroglyphs ends at 1467F. In Emacs, it ends at 1457F. Typo? 4. In Blocks.txt: 20000..2A6DF; CJK Unified Ideographs Extension B 2A700..2B73F; CJK Unified Ideographs Extension C 2B740..2B81F; CJK Unified Ideographs Extension D 2B820..2CEAF; CJK Unified Ideographs Extension E 2F800..2FA1F; CJK Compatibility Ideographs Supplement In Emacs: (#x20000 #x2CEAF han) (#x2F800 #x2FFFF han) Emacs adds the ranges 2a6e0:2a6ff and 2fa20:2ffff, which Blocks.txt does not cover. Intentional? 5. Newly added "sutton-sign-writing" - should be "sutton-signwriting"? (The case-insensitive source says "Sutton SignWriting".) [-- Attachment #2: blocks.awk --] [-- Type: application/octet-stream, Size: 6859 bytes --] #!/usr/bin/awk -f ## Copyright (C) 2015 Free Software Foundation, Inc. ## Author: Glenn Morris <rgm@gnu.org> ## This file is part of GNU Emacs. ## GNU Emacs is free software: you can redistribute it and/or modify ## it under the terms of the GNU General Public License as published by ## the Free Software Foundation, either version 3 of the License, or ## (at your option) any later version. ## GNU Emacs is distributed in the hope that it will be useful, ## but WITHOUT ANY WARRANTY; without even the implied warranty of ## MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the ## GNU General Public License for more details. ## You should have received a copy of the GNU General Public License ## along with GNU Emacs. If not, see <http://www.gnu.org/licenses/>. ### Commentary: ## This script takes as input Unicode's Blocks.txt ## (http://www.unicode.org/Public/UNIDATA/Blocks.txt) ## and produces output for Emacs's lisp/international/charscript.el. ## It lumps together all the blocks belonging to the same language. ## E.g., "Basic Latin", "Latin-1 Supplement", "Latin Extended-A", ## etc. are all lumped together under "latin". ## The Unicode blocks actually extend past some of these ranges with ## undefined codepoints. ## For additional details, see <http://debbugs.gnu.org/20789#11>. ### Code: BEGIN { ## Hard-coded names. See name2alias for the rest. alias["ipa extensions"] = "phonetic" alias["letterlike symbols"] = "symbol" alias["number forms"] = "symbol" alias["miscellaneous technical"] = "symbol" alias["control pictures"] = "symbol" alias["optical character recognition"] = "symbol" alias["enclosed alphanumerics"] = "symbol" alias["box drawing"] = "symbol" alias["block elements"] = "symbol" alias["miscellaneous symbols"] = "symbol" alias["cjk strokes"] = "cjk-misc" alias["cjk symbols and punctuation"] = "cjk-misc" alias["halfwidth and fullwidth forms"] = "cjk-misc" alias["common indic number forms"] = "north-indic-number" tohex["a"] = 10 tohex["b"] = 11 tohex["c"] = 12 tohex["d"] = 13 tohex["e"] = 14 tohex["f"] = 15 fix_start["0080"] = "00A0" fix_end["2A6DF"] = "2A6FF" fix_end["2FA1F"] = "2FFFF" } ## From admin/charsets/. ## With gawk's --non-decimal-data switch we wouldn't need this. function decode_hex(str , n, len, i, c) { n = 0 len = length(str) for (i = 1; i <= len; i++) { c = substr (str, i, 1) if (c >= "0" && c <= "9") n = n * 16 + (c - "0") else n = n * 16 + tohex[tolower(c)] } return n } function name2alias(name , w, w2) { name = tolower(name) if (alias[name]) return alias[name] else if (name ~ /for symbols/) return "symbol" else if (name ~ /latin|combining .* marks|spacing modifier|tone letters|alphabetic presentation/) return "latin" else if (name ~ /cjk|yijing|enclosed ideograph|kangxi/) return "han" else if (name ~ /arabic/) return "arabic" else if (name ~ /^greek/) return "greek" else if (name ~ /^coptic/) return "coptic" else if (name ~ /cuneiform number/) return "cuneiform-numbers-and-punctuation" else if (name ~ /cuneiform/) return "cuneiform" else if (name ~ /mathematical alphanumeric symbol/) return "mathematical" else if (name ~ /punctuation|mathematical|arrows|currency|superscript|small form variants|geometric|dingbats|enclosed|alchemical|pictograph|emoticon|transport/) return "symbol" else if (name ~ /canadian aboriginal/) return "canadian-aboriginal" else if (name ~ /katakana|hiragana/) return "kana" else if (name ~ /myanmar/) return "burmese" else if (name ~ /hangul/) return "hangul" else if (name ~ /khmer/) return "khmer" else if (name ~ /braille/) return "braille" else if (name ~ /^yi /) return "yi" else if (name ~ /surrogates|private use|variation selectors/) return 0 else if (name ~/^(specials|tags)$/) return 0 else if (name ~ /linear b/) return "linear-b" else if (name ~ /aramaic/) return "aramaic" else if (name ~ /rumi num/) return "rumi-number" else if (name ~ /duployan|shorthand/) return "duployan-shorthand" else if (name ~ /sutton signwriting/) return "sutton-sign-writing" sub(/ (extended|extensions|supplement).*/, "", name) sub(/numbers/, "number", name) sub(/numerals/, "numeral", name) sub(/symbols/, "symbol", name) sub(/forms$/, "form", name) sub(/tiles$/, "tile", name) sub(/^new /, "", name) sub(/ (characters|hieroglyphs|cursive)$/, "", name) gsub(/ /, "-", name) return name } /^[0-9A-F]/ { sep = index($1, "..") len = length($1) s = substr($1,1,sep-1) e = substr($1,sep+2,len-sep-2) $1 = "" sub(/^ */, "", $0) i++ start[i] = fix_start[s] ? fix_start[s] : s end[i] = fix_end[e] ? fix_end[e]: e name[i] = $0 alt[i] = name2alias(name[i]) if (!alt[i]) { i-- next } ## Combine adjacent ranges with the same name. if (alt[i] == alt[i-1] && decode_hex(start[i]) == 1 + decode_hex(end[i-1])) { end[i-1] = end[i] name[i-1] = (name[i-1] ", " name[i]) i-- } ## Some hard-coded splits. if (start[i] == "0370") { end[i] = "03E1" i++ start[i] = "03E2" end[i] = "03EF" alt[i] = "coptic" i++ start[i] = "03F0" end[i] = "03FF" alt[i] = "greek" } else if (start[i] == "FB00") { end[i] = "FB06" i++ start[i] = "FB13" end[i] = "FB17" alt[i] = "armenian" i++ start[i] = "FB1D" end[i] = "FB4F" alt[i] = "hebrew" } else if (start[i] == "FF00") { end[i] = "FF5F" i++ start[i] = "FF61" end[i] = "FF9F" alt[i] = "kana" i++ start[i] = "FFE0" end[i] = "FFEF" alt[i] = "cjk-misc" } } END { print ";;; charscript.el --- character script table -*- no-byte-compile: t -*-" print ";;; Automatically generated from admin/unidata/Blocks.txt" print "(let (script-list)" print " (dolist (elt '(" for (j=1;j<=i;j++) { printf(" (#x%s #x%s %s)", start[j], end[j], alt[j]) ## Fuzz to decide whether worth printing original name as a comment. if (name[j] && alt[j] != tolower(name[j]) && alt[j] !~ /-/) printf(" ; %s", name[j]) printf("\n") } print " ))" print " (set-char-table-range char-script-table" print " (cons (car elt) (nth 1 elt)) (nth 2 elt))" print " (or (memq (nth 2 elt) script-list)" print " (setq script-list (cons (nth 2 elt) script-list))))" print " (set-char-table-extra-slot char-script-table 0 (nreverse script-list)))" print "" print "(provide 'charscript)" } ^ permalink raw reply [flat|nested] 11+ messages in thread
* bug#20789: Invalid script or charset name: cuneiform-numbers-and-punctuation 2015-06-16 0:22 ` Glenn Morris @ 2015-06-16 14:41 ` Eli Zaretskii 2015-06-17 6:52 ` Glenn Morris 0 siblings, 1 reply; 11+ messages in thread From: Eli Zaretskii @ 2015-06-16 14:41 UTC (permalink / raw) To: Glenn Morris; +Cc: 20789 > From: Glenn Morris <rgm@gnu.org> > Cc: 20789@debbugs.gnu.org > Date: Mon, 15 Jun 2015 20:22:07 -0400 > > Eli Zaretskii wrote: > > >> I don't suppose that big list can be auto-generated from the inputs? > > > > It's not trivial. I describe below some of the issues, in the hope > > that Someone™ will volunteer: > > Thanks. Script that processes Blocks.txt attached. Some questions: > > 1. In Blocks.txt: > > FF00..FFEF; Halfwidth and Fullwidth Forms > > In Emacs: > > (#xFF00 #xFF5F cjk-misc) > (#xFF61 #xFF9F kana) > (#xFFE0 #xFFEF cjk-misc) > > Is ff60 (FULLWIDTH RIGHT WHITE PARENTHESIS) intentionally omitted? AFAICT, there's a small mess around there. Based on the names of the pertinent characters, I think we should have this instead of the above 3 ranges: (#xFF00 #xFF60 cjk-misc) (#xFF61 #xFF9F kana) (#xFFA0 #xFFDF hangul) (#xFFE0 #xFFEF cjk-misc) > 2. In Emacs "olt-italic" looks like a typo ("old-italic"). Can it be renamed? Yes, please. > 3. In Blocks.txt, Anatolian Hieroglyphs ends at 1467F. > In Emacs, it ends at 1457F. Typo? Yes. > 4. In Blocks.txt: > > 20000..2A6DF; CJK Unified Ideographs Extension B > 2A700..2B73F; CJK Unified Ideographs Extension C > 2B740..2B81F; CJK Unified Ideographs Extension D > 2B820..2CEAF; CJK Unified Ideographs Extension E > 2F800..2FA1F; CJK Compatibility Ideographs Supplement > > In Emacs: > > (#x20000 #x2CEAF han) > (#x2F800 #x2FFFF han) > > Emacs adds the ranges 2a6e0:2a6ff and 2fa20:2ffff, which Blocks.txt does > not cover. Intentional? I don't know, but probably not intentional. I think we had better made it consistent with the UCD. > 5. Newly added "sutton-sign-writing" - should be "sutton-signwriting"? > (The case-insensitive source says "Sutton SignWriting".) Well, "signwriting" is not a word, AFAIK, it's 2 words (and the funny camel-case seems to agree with me). AFAIU, they used "SignWriting" because it's the commercial name. But if you insist, I won't... Thank you for doing this. P.S. Does the script work with mawk? (Some systems have it as their default Awk, I think.) ^ permalink raw reply [flat|nested] 11+ messages in thread
* bug#20789: Invalid script or charset name: cuneiform-numbers-and-punctuation 2015-06-16 14:41 ` Eli Zaretskii @ 2015-06-17 6:52 ` Glenn Morris 2015-06-17 16:27 ` Eli Zaretskii 0 siblings, 1 reply; 11+ messages in thread From: Glenn Morris @ 2015-06-17 6:52 UTC (permalink / raw) To: Eli Zaretskii; +Cc: 20789 Eli Zaretskii wrote: > Well, "signwriting" is not a word, AFAIK, it's 2 words [...] It's a word (in the OED), but in the sense of painting commercial signs. I don't really care, it's just that ~ 50% of the script is transforming the Unicode names to the (seemingly randomly chosen) Emacs names. If the latter were more straightforwardly derived from the former, things would be simpler. But one more special rule makes no difference. > P.S. Does the script work with mawk? Yes, and with Sun OS 5.10's /usr/xpg4/bin/awk (but not /usr/bin/awk). I don't believe it uses any more features than admin/charsets/*.awk. Is there anything else in international/ that could benefit from being auto-generated? ^ permalink raw reply [flat|nested] 11+ messages in thread
* bug#20789: Invalid script or charset name: cuneiform-numbers-and-punctuation 2015-06-17 6:52 ` Glenn Morris @ 2015-06-17 16:27 ` Eli Zaretskii 2015-06-20 23:34 ` Glenn Morris 0 siblings, 1 reply; 11+ messages in thread From: Eli Zaretskii @ 2015-06-17 16:27 UTC (permalink / raw) To: Glenn Morris; +Cc: 20789 > From: Glenn Morris <rgm@gnu.org> > Cc: 20789@debbugs.gnu.org > Date: Wed, 17 Jun 2015 02:52:48 -0400 > > Is there anything else in international/ that could benefit from being > auto-generated? Some. Things I've spotted: . characters.el: . The modify-category-entry calls -- they basically can be derived from Blocks.txt . The modify-syntax-entry and set-case-syntax calls can be derived from the values of the 'general-category' property returned by 'get-char-code-property', perhaps augmented by 'paired-bracket' and 'paired-type' properties . The set-case-syntax-pair calls (perhaps use the data in CaseFolding.txt, or even the case mapping information in UnicodeData.txt) . The setup of char-width-table -- I think the information is in EastAsianWidth.txt, with background information described in UAX#11 (http://www.unicode.org/reports/tr11/) . The setup of char-acronym-table: at least some of the data is in NameAliases.txt and NameList.txt . fontset.el: . The setup of script-representative-chars . mule-cmds.el: . The setting of locale-language-names -- the data is available in IANA's Language Subtag Registry (http://www.iana.org/assignments/language-subtag-registry/language-subtag-registry) and in ISO 639-2 (http://www.loc.gov/standards/iso639-2/, http://www.loc.gov/standards/iso639-2/php/English_list.php) TIA P.S. It would be good to add to somewhere (admin/make-tarball.txt?) a reminder to fetch all those reference files and regenerate their dependencies, before we prepare a release. ^ permalink raw reply [flat|nested] 11+ messages in thread
* bug#20789: Invalid script or charset name: cuneiform-numbers-and-punctuation 2015-06-17 16:27 ` Eli Zaretskii @ 2015-06-20 23:34 ` Glenn Morris 2015-06-21 15:00 ` Eli Zaretskii 0 siblings, 1 reply; 11+ messages in thread From: Glenn Morris @ 2015-06-20 23:34 UTC (permalink / raw) To: Eli Zaretskii; +Cc: 20789 I spent some time looking at some of these. In no case could I see a clear path from the inputs to the outputs. Eli Zaretskii wrote: > . characters.el: > > . The modify-category-entry calls -- they basically can be derived > from Blocks.txt I looked at it briefly. I can see that they are somewhat related, but not precisely how. Eg: Emacs: 2E80:312F and 3190:33FF are "line breakable". Which means that "Hangul Compatibility Jamo" isn't. I have no idea why. Emacs: 3400:4DBF and 4E00:9FAF are "2-byte han". Which means that "Yijing Hexagram Symbols" aren't. Again, I have no idea why. I didn't look any further. > . The modify-syntax-entry and set-case-syntax calls can be derived > from the values of the 'general-category' property returned by > 'get-char-code-property', perhaps augmented by 'paired-bracket' > and 'paired-type' properties I didn't look at this yet. > . The set-case-syntax-pair calls (perhaps use the data in > CaseFolding.txt, or even the case mapping information in > UnicodeData.txt) I didn't look at this yet. > . The setup of char-width-table -- I think the information is in > EastAsianWidth.txt, with background information described in > UAX#11 (http://www.unicode.org/reports/tr11/) Looks somewhat promising, but could you be more specific? There's nothing in that file that defines "zero width" characters, so I don't see where Emacs's width 0 characters come from. The width 2 characters look like they might be the "W" and "F" characters, but just doing that gives a list that has many differences to the list Emacs uses. > . The setup of char-acronym-table: at least some of the data is in > NameAliases.txt and NameList.txt Looks somewhat promising. I can see how most of this comes from NameAliases.txt. But there are many oddities: Why does Emacs not have anything for 0009 (HT or TAB) or 000A (LF, NL, or EOF)? 0019 is EOM in the source but EM in Emacs. 0080 is PAD in the source but XXX in Emacs. 0081 is HOP in the source but XXX in Emacs. 008F is SS3 in the source but SS1 in Emacs. 0099 is SGC in the source but XXX in Emacs. How does Emacs choose which entries to list? There are many more in the source. Could it do any harm to add more? Where does "KIVAQ" come from? That appears nowhere in the source AFAICS. Why does Emacs list two Khmer entries, and nothing else? There are loads more of them. > . fontset.el: > > . The setup of script-representative-chars I don't see how. It seems to be "for some of, but not all, the entries in char-script-table, choose a single character somewhere in the range." There seems to be no pattern to how the character is chosen within the range. Often the first one, but by no means always. > . mule-cmds.el: > > . The setting of locale-language-names -- the data is available in > IANA's Language Subtag Registry > (http://www.iana.org/assignments/language-subtag-registry/language-subtag-registry) > and in ISO 639-2 (http://www.loc.gov/standards/iso639-2/, > http://www.loc.gov/standards/iso639-2/php/English_list.php) Again, I don't see how. Eg nowhere in those source files do I see Welsh associated with iso-8859-14, and the comment in mule-cmds says that the last part is "implementation dependent". > P.S. It would be good to add to somewhere (admin/make-tarball.txt?) a > reminder to fetch all those reference files and regenerate their > dependencies, before we prepare a release. admin/FOR-RELEASE contains that kind of thing. ^ permalink raw reply [flat|nested] 11+ messages in thread
* bug#20789: Invalid script or charset name: cuneiform-numbers-and-punctuation 2015-06-20 23:34 ` Glenn Morris @ 2015-06-21 15:00 ` Eli Zaretskii 2015-06-27 2:02 ` Glenn Morris 0 siblings, 1 reply; 11+ messages in thread From: Eli Zaretskii @ 2015-06-21 15:00 UTC (permalink / raw) To: Glenn Morris, Kenichi Handa; +Cc: 20789 > From: Glenn Morris <rgm@gnu.org> > Cc: 20789@debbugs.gnu.org > Date: Sat, 20 Jun 2015 19:34:01 -0400 > > I spent some time looking at some of these. > In no case could I see a clear path from the inputs to the outputs. Thanks for looking into this. Let me first make a general comment: we can always convert only certain parts of the setup to an automated procedure, and leave the rest in its present form, more or less. That's especially true where Emacs has specialized needs or defines properties not in Unicode. > > . characters.el: > > > > . The modify-category-entry calls -- they basically can be derived > > from Blocks.txt > > I looked at it briefly. I can see that they are somewhat related, but > not precisely how. Eg: > > Emacs: 2E80:312F and 3190:33FF are "line breakable". > Which means that "Hangul Compatibility Jamo" isn't. I have no idea why. > > Emacs: 3400:4DBF and 4E00:9FAF are "2-byte han". > Which means that "Yijing Hexagram Symbols" aren't. Again, I have no idea why. > > I didn't look any further. When I said "derived from Blocks.txt", I meant the categories that are related to script names, like ASCII, Latin, Arabic, Chinese, etc. Sorry for not saying that explicitly. Other categories need other sources. Here's my attempt to decipher some of them: . ?| -- "line breakable" The data seems to be in LineBreak.txt, described in detail in UAX#14 (http://unicode.org/reports/tr14/). It looks like characters with the ?| category are those whose line-break properties are ID or CJ or NS. Therefore, the exclusion of Hangul Compatibility Jamo is a mistake (or maybe an omission, since the comment says "Chinese"); in particular, UAX#14 explicitly says, in section 5.1 under "ID", that the characters in the range 3130..318F are treated as class ID. This category is currently used only by kinsoku.el, which has its own data (and sets the ?< and ?> categories). So this will only become important if we ever implement in Emacs something more general, like the algorithm described in UAX#14. . "2-byte han" -- I think this is related to their legacy encoding; I don't see this used anywhere. Likewise with other 2-byte categories. Perhaps Handa-san (CC'ed) could comment on their necessity. If this is still needed, we should probably leave these alone. . ?0 - ?9 -- I don't see how to get this data from the UCD or any other source. Some of it seems to be in IndicSyllabicCategory.txt, FWIW. . ?R and ?L -- already set up using the Unicode data, so no change is needed. . ?^ -- should be set for any character whose general-category is Mn. Since we already do this, the manual setting around line 820 is redundant and should be deleted. . ?. -- already set using Unicode data, no change needed. > > . The setup of char-width-table -- I think the information is in > > EastAsianWidth.txt, with background information described in > > UAX#11 (http://www.unicode.org/reports/tr11/) > > Looks somewhat promising, but could you be more specific? > There's nothing in that file that defines "zero width" characters, so I > don't see where Emacs's width 0 characters come from. The following rules regarding zero-width characters are due to Markus Kuhn, and are excerpted from the description in comments to his implementation of 'wcwidth' (http://www.cl.cam.ac.uk/~mgk25/ucs/wcwidth.c): . The null character (U+0000) has a column width of 0. . Non-spacing and enclosing combining characters (general category code Mn or Me in the Unicode database) have a column width of 0. . ZERO WIDTH SPACE (U+200B) and format characters (general category code Cf in the Unicode database), except SOFT HYPHEN (U+00AD), have a column width of 0. . Hangul Jamo medial vowels and final consonants (U+1160-U+11FF) have a column width of 0. > The width 2 characters look like they might be the "W" and "F" characters, Yes. > but just doing that gives a list that has many differences to the list > Emacs uses. I don't see any significant differences, except perhaps in unassigned codepoints (see paragraph 6.1 of UAX#11 for the treatment of unassigned CJK codepoints). I think any differences beyond that should be treated as errors in Emacs in this case. > > . The setup of char-acronym-table: at least some of the data is in > > NameAliases.txt and NameList.txt > > Looks somewhat promising. > I can see how most of this comes from NameAliases.txt. > But there are many oddities: > > Why does Emacs not have anything for 0009 (HT or TAB) or 000A (LF, NL, > or EOF)? This table is set for the 'acronym' method of glyphless-char-display, so I guess these omissions are for characters for which no one envisioned them to be ever displayed as glyphless. I'd include them in the table anyway, just in case, and also to keep our exceptions vs the UCD to the bare minimum. > 0019 is EOM in the source but EM in Emacs. Typo, I think. > 0080 is PAD in the source but XXX in Emacs. > 0081 is HOP in the source but XXX in Emacs. > 008F is SS3 in the source but SS1 in Emacs. > 0099 is SGC in the source but XXX in Emacs. I think these are typos and perhaps acronyms that whoever wrote this didn't know. > How does Emacs choose which entries to list? There are many more in the > source. Could it do any harm to add more? As long as you take only "abbreviations", i.e. they are short, I think we should use all of them, given their use in Emacs. > Where does "KIVAQ" come from? That appears nowhere in the source AFAICS. AFAIK, that's the official name of that character. At least that's what I glean from Google; I know nothing about the Khmer script. > Why does Emacs list two Khmer entries, and nothing else? There are loads > more of them. These are the only 2 that have such abbreviations; see https://en.wikipedia.org/wiki/Khmer_alphabet (assuming by "loads more" you meant the Khmer letters). > > . fontset.el: > > > > . The setup of script-representative-chars > > I don't see how. It seems to be "for some of, but not all, the entries > in char-script-table, choose a single character somewhere in the range." We should have a representative character for each entry in char-script-table. They are used with some font back-ends (xfont and x?ftfont, AFAIR) to probe candidate fonts for coverage of the required script, so we should have the full information about that. I think the only reason for the partial information we have now is that it is maintained manually, so it includes whatever the people who worked on that bothered to add. > There seems to be no pattern to how the character is chosen within the > range. Often the first one, but by no means always. I think the rule is to choose the first one that is a letter, i.e. its general-category is either one of Lu, Ll, Lt, Lo, or Lm. > > . mule-cmds.el: > > > > . The setting of locale-language-names -- the data is available in > > IANA's Language Subtag Registry > > > > (http://www.iana.org/assignments/language-subtag-registry/language-subtag-registry) > > and in ISO 639-2 (http://www.loc.gov/standards/iso639-2/, > > http://www.loc.gov/standards/iso639-2/php/English_list.php) > > Again, I don't see how. Eg nowhere in those source files do I see Welsh > associated with iso-8859-14, and the comment in mule-cmds says that the > last part is "implementation dependent". The bulk of the data is the correspondence between the ISO 639 2-letter names and the country/culture name. The few cases where we also have the encoding could be set up with a very small database once the main data is set, by adding the encoding to those few that need it. If by "last part" you mean IPA and "Nonstandard or obsolete language codes", then these are very few and can be added manually. > > P.S. It would be good to add to somewhere (admin/make-tarball.txt?) a > > reminder to fetch all those reference files and regenerate their > > dependencies, before we prepare a release. > > admin/FOR-RELEASE contains that kind of thing. Right, I will add the information there. Thanks. ^ permalink raw reply [flat|nested] 11+ messages in thread
* bug#20789: Invalid script or charset name: cuneiform-numbers-and-punctuation 2015-06-21 15:00 ` Eli Zaretskii @ 2015-06-27 2:02 ` Glenn Morris 2015-06-27 7:42 ` Eli Zaretskii 0 siblings, 1 reply; 11+ messages in thread From: Glenn Morris @ 2015-06-27 2:02 UTC (permalink / raw) To: Eli Zaretskii; +Cc: 20789 Eli Zaretskii wrote: >> The width 2 characters look like they might be the "W" and "F" characters, > > Yes. > >> but just doing that gives a list that has many differences to the list >> Emacs uses. This is list of "F" and "W" characters, compared to the 11 ranges that Emacs uses: (#x1100 . #x115F) (#x2329 . #x232A) (#x2E80 . #x2E99) (#x2E9B . #x2EF3) (#x2F00 . #x2FD5) (#x2FF0 . #x2FFB) (#x3000 . #x303E) (#x3041 . #x3096) (#x3099 . #x30FF) (#x3105 . #x312D) (#x3131 . #x318E) (#x3190 . #x31BA) (#x31C0 . #x31E3) (#x31F0 . #x321E) (#x3220 . #x3247) (#x3250 . #x32FE) (#x3300 . #x4DBF) (#x4E00 . #xA48C) (#xA490 . #xA4C6) (#xA960 . #xA97C) (#xAC00 . #xD7A3) (#xF900 . #xFAFF) (#xFE10 . #xFE19) (#xFE30 . #xFE52) (#xFE54 . #xFE66) (#xFE68 . #xFE6B) (#xFF01 . #xFF60) (#xFFE0 . #xFFE6) (#x1B000 . #x1B001) (#x1F200 . #x1F202) (#x1F210 . #x1F23A) (#x1F240 . #x1F248) (#x1F250 . #x1F251) (#x20000 . #x2FFFD) (#x30000 . #x3FFFD) > I don't see any significant differences, except perhaps in unassigned > codepoints (see paragraph 6.1 of UAX#11 for the treatment of > unassigned CJK codepoints). I don't know if this means that the above needs modifying? ^ permalink raw reply [flat|nested] 11+ messages in thread
* bug#20789: Invalid script or charset name: cuneiform-numbers-and-punctuation 2015-06-27 2:02 ` Glenn Morris @ 2015-06-27 7:42 ` Eli Zaretskii 0 siblings, 0 replies; 11+ messages in thread From: Eli Zaretskii @ 2015-06-27 7:42 UTC (permalink / raw) To: Glenn Morris; +Cc: 20789 > From: Glenn Morris <rgm@gnu.org> > Cc: Kenichi Handa <handa@gnu.org>, 20789@debbugs.gnu.org > Date: Fri, 26 Jun 2015 22:02:36 -0400 > > Eli Zaretskii wrote: > > >> The width 2 characters look like they might be the "W" and "F" characters, > > > > Yes. > > > >> but just doing that gives a list that has many differences to the list > >> Emacs uses. > > This is list of "F" and "W" characters, compared to the 11 ranges that > Emacs uses: Looks good to me. The 11 ranges we have now are either identical or more coarse than the list derived from the UCD that you show. > > I don't see any significant differences, except perhaps in unassigned > > codepoints (see paragraph 6.1 of UAX#11 for the treatment of > > unassigned CJK codepoints). > > I don't know if this means that the above needs modifying? I was saying that we need to augment the list with the 5 ranges of unassigned codepoints that belong to the CJK planes, as described in that section of UAX#11. An unassigned codepoint has its 'general-category' property set to 'Cn', and the list of the 5 planes could be in some defconst, because it will probably never change. Thanks. ^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2015-06-27 7:42 UTC | newest] Thread overview: 11+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2015-06-11 22:05 bug#20789: Invalid script or charset name: cuneiform-numbers-and-punctuation Glenn Morris 2015-06-11 22:24 ` Glenn Morris 2015-06-12 8:28 ` Eli Zaretskii 2015-06-16 0:22 ` Glenn Morris 2015-06-16 14:41 ` Eli Zaretskii 2015-06-17 6:52 ` Glenn Morris 2015-06-17 16:27 ` Eli Zaretskii 2015-06-20 23:34 ` Glenn Morris 2015-06-21 15:00 ` Eli Zaretskii 2015-06-27 2:02 ` Glenn Morris 2015-06-27 7:42 ` Eli Zaretskii
Code repositories for project(s) associated with this external index https://git.savannah.gnu.org/cgit/emacs.git https://git.savannah.gnu.org/cgit/emacs/org-mode.git This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.