From: Glenn Morris <rgm@gnu.org>
To: Eli Zaretskii <eliz@gnu.org>
Cc: 20789@debbugs.gnu.org
Subject: bug#20789: Invalid script or charset name: cuneiform-numbers-and-punctuation
Date: Mon, 15 Jun 2015 20:22:07 -0400 [thread overview]
Message-ID: <ozy4jkh58w.fsf@fencepost.gnu.org> (raw)
In-Reply-To: <21zj45kiix.fsf@fencepost.gnu.org>
[-- Attachment #1: Type: text/plain, Size: 1281 bytes --]
Eli Zaretskii wrote:
>> I don't suppose that big list can be auto-generated from the inputs?
>
> It's not trivial. I describe below some of the issues, in the hope
> that Someone™ will volunteer:
Thanks. Script that processes Blocks.txt attached. Some questions:
1. In Blocks.txt:
FF00..FFEF; Halfwidth and Fullwidth Forms
In Emacs:
(#xFF00 #xFF5F cjk-misc)
(#xFF61 #xFF9F kana)
(#xFFE0 #xFFEF cjk-misc)
Is ff60 (FULLWIDTH RIGHT WHITE PARENTHESIS) intentionally omitted?
2. In Emacs "olt-italic" looks like a typo ("old-italic"). Can it be renamed?
3. In Blocks.txt, Anatolian Hieroglyphs ends at 1467F.
In Emacs, it ends at 1457F. Typo?
4. In Blocks.txt:
20000..2A6DF; CJK Unified Ideographs Extension B
2A700..2B73F; CJK Unified Ideographs Extension C
2B740..2B81F; CJK Unified Ideographs Extension D
2B820..2CEAF; CJK Unified Ideographs Extension E
2F800..2FA1F; CJK Compatibility Ideographs Supplement
In Emacs:
(#x20000 #x2CEAF han)
(#x2F800 #x2FFFF han)
Emacs adds the ranges 2a6e0:2a6ff and 2fa20:2ffff, which Blocks.txt does
not cover. Intentional?
5. Newly added "sutton-sign-writing" - should be "sutton-signwriting"?
(The case-insensitive source says "Sutton SignWriting".)
[-- Attachment #2: blocks.awk --]
[-- Type: application/octet-stream, Size: 6859 bytes --]
#!/usr/bin/awk -f
## Copyright (C) 2015 Free Software Foundation, Inc.
## Author: Glenn Morris <rgm@gnu.org>
## This file is part of GNU Emacs.
## GNU Emacs is free software: you can redistribute it and/or modify
## it under the terms of the GNU General Public License as published by
## the Free Software Foundation, either version 3 of the License, or
## (at your option) any later version.
## GNU Emacs is distributed in the hope that it will be useful,
## but WITHOUT ANY WARRANTY; without even the implied warranty of
## MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
## GNU General Public License for more details.
## You should have received a copy of the GNU General Public License
## along with GNU Emacs. If not, see <http://www.gnu.org/licenses/>.
### Commentary:
## This script takes as input Unicode's Blocks.txt
## (http://www.unicode.org/Public/UNIDATA/Blocks.txt)
## and produces output for Emacs's lisp/international/charscript.el.
## It lumps together all the blocks belonging to the same language.
## E.g., "Basic Latin", "Latin-1 Supplement", "Latin Extended-A",
## etc. are all lumped together under "latin".
## The Unicode blocks actually extend past some of these ranges with
## undefined codepoints.
## For additional details, see <http://debbugs.gnu.org/20789#11>.
### Code:
BEGIN {
## Hard-coded names. See name2alias for the rest.
alias["ipa extensions"] = "phonetic"
alias["letterlike symbols"] = "symbol"
alias["number forms"] = "symbol"
alias["miscellaneous technical"] = "symbol"
alias["control pictures"] = "symbol"
alias["optical character recognition"] = "symbol"
alias["enclosed alphanumerics"] = "symbol"
alias["box drawing"] = "symbol"
alias["block elements"] = "symbol"
alias["miscellaneous symbols"] = "symbol"
alias["cjk strokes"] = "cjk-misc"
alias["cjk symbols and punctuation"] = "cjk-misc"
alias["halfwidth and fullwidth forms"] = "cjk-misc"
alias["common indic number forms"] = "north-indic-number"
tohex["a"] = 10
tohex["b"] = 11
tohex["c"] = 12
tohex["d"] = 13
tohex["e"] = 14
tohex["f"] = 15
fix_start["0080"] = "00A0"
fix_end["2A6DF"] = "2A6FF"
fix_end["2FA1F"] = "2FFFF"
}
## From admin/charsets/.
## With gawk's --non-decimal-data switch we wouldn't need this.
function decode_hex(str , n, len, i, c) {
n = 0
len = length(str)
for (i = 1; i <= len; i++)
{
c = substr (str, i, 1)
if (c >= "0" && c <= "9")
n = n * 16 + (c - "0")
else
n = n * 16 + tohex[tolower(c)]
}
return n
}
function name2alias(name , w, w2) {
name = tolower(name)
if (alias[name]) return alias[name]
else if (name ~ /for symbols/) return "symbol"
else if (name ~ /latin|combining .* marks|spacing modifier|tone letters|alphabetic presentation/) return "latin"
else if (name ~ /cjk|yijing|enclosed ideograph|kangxi/) return "han"
else if (name ~ /arabic/) return "arabic"
else if (name ~ /^greek/) return "greek"
else if (name ~ /^coptic/) return "coptic"
else if (name ~ /cuneiform number/) return "cuneiform-numbers-and-punctuation"
else if (name ~ /cuneiform/) return "cuneiform"
else if (name ~ /mathematical alphanumeric symbol/) return "mathematical"
else if (name ~ /punctuation|mathematical|arrows|currency|superscript|small form variants|geometric|dingbats|enclosed|alchemical|pictograph|emoticon|transport/) return "symbol"
else if (name ~ /canadian aboriginal/) return "canadian-aboriginal"
else if (name ~ /katakana|hiragana/) return "kana"
else if (name ~ /myanmar/) return "burmese"
else if (name ~ /hangul/) return "hangul"
else if (name ~ /khmer/) return "khmer"
else if (name ~ /braille/) return "braille"
else if (name ~ /^yi /) return "yi"
else if (name ~ /surrogates|private use|variation selectors/) return 0
else if (name ~/^(specials|tags)$/) return 0
else if (name ~ /linear b/) return "linear-b"
else if (name ~ /aramaic/) return "aramaic"
else if (name ~ /rumi num/) return "rumi-number"
else if (name ~ /duployan|shorthand/) return "duployan-shorthand"
else if (name ~ /sutton signwriting/) return "sutton-sign-writing"
sub(/ (extended|extensions|supplement).*/, "", name)
sub(/numbers/, "number", name)
sub(/numerals/, "numeral", name)
sub(/symbols/, "symbol", name)
sub(/forms$/, "form", name)
sub(/tiles$/, "tile", name)
sub(/^new /, "", name)
sub(/ (characters|hieroglyphs|cursive)$/, "", name)
gsub(/ /, "-", name)
return name
}
/^[0-9A-F]/ {
sep = index($1, "..")
len = length($1)
s = substr($1,1,sep-1)
e = substr($1,sep+2,len-sep-2)
$1 = ""
sub(/^ */, "", $0)
i++
start[i] = fix_start[s] ? fix_start[s] : s
end[i] = fix_end[e] ? fix_end[e]: e
name[i] = $0
alt[i] = name2alias(name[i])
if (!alt[i])
{
i--
next
}
## Combine adjacent ranges with the same name.
if (alt[i] == alt[i-1] && decode_hex(start[i]) == 1 + decode_hex(end[i-1]))
{
end[i-1] = end[i]
name[i-1] = (name[i-1] ", " name[i])
i--
}
## Some hard-coded splits.
if (start[i] == "0370")
{
end[i] = "03E1"
i++
start[i] = "03E2"
end[i] = "03EF"
alt[i] = "coptic"
i++
start[i] = "03F0"
end[i] = "03FF"
alt[i] = "greek"
}
else if (start[i] == "FB00")
{
end[i] = "FB06"
i++
start[i] = "FB13"
end[i] = "FB17"
alt[i] = "armenian"
i++
start[i] = "FB1D"
end[i] = "FB4F"
alt[i] = "hebrew"
}
else if (start[i] == "FF00")
{
end[i] = "FF5F"
i++
start[i] = "FF61"
end[i] = "FF9F"
alt[i] = "kana"
i++
start[i] = "FFE0"
end[i] = "FFEF"
alt[i] = "cjk-misc"
}
}
END {
print ";;; charscript.el --- character script table -*- no-byte-compile: t -*-"
print ";;; Automatically generated from admin/unidata/Blocks.txt"
print "(let (script-list)"
print " (dolist (elt '("
for (j=1;j<=i;j++)
{
printf(" (#x%s #x%s %s)", start[j], end[j], alt[j])
## Fuzz to decide whether worth printing original name as a comment.
if (name[j] && alt[j] != tolower(name[j]) && alt[j] !~ /-/)
printf(" ; %s", name[j])
printf("\n")
}
print " ))"
print " (set-char-table-range char-script-table"
print " (cons (car elt) (nth 1 elt)) (nth 2 elt))"
print " (or (memq (nth 2 elt) script-list)"
print " (setq script-list (cons (nth 2 elt) script-list))))"
print " (set-char-table-extra-slot char-script-table 0 (nreverse script-list)))"
print ""
print "(provide 'charscript)"
}
next prev parent reply other threads:[~2015-06-16 0:22 UTC|newest]
Thread overview: 11+ messages / expand[flat|nested] mbox.gz Atom feed top
2015-06-11 22:05 bug#20789: Invalid script or charset name: cuneiform-numbers-and-punctuation Glenn Morris
2015-06-11 22:24 ` Glenn Morris
2015-06-12 8:28 ` Eli Zaretskii
2015-06-16 0:22 ` Glenn Morris [this message]
2015-06-16 14:41 ` Eli Zaretskii
2015-06-17 6:52 ` Glenn Morris
2015-06-17 16:27 ` Eli Zaretskii
2015-06-20 23:34 ` Glenn Morris
2015-06-21 15:00 ` Eli Zaretskii
2015-06-27 2:02 ` Glenn Morris
2015-06-27 7:42 ` Eli Zaretskii
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ozy4jkh58w.fsf@fencepost.gnu.org \
--to=rgm@gnu.org \
--cc=20789@debbugs.gnu.org \
--cc=eliz@gnu.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this external index
https://git.savannah.gnu.org/cgit/emacs.git
https://git.savannah.gnu.org/cgit/emacs/org-mode.git
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.