bug#20789: Invalid script or charset name: cuneiform-numbers-and-punctuation

all messages for Emacs-related lists mirrored at yhetil.org
 help / color / mirror / code / Atom feed

From: Glenn Morris <rgm@gnu.org>
To: Eli Zaretskii <eliz@gnu.org>
Cc: 20789@debbugs.gnu.org
Subject: bug#20789: Invalid script or charset name:	cuneiform-numbers-and-punctuation
Date: Mon, 15 Jun 2015 20:22:07 -0400	[thread overview]
Message-ID: <ozy4jkh58w.fsf@fencepost.gnu.org> (raw)
In-Reply-To: <21zj45kiix.fsf@fencepost.gnu.org>

[-- Attachment #1: Type: text/plain, Size: 1281 bytes --]

Eli Zaretskii wrote:

>> I don't suppose that big list can be auto-generated from the inputs?
>
> It's not trivial.  I describe below some of the issues, in the hope
> that Someone™ will volunteer:

Thanks. Script that processes Blocks.txt attached. Some questions:

1. In Blocks.txt:

  FF00..FFEF; Halfwidth and Fullwidth Forms

In Emacs:

  (#xFF00 #xFF5F cjk-misc)
  (#xFF61 #xFF9F kana)
  (#xFFE0 #xFFEF cjk-misc)

Is ff60 (FULLWIDTH RIGHT WHITE PARENTHESIS) intentionally omitted?


2. In Emacs "olt-italic" looks like a typo ("old-italic"). Can it be renamed?


3. In Blocks.txt, Anatolian Hieroglyphs ends at 1467F.
In Emacs, it ends at 1457F. Typo?


4. In Blocks.txt:

  20000..2A6DF; CJK Unified Ideographs Extension B
  2A700..2B73F; CJK Unified Ideographs Extension C
  2B740..2B81F; CJK Unified Ideographs Extension D
  2B820..2CEAF; CJK Unified Ideographs Extension E
  2F800..2FA1F; CJK Compatibility Ideographs Supplement

In Emacs:

  (#x20000 #x2CEAF han)
  (#x2F800 #x2FFFF han)

Emacs adds the ranges 2a6e0:2a6ff and 2fa20:2ffff, which Blocks.txt does
not cover. Intentional?


5. Newly added "sutton-sign-writing" - should be "sutton-signwriting"?
(The case-insensitive source says "Sutton SignWriting".)



[-- Attachment #2: blocks.awk --]
[-- Type: application/octet-stream, Size: 6859 bytes --]

#!/usr/bin/awk -f

## Copyright (C) 2015 Free Software Foundation, Inc.

## Author: Glenn Morris <rgm@gnu.org>

## This file is part of GNU Emacs.

## GNU Emacs is free software: you can redistribute it and/or modify
## it under the terms of the GNU General Public License as published by
## the Free Software Foundation, either version 3 of the License, or
## (at your option) any later version.

## GNU Emacs is distributed in the hope that it will be useful,
## but WITHOUT ANY WARRANTY; without even the implied warranty of
## MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
## GNU General Public License for more details.

## You should have received a copy of the GNU General Public License
## along with GNU Emacs.  If not, see <http://www.gnu.org/licenses/>.

### Commentary:

## This script takes as input Unicode's Blocks.txt
## (http://www.unicode.org/Public/UNIDATA/Blocks.txt)
## and produces output for Emacs's lisp/international/charscript.el.

## It lumps together all the blocks belonging to the same language.
## E.g., "Basic Latin", "Latin-1 Supplement", "Latin Extended-A",
## etc. are all lumped together under "latin".

## The Unicode blocks actually extend past some of these ranges with
## undefined codepoints.

## For additional details, see <http://debbugs.gnu.org/20789#11>.

### Code:

BEGIN {
    ## Hard-coded names.  See name2alias for the rest.
    alias["ipa extensions"] = "phonetic"
    alias["letterlike symbols"] = "symbol"
    alias["number forms"] = "symbol"
    alias["miscellaneous technical"] = "symbol"
    alias["control pictures"] = "symbol"
    alias["optical character recognition"] = "symbol"
    alias["enclosed alphanumerics"] = "symbol"
    alias["box drawing"] = "symbol"
    alias["block elements"] = "symbol"
    alias["miscellaneous symbols"] = "symbol"
    alias["cjk strokes"] = "cjk-misc"
    alias["cjk symbols and punctuation"] = "cjk-misc"
    alias["halfwidth and fullwidth forms"] = "cjk-misc"
    alias["common indic number forms"] = "north-indic-number"

    tohex["a"] = 10
    tohex["b"] = 11
    tohex["c"] = 12
    tohex["d"] = 13
    tohex["e"] = 14
    tohex["f"] = 15

    fix_start["0080"] = "00A0"
    fix_end["2A6DF"] = "2A6FF"
    fix_end["2FA1F"] = "2FFFF"
}

## From admin/charsets/.
## With gawk's --non-decimal-data switch we wouldn't need this.
function decode_hex(str   , n, len, i, c) {
  n = 0
  len = length(str)
  for (i = 1; i <= len; i++)
    {
      c = substr (str, i, 1)
      if (c >= "0" && c <= "9")
	n = n * 16 + (c - "0")
      else
	n = n * 16 + tohex[tolower(c)]
    }
  return n
}

function name2alias(name   , w, w2) {
    name = tolower(name)
    if (alias[name]) return alias[name]
    else if (name ~ /for symbols/) return "symbol"
    else if (name ~ /latin|combining .* marks|spacing modifier|tone letters|alphabetic presentation/) return "latin"
    else if (name ~ /cjk|yijing|enclosed ideograph|kangxi/) return "han"
    else if (name ~ /arabic/) return "arabic"
    else if (name ~ /^greek/) return "greek"
    else if (name ~ /^coptic/) return "coptic"
    else if (name ~ /cuneiform number/) return "cuneiform-numbers-and-punctuation"
    else if (name ~ /cuneiform/) return "cuneiform"
    else if (name ~ /mathematical alphanumeric symbol/) return "mathematical"
    else if (name ~ /punctuation|mathematical|arrows|currency|superscript|small form variants|geometric|dingbats|enclosed|alchemical|pictograph|emoticon|transport/) return "symbol"
    else if (name ~ /canadian aboriginal/) return "canadian-aboriginal"
    else if (name ~ /katakana|hiragana/) return "kana"
    else if (name ~ /myanmar/) return "burmese"
    else if (name ~ /hangul/) return "hangul"
    else if (name ~ /khmer/) return "khmer"
    else if (name ~ /braille/) return "braille"
    else if (name ~ /^yi /) return "yi"
    else if (name ~ /surrogates|private use|variation selectors/) return 0
    else if (name ~/^(specials|tags)$/) return 0
    else if (name ~ /linear b/) return "linear-b"
    else if (name ~ /aramaic/) return "aramaic"
    else if (name ~ /rumi num/) return "rumi-number"
    else if (name ~ /duployan|shorthand/) return "duployan-shorthand"
    else if (name ~ /sutton signwriting/) return "sutton-sign-writing"

    sub(/ (extended|extensions|supplement).*/, "", name)
    sub(/numbers/, "number", name)
    sub(/numerals/, "numeral", name)
    sub(/symbols/, "symbol", name)
    sub(/forms$/, "form", name)
    sub(/tiles$/, "tile", name)
    sub(/^new /, "", name)
    sub(/ (characters|hieroglyphs|cursive)$/, "", name)
    gsub(/ /, "-", name)

    return name
}

/^[0-9A-F]/ {
    sep = index($1, "..")
    len = length($1)
    s = substr($1,1,sep-1)
    e = substr($1,sep+2,len-sep-2)
    $1 = ""
    sub(/^ */, "", $0)
    i++
    start[i] = fix_start[s] ? fix_start[s] : s
    end[i] = fix_end[e] ? fix_end[e]: e
    name[i] = $0

    alt[i] = name2alias(name[i])

    if (!alt[i])
    {
        i--
        next
    }

    ## Combine adjacent ranges with the same name.
    if (alt[i] == alt[i-1] && decode_hex(start[i]) == 1 + decode_hex(end[i-1]))
    {
        end[i-1] = end[i]
        name[i-1] = (name[i-1] ", " name[i])
        i--
    }

    ## Some hard-coded splits.
    if (start[i] == "0370")
    {
        end[i] = "03E1"
        i++
        start[i] = "03E2"
        end[i] = "03EF"
        alt[i] = "coptic"
        i++
        start[i] = "03F0"
        end[i] = "03FF"
        alt[i] = "greek"
    }
    else if (start[i] == "FB00")
    {
        end[i] = "FB06"
        i++
        start[i] = "FB13"
        end[i] = "FB17"
        alt[i] = "armenian"
        i++
        start[i] = "FB1D"
        end[i] = "FB4F"
        alt[i] = "hebrew"
    }
    else if (start[i] == "FF00")
    {
        end[i] = "FF5F"
        i++
        start[i] = "FF61"
        end[i] = "FF9F"
        alt[i] = "kana"
        i++
        start[i] = "FFE0"
        end[i] = "FFEF"
        alt[i] = "cjk-misc"
    }
}

END {
    print ";;; charscript.el --- character script table -*- no-byte-compile: t -*-"
    print ";;; Automatically generated from admin/unidata/Blocks.txt"
    print "(let (script-list)"
    print "  (dolist (elt '("

    for (j=1;j<=i;j++)
    {
        printf("    (#x%s #x%s %s)", start[j], end[j], alt[j])
        ## Fuzz to decide whether worth printing original name as a comment.
        if (name[j] && alt[j] != tolower(name[j]) && alt[j] !~ /-/)
            printf(" ; %s", name[j])
        printf("\n")
    }

    print "    ))"
    print "    (set-char-table-range char-script-table"
    print "			  (cons (car elt) (nth 1 elt)) (nth 2 elt))"
    print "    (or (memq (nth 2 elt) script-list)"
    print "	(setq script-list (cons (nth 2 elt) script-list))))"
    print "  (set-char-table-extra-slot char-script-table 0 (nreverse script-list)))"
    print ""
    print "(provide 'charscript)"
}

next prev parent reply	other threads:[~2015-06-16  0:22 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-06-11 22:05 bug#20789: Invalid script or charset name: cuneiform-numbers-and-punctuation Glenn Morris
2015-06-11 22:24 ` Glenn Morris
2015-06-12  8:28   ` Eli Zaretskii
2015-06-16  0:22     ` Glenn Morris [this message]
2015-06-16 14:41       ` Eli Zaretskii
2015-06-17  6:52         ` Glenn Morris
2015-06-17 16:27           ` Eli Zaretskii
2015-06-20 23:34             ` Glenn Morris
2015-06-21 15:00               ` Eli Zaretskii
2015-06-27  2:02                 ` Glenn Morris
2015-06-27  7:42                   ` Eli Zaretskii

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ozy4jkh58w.fsf@fencepost.gnu.org \
    --to=rgm@gnu.org \
    --cc=20789@debbugs.gnu.org \
    --cc=eliz@gnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

Code repositories for project(s) associated with this external index

	https://git.savannah.gnu.org/cgit/emacs.git
	https://git.savannah.gnu.org/cgit/emacs/org-mode.git

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.