unofficial mirror of bug-gnu-emacs@gnu.org 
 help / color / mirror / code / Atom feed
* bug#20789: Invalid script or charset name: cuneiform-numbers-and-punctuation
@ 2015-06-11 22:05 Glenn Morris
  2015-06-11 22:24 ` Glenn Morris
  0 siblings, 1 reply; 11+ messages in thread
From: Glenn Morris @ 2015-06-11 22:05 UTC (permalink / raw)
  To: 20789

Package: emacs
Version: 25.0.50

Current master on x86_64 RHEL 7.1.

emacs -Q: All looks fine, but there is a *Warnings* buffer with contents:

  Error (initialization): Creation of the default fontsets failed: (error
  Invalid script or charset name: cuneiform-numbers-and-punctuation)

A second bug: the *Warnings* buffer is not shown at startup, *scratch* is.





^ permalink raw reply	[flat|nested] 11+ messages in thread

* bug#20789: Invalid script or charset name: cuneiform-numbers-and-punctuation
  2015-06-11 22:05 bug#20789: Invalid script or charset name: cuneiform-numbers-and-punctuation Glenn Morris
@ 2015-06-11 22:24 ` Glenn Morris
  2015-06-12  8:28   ` Eli Zaretskii
  0 siblings, 1 reply; 11+ messages in thread
From: Glenn Morris @ 2015-06-11 22:24 UTC (permalink / raw)
  To: 20789

Glenn Morris wrote:

>   Error (initialization): Creation of the default fontsets failed: (error
>   Invalid script or charset name: cuneiform-numbers-and-punctuation)

I fixed a typo that seems to have caused that.

I don't suppose that big list can be auto-generated from the inputs?

> A second bug: the *Warnings* buffer is not shown at startup, *scratch* is.





^ permalink raw reply	[flat|nested] 11+ messages in thread

* bug#20789: Invalid script or charset name: cuneiform-numbers-and-punctuation
  2015-06-11 22:24 ` Glenn Morris
@ 2015-06-12  8:28   ` Eli Zaretskii
  2015-06-16  0:22     ` Glenn Morris
  0 siblings, 1 reply; 11+ messages in thread
From: Eli Zaretskii @ 2015-06-12  8:28 UTC (permalink / raw)
  To: Glenn Morris; +Cc: 20789

> From: Glenn Morris <rgm@gnu.org>
> Date: Thu, 11 Jun 2015 18:24:06 -0400
> 
> Glenn Morris wrote:
> 
> >   Error (initialization): Creation of the default fontsets failed: (error
> >   Invalid script or charset name: cuneiform-numbers-and-punctuation)
> 
> I fixed a typo that seems to have caused that.

Sorry about that.

> I don't suppose that big list can be auto-generated from the inputs?

It's not trivial.  I describe below some of the issues, in the hope
that Someone™ will volunteer:

  . Most of the script names come from the corresponding Unicode
    blocks, with trivial transformations (downcase words and replace
    blanks with a hyphen).  So basically, we will need to use the
    information in Blocks.txt, a file that is part of the Unicode
    Character Database (UCD), but with quirks described below.

  . The first quirk is that we lump together all the blocks that
    belong to the same script, like "Basic Latin", "Latin Extended-A",
    "Latin-1 Supplement", etc. -- these all go to the single script
    called 'latin'.  Likewise with other similar blocks that are
    either "SOMETHING Extended" or "Supplement" or whatever.

  . The second quirk is with the CJK characters: those are divided
    into several broad scripts like 'han', 'kana', and 'cjk-misc'
    whose exact rules I don't know.

  . The third quirk is with the 'symbol' pseudo-script: we lump there
    all punctuation characters and all symbol characters (those for
    which the General Category is one of Pc, Pd, Ps, Pe, Pi, Pf, Po,
    Sm, Sc, Sk, So), but with the following notable exception:
    punctuation characters that belong to blocks that include
    non-punctuation characters are left in those blocks -- those are
    punctuation characters used only with the scripts named by those
    blocks, like U+05BE HEBREW PUNCTUATION MAQAF, which is only used
    by the Hebrew script.

  . Another quirk is that mathematical alphanumerics (which are just
    letters from the Unicode POV) are lumped into a separate script
    'mathematical'.

Alternatively, one could use Scripts.txt from the UCD, and then the
only problem is to subdivide what they call "Common" into the scripts
we use.

For the general category of a character, one can do in Emacs:

      (get-char-code-property CHAR 'general-category)

Alternatively, one can search UnicodeData.txt directly: the General
Category is the 3rd field there.

Patches are welcome to do all of the above automatically, perhaps with
some small database that expresses the more tricky of the above rules.





^ permalink raw reply	[flat|nested] 11+ messages in thread

* bug#20789: Invalid script or charset name: cuneiform-numbers-and-punctuation
  2015-06-12  8:28   ` Eli Zaretskii
@ 2015-06-16  0:22     ` Glenn Morris
  2015-06-16 14:41       ` Eli Zaretskii
  0 siblings, 1 reply; 11+ messages in thread
From: Glenn Morris @ 2015-06-16  0:22 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 20789

[-- Attachment #1: Type: text/plain, Size: 1281 bytes --]

Eli Zaretskii wrote:

>> I don't suppose that big list can be auto-generated from the inputs?
>
> It's not trivial.  I describe below some of the issues, in the hope
> that Someone™ will volunteer:

Thanks. Script that processes Blocks.txt attached. Some questions:

1. In Blocks.txt:

  FF00..FFEF; Halfwidth and Fullwidth Forms

In Emacs:

  (#xFF00 #xFF5F cjk-misc)
  (#xFF61 #xFF9F kana)
  (#xFFE0 #xFFEF cjk-misc)

Is ff60 (FULLWIDTH RIGHT WHITE PARENTHESIS) intentionally omitted?


2. In Emacs "olt-italic" looks like a typo ("old-italic"). Can it be renamed?


3. In Blocks.txt, Anatolian Hieroglyphs ends at 1467F.
In Emacs, it ends at 1457F. Typo?


4. In Blocks.txt:

  20000..2A6DF; CJK Unified Ideographs Extension B
  2A700..2B73F; CJK Unified Ideographs Extension C
  2B740..2B81F; CJK Unified Ideographs Extension D
  2B820..2CEAF; CJK Unified Ideographs Extension E
  2F800..2FA1F; CJK Compatibility Ideographs Supplement

In Emacs:

  (#x20000 #x2CEAF han)
  (#x2F800 #x2FFFF han)

Emacs adds the ranges 2a6e0:2a6ff and 2fa20:2ffff, which Blocks.txt does
not cover. Intentional?


5. Newly added "sutton-sign-writing" - should be "sutton-signwriting"?
(The case-insensitive source says "Sutton SignWriting".)



[-- Attachment #2: blocks.awk --]
[-- Type: application/octet-stream, Size: 6859 bytes --]

#!/usr/bin/awk -f

## Copyright (C) 2015 Free Software Foundation, Inc.

## Author: Glenn Morris <rgm@gnu.org>

## This file is part of GNU Emacs.

## GNU Emacs is free software: you can redistribute it and/or modify
## it under the terms of the GNU General Public License as published by
## the Free Software Foundation, either version 3 of the License, or
## (at your option) any later version.

## GNU Emacs is distributed in the hope that it will be useful,
## but WITHOUT ANY WARRANTY; without even the implied warranty of
## MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
## GNU General Public License for more details.

## You should have received a copy of the GNU General Public License
## along with GNU Emacs.  If not, see <http://www.gnu.org/licenses/>.

### Commentary:

## This script takes as input Unicode's Blocks.txt
## (http://www.unicode.org/Public/UNIDATA/Blocks.txt)
## and produces output for Emacs's lisp/international/charscript.el.

## It lumps together all the blocks belonging to the same language.
## E.g., "Basic Latin", "Latin-1 Supplement", "Latin Extended-A",
## etc. are all lumped together under "latin".

## The Unicode blocks actually extend past some of these ranges with
## undefined codepoints.

## For additional details, see <http://debbugs.gnu.org/20789#11>.

### Code:

BEGIN {
    ## Hard-coded names.  See name2alias for the rest.
    alias["ipa extensions"] = "phonetic"
    alias["letterlike symbols"] = "symbol"
    alias["number forms"] = "symbol"
    alias["miscellaneous technical"] = "symbol"
    alias["control pictures"] = "symbol"
    alias["optical character recognition"] = "symbol"
    alias["enclosed alphanumerics"] = "symbol"
    alias["box drawing"] = "symbol"
    alias["block elements"] = "symbol"
    alias["miscellaneous symbols"] = "symbol"
    alias["cjk strokes"] = "cjk-misc"
    alias["cjk symbols and punctuation"] = "cjk-misc"
    alias["halfwidth and fullwidth forms"] = "cjk-misc"
    alias["common indic number forms"] = "north-indic-number"

    tohex["a"] = 10
    tohex["b"] = 11
    tohex["c"] = 12
    tohex["d"] = 13
    tohex["e"] = 14
    tohex["f"] = 15

    fix_start["0080"] = "00A0"
    fix_end["2A6DF"] = "2A6FF"
    fix_end["2FA1F"] = "2FFFF"
}

## From admin/charsets/.
## With gawk's --non-decimal-data switch we wouldn't need this.
function decode_hex(str   , n, len, i, c) {
  n = 0
  len = length(str)
  for (i = 1; i <= len; i++)
    {
      c = substr (str, i, 1)
      if (c >= "0" && c <= "9")
	n = n * 16 + (c - "0")
      else
	n = n * 16 + tohex[tolower(c)]
    }
  return n
}

function name2alias(name   , w, w2) {
    name = tolower(name)
    if (alias[name]) return alias[name]
    else if (name ~ /for symbols/) return "symbol"
    else if (name ~ /latin|combining .* marks|spacing modifier|tone letters|alphabetic presentation/) return "latin"
    else if (name ~ /cjk|yijing|enclosed ideograph|kangxi/) return "han"
    else if (name ~ /arabic/) return "arabic"
    else if (name ~ /^greek/) return "greek"
    else if (name ~ /^coptic/) return "coptic"
    else if (name ~ /cuneiform number/) return "cuneiform-numbers-and-punctuation"
    else if (name ~ /cuneiform/) return "cuneiform"
    else if (name ~ /mathematical alphanumeric symbol/) return "mathematical"
    else if (name ~ /punctuation|mathematical|arrows|currency|superscript|small form variants|geometric|dingbats|enclosed|alchemical|pictograph|emoticon|transport/) return "symbol"
    else if (name ~ /canadian aboriginal/) return "canadian-aboriginal"
    else if (name ~ /katakana|hiragana/) return "kana"
    else if (name ~ /myanmar/) return "burmese"
    else if (name ~ /hangul/) return "hangul"
    else if (name ~ /khmer/) return "khmer"
    else if (name ~ /braille/) return "braille"
    else if (name ~ /^yi /) return "yi"
    else if (name ~ /surrogates|private use|variation selectors/) return 0
    else if (name ~/^(specials|tags)$/) return 0
    else if (name ~ /linear b/) return "linear-b"
    else if (name ~ /aramaic/) return "aramaic"
    else if (name ~ /rumi num/) return "rumi-number"
    else if (name ~ /duployan|shorthand/) return "duployan-shorthand"
    else if (name ~ /sutton signwriting/) return "sutton-sign-writing"

    sub(/ (extended|extensions|supplement).*/, "", name)
    sub(/numbers/, "number", name)
    sub(/numerals/, "numeral", name)
    sub(/symbols/, "symbol", name)
    sub(/forms$/, "form", name)
    sub(/tiles$/, "tile", name)
    sub(/^new /, "", name)
    sub(/ (characters|hieroglyphs|cursive)$/, "", name)
    gsub(/ /, "-", name)

    return name
}

/^[0-9A-F]/ {
    sep = index($1, "..")
    len = length($1)
    s = substr($1,1,sep-1)
    e = substr($1,sep+2,len-sep-2)
    $1 = ""
    sub(/^ */, "", $0)
    i++
    start[i] = fix_start[s] ? fix_start[s] : s
    end[i] = fix_end[e] ? fix_end[e]: e
    name[i] = $0

    alt[i] = name2alias(name[i])

    if (!alt[i])
    {
        i--
        next
    }

    ## Combine adjacent ranges with the same name.
    if (alt[i] == alt[i-1] && decode_hex(start[i]) == 1 + decode_hex(end[i-1]))
    {
        end[i-1] = end[i]
        name[i-1] = (name[i-1] ", " name[i])
        i--
    }

    ## Some hard-coded splits.
    if (start[i] == "0370")
    {
        end[i] = "03E1"
        i++
        start[i] = "03E2"
        end[i] = "03EF"
        alt[i] = "coptic"
        i++
        start[i] = "03F0"
        end[i] = "03FF"
        alt[i] = "greek"
    }
    else if (start[i] == "FB00")
    {
        end[i] = "FB06"
        i++
        start[i] = "FB13"
        end[i] = "FB17"
        alt[i] = "armenian"
        i++
        start[i] = "FB1D"
        end[i] = "FB4F"
        alt[i] = "hebrew"
    }
    else if (start[i] == "FF00")
    {
        end[i] = "FF5F"
        i++
        start[i] = "FF61"
        end[i] = "FF9F"
        alt[i] = "kana"
        i++
        start[i] = "FFE0"
        end[i] = "FFEF"
        alt[i] = "cjk-misc"
    }
}

END {
    print ";;; charscript.el --- character script table -*- no-byte-compile: t -*-"
    print ";;; Automatically generated from admin/unidata/Blocks.txt"
    print "(let (script-list)"
    print "  (dolist (elt '("

    for (j=1;j<=i;j++)
    {
        printf("    (#x%s #x%s %s)", start[j], end[j], alt[j])
        ## Fuzz to decide whether worth printing original name as a comment.
        if (name[j] && alt[j] != tolower(name[j]) && alt[j] !~ /-/)
            printf(" ; %s", name[j])
        printf("\n")
    }

    print "    ))"
    print "    (set-char-table-range char-script-table"
    print "			  (cons (car elt) (nth 1 elt)) (nth 2 elt))"
    print "    (or (memq (nth 2 elt) script-list)"
    print "	(setq script-list (cons (nth 2 elt) script-list))))"
    print "  (set-char-table-extra-slot char-script-table 0 (nreverse script-list)))"
    print ""
    print "(provide 'charscript)"
}

^ permalink raw reply	[flat|nested] 11+ messages in thread

* bug#20789: Invalid script or charset name: cuneiform-numbers-and-punctuation
  2015-06-16  0:22     ` Glenn Morris
@ 2015-06-16 14:41       ` Eli Zaretskii
  2015-06-17  6:52         ` Glenn Morris
  0 siblings, 1 reply; 11+ messages in thread
From: Eli Zaretskii @ 2015-06-16 14:41 UTC (permalink / raw)
  To: Glenn Morris; +Cc: 20789

> From: Glenn Morris <rgm@gnu.org>
> Cc: 20789@debbugs.gnu.org
> Date: Mon, 15 Jun 2015 20:22:07 -0400
> 
> Eli Zaretskii wrote:
> 
> >> I don't suppose that big list can be auto-generated from the inputs?
> >
> > It's not trivial.  I describe below some of the issues, in the hope
> > that Someone™ will volunteer:
> 
> Thanks. Script that processes Blocks.txt attached. Some questions:
> 
> 1. In Blocks.txt:
> 
>   FF00..FFEF; Halfwidth and Fullwidth Forms
> 
> In Emacs:
> 
>   (#xFF00 #xFF5F cjk-misc)
>   (#xFF61 #xFF9F kana)
>   (#xFFE0 #xFFEF cjk-misc)
> 
> Is ff60 (FULLWIDTH RIGHT WHITE PARENTHESIS) intentionally omitted?

AFAICT, there's a small mess around there.  Based on the names of the
pertinent characters, I think we should have this instead of the above
3 ranges:

  (#xFF00 #xFF60 cjk-misc)
  (#xFF61 #xFF9F kana)
  (#xFFA0 #xFFDF hangul)
  (#xFFE0 #xFFEF cjk-misc)

> 2. In Emacs "olt-italic" looks like a typo ("old-italic"). Can it be renamed?

Yes, please.

> 3. In Blocks.txt, Anatolian Hieroglyphs ends at 1467F.
> In Emacs, it ends at 1457F. Typo?

Yes.

> 4. In Blocks.txt:
> 
>   20000..2A6DF; CJK Unified Ideographs Extension B
>   2A700..2B73F; CJK Unified Ideographs Extension C
>   2B740..2B81F; CJK Unified Ideographs Extension D
>   2B820..2CEAF; CJK Unified Ideographs Extension E
>   2F800..2FA1F; CJK Compatibility Ideographs Supplement
> 
> In Emacs:
> 
>   (#x20000 #x2CEAF han)
>   (#x2F800 #x2FFFF han)
> 
> Emacs adds the ranges 2a6e0:2a6ff and 2fa20:2ffff, which Blocks.txt does
> not cover. Intentional?

I don't know, but probably not intentional.  I think we had better
made it consistent with the UCD.

> 5. Newly added "sutton-sign-writing" - should be "sutton-signwriting"?
> (The case-insensitive source says "Sutton SignWriting".)

Well, "signwriting" is not a word, AFAIK, it's 2 words (and the funny
camel-case seems to agree with me).  AFAIU, they used "SignWriting"
because it's the commercial name.  But if you insist, I won't...

Thank you for doing this.

P.S. Does the script work with mawk?  (Some systems have it as their
default Awk, I think.)





^ permalink raw reply	[flat|nested] 11+ messages in thread

* bug#20789: Invalid script or charset name: cuneiform-numbers-and-punctuation
  2015-06-16 14:41       ` Eli Zaretskii
@ 2015-06-17  6:52         ` Glenn Morris
  2015-06-17 16:27           ` Eli Zaretskii
  0 siblings, 1 reply; 11+ messages in thread
From: Glenn Morris @ 2015-06-17  6:52 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 20789

Eli Zaretskii wrote:

> Well, "signwriting" is not a word, AFAIK, it's 2 words [...]

It's a word (in the OED), but in the sense of painting commercial signs.
I don't really care, it's just that ~ 50% of the script is transforming
the Unicode names to the (seemingly randomly chosen) Emacs names.
If the latter were more straightforwardly derived from the former,
things would be simpler. But one more special rule makes no difference.

> P.S. Does the script work with mawk?

Yes, and with Sun OS 5.10's /usr/xpg4/bin/awk (but not /usr/bin/awk).
I don't believe it uses any more features than admin/charsets/*.awk.


Is there anything else in international/ that could benefit from being
auto-generated?





^ permalink raw reply	[flat|nested] 11+ messages in thread

* bug#20789: Invalid script or charset name: cuneiform-numbers-and-punctuation
  2015-06-17  6:52         ` Glenn Morris
@ 2015-06-17 16:27           ` Eli Zaretskii
  2015-06-20 23:34             ` Glenn Morris
  0 siblings, 1 reply; 11+ messages in thread
From: Eli Zaretskii @ 2015-06-17 16:27 UTC (permalink / raw)
  To: Glenn Morris; +Cc: 20789

> From: Glenn Morris <rgm@gnu.org>
> Cc: 20789@debbugs.gnu.org
> Date: Wed, 17 Jun 2015 02:52:48 -0400
> 
> Is there anything else in international/ that could benefit from being
> auto-generated?

Some.  Things I've spotted:

  . characters.el:

    . The modify-category-entry calls -- they basically can be derived
      from Blocks.txt

    . The modify-syntax-entry and set-case-syntax calls can be derived
      from the values of the 'general-category' property returned by
      'get-char-code-property', perhaps augmented by 'paired-bracket'
      and 'paired-type' properties

    . The set-case-syntax-pair calls (perhaps use the data in
      CaseFolding.txt, or even the case mapping information in
      UnicodeData.txt)

    . The setup of char-width-table -- I think the information is in
      EastAsianWidth.txt, with background information described in
      UAX#11 (http://www.unicode.org/reports/tr11/)

    . The setup of char-acronym-table: at least some of the data is in
      NameAliases.txt and NameList.txt

  . fontset.el:

    . The setup of script-representative-chars

  . mule-cmds.el:

    . The setting of locale-language-names -- the data is available in
      IANA's Language Subtag Registry
      (http://www.iana.org/assignments/language-subtag-registry/language-subtag-registry)
      and in ISO 639-2 (http://www.loc.gov/standards/iso639-2/,
      http://www.loc.gov/standards/iso639-2/php/English_list.php)
      
TIA

P.S. It would be good to add to somewhere (admin/make-tarball.txt?) a
reminder to fetch all those reference files and regenerate their
dependencies, before we prepare a release.





^ permalink raw reply	[flat|nested] 11+ messages in thread

* bug#20789: Invalid script or charset name: cuneiform-numbers-and-punctuation
  2015-06-17 16:27           ` Eli Zaretskii
@ 2015-06-20 23:34             ` Glenn Morris
  2015-06-21 15:00               ` Eli Zaretskii
  0 siblings, 1 reply; 11+ messages in thread
From: Glenn Morris @ 2015-06-20 23:34 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 20789


I spent some time looking at some of these.
In no case could I see a clear path from the inputs to the outputs.

Eli Zaretskii wrote:

>   . characters.el:
>
>     . The modify-category-entry calls -- they basically can be derived
>       from Blocks.txt

I looked at it briefly. I can see that they are somewhat related, but
not precisely how. Eg:

Emacs: 2E80:312F and 3190:33FF are "line breakable".
Which means that "Hangul Compatibility Jamo" isn't. I have no idea why.

Emacs: 3400:4DBF and 4E00:9FAF are "2-byte han".
Which means that "Yijing Hexagram Symbols" aren't. Again, I have no idea why.

I didn't look any further.

>     . The modify-syntax-entry and set-case-syntax calls can be derived
>       from the values of the 'general-category' property returned by
>       'get-char-code-property', perhaps augmented by 'paired-bracket'
>       and 'paired-type' properties

I didn't look at this yet.

>     . The set-case-syntax-pair calls (perhaps use the data in
>       CaseFolding.txt, or even the case mapping information in
>       UnicodeData.txt)

I didn't look at this yet.

>     . The setup of char-width-table -- I think the information is in
>       EastAsianWidth.txt, with background information described in
>       UAX#11 (http://www.unicode.org/reports/tr11/)

Looks somewhat promising, but could you be more specific?
There's nothing in that file that defines "zero width" characters, so I
don't see where Emacs's width 0 characters come from.

The width 2 characters look like they might be the "W" and "F" characters,
but just doing that gives a list that has many differences to the list
Emacs uses.

>     . The setup of char-acronym-table: at least some of the data is in
>       NameAliases.txt and NameList.txt

Looks somewhat promising.
I can see how most of this comes from NameAliases.txt.
But there are many oddities:

Why does Emacs not have anything for 0009 (HT or TAB) or 000A (LF, NL,
or EOF)?
0019 is EOM in the source but EM in Emacs.

0080 is PAD in the source but XXX in Emacs.
0081 is HOP in the source but XXX in Emacs.
008F is SS3 in the source but SS1 in Emacs.
0099 is SGC in the source but XXX in Emacs.

How does Emacs choose which entries to list? There are many more in the
source. Could it do any harm to add more?

Where does "KIVAQ" come from? That appears nowhere in the source AFAICS.
Why does Emacs list two Khmer entries, and nothing else? There are loads
more of them.

>   . fontset.el:
>
>     . The setup of script-representative-chars

I don't see how. It seems to be "for some of, but not all, the entries
in char-script-table, choose a single character somewhere in the range."
There seems to be no pattern to how the character is chosen within the
range. Often the first one, but by no means always.

>   . mule-cmds.el:
>
>     . The setting of locale-language-names -- the data is available in
>       IANA's Language Subtag Registry
>       (http://www.iana.org/assignments/language-subtag-registry/language-subtag-registry)
>       and in ISO 639-2 (http://www.loc.gov/standards/iso639-2/,
>       http://www.loc.gov/standards/iso639-2/php/English_list.php)

Again, I don't see how. Eg nowhere in those source files do I see Welsh
associated with iso-8859-14, and the comment in mule-cmds says that the
last part is "implementation dependent".

> P.S. It would be good to add to somewhere (admin/make-tarball.txt?) a
> reminder to fetch all those reference files and regenerate their
> dependencies, before we prepare a release.

admin/FOR-RELEASE contains that kind of thing.





^ permalink raw reply	[flat|nested] 11+ messages in thread

* bug#20789: Invalid script or charset name: cuneiform-numbers-and-punctuation
  2015-06-20 23:34             ` Glenn Morris
@ 2015-06-21 15:00               ` Eli Zaretskii
  2015-06-27  2:02                 ` Glenn Morris
  0 siblings, 1 reply; 11+ messages in thread
From: Eli Zaretskii @ 2015-06-21 15:00 UTC (permalink / raw)
  To: Glenn Morris, Kenichi Handa; +Cc: 20789

> From: Glenn Morris <rgm@gnu.org>
> Cc: 20789@debbugs.gnu.org
> Date: Sat, 20 Jun 2015 19:34:01 -0400
> 
> I spent some time looking at some of these.
> In no case could I see a clear path from the inputs to the outputs.

Thanks for looking into this.

Let me first make a general comment: we can always convert only
certain parts of the setup to an automated procedure, and leave the
rest in its present form, more or less.  That's especially true where
Emacs has specialized needs or defines properties not in Unicode.

> >   . characters.el:
> >
> >     . The modify-category-entry calls -- they basically can be derived
> >       from Blocks.txt
> 
> I looked at it briefly. I can see that they are somewhat related, but
> not precisely how. Eg:
> 
> Emacs: 2E80:312F and 3190:33FF are "line breakable".
> Which means that "Hangul Compatibility Jamo" isn't. I have no idea why.
> 
> Emacs: 3400:4DBF and 4E00:9FAF are "2-byte han".
> Which means that "Yijing Hexagram Symbols" aren't. Again, I have no idea why.
> 
> I didn't look any further.

When I said "derived from Blocks.txt", I meant the categories that are
related to script names, like ASCII, Latin, Arabic, Chinese, etc.
Sorry for not saying that explicitly.

Other categories need other sources.  Here's my attempt to decipher
some of them:

 . ?| -- "line breakable"

   The data seems to be in LineBreak.txt, described in detail in
   UAX#14 (http://unicode.org/reports/tr14/).  It looks like
   characters with the ?| category are those whose line-break
   properties are ID or CJ or NS.  Therefore, the exclusion of Hangul
   Compatibility Jamo is a mistake (or maybe an omission, since the
   comment says "Chinese"); in particular, UAX#14 explicitly says, in
   section 5.1 under "ID", that the characters in the range 3130..318F
   are treated as class ID.

   This category is currently used only by kinsoku.el, which has its
   own data (and sets the ?< and ?> categories).  So this will only
   become important if we ever implement in Emacs something more
   general, like the algorithm described in UAX#14.

 . "2-byte han" -- I think this is related to their legacy encoding; I
   don't see this used anywhere.  Likewise with other 2-byte
   categories.  Perhaps Handa-san (CC'ed) could comment on their
   necessity.  If this is still needed, we should probably leave these
   alone.

 . ?0 - ?9 -- I don't see how to get this data from the UCD or any
   other source.  Some of it seems to be in IndicSyllabicCategory.txt,
   FWIW.

 . ?R and ?L -- already set up using the Unicode data, so no change is
   needed.

 . ?^ -- should be set for any character whose general-category is
   Mn.  Since we already do this, the manual setting around line 820
   is redundant and should be deleted.

 . ?. -- already set using Unicode data, no change needed.

> >     . The setup of char-width-table -- I think the information is in
> >       EastAsianWidth.txt, with background information described in
> >       UAX#11 (http://www.unicode.org/reports/tr11/)
> 
> Looks somewhat promising, but could you be more specific?
> There's nothing in that file that defines "zero width" characters, so I
> don't see where Emacs's width 0 characters come from.

The following rules regarding zero-width characters are due to Markus
Kuhn, and are excerpted from the description in comments to his
implementation of 'wcwidth' (http://www.cl.cam.ac.uk/~mgk25/ucs/wcwidth.c):

 . The null character (U+0000) has a column width of 0.
 . Non-spacing and enclosing combining characters (general category
   code Mn or Me in the Unicode database) have a column width of 0. 
 . ZERO WIDTH SPACE (U+200B) and format characters (general category
   code Cf in the Unicode database), except SOFT HYPHEN (U+00AD), have
   a column width of 0.
 . Hangul Jamo medial vowels and final consonants (U+1160-U+11FF) have
   a column width of 0.

> The width 2 characters look like they might be the "W" and "F" characters,

Yes.

> but just doing that gives a list that has many differences to the list
> Emacs uses.

I don't see any significant differences, except perhaps in unassigned
codepoints (see paragraph 6.1 of UAX#11 for the treatment of
unassigned CJK codepoints).  I think any differences beyond that
should be treated as errors in Emacs in this case.

> >     . The setup of char-acronym-table: at least some of the data is in
> >       NameAliases.txt and NameList.txt
> 
> Looks somewhat promising.
> I can see how most of this comes from NameAliases.txt.
> But there are many oddities:
> 
> Why does Emacs not have anything for 0009 (HT or TAB) or 000A (LF, NL,
> or EOF)?

This table is set for the 'acronym' method of glyphless-char-display,
so I guess these omissions are for characters for which no one
envisioned them to be ever displayed as glyphless.  I'd include them
in the table anyway, just in case, and also to keep our exceptions vs
the UCD to the bare minimum.

> 0019 is EOM in the source but EM in Emacs.

Typo, I think.

> 0080 is PAD in the source but XXX in Emacs.
> 0081 is HOP in the source but XXX in Emacs.
> 008F is SS3 in the source but SS1 in Emacs.
> 0099 is SGC in the source but XXX in Emacs.

I think these are typos and perhaps acronyms that whoever wrote this
didn't know.

> How does Emacs choose which entries to list? There are many more in the
> source. Could it do any harm to add more?

As long as you take only "abbreviations", i.e. they are short, I think
we should use all of them, given their use in Emacs.

> Where does "KIVAQ" come from? That appears nowhere in the source AFAICS.

AFAIK, that's the official name of that character.  At least that's
what I glean from Google; I know nothing about the Khmer script.

> Why does Emacs list two Khmer entries, and nothing else? There are loads
> more of them.

These are the only 2 that have such abbreviations; see
https://en.wikipedia.org/wiki/Khmer_alphabet (assuming by "loads more"
you meant the Khmer letters).

> >   . fontset.el:
> >
> >     . The setup of script-representative-chars
> 
> I don't see how. It seems to be "for some of, but not all, the entries
> in char-script-table, choose a single character somewhere in the range."

We should have a representative character for each entry in
char-script-table.  They are used with some font back-ends (xfont and
x?ftfont, AFAIR) to probe candidate fonts for coverage of the required
script, so we should have the full information about that.  I think
the only reason for the partial information we have now is that it is
maintained manually, so it includes whatever the people who worked on
that bothered to add.

> There seems to be no pattern to how the character is chosen within the
> range. Often the first one, but by no means always.

I think the rule is to choose the first one that is a letter, i.e. its
general-category is either one of Lu, Ll, Lt, Lo, or Lm.

> >   . mule-cmds.el:
> >
> >     . The setting of locale-language-names -- the data is available in
> >       IANA's Language Subtag Registry
> >       
> > (http://www.iana.org/assignments/language-subtag-registry/language-subtag-registry)
> >       and in ISO 639-2 (http://www.loc.gov/standards/iso639-2/,
> >       http://www.loc.gov/standards/iso639-2/php/English_list.php)
> 
> Again, I don't see how. Eg nowhere in those source files do I see Welsh
> associated with iso-8859-14, and the comment in mule-cmds says that the
> last part is "implementation dependent".

The bulk of the data is the correspondence between the ISO 639
2-letter names and the country/culture name.  The few cases where we
also have the encoding could be set up with a very small database once
the main data is set, by adding the encoding to those few that need
it.

If by "last part" you mean IPA and "Nonstandard or obsolete language
codes", then these are very few and can be added manually.

> > P.S. It would be good to add to somewhere (admin/make-tarball.txt?) a
> > reminder to fetch all those reference files and regenerate their
> > dependencies, before we prepare a release.
> 
> admin/FOR-RELEASE contains that kind of thing.

Right, I will add the information there.

Thanks.





^ permalink raw reply	[flat|nested] 11+ messages in thread

* bug#20789: Invalid script or charset name: cuneiform-numbers-and-punctuation
  2015-06-21 15:00               ` Eli Zaretskii
@ 2015-06-27  2:02                 ` Glenn Morris
  2015-06-27  7:42                   ` Eli Zaretskii
  0 siblings, 1 reply; 11+ messages in thread
From: Glenn Morris @ 2015-06-27  2:02 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 20789

Eli Zaretskii wrote:

>> The width 2 characters look like they might be the "W" and "F" characters,
>
> Yes.
>
>> but just doing that gives a list that has many differences to the list
>> Emacs uses.

This is list of "F" and "W" characters, compared to the 11 ranges that
Emacs uses:

(#x1100 . #x115F)
(#x2329 . #x232A)
(#x2E80 . #x2E99)
(#x2E9B . #x2EF3)
(#x2F00 . #x2FD5)
(#x2FF0 . #x2FFB)
(#x3000 . #x303E)
(#x3041 . #x3096)
(#x3099 . #x30FF)
(#x3105 . #x312D)
(#x3131 . #x318E)
(#x3190 . #x31BA)
(#x31C0 . #x31E3)
(#x31F0 . #x321E)
(#x3220 . #x3247)
(#x3250 . #x32FE)
(#x3300 . #x4DBF)
(#x4E00 . #xA48C)
(#xA490 . #xA4C6)
(#xA960 . #xA97C)
(#xAC00 . #xD7A3)
(#xF900 . #xFAFF)
(#xFE10 . #xFE19)
(#xFE30 . #xFE52)
(#xFE54 . #xFE66)
(#xFE68 . #xFE6B)
(#xFF01 . #xFF60)
(#xFFE0 . #xFFE6)
(#x1B000 . #x1B001)
(#x1F200 . #x1F202)
(#x1F210 . #x1F23A)
(#x1F240 . #x1F248)
(#x1F250 . #x1F251)
(#x20000 . #x2FFFD)
(#x30000 . #x3FFFD)

> I don't see any significant differences, except perhaps in unassigned
> codepoints (see paragraph 6.1 of UAX#11 for the treatment of
> unassigned CJK codepoints).

I don't know if this means that the above needs modifying?





^ permalink raw reply	[flat|nested] 11+ messages in thread

* bug#20789: Invalid script or charset name: cuneiform-numbers-and-punctuation
  2015-06-27  2:02                 ` Glenn Morris
@ 2015-06-27  7:42                   ` Eli Zaretskii
  0 siblings, 0 replies; 11+ messages in thread
From: Eli Zaretskii @ 2015-06-27  7:42 UTC (permalink / raw)
  To: Glenn Morris; +Cc: 20789

> From: Glenn Morris <rgm@gnu.org>
> Cc: Kenichi Handa <handa@gnu.org>,  20789@debbugs.gnu.org
> Date: Fri, 26 Jun 2015 22:02:36 -0400
> 
> Eli Zaretskii wrote:
> 
> >> The width 2 characters look like they might be the "W" and "F" characters,
> >
> > Yes.
> >
> >> but just doing that gives a list that has many differences to the list
> >> Emacs uses.
> 
> This is list of "F" and "W" characters, compared to the 11 ranges that
> Emacs uses:

Looks good to me.  The 11 ranges we have now are either identical or
more coarse than the list derived from the UCD that you show.

> > I don't see any significant differences, except perhaps in unassigned
> > codepoints (see paragraph 6.1 of UAX#11 for the treatment of
> > unassigned CJK codepoints).
> 
> I don't know if this means that the above needs modifying?

I was saying that we need to augment the list with the 5 ranges of
unassigned codepoints that belong to the CJK planes, as described in
that section of UAX#11.  An unassigned codepoint has its
'general-category' property set to 'Cn', and the list of the 5 planes
could be in some defconst, because it will probably never change.

Thanks.





^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2015-06-27  7:42 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-06-11 22:05 bug#20789: Invalid script or charset name: cuneiform-numbers-and-punctuation Glenn Morris
2015-06-11 22:24 ` Glenn Morris
2015-06-12  8:28   ` Eli Zaretskii
2015-06-16  0:22     ` Glenn Morris
2015-06-16 14:41       ` Eli Zaretskii
2015-06-17  6:52         ` Glenn Morris
2015-06-17 16:27           ` Eli Zaretskii
2015-06-20 23:34             ` Glenn Morris
2015-06-21 15:00               ` Eli Zaretskii
2015-06-27  2:02                 ` Glenn Morris
2015-06-27  7:42                   ` Eli Zaretskii

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).