all messages for Emacs-related lists mirrored at yhetil.org
 help / color / mirror / code / Atom feed
From: Glenn Morris <rgm@gnu.org>
To: Eli Zaretskii <eliz@gnu.org>
Cc: 20789@debbugs.gnu.org
Subject: bug#20789: Invalid script or charset name:	cuneiform-numbers-and-punctuation
Date: Sat, 20 Jun 2015 19:34:01 -0400	[thread overview]
Message-ID: <6pp4qlzti.fsf@fencepost.gnu.org> (raw)
In-Reply-To: <21zj45kiix.fsf@fencepost.gnu.org>


I spent some time looking at some of these.
In no case could I see a clear path from the inputs to the outputs.

Eli Zaretskii wrote:

>   . characters.el:
>
>     . The modify-category-entry calls -- they basically can be derived
>       from Blocks.txt

I looked at it briefly. I can see that they are somewhat related, but
not precisely how. Eg:

Emacs: 2E80:312F and 3190:33FF are "line breakable".
Which means that "Hangul Compatibility Jamo" isn't. I have no idea why.

Emacs: 3400:4DBF and 4E00:9FAF are "2-byte han".
Which means that "Yijing Hexagram Symbols" aren't. Again, I have no idea why.

I didn't look any further.

>     . The modify-syntax-entry and set-case-syntax calls can be derived
>       from the values of the 'general-category' property returned by
>       'get-char-code-property', perhaps augmented by 'paired-bracket'
>       and 'paired-type' properties

I didn't look at this yet.

>     . The set-case-syntax-pair calls (perhaps use the data in
>       CaseFolding.txt, or even the case mapping information in
>       UnicodeData.txt)

I didn't look at this yet.

>     . The setup of char-width-table -- I think the information is in
>       EastAsianWidth.txt, with background information described in
>       UAX#11 (http://www.unicode.org/reports/tr11/)

Looks somewhat promising, but could you be more specific?
There's nothing in that file that defines "zero width" characters, so I
don't see where Emacs's width 0 characters come from.

The width 2 characters look like they might be the "W" and "F" characters,
but just doing that gives a list that has many differences to the list
Emacs uses.

>     . The setup of char-acronym-table: at least some of the data is in
>       NameAliases.txt and NameList.txt

Looks somewhat promising.
I can see how most of this comes from NameAliases.txt.
But there are many oddities:

Why does Emacs not have anything for 0009 (HT or TAB) or 000A (LF, NL,
or EOF)?
0019 is EOM in the source but EM in Emacs.

0080 is PAD in the source but XXX in Emacs.
0081 is HOP in the source but XXX in Emacs.
008F is SS3 in the source but SS1 in Emacs.
0099 is SGC in the source but XXX in Emacs.

How does Emacs choose which entries to list? There are many more in the
source. Could it do any harm to add more?

Where does "KIVAQ" come from? That appears nowhere in the source AFAICS.
Why does Emacs list two Khmer entries, and nothing else? There are loads
more of them.

>   . fontset.el:
>
>     . The setup of script-representative-chars

I don't see how. It seems to be "for some of, but not all, the entries
in char-script-table, choose a single character somewhere in the range."
There seems to be no pattern to how the character is chosen within the
range. Often the first one, but by no means always.

>   . mule-cmds.el:
>
>     . The setting of locale-language-names -- the data is available in
>       IANA's Language Subtag Registry
>       (http://www.iana.org/assignments/language-subtag-registry/language-subtag-registry)
>       and in ISO 639-2 (http://www.loc.gov/standards/iso639-2/,
>       http://www.loc.gov/standards/iso639-2/php/English_list.php)

Again, I don't see how. Eg nowhere in those source files do I see Welsh
associated with iso-8859-14, and the comment in mule-cmds says that the
last part is "implementation dependent".

> P.S. It would be good to add to somewhere (admin/make-tarball.txt?) a
> reminder to fetch all those reference files and regenerate their
> dependencies, before we prepare a release.

admin/FOR-RELEASE contains that kind of thing.





  reply	other threads:[~2015-06-20 23:34 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-06-11 22:05 bug#20789: Invalid script or charset name: cuneiform-numbers-and-punctuation Glenn Morris
2015-06-11 22:24 ` Glenn Morris
2015-06-12  8:28   ` Eli Zaretskii
2015-06-16  0:22     ` Glenn Morris
2015-06-16 14:41       ` Eli Zaretskii
2015-06-17  6:52         ` Glenn Morris
2015-06-17 16:27           ` Eli Zaretskii
2015-06-20 23:34             ` Glenn Morris [this message]
2015-06-21 15:00               ` Eli Zaretskii
2015-06-27  2:02                 ` Glenn Morris
2015-06-27  7:42                   ` Eli Zaretskii

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=6pp4qlzti.fsf@fencepost.gnu.org \
    --to=rgm@gnu.org \
    --cc=20789@debbugs.gnu.org \
    --cc=eliz@gnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this external index

	https://git.savannah.gnu.org/cgit/emacs.git
	https://git.savannah.gnu.org/cgit/emacs/org-mode.git

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.