Sounds good to me. On Thu, Dec 20, 2018 at 1:58 PM Eli Zaretskii wrote: > Ping! Could someone on the Harfbuzz team please comment on the > thoughts below? Khaled, Mohammad, Behdad? > > > Date: Mon, 17 Dec 2018 17:55:52 +0200 > > From: Eli Zaretskii > > Cc: dr.khaled.hosny@gmail.com, behdad@behdad.org, 33729@debbugs.gnu.org, > > far.nasiri.m@gmail.com, kaushal.modi@gmail.com > > > > > From: Glenn Morris > > > Cc: far.nasiri.m@gmail.com, dr.khaled.hosny@gmail.com, > behdad@behdad.org, 33729@debbugs.gnu.org, kaushal.modi@gmail.com > > > Date: Sun, 16 Dec 2018 19:30:00 -0500 > > > > > > > After some thinking, my conclusion is that we should import the > > > > ISO 15924 database from https://unicode.org/iso15924/, use a script > > > > similar to admin/unidata/blocks.awk to generate an alist from it that > > > > maps Emacs script names to ISO 15924 tags, and then access that alist > > > > from uni_script to get the correct script information to Harfbuzz. > > > > > > > > Patches implementing that are welcome. > > > > > > I live to write awk scripts. I'm not 100% sure what you want, but as a > > > first example, the following takes > > > http://www.unicode.org/Public/UCD/latest/ucd/PropertyValueAliases.txt > > > as input and outputs lines of the form "(gujr . gujarati)". > > > > > > The aliases are so that the RHS matches charscript.el. > > > > > > If this is not right, please clarify exactly what the inputs and output > > > should be. > > > > Thanks. > > > > It turns out I didn't have this figured out completely, and your > > proposal forced me to dig some more into the relevant parts of Unicode > > and Emacs. I found a few additional issues and considerations; for at > > least some of them I'd like to hear the opinions of the Harfbuzz > > developers. > > > > Here are the issues: > > > > . Contrary to my original thoughts, I now tend to think that a > > separate char-table, say char-iso159240tag-table, that maps > > character codepoints directly to the script tags, is a better > > solution: > > - it will allow a faster look up, obviously > > - the subdivision of characters into scripts, as shown in > > Unicode's Scripts.txt, is slightly different from what > > char-script-table does, so a simple mapping from Emacs scripts > > to ISO 15924 script tag will not do. For example, many > > characters Emacs puts into 'latin' or 'symbol' scripts are in > > the Common script according to Scripts.txt, and similarly for > > the Inherited script. I imagine this is important for > > Harfbuzz. > > > > . Whether to produce the character-to-script-tag mapping using the > > UCD files, such as Scripts.txt and PropertyValueAliases.txt, or the > > canonical ISO 15924 tags from https://unicode.org/iso15924/, > > depends on whether the slight differences mentioned in > > https://www.unicode.org/reports/tr24/#Relation_To_ISO15924 matter > > for Harfbuzz. For example, ISO 15924 has separate tags for the > > Fraktur and Gaelic varieties of the Latin script: does this > > distinction matter for Harfbuzz? > > > > . Does Harfbuzz handle the issues mentioned in > > https://www.unicode.org/reports/tr24/#Script_Anomalies, and in > > particular the use case of decomposed characters which yield a > > different script than their precomposed variants? This use case is > > quite common in handling of character compositions, so it's > > important to understand its implications before we decide on the > > implementation. > > > > To summarize, unless the Harfbuzz guys advise differently, I'd prefer > > processing Scripts.txt and PropertyValueAliases.txt into a list > > similar to the one we produce in charscript.el, then generate a > > char-table from that list. > > > > Thanks again for working on this. > > > > > > > > > -- behdad http://behdad.org/