* bug#54111: guile bundles (a compiled version of) UnicodeData.txt and binaries @ 2022-02-22 16:42 Maxime Devos 2022-02-27 13:52 ` Ludovic Courtès 0 siblings, 1 reply; 12+ messages in thread From: Maxime Devos @ 2022-02-22 16:42 UTC (permalink / raw) To: 54111 [-- Attachment #1: Type: text/plain, Size: 366 bytes --] Hi guix, Looking at <https://git.savannah.gnu.org/cgit/guile.git/commit/?id=2f9bc7fe61d39658a24a15526b7b88bbd184961b>, I noticed that Guile bundles a binary variant of UnicodeData.txt in srfi-14.i.c. Seems like it should be compiled with the 'unidate_to_charset.pl' script instead (assuming that there are no bootstrapping concerns). Greetings, Maxime. [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 260 bytes --] ^ permalink raw reply [flat|nested] 12+ messages in thread
* bug#54111: guile bundles (a compiled version of) UnicodeData.txt and binaries 2022-02-22 16:42 bug#54111: guile bundles (a compiled version of) UnicodeData.txt and binaries Maxime Devos @ 2022-02-27 13:52 ` Ludovic Courtès 2022-02-27 19:45 ` Maxime Devos 0 siblings, 1 reply; 12+ messages in thread From: Ludovic Courtès @ 2022-02-27 13:52 UTC (permalink / raw) To: Maxime Devos; +Cc: 54111 Hi, Maxime Devos <maximedevos@telenet.be> skribis: > Looking at <https://git.savannah.gnu.org/cgit/guile.git/commit/?id=2f9bc7fe61d39658a24a15526b7b88bbd184961b>, > I noticed that Guile bundles a binary variant of UnicodeData.txt in > srfi-14.i.c. Seems like it should be compiled with > the 'unidate_to_charset.pl' script instead (assuming that there are no > bootstrapping concerns). It would add a dependency on Perl, which is not great (I’m not sure whether it complicates bootstrapping since Perl is already present early on, but it’s safer to avoid it.) We could rewrite ‘unidata_to_charset.pl’ in Scheme, but then Guile would still need to provide a pre-compiled version of srfi-14.i.c for bootstrapping purposes. Or we could rewrite it in Awk, since Guile already depends on Awk anyway. Thoughts? Ludo’. ^ permalink raw reply [flat|nested] 12+ messages in thread
* bug#54111: guile bundles (a compiled version of) UnicodeData.txt and binaries 2022-02-27 13:52 ` Ludovic Courtès @ 2022-02-27 19:45 ` Maxime Devos 2022-02-27 19:52 ` Maxime Devos 2022-02-28 11:45 ` Ludovic Courtès 0 siblings, 2 replies; 12+ messages in thread From: Maxime Devos @ 2022-02-27 19:45 UTC (permalink / raw) To: Ludovic Courtès; +Cc: 54111 [-- Attachment #1: Type: text/plain, Size: 1772 bytes --] Ludovic Courtès schreef op zo 27-02-2022 om 14:52 [+0100]: > It would add a dependency on Perl, which is not great (I’m not sure > whether it complicates bootstrapping since Perl is already present early > on, but it’s safer to avoid it.) > > We could rewrite ‘unidata_to_charset.pl’ in Scheme, but then Guile would > still need to provide a pre-compiled version of srfi-14.i.c for > bootstrapping purposes. Or we could rewrite it in Awk, since Guile > already depends on Awk anyway. > > Thoughts? The ‘blob’ seems relatively harmless to the compilation process, so when there are bootstrapping problems, I think we can leave it in. However, all this Unicode is important for some other things (e.g. some DNS and filesystem things). So it would be nice to validate that no attacker with access to the Guile repo stealthily introduced some wrong information in during an otherwise routine update of the Unicode information. Hence, the following proposal: * Make perl an optional dependency of Guile (upstream) and add an '--with-unicode-data=[...]' configure flag or something like that. If perl is detected by './configure' and '--with-unicode-data=...' is set, then let one of the makefiles run 'unidata_to_charset.pl' and compare the 'new' srfi-14.i.c against the old srfi-14.i.c. In case of a mismatch, bail out. When there's no perl or --with-unicode-data, then just use the bundled srfi-14.i.c. * Add 'perl' (or 'perl-boot0' because that perl is probably good enough?) to the native-inputs of guile. Actually, the second is already done in 'guile-final'. Optionally, this can be combined with rewriting it in Scheme or some other language. Greetings, Maxime. [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 260 bytes --] ^ permalink raw reply [flat|nested] 12+ messages in thread
* bug#54111: guile bundles (a compiled version of) UnicodeData.txt and binaries 2022-02-27 19:45 ` Maxime Devos @ 2022-02-27 19:52 ` Maxime Devos 2022-02-27 23:07 ` Bengt Richter 2022-02-28 11:45 ` Ludovic Courtès 1 sibling, 1 reply; 12+ messages in thread From: Maxime Devos @ 2022-02-27 19:52 UTC (permalink / raw) To: Ludovic Courtès; +Cc: 54111 [-- Attachment #1: Type: text/plain, Size: 435 bytes --] Maxime Devos schreef op zo 27-02-2022 om 20:45 [+0100]: > * Add 'perl' (or 'perl-boot0' because that perl is probably good > enough?) to the native-inputs of guile. > > Actually, the second is already done in 'guile-final'. Maybe this being done in 'guile-final' and 'guile-3.0-latest' is sufficient? Which package exactly verifies doesn't seem important, as long as some package does it. Greetings, Maxime. [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 260 bytes --] ^ permalink raw reply [flat|nested] 12+ messages in thread
* bug#54111: guile bundles (a compiled version of) UnicodeData.txt and binaries 2022-02-27 19:52 ` Maxime Devos @ 2022-02-27 23:07 ` Bengt Richter 0 siblings, 0 replies; 12+ messages in thread From: Bengt Richter @ 2022-02-27 23:07 UTC (permalink / raw) To: Maxime Devos; +Cc: 54111 Hi guix, On +2022-02-27 20:52:38 +0100, Maxime Devos wrote: > Maxime Devos schreef op zo 27-02-2022 om 20:45 [+0100]: > > * Add 'perl' (or 'perl-boot0' because that perl is probably good > > enough?) to the native-inputs of guile. > > > > Actually, the second is already done in 'guile-final'. > > Maybe this being done in 'guile-final' and 'guile-3.0-latest' is > sufficient? Which package exactly verifies doesn't seem important, > as long as some package does it. > > Greetings, > Maxime. I'm wondering how many lines of perl code actually would have to be translated to guile to eliminate this perl dependency. Does the perl code upstream get changed too often to make keeping up an acceptable chore? (I guess I'm assuming the code is like one screenful with a hot loop accessing a bunch of static tables. I haven't chased it :) -- Regards, Bengt Richter ^ permalink raw reply [flat|nested] 12+ messages in thread
* bug#54111: guile bundles (a compiled version of) UnicodeData.txt and binaries 2022-02-27 19:45 ` Maxime Devos 2022-02-27 19:52 ` Maxime Devos @ 2022-02-28 11:45 ` Ludovic Courtès 2022-02-28 17:46 ` Maxime Devos 1 sibling, 1 reply; 12+ messages in thread From: Ludovic Courtès @ 2022-02-28 11:45 UTC (permalink / raw) To: Maxime Devos; +Cc: 54111 Hi, Maxime Devos <maximedevos@telenet.be> skribis: > Ludovic Courtès schreef op zo 27-02-2022 om 14:52 [+0100]: [...] >> We could rewrite ‘unidata_to_charset.pl’ in Scheme, but then Guile would >> still need to provide a pre-compiled version of srfi-14.i.c for >> bootstrapping purposes. Or we could rewrite it in Awk, since Guile >> already depends on Awk anyway. >> >> Thoughts? > > The ‘blob’ seems relatively harmless to the compilation process, so > when there are bootstrapping problems, I think we can leave it in. > > However, all this Unicode is important for some other things (e.g. some > DNS and filesystem things). So it would be nice to validate that no > attacker with access to the Guile repo stealthily introduced some wrong > information in during an otherwise routine update of the Unicode > information. The threat model is that the repository is trusted (that’s a strong assumption, but that’s how it is). You cannot protect against someone with access to the repository. We could use ‘guix git authenticate’ to improve on that. > Hence, the following proposal: > > * Make perl an optional dependency of Guile (upstream) and add an > '--with-unicode-data=[...]' configure flag or something like that. > > If perl is detected by './configure' and '--with-unicode-data=...' > is set, then let one of the makefiles run 'unidata_to_charset.pl' > and compare the 'new' srfi-14.i.c against the old srfi-14.i.c. > > In case of a mismatch, bail out. > > When there's no perl or --with-unicode-data, then just use the > bundled srfi-14.i.c. > > * Add 'perl' (or 'perl-boot0' because that perl is probably good > enough?) to the native-inputs of guile. > > Actually, the second is already done in 'guile-final'. > Optionally, this can be combined with rewriting it in Scheme > or some other language. It might be easier to rewrite in Awk in build srfi-14.i.c unconditionally no? We can also add ‘--with-unicode-data’, though that’s orthogonal. Thanks, Ludo’. ^ permalink raw reply [flat|nested] 12+ messages in thread
* bug#54111: guile bundles (a compiled version of) UnicodeData.txt and binaries 2022-02-28 11:45 ` Ludovic Courtès @ 2022-02-28 17:46 ` Maxime Devos 2022-03-14 18:27 ` Timothy Sample 0 siblings, 1 reply; 12+ messages in thread From: Maxime Devos @ 2022-02-28 17:46 UTC (permalink / raw) To: Ludovic Courtès; +Cc: 54111 [-- Attachment #1: Type: text/plain, Size: 330 bytes --] Ludovic Courtès schreef op ma 28-02-2022 om 12:45 [+0100]: > It might be easier to rewrite in Awk in build srfi-14.i.c > unconditionally no? I don't know any Awk and it seems to be quite different from languages I know, so for me doing that isn't easier. But for someone who knows some Awk, sure! Greetings, Maxime. [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 260 bytes --] ^ permalink raw reply [flat|nested] 12+ messages in thread
* bug#54111: guile bundles (a compiled version of) UnicodeData.txt and binaries 2022-02-28 17:46 ` Maxime Devos @ 2022-03-14 18:27 ` Timothy Sample 2022-03-16 10:47 ` Ludovic Courtès 0 siblings, 1 reply; 12+ messages in thread From: Timothy Sample @ 2022-03-14 18:27 UTC (permalink / raw) To: Maxime Devos; +Cc: 54111 [-- Attachment #1: Type: text/plain, Size: 2152 bytes --] Hi Maxime, Maxime Devos <maximedevos@telenet.be> writes: > Ludovic Courtès schreef op ma 28-02-2022 om 12:45 [+0100]: > >> It might be easier to rewrite in Awk in build srfi-14.i.c >> unconditionally no? > > I don't know any Awk and it seems to be quite different from languages > I know, so for me doing that isn't easier. But for someone who knows > some Awk, sure! Well, I don’t consider myself an Awk person, but I had to implement it for Gash-Utils, so I know it well enough! This may not be the most idiomatic Awk program, but to my eyes it is no less readable than the Perl version. Note that this Awk script needs to be invoked using something like: $ awk -f unidata_to_charset.awk < UnicodeData.txt > srfi-14.i.c That is, the Perl version had the file names hard-coded, but the Awk version reads from stdin and writes to stdout. Also, the Awk version does not shell out to 'indent' to post-process the file. That was basically a no-op in the Perl version, so I removed it. There are a few differences in how the script is structured, and I had to convert all the hex literals to decimal, but the logical behaviour should be exactly the same. I preserved all the comments and predicates exactly from the Perl version. There’s probably some differences in error handling, but the input data is so simple that it shouldn’t matter. It runs with “gawk --posix”. If I run “gawk --lint”, I get warnings, but I’m pretty sure they are spurious (they may even be Gawk bugs, but I would have to double check the relevant specs and docs). If the lint warnings are a problem, you can append the empty string to the argument of the ‘hex’ function to make them go away. Also, (as a bonus) as of commit 62c56f9 the Gash-Utils version of Awk can run this script! :) Of course, to use this script as part of the Guile build, someone™ will have to double check that we can legally redistribute the Unicode data file (probably okay, but always good to check), and update the build rules to generate the C file. I can’t guarantee that I’ll get to it.... -- Tim [-- Attachment #2: unidata_to_charset.awk --] [-- Type: text/plain, Size: 12796 bytes --] # unidata_to_charset.awk --- Compute SRFI-14 charsets from UnicodeData.txt # # Copyright (C) 2009, 2010, 2022 Free Software Foundation, Inc. # # This library is free software; you can redistribute it and/or # modify it under the terms of the GNU Lesser General Public # License as published by the Free Software Foundation; either # version 3 of the License, or (at your option) any later version. # # This library is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU # Lesser General Public License for more details. # # You should have received a copy of the GNU Lesser General Public # License along with this library; if not, write to the Free Software # Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA # Utilities ########### # Print MESSAGE to standard error, and exit with STATUS. function die(status, message) { print "unidata_to_charset.awk:", message | "cat 1>&2"; exit_status = status; exit exit_status; } # Parse the string S as a hexadecimal number. Note that R, C, and B are # local variables that need not be set by callers. Most Awk # implementations have an 'strtonum' function that we could use, but it # is not part of POSIX. function hex(s, r, c, b) { if (length(s) == 0) { die(1, "Cannot parse empty string as hexadecimal."); } r = 0; for (i = 1; i <= length(s); i++) { c = substr(s, i, 1); b = 0; if (c == "0") { b = 0; } else if (c == "1") { b = 1; } else if (c == "2") { b = 2; } else if (c == "3") { b = 3; } else if (c == "4") { b = 4; } else if (c == "5") { b = 5; } else if (c == "6") { b = 6; } else if (c == "7") { b = 7; } else if (c == "8") { b = 8; } else if (c == "9") { b = 9; } else if (c == "A") { b = 10; } else if (c == "B") { b = 11; } else if (c == "C") { b = 12; } else if (c == "D") { b = 13; } else if (c == "E") { b = 14; } else if (c == "F") { b = 15; } else { die(1, "Invalid hexadecimal character: " c); } r *= 16; r += b; } return r; } # Program initialization ######################## BEGIN { # The columns are separated by semicolons. FS = ";"; # This will help us handle errors. exit_status = 0; # List of charsets. all_charsets_count = 0; all_charsets[all_charsets_count++] = "lower_case"; all_charsets[all_charsets_count++] = "upper_case"; all_charsets[all_charsets_count++] = "title_case"; all_charsets[all_charsets_count++] = "letter"; all_charsets[all_charsets_count++] = "digit"; all_charsets[all_charsets_count++] = "hex_digit"; all_charsets[all_charsets_count++] = "letter_plus_digit"; all_charsets[all_charsets_count++] = "graphic"; all_charsets[all_charsets_count++] = "whitespace"; all_charsets[all_charsets_count++] = "printing"; all_charsets[all_charsets_count++] = "iso_control"; all_charsets[all_charsets_count++] = "punctuation"; all_charsets[all_charsets_count++] = "symbol"; all_charsets[all_charsets_count++] = "blank"; all_charsets[all_charsets_count++] = "ascii"; all_charsets[all_charsets_count++] = "empty"; all_charsets[all_charsets_count++] = "designated"; # Initialize charset state table. for (i in all_charsets) { cs = all_charsets[i]; state[cs, "start"] = -1; state[cs, "end"] = -1; state[cs, "count"] = 0; } } # Record initialization ####################### # In this block we give names to each field, and do some basic # initialization. { codepoint = hex($1); name = $2; category = $3; uppercase = $13; lowercase = $14; codepoint_end = codepoint; charset_index = 0; for (i in charsets) { delete charsets[i]; } } # Some pairs of lines in UnicodeData.txt delimit ranges of # characters. name ~ /First>$/ { getline; last_name = name; sub(/First>$/, "Last>", last_name); if (last_name != $2) { die(1, "Invalid range in Unicode data."); exit_status = 1; exit 1; } codepoint_end = hex($1); } # Character set predicates ########################## ## The lower_case character set ############################### # For Unicode, we follow Java's specification: a character is # lowercase if # * it is not in the range [U+2000,U+2FFF] ([8192,12287]), and # * the Unicode attribute table does not give a lowercase mapping # for it, and # * at least one of the following is true: # o the Unicode attribute table gives a mapping to uppercase # for the character, or # o the name for the character in the Unicode attribute table # contains the words "SMALL LETTER" or "SMALL LIGATURE". (codepoint < 8192 || codepoint > 12287) && lowercase == "" && (uppercase != "" || name ~ /(SMALL LETTER|SMALL LIGATURE)/) { charsets[charset_index++] = "lower_case"; } ## The upper_case character set ############################### # For Unicode, we follow Java's specification: a character is # uppercase if # * it is not in the range [U+2000,U+2FFF] ([8192,12287]), and # * the Unicode attribute table does not give an uppercase mapping # for it (this excludes titlecase characters), and # * at least one of the following is true: # o the Unicode attribute table gives a mapping to lowercase # for the character, or # o the name for the character in the Unicode attribute table # contains the words "CAPITAL LETTER" or "CAPITAL LIGATURE". (codepoint < 8192 || codepoint > 12287) && uppercase == "" && (lowercase != "" || name ~ /(CAPITAL LETTER|CAPITAL LIGATURE)/) { charsets[charset_index++] = "upper_case"; } ## The title_case character set ############################### # A character is titlecase if it has the category Lt in the character # attribute database. category == "Lt" { charsets[charset_index++] = "title_case"; } ## The letter character set ########################### # A letter is any character with one of the letter categories (Lu, Ll, # Lt, Lm, Lo) in the Unicode character database. category == "Lu" || category == "Ll" || category == "Lt" || category == "Lm" || category == "Lo" { charsets[charset_index++] = "letter"; charsets[charset_index++] = "letter_plus_digit"; } ## The digit character set ########################## # A character is a digit if it has the category Nd in the character # attribute database. In Latin-1 and ASCII, the only such characters # are 0123456789. In Unicode, there are other digit characters in # other code blocks, such as Gujarati digits and Tibetan digits. category == "Nd" { charsets[charset_index++] = "digit"; charsets[charset_index++] = "letter_plus_digit"; } ## The hex_digit character set ############################## # The only hex digits are 0123456789abcdefABCDEF. (codepoint >= 48 && codepoint <= 57) || (codepoint >= 65 && codepoint <= 70) || (codepoint >= 97 && codepoint <= 102) { charsets[charset_index++] = "hex_digit"; } ## The graphic character set ############################ # Characters that would 'use ink' when printed category ~ /L|M|N|P|S/ { charsets[charset_index++] = "graphic"; charsets[charset_index++] = "printing"; } ## The whitespace character set ############################### # A whitespace character is either # * a character with one of the space, line, or paragraph separator # categories (Zs, Zl or Zp) of the Unicode character database. # * U+0009 (09) Horizontal tabulation (\t control-I) # * U+000A (10) Line feed (\n control-J) # * U+000B (11) Vertical tabulation (\v control-K) # * U+000C (12) Form feed (\f control-L) # * U+000D (13) Carriage return (\r control-M) category ~ /Zs|Zl|Zp/ || (codepoint >= 9 && codepoint <= 13) { charsets[charset_index++] = "whitespace"; charsets[charset_index++] = "printing"; } ## The iso_control character set ################################ # The ISO control characters are the Unicode/Latin-1 characters in the # ranges [U+0000,U+001F] ([0,31]) and [U+007F,U+009F] ([127,159]). (codepoint >= 0 && codepoint <= 31) || (codepoint >= 127 && codepoint <= 159) { charsets[charset_index++] = "iso_control"; } ## The punctuation character set ################################ # A punctuation character is any character that has one of the # punctuation categories in the Unicode character database (Pc, Pd, # Ps, Pe, Pi, Pf, or Po.) # Note that srfi-14 gives conflicting requirements!! It claims that # only the Unicode punctuation is necessary, but, explicitly calls out # the soft hyphen character (U+00AD) as punctution. Current versions # of Unicode consider U+00AD to be a formatting character, not # punctuation. category ~ /P/ { charsets[charset_index++] = "punctuation"; } ## The symbol character set ########################### # A symbol is any character that has one of the symbol categories in # the Unicode character database (Sm, Sc, Sk, or So). category ~ /S/ { charsets[charset_index++] = "symbol"; } ## The blank character set ########################## # Blank chars are horizontal whitespace. A blank character is either # * a character with the space separator category (Zs) in the # Unicode character database. # * U+0009 (9) Horizontal tabulation (\t control-I) category ~ /Zs/ || codepoint == 9 { charsets[charset_index++] = "blank"; } ## The ascii character set ########################## codepoint <= 127 { charsets[charset_index++] = "ascii"; } ## The designated character set ############################### category !~ /Cs/ { charsets[charset_index++] = "designated"; } ## Other character sets ####################### # Note that the "letter_plus_digit" and "printing" character sets, which # are unions of other character sets, are included in the patterns # matching their constituent parts (i.e., the "letter_plus_digit" # character set is included as part of the "letter" and "digit" # patterns). # # Also, the "empty" character is computed by doing precisely nothing! # Keeping track of state ######################## # Update the state for each charset. { for (i in charsets) { cs = charsets[i]; if (state[cs, "start"] == -1) { state[cs, "start"] = codepoint; state[cs, "end"] = codepoint_end; } else if (state[cs, "end"] + 1 == codepoint) { state[cs, "end"] = codepoint_end; } else { count = state[cs, "count"]; state[cs, "count"]++; state[cs, "ranges", count, 0] = state[cs, "start"]; state[cs, "ranges", count, 1] = state[cs, "end"]; state[cs, "start"] = codepoint; state[cs, "end"] = codepoint_end; } } } # Printing and error handling ############################# END { # Normally, an exit statement runs all the 'END' blocks before # actually exiting. We use the 'exit_status' variable to short # circuit the rest of the 'END' block by reissuing the exit # statement. if (exit_status != 0) { exit exit_status; } # Write a bit of a header. print("/* srfi-14.i.c -- standard SRFI-14 character set data */"); print(""); print("/* This file is #include'd by srfi-14.c. */"); print(""); print("/* This file was generated from"); print(" http://unicode.org/Public/UNIDATA/UnicodeData.txt"); print(" with the unidata_to_charset.awk script. */"); print(""); for (i = 0; i < all_charsets_count; i++) { cs = all_charsets[i]; # Extra logic to ensure that the last range is included. if (state[cs, "start"] != -1) { count = state[cs, "count"]; state[cs, "count"]++; state[cs, "ranges", count, 0] = state[cs, "start"]; state[cs, "ranges", count, 1] = state[cs, "end"]; } count = state[cs, "count"]; print("static const scm_t_char_range cs_" cs "_ranges[] = {"); for (j = 0; j < count; j++) { rstart = state[cs, "ranges", j, 0]; rend = state[cs, "ranges", j, 1]; if (j + 1 < count) { printf(" {0x%04x, 0x%04x},\n", rstart, rend); } else { printf(" {0x%04x, 0x%04x}\n", rstart, rend); } } print("};"); print(""); count = state[cs, "count"]; printf("static const size_t cs_%s_len = %d;\n", cs, count); if (i + 1 < all_charsets_count) { print(""); } } } # And we're done. ^ permalink raw reply [flat|nested] 12+ messages in thread
* bug#54111: guile bundles (a compiled version of) UnicodeData.txt and binaries 2022-03-14 18:27 ` Timothy Sample @ 2022-03-16 10:47 ` Ludovic Courtès 2022-03-16 23:42 ` Timothy Sample 0 siblings, 1 reply; 12+ messages in thread From: Ludovic Courtès @ 2022-03-16 10:47 UTC (permalink / raw) To: Timothy Sample; +Cc: 54111 Hi Tim, Timothy Sample <samplet@ngyro.com> skribis: > Well, I don’t consider myself an Awk person, but I had to implement it > for Gash-Utils, so I know it well enough! This may not be the most > idiomatic Awk program, but to my eyes it is no less readable than the > Perl version. You rock! [...] > It runs with “gawk --posix”. If I run “gawk --lint”, I get warnings, > but I’m pretty sure they are spurious (they may even be Gawk bugs, but I > would have to double check the relevant specs and docs). If the lint > warnings are a problem, you can append the empty string to the argument > of the ‘hex’ function to make them go away. Also, (as a bonus) as of > commit 62c56f9 the Gash-Utils version of Awk can run this script! :) Incredible. :-) > Of course, to use this script as part of the Guile build, someone™ will > have to double check that we can legally redistribute the Unicode data > file (probably okay, but always good to check), and update the build > rules to generate the C file. I can’t guarantee that I’ll get to it.... I’ll check with Andy if he’s fine with this option. Would you like to turn it into a patch against Guile? If not, I could do that. > # unidata_to_charset.awk --- Compute SRFI-14 charsets from UnicodeData.txt > # > # Copyright (C) 2009, 2010, 2022 Free Software Foundation, Inc. Is this correct? (Maybe yes because it’s a translation of the original Perl script, right?) Thanks a lot! Ludo’. ^ permalink raw reply [flat|nested] 12+ messages in thread
* bug#54111: guile bundles (a compiled version of) UnicodeData.txt and binaries 2022-03-16 10:47 ` Ludovic Courtès @ 2022-03-16 23:42 ` Timothy Sample 2022-03-19 18:20 ` Timothy Sample 0 siblings, 1 reply; 12+ messages in thread From: Timothy Sample @ 2022-03-16 23:42 UTC (permalink / raw) To: Ludovic Courtès; +Cc: 54111 Hi Ludo, Ludovic Courtès <ludo@gnu.org> writes: > Timothy Sample <samplet@ngyro.com> skribis: > >> Of course, to use this script as part of the Guile build, someone™ will >> have to double check that we can legally redistribute the Unicode data >> file (probably okay, but always good to check), and update the build >> rules to generate the C file. I can’t guarantee that I’ll get to it.... > > I’ll check with Andy if he’s fine with this option. Would you like to > turn it into a patch against Guile? If not, I could do that. I’ll do it. It always feels good to submit a patch! >> # unidata_to_charset.awk --- Compute SRFI-14 charsets from UnicodeData.txt >> # >> # Copyright (C) 2009, 2010, 2022 Free Software Foundation, Inc. > > Is this correct? (Maybe yes because it’s a translation of the original > Perl script, right?) That’s my understanding. This is technically a modification of the original work, so the old copyright years are still relevant. -- Tim ^ permalink raw reply [flat|nested] 12+ messages in thread
* bug#54111: guile bundles (a compiled version of) UnicodeData.txt and binaries 2022-03-16 23:42 ` Timothy Sample @ 2022-03-19 18:20 ` Timothy Sample 2022-03-24 13:33 ` Ludovic Courtès 0 siblings, 1 reply; 12+ messages in thread From: Timothy Sample @ 2022-03-19 18:20 UTC (permalink / raw) To: Ludovic Courtès; +Cc: 54111 [-- Attachment #1: Type: text/plain, Size: 1664 bytes --] Hi again, Timothy Sample <samplet@ngyro.com> writes: > Ludovic Courtès <ludo@gnu.org> writes: > >> Timothy Sample <samplet@ngyro.com> skribis: >> >>> Of course, to use this script as part of the Guile build, someone™ will >>> have to double check that we can legally redistribute the Unicode data >>> file (probably okay, but always good to check), and update the build >>> rules to generate the C file. I can’t guarantee that I’ll get to it.... >> >> I’ll check with Andy if he’s fine with this option. Would you like to >> turn it into a patch against Guile? If not, I could do that. > > I’ll do it. It always feels good to submit a patch! I’ve attached two patches, the second of which is gzipped (the UnicodeData.txt file is nearly 2M). The first patch replaces the Perl script with the Awk script. The Awk script produces an identical ‘srfi-14.i.c’, except for changing “.pl” to “.awk” in a comment. The second patch removes ‘srfi-14.i.c’, adds ‘UnicodeData.txt’, and teaches the build machinery how to generate the former from the latter. I did my best with the Makefile, but I’m still a noob when it comes to Automake conventions. This is the part that warrants the most review! Finally, I added support for comments to the Awk script so that I could put the Unicode license text in the data file itself. This is probably the simplest way to dispatch our legal obligations to Unicode, Inc. (and follow the guidelines of the FSF). For all the details, see <https://www.unicode.org/copyright.html> and <https://www.gnu.org/licenses/license-list.html#Unicode>. -- Tim [-- Warning: decoded text below may be mangled, UTF-8 assumed --] [-- Attachment #2: 0001-Reimplement-unidata_to_charset.pl-in-Awk.patch --] [-- Type: text/x-patch, Size: 27585 bytes --] From b3c8be22f8ab5f4cc852cd56f960079ed4e84c49 Mon Sep 17 00:00:00 2001 From: Timothy Sample <samplet@ngyro.com> Date: Wed, 16 Mar 2022 21:13:45 -0600 Subject: [PATCH 1/2] Reimplement 'unidata_to_charset.pl' in Awk. * libguile/unidata_to_charset.pl: Delete file. * libguile/unidata_to_charset.awk: New file. * libguile/Makefile.am (EXTRA_DIST): Adjust accordingly. --- libguile/Makefile.am | 2 +- libguile/unidata_to_charset.awk | 409 ++++++++++++++++++++++++++++++++ libguile/unidata_to_charset.pl | 401 ------------------------------- 3 files changed, 410 insertions(+), 402 deletions(-) create mode 100644 libguile/unidata_to_charset.awk delete mode 100755 libguile/unidata_to_charset.pl diff --git a/libguile/Makefile.am b/libguile/Makefile.am index 40619d379..b2a7d1c51 100644 --- a/libguile/Makefile.am +++ b/libguile/Makefile.am @@ -728,7 +728,7 @@ EXTRA_DIST = ChangeLog-scm ChangeLog-threads \ guile-func-name-check \ cpp-E.syms cpp-E.c cpp-SIG.syms cpp-SIG.c \ c-tokenize.lex \ - scmconfig.h.top libgettext.h unidata_to_charset.pl libguile.map \ + scmconfig.h.top libgettext.h unidata_to_charset.awk libguile.map \ vm-operations.h libguile-@GUILE_EFFECTIVE_VERSION@-gdb.scm \ $(lightening_c_files) $(lightening_extra_files) # $(DOT_DOC_FILES) $(EXTRA_DOT_DOC_FILES) \ diff --git a/libguile/unidata_to_charset.awk b/libguile/unidata_to_charset.awk new file mode 100644 index 000000000..11dfb2686 --- /dev/null +++ b/libguile/unidata_to_charset.awk @@ -0,0 +1,409 @@ +# unidata_to_charset.awk --- Compute SRFI-14 charsets from UnicodeData.txt +# +# Copyright (C) 2009, 2010, 2022 Free Software Foundation, Inc. +# +# This library is free software; you can redistribute it and/or +# modify it under the terms of the GNU Lesser General Public +# License as published by the Free Software Foundation; either +# version 3 of the License, or (at your option) any later version. +# +# This library is distributed in the hope that it will be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU +# Lesser General Public License for more details. +# +# You should have received a copy of the GNU Lesser General Public +# License along with this library; if not, write to the Free Software +# Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA + +# Utilities +########### + +# Print MESSAGE to standard error, and exit with STATUS. +function die(status, message) { + print "unidata_to_charset.awk:", message | "cat 1>&2"; + exit_status = status; + exit exit_status; +} + +# Parse the string S as a hexadecimal number. Note that R, C, and B are +# local variables that need not be set by callers. Most Awk +# implementations have an 'strtonum' function that we could use, but it +# is not part of POSIX. +function hex(s, r, c, b) { + if (length(s) == 0) { + die(1, "Cannot parse empty string as hexadecimal."); + } + r = 0; + for (i = 1; i <= length(s); i++) { + c = substr(s, i, 1); + b = 0; + if (c == "0") { b = 0; } + else if (c == "1") { b = 1; } + else if (c == "2") { b = 2; } + else if (c == "3") { b = 3; } + else if (c == "4") { b = 4; } + else if (c == "5") { b = 5; } + else if (c == "6") { b = 6; } + else if (c == "7") { b = 7; } + else if (c == "8") { b = 8; } + else if (c == "9") { b = 9; } + else if (c == "A") { b = 10; } + else if (c == "B") { b = 11; } + else if (c == "C") { b = 12; } + else if (c == "D") { b = 13; } + else if (c == "E") { b = 14; } + else if (c == "F") { b = 15; } + else { die(1, "Invalid hexadecimal character: " c); } + r *= 16; + r += b; + } + return r; +} + +# Program initialization +######################## + +BEGIN { + # The columns are separated by semicolons. + FS = ";"; + + # This will help us handle errors. + exit_status = 0; + + # List of charsets. + all_charsets_count = 0; + all_charsets[all_charsets_count++] = "lower_case"; + all_charsets[all_charsets_count++] = "upper_case"; + all_charsets[all_charsets_count++] = "title_case"; + all_charsets[all_charsets_count++] = "letter"; + all_charsets[all_charsets_count++] = "digit"; + all_charsets[all_charsets_count++] = "hex_digit"; + all_charsets[all_charsets_count++] = "letter_plus_digit"; + all_charsets[all_charsets_count++] = "graphic"; + all_charsets[all_charsets_count++] = "whitespace"; + all_charsets[all_charsets_count++] = "printing"; + all_charsets[all_charsets_count++] = "iso_control"; + all_charsets[all_charsets_count++] = "punctuation"; + all_charsets[all_charsets_count++] = "symbol"; + all_charsets[all_charsets_count++] = "blank"; + all_charsets[all_charsets_count++] = "ascii"; + all_charsets[all_charsets_count++] = "empty"; + all_charsets[all_charsets_count++] = "designated"; + + # Initialize charset state table. + for (i in all_charsets) { + cs = all_charsets[i]; + state[cs, "start"] = -1; + state[cs, "end"] = -1; + state[cs, "count"] = 0; + } +} + +# Record initialization +####################### + +# In this block we give names to each field, and do some basic +# initialization. +{ + codepoint = hex($1); + name = $2; + category = $3; + uppercase = $13; + lowercase = $14; + + codepoint_end = codepoint; + charset_count = 0; +} + +# Some pairs of lines in UnicodeData.txt delimit ranges of +# characters. +name ~ /First>$/ { + getline; + last_name = name; + sub(/First>$/, "Last>", last_name); + if (last_name != $2) { + die(1, "Invalid range in Unicode data."); + exit_status = 1; + exit 1; + } + codepoint_end = hex($1); +} + +# Character set predicates +########################## + +## The lower_case character set +############################### + +# For Unicode, we follow Java's specification: a character is +# lowercase if +# * it is not in the range [U+2000,U+2FFF] ([8192,12287]), and +# * the Unicode attribute table does not give a lowercase mapping +# for it, and +# * at least one of the following is true: +# o the Unicode attribute table gives a mapping to uppercase +# for the character, or +# o the name for the character in the Unicode attribute table +# contains the words "SMALL LETTER" or "SMALL LIGATURE". + +(codepoint < 8192 || codepoint > 12287) && +lowercase == "" && +(uppercase != "" || name ~ /(SMALL LETTER|SMALL LIGATURE)/) { + charsets[charset_count++] = "lower_case"; +} + +## The upper_case character set +############################### + +# For Unicode, we follow Java's specification: a character is +# uppercase if +# * it is not in the range [U+2000,U+2FFF] ([8192,12287]), and +# * the Unicode attribute table does not give an uppercase mapping +# for it (this excludes titlecase characters), and +# * at least one of the following is true: +# o the Unicode attribute table gives a mapping to lowercase +# for the character, or +# o the name for the character in the Unicode attribute table +# contains the words "CAPITAL LETTER" or "CAPITAL LIGATURE". + +(codepoint < 8192 || codepoint > 12287) && +uppercase == "" && +(lowercase != "" || name ~ /(CAPITAL LETTER|CAPITAL LIGATURE)/) { + charsets[charset_count++] = "upper_case"; +} + +## The title_case character set +############################### + +# A character is titlecase if it has the category Lt in the character +# attribute database. + +category == "Lt" { + charsets[charset_count++] = "title_case"; +} + +## The letter character set +########################### + +# A letter is any character with one of the letter categories (Lu, Ll, +# Lt, Lm, Lo) in the Unicode character database. + +category == "Lu" || +category == "Ll" || +category == "Lt" || +category == "Lm" || +category == "Lo" { + charsets[charset_count++] = "letter"; + charsets[charset_count++] = "letter_plus_digit"; +} + +## The digit character set +########################## + +# A character is a digit if it has the category Nd in the character +# attribute database. In Latin-1 and ASCII, the only such characters +# are 0123456789. In Unicode, there are other digit characters in +# other code blocks, such as Gujarati digits and Tibetan digits. + +category == "Nd" { + charsets[charset_count++] = "digit"; + charsets[charset_count++] = "letter_plus_digit"; +} + +## The hex_digit character set +############################## + +# The only hex digits are 0123456789abcdefABCDEF. + +(codepoint >= 48 && codepoint <= 57) || +(codepoint >= 65 && codepoint <= 70) || +(codepoint >= 97 && codepoint <= 102) { + charsets[charset_count++] = "hex_digit"; +} + +## The graphic character set +############################ + +# Characters that would 'use ink' when printed + +category ~ /L|M|N|P|S/ { + charsets[charset_count++] = "graphic"; + charsets[charset_count++] = "printing"; +} + +## The whitespace character set +############################### + +# A whitespace character is either +# * a character with one of the space, line, or paragraph separator +# categories (Zs, Zl or Zp) of the Unicode character database. +# * U+0009 (09) Horizontal tabulation (\t control-I) +# * U+000A (10) Line feed (\n control-J) +# * U+000B (11) Vertical tabulation (\v control-K) +# * U+000C (12) Form feed (\f control-L) +# * U+000D (13) Carriage return (\r control-M) + +category ~ /Zs|Zl|Zp/ || +(codepoint >= 9 && codepoint <= 13) { + charsets[charset_count++] = "whitespace"; + charsets[charset_count++] = "printing"; +} + +## The iso_control character set +################################ + +# The ISO control characters are the Unicode/Latin-1 characters in the +# ranges [U+0000,U+001F] ([0,31]) and [U+007F,U+009F] ([127,159]). + +(codepoint >= 0 && codepoint <= 31) || +(codepoint >= 127 && codepoint <= 159) { + charsets[charset_count++] = "iso_control"; +} + +## The punctuation character set +################################ + +# A punctuation character is any character that has one of the +# punctuation categories in the Unicode character database (Pc, Pd, +# Ps, Pe, Pi, Pf, or Po.) + +# Note that srfi-14 gives conflicting requirements!! It claims that +# only the Unicode punctuation is necessary, but, explicitly calls out +# the soft hyphen character (U+00AD) as punctution. Current versions +# of Unicode consider U+00AD to be a formatting character, not +# punctuation. + +category ~ /P/ { + charsets[charset_count++] = "punctuation"; +} + +## The symbol character set +########################### + +# A symbol is any character that has one of the symbol categories in +# the Unicode character database (Sm, Sc, Sk, or So). + +category ~ /S/ { + charsets[charset_count++] = "symbol"; +} + +## The blank character set +########################## + +# Blank chars are horizontal whitespace. A blank character is either +# * a character with the space separator category (Zs) in the +# Unicode character database. +# * U+0009 (9) Horizontal tabulation (\t control-I) + +category ~ /Zs/ || codepoint == 9 { + charsets[charset_count++] = "blank"; +} + +## The ascii character set +########################## + +codepoint <= 127 { + charsets[charset_count++] = "ascii"; +} + +## The designated character set +############################### + +# Designated -- All characters except for the surrogates + +category !~ /Cs/ { + charsets[charset_count++] = "designated"; +} + +## Other character sets +####################### + +# Note that the "letter_plus_digit" and "printing" character sets, which +# are unions of other character sets, are included in the patterns +# matching their constituent parts (i.e., the "letter_plus_digit" +# character set is included as part of the "letter" and "digit" +# patterns). +# +# Also, the "empty" character is computed by doing precisely nothing! + +# Keeping track of state +######################## + +# Update the state for each charset. +{ + for (i = 0; i < charset_count; i++) { + cs = charsets[i]; + if (state[cs, "start"] == -1) { + state[cs, "start"] = codepoint; + state[cs, "end"] = codepoint_end; + } else if (state[cs, "end"] + 1 == codepoint) { + state[cs, "end"] = codepoint_end; + } else { + count = state[cs, "count"]; + state[cs, "count"]++; + state[cs, "ranges", count, 0] = state[cs, "start"]; + state[cs, "ranges", count, 1] = state[cs, "end"]; + state[cs, "start"] = codepoint; + state[cs, "end"] = codepoint_end; + } + } +} + +# Printing and error handling +############################# + +END { + # Normally, an exit statement runs all the 'END' blocks before + # actually exiting. We use the 'exit_status' variable to short + # circuit the rest of the 'END' block by reissuing the exit + # statement. + if (exit_status != 0) { + exit exit_status; + } + + # Write a bit of a header. + print("/* srfi-14.i.c -- standard SRFI-14 character set data */"); + print(""); + print("/* This file is #include'd by srfi-14.c. */"); + print(""); + print("/* This file was generated from"); + print(" https://unicode.org/Public/UNIDATA/UnicodeData.txt"); + print(" with the unidata_to_charset.awk script. */"); + print(""); + + for (i = 0; i < all_charsets_count; i++) { + cs = all_charsets[i]; + + # Extra logic to ensure that the last range is included. + if (state[cs, "start"] != -1) { + count = state[cs, "count"]; + state[cs, "count"]++; + state[cs, "ranges", count, 0] = state[cs, "start"]; + state[cs, "ranges", count, 1] = state[cs, "end"]; + } + + count = state[cs, "count"]; + + print("static const scm_t_char_range cs_" cs "_ranges[] = {"); + for (j = 0; j < count; j++) { + rstart = state[cs, "ranges", j, 0]; + rend = state[cs, "ranges", j, 1]; + if (j + 1 < count) { + printf(" {0x%04x, 0x%04x},\n", rstart, rend); + } else { + printf(" {0x%04x, 0x%04x}\n", rstart, rend); + } + } + print("};"); + print(""); + + count = state[cs, "count"]; + printf("static const size_t cs_%s_len = %d;\n", cs, count); + if (i + 1 < all_charsets_count) { + print(""); + } + } +} + +# And we're done. diff --git a/libguile/unidata_to_charset.pl b/libguile/unidata_to_charset.pl deleted file mode 100755 index 9cd7e6e71..000000000 --- a/libguile/unidata_to_charset.pl +++ /dev/null @@ -1,401 +0,0 @@ -#!/usr/bin/perl -# unidata_to_charset.pl --- Compute SRFI-14 charsets from UnicodeData.txt -# -# Copyright (C) 2009, 2010, 2022 Free Software Foundation, Inc. -# -# This library is free software; you can redistribute it and/or -# modify it under the terms of the GNU Lesser General Public -# License as published by the Free Software Foundation; either -# version 3 of the License, or (at your option) any later version. -# -# This library is distributed in the hope that it will be useful, -# but WITHOUT ANY WARRANTY; without even the implied warranty of -# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU -# Lesser General Public License for more details. -# -# You should have received a copy of the GNU Lesser General Public -# License along with this library; if not, write to the Free Software -# Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA - -open(my $in, "<", "UnicodeData.txt") or die "Can't open UnicodeData.txt: $!"; -open(my $out, ">", "srfi-14.i.c") or die "Can't open srfi-14.i.c: $!"; - -# For Unicode, we follow Java's specification: a character is -# lowercase if -# * it is not in the range [U+2000,U+2FFF], and -# * the Unicode attribute table does not give a lowercase mapping -# for it, and -# * at least one of the following is true: -# o the Unicode attribute table gives a mapping to uppercase -# for the character, or -# o the name for the character in the Unicode attribute table -# contains the words "SMALL LETTER" or "SMALL LIGATURE". - -sub lower_case { - my($codepoint, $name, $category, $uppercase, $lowercase)= @_; - if (($codepoint < 0x2000 || $codepoint > 0x2FFF) - && (!defined($lowercase) || $lowercase eq "") - && ((defined($uppercase) && $uppercase ne "") - || ($name =~ /(SMALL LETTER|SMALL LIGATURE)/))) { - return 1; - } else { - return 0; - } -} - -# For Unicode, we follow Java's specification: a character is -# uppercase if -# * it is not in the range [U+2000,U+2FFF], and -# * the Unicode attribute table does not give an uppercase mapping -# for it (this excludes titlecase characters), and -# * at least one of the following is true: -# o the Unicode attribute table gives a mapping to lowercase -# for the character, or -# o the name for the character in the Unicode attribute table -# contains the words "CAPITAL LETTER" or "CAPITAL LIGATURE". - -sub upper_case { - my($codepoint, $name, $category, $uppercase, $lowercase)= @_; - if (($codepoint < 0x2000 || $codepoint > 0x2FFF) - && (!defined($uppercase) || $uppercase eq "") - && ((defined($lowercase) && $lowercase ne "") - || ($name =~ /(CAPITAL LETTER|CAPITAL LIGATURE)/))) { - return 1; - } else { - return 0; - } -} - -# A character is titlecase if it has the category Lt in the character -# attribute database. - -sub title_case { - my($codepoint, $name, $category, $uppercase, $lowercase)= @_; - if (defined($category) && $category eq "Lt") { - return 1; - } else { - return 0; - } -} - -# A letter is any character with one of the letter categories (Lu, Ll, -# Lt, Lm, Lo) in the Unicode character database. - -sub letter { - my($codepoint, $name, $category, $uppercase, $lowercase)= @_; - if (defined($category) && ($category eq "Lu" - || $category eq "Ll" - || $category eq "Lt" - || $category eq "Lm" - || $category eq "Lo")) { - return 1; - } else { - return 0; - } -} - -# A character is a digit if it has the category Nd in the character -# attribute database. In Latin-1 and ASCII, the only such characters -# are 0123456789. In Unicode, there are other digit characters in -# other code blocks, such as Gujarati digits and Tibetan digits. - -sub digit { - my($codepoint, $name, $category, $uppercase, $lowercase)= @_; - if (defined($category) && $category eq "Nd") { - return 1; - } else { - return 0; - } -} - -# The only hex digits are 0123456789abcdefABCDEF. - -sub hex_digit { - my($codepoint, $name, $category, $uppercase, $lowercase)= @_; - if (($codepoint >= 0x30 && $codepoint <= 0x39) - || ($codepoint >= 0x41 && $codepoint <= 0x46) - || ($codepoint >= 0x61 && $codepoint <= 0x66)) { - return 1; - } else { - return 0; - } -} - -# The union of char-set:letter and char-set:digit. - -sub letter_plus_digit { - my($codepoint, $name, $category, $uppercase, $lowercase)= @_; - if (letter($codepoint, $name, $category, $uppercase, $lowercase) - || digit($codepoint, $name, $category, $uppercase, $lowercase)) { - return 1; - } else { - return 0; - } -} - -# Characters that would 'use ink' when printed -sub graphic { - my($codepoint, $name, $category, $uppercase, $lowercase)= @_; - if ($category =~ (/L|M|N|P|S/)) { - return 1; - } else { - return 0; - } -} - -# A whitespace character is either -# * a character with one of the space, line, or paragraph separator -# categories (Zs, Zl or Zp) of the Unicode character database. -# * U+0009 Horizontal tabulation (\t control-I) -# * U+000A Line feed (\n control-J) -# * U+000B Vertical tabulation (\v control-K) -# * U+000C Form feed (\f control-L) -# * U+000D Carriage return (\r control-M) - -sub whitespace { - my($codepoint, $name, $category, $uppercase, $lowercase)= @_; - if ($category =~ (/Zs|Zl|Zp/) - || $codepoint == 0x9 - || $codepoint == 0xA - || $codepoint == 0xB - || $codepoint == 0xC - || $codepoint == 0xD) { - return 1; - } else { - return 0; - } -} - -# A printing character is one that would occupy space when printed, -# i.e., a graphic character or a space character. char-set:printing is -# the union of char-set:whitespace and char-set:graphic. - -sub printing { - my($codepoint, $name, $category, $uppercase, $lowercase)= @_; - if (whitespace($codepoint, $name, $category, $uppercase, $lowercase) - || graphic($codepoint, $name, $category, $uppercase, $lowercase)) { - return 1; - } else { - return 0; - } -} - -# The ISO control characters are the Unicode/Latin-1 characters in the -# ranges [U+0000,U+001F] and [U+007F,U+009F]. - -sub iso_control { - my($codepoint, $name, $category, $uppercase, $lowercase)= @_; - if (($codepoint >= 0x00 && $codepoint <= 0x1F) - || ($codepoint >= 0x7F && $codepoint <= 0x9F)) { - return 1; - } else { - return 0; - } -} - -# A punctuation character is any character that has one of the -# punctuation categories in the Unicode character database (Pc, Pd, -# Ps, Pe, Pi, Pf, or Po.) - -# Note that srfi-14 gives conflicting requirements!! It claims that -# only the Unicode punctuation is necessary, but, explicitly calls out -# the soft hyphen character (U+00AD) as punctution. Current versions -# of Unicode consider U+00AD to be a formatting character, not -# punctuation. - -sub punctuation { - my($codepoint, $name, $category, $uppercase, $lowercase)= @_; - if ($category =~ (/P/)) { - return 1; - } else { - return 0; - } -} - -# A symbol is any character that has one of the symbol categories in -# the Unicode character database (Sm, Sc, Sk, or So). - -sub symbol { - my($codepoint, $name, $category, $uppercase, $lowercase)= @_; - if ($category =~ (/S/)) { - return 1; - } else { - return 0; - } -} - -# Blank chars are horizontal whitespace. A blank character is either -# * a character with the space separator category (Zs) in the -# Unicode character database. -# * U+0009 Horizontal tabulation (\t control-I) -sub blank { - my($codepoint, $name, $category, $uppercase, $lowercase)= @_; - if ($category =~ (/Zs/) - || $codepoint == 0x9) { - return 1; - } else { - return 0; - } -} - -# ASCII -sub ascii { - my($codepoint, $name, $category, $uppercase, $lowercase)= @_; - if ($codepoint <= 0x7F) { - return 1; - } else { - return 0; - } -} - -# Empty -sub empty { - my($codepoint, $name, $category, $uppercase, $lowercase)= @_; - return 0; -} - -# Designated -- All characters except for the surrogates -sub designated { - my($codepoint, $name, $category, $uppercase, $lowercase)= @_; - if ($category =~ (/Cs/)) { - return 0; - } else { - return 1; - } -} - - -# The procedure generates the two C structures necessary to describe a -# given category. -sub compute { - my($f) = @_; - my $start = -1; - my $end = -1; - my $len = 0; - my @rstart = (-1); - my @rend = (-1); - - seek($in, 0, 0) or die "Can't seek to beginning of file: $!"; - - print "$f\n"; - - while (<$in>) { - # Parse the 14 column, semicolon-delimited UnicodeData.txt - # file - chomp; - my(@fields) = split(/;/); - - # The codepoint: an integer - my $codepoint = hex($fields[0]); - - # If this is a character range, the last character in this - # range - my $codepoint_end = $codepoint; - - # The name of the character - my $name = $fields[1]; - - # A two-character category code, such as Ll (lower-case - # letter) - my $category = $fields[2]; - - # The codepoint of the uppercase version of this char - my $uppercase = $fields[12]; - - # The codepoint of the lowercase version of this char - my $lowercase = $fields[13]; - - my $pass = &$f($codepoint,$name,$category,$uppercase,$lowercase); - if ($pass == 1) { - - # Some pairs of lines in UnicodeData.txt delimit ranges of - # characters. - if ($name =~ /First/) { - $line = <$in>; - die $! if $!; - $codepoint_end = hex( (split(/;/, $line))[0] ); - } - - # Compute ranges of characters [start:end] that meet the - # criteria. Store the ranges. - if ($start == -1) { - $start = $codepoint; - $end = $codepoint_end; - } elsif ($end + 1 == $codepoint) { - $end = $codepoint_end; - } else { - $rstart[$len] = $start; - $rend[$len] = $end; - $len++; - $start = $codepoint; - $end = $codepoint_end; - } - } - } - - # Extra logic to ensure that the last range is included - if ($start != -1) { - if ($len > 0 && $rstart[@rstart-1] != $start) { - $rstart[$len] = $start; - $rend[$len] = $end; - $len++; - } elsif ($len == 0) { - $rstart[0] = $start; - $rend[0] = $end; - $len++; - } - } - - # Print the C struct that contains the range list. - print $out "static const scm_t_char_range cs_" . $f . "_ranges[] = {\n"; - if ($rstart[0] != -1) { - for (my $i=0; $i<@rstart-1; $i++) { - printf $out " {0x%04x, 0x%04x},\n", $rstart[$i], $rend[$i]; - } - printf $out " {0x%04x, 0x%04x}\n", $rstart[@rstart-1], $rend[@rstart-1]; - } - print $out "};\n\n"; - - # Print the C struct that contains the range list length and - # pointer to the range list. - print $out "static const size_t cs_${f}_len = $len;\n\n"; -} - -# Write a bit of a header -print $out "/* srfi-14.i.c -- standard SRFI-14 character set data */\n\n"; -print $out "/* This file is #include'd by srfi-14.c. */\n\n"; -print $out "/* This file was generated from\n"; -print $out " http://unicode.org/Public/UNIDATA/UnicodeData.txt\n"; -print $out " with the unidata_to_charset.pl script. */\n\n"; - -# Write the C structs for each SRFI-14 charset -compute "lower_case"; -compute "upper_case"; -compute "title_case"; -compute "letter"; -compute "digit"; -compute "hex_digit"; -compute "letter_plus_digit"; -compute "graphic"; -compute "whitespace"; -compute "printing"; -compute "iso_control"; -compute "punctuation"; -compute "symbol"; -compute "blank"; -compute "ascii"; -compute "empty"; -compute "designated"; - -close $in; -close $out; - -exec ('indent srfi-14.i.c') or print STDERR "call to 'indent' failed: $!"; - -# And we're done. - - - - - - -- 2.34.0 [-- Attachment #3: 0002-Create-srfi-14.i.c-during-build.patch.gz --] [-- Type: application/octet-stream, Size: 289030 bytes --] ^ permalink raw reply related [flat|nested] 12+ messages in thread
* bug#54111: guile bundles (a compiled version of) UnicodeData.txt and binaries 2022-03-19 18:20 ` Timothy Sample @ 2022-03-24 13:33 ` Ludovic Courtès 0 siblings, 0 replies; 12+ messages in thread From: Ludovic Courtès @ 2022-03-24 13:33 UTC (permalink / raw) To: Timothy Sample; +Cc: 54111-done Hello, Timothy Sample <samplet@ngyro.com> skribis: > I’ve attached two patches, the second of which is gzipped (the > UnicodeData.txt file is nearly 2M). > > The first patch replaces the Perl script with the Awk script. The Awk > script produces an identical ‘srfi-14.i.c’, except for changing “.pl” to > “.awk” in a comment. > > The second patch removes ‘srfi-14.i.c’, adds ‘UnicodeData.txt’, and > teaches the build machinery how to generate the former from the latter. > I did my best with the Makefile, but I’m still a noob when it comes to > Automake conventions. This is the part that warrants the most review! > Finally, I added support for comments to the Awk script so that I could > put the Unicode license text in the data file itself. This is probably > the simplest way to dispatch our legal obligations to Unicode, Inc. (and > follow the guidelines of the FSF). For all the details, see > <https://www.unicode.org/copyright.html> and > <https://www.gnu.org/licenses/license-list.html#Unicode>. This all looks good to me. Pushed in Guile as commit 9f8e05e513399985021643c34217f45d65c66392, thank you! Ludo’. ^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2022-03-24 13:34 UTC | newest] Thread overview: 12+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2022-02-22 16:42 bug#54111: guile bundles (a compiled version of) UnicodeData.txt and binaries Maxime Devos 2022-02-27 13:52 ` Ludovic Courtès 2022-02-27 19:45 ` Maxime Devos 2022-02-27 19:52 ` Maxime Devos 2022-02-27 23:07 ` Bengt Richter 2022-02-28 11:45 ` Ludovic Courtès 2022-02-28 17:46 ` Maxime Devos 2022-03-14 18:27 ` Timothy Sample 2022-03-16 10:47 ` Ludovic Courtès 2022-03-16 23:42 ` Timothy Sample 2022-03-19 18:20 ` Timothy Sample 2022-03-24 13:33 ` Ludovic Courtès
Code repositories for project(s) associated with this external index https://git.savannah.gnu.org/cgit/guix.git This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.