From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mp11.migadu.com ([2001:41d0:2:4a6f::]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)) by ms0.migadu.com with LMTPS id WAWeCU2JL2JtjwAAgWs5BA (envelope-from ) for ; Mon, 14 Mar 2022 19:28:29 +0100 Received: from aspmx1.migadu.com ([2001:41d0:2:4a6f::]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)) by mp11.migadu.com with LMTPS id EP3cBk2JL2IYUgEA9RJhRA (envelope-from ) for ; Mon, 14 Mar 2022 19:28:29 +0100 Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by aspmx1.migadu.com (Postfix) with ESMTPS id 9A0BD366FB for ; Mon, 14 Mar 2022 19:28:28 +0100 (CET) Received: from localhost ([::1]:59968 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1nTpQh-0004im-O4 for larch@yhetil.org; Mon, 14 Mar 2022 14:28:27 -0400 Received: from eggs.gnu.org ([209.51.188.92]:51304) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1nTpQJ-0004ia-8q for bug-guix@gnu.org; Mon, 14 Mar 2022 14:28:03 -0400 Received: from debbugs.gnu.org ([209.51.188.43]:52377) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.90_1) (envelope-from ) id 1nTpQJ-0007mt-0G for bug-guix@gnu.org; Mon, 14 Mar 2022 14:28:03 -0400 Received: from Debian-debbugs by debbugs.gnu.org with local (Exim 4.84_2) (envelope-from ) id 1nTpQI-0005iM-Ei for bug-guix@gnu.org; Mon, 14 Mar 2022 14:28:02 -0400 X-Loop: help-debbugs@gnu.org Subject: bug#54111: guile bundles (a compiled version of) UnicodeData.txt and binaries In-Reply-To: <9953e99b32693fa2393fa9919973323207413063.camel@telenet.be> Resent-From: Timothy Sample Original-Sender: "Debbugs-submit" Resent-CC: bug-guix@gnu.org Resent-Date: Mon, 14 Mar 2022 18:28:02 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 54111 X-GNU-PR-Package: guix X-GNU-PR-Keywords: To: Maxime Devos Received: via spool by 54111-submit@debbugs.gnu.org id=B54111.164728244421910 (code B ref 54111); Mon, 14 Mar 2022 18:28:02 +0000 Received: (at 54111) by debbugs.gnu.org; 14 Mar 2022 18:27:24 +0000 Received: from localhost ([127.0.0.1]:46274 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1nTpPf-0005hJ-R2 for submit@debbugs.gnu.org; Mon, 14 Mar 2022 14:27:24 -0400 Received: from out2-smtp.messagingengine.com ([66.111.4.26]:59993) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1nTpPd-0005h4-C0 for 54111@debbugs.gnu.org; Mon, 14 Mar 2022 14:27:22 -0400 Received: from compute4.internal (compute4.nyi.internal [10.202.2.44]) by mailout.nyi.internal (Postfix) with ESMTP id 3907B5C022B; Mon, 14 Mar 2022 14:27:16 -0400 (EDT) Received: from mailfrontend2 ([10.202.2.163]) by compute4.internal (MEProxy); Mon, 14 Mar 2022 14:27:16 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d= messagingengine.com; h=cc:cc:content-type:date:date:from:from :in-reply-to:message-id:mime-version:references:reply-to:sender :subject:subject:to:to:x-me-proxy:x-me-proxy:x-me-sender :x-me-sender:x-sasl-enc; s=fm2; bh=uteW7d/gLhSqSa35V1UfZnVZOj/Wr Cx5/aPgn9wc8NE=; b=oSvzNvpnHGVCgLT4P4jpA1T/U/J9noke8trLcnb3KVHWD EuWYiHvs3ka1dmDKpDlpAJsj+/ZSxswUKPENwzt2PaeM5ANVv2NM5mWb60WbfBkw CcqjMaZ4Vv5o9qoJoPyBxrhUfWPxcxrP7MoBAQuAPzHH7wq9kYU7RSfqsjN4GhPn Yzi9LL+YJreAeUznJvSksyn4r/ht0c3jUGIvbBvfk/Rw/bqxcaJydshnszIL9ex0 HpLt4pDJS6rvhQPdT/qmeIuV4GyBj/93MA1zti9AVCfqQCKZ6ShJQhwmPBd3QNL1 ZaBqekXoIcX1vutBh78rFxwDCacBAzqRx/yBic4aA== X-ME-Sender: X-ME-Received: X-ME-Proxy-Cause: gggruggvucftvghtrhhoucdtuddrgedvvddruddvkedgudduudcutefuodetggdotefrod ftvfcurfhrohhfihhlvgemucfhrghsthforghilhdpqfgfvfdpuffrtefokffrpgfnqfgh necuuegrihhlohhuthemuceftddtnecusecvtfgvtghiphhivghnthhsucdlqddutddtmd enucfjughrpefhvffufhffkfgfgggtsehmtderredtreejnecuhfhrohhmpefvihhmohht hhihucfurghmphhlvgcuoehsrghmphhlvghtsehnghihrhhordgtohhmqeenucggtffrrg htthgvrhhnpefhgeekveegieehleefieeiffefkeeiieffiedtieevveejudehteefveel vedvtdenucevlhhushhtvghrufhiiigvpedtnecurfgrrhgrmhepmhgrihhlfhhrohhmpe hsrghmphhlvghtsehnghihrhhordgtohhm X-ME-Proxy: Received: by mail.messagingengine.com (Postfix) with ESMTPA; Mon, 14 Mar 2022 14:27:15 -0400 (EDT) From: Timothy Sample References: <9953e99b32693fa2393fa9919973323207413063.camel@telenet.be> <87h78kwh5c.fsf@gnu.org> <87wnhfdxjq.fsf@gnu.org> <3a91920b7a028ed9c132a810e6fd0751154d3f73.camel@telenet.be> Date: Mon, 14 Mar 2022 12:27:14 -0600 Message-ID: <877d8w5r0d.fsf@ngyro.com> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/27.2 (gnu/linux) MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="=-=-=" X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list X-BeenThere: bug-guix@gnu.org List-Id: Bug reports for GNU Guix List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: 54111@debbugs.gnu.org Errors-To: bug-guix-bounces+larch=yhetil.org@gnu.org Sender: "bug-Guix" X-Migadu-Flow: FLOW_IN X-Migadu-To: larch@yhetil.org X-Migadu-Country: US ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=yhetil.org; s=key1; t=1647282508; h=from:from:sender:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:resent-cc:resent-from:resent-sender: resent-message-id:in-reply-to:in-reply-to:references:references: list-id:list-help:list-unsubscribe:list-subscribe:list-post: dkim-signature; bh=uteW7d/gLhSqSa35V1UfZnVZOj/WrCx5/aPgn9wc8NE=; b=NJ1hFprkNpAarhn0vb5M0nDs3Gr78C5MGQD4+vSAAr5vO1VpxGBFhv5SK268tQe8jS23ug HnrmXFotpoU0MR8dVIaY6yTWn3Am7oEwuXfzSkPzkc5l/BpWh2WpFXl7Saax6QDBE0ygzq 3yWYtUUXPGQUDlSGwlaBZuAoZkYsFE0cOrcnwyznH013EOSWpem+SX7ElkNFtCkd3D9AvK 3OvBQ2wm51OvswUhVmfj1ljKTV5p8nyKN52KsRdHvJBmMK/c3UYNEf4rcAIRjXpVccuYz+ 8GF2a/vKsG61beY1wLrRjQAudnKgndQvmmF6NKIsFFXWqBvKzZSt8h8JSGLaxw== ARC-Seal: i=1; s=key1; d=yhetil.org; t=1647282508; a=rsa-sha256; cv=none; b=EtTcaJ992Bo7+q8nexiMSpoMPUUZGaujLqDH4kCYtUHAyLboewsnTV1GGRXBSzJ23d8VN0 Psz2l8bqmSenY5q8koETsIXmA+haHDLPqntk+sOzDRDpfTGKgYrhKqlFC+IwKuz/jipXJq eAjKSnbxVLApTA0g26P6YlouOAndTlphjfGEwbodrVE9EGrdkJYV8pWR4gxyvFW9lnS7bK B/F7yuytwO3pGKv1BTB8M3tleNZZaz8yMNhdvp+EJeS8gPwW8nk0qPCFjCXPPoXJPw1XlE P6THOUAkwFVRE0/lfTQ4Y3ckySpdDSGEwfp49L1gxrafM34t4Vs9vHrUWGGCyQ== ARC-Authentication-Results: i=1; aspmx1.migadu.com; dkim=fail ("headers rsa verify failed") header.d=messagingengine.com header.s=fm2 header.b=oSvzNvpn; dmarc=none; spf=pass (aspmx1.migadu.com: domain of "bug-guix-bounces+larch=yhetil.org@gnu.org" designates 209.51.188.17 as permitted sender) smtp.mailfrom="bug-guix-bounces+larch=yhetil.org@gnu.org" X-Migadu-Spam-Score: 0.33 Authentication-Results: aspmx1.migadu.com; dkim=fail ("headers rsa verify failed") header.d=messagingengine.com header.s=fm2 header.b=oSvzNvpn; dmarc=none; spf=pass (aspmx1.migadu.com: domain of "bug-guix-bounces+larch=yhetil.org@gnu.org" designates 209.51.188.17 as permitted sender) smtp.mailfrom="bug-guix-bounces+larch=yhetil.org@gnu.org" X-Migadu-Queue-Id: 9A0BD366FB X-Spam-Score: 0.33 X-Migadu-Scanner: scn0.migadu.com X-TUID: KAFgOp5HMZON --=-=-= Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Hi Maxime, Maxime Devos writes: > Ludovic Court=C3=A8s schreef op ma 28-02-2022 om 12:45 [+0100]: > >> It might be easier to rewrite in Awk in build srfi-14.i.c >> unconditionally no? > > I don't know any Awk and it seems to be quite different from languages > I know, so for me doing that isn't easier. But for someone who knows > some Awk, sure! Well, I don=E2=80=99t consider myself an Awk person, but I had to implement= it for Gash-Utils, so I know it well enough! This may not be the most idiomatic Awk program, but to my eyes it is no less readable than the Perl version. Note that this Awk script needs to be invoked using something like: $ awk -f unidata_to_charset.awk < UnicodeData.txt > srfi-14.i.c That is, the Perl version had the file names hard-coded, but the Awk version reads from stdin and writes to stdout. Also, the Awk version does not shell out to 'indent' to post-process the file. That was basically a no-op in the Perl version, so I removed it. There are a few differences in how the script is structured, and I had to convert all the hex literals to decimal, but the logical behaviour should be exactly the same. I preserved all the comments and predicates exactly from the Perl version. There=E2=80=99s probably some differences in error handling, but the input data is so simple that it shouldn=E2=80=99t matter. It runs with =E2=80=9Cgawk --posix=E2=80=9D. If I run =E2=80=9Cgawk --lint= =E2=80=9D, I get warnings, but I=E2=80=99m pretty sure they are spurious (they may even be Gawk bugs, = but I would have to double check the relevant specs and docs). If the lint warnings are a problem, you can append the empty string to the argument of the =E2=80=98hex=E2=80=99 function to make them go away. Also, (as a bo= nus) as of commit 62c56f9 the Gash-Utils version of Awk can run this script! :) Of course, to use this script as part of the Guile build, someone=E2=84=A2 = will have to double check that we can legally redistribute the Unicode data file (probably okay, but always good to check), and update the build rules to generate the C file. I can=E2=80=99t guarantee that I=E2=80=99ll = get to it.... -- Tim --=-=-= Content-Type: text/plain Content-Disposition: inline; filename=unidata_to_charset.awk # unidata_to_charset.awk --- Compute SRFI-14 charsets from UnicodeData.txt # # Copyright (C) 2009, 2010, 2022 Free Software Foundation, Inc. # # This library is free software; you can redistribute it and/or # modify it under the terms of the GNU Lesser General Public # License as published by the Free Software Foundation; either # version 3 of the License, or (at your option) any later version. # # This library is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU # Lesser General Public License for more details. # # You should have received a copy of the GNU Lesser General Public # License along with this library; if not, write to the Free Software # Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA # Utilities ########### # Print MESSAGE to standard error, and exit with STATUS. function die(status, message) { print "unidata_to_charset.awk:", message | "cat 1>&2"; exit_status = status; exit exit_status; } # Parse the string S as a hexadecimal number. Note that R, C, and B are # local variables that need not be set by callers. Most Awk # implementations have an 'strtonum' function that we could use, but it # is not part of POSIX. function hex(s, r, c, b) { if (length(s) == 0) { die(1, "Cannot parse empty string as hexadecimal."); } r = 0; for (i = 1; i <= length(s); i++) { c = substr(s, i, 1); b = 0; if (c == "0") { b = 0; } else if (c == "1") { b = 1; } else if (c == "2") { b = 2; } else if (c == "3") { b = 3; } else if (c == "4") { b = 4; } else if (c == "5") { b = 5; } else if (c == "6") { b = 6; } else if (c == "7") { b = 7; } else if (c == "8") { b = 8; } else if (c == "9") { b = 9; } else if (c == "A") { b = 10; } else if (c == "B") { b = 11; } else if (c == "C") { b = 12; } else if (c == "D") { b = 13; } else if (c == "E") { b = 14; } else if (c == "F") { b = 15; } else { die(1, "Invalid hexadecimal character: " c); } r *= 16; r += b; } return r; } # Program initialization ######################## BEGIN { # The columns are separated by semicolons. FS = ";"; # This will help us handle errors. exit_status = 0; # List of charsets. all_charsets_count = 0; all_charsets[all_charsets_count++] = "lower_case"; all_charsets[all_charsets_count++] = "upper_case"; all_charsets[all_charsets_count++] = "title_case"; all_charsets[all_charsets_count++] = "letter"; all_charsets[all_charsets_count++] = "digit"; all_charsets[all_charsets_count++] = "hex_digit"; all_charsets[all_charsets_count++] = "letter_plus_digit"; all_charsets[all_charsets_count++] = "graphic"; all_charsets[all_charsets_count++] = "whitespace"; all_charsets[all_charsets_count++] = "printing"; all_charsets[all_charsets_count++] = "iso_control"; all_charsets[all_charsets_count++] = "punctuation"; all_charsets[all_charsets_count++] = "symbol"; all_charsets[all_charsets_count++] = "blank"; all_charsets[all_charsets_count++] = "ascii"; all_charsets[all_charsets_count++] = "empty"; all_charsets[all_charsets_count++] = "designated"; # Initialize charset state table. for (i in all_charsets) { cs = all_charsets[i]; state[cs, "start"] = -1; state[cs, "end"] = -1; state[cs, "count"] = 0; } } # Record initialization ####################### # In this block we give names to each field, and do some basic # initialization. { codepoint = hex($1); name = $2; category = $3; uppercase = $13; lowercase = $14; codepoint_end = codepoint; charset_index = 0; for (i in charsets) { delete charsets[i]; } } # Some pairs of lines in UnicodeData.txt delimit ranges of # characters. name ~ /First>$/ { getline; last_name = name; sub(/First>$/, "Last>", last_name); if (last_name != $2) { die(1, "Invalid range in Unicode data."); exit_status = 1; exit 1; } codepoint_end = hex($1); } # Character set predicates ########################## ## The lower_case character set ############################### # For Unicode, we follow Java's specification: a character is # lowercase if # * it is not in the range [U+2000,U+2FFF] ([8192,12287]), and # * the Unicode attribute table does not give a lowercase mapping # for it, and # * at least one of the following is true: # o the Unicode attribute table gives a mapping to uppercase # for the character, or # o the name for the character in the Unicode attribute table # contains the words "SMALL LETTER" or "SMALL LIGATURE". (codepoint < 8192 || codepoint > 12287) && lowercase == "" && (uppercase != "" || name ~ /(SMALL LETTER|SMALL LIGATURE)/) { charsets[charset_index++] = "lower_case"; } ## The upper_case character set ############################### # For Unicode, we follow Java's specification: a character is # uppercase if # * it is not in the range [U+2000,U+2FFF] ([8192,12287]), and # * the Unicode attribute table does not give an uppercase mapping # for it (this excludes titlecase characters), and # * at least one of the following is true: # o the Unicode attribute table gives a mapping to lowercase # for the character, or # o the name for the character in the Unicode attribute table # contains the words "CAPITAL LETTER" or "CAPITAL LIGATURE". (codepoint < 8192 || codepoint > 12287) && uppercase == "" && (lowercase != "" || name ~ /(CAPITAL LETTER|CAPITAL LIGATURE)/) { charsets[charset_index++] = "upper_case"; } ## The title_case character set ############################### # A character is titlecase if it has the category Lt in the character # attribute database. category == "Lt" { charsets[charset_index++] = "title_case"; } ## The letter character set ########################### # A letter is any character with one of the letter categories (Lu, Ll, # Lt, Lm, Lo) in the Unicode character database. category == "Lu" || category == "Ll" || category == "Lt" || category == "Lm" || category == "Lo" { charsets[charset_index++] = "letter"; charsets[charset_index++] = "letter_plus_digit"; } ## The digit character set ########################## # A character is a digit if it has the category Nd in the character # attribute database. In Latin-1 and ASCII, the only such characters # are 0123456789. In Unicode, there are other digit characters in # other code blocks, such as Gujarati digits and Tibetan digits. category == "Nd" { charsets[charset_index++] = "digit"; charsets[charset_index++] = "letter_plus_digit"; } ## The hex_digit character set ############################## # The only hex digits are 0123456789abcdefABCDEF. (codepoint >= 48 && codepoint <= 57) || (codepoint >= 65 && codepoint <= 70) || (codepoint >= 97 && codepoint <= 102) { charsets[charset_index++] = "hex_digit"; } ## The graphic character set ############################ # Characters that would 'use ink' when printed category ~ /L|M|N|P|S/ { charsets[charset_index++] = "graphic"; charsets[charset_index++] = "printing"; } ## The whitespace character set ############################### # A whitespace character is either # * a character with one of the space, line, or paragraph separator # categories (Zs, Zl or Zp) of the Unicode character database. # * U+0009 (09) Horizontal tabulation (\t control-I) # * U+000A (10) Line feed (\n control-J) # * U+000B (11) Vertical tabulation (\v control-K) # * U+000C (12) Form feed (\f control-L) # * U+000D (13) Carriage return (\r control-M) category ~ /Zs|Zl|Zp/ || (codepoint >= 9 && codepoint <= 13) { charsets[charset_index++] = "whitespace"; charsets[charset_index++] = "printing"; } ## The iso_control character set ################################ # The ISO control characters are the Unicode/Latin-1 characters in the # ranges [U+0000,U+001F] ([0,31]) and [U+007F,U+009F] ([127,159]). (codepoint >= 0 && codepoint <= 31) || (codepoint >= 127 && codepoint <= 159) { charsets[charset_index++] = "iso_control"; } ## The punctuation character set ################################ # A punctuation character is any character that has one of the # punctuation categories in the Unicode character database (Pc, Pd, # Ps, Pe, Pi, Pf, or Po.) # Note that srfi-14 gives conflicting requirements!! It claims that # only the Unicode punctuation is necessary, but, explicitly calls out # the soft hyphen character (U+00AD) as punctution. Current versions # of Unicode consider U+00AD to be a formatting character, not # punctuation. category ~ /P/ { charsets[charset_index++] = "punctuation"; } ## The symbol character set ########################### # A symbol is any character that has one of the symbol categories in # the Unicode character database (Sm, Sc, Sk, or So). category ~ /S/ { charsets[charset_index++] = "symbol"; } ## The blank character set ########################## # Blank chars are horizontal whitespace. A blank character is either # * a character with the space separator category (Zs) in the # Unicode character database. # * U+0009 (9) Horizontal tabulation (\t control-I) category ~ /Zs/ || codepoint == 9 { charsets[charset_index++] = "blank"; } ## The ascii character set ########################## codepoint <= 127 { charsets[charset_index++] = "ascii"; } ## The designated character set ############################### category !~ /Cs/ { charsets[charset_index++] = "designated"; } ## Other character sets ####################### # Note that the "letter_plus_digit" and "printing" character sets, which # are unions of other character sets, are included in the patterns # matching their constituent parts (i.e., the "letter_plus_digit" # character set is included as part of the "letter" and "digit" # patterns). # # Also, the "empty" character is computed by doing precisely nothing! # Keeping track of state ######################## # Update the state for each charset. { for (i in charsets) { cs = charsets[i]; if (state[cs, "start"] == -1) { state[cs, "start"] = codepoint; state[cs, "end"] = codepoint_end; } else if (state[cs, "end"] + 1 == codepoint) { state[cs, "end"] = codepoint_end; } else { count = state[cs, "count"]; state[cs, "count"]++; state[cs, "ranges", count, 0] = state[cs, "start"]; state[cs, "ranges", count, 1] = state[cs, "end"]; state[cs, "start"] = codepoint; state[cs, "end"] = codepoint_end; } } } # Printing and error handling ############################# END { # Normally, an exit statement runs all the 'END' blocks before # actually exiting. We use the 'exit_status' variable to short # circuit the rest of the 'END' block by reissuing the exit # statement. if (exit_status != 0) { exit exit_status; } # Write a bit of a header. print("/* srfi-14.i.c -- standard SRFI-14 character set data */"); print(""); print("/* This file is #include'd by srfi-14.c. */"); print(""); print("/* This file was generated from"); print(" http://unicode.org/Public/UNIDATA/UnicodeData.txt"); print(" with the unidata_to_charset.awk script. */"); print(""); for (i = 0; i < all_charsets_count; i++) { cs = all_charsets[i]; # Extra logic to ensure that the last range is included. if (state[cs, "start"] != -1) { count = state[cs, "count"]; state[cs, "count"]++; state[cs, "ranges", count, 0] = state[cs, "start"]; state[cs, "ranges", count, 1] = state[cs, "end"]; } count = state[cs, "count"]; print("static const scm_t_char_range cs_" cs "_ranges[] = {"); for (j = 0; j < count; j++) { rstart = state[cs, "ranges", j, 0]; rend = state[cs, "ranges", j, 1]; if (j + 1 < count) { printf(" {0x%04x, 0x%04x},\n", rstart, rend); } else { printf(" {0x%04x, 0x%04x}\n", rstart, rend); } } print("};"); print(""); count = state[cs, "count"]; printf("static const size_t cs_%s_len = %d;\n", cs, count); if (i + 1 < all_charsets_count) { print(""); } } } # And we're done. --=-=-=--