From mboxrd@z Thu Jan 1 00:00:00 1970 Path: main.gmane.org!not-for-mail From: Tom Lord Newsgroups: gmane.lisp.guile.devel Subject: Re: Worrying development Date: Thu, 22 Jan 2004 13:47:36 -0800 (PST) Sender: guile-devel-bounces+guile-devel=m.gmane.org@gnu.org Message-ID: <200401222147.NAA21462@morrowfield.regexps.com> References: <1074246064.6729.23.camel@localhost> <87vfn9ufvw.fsf@zagadka.ping.de> <200401182158.NAA01914@morrowfield.regexps.com> NNTP-Posting-Host: deer.gmane.org X-Trace: sea.gmane.org 1074807358 2038 80.91.224.253 (22 Jan 2004 21:35:58 GMT) X-Complaints-To: usenet@sea.gmane.org NNTP-Posting-Date: Thu, 22 Jan 2004 21:35:58 +0000 (UTC) Original-X-From: guile-devel-bounces+guile-devel=m.gmane.org@gnu.org Thu Jan 22 22:35:46 2004 Return-path: Original-Received: from monty-python.gnu.org ([199.232.76.173]) by deer.gmane.org with esmtp (Exim 3.35 #1 (Debian)) id 1AjmUM-0006t1-00 for ; Thu, 22 Jan 2004 22:35:46 +0100 Original-Received: from localhost ([127.0.0.1] helo=monty-python.gnu.org) by monty-python.gnu.org with esmtp (Exim 4.24) id 1AjmUF-0008M4-BH for guile-devel@m.gmane.org; Thu, 22 Jan 2004 16:35:39 -0500 Original-Received: from list by monty-python.gnu.org with tmda-scanned (Exim 4.24) id 1AjmTa-0007ed-D8 for guile-devel@gnu.org; Thu, 22 Jan 2004 16:34:58 -0500 Original-Received: from mail by monty-python.gnu.org with spam-scanned (Exim 4.24) id 1AjmSu-000749-SS for guile-devel@gnu.org; Thu, 22 Jan 2004 16:34:48 -0500 Original-Received: from [65.234.195.150] (helo=morrowfield.regexps.com) by monty-python.gnu.org with esmtp (Exim 4.24) id 1AjmSo-0006q7-Br for guile-devel@gnu.org; Thu, 22 Jan 2004 16:34:11 -0500 Original-Received: (from lord@localhost) by morrowfield.regexps.com (8.9.1/8.9.1) id NAA21462; Thu, 22 Jan 2004 13:47:36 -0800 (PST) (envelope-from lord@morrowfield.regexps.com) Original-To: guile-devel@gnu.org In-reply-to: <200401182158.NAA01914@morrowfield.regexps.com> (message from Tom Lord on Sun, 18 Jan 2004 13:58:47 -0800 (PST)) X-BeenThere: guile-devel@gnu.org X-Mailman-Version: 2.1.2 Precedence: list List-Id: Developers list for Guile, the GNU extensibility library List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: guile-devel-bounces+guile-devel=m.gmane.org@gnu.org Xref: main.gmane.org gmane.lisp.guile.devel:3285 X-Report-Spam: http://spam.gmane.org/gmane.lisp.guile.devel:3285 > From: Tom Lord > > From: Marius Vollmer > > I'll try to put forth a proposal in the next days for the string part > > of the type conversion API that allows Unicode and shared substrings. > If you don't mind, I'd like to do that too -- independently. But in > a screwy way. > I need to write a spec for the string API of Pika. So far, the Pika > FFI is such that it could be implemented for Guile very easily. And, > imo, it's also pretty good as an internal API for almost everything. > If used (gradually migrated to, in the case of Guile) as an internal > API -- I think it's pretty liberating (permitting lots of freedom for > different object representations, thread support, GC strategies, > etc.). > Maybe in the longer term -- unifying over the Pika APIs would be a > general win. So, with apologies since most readers have probably already seen it in other forums: Unicode Meets Scheme Meets C ---------------------------- Specifying Unicode string and character support for Pika is an exercise in language design (because standard Scheme needs to be modified and extended), Scheme interface design, C interface design, and implementation design. Additionally, we would like a subset of the Pika C interface to characters and strings to be something that _any_ C-compatible Scheme implementation can provide (including Schemes that support only ISO8859-* character sets or other subsets of Unicode). Additionally, we would like the new and modified Scheme interfaces to strings and characters to be such that other implementors may choose to provide them and future revisions to the Scheme standard may incorporate them. This document provides some background on the problem, specifies the three interfaces, identifies the (hopefully) portable subset of the C interface, recommends changes to the Scheme standard, and finally, describes how all of these will be implemented in Pika. The reader is presumed to be somewhat familiar with the Unicode standard. Contents * Scheme Characters Meet Unicode * Scheme Strings Meet Unicode * The Scheme Interface to Unicode Characters and Strings * The Portable FFI Interface to Unicode Characters and Strings * The Pika FFI Interface to Unicode Characters and Strings Non-contents (unresolved issues) * Unicode and Ports (what does READ-CHAR mean?) * Scheme Characters ** Required and Expected Characters R5RS requires that all members of the character set used to specify the syntax of Scheme must be representable as Scheme characters. These include various punctuation marks, space, newline, digits, and the lowercase letters `a..z'. The specifications of CHAR-UPCASE, CHAR-DOWNCASE, and CHAR-ALPHABETIC? strongly imply that `A..Z' must also be present. The only way to not read this as a requirement is to imagine an implementation in which, for example, `a' is not an alphabetic character. By reasonable expectation, tab and form-feed should also be present. Implementations are possible which do not provide these but would be rather exceptional in that regard. /=== R6RS Recommendation: R6RS should explicitly require that tab, form-feed, and both `a..z' and `A..Z' are present, that the letters are alphabetic characters, and that they are case-mapped as expected. |=== Portable FFI Recommendation: A portable FFI should assume that that requirement is in place. |=== Pika Design: Pika assumes that that requirement is in place. \======== Perhaps more controversially: all members of the character set comprising the first half (U+0000..U+007F) of iso8859-1 should be representable as Scheme characters. These characters are found in ASCII, all iso8859-* character sets, and Unicode itself. They are used almost universally and are likely to remain so indefinitely. /=== R6RS Recommendation: R6RS should explicitly and strongly encourage the presence of all ASCII characters in the CHAR? type. |=== Portable FFI Recommendation: A portable FFI should require that all characters representable in portable C as non-octal-escaped character constants are representable in Scheme. It should strongly encourage that all ASCII characters are representable in Scheme. |=== Pika Design: The Pika CHAR? type will include representations for all of ASCII. \======== By somewhat reasonable expectation, there must be at least 256 distinct Scheme characters and INTEGER->CHAR must be defined for all integers in the range `0..255'. There are many circumstances in which conversions between octets and characters are desirable and the requirements of this expectation say that such conversion is always possible. It is quite possible to imagine implementations in which this is not the case: in which, for example, a (fully general) octet stream can not be read and written using READ-CHAR and DISPLAY (applied to characters). Such an implementation might introduce non-standard procedures for reading and writing octets and representing arrays of octets. While such non-standard extensions may be desirable for independent reasons, I see no good reason not to define at least a subset of Scheme characters which is mapped to the set of octet values. /=== R6RS Recommendation: R6RS should explicitly and strongly encourage the presence of at least 256 characters and that INTEGER->CHAR is defined for the entire range 0..255 (at least). |=== Portable FFI Recommendation: A portable FFI should require that the intersection of the set of integers -128..127 and the set of values representable as a `char' value are representable in Scheme as CHAR? values. The preferred 8-bit character representation in the FFI should be `unsigned char' and the scheme representation (if any) for any unsigned character `uc' should be the same as that for `(char)uc'. Note: stating these requirements is greatly simplified if the FFI simply requires that `char' and `unsigned char' are 8-bit types. |=== Pika Design: Pika will satisfy the FFI requirement and require that `char' is an 8-bit integer. \======== *** Remaining Degrees of Freedom for the CHAR? Type Scheme implementations consistent with our proposed requirements so far are likely to partition into four broad classes: ~ Those providing exactly 256 distinct (under CHAR=?) CHAR? values ~ Those providing approximately 2^16 CHAR? values ~ Those providing approximately 2^21 CHAR? values ~ Those providing an infinite set of CHAR? values /=== Pika Design: Pika is of the "approximately 2^21 characters" variety. Specifically, the Pika CHAR? type will in effect be a _superset_ of the set of Unicode codepoints. Each 21-bit codepoint will correspond to a Pika character. For each such character, there will be (2^4-1) (15) additional related characters representing the basic code point modified by a combination of any of four "buckybits". For example, the Unicode codepoint U+0041 can be written in Pika as: #\A and by applying buckybits (shift, meta, alt, hyper) an additional 15 characters can be formed giving the total set of 16 "A" characters: #\A #\S-A #\M-A #\H-A #\A-A #\S-M-A #\S-H-A #\S-A-A #\M-H-A #\M-A-A #\H-A-A #\S-M-H-A #\S-M-A-A #\S-H-A-A #\M-H-A-A #\S-M-H-A-A \======== ** Case Mappings Strictly speaking, R5RS does not require that the character set contain any upper or lower case letters. For example, it must contain `a' but it does not require that CHAR-LOWER-CASE? of `a' is true. However, in an implementation in which `a' is not lower case, `a' must also not be alphabetic. /=== R6RS Recommendation: R6RS should explicitly require that `a..z' and `A..Z' are alphabetic and cased in the expected way. |=== Pika Design: Pika will satisfy the proposed requirement. \======== R5RS requires a partial ordering of characters in which upper and lower case variants of "the same character" are treated as equal. Most problematically: R5RS requires that every alphabetic character have both an upper and lower case variant. This is a problem because Unicode defines abstract characters which, at least intuitively, are alphabetic -- but which lack such case mappings. We'll explore the topic further, later, but briefly: it does not appear that "good Unicode support" and "R5RS requirements for case mappings" are compatible -- at least not in a simple way. /=== R6RS Recommendation: R6RS should drop the requirement from the closing sentence of section 6.3.4 which says: In addition, if char is alphabetic, then the result of char-upcase is upper case and the result of char-downcase is lower case. Instead, it should say: There is no requirement that all alphabetic characters have an upper and lowercase mappings and no requirement that all alphabetic characters return true for one of CHAR-UPPER-CASE? or CHAR-LOWER-CASE?. There is no requirement that if a character is upper or lower case that it has a case mapping which is itself a character. However, it is required that the characters `a..z' are lowercase, `A..Z' are uppercase, and that CHAR-UPCASE and CHAR-DOWNCASE converts between those two ranges. And: The case mapping procedures and predicates _must_ be consistent with the case mappings that determine equivalence of identifiers and symbol names. In many environments they are usable for case mappings in a broader linguistic sense but programmers are cautioned that, in general, they are not appropriate for such uses in portable applications: some alphabets lack the concept of case entirely; others have the concept of case but lack a 1:1 mapping between upper and lowercase characters. Different case mapping procedures should be used in portable linguistically-oriented applications. |=== Pika Design: Pika will include CHAR? values such as Unicode's eszett (U+00DF) with the properties that can't satisfy R5RS' requirements as in the examples: (char-alphabetic? eszett) => #t (char-lower-case? eszett) => #t (char-lower-case? (char-upcase eszett)) => #t (char=? eszett (char-upcase eszett)) => #t \======== ** Character Ordering and Enumeration R5RS requires a bijective mapping between characters and a set of integers (the CHAR->INTEGER and INTEGER->CHAR procedures). Let's call this mapping the "enumeration of characters". R5RS requires a total ordering of characters and requires that that enumeration is isomorphic to the ordering of the enumeration. R5RS underspecifies the total ordering of characters. It requires that alphabetic characters `a..z' and (if present) `A..Z' be (respectively) lexically ordered. It requires that decimal digits be decimally ordered. It requires that digits either all proceed or all follow all uppercase letters. The ordering requirements of R5RS are problematic for Unicode. While iso8859-* digits `0..9' easily satisfy the requirement, Unicode defines additional decimal digits which do not. Intuitively, it seems that either CHAR-NUMERIC? must not behave as one would like on some Unicode abstract characters or the ordering requirement will have to change in R6RS. /=== R6RS Recommendation: R6RS should modify the requirement from section 6.3.4 which says: These procedures impose a total ordering on the set of characters. It is guaranteed that under this ordering: * The upper case characters are in order. For example, (char t_unicode; An unsigned integer type sufficiently large to hold a Unicode codepoint. ~ enum uni_encoding_scheme; Valid values include but are not limited to: uni_utf8 uni_utf16 uni_utf16be uni_utf16le uni_utf32 uni_iso8859_1 ... uni_iso8859_16 uni_ascii ** Character Conversions ~ t_scm_error scm_character_to_codepoint (t_unicode * answer, t_scm_arena instance, t_scm_word * chr) Normally, return (via `*answer') the Unicode codepoint value corresponding to Scheme value `*chr'. Return 0. If `*chr' is not a character or is not representable as a Unicode codepoint, set `*answer' to U+FFFD and return an error code. ~ t_scm_error scm_character_to_ascii (char * answer, t_scm_arena instance, t_scm_word * chr) Normally, return (via `*answer') the ASCII codepoint value corresponding to Scheme value `*chr'. Return 0. If `*chr' is not an ASCII character or is not representable as a Unicode codepoint, set `*answer' to 0 and return an error code. ~ t_scm_error scm_codepoint_to_character (t_scm_word * answer, t_scm_arena instance, t_unicode codepoint) Normally, return (via `*answer') the Scheme character corresponding to `codepoint' which must be in the range 0..(2^21-1). Return 0. If `codepoint' is not representable as a Scheme character, set `*answer' to an unspecified Scheme value and return an error code. ~ void scm_ascii_to_character (t_scm_word * answer, t_scm_arena instance, char chr) Return (via `*answer') the Scheme character corresponding to `chr' which must be representable as an ASCII character. ** String Conversions ~ t_scm_error scm_extract_string8 (t_uchar * answer, size_t * answer_len, enum uni_encoding_scheme enc, t_scm_arena instance, t_scm_word * str) Normally, convert `str' to the indicated encoding (which must be one of `uni_utf8', `uni_iso8859_*', or `uni_ascii') storing the result in the memory addressed by `answer' and the number of bytes stored in `*answer_len'. Return 0. On input, `*answer_len' should indicate the amount of storage available at the address `answer'. If there is insuffiencient memory available, `*answer_len' will be set to the number of bytes needed and the value `scm_err_too_short' returned. ~ t_scm_error scm_extract_string16 (t_uint16 * answer, size_t * answer_len, enum uni_encoding_scheme enc, t_scm_arena instance, t_scm_word * str) Normally, convert `str' to the indicated encoding (which must be one of `uni_utf16', `uni_utf16be', or `uni_utf16le') storing the result in the memory addressed by `answer' and the number of 16-bit values stored in `*answer_len'. Return 0. On input, `*answer_len' should indicate the amount of storage available at the address `answer' (measured in 16-bit values). If there is insuffiencient memory available, `*answer_len' will be set to the number of 16-bit values needed and the value `scm_err_too_short' returned. ~ t_scm_error scm_extract_string32 (t_uint32 * answer, size_t * answer_len, t_scm_arena instance, t_scm_word * str) Normally, convert `str' to UTF-32, storing the result in the memory addressed by `answer' and the number of 32-bit values stored in `*answer_len'. Return 0. On input, `*answer_len' should indicate the amount of storage available at the address `answer' (measured in 32-bit values). If there is insuffiencient memory available, `*answer_len' will be set to the number of 32-bit values needed and the value `scm_err_too_short' returned. ~ t_scm_error scm_make_string8_n (t_scm_word uchar * answer, t_scm_arena instance, t_uchar * str, enum uni_encoding_scheme enc, size_t str_len) Convert `str' of length `str_len' and in encoding to `enc' to a Scheme value (returned in `*answer') if possible, and return 0. Otherwise, return an error code. ~ t_scm_error scm_make_string16_n (t_scm_word uchar * answer, t_scm_arena instance, t_uint16 * str, enum uni_encoding_scheme enc, size_t str_len) Convert `str' of length `str_len' and in encoding to `enc' to a Scheme value (returned in `*answer') if possible, and return 0. Otherwise, return an error code. ~ t_scm_error scm_make_string32_n (t_scm_word uchar * answer, t_scm_arena instance, t_uint32 * str, enum uni_encoding_scheme enc, size_t str_len) Convert `str' of length `str_len' and in encoding to `enc' to a Scheme value (returned in `*answer') if possible, and return 0. Otherwise, return an error code. ** Other FFI Functions Various standard Scheme procedures (e.g., `STRING-REF') ought to be present in the FFI as well. Their mappings into C are straightforward given the proposed requirement that string indexes refer to codepoint offsets. * The Pika FFI Interface to Unicode Characters and Strings The standard Scheme string procedures map into the native Pika FFI in the expected way (their C function prototypes and semantics can be inferred in a trivial way from their standard Scheme specifications). As such, they are not specified here. ** Pika FFI Access to String Data (Note: In earlier discussions with some people I had suggested that Pika strings might be represented by a tree structure. I have since decided that that should not be the case: I want Pika to live up to the expectation that STRING? values are represented as arrays. The tree-structured-string-like type will be present in Pika, but it will be distinct from STRING?.) There is a touchy design issue regarding strings. On the one hand, for efficiency's sake, C code needs fairly direct access (for both reading and writing) to string data. On the other hand, that necessarily means exposing to C the otherwise internal details of value representation and of addressing the possibility of a multi-threaded Pika. ~ t_scm_error scm_lock_string_data (t_udstr * answer, t_scm_arena instance, t_scm_word * str) Obtain the `t_udstr' (see libhackerlab) corresponding to `str'. Return 0. If `*str' is not a string, set `*answer' to 0 and return an error code. A string obtained by this call must be released by `scm_release_string_data' (see below). It is critical that between calls to `scm_lock_string_data' and `scm_release_string_data' no other calls to Pika FFI functions be made, and that the number of instructions between those calls is sharply bounded. ~ t_scm_error scm_lock_string_data2 (t_udstr * answer1, t_udstr * answer2, t_scm_arena instance, t_scm_word * str1 t_scm_word * str2) Like `scm_lock_string_data' but return the `t_udstr's for two strings instead of one. ~ t_scm_error scm_release_string_data (t_scm_arena instance t_scm_word * str) Release a string locked by `scm_lock_string_data'. ~ t_scm_error scm_release_string_data (t_scm_arena instance t_scm_word * str1 t_scm_word * str2) Release strings locked by `scm_lock_string_data2'. ** Pika String Internals libhackerlab contains a string-like type, `t_udstr' defined (internally) as a pointer to a structure of the form: struct udstr_handle { enum uni_encoding_scheme enc; size_t length; uni_string data; alloc_limits limits; int refs; }; The intention here is to represent a string in an aribitray Unicode encoding, with an explicitly recorded length measured in coding units. libhackerlab contains a (currently small but growing) number of "string primitives" for operating on `t_udstr' values. For example, one can concatenate two strings without regard to what encoding scheme each is in. In Pika, strings shall be represented as `t_udstr' values, pointed to by an scm_vtable object. Further restrictions apply, however: Pika strings will have the property that: ~ any string including a character in the range U+10000..U+1FFFFF will be stored in a t_udstr using the uni_utf32 encoding. ~ any other string including a character in the range U+0100..U+FFFF will be stored in a t_udstr using the uni_utf16 encoding. ~ all remaining strings will be stored in a t_udstr using the uni_iso8859_1 encoding. In that way, the length of the t_udstr (measured in encoding units) will always be equal to the length of the Scheme string (measured in codepoints). Nearly all Scheme string operations involving string indexes can find the referenced characters in O(1) time. The primary exception is STRING-SET! if the character being stored requires the string it is being stored into to change encoding. At the same time, Scheme strings will have a space-efficient representation. _______________________________________________ Guile-devel mailing list Guile-devel@gnu.org http://mail.gnu.org/mailman/listinfo/guile-devel