* Worrying development @ 2004-01-16 9:41 Roland Orre 2004-01-16 11:59 ` tomas 2004-01-18 21:05 ` Marius Vollmer 0 siblings, 2 replies; 26+ messages in thread From: Roland Orre @ 2004-01-16 9:41 UTC (permalink / raw) Cc: guile-user At the moment I'm somewhat in a crisis how to continue with guile. When such a fundamental function as shared substrings were removed from guile 1.6 to guile 1.7, without being replaced, I see a serious risk here for a split and do as Thien-Thi Nguyen. OK, guile 1.7 is a development version so I should not rely on it, on the other hand, as we are using guile as our main tool for analysis we do of course also want to adapt to the development version to be able to utilize facilities which are only available there. I see this as an important policy issue and I quickly need to decide how to go further. One way is of course that I patch specific parts of the code, as in the shared substring case tags.h, strings.c and gc.c. This is, however, also a risky thing to do, and it means more maintainance work for every new version. I suggest that shared substrings are moved back to guile. It is quite serious when something which has been promoted in the reference manual and which can not trivially be fixed in user code, is removed. Roland Orre _______________________________________________ Guile-devel mailing list Guile-devel@gnu.org http://mail.gnu.org/mailman/listinfo/guile-devel ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Worrying development 2004-01-16 9:41 Worrying development Roland Orre @ 2004-01-16 11:59 ` tomas 2004-01-18 21:05 ` Marius Vollmer 1 sibling, 0 replies; 26+ messages in thread From: tomas @ 2004-01-16 11:59 UTC (permalink / raw) Cc: guile-user, guile-devel On Fri, Jan 16, 2004 at 10:41:05AM +0100, Roland Orre wrote: > At the moment I'm somewhat in a crisis how to continue with guile. [about the disappearing of shared substrings in Guile] Hrm. I wasn't very happy with this decision either. But since Dirk et al were doing the grunt work and I had nothing to contribute, I decided to shut up. As far as I remember, the main reasons for this step were simplicity (with shared substrings you get to choose between changes propagating from substrings to the containing string (you don't always want that, although sometimes), a copy-on-write implementation. Which is right?). Besides, with non-shared substrings, you may guarantee null-termination which helps naïve C interfacing (not a major advantage to my eyes, but...). I'd propose to implement a higher-level library (perhaps using GOOPS) in which strings (or more generally arrays) are implemented as (lists/trees/arrays) of arrays. That'd help designing an interface, and when done (and if necessary), re-implementing as smobs in C (or steal a datatype ;-) Opinions? -- tomás _______________________________________________ Guile-devel mailing list Guile-devel@gnu.org http://mail.gnu.org/mailman/listinfo/guile-devel ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Worrying development 2004-01-16 9:41 Worrying development Roland Orre 2004-01-16 11:59 ` tomas @ 2004-01-18 21:05 ` Marius Vollmer 2004-01-18 21:58 ` Tom Lord 2004-01-22 16:11 ` Dirk Herrmann 1 sibling, 2 replies; 26+ messages in thread From: Marius Vollmer @ 2004-01-18 21:05 UTC (permalink / raw) Cc: guile-user, guile-devel Roland Orre <orre@nada.kth.se> writes: > I suggest that shared substrings are moved back to guile. Agreed. I'm sorry for previously giving the impression that shared substrings wont come back. There is no problem on the Scheme side of things: we can just add shared substrings and make it a proper subtype of 'string'. The problem lies with C code and there only with the low level API consisting of SCM_STRINGP, SCM_STRING_CHARS etc. Functions like scm_c_string2str can be updated to just continue to work. Shared substrings also touch on the issues of using Unicode in Guile and on making sure we have a nice type conversion API that can replace gh_ in all respects. I'd like to do it in this order: - type conversion API (which allows for different encodings of strings, but doesn't need it immediately) (the first part of this was the 'frame' stuff for handling unwinds in C). - Unicode (with shared substrings in mind). - shared substrings Of course, we shouldn't do too much lest 1.8 wont happen... I'll try to put forth a proposal in the next days for the string part of the type conversion API that allows Unicode and shared substrings. -- GPG: D5D4E405 - 2F9B BCCC 8527 692A 04E3 331E FAF8 226A D5D4 E405 _______________________________________________ Guile-user mailing list Guile-user@gnu.org http://mail.gnu.org/mailman/listinfo/guile-user ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Worrying development 2004-01-18 21:05 ` Marius Vollmer @ 2004-01-18 21:58 ` Tom Lord 2004-01-22 21:47 ` Tom Lord 2004-01-22 16:11 ` Dirk Herrmann 1 sibling, 1 reply; 26+ messages in thread From: Tom Lord @ 2004-01-18 21:58 UTC (permalink / raw) Cc: guile-user, guile-devel > From: Marius Vollmer <mvo@zagadka.de> > I'll try to put forth a proposal in the next days for the string part > of the type conversion API that allows Unicode and shared substrings. If you don't mind, I'd like to do that too -- independently. But in a screwy way. I need to write a spec for the string API of Pika. So far, the Pika FFI is such that it could be implemented for Guile very easily. And, imo, it's also pretty good as an internal API for almost everything. If used (gradually migrated to, in the case of Guile) as an internal API -- I think it's pretty liberating (permitting lots of freedom for different object representations, thread support, GC strategies, etc.). Maybe in the longer term -- unifying over the Pika APIs would be a general win. -t _______________________________________________ Guile-devel mailing list Guile-devel@gnu.org http://mail.gnu.org/mailman/listinfo/guile-devel ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Worrying development 2004-01-18 21:58 ` Tom Lord @ 2004-01-22 21:47 ` Tom Lord 0 siblings, 0 replies; 26+ messages in thread From: Tom Lord @ 2004-01-22 21:47 UTC (permalink / raw) > From: Tom Lord <lord@emf.net> > > From: Marius Vollmer <mvo@zagadka.de> > > I'll try to put forth a proposal in the next days for the string part > > of the type conversion API that allows Unicode and shared substrings. > If you don't mind, I'd like to do that too -- independently. But in > a screwy way. > I need to write a spec for the string API of Pika. So far, the Pika > FFI is such that it could be implemented for Guile very easily. And, > imo, it's also pretty good as an internal API for almost everything. > If used (gradually migrated to, in the case of Guile) as an internal > API -- I think it's pretty liberating (permitting lots of freedom for > different object representations, thread support, GC strategies, > etc.). > Maybe in the longer term -- unifying over the Pika APIs would be a > general win. So, with apologies since most readers have probably already seen it in other forums: Unicode Meets Scheme Meets C ---------------------------- Specifying Unicode string and character support for Pika is an exercise in language design (because standard Scheme needs to be modified and extended), Scheme interface design, C interface design, and implementation design. Additionally, we would like a subset of the Pika C interface to characters and strings to be something that _any_ C-compatible Scheme implementation can provide (including Schemes that support only ISO8859-* character sets or other subsets of Unicode). Additionally, we would like the new and modified Scheme interfaces to strings and characters to be such that other implementors may choose to provide them and future revisions to the Scheme standard may incorporate them. This document provides some background on the problem, specifies the three interfaces, identifies the (hopefully) portable subset of the C interface, recommends changes to the Scheme standard, and finally, describes how all of these will be implemented in Pika. The reader is presumed to be somewhat familiar with the Unicode standard. Contents * Scheme Characters Meet Unicode * Scheme Strings Meet Unicode * The Scheme Interface to Unicode Characters and Strings * The Portable FFI Interface to Unicode Characters and Strings * The Pika FFI Interface to Unicode Characters and Strings Non-contents (unresolved issues) * Unicode and Ports (what does READ-CHAR mean?) * Scheme Characters ** Required and Expected Characters R5RS requires that all members of the character set used to specify the syntax of Scheme must be representable as Scheme characters. These include various punctuation marks, space, newline, digits, and the lowercase letters `a..z'. The specifications of CHAR-UPCASE, CHAR-DOWNCASE, and CHAR-ALPHABETIC? strongly imply that `A..Z' must also be present. The only way to not read this as a requirement is to imagine an implementation in which, for example, `a' is not an alphabetic character. By reasonable expectation, tab and form-feed should also be present. Implementations are possible which do not provide these but would be rather exceptional in that regard. /=== R6RS Recommendation: R6RS should explicitly require that tab, form-feed, and both `a..z' and `A..Z' are present, that the letters are alphabetic characters, and that they are case-mapped as expected. |=== Portable FFI Recommendation: A portable FFI should assume that that requirement is in place. |=== Pika Design: Pika assumes that that requirement is in place. \======== Perhaps more controversially: all members of the character set comprising the first half (U+0000..U+007F) of iso8859-1 should be representable as Scheme characters. These characters are found in ASCII, all iso8859-* character sets, and Unicode itself. They are used almost universally and are likely to remain so indefinitely. /=== R6RS Recommendation: R6RS should explicitly and strongly encourage the presence of all ASCII characters in the CHAR? type. |=== Portable FFI Recommendation: A portable FFI should require that all characters representable in portable C as non-octal-escaped character constants are representable in Scheme. It should strongly encourage that all ASCII characters are representable in Scheme. |=== Pika Design: The Pika CHAR? type will include representations for all of ASCII. \======== By somewhat reasonable expectation, there must be at least 256 distinct Scheme characters and INTEGER->CHAR must be defined for all integers in the range `0..255'. There are many circumstances in which conversions between octets and characters are desirable and the requirements of this expectation say that such conversion is always possible. It is quite possible to imagine implementations in which this is not the case: in which, for example, a (fully general) octet stream can not be read and written using READ-CHAR and DISPLAY (applied to characters). Such an implementation might introduce non-standard procedures for reading and writing octets and representing arrays of octets. While such non-standard extensions may be desirable for independent reasons, I see no good reason not to define at least a subset of Scheme characters which is mapped to the set of octet values. /=== R6RS Recommendation: R6RS should explicitly and strongly encourage the presence of at least 256 characters and that INTEGER->CHAR is defined for the entire range 0..255 (at least). |=== Portable FFI Recommendation: A portable FFI should require that the intersection of the set of integers -128..127 and the set of values representable as a `char' value are representable in Scheme as CHAR? values. The preferred 8-bit character representation in the FFI should be `unsigned char' and the scheme representation (if any) for any unsigned character `uc' should be the same as that for `(char)uc'. Note: stating these requirements is greatly simplified if the FFI simply requires that `char' and `unsigned char' are 8-bit types. |=== Pika Design: Pika will satisfy the FFI requirement and require that `char' is an 8-bit integer. \======== *** Remaining Degrees of Freedom for the CHAR? Type Scheme implementations consistent with our proposed requirements so far are likely to partition into four broad classes: ~ Those providing exactly 256 distinct (under CHAR=?) CHAR? values ~ Those providing approximately 2^16 CHAR? values ~ Those providing approximately 2^21 CHAR? values ~ Those providing an infinite set of CHAR? values /=== Pika Design: Pika is of the "approximately 2^21 characters" variety. Specifically, the Pika CHAR? type will in effect be a _superset_ of the set of Unicode codepoints. Each 21-bit codepoint will correspond to a Pika character. For each such character, there will be (2^4-1) (15) additional related characters representing the basic code point modified by a combination of any of four "buckybits". For example, the Unicode codepoint U+0041 can be written in Pika as: #\A and by applying buckybits (shift, meta, alt, hyper) an additional 15 characters can be formed giving the total set of 16 "A" characters: #\A #\S-A #\M-A #\H-A #\A-A #\S-M-A #\S-H-A #\S-A-A #\M-H-A #\M-A-A #\H-A-A #\S-M-H-A #\S-M-A-A #\S-H-A-A #\M-H-A-A #\S-M-H-A-A \======== ** Case Mappings Strictly speaking, R5RS does not require that the character set contain any upper or lower case letters. For example, it must contain `a' but it does not require that CHAR-LOWER-CASE? of `a' is true. However, in an implementation in which `a' is not lower case, `a' must also not be alphabetic. /=== R6RS Recommendation: R6RS should explicitly require that `a..z' and `A..Z' are alphabetic and cased in the expected way. |=== Pika Design: Pika will satisfy the proposed requirement. \======== R5RS requires a partial ordering of characters in which upper and lower case variants of "the same character" are treated as equal. Most problematically: R5RS requires that every alphabetic character have both an upper and lower case variant. This is a problem because Unicode defines abstract characters which, at least intuitively, are alphabetic -- but which lack such case mappings. We'll explore the topic further, later, but briefly: it does not appear that "good Unicode support" and "R5RS requirements for case mappings" are compatible -- at least not in a simple way. /=== R6RS Recommendation: R6RS should drop the requirement from the closing sentence of section 6.3.4 which says: In addition, if char is alphabetic, then the result of char-upcase is upper case and the result of char-downcase is lower case. Instead, it should say: There is no requirement that all alphabetic characters have an upper and lowercase mappings and no requirement that all alphabetic characters return true for one of CHAR-UPPER-CASE? or CHAR-LOWER-CASE?. There is no requirement that if a character is upper or lower case that it has a case mapping which is itself a character. However, it is required that the characters `a..z' are lowercase, `A..Z' are uppercase, and that CHAR-UPCASE and CHAR-DOWNCASE converts between those two ranges. And: The case mapping procedures and predicates _must_ be consistent with the case mappings that determine equivalence of identifiers and symbol names. In many environments they are usable for case mappings in a broader linguistic sense but programmers are cautioned that, in general, they are not appropriate for such uses in portable applications: some alphabets lack the concept of case entirely; others have the concept of case but lack a 1:1 mapping between upper and lowercase characters. Different case mapping procedures should be used in portable linguistically-oriented applications. |=== Pika Design: Pika will include CHAR? values such as Unicode's eszett (U+00DF) with the properties that can't satisfy R5RS' requirements as in the examples: (char-alphabetic? eszett) => #t (char-lower-case? eszett) => #t (char-lower-case? (char-upcase eszett)) => #t (char=? eszett (char-upcase eszett)) => #t \======== ** Character Ordering and Enumeration R5RS requires a bijective mapping between characters and a set of integers (the CHAR->INTEGER and INTEGER->CHAR procedures). Let's call this mapping the "enumeration of characters". R5RS requires a total ordering of characters and requires that that enumeration is isomorphic to the ordering of the enumeration. R5RS underspecifies the total ordering of characters. It requires that alphabetic characters `a..z' and (if present) `A..Z' be (respectively) lexically ordered. It requires that decimal digits be decimally ordered. It requires that digits either all proceed or all follow all uppercase letters. The ordering requirements of R5RS are problematic for Unicode. While iso8859-* digits `0..9' easily satisfy the requirement, Unicode defines additional decimal digits which do not. Intuitively, it seems that either CHAR-NUMERIC? must not behave as one would like on some Unicode abstract characters or the ordering requirement will have to change in R6RS. /=== R6RS Recommendation: R6RS should modify the requirement from section 6.3.4 which says: These procedures impose a total ordering on the set of characters. It is guaranteed that under this ordering: * The upper case characters are in order. For example, (char<? #\A #\B) returns #t. * The lower case characters are in order. For example, (char<? #\a #\b) returns #t. * The digits are in order. For example, (char<? #\0 #\9) returns #t. * Either all the digits precede all the upper case letters, or vice versa. * Either all the digits precede all the lower case letters, or vice versa. Instead, it should say: These procedures impose a total ordering on the set of characters. It is guaranteed that under this ordering: * The characters `A..Z' are in order. For example, (char<? #\A #\B) returns #t. * The characters `a..z' are in order. For example, (char<? #\a #\b) returns #t. * The digits `0..9' are in order. For example, (char<? #\0 #\9) returns #t. Programmers are cautioned that the ordering of characters by these procedures is not expected to have linguistic significance suitable for portable applications. At the same time, implementors are strongly encouraged to define the ordering in such a way that a list of strings, sorted lexically by the character ordering, is at least likely to be in an order that is suitable for presentation to users -- even though the sorting may not be culturally optimal. |=== Pika Design: Pika CHAR? values will map to their Unicode codepoint equivalents, with bucky-bits added added as additional high-order bits. This satisfies the proposed modified requirements and results in a situation where, for example, all characters with the META bit set will sort after all characters with no buckybits set. \======== ** Character Classes R5RS defines 5 classes of characters by introducing 5 predicates: char-alphabetic? char-numeric? char-whitespace? char-upper-case? char-lower-case? In the context of Unicode, there is a three-way ambiguity between linguistic categorization of characters, categorizations defined by the Unicode standard, and categorizations as they apply to Scheme syntax. /=== R6RS Recommendation: Section 6.3.4 should be modified. Instead of saying: These procedures return #t if their arguments are alphabetic, numeric, whitespace, upper case, or lower case characters, respectively, otherwise they return #f. The following remarks, which are specific to the ASCII character set, are intended only as a guide: The alphabetic characters are the 52 upper and lower case letters. The numeric characters are the ten decimal digits. The whitespace characters are space, tab, line feed, form feed, and carriage return. it should say: These procedures return #t if their arguments are alphabetic, numeric, whitespace, upper case, or lower case characters, respectively, otherwise they return #f. These procedures _must_ be consistent with the procedure READ provided by the implementation. For example, if a character is CHAR-ALPHABETIC?, then it must also be suitable for use as the first character of an identifier. `a..z' and `A..Z' _must_ be alphabetic and _must_ be respectively lower and upper case. #\space, #\tab, and #\formfeed _must_ be CHAR-WHITESPACE?. `0..9' _must_ be CHAR-NUMERIC?. No character may cause more than one the procedures CHAR-ALPHABETIC?, CHAR-NUMERIC? and CHAR-WHITESPACE? to return #t. No character may cause more than one of the procedures CHAR-UPPER-CASE? and CHAR-LOWER-CASE? to return #t. Programmer's are advised that these procedures are unlikely to be suitable for linguistic programming in portable code while implementors are strongly encouraged to define them in ways that make them a reasonable approximation of their linguistic counterparts. |=== Pika Design: Pika will return #t from CHAR-NUMERIC? _only_ for `0..9'. Pika will return #t from CHAR-WHITESPACE? _only_ for space, tab, newline, carriage return, and formfeed. The set of characters for which Pika will return #t from CHAR-ALPHABETIC? is undetermined at this time. Pika will return #t from CHAR-UPPER-CASE? for those which have the Unicode general category property Lu or Lt; from CHAR-LOWER-CASE? for those which have the general category property Ll. \======== ** What _is_ a Character, Anyway We have observed the need for various changes to the Scheme standard for the CHAR? type but have done so (deliberately) without "nailing down" exactly what a character is. The good reason for preserving this ambiguity is that there are several reasonable choices for the CHAR? type -- implementations should not be required to all be the same in this regard. In particular, these strategies seem to me to be the most viable: ~ A CHAR? value roughly corresponds to a Unicode codepoint. ~ A CHAR? value roughly corresponds to a member of exactly one of the iso8859-* family of character sets. (An implementation in this class would not have full Unicode support, of course. Nevertheless, the iso8859-* character sets are (abstractly, not by numeric character values) a subset of Unicode -- they are consistent with Unicode. It would seem gratuitous to rule an implementation non-standard simply because it doesn't support a huge character set.) ~ A CHAR? value corresponds to an 8-bit integer, with characters 0..127 corresponding to ASCII characters, and the meaning of characters 128..255 depending on the host environment (such as a LOCALE setting). (This is a second class of implementations without full Unicode support but consistent with Unicode. Implementations of this class would be consistent with a common form of internationalization practices used in C programming.) ~ A CHAR? value corresponds to a "combining sequence" of an arbitrary number of Unicode codepoints. (At this time, such implementations are mostly theoretic. This is the kind of implementation Ray Dillenger has been working on and he makes an interesting case for it.) Related to these possibilities is an important question of _syntax_. A Scheme program can contain a lexeme denoting a character constant: #\X and the question is "What can X be in a portable program?" /=== R6RS Recommendation: R6RS should explicitly define a _portable_character_set_ containing the characters mentioned earlier: `a..z', `A..Z', space, formfeed, newline, carriage return, and the punctuation required by Scheme syntax. Additionally, R6RS should define an _optional_ syntax for Unicode codepoints. I propose: #\U+XXXXX and in strings: \U+XXXXX. where XXXXX is an (arbitrary length) string of hexadecimal digits. There is some "touchiness" with this optional syntax. Conceivably, for example, implementations may support Unicode to a degree but not support codepoints which are not assigned characters, or not support all possible codepoints, or not permit strings which are not in one of the (several available) canonical forms. In my opinion, it is too early to try to differentiate these various forms of optional conformance. At this stage in history, the optional syntaxes should be permitted -- but applications providing them should not be constrained to support arbitrary codepoint characters or arbitrary Unicode strings. |=== Pika Design: Pika's character set will include all Unicode codepoints with all combinations of buckybits. Some characters will only be writable using hexadecimal escapes and/or buckybit prefixes. \======== * Scheme Strings Meet Unicode R5RS defines Scheme strings as sequences of scheme characters. They have a length measured in characters and may be indexed (to retrieve or set a given character) by character offsets. R5RS defines two lexical orderings of strings: once by the definitions of the ordering of characters; again by the definitions of the case-independent ordering of characters. There is a strong expectation that strings are "random access" meaning that any character within them can be accessed or set in O(1) time. The _intuitive_ model of a Scheme string is a uniform array of characters. /=== R6RS Recommendation: R6RS should strongly encourage implementations to make the expected-case complexity of STRING-REF and STRING-SET! O(1). \======== There is an additional problem when designing portable APIs: the matter of string indexes. Most of the possible answers to "what is a Scheme character" are consistent with the view that characters correspond to (possibly a subset of) Unicode codepoints. One of the possible answers to that question has the CHAR? type correspond to a _sequence_ of Unicode code points. The difference between those two possibilities impacts on string indexes. An index to the beginning of some substring has one numeric value if characters are codepoints, another if they are combined sequences of codepoints. That difference is, unfortunately, intolerable for portable code. Consider, for example, a string constant written: "XYZ" where X, Y, and Z are arbitrary Unicode substrings. What is the index of the beginning of substring Y? The question becomes especially pressing for portable programs that want to encode that string index as an integer constant, for programs exchanging Scheme data which includes string indexes with other Scheme implementations, and for an FFI (in which FFI-using code may wish to compute and return to Scheme a string index -- or to interpret a string-index passed to C from Scheme). So, an ambiguity about string indexes should not be allowed to stand. /=== R6RS Recommendation: While R6RS should not require that CHAR? be a subset of Unicode, it should specify the semantics of string indexes for strings which _are_ subsets of Unicode. Specifically, if a Scheme string consists of nothing but Unicode codepoints (including substrings which form combining sequences), string indexes _must_ be Unicode codepoint offsets. \======== That proposed modification to R6RS presents a (hopefully small) problem for Ray Dillinger. He would like (for quite plausible reasons) to have CHAR? values which correspond to a _sequence_ of Unicode codepoints. While I have some ideas about how to _partially_ reconcile his ideas with this proposal, I'd like to hear his thoughts on the matter. * The Scheme Interface to Unicode Characters and Strings In the preceding, we have added no new procedures or types to Scheme -- only modified the requirements for existing procedures. We have added new syntax for characters and strings. Certainly, further extensions to Scheme are desirable -- for example, to provide linguistically sensitive string processing. I suggest that such extensions are best regarded as a separate topic -- to be taken up "elsewhere". * The Portable FFI Interface to Unicode Characters and Strings This section is written in the style of Pika naming and calling conventions. ** Basic Types ~ typedef <unspecified unsigned integer type> t_unicode; An unsigned integer type sufficiently large to hold a Unicode codepoint. ~ enum uni_encoding_scheme; Valid values include but are not limited to: uni_utf8 uni_utf16 uni_utf16be uni_utf16le uni_utf32 uni_iso8859_1 ... uni_iso8859_16 uni_ascii ** Character Conversions ~ t_scm_error scm_character_to_codepoint (t_unicode * answer, t_scm_arena instance, t_scm_word * chr) Normally, return (via `*answer') the Unicode codepoint value corresponding to Scheme value `*chr'. Return 0. If `*chr' is not a character or is not representable as a Unicode codepoint, set `*answer' to U+FFFD and return an error code. ~ t_scm_error scm_character_to_ascii (char * answer, t_scm_arena instance, t_scm_word * chr) Normally, return (via `*answer') the ASCII codepoint value corresponding to Scheme value `*chr'. Return 0. If `*chr' is not an ASCII character or is not representable as a Unicode codepoint, set `*answer' to 0 and return an error code. ~ t_scm_error scm_codepoint_to_character (t_scm_word * answer, t_scm_arena instance, t_unicode codepoint) Normally, return (via `*answer') the Scheme character corresponding to `codepoint' which must be in the range 0..(2^21-1). Return 0. If `codepoint' is not representable as a Scheme character, set `*answer' to an unspecified Scheme value and return an error code. ~ void scm_ascii_to_character (t_scm_word * answer, t_scm_arena instance, char chr) Return (via `*answer') the Scheme character corresponding to `chr' which must be representable as an ASCII character. ** String Conversions ~ t_scm_error scm_extract_string8 (t_uchar * answer, size_t * answer_len, enum uni_encoding_scheme enc, t_scm_arena instance, t_scm_word * str) Normally, convert `str' to the indicated encoding (which must be one of `uni_utf8', `uni_iso8859_*', or `uni_ascii') storing the result in the memory addressed by `answer' and the number of bytes stored in `*answer_len'. Return 0. On input, `*answer_len' should indicate the amount of storage available at the address `answer'. If there is insuffiencient memory available, `*answer_len' will be set to the number of bytes needed and the value `scm_err_too_short' returned. ~ t_scm_error scm_extract_string16 (t_uint16 * answer, size_t * answer_len, enum uni_encoding_scheme enc, t_scm_arena instance, t_scm_word * str) Normally, convert `str' to the indicated encoding (which must be one of `uni_utf16', `uni_utf16be', or `uni_utf16le') storing the result in the memory addressed by `answer' and the number of 16-bit values stored in `*answer_len'. Return 0. On input, `*answer_len' should indicate the amount of storage available at the address `answer' (measured in 16-bit values). If there is insuffiencient memory available, `*answer_len' will be set to the number of 16-bit values needed and the value `scm_err_too_short' returned. ~ t_scm_error scm_extract_string32 (t_uint32 * answer, size_t * answer_len, t_scm_arena instance, t_scm_word * str) Normally, convert `str' to UTF-32, storing the result in the memory addressed by `answer' and the number of 32-bit values stored in `*answer_len'. Return 0. On input, `*answer_len' should indicate the amount of storage available at the address `answer' (measured in 32-bit values). If there is insuffiencient memory available, `*answer_len' will be set to the number of 32-bit values needed and the value `scm_err_too_short' returned. ~ t_scm_error scm_make_string8_n (t_scm_word uchar * answer, t_scm_arena instance, t_uchar * str, enum uni_encoding_scheme enc, size_t str_len) Convert `str' of length `str_len' and in encoding to `enc' to a Scheme value (returned in `*answer') if possible, and return 0. Otherwise, return an error code. ~ t_scm_error scm_make_string16_n (t_scm_word uchar * answer, t_scm_arena instance, t_uint16 * str, enum uni_encoding_scheme enc, size_t str_len) Convert `str' of length `str_len' and in encoding to `enc' to a Scheme value (returned in `*answer') if possible, and return 0. Otherwise, return an error code. ~ t_scm_error scm_make_string32_n (t_scm_word uchar * answer, t_scm_arena instance, t_uint32 * str, enum uni_encoding_scheme enc, size_t str_len) Convert `str' of length `str_len' and in encoding to `enc' to a Scheme value (returned in `*answer') if possible, and return 0. Otherwise, return an error code. ** Other FFI Functions Various standard Scheme procedures (e.g., `STRING-REF') ought to be present in the FFI as well. Their mappings into C are straightforward given the proposed requirement that string indexes refer to codepoint offsets. * The Pika FFI Interface to Unicode Characters and Strings The standard Scheme string procedures map into the native Pika FFI in the expected way (their C function prototypes and semantics can be inferred in a trivial way from their standard Scheme specifications). As such, they are not specified here. ** Pika FFI Access to String Data (Note: In earlier discussions with some people I had suggested that Pika strings might be represented by a tree structure. I have since decided that that should not be the case: I want Pika to live up to the expectation that STRING? values are represented as arrays. The tree-structured-string-like type will be present in Pika, but it will be distinct from STRING?.) There is a touchy design issue regarding strings. On the one hand, for efficiency's sake, C code needs fairly direct access (for both reading and writing) to string data. On the other hand, that necessarily means exposing to C the otherwise internal details of value representation and of addressing the possibility of a multi-threaded Pika. ~ t_scm_error scm_lock_string_data (t_udstr * answer, t_scm_arena instance, t_scm_word * str) Obtain the `t_udstr' (see libhackerlab) corresponding to `str'. Return 0. If `*str' is not a string, set `*answer' to 0 and return an error code. A string obtained by this call must be released by `scm_release_string_data' (see below). It is critical that between calls to `scm_lock_string_data' and `scm_release_string_data' no other calls to Pika FFI functions be made, and that the number of instructions between those calls is sharply bounded. ~ t_scm_error scm_lock_string_data2 (t_udstr * answer1, t_udstr * answer2, t_scm_arena instance, t_scm_word * str1 t_scm_word * str2) Like `scm_lock_string_data' but return the `t_udstr's for two strings instead of one. ~ t_scm_error scm_release_string_data (t_scm_arena instance t_scm_word * str) Release a string locked by `scm_lock_string_data'. ~ t_scm_error scm_release_string_data (t_scm_arena instance t_scm_word * str1 t_scm_word * str2) Release strings locked by `scm_lock_string_data2'. ** Pika String Internals libhackerlab contains a string-like type, `t_udstr' defined (internally) as a pointer to a structure of the form: struct udstr_handle { enum uni_encoding_scheme enc; size_t length; uni_string data; alloc_limits limits; int refs; }; The intention here is to represent a string in an aribitray Unicode encoding, with an explicitly recorded length measured in coding units. libhackerlab contains a (currently small but growing) number of "string primitives" for operating on `t_udstr' values. For example, one can concatenate two strings without regard to what encoding scheme each is in. In Pika, strings shall be represented as `t_udstr' values, pointed to by an scm_vtable object. Further restrictions apply, however: Pika strings will have the property that: ~ any string including a character in the range U+10000..U+1FFFFF will be stored in a t_udstr using the uni_utf32 encoding. ~ any other string including a character in the range U+0100..U+FFFF will be stored in a t_udstr using the uni_utf16 encoding. ~ all remaining strings will be stored in a t_udstr using the uni_iso8859_1 encoding. In that way, the length of the t_udstr (measured in encoding units) will always be equal to the length of the Scheme string (measured in codepoints). Nearly all Scheme string operations involving string indexes can find the referenced characters in O(1) time. The primary exception is STRING-SET! if the character being stored requires the string it is being stored into to change encoding. At the same time, Scheme strings will have a space-efficient representation. _______________________________________________ Guile-devel mailing list Guile-devel@gnu.org http://mail.gnu.org/mailman/listinfo/guile-devel ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Worrying development 2004-01-18 21:05 ` Marius Vollmer 2004-01-18 21:58 ` Tom Lord @ 2004-01-22 16:11 ` Dirk Herrmann 2004-01-22 18:42 ` Tom Lord 1 sibling, 1 reply; 26+ messages in thread From: Dirk Herrmann @ 2004-01-22 16:11 UTC (permalink / raw) Cc: guile-user, guile-devel Hi folks, first, sorry for the late answer... Marius Vollmer wrote: >Roland Orre <orre@nada.kth.se> writes: > > >>I suggest that shared substrings are moved back to guile. >> >> > >Agreed. > >I'm sorry for previously giving the impression that shared substrings >wont come back. > >There is no problem on the Scheme side of things: we can just add >shared substrings and make it a proper subtype of 'string'. > >The problem lies with C code and there only with the low level API >consisting of SCM_STRINGP, SCM_STRING_CHARS etc. Functions like >scm_c_string2str can be updated to just continue to work. > >Shared substrings also touch on the issues of using Unicode in Guile >and on making sure we have a nice type conversion API that can replace >gh_ in all respects. > >I'd like to do it in this order: > > - type conversion API (which allows for different encodings of > strings, but doesn't need it immediately) (the first part of this > was the 'frame' stuff for handling unwinds in C). > > - Unicode (with shared substrings in mind). > > - shared substrings > >Of course, we shouldn't do too much lest 1.8 wont happen... > >I'll try to put forth a proposal in the next days for the string part >of the type conversion API that allows Unicode and shared substrings. > I am not quite sure, everybody is talking about the same issues here: When talking about the re-introduction of shared substrings, Marius, do you think of implicitly shared copy-on-write substrings, or guile's explicitly shared substrings? Shared substrings as they have been provided by guile could have served two purposes: 1) saving resources (run time and memory) 2) communicating changes via something like a shared memory interface The first purpose is just a matter of performance and should not change the functional behaviour of strings. However, guile's former implementation was only imperfectly suited to this kind of usage: Whoever used shared substrings for this purpose needed to be well aware of which strings were actually shared and which were not, because modifications on the strings would cause side effects. Thus, for this purpose the mechanism of implicitly shared copy-on-write substrings is safer. And, this is what we had intended to implement for guile. It would not require the user to perform any explicit action to have substrings to be shared. The second purpose is about a change in behaviour, and to me it is not quite sure that it should be brought back in the old way. Providing shared substrings in that way changes the semantics of strings: Two strings s1 and s2, which are not eq? would become connected in a user-visible way such that modifications to s1 influence s2 and vice versa. What may users of a string data type expect? Shall it be granted that the following expression will always evaluate to true? (if (and (not (eq? s1 s2)) (equal? s1 s3)) (begin (string-set! s2 0 #\x) (equal? s1 s3)) #t) My assumption is that most users will assume the above expression to evaluate to true. If that was not the case, we would require users to perform aliasing checks in their code. Do we really want that? In which way would we extend the string API such that users are able to perform the necessary aliasing checks? IIRC, the old shared substring API did not provide a means for such aliasing checks. The shared substring feature was deprecated since we had considered that feature as a bug in guile's design. I propose not to officially re-introduce it in its former way. The best thing was to have code changed that used the old behaviour. To allow applications to be migrated incrementally, we have provided the feature as deprecated since guile-1.6. If that is not possible for some applications, then a workaround like the one that Mikael and Roland have developed can be used. With that solution, the feature may even remain part of guile - but deprecated, only provided for backwards compatibility! Whoever uses it, should be aware of the fact that due to the aliasing it may lead to problems with other string libraries. Best regards Dirk Herrmann _______________________________________________ Guile-devel mailing list Guile-devel@gnu.org http://mail.gnu.org/mailman/listinfo/guile-devel ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Worrying development 2004-01-22 16:11 ` Dirk Herrmann @ 2004-01-22 18:42 ` Tom Lord 2004-01-23 11:45 ` Dirk Herrmann 0 siblings, 1 reply; 26+ messages in thread From: Tom Lord @ 2004-01-22 18:42 UTC (permalink / raw) Cc: guile-user, guile-devel, mvo > From: Dirk Herrmann <dirk@dirk-herrmanns-seiten.de> > What may users of a string data type expect? Shall it be granted > that the following expression will always evaluate to true? > (if (and > (not (eq? s1 s2)) > (equal? s1 s3)) > (begin > (string-set! s2 0 #\x) > (equal? s1 s3)) > #t) What about: (if (and (not (eq? l1 l2)) (equal? l1 l3)) (begin (set-car! l2 'x) (equal? l1 l3)) #t) > My assumption is that most users will assume the above expression to > evaluate to true. When do they need to make such assumptions? Why is it different from the case with lists? > If that was not the case, we would require users to > perform aliasing checks in their code. Do we really want that? I've never seen list-mutating code make such checks. Why would strings be different? > The shared substring feature was deprecated since we had > considered that feature as a bug in guile's design. I propose > not to officially re-introduce it in its former way. The best > thing was to have code changed that used the old behaviour. To > allow applications to be migrated incrementally, we have > provided the feature as deprecated since guile-1.6. If that is > not possible for some applications, then a workaround like the > one that Mikael and Roland have developed can be used. With that > solution, the feature may even remain part of guile - but > deprecated, only provided for backwards compatibility! Whoever > uses it, should be aware of the fact that due to the aliasing it > may lead to problems with other string libraries. I am having trouble imagining any libraries that would break. Let's suppose that, eventually, Guile has _both_ COW shared substrings and shared-mutation shared substrings. The only reason I would ever create a shared-mutation shared substring in the first place is if I know that I want to mutate it (or it's parent or some other shared-mutation string) and have the effect on all of these strings. Now what if I have shared-mutation substrings but not COW? You say that the old implementation was flawed because it created _only_ shared mutation substrings. I don't think that that's a very serious flaw. In general, no procedure should mutate _any_ of its arguments unless it is advertised as doing so. Consequently, I simply shouldn't hand a shared-mutation substring to a mutating procedure unless I intend for that mutation to effect all sharing strings. And on the other hand, if I have a mutating procedure -- almost invariably the mutation is _unconditional_. The "copy" of a COW substring is guaranteed to take place. If I'm going to pass a substring to a mutating procedure and _don't_ want the mutations to propogate, then I may as well do the "copy" eagerly in the first place. That decision to share mutations or not is one I can make locally -- at the point where I create the substring in the first place. Libraries don't have to worry that I might have made that decision for some parameters at all. It's not their business. There's no need for aliasing checks. Even if a library _wanted_ to worry about aliasing it couldn't: it doesn't know what other strings to check for aliasing. So, no, sorry -- the old implementation (shared-mutation-only) was very good. Adding a COW behavior to SUBSTRING is an upwards-compatible improvement to the old way -- since many uses of SUBSTRING in portable Scheme programs will never need to perform the copy -- but if you have only one of the two kinds of shared substring, shared-mutation gives you the greater functionality at essentially no cost to correct standard programs. As I vaguely recall, the only reason COW didn't become the behavior of SUBSTRING "back then" was because of a tag-bit shortage (at the time). It was something I had planned to eventually squeeze in. -t _______________________________________________ Guile-user mailing list Guile-user@gnu.org http://mail.gnu.org/mailman/listinfo/guile-user ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Worrying development 2004-01-22 18:42 ` Tom Lord @ 2004-01-23 11:45 ` Dirk Herrmann 2004-01-23 17:16 ` Tom Lord 0 siblings, 1 reply; 26+ messages in thread From: Dirk Herrmann @ 2004-01-23 11:45 UTC (permalink / raw) Cc: guile-user, guile-devel, mvo Tom Lord wrote: > > From: Dirk Herrmann <dirk@dirk-herrmanns-seiten.de> > > > What may users of a string data type expect? Shall it be granted > > that the following expression will always evaluate to true? > > > (if (and > > (not (eq? s1 s2)) > > (equal? s1 s3)) > > (begin > > (string-set! s2 0 #\x) > > (equal? s1 s3)) > > #t) > >What about: > > (if (and (not (eq? l1 l2)) > (equal? l1 l3)) > (begin > (set-car! l2 'x) > (equal? l1 l3)) > #t) > > > > My assumption is that most users will assume the above expression to > > evaluate to true. > >When do they need to make such assumptions? Why is it different from >the case with lists? > First: It's not a matter of whether users *need* to make certain assumptions: It's a matter of interface definition. Scheme defines the string data type and together with it, it defines the semantics of operations on it. This gives users a set of properties they *can* rely on. Certainly, not all of the properties are *needed* in every piece of code. The difference with lists is, that lists are not _one_ object, but are made of several objects. That is, there is no statement saying that two different lists are independent of each other in the same way that two strings are independent of each other. >You say that the old implementation was flawed because it created >_only_ shared mutation substrings. I don't think that that's a very >serious flaw. In general, no procedure should mutate _any_ of its >arguments unless it is advertised as doing so. Consequently, I simply >shouldn't hand a shared-mutation substring to a mutating procedure >unless I intend for that mutation to effect all sharing strings. > No, I said that the old implementation was flawed because it changed the semantics of the standard scheme string type. It is certainly all right if someone wants to use something like shared mutation substrings, but IMO this should be achieved with a different data type. It may be a good idea to provide a library for this kind of feature, but it should not modify scheme's standard string type. Who operates on strings should be allowed to make all assumptions about string behaviour that belong to the definition of the string data type. Best regards Dirk Herrmann _______________________________________________ Guile-user mailing list Guile-user@gnu.org http://mail.gnu.org/mailman/listinfo/guile-user ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Worrying development 2004-01-23 11:45 ` Dirk Herrmann @ 2004-01-23 17:16 ` Tom Lord 2004-01-23 21:01 ` Marius Vollmer 2004-01-23 22:37 ` Worrying development Dirk Herrmann 0 siblings, 2 replies; 26+ messages in thread From: Tom Lord @ 2004-01-23 17:16 UTC (permalink / raw) Cc: guile-user, guile-devel, mvo > From: Dirk Herrmann <dirk@dirk-herrmanns-seiten.de> > First: It's not a matter of whether users *need* to make certain > assumptions: It's a matter of interface definition. Scheme defines the > string data type and together with it, it defines the semantics of > operations on it. Please show me what existing lines of the Scheme standard will have to change if mutation-sharing shared substrings are added. The standard (not very formally but clearly enough) says that the standard procedures which construct strings allocate fresh locations for the contents of those strings. That means that none of those procedures create mutation-sharing shared substrings -- nobody has proposed anything different. I think you are imagining that there is an additional requirement in the standard: that any procedure at all which creates a new string must allocate fresh locations for its contents. But that additional requirement isn't there. Scheme programmers can not assume that that requirement is part of Scheme. Mutation-sharing shared substrings are an upwards compatible extension to the Scheme standard. They break no correct programs. They enable new kinds of programs. -t _______________________________________________ Guile-user mailing list Guile-user@gnu.org http://mail.gnu.org/mailman/listinfo/guile-user ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Worrying development 2004-01-23 17:16 ` Tom Lord @ 2004-01-23 21:01 ` Marius Vollmer 2004-01-23 22:18 ` Tom Lord 2004-01-23 22:28 ` Paul Jarc 2004-01-23 22:37 ` Worrying development Dirk Herrmann 1 sibling, 2 replies; 26+ messages in thread From: Marius Vollmer @ 2004-01-23 21:01 UTC (permalink / raw) Cc: guile-user, guile-devel Tom Lord <lord@emf.net> writes: > Mutation-sharing shared substrings are an upwards compatible extension > to the Scheme standard. They break no correct programs. They enable > new kinds of programs. I'd say that the real 'trouble' is that strings are mutable at all. Mutation-sharing substrings are only a minor additional semantical annoyance. They do enable new kinds of programs, and that makes them valuable. Also, there is the possibility on the horizon that we turn string-ref etc into 'primitive generics' which means that people could implement new kinds of strings using GOOPS. Also, I still like the idea of using mutation-sharing substrings as markers that allow O(1) access into variable-width encoded strings. -- GPG: D5D4E405 - 2F9B BCCC 8527 692A 04E3 331E FAF8 226A D5D4 E405 _______________________________________________ Guile-devel mailing list Guile-devel@gnu.org http://mail.gnu.org/mailman/listinfo/guile-devel ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Worrying development 2004-01-23 21:01 ` Marius Vollmer @ 2004-01-23 22:18 ` Tom Lord 2004-01-24 0:27 ` Marius Vollmer 2004-01-23 22:28 ` Paul Jarc 1 sibling, 1 reply; 26+ messages in thread From: Tom Lord @ 2004-01-23 22:18 UTC (permalink / raw) Cc: guile-user, guile-devel > From: Marius Vollmer <mvo@zagadka.de> > Tom Lord <lord@emf.net> writes: > > Mutation-sharing shared substrings are an upwards compatible extension > > to the Scheme standard. They break no correct programs. They enable > > new kinds of programs. > I'd say that the real 'trouble' is that strings are mutable at > all. Worried mostly about variable-length character encodings in string? Or you'd just rather be programming in an ML-family language? :-) If it's variable-length encodings that irk you: if strings were read-only you'd want to optimize the heck out of STRING-APPEND and SUBSTRING and, once you did that, you'd have essentially enough machinery to do mutations efficiently. > Also, I still like the idea of using mutation-sharing substrings as > markers that allow O(1) access into variable-width encoded strings. Interesting. The interaction with STRING-SET! will be tricky. I think you'll either have to "timestamp" strings (one tick per mutation -- and you'll likely have to use a GC'ed value rather than an inline integer for timestamps) or wind up with O(K) for mutations where K is the number of shared substrings. The same problem comes up if you add STRING-RESIZE!. I keep going back and forth on whether or not strings should be the same things as or a subset buffers vs. making buffers a completely separate type. (The latter certainly seems to be easier to implement.) > Also, there is the possibility on the horizon that we turn > string-ref etc into 'primitive generics' which means that people > could implement new kinds of strings using GOOPS. Well, heck. In that case, maybe consider what I'm planning for Pika (at least initially). Purely ASCII strings are stored 1-byte per character. Most other strings 2-bytes per character. Strings using characters outside the Basic Multilingual Plane, 4 bytes per character. You want some fancier-than-libc string functions in C for that -- but it gives you an expected-case O(1) for STRING-REF and STRING-SET! and pretty good space efficiency. It also gives you some performance glitches as when you store a U+0100 character in an otherwise purely ASCII 10MB string. (We're working on providing such fancier-than-libc functions in libhackerlab -- so they'd be available independently of Pika if you went this route.) -t _______________________________________________ Guile-user mailing list Guile-user@gnu.org http://mail.gnu.org/mailman/listinfo/guile-user ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Worrying development 2004-01-23 22:18 ` Tom Lord @ 2004-01-24 0:27 ` Marius Vollmer 2004-01-24 0:53 ` Tom Lord 0 siblings, 1 reply; 26+ messages in thread From: Marius Vollmer @ 2004-01-24 0:27 UTC (permalink / raw) Cc: guile-user, guile-devel Tom Lord <lord@emf.net> writes: > > I'd say that the real 'trouble' is that strings are mutable at > > all. > > Worried mostly about variable-length character encodings in string? > Or you'd just rather be programming in an ML-family language? :-) Heh, no, I'm not really worried, I was actually trying to comment Dirk's concerns. > > Also, I still like the idea of using mutation-sharing substrings as > > markers that allow O(1) access into variable-width encoded strings. > > Interesting. The interaction with STRING-SET! will be tricky. I > think you'll either have to "timestamp" strings (one tick per mutation > -- and you'll likely have to use a GC'ed value rather than an inline > integer for timestamps) or wind up with O(K) for mutations where K is > the number of shared substrings. Yes. What I have in mind is that accessing strings is efficient as long as no mutations are performed. I.e., instead of indication positions in a string with an integer index, you create a shared substring that starts at the desired position. (This could be done with COW substrings, tho.) > > Also, there is the possibility on the horizon that we turn > > string-ref etc into 'primitive generics' which means that people > > could implement new kinds of strings using GOOPS. > > Well, heck. In that case, maybe consider what I'm planning for Pika > (at least initially). Purely ASCII strings are stored 1-byte per > character. Most other strings 2-bytes per character. Strings using > characters outside the Basic Multilingual Plane, 4 bytes per > character. Yes, that's an attractive approach. But I also find simply using UTF-8 exclusively very attractive. It might fit better with what other people are doing and we might need fewer conversions when wrapping external libraries. Or maybe not. -- GPG: D5D4E405 - 2F9B BCCC 8527 692A 04E3 331E FAF8 226A D5D4 E405 _______________________________________________ Guile-devel mailing list Guile-devel@gnu.org http://mail.gnu.org/mailman/listinfo/guile-devel ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Worrying development 2004-01-24 0:27 ` Marius Vollmer @ 2004-01-24 0:53 ` Tom Lord 0 siblings, 0 replies; 26+ messages in thread From: Tom Lord @ 2004-01-24 0:53 UTC (permalink / raw) Cc: guile-user, guile-devel > From: Marius Vollmer <mvo@zagadka.de> > > Well, heck. In that case, maybe consider what I'm planning for Pika > > (at least initially). Purely ASCII strings are stored 1-byte per > > character. Most other strings 2-bytes per character. Strings using > > characters outside the Basic Multilingual Plane, 4 bytes per > > character. > Yes, that's an attractive approach. But I also find simply using > UTF-8 exclusively very attractive. It might fit better with what > other people are doing and we might need fewer conversions when > wrapping external libraries. Or maybe not. In case it helps seduce you to the dark side of the force just a little more: Having wrappings of external libraries mostly rely on copying/converting strings is a win for thread support. Having FFI-using routines directly access or munge string data is, in general, pretty touchy. It is, I admit, a total pain in the butt that so much existing code already does access string data directly -- but for the most part, that code is unlikely to be expecting UTF-8 anyway so...... -t _______________________________________________ Guile-devel mailing list Guile-devel@gnu.org http://mail.gnu.org/mailman/listinfo/guile-devel ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Worrying development 2004-01-23 21:01 ` Marius Vollmer 2004-01-23 22:18 ` Tom Lord @ 2004-01-23 22:28 ` Paul Jarc 2004-01-24 12:09 ` rm 1 sibling, 1 reply; 26+ messages in thread From: Paul Jarc @ 2004-01-23 22:28 UTC (permalink / raw) Cc: guile-user, guile-devel Marius Vollmer <mvo@zagadka.de> wrote: > Also, there is the possibility on the horizon that we turn string-ref > etc into 'primitive generics' which means that people could implement > new kinds of strings using GOOPS. Neat. Has/might that also be done for car & cdr? Then we could have Python-like generators. It would make SCM_C[AD]R less speedy, though they could still be pretty fast for actual pairs. paul _______________________________________________ Guile-user mailing list Guile-user@gnu.org http://mail.gnu.org/mailman/listinfo/guile-user ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Worrying development 2004-01-23 22:28 ` Paul Jarc @ 2004-01-24 12:09 ` rm 2004-01-24 13:29 ` Marius Vollmer 0 siblings, 1 reply; 26+ messages in thread From: rm @ 2004-01-24 12:09 UTC (permalink / raw) On Fri, Jan 23, 2004 at 05:28:03PM -0500, Paul Jarc wrote: > Marius Vollmer <mvo@zagadka.de> wrote: > > Also, there is the possibility on the horizon that we turn string-ref > > etc into 'primitive generics' which means that people could implement > > new kinds of strings using GOOPS. > > Neat. Has/might that also be done for car & cdr? Then we could have > Python-like generators. It would make SCM_C[AD]R less speedy, though > they could still be pretty fast for actual pairs. Wow, _that_ would be a real help! I have to build bindings for some heavy C++ libs that make excessive use of iterators. A generic 'sequence' type with the right generics would help a lot -- the bindings would look much more scheme-ish. Ralf Mattes > > paul > > > _______________________________________________ > Guile-devel mailing list > Guile-devel@gnu.org > http://mail.gnu.org/mailman/listinfo/guile-devel _______________________________________________ Guile-devel mailing list Guile-devel@gnu.org http://mail.gnu.org/mailman/listinfo/guile-devel ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Worrying development 2004-01-24 12:09 ` rm @ 2004-01-24 13:29 ` Marius Vollmer 2004-01-26 2:42 ` overriding car/cdr (was: Worrying development) Paul Jarc 0 siblings, 1 reply; 26+ messages in thread From: Marius Vollmer @ 2004-01-24 13:29 UTC (permalink / raw) Cc: guile-user, guile-devel rm@fabula.de writes: > On Fri, Jan 23, 2004 at 05:28:03PM -0500, Paul Jarc wrote: >> Marius Vollmer <mvo@zagadka.de> wrote: >> > Also, there is the possibility on the horizon that we turn string-ref >> > etc into 'primitive generics' which means that people could implement >> > new kinds of strings using GOOPS. >> >> Neat. Has/might that also be done for car & cdr? Then we could have >> Python-like generators. It would make SCM_C[AD]R less speedy, though >> they could still be pretty fast for actual pairs. > > Wow, _that_ would be a real help! I have to build bindings for some > heavy C++ libs that make excessive use of iterators. A generic 'sequence' > type with the right generics would help a lot -- the bindings would look > much more scheme-ish. Hmm, my immediate reaction is that car/cdr are too low-level for making them overrideable, but map and for-each and other operations that work on whole sequences look like good targets... -- GPG: D5D4E405 - 2F9B BCCC 8527 692A 04E3 331E FAF8 226A D5D4 E405 _______________________________________________ Guile-devel mailing list Guile-devel@gnu.org http://mail.gnu.org/mailman/listinfo/guile-devel ^ permalink raw reply [flat|nested] 26+ messages in thread
* overriding car/cdr (was: Worrying development) 2004-01-24 13:29 ` Marius Vollmer @ 2004-01-26 2:42 ` Paul Jarc 2004-02-08 16:21 ` overriding car/cdr Dirk Herrmann 2004-02-08 18:09 ` Marius Vollmer 0 siblings, 2 replies; 26+ messages in thread From: Paul Jarc @ 2004-01-26 2:42 UTC (permalink / raw) Cc: guile-user, rm, guile-devel Marius Vollmer <mvo@zagadka.de> wrote: > Hmm, my immediate reaction is that car/cdr are too low-level for > making them overrideable, but map and for-each and other operations > that work on whole sequences look like good targets... Going that route, there will always be one more function that someone wants to be converted. Third-party libraries also often won't be able to handle generated lists without modification. OTOH, by modifying SCM_CAR/SCM_CDR, everything that handles lists automatically becomes able to handle generated lists, and the cost for normal lists is only the same cost as when compiling with -DSCM_DEBUG_PAIR_ACCESSES=1. paul _______________________________________________ Guile-devel mailing list Guile-devel@gnu.org http://mail.gnu.org/mailman/listinfo/guile-devel ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: overriding car/cdr 2004-01-26 2:42 ` overriding car/cdr (was: Worrying development) Paul Jarc @ 2004-02-08 16:21 ` Dirk Herrmann 2004-02-08 18:09 ` Marius Vollmer 1 sibling, 0 replies; 26+ messages in thread From: Dirk Herrmann @ 2004-02-08 16:21 UTC (permalink / raw) Cc: guile-user, rm, guile-devel, Marius Vollmer Paul Jarc wrote: >Marius Vollmer <mvo@zagadka.de> wrote: > > >>Hmm, my immediate reaction is that car/cdr are too low-level for >>making them overrideable, but map and for-each and other operations >>that work on whole sequences look like good targets... >> >> > >Going that route, there will always be one more function that someone >wants to be converted. Third-party libraries also often won't be able >to handle generated lists without modification. OTOH, by modifying >SCM_CAR/SCM_CDR, everything that handles lists automatically becomes >able to handle generated lists, and the cost for normal lists is only >the same cost as when compiling with -DSCM_DEBUG_PAIR_ACCESSES=1. > The macros SCM_CAR and SCM_CDR should IMO not changed to handle generic pairs. In the guile kernel we need a layer to deal with the built-in low level data types. Best regards Dirk _______________________________________________ Guile-devel mailing list Guile-devel@gnu.org http://mail.gnu.org/mailman/listinfo/guile-devel ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: overriding car/cdr 2004-01-26 2:42 ` overriding car/cdr (was: Worrying development) Paul Jarc 2004-02-08 16:21 ` overriding car/cdr Dirk Herrmann @ 2004-02-08 18:09 ` Marius Vollmer 2004-02-08 20:56 ` Paul Jarc 1 sibling, 1 reply; 26+ messages in thread From: Marius Vollmer @ 2004-02-08 18:09 UTC (permalink / raw) Cc: guile-user, guile-devel prj@po.cwru.edu (Paul Jarc) writes: > Marius Vollmer <mvo@zagadka.de> wrote: >> Hmm, my immediate reaction is that car/cdr are too low-level for >> making them overrideable, but map and for-each and other operations >> that work on whole sequences look like good targets... > > Going that route, there will always be one more function that someone > wants to be converted. But car/cdr are not a good way to work with general sequences. Think of vectors. We should not try hard to turn car/cdr into something abstract (which they are not, even their names come from the lowest level). Somehthing like the 'sequence' concept of Common Lisp is a better approach, I'd say. -- GPG: D5D4E405 - 2F9B BCCC 8527 692A 04E3 331E FAF8 226A D5D4 E405 _______________________________________________ Guile-user mailing list Guile-user@gnu.org http://mail.gnu.org/mailman/listinfo/guile-user ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: overriding car/cdr 2004-02-08 18:09 ` Marius Vollmer @ 2004-02-08 20:56 ` Paul Jarc 2004-03-20 22:28 ` Marius Vollmer 0 siblings, 1 reply; 26+ messages in thread From: Paul Jarc @ 2004-02-08 20:56 UTC (permalink / raw) Cc: guile-user, rm, guile-devel Marius Vollmer <mvo@zagadka.de> wrote: > But car/cdr are not a good way to work with general sequences. Think > of vectors. car/cdr certainly can be made to work with vectors. Making them work with arbitrary user-defined structures would give us a lot of flexibility without too much work. If there is a significant amount of C code using SCM_CAR/SCM_CDR with non-pair objects, then it might be best to leave them unchanged. But I think the Scheme-level cxr functions would be a good place to make the change. > We should not try hard to turn car/cdr into something abstract > (which they are not, even their names come from the lowest level). Of course they currently are not, but that doesn't (nor do the names) tell us whether it would be good if they were more abstract. > Somehthing like the 'sequence' concept of Common Lisp is a better > approach, I'd say. AFAICT, that would involve a whole new set of procedures to work with this new data type. I'm suggesting modifying the existing procedures so that most existing code would automatically be able to take advantage of the new flexibility, with no further changes needed. paul _______________________________________________ Guile-devel mailing list Guile-devel@gnu.org http://mail.gnu.org/mailman/listinfo/guile-devel ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: overriding car/cdr 2004-02-08 20:56 ` Paul Jarc @ 2004-03-20 22:28 ` Marius Vollmer 2004-03-22 17:05 ` David Van Horn 2004-03-22 17:24 ` Paul Jarc 0 siblings, 2 replies; 26+ messages in thread From: Marius Vollmer @ 2004-03-20 22:28 UTC (permalink / raw) Cc: guile-user, guile-devel prj@po.cwru.edu (Paul Jarc) writes: > Marius Vollmer <mvo@zagadka.de> wrote: >> But car/cdr are not a good way to work with general sequences. Think >> of vectors. > > car/cdr certainly can be made to work with vectors. [...] Yes, but would that be a _good_ way to work with them? It would be, in my view, only a kluge to make routines work with vectors that were originally written for lists. Such automatic code reuse or rather code-napping doesn't sound like a good idea to me. >> Somehthing like the 'sequence' concept of Common Lisp is a better >> approach, I'd say. > > AFAICT, that would involve a whole new set of procedures to work with > this new data type. I'm suggesting modifying the existing procedures > so that most existing code would automatically be able to take > advantage of the new flexibility, with no further changes needed. What would the advantage be? Some of the existing list routines will sort of work with vectors, but vectors are not lists, and the results will be strange at best. Lists in Scheme and Lisp are not merely sequences, they are able to form general trees with all kinds of intentional structure sharing. Vectors are not at all like this. What would is (cons (car vec) (cdr vec))? A pair with a vector as its second argument? A vector? Some lists are uses as sequences and it would indeed make sense to formalize this by introducing an abstract 'sequence' type for this, I'd say. Maybe by going over SRFI-1 and picking out the procedures that treat lists as sequences. Maybe such a thing already exists. -- GPG: D5D4E405 - 2F9B BCCC 8527 692A 04E3 331E FAF8 226A D5D4 E405 _______________________________________________ Guile-user mailing list Guile-user@gnu.org http://mail.gnu.org/mailman/listinfo/guile-user ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: overriding car/cdr 2004-03-20 22:28 ` Marius Vollmer @ 2004-03-22 17:05 ` David Van Horn 2004-03-22 21:03 ` Marius Vollmer 2004-03-22 17:24 ` Paul Jarc 1 sibling, 1 reply; 26+ messages in thread From: David Van Horn @ 2004-03-22 17:05 UTC (permalink / raw) Cc: guile-user, rm, guile-devel Marius Vollmer wrote: > Some lists are uses as sequences and it would indeed make sense to > formalize this by introducing an abstract 'sequence' type for this, > I'd say. Maybe by going over SRFI-1 and picking out the procedures > that treat lists as sequences. Maybe such a thing already exists. You might have a look at SRFI 44: Collections. http://srfi.schemers.org/srfi-44/ David _______________________________________________ Guile-user mailing list Guile-user@gnu.org http://mail.gnu.org/mailman/listinfo/guile-user ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: overriding car/cdr 2004-03-22 17:05 ` David Van Horn @ 2004-03-22 21:03 ` Marius Vollmer 0 siblings, 0 replies; 26+ messages in thread From: Marius Vollmer @ 2004-03-22 21:03 UTC (permalink / raw) Cc: guile-user, rm, guile-devel David Van Horn <dvanhorn@cs.uvm.edu> writes: > You might have a look at SRFI 44: Collections. > > http://srfi.schemers.org/srfi-44/ Yes, exactly! -- GPG: D5D4E405 - 2F9B BCCC 8527 692A 04E3 331E FAF8 226A D5D4 E405 _______________________________________________ Guile-user mailing list Guile-user@gnu.org http://mail.gnu.org/mailman/listinfo/guile-user ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: overriding car/cdr 2004-03-20 22:28 ` Marius Vollmer 2004-03-22 17:05 ` David Van Horn @ 2004-03-22 17:24 ` Paul Jarc 1 sibling, 0 replies; 26+ messages in thread From: Paul Jarc @ 2004-03-22 17:24 UTC (permalink / raw) Cc: guile-user, rm, guile-devel Marius Vollmer <mvo@zagadka.de> wrote: > Lists in Scheme and Lisp are not merely sequences, they are able to > form general trees with all kinds of intentional structure sharing. > Vectors are not at all like this. Hmm, good point. > Some lists are uses as sequences and it would indeed make sense to > formalize this by introducing an abstract 'sequence' type for this, > I'd say. Yes, that might be the best way. Ideally, I think, programmers shouldn't have to worry about list vs. vector representation of sequence objects any more than they have to worry about memory management. We ought to be able to say "give me an object that supports these operations, and favors these certain operations for performance", and let the computer figure out what representation is best. paul _______________________________________________ Guile-devel mailing list Guile-devel@gnu.org http://mail.gnu.org/mailman/listinfo/guile-devel ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Worrying development 2004-01-23 17:16 ` Tom Lord 2004-01-23 21:01 ` Marius Vollmer @ 2004-01-23 22:37 ` Dirk Herrmann 2004-01-23 23:25 ` Tom Lord 1 sibling, 1 reply; 26+ messages in thread From: Dirk Herrmann @ 2004-01-23 22:37 UTC (permalink / raw) Cc: guile-user, guile-devel Tom Lord wrote: > > From: Dirk Herrmann <dirk@dirk-herrmanns-seiten.de> > > > First: It's not a matter of whether users *need* to make certain > > assumptions: It's a matter of interface definition. Scheme defines the > > string data type and together with it, it defines the semantics of > > operations on it. > >Please show me what existing lines of the Scheme standard will have to >change if mutation-sharing shared substrings are added. > >The standard (not very formally but clearly enough) says that the >standard procedures which construct strings allocate fresh locations >for the contents of those strings. > >That means that none of those procedures create mutation-sharing >shared substrings -- nobody has proposed anything different. > >I think you are imagining that there is an additional requirement in >the standard: that any procedure at all which creates a new string >must allocate fresh locations for its contents. But that additional >requirement isn't there. Scheme programmers can not assume that that >requirement is part of Scheme. > As you say, the standard only describes functions that, on creation of strings, requires to allocate fresh locations for the contents of the strings. That is, someone who only uses the functions that are described by the standard is not able to create any mutation-sharing substring. And this, implicitly, indicates that different (in the sense of eq?) strings use different locations for their contents. To me this seems like a valid assumption. Standard-conforming scheme programs may IMO rely on this fact. However, I may be wrong in my interpretation of the standard, which is why I suggest the srfi approach (see below). >Mutation-sharing shared substrings are an upwards compatible extension >to the Scheme standard. They break no correct programs. They enable >new kinds of programs. > Introducing a separate data type for mutation-sharing character arrays also enables new kinds of programs. The difference between a separate data type and the former implementation is, that mutation-sharing substrings could be used everywhere where an ordinary string had been used before. That is, the difference is a matter of being able to re-use existing string-handling code rather than enabling new kinds of programs. However, it is exactly this re-using of existing string-handling code issue which becomes problematic when the semantics of the string objects change. Marius, would it be an acceptable compromise to require that the mutation-sharing substring issue be submitted and discussed as an srfi before it becomes an official part of guile's core? The discussion of the topic in that forum would reduce the risk that the change introduces problems. I would then ask those who are interested to have it as a part of guile to submit a srfi proposal. (Please note that, as I have said before, I have nothing against providing mutation-sharing substrings as a deprecated feature for some period - but not as an official part of guile's core.) Best regards Dirk _______________________________________________ Guile-user mailing list Guile-user@gnu.org http://mail.gnu.org/mailman/listinfo/guile-user ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Worrying development 2004-01-23 22:37 ` Worrying development Dirk Herrmann @ 2004-01-23 23:25 ` Tom Lord 0 siblings, 0 replies; 26+ messages in thread From: Tom Lord @ 2004-01-23 23:25 UTC (permalink / raw) Cc: guile-user, guile-devel, mvo > From: Dirk Herrmann <dirk@dirk-herrmanns-seiten.de> > As you say, the standard only describes functions that, on creation of > strings, requires to allocate fresh locations for the contents of the > strings. That is, someone who only uses the functions that are described > by the standard is not able to create any mutation-sharing substring. > And this, implicitly, indicates that different (in the sense of eq?) > strings use different locations for their contents. To me this seems > like a valid assumption. It's subtly wrong. A program (or, more importantly, a module) can assume that non-eq? strings are disjoint in their storage _if_and_only_if_ either (1) it knows how both strings were created or (2) it's assumption-of-disjointness is shared with the caller who passes the strings to the module. (1) is trivially obvious. (2) may sound onerous but it really isn't. It falls out of the existing practice of procedures advertising clearly what arguments they mutate. A caller passing around mutation-sharing strings has to be careful to pass them to mutating procedures only when they are certain that they want mutations to be shared. Ultimately, that obligation propogates right back to the point of creation of the mutation-sharing substring --- using mutation-sharing substrings is, every step of the way, a perfectly localized (a perfectly "modular") decision. You keep worrying about what happens if a procedure mutates a string and then is "surprised" when that mutation effects some other string. No procedure need worry about that -- it is entirely up to code which _creates_ a mutation-sharing string to only pass it to a mutating procedure when it knows what it's doing. > Standard-conforming scheme programs may IMO > rely on this fact [that non-eq? strings can't possibly share > mutations] Freestanding portable standard programs can rely on that fact -- and adding mutation-sharing substrings doesn't change that in the slightest. Modules, intended to be combined with other modules that may be using non-standard feature can _not_ rely on that fact and have never been able to rely on that fact. Mutation-sharing substrings drive that point home but they aren't the only reason. But that property of modules is not really a problem in practice. You have to have a pretty twisted Scheme programming style to run into a case where it will make a difference. > >Mutation-sharing shared substrings are an upwards compatible extension > >to the Scheme standard. They break no correct programs. They enable > >new kinds of programs. > Introducing a separate data type for mutation-sharing character arrays > also enables new kinds of programs. The difference between a separate > data type and the former implementation is, that mutation-sharing > substrings could be used everywhere where an ordinary string had been > used before. That is, the difference is a matter of being able to re-use > existing string-handling code rather than enabling new kinds of > programs. However, it is exactly this re-using of existing > string-handling code issue which becomes problematic when the semantics > of the string objects change. I don't believe that it's problematic in the slightest. Certainly no more so than re-using a standard Scheme module in a multi-threaded Scheme. > Marius, would it be an acceptable compromise to require that the > mutation-sharing substring issue be submitted and discussed as an srfi > before it becomes an official part of guile's core? The discussion of > the topic in that forum would reduce the risk that the change introduces > problems. I would then ask those who are interested to have it as a part > of guile to submit a srfi proposal. > (Please note that, as I have said before, I have nothing against > providing mutation-sharing substrings as a deprecated feature for some > period - but not as an official part of guile's core.) I guess what really bugs me about the issue is that the feature was (semi-) removed based on pure speculation, opinion, and apparently by imperfect interpretation of the standard. I don't think that there was a record of the feature causing any serious and sustained problems. It's removal seemed rather a rather gratuitous snub. -t _______________________________________________ Guile-user mailing list Guile-user@gnu.org http://mail.gnu.org/mailman/listinfo/guile-user ^ permalink raw reply [flat|nested] 26+ messages in thread
end of thread, other threads:[~2004-03-22 21:03 UTC | newest] Thread overview: 26+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2004-01-16 9:41 Worrying development Roland Orre 2004-01-16 11:59 ` tomas 2004-01-18 21:05 ` Marius Vollmer 2004-01-18 21:58 ` Tom Lord 2004-01-22 21:47 ` Tom Lord 2004-01-22 16:11 ` Dirk Herrmann 2004-01-22 18:42 ` Tom Lord 2004-01-23 11:45 ` Dirk Herrmann 2004-01-23 17:16 ` Tom Lord 2004-01-23 21:01 ` Marius Vollmer 2004-01-23 22:18 ` Tom Lord 2004-01-24 0:27 ` Marius Vollmer 2004-01-24 0:53 ` Tom Lord 2004-01-23 22:28 ` Paul Jarc 2004-01-24 12:09 ` rm 2004-01-24 13:29 ` Marius Vollmer 2004-01-26 2:42 ` overriding car/cdr (was: Worrying development) Paul Jarc 2004-02-08 16:21 ` overriding car/cdr Dirk Herrmann 2004-02-08 18:09 ` Marius Vollmer 2004-02-08 20:56 ` Paul Jarc 2004-03-20 22:28 ` Marius Vollmer 2004-03-22 17:05 ` David Van Horn 2004-03-22 21:03 ` Marius Vollmer 2004-03-22 17:24 ` Paul Jarc 2004-01-23 22:37 ` Worrying development Dirk Herrmann 2004-01-23 23:25 ` Tom Lord
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).