* Wide strings @ 2009-01-25 21:15 Mike Gran 2009-01-25 22:31 ` Ludovic Courtès 0 siblings, 1 reply; 17+ messages in thread From: Mike Gran @ 2009-01-25 21:15 UTC (permalink / raw) To: guile-devel Hi. I know there has been a lot of talk about wide characters and Unicode over the years. I'd like to see it happen because how the are implemented will determine the future of a couple of my side-projects. I could pitch in, if you needed some help. I looked over the history of guile-devel, and there has been a tremendous amount of discussion about it. Also, the Schemes seem to be each inventing their own solution. Tom Lord's 2003 proposal http://lists.gnu.org/archive/html/guile-devel/2003-11/msg00036.html Marius Vollmer's idea http://lists.gnu.org/archive/html/guile-devel/2005-08/msg00029.html R6RS http://www.r6rs.org/final/html/r6rs-lib/r6rs-lib-Z-H-2.html#node_chap_1 MIT Scheme http://www.gnu.org/software/mit-scheme/documentation/mit-scheme-ref/Internal-Representation-of-Characters.html There has also been some back-and-forth about to what extent the internal representation of strings should be accessible, whether the internal representation should be a vector or if it can be something more efficient, and how not to completely break regular expressions. Also, there is the question as to whether a wide character is a codepoint or a grapheme. Is there a current proposal on the table for how to reach this? If you suffering from a dearth of opinions, I certainly have some ideas. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Wide strings 2009-01-25 21:15 Wide strings Mike Gran @ 2009-01-25 22:31 ` Ludovic Courtès 2009-01-25 23:32 ` Neil Jerram 2009-01-26 0:16 ` Mike Gran 0 siblings, 2 replies; 17+ messages in thread From: Ludovic Courtès @ 2009-01-25 22:31 UTC (permalink / raw) To: guile-devel Hello! Mike Gran <spk121@yahoo.com> writes: > Hi. I know there has been a lot of talk about wide characters and > Unicode over the years. I'd like to see it happen because how the are > implemented will determine the future of a couple of my side-projects. > I could pitch in, if you needed some help. Indeed, it looks like you have some experience with GuCu! ;-) I agree it would be really nice to have Unicode support, but I'm not aware of any "plan", so please go ahead! :-) A few considerations regarding the inevitable debate about the internal string representation: 1. IMO it'd be nice to have ASCII strings special-cased so that they are always encoded in ASCII. This would allow for memory savings since, e.g., most symbols are expected to contain only ASCII characters. It might also simplify interaction with C in certain cases; for instance, it would make it easy to have statically initialized ASCII Scheme strings [0]. 2. O(1) `string-{ref,set!}' is somewhat mandated by parts of SRFI-13. For instance, `substring' takes indices as parameters, `string-index' returns an index, etc. (John Cowan once argued that an abstract type to represent the position would remove this limitation [1], but the fact is that we have to live with SRFI-13). 3. GLib et al. like UTF-8, and it'd be nice to minimize the overhead when interfacing with these libs (e.g., by avoiding translations from one string representation to another). 4. It might be nice to be friendly to `wchar_t' and friends. Interestingly, some of these things are contradictory. Will Clinger has a good summary of a range of possible implementations: https://trac.ccs.neu.edu/trac/larceny/wiki/StringRepresentations Thanks, Ludo'. [0] http://thread.gmane.org/gmane.lisp.guile.devel/7998 [1] http://lists.r6rs.org/pipermail/r6rs-discuss/2007-April/002252.html ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Wide strings 2009-01-25 22:31 ` Ludovic Courtès @ 2009-01-25 23:32 ` Neil Jerram 2009-01-26 20:24 ` Ludovic Courtès 2009-01-26 0:16 ` Mike Gran 1 sibling, 1 reply; 17+ messages in thread From: Neil Jerram @ 2009-01-25 23:32 UTC (permalink / raw) To: Ludovic Courtès; +Cc: guile-devel 2009/1/25 Ludovic Courtès <ludo@gnu.org>: > > I agree it would be really nice to have Unicode support, but I'm not > aware of any "plan", so please go ahead! :-) Indeed. > A few considerations regarding the inevitable debate about the internal > string representation: [...] But what about the other possible debate, about the API? Are you thinking that we should accept R6RS's choice? (I really haven't read up on all this enough - however when reading Tom Lord's analysis just now, I was thinking "why not just specify that things like char-upcase don't work in the difficult cases", and it seems to me that this is what R6RS chose to do. So at first glance the R6RS API looks OK to me. (Although I read them at the time, I can't remember now what Tom's remaining concerns with the R6RS proposal were; should probably go back and read those again. On the other hand, Tom did eventually vote for R6RS, so I would guess that they can't have been that bad.)) Regards, Neil ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Wide strings 2009-01-25 23:32 ` Neil Jerram @ 2009-01-26 20:24 ` Ludovic Courtès 0 siblings, 0 replies; 17+ messages in thread From: Ludovic Courtès @ 2009-01-26 20:24 UTC (permalink / raw) To: guile-devel Hello! Neil Jerram <neiljerram@googlemail.com> writes: > But what about the other possible debate, about the API? Are you > thinking that we should accept R6RS's choice? No, I think we have SRFI-1[34] to start with, both of which are well defined in the context of Unicode. > (I really haven't read up on all this enough - however when reading > Tom Lord's analysis just now, I was thinking "why not just specify > that things like char-upcase don't work in the difficult cases", and > it seems to me that this is what R6RS chose to do. So at first glance > the R6RS API looks OK to me. Regarding `ß' (German eszet), which is one of the "difficult cases" mentioned by Tom Lord, SRFI-13 reads: Some characters case-map to more than one character. For example, the Latin-1 German eszet character upper-cases to "SS." * This means that the R5RS function char-upcase is not well-defined, since it is defined to produce a (single) character result. * It means that an in-place string-upcase! procedure cannot be reliably defined, since the original string may not be long enough to contain the result -- an N-character string might upcase to a 2N-character result. * It means that case-insensitive string-matching or searching is quite tricky. For example, an n-character string s might match a 2N-character string s'. And then: SRFI 13 makes no attempt to deal with these issues; it uses a simple 1-1 locale- and context-independent case-mapping I think it's reasonable to stick to this approach at first, at least. Locale-dependent case folding is part of `(ice-9 i18n)' anyway. Thanks, Ludo'. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Wide strings 2009-01-25 22:31 ` Ludovic Courtès 2009-01-25 23:32 ` Neil Jerram @ 2009-01-26 0:16 ` Mike Gran 2009-01-26 15:21 ` Mike Gran 2009-01-26 21:40 ` Ludovic Courtès 1 sibling, 2 replies; 17+ messages in thread From: Mike Gran @ 2009-01-26 0:16 UTC (permalink / raw) To: guile-devel > From: Ludovic Courtès ludo@gnu.org I believe that we should aim for R6RS strings. I think the most important thing is to have humility in the face of an impossible problem: how to encode all textual information. It is important to "stand on the shoulders of giants" here. It becomes a matter of deciding which actively developed library of wide character functions is to be used and how to integrate it. There are 3 good, actively developed solutions of which I am aware. 1. Use GNU libc functionality. Encode wide strings as wchar_t. 2. Use GLib functionality. Encode wide strings as UTF-8. Possibly give up on O(1). Possibly add indexing information to string to allow O(1), which might negate the space advantage of UTF-8. 3. Use IBM's ICU4c. Encode wide strings as UTF-16. Thus, add an obscure dependency. Option 3 is likely a non-starter, because it seems that Guile has tried to avoid adding new non-GNU dependencies. It is technologically a great solution, IMHO. Option 1 is probably the way to go, because it keeps Guile close to the metal and keeps dependencies out of it. Unfortunately, UTF-8 strings would require conversion. > 1. IMO it'd be nice to have ASCII strings special-cased so that they > are always encoded in ASCII. This would allow for memory savings > since, e.g., most symbols are expected to contain only ASCII > characters. It might also simplify interaction with C in certain > cases; for instance, it would make it easy to have statically > initialized ASCII Scheme strings. Why not? It does solve the initialization problem of dealing with strings before setlocale has been called. Let's say that a string is a union of either an ASCII char vector or a wchar_t vector. A "character" then is just a Unicode codepoint. String-ref returns a wchar_t. This is all in line with R6RS as I understand it. There could then be a separate iterator and function set that does (likely O(n)) operations on the grapheme clusters of strings. A grapheme cluster is a single written symbol which may be made up of several codepoints. Unicode Standard Annex #29 describes how to partition a string into a set of graphemes.[1] There is the problem of systems where wchar_t is 2 bytes instead of 4 bytes, like Cygwin. For those systems, I'd recommend restricting functionality to 16-bit characters instead of trying to add an extra UTF-16 encoding/decoding step. I think there should always be a complete codepoint in each wchar_t. -- Mike Gran [1] http://www.unicode.org/reports/tr29/ ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Wide strings 2009-01-26 0:16 ` Mike Gran @ 2009-01-26 15:21 ` Mike Gran 2009-01-26 21:40 ` Ludovic Courtès 1 sibling, 0 replies; 17+ messages in thread From: Mike Gran @ 2009-01-26 15:21 UTC (permalink / raw) To: Mike Gran, guile-devel > > Ludo sez, > Mike sez, > > 1. IMO it'd be nice to have ASCII strings special-cased so that they > > are always encoded in ASCII. This would allow for memory savings > > since, e.g., most symbols are expected to contain only ASCII > > characters. It might also simplify interaction with C in certain > > cases; for instance, it would make it easy to have statically > > initialized ASCII Scheme strings. > > Why not? It does solve the initialization problem of dealing with strings > before setlocale has been called. One thing I only just noticed today is that the first 256 Unicode chars are ISO-8859-1. Maybe then the idea should be to make strings where the size of the character can be 1 or 4 bytes but the encoding is always in Unicode code points, which in the 1-byte-char case would then be coincidentally ISO-8859-1. Thanks, Mike Gran ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Wide strings 2009-01-26 0:16 ` Mike Gran 2009-01-26 15:21 ` Mike Gran @ 2009-01-26 21:40 ` Ludovic Courtès 2009-01-27 5:38 ` Mike Gran 1 sibling, 1 reply; 17+ messages in thread From: Ludovic Courtès @ 2009-01-26 21:40 UTC (permalink / raw) To: guile-devel Hello, Mike Gran <spk121@yahoo.com> writes: > There are 3 good, actively developed solutions of which I am aware. > > 1. Use GNU libc functionality. Encode wide strings as wchar_t. That'd be POSIX functionality, actually. > 2. Use GLib functionality. Encode wide strings as UTF-8. Possibly > give up on O(1). Possibly add indexing information to string to allow > O(1), which might negate the space advantage of UTF-8. Technically, depending on GLib would seem unreasonable to me. :-) BTW, Gnulib has a wealth of modules that could be helpful here: http://www.gnu.org/software/gnulib/MODULES.html#posix_ext_unicode I used a few of them in Guile-R6RS-Libs to implement `string->utf8' and such like. > 3. Use IBM's ICU4c. Encode wide strings as UTF-16. Thus, add an > obscure dependency. > > Option 3 is likely a non-starter, because it seems that Guile has > tried to avoid adding new non-GNU dependencies. It is technologically > a great solution, IMHO. At first sight, I'd rather avoid it as a dependency, if that's possible, but that's mostly subjective. > Let's say that a string is a union of either an ASCII char vector or a > wchar_t vector. A "character" then is just a Unicode codepoint. > String-ref returns a wchar_t. This is all in line with R6RS as I > understand it. Yes, that seems easily doable. > There could then be a separate iterator and function set that does > (likely O(n)) operations on the grapheme clusters of strings. A > grapheme cluster is a single written symbol which may be made up of > several codepoints. Unicode Standard Annex #29 describes how to > partition a string into a set of graphemes.[1] Hmm, that seems like a difficult topic. It's not even mentioned in SRFI-13. I suppose it can be addressed at a later stage, possibly by providing a specific API. > There is the problem of systems where wchar_t is 2 bytes instead of 4 > bytes, like Cygwin. For those systems, I'd recommend > restricting functionality to 16-bit characters instead of trying to > add an extra UTF-16 encoding/decoding step. I think there should > always be a complete codepoint in each wchar_t. Agreed. The GNU libc doc concurs (info "(libc) Extended Char Intro"). However, given this limitation, and other potential portability issues, it's still unclear to me whether this would be a good choice. We need to look more closely at what Gnulib has to offer, IMO. Thanks, Ludo'. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Wide strings 2009-01-26 21:40 ` Ludovic Courtès @ 2009-01-27 5:38 ` Mike Gran 2009-01-27 5:52 ` Mike Gran 2009-01-27 18:59 ` Ludovic Courtès 0 siblings, 2 replies; 17+ messages in thread From: Mike Gran @ 2009-01-27 5:38 UTC (permalink / raw) To: guile-devel Hello, > Ludo' sez >> Mike Gran <spk121@yahoo.com> writes: > BTW, Gnulib has a wealth of modules that could be helpful here: > http://www.gnu.org/software/gnulib/MODULES.html#posix_ext_unicode > I used a few of them in Guile-R6RS-Libs to implement `string->utf8' > and such like. The Gnulib routines seem perfectly complete. I was unaware of them. It wasn't clear to me at first glance if wide regex is supported, but, otherwise, they are fine. >> There could then be a separate iterator and function set that does >> (likely O(n)) operations on the grapheme clusters of strings. A >> grapheme cluster is a single written symbol which may be made up of >> several codepoints. Unicode Standard Annex #29 describes how to >> partition a string into a set of graphemes.[1] > Hmm, that seems like a difficult topic. It's not even mentioned in > SRFI-13. I suppose it can be addressed at a later stage, possibly > by providing a specific API. Fair enough. With wide strings in place, this could all be done in pure Scheme anyway, and end up in some library. I brought it up really just to note the codepoint / grapheme problem. > [...] We need to look more closely at what Gnulib has to offer, IMO. Gnulib works for me. Bruno is the maintainer of those funcs, so I'm sure they work great. So really the first questions to answer are the encoding question and whether the R6RS string API is the goal. For the latter, I think the R6RS / SRFI-13 is simple enough. I like it as a goal. For the former, I rather like the idea that internally a string will internally be encoded either as 4-byte chars of UTF-32 or 1-byte chars of ISO-8859-1. Since the first 256 chars of UTF-32 are ISO-8859-1, it makes it trivial for string-ref/set to work with codepoints. (Though, such a scheme would force scm_take_locale_string to become scm_take_iso88591_string.) > Thanks, > Ludo'. Thanks, Mike ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Wide strings 2009-01-27 5:38 ` Mike Gran @ 2009-01-27 5:52 ` Mike Gran 2009-01-27 9:50 ` Andy Wingo 2009-01-27 18:59 ` Ludovic Courtès 1 sibling, 1 reply; 17+ messages in thread From: Mike Gran @ 2009-01-27 5:52 UTC (permalink / raw) To: guile-devel I said > (Though, such a scheme would force scm_take_locale_string to become > scm_take_iso88591_string.) which is incorrect. Under the proposed scheme, scm_take_locale_string would only be able to use that storage directly if it happened to be ASCII or 8859-1. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Wide strings 2009-01-27 5:52 ` Mike Gran @ 2009-01-27 9:50 ` Andy Wingo 0 siblings, 0 replies; 17+ messages in thread From: Andy Wingo @ 2009-01-27 9:50 UTC (permalink / raw) To: Mike Gran; +Cc: guile-devel On Tue 27 Jan 2009 06:52, Mike Gran <spk121@yahoo.com> writes: > I said > >> (Though, such a scheme would force scm_take_locale_string to become >> scm_take_iso88591_string.) > > which is incorrect. Under the proposed scheme, scm_take_locale_string > would only be able to use that storage directly if it happened to be > ASCII or 8859-1. Perhaps as part of this, we should add scm_{from,take}_{ascii,iso88591,ucs32}_string. This would help greatly when you know the format of data that you're writing to object code or in C, but you don't know the locale of the user. Andy ps. Good luck! Having this problem looked at, with an eye to solutions, makes me very happy :-)) -- http://wingolog.org/ ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Wide strings 2009-01-27 5:38 ` Mike Gran 2009-01-27 5:52 ` Mike Gran @ 2009-01-27 18:59 ` Ludovic Courtès 2009-01-28 16:44 ` Mike Gran 1 sibling, 1 reply; 17+ messages in thread From: Ludovic Courtès @ 2009-01-27 18:59 UTC (permalink / raw) To: guile-devel Hi! Mike Gran <spk121@yahoo.com> writes: > Gnulib works for me. Bruno is the maintainer of those funcs, so I'm > sure they work great. Good! > So really the first questions to answer are the encoding question and > whether the R6RS string API is the goal. SRFI-1[34] (i.e., status quo in terms of supported APIs) seems like a reasonable milestone. > For the former, I rather like the idea that internally a string will > internally be encoded either as 4-byte chars of UTF-32 or 1-byte chars > of ISO-8859-1. Since the first 256 chars of UTF-32 are ISO-8859-1, it > makes it trivial for string-ref/set to work with codepoints. Good to know. That would give us O(1) ref/set!, and with Latin-1 special-cased, we'd have memory saving when interpreting Latin-1 code, which is good. > (Though, such a scheme would force scm_take_locale_string to become > scm_take_iso88591_string.) I think it would not *become* scm_take_iso88591_string, but scm_take_iso88591_string (and others, as Andy suggested) would *complement* it. Thanks, Ludo'. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Wide strings 2009-01-27 18:59 ` Ludovic Courtès @ 2009-01-28 16:44 ` Mike Gran 2009-01-28 18:36 ` Andy Wingo 2009-01-28 20:44 ` Clinton Ebadi 0 siblings, 2 replies; 17+ messages in thread From: Mike Gran @ 2009-01-28 16:44 UTC (permalink / raw) To: guile-devel Hi, Let's say that one possible goal is to add wide strings * using Gnulib functions * with minimal changes to the public Guile API * where chars become 4-byte codepoints and strings are internally either UTF-32 or ISO-8859-1 Since I need this functionality taken care of, and since I have some time to play with it, what's the procedure here? Should I mock something up and submit it as a patch? If I did, it would likely be a big patch. Do we need to talk more about what needs to be accomplished? Do we need a complete specification? Do we need a vote on if it is a good idea? Pragmatically, I see that this can be broken up into three steps. (Not for public use. Just as a programming subtasks.) 1. Convert the internal char and string representation to be explicitly ISO 8859-1. Add the to/from locale conversion functionality while still retaining 8-bit strings. Replace C library funcs with Gnulib string funcs where appropriate. 2. Convert the internal representation of chars to 4-byte codepoints, while still retaining 8-bit strings. 3. Convert strings to be a union of 1 byte and 4 byte chars. Thanks, Mike Gran ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Wide strings 2009-01-28 16:44 ` Mike Gran @ 2009-01-28 18:36 ` Andy Wingo 2009-01-29 0:01 ` Ludovic Courtès 2009-01-28 20:44 ` Clinton Ebadi 1 sibling, 1 reply; 17+ messages in thread From: Andy Wingo @ 2009-01-28 18:36 UTC (permalink / raw) To: Mike Gran; +Cc: guile-devel Hi, On Wed 28 Jan 2009 17:44, Mike Gran <spk121@yahoo.com> writes: > Since I need this functionality taken care of, and since I have some > time to play with it, what's the procedure here? The best thing IMO would be to hack on it on a Git branch, with small and correct patches. We could get you commit access if you don't already have it (Ludo or Neil would have to reply on that). Then you could push your work directly to a branch, so we all can review it easily. > Do we need to talk more about what needs to be accomplished? Do we > need a complete specification? Do we need a vote on if it is a good > idea? I think you're going in the right direction. More importantly, although I can't speak for them, Neil and Ludo seem to think so too. > 1. Convert the internal char and string representation to be > explicitly ISO 8859-1. Add the to/from locale conversion functionality > while still retaining 8-bit strings. Replace C library funcs with > Gnulib string funcs where appropriate. Sounds appropriate to me. I am unfamiliar with the gnulib code; where do the unicode codepoit tables live? How does one update them? Do we get full introspection on characters and their classes, properties, etc? > 2. Convert the internal representation of chars to 4-byte > codepoints, while still retaining 8-bit strings. Currently, characters are immediate values, with an 8-bit tag. See tags.h:333. So it seems we have 24 bits remaining, and unicode claims that 21 bits are the minimum necessary -- so we're good, if you can figure out a reasonable way to go from a 32-bit codepoint to a 24-bit codepoint. > 3. Convert strings to be a union of 1 byte and 4 byte chars. There's room on stringbufs to have a flag, I think. Dunno if that's the right way to do it. Converting the symbols and keywords code to do the right thing will be a little bit of work, too. Happy hacking, Andy -- http://wingolog.org/ ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Wide strings 2009-01-28 18:36 ` Andy Wingo @ 2009-01-29 0:01 ` Ludovic Courtès 2009-01-30 0:15 ` Neil Jerram 0 siblings, 1 reply; 17+ messages in thread From: Ludovic Courtès @ 2009-01-29 0:01 UTC (permalink / raw) To: guile-devel Hi, Andy Wingo <wingo@pobox.com> writes: > On Wed 28 Jan 2009 17:44, Mike Gran <spk121@yahoo.com> writes: > >> Since I need this functionality taken care of, and since I have some >> time to play with it, what's the procedure here? > > The best thing IMO would be to hack on it on a Git branch, with small > and correct patches. We could get you commit access if you don't already > have it (Ludo or Neil would have to reply on that). Then you could push > your work directly to a branch, so we all can review it easily. Yep, setting up a branch is the easiest way. You can then post updates or requests for comments as things progress. We'll need you to assign the copyright for your changes to the FSF as well (I'll send you an email sometime later, I need to go to sleep now). In the meantime, you can browse the GNU Coding Standards. :-) >> Do we need to talk more about what needs to be accomplished? Do we >> need a complete specification? Do we need a vote on if it is a good >> idea? > > I think you're going in the right direction. More importantly, although > I can't speak for them, Neil and Ludo seem to think so too. Yes, as far as I'm concerned. I know you're probably more knowledgeable than I am on this issue and I'm confident. >> 1. Convert the internal char and string representation to be >> explicitly ISO 8859-1. Add the to/from locale conversion functionality >> while still retaining 8-bit strings. Replace C library funcs with >> Gnulib string funcs where appropriate. > > Sounds appropriate to me. +1. >> 2. Convert the internal representation of chars to 4-byte >> codepoints, while still retaining 8-bit strings. > > Currently, characters are immediate values, with an 8-bit tag. See > tags.h:333. So it seems we have 24 bits remaining, and unicode claims > that 21 bits are the minimum necessary -- so we're good, if you can > figure out a reasonable way to go from a 32-bit codepoint to a 24-bit > codepoint. Good (code)point. It might be that we'll have to resort to cells for chars themselves, while storing raw `wchar_t' in a stringbuf content. >> 3. Convert strings to be a union of 1 byte and 4 byte chars. > > There's room on stringbufs to have a flag, I think. Dunno if that's the > right way to do it. I had something like that in mind. > Converting the symbols and keywords code to do the > right thing will be a little bit of work, too. Not if it's handled at the level of stringbufs, I think. BTW, while BDW-GC isn't used, make sure to update `scm_i_stringbuf_free ()' and friends so that they pass the right number of bytes that are to be freed... Thanks, Ludo'. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Wide strings 2009-01-29 0:01 ` Ludovic Courtès @ 2009-01-30 0:15 ` Neil Jerram 0 siblings, 0 replies; 17+ messages in thread From: Neil Jerram @ 2009-01-30 0:15 UTC (permalink / raw) To: Mike Gran; +Cc: guile-devel ludo@gnu.org (Ludovic Courtès) writes: >>> Do we need to talk more about what needs to be accomplished? Do we >>> need a complete specification? Do we need a vote on if it is a good >>> idea? >> >> I think you're going in the right direction. More importantly, although >> I can't speak for them, Neil and Ludo seem to think so too. > > Yes, as far as I'm concerned. I know you're probably more knowledgeable > than I am on this issue and I'm confident. For the record, I'm happy too - in fact I'm excited that Guile is finally going to have wide strings. Technically I think I'm less of an expert here than everyone else who has commented, so I'm happy to defer to the apparent consensus. I see that Clinton's idea goes beyond that, but (IIUC) it looks like we doesn't lose anything by going with Latin-1/UCS-32 now, and then Clinton's idea could be added later; is that correct? Regards, Neil ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Wide strings 2009-01-28 16:44 ` Mike Gran 2009-01-28 18:36 ` Andy Wingo @ 2009-01-28 20:44 ` Clinton Ebadi 2009-01-28 23:49 ` Ludovic Courtès 1 sibling, 1 reply; 17+ messages in thread From: Clinton Ebadi @ 2009-01-28 20:44 UTC (permalink / raw) To: guile-devel Mike Gran <spk121@yahoo.com> writes: > Hi, > > Let's say that one possible goal is to add wide strings > * using Gnulib functions > * with minimal changes to the public Guile API > * where chars become 4-byte codepoints and strings are internally > either UTF-32 or ISO-8859-1 > > Since I need this functionality taken care of, and since I have some > time to play with it, what's the procedure here? Should I mock > something up and submit it as a patch? If I did, it would likely be > a big patch. Do we need to talk more about what needs to be > accomplished? Do we need a complete specification? Do we need > a vote on if it is a good idea? You should take a look at Common Lisp strings[0] and streams[1]. The gist is that a string is a uniform array of some subtype of `character'[2], and character streams have an :external-encoding--character data is converted to/from that format when writing/reading the stream. Guile should be a bit more opaque and just specify the string as being an ordered sequence of characters, and providing conversion functions to/from uniform byte[^0] arrays in some explicitly specified encoding. The `scm_{to|from}_locale_string' functions provide enough abstraction to make this doable without breaking anything that doesn't use `scm_take_locale_string' (and even then Guile can detect when the locale is not UCS-4, revert to `scm_from_locale_string' and `free' the taken string immediately after conversion). This could be enhanced with `scm_{to|from}_encoded_string ({char*|SCM} string, enum encoding)' functions. > Pragmatically, I see that this can be broken up into three steps. > (Not for public use. Just as a programming subtasks.) > > 1. Convert the internal char and string representation to be > explicitly ISO 8859-1. Add the to/from locale conversion functionality > while still retaining 8-bit strings. Replace C library funcs with > Gnulib string funcs where appropriate. Initially, I would suggest just using UCS-4 internally and iconv[3] to handle conversion to/from the locale dependent encodings for C. Converting to an external encoding within `scm_to_{}_string' has minimal overhead really--the stringbuf has to be copied anyway (likewise for `scm_from_{}_string'). If you are writing the externally encoded string to a stream it is even cheaper--no memory need be allocated during conversion. I think it is acceptable to restrict the encoding of the string passed `scm_take_string'. If you are constructing strings that Guile can take possession of you probably have a bit of control over the encoding; if you don't generating a string and throwing it away more or less immediately is still pretty cheap if malloc doesn't suck. Adding a `scm_take_encoded_string' and removing the guarantee from `scm_take_locale_string' that Guile will not copy the string seems to be all that is needed to make taking strings work more or less transparently. > 2. Convert the internal representation of chars to 4-byte > codepoints, while still retaining 8-bit strings. > > 3. Convert strings to be a union of 1 byte and 4 byte chars. After getting a basic implementation done using a fixed with internal encoding rather than doing something like this it seems better to make the internal encoding flexible. Basically `make-string' would be extended with an :internal-encoding argument, or a new `make-string-with-internal-encoding' (with a better name perhaps) introduced to explicitly specify the internal encoding the application desires. An encoding would be implemented as a protocol of some sort that implemented a few primitive operations: conversion to UCS-4[^1], length, substring, concatenate, indexed ref, and indexed set! seem to be the minimal set for an optimizable implementation. Indices would have an unspecified type to allow for fancy internal encodings--e.g. a tree of some sort of UTF-8 codepoints that supported fast substring and concatenation. Allowing an internal encoding to not implement a destructive set! opens up some interesting optimizations for purely functional strings (e.g. for representing things like Emacs buffers using fancy persistent trees that are efficiently updateable and can maintain an edit history with nearly nil overhead). Does this seem reasonable? [0] http://www.lispworks.com/documentation/HyperSpec/Body/16_a.htm [1] http://www.lispworks.com/documentation/HyperSpec/Body/21_a.htm [2] http://www.lispworks.com/documentation/HyperSpec/Body/13_a.htm [3] http://www.gnu.org/software/libiconv/ [4] http://www.lispworks.com/documentation/HyperSpec/Body/f_by_by.htm [5] http://www.lispworks.com/documentation/HyperSpec/Body/f_ldb.htm [6] http://www.lispworks.com/documentation/HyperSpec/Body/f_dpb.htm#dpb [^0] `byte'[4] in CL language is some arbitrary width sequence of bits; e.g. a /traditional/ byte would be of type `(byte 0 7)' and a 32-bit machine word `(byte 0 31)'. Unrelatedly, you can do some neat things using these arbitrary width bytes with `ldb'[5]/`dpb'[6]. [^1] Minimally; ideally an internal encoding would be passed any format iconv understands and if possible convert directly to that, but if not use UCS-4 and punt to the default conversion function instead. -- emacsen: "Like... windows are portals man... emacsen: Dude... let's yank this shit out of the kill ring" ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Wide strings 2009-01-28 20:44 ` Clinton Ebadi @ 2009-01-28 23:49 ` Ludovic Courtès 0 siblings, 0 replies; 17+ messages in thread From: Ludovic Courtès @ 2009-01-28 23:49 UTC (permalink / raw) To: guile-devel Hello, Clinton Ebadi <clinton@unknownlamer.org> writes: > The `scm_{to|from}_locale_string' functions provide enough abstraction > to make this doable without breaking anything that doesn't use > `scm_take_locale_string' (and even then Guile can detect when the locale > is not UCS-4, revert to `scm_from_locale_string' and `free' the taken > string immediately after conversion). This could be enhanced with > `scm_{to|from}_encoded_string ({char*|SCM} string, enum encoding)' > functions. Yes. > Basically `make-string' would be extended with an :internal-encoding > argument, or a new `make-string-with-internal-encoding' (with a better > name perhaps) introduced to explicitly specify the internal encoding > the application desires. I think the internal encoding of strings should not be exposed to applications, at least not at the Scheme level. Nevertheless, one could argue that it might be useful to expose it at the C level, precisely to allow the efficient interaction with C code, but that would need to be done with care, so that Guile is not over-constrained. Thanks, Ludo'. ^ permalink raw reply [flat|nested] 17+ messages in thread
end of thread, other threads:[~2009-01-30 0:15 UTC | newest] Thread overview: 17+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2009-01-25 21:15 Wide strings Mike Gran 2009-01-25 22:31 ` Ludovic Courtès 2009-01-25 23:32 ` Neil Jerram 2009-01-26 20:24 ` Ludovic Courtès 2009-01-26 0:16 ` Mike Gran 2009-01-26 15:21 ` Mike Gran 2009-01-26 21:40 ` Ludovic Courtès 2009-01-27 5:38 ` Mike Gran 2009-01-27 5:52 ` Mike Gran 2009-01-27 9:50 ` Andy Wingo 2009-01-27 18:59 ` Ludovic Courtès 2009-01-28 16:44 ` Mike Gran 2009-01-28 18:36 ` Andy Wingo 2009-01-29 0:01 ` Ludovic Courtès 2009-01-30 0:15 ` Neil Jerram 2009-01-28 20:44 ` Clinton Ebadi 2009-01-28 23:49 ` Ludovic Courtès
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).