* String internals sketch @ 2017-03-10 15:31 David Kastrup 2017-03-10 16:08 ` Andy Wingo 2017-03-11 0:30 ` Matt Wette 0 siblings, 2 replies; 6+ messages in thread From: David Kastrup @ 2017-03-10 15:31 UTC (permalink / raw) To: guile-user Hi, I've been mulling a bit over enabling UTF-8 as an internal string representation. There are several interesting points to consider: a) Guile already uses two different internal representations: basically UCS-8 and UCS-32. Adding more internal representations could be done using a number of tables indexed by the internal representation type, making string representations sort of a "plugin". b) Scheme, at least older dialects, have several O(1) guarantees. If the chosen internal representations can be configured with some fluid (like it is done with port encodings), possibly as a list in priority order, one can retain an O(1) guarantee as the default behavior while allowing different mechanisms for different requirements. c) Indexing is the most important thing one wants to be fast. For an utf-8 internal representation, a lot is achieved if one caches both last index and last byte offset, preferably also string length as index and byte length. For each indexing operation, one can then first check the last string-middle cache, and then scan forwards or backwards to do the indexing (if string start or string end are closer, one can scan from there), assuming reasonably sanitized behavior. That will make sequential processing close to O(1) behavior. d) a bad complication is write access to strings, for example with -- Scheme Procedure: string-map! proc s [start [end]] -- C Function: scm_string_map_x (proc, s, start, end) PROC is a char->char procedure, it is mapped over S. The order in which the procedure is applied to the string elements is not specified. The string S is modified in-place, the return value is not specified. The current string character can gain a longer or a shorter byte length in this process. This requires being able to deal with insertion/deletion of bytes at the current position. One way to do that is to have a _gap_ for the purpose of string-map! and similar functions and copy material from the end of the gap to its start while iterating. Now this starts looking so much like the memory management Emacs buffers that it isn't funny. It also is _the_ natural string representation to use for things like with-output-to-string-buffer and with-input-to-string-buffer. Basically, one valid string representation would coincide with an utf-8 based string buffer. This would seem like it would also give a boost to some Guile operations and would offer possibilities for making "string representation plugins" for special purposes. This should offer a lot of leeway to gradually suck up the best parts of things like Emacs' string representation and code conversion facilities into a Guile kernel without being tied down by the current Guile facilities and without having stuff like Emacs strings black boxes mostly inaccessible to Guile programming. It should also allow to bounce stuff like sanitized utf-8 through Guile without incurring constant conversion costs. So it should provide a _large_ opportunity for the sakes of applications with profiles akin to Emacs or LilyPond. -- David Kastrup ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: String internals sketch 2017-03-10 15:31 String internals sketch David Kastrup @ 2017-03-10 16:08 ` Andy Wingo 2017-03-10 16:51 ` David Kastrup 2017-03-15 15:46 ` David Kastrup 2017-03-11 0:30 ` Matt Wette 1 sibling, 2 replies; 6+ messages in thread From: Andy Wingo @ 2017-03-10 16:08 UTC (permalink / raw) To: David Kastrup; +Cc: guile-user Hi :) On Fri 10 Mar 2017 16:31, David Kastrup <dak@gnu.org> writes: > a) Guile already uses two different internal representations: basically > UCS-8 and UCS-32. Adding more internal representations could be done > using a number of tables indexed by the internal representation type, > making string representations sort of a "plugin". I think we probably want to avoid this if we can. We gain a number of efficiencies if we can be concrete. Of course there are counterexamples in which specialization can help, like the 20-some string kinds in V8, for example: cons strings, latin1 strings, utf-16 strings, external strings, slices, and the product of all of those; but I am hesitant to take on this cost. If we switched to UTF-8 strings, I would like to use it as our only string representation. Sure would be nice to have cons strings though! (That would give O(1) string-append.) > b) Scheme, at least older dialects, have several O(1) guarantees. R7RS seems to have relaxed this FWIW. O(1) is great of course but there are reasonable cases to be made for O(log N) being a good tradeoff if you get other benefits. > c) Indexing is the most important thing one wants to be fast. For an > utf-8 internal representation, a lot is achieved if one caches both last > index and last byte offset, preferably also string length as index and > byte length. Consider threads though :/ Caches get a bit complicated there. > d) a bad complication is write access to strings, for example with > > -- Scheme Procedure: string-map! proc s [start [end]] > -- C Function: scm_string_map_x (proc, s, start, end) TBH I wouldn't worry too much about this function in particular; you could map characters into to a vector and then write those characters back to the string. Most modern languages of course have read-only strings, and destructive operations on strings are mostly used when filling buffers. That said, this point: > The current string character can gain a longer or a shorter byte length > in this process. Is especially gnarly in the threaded case; string updates are no longer atomic. One thread mutating a string might actually corrupt another thread. Right now on x86, updates are entirely atomic; on other processors that need barriers, the worst that could happen is that another thread could fail to read an update. We'd have to re-add the string write mutex, which would be a bit sad :) > So it should provide a _large_ opportunity for the sakes of applications > with profiles akin to Emacs or LilyPond. I'm sympathetic :) Lots of details to get right (or wrong!) though. WDYT about the threading aspects? Andy ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: String internals sketch 2017-03-10 16:08 ` Andy Wingo @ 2017-03-10 16:51 ` David Kastrup 2017-03-15 15:46 ` David Kastrup 1 sibling, 0 replies; 6+ messages in thread From: David Kastrup @ 2017-03-10 16:51 UTC (permalink / raw) To: Andy Wingo; +Cc: guile-user Andy Wingo <wingo@pobox.com> writes: > Hi :) > > On Fri 10 Mar 2017 16:31, David Kastrup <dak@gnu.org> writes: > >> a) Guile already uses two different internal representations: basically >> UCS-8 and UCS-32. Adding more internal representations could be done >> using a number of tables indexed by the internal representation type, >> making string representations sort of a "plugin". > > I think we probably want to avoid this if we can. We gain a number of > efficiencies if we can be concrete. Well, the crucial question is where to hook the specialization for the best compromise between performance, versatilty and effort. > Of course there are counterexamples in which specialization can help, > like the 20-some string kinds in V8, for example: cons strings, latin1 > strings, utf-16 strings, external strings, slices, and the product of > all of those; but I am hesitant to take on this cost. If we switched > to UTF-8 strings, I would like to use it as our only string > representation. > > Sure would be nice to have cons strings though! (That would give O(1) > string-append.) String buffers should have O(1) string-append. >> b) Scheme, at least older dialects, have several O(1) guarantees. > > R7RS seems to have relaxed this FWIW. O(1) is great of course but there > are reasonable cases to be made for O(log N) being a good tradeoff if > you get other benefits. A guaranteed O(log N) is not exactly trivial to come by for utf-8 encoded strings either. >> c) Indexing is the most important thing one wants to be fast. For an >> utf-8 internal representation, a lot is achieved if one caches both >> last index and last byte offset, preferably also string length as >> index and byte length. > > Consider threads though :/ Caches get a bit complicated there. If there is actual access of several threads on the same string, a single-element cache is going to be mostly useless. Don't know about typical implementation costs and implementations of per-thread variables, so I can't make a useful proposal here. >> d) a bad complication is write access to strings, for example with >> >> -- Scheme Procedure: string-map! proc s [start [end]] >> -- C Function: scm_string_map_x (proc, s, start, end) > > TBH I wouldn't worry too much about this function in particular; It's certainly not an important function regarding its use frequency, but it's a good worst case study and is nicely covered by the gapped-string model. > That said, this point: > >> The current string character can gain a longer or a shorter byte >> length in this process. > > Is especially gnarly in the threaded case; string updates are no longer > atomic. One thread mutating a string might actually corrupt another > thread. Right now on x86, updates are entirely atomic; on other > processors that need barriers, the worst that could happen is that > another thread could fail to read an update. We'd have to re-add the > string write mutex, which would be a bit sad :) A gapped string does not have a lot of contention, but of course if two threads wanted to write in different places, there would need to be a way to either lock the gap until the end of the current operation or to support multiple gaps in progress. >> So it should provide a _large_ opportunity for the sakes of >> applications with profiles akin to Emacs or LilyPond. > > I'm sympathetic :) Lots of details to get right (or wrong!) though. Sure. Not all implementations need to get along without locks, by the way. I am not a fan of providing choice between several bad options instead of getting one version right, but "right" is so large a field here, particularly when considering embedding Guile into other systems, that it might be a worthwhile goal to at least have the "right for any purpose" goal met at the framework layer, and then try improving the implementations from there. > WDYT about the threading aspects? Good for headaches. But it might make sense diversifying the headaches into the several implementations. -- David Kastrup ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: String internals sketch 2017-03-10 16:08 ` Andy Wingo 2017-03-10 16:51 ` David Kastrup @ 2017-03-15 15:46 ` David Kastrup 1 sibling, 0 replies; 6+ messages in thread From: David Kastrup @ 2017-03-15 15:46 UTC (permalink / raw) To: guile-user Andy Wingo <wingo@pobox.com> writes: > Hi :) > > On Fri 10 Mar 2017 16:31, David Kastrup <dak@gnu.org> writes: > >> a) Guile already uses two different internal representations: basically >> UCS-8 and UCS-32. Adding more internal representations could be done >> using a number of tables indexed by the internal representation type, >> making string representations sort of a "plugin". > > I think we probably want to avoid this if we can. We gain a number of > efficiencies if we can be concrete. Maybe use GOOPS on the string ops by diversifying on string subtypes? One would start by making UCS-32 and UCS-8 string implementations subtypes of strings, then add UTF-8-internal (which is identical to UTF-8 for all-valid UTF-8)? I doubt that this would beat a vector of dedicated hooks but it might look more transparent to implementors with special needs? Maybe not: they might not approach this from a Guile angle of thinking. -- David Kastrup ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: String internals sketch 2017-03-10 15:31 String internals sketch David Kastrup 2017-03-10 16:08 ` Andy Wingo @ 2017-03-11 0:30 ` Matt Wette 2017-03-11 6:44 ` David Kastrup 1 sibling, 1 reply; 6+ messages in thread From: Matt Wette @ 2017-03-11 0:30 UTC (permalink / raw) To: David Kastrup; +Cc: guile-user > On Mar 10, 2017, at 7:31 AM, David Kastrup <dak@gnu.org> wrote: > I've been mulling a bit over enabling UTF-8 as an internal string > representation. There are several interesting points to consider: One point to consider on this topic is that guile is targeted (via the compiler tower) to support other languages. Strings in ecmascript, for example, are 16-bit code points, not 8 or 32. I personally don’t care that the guile-2.2 hosted ecmascript can’t be a pedantic implementation, but it may generate issues so some down the road. (BTW, the other area in which guile does not implement ecmascript is in representation of numbers ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: String internals sketch 2017-03-11 0:30 ` Matt Wette @ 2017-03-11 6:44 ` David Kastrup 0 siblings, 0 replies; 6+ messages in thread From: David Kastrup @ 2017-03-11 6:44 UTC (permalink / raw) To: guile-user Matt Wette <matt.wette@gmail.com> writes: >> On Mar 10, 2017, at 7:31 AM, David Kastrup <dak@gnu.org> wrote: >> I've been mulling a bit over enabling UTF-8 as an internal string >> representation. There are several interesting points to consider: > > One point to consider on this topic is that guile is targeted (via the > compiler tower) to support other languages. Strings in ecmascript, > for example, are 16-bit code points, not 8 or 32. I personally don’t > care that the guile-2.2 hosted ecmascript can’t be a pedantic > implementation, but it may generate issues so some down the road. Well, that's an "engine" one would need to create complete support for. Some things would be nicer to specify via C++ templates rather than function hooks. C macros can sort-of be used for providing similar functionality, but they are quite more opaque to compiler, debugger, and programmer and more ad-hoc. Not proposing to move to C++: that would seriously impact the ability to integrate into other systems. Just a musing. -- David Kastrup ^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2017-03-15 15:46 UTC | newest] Thread overview: 6+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2017-03-10 15:31 String internals sketch David Kastrup 2017-03-10 16:08 ` Andy Wingo 2017-03-10 16:51 ` David Kastrup 2017-03-15 15:46 ` David Kastrup 2017-03-11 0:30 ` Matt Wette 2017-03-11 6:44 ` David Kastrup
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).