* scm_to_locale_stringbuf @ 2009-02-03 16:25 Mike Gran 2009-02-03 22:48 ` scm_to_locale_stringbuf Neil Jerram 0 siblings, 1 reply; 6+ messages in thread From: Mike Gran @ 2009-02-03 16:25 UTC (permalink / raw) To: guile-devel Hi, The description for scm_to_locale_stringbuf doesn't specify what happens when the final multibyte character doesn't fit in the provided string buffer. size_t scm_to_locale_stringbuf (SCM str, char *buf, size_t max_len) Say the locale is UTF-8, and the last position in BUF would be the first byte of a two-byte character. The right thing is not to copy that first byte of the character into the last position of BUF, but, instead copy a '\0'. But there is no way to indicate to the caller that the final '\0' is padding and not a true '\0'. Ideas? Thanks, Mike Gran ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: scm_to_locale_stringbuf 2009-02-03 16:25 scm_to_locale_stringbuf Mike Gran @ 2009-02-03 22:48 ` Neil Jerram 2009-02-03 23:46 ` scm_to_locale_stringbuf Mike Gran 0 siblings, 1 reply; 6+ messages in thread From: Neil Jerram @ 2009-02-03 22:48 UTC (permalink / raw) To: Mike Gran; +Cc: guile-devel Mike Gran <spk121@yahoo.com> writes: > Hi, > > The description for scm_to_locale_stringbuf doesn't specify > what happens when the final multibyte character doesn't fit > in the provided string buffer. > > size_t scm_to_locale_stringbuf (SCM str, char *buf, size_t max_len) > > Say the locale is UTF-8, and the last position in BUF would be > the first byte of a two-byte character. The right thing is not > to copy that first byte of the character into the last position > of BUF, but, instead copy a '\0'. But there is no way to indicate > to the caller that the final '\0' is padding and not a true '\0'. I'm afraid I don't understand the problem, on two counts. 1. The doc (in the manual) says that scm_to_locale_stringbuf doesn't add a terminating \0. So presumably any \0s present must be padding. 2. The doc also says that if scm_to_locale_stringbuf's return value is > max_len (as it would be in your case), the caller should call it again with a larger buffer. What is the caller scenario that you have in mind? Regards, Neil ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: scm_to_locale_stringbuf 2009-02-03 22:48 ` scm_to_locale_stringbuf Neil Jerram @ 2009-02-03 23:46 ` Mike Gran 2009-02-04 0:23 ` scm_to_locale_stringbuf Neil Jerram 0 siblings, 1 reply; 6+ messages in thread From: Mike Gran @ 2009-02-03 23:46 UTC (permalink / raw) To: guile-devel > From: Neil Jerram neil@ossau.uklinux.net > I'm afraid I don't understand the problem, on two counts. > > 1. The doc (in the manual) says that scm_to_locale_stringbuf doesn't > add a terminating \0. So presumably any \0s present must be padding. > > 2. The doc also says that if scm_to_locale_stringbuf's return value > is > max_len (as it would be in your case), the caller should call it > again with a larger buffer. > Right now, the internal coding of strings is an unspecified 8-bit encoding, and is assumed to be compatible with the locale in which it is being run. So if I have a guile string with some 8-bit character that is between 128 and 255, it just gets passed through. If I request the contents of that string from C with scm_to_locale_string, it just returns the buffer of the scheme string. But, in future, scm_to_locale_string or scm_to_locale_stringbuf should actually do the proper conversion to the current locale so that wide characters are printed properly. So, if we move the internal representation of strings away from unspecified 8-bit data and toward something concrete, like ISO-8859-1 or UCS-4, and if a program is running in an environment where a locale that has a multibyte encoding like UTF-8, then the created locale string could have multi-byte characters. Consider a scheme string that is internally the single character "LATIN SMALL LETTER A WITH ACUTE", which is U+00E1. If the locale were some sort of UTF-8, like en_US.utf-8, this letter should become the two bytes 0xC3 and 0xA1 when converted to the locale. So what should happen in this case if I call scm_to_locale_stringbuf (str, buf, 1)? Note that here BUF can only contain 1 byte. Should the one byte 0xC3 be copied into it, which creates an illegal string? Or, should nothing be copied into it. In either case, there should be some mechanism in the API to provide information that an incomplete last character has occurred, because outputting just the one byte 0xC3 would cause problems somewhere down the road. So what I was saying was that in this case maybe the best thing to do would be to pad the output buffer with '\0' instead of putting in half of a multibyte character, and then signal that there is some padding at the end of the string. For instance, one could have a function scm_to_locale_stringbufn (SCM str, char *buf, size_t max_len, size_t *len_used) where LEN_USED is size of the buffer that was actually used. Sorry for the book-length explanation, Mike Gran ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: scm_to_locale_stringbuf 2009-02-03 23:46 ` scm_to_locale_stringbuf Mike Gran @ 2009-02-04 0:23 ` Neil Jerram 2009-02-05 22:26 ` scm_to_locale_stringbuf Ludovic Courtès 0 siblings, 1 reply; 6+ messages in thread From: Neil Jerram @ 2009-02-04 0:23 UTC (permalink / raw) To: Mike Gran; +Cc: guile-devel Hi Mike, Thanks for explaining... Mike Gran <spk121@yahoo.com> writes: > Right now, the internal coding of strings is an unspecified 8-bit > encoding, and is assumed to be compatible with the locale in which it > is being run. > > So if I have a guile string with some 8-bit character that is between > 128 and 255, it just gets passed through. If I request the contents > of that string from C with scm_to_locale_string, it just returns the > buffer of the scheme string. > > But, in future, scm_to_locale_string or scm_to_locale_stringbuf should > actually do the proper conversion to the current locale so that wide > characters are printed properly. > > So, if we move the internal representation of strings away from > unspecified 8-bit data and toward something concrete, like ISO-8859-1 > or UCS-4, and if a program is running in an environment where a locale > that has a multibyte encoding like UTF-8, then the created locale > string could have multi-byte characters. > > Consider a scheme string that is internally the single character > "LATIN SMALL LETTER A WITH ACUTE", which is U+00E1. If the locale > were some sort of UTF-8, like en_US.utf-8, this letter should become > the two bytes 0xC3 and 0xA1 when converted to the locale. Right. I'm happy with all this. > So what should happen in this case if I call scm_to_locale_stringbuf > (str, buf, 1)? Note that here BUF can only contain 1 byte. I think the key thing is that scm_to_locale_stringbuf () will return 2. This tells the caller that BUF wasn't big enough. Beyond that, we shouldn't do something obviously misleading, but I don't think it matters very much what we choose to do. > Should > the one byte 0xC3 be copied into it, which creates an illegal > string? No. I agree that that would feel "obviously misleading". > Or, should nothing be copied into it. That - in other words no change to BUF at all - sounds good to me. > In either case, there should be some mechanism in the API to >provide information that an incomplete last character has occurred, >because outputting just the one byte 0xC3 would cause problems >somewhere down the road. I don't follow your "in either case" - because in the second case we haven't output 0xC3. You may still be right that we need some mechanism to say that some bytes at the end of BUF were not used, but the case for this isn't obvious to me yet. > So what I was saying was that in this case maybe the best thing to do > would be to pad the output buffer with '\0' instead of putting in half > of a multibyte character, Padding feels wrong to me. We wouldn't pad if the caller supplied a BUF of length 10 and a string that needed only 3 bytes. > Sorry for the book-length explanation, No problem. I think the key question remains: why is the existing API (i.e. the existing return value) not good enough? I guess there could be a scenario where the caller has a fixed size buffer, and just wants to copy in as much of an arbitrary string as will fit, and then use that possibly truncated string somehow. Depending on the API that the string is being passed on to, any of the following could be most useful: - padding the unused bytes of BUF with \0 (or some other value) - adding a single \0 (or other value) in the first unused byte - returning a pointer (or offset in bytes) to the first unused byte - returning the number of characters written. Returning both <number of chars written> and <number of bytes used> would allow the caller to do any of those efficiently, so perhaps we should do that? Regards, Neil ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: scm_to_locale_stringbuf 2009-02-04 0:23 ` scm_to_locale_stringbuf Neil Jerram @ 2009-02-05 22:26 ` Ludovic Courtès 2009-02-08 21:41 ` scm_to_locale_stringbuf Neil Jerram 0 siblings, 1 reply; 6+ messages in thread From: Ludovic Courtès @ 2009-02-05 22:26 UTC (permalink / raw) To: guile-devel Hi, Neil Jerram <neil@ossau.uklinux.net> writes: > I think the key thing is that scm_to_locale_stringbuf () will return > 2. This tells the caller that BUF wasn't big enough. Beyond that, we > shouldn't do something obviously misleading, but I don't think it > matters very much what we choose to do. Agreed. The caller is already able to determine that something's wrong if the return value is larger than MAX_LEN. > I guess there could be a scenario where the caller has a fixed size > buffer, and just wants to copy in as much of an arbitrary string as > will fit, and then use that possibly truncated string somehow. > Depending on the API that the string is being passed on to, any of the > following could be most useful: > > - padding the unused bytes of BUF with \0 (or some other value) > > - adding a single \0 (or other value) in the first unused byte > > - returning a pointer (or offset in bytes) to the first unused byte > > - returning the number of characters written. > > Returning both <number of chars written> and <number of bytes used> > would allow the caller to do any of those efficiently, so perhaps we > should do that? I would say returning both "number of bytes needed for the full string" (as is the case) plus "number of bytes actually written" (which may be smaller than MAX_LEN in the case of multi-byte encoding). This would be an addition to the API, IMO, while `scm_to_locale_stringbuf ()' would keep behaving as described, with the limitations you outline above. Thanks, Ludo'. ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: scm_to_locale_stringbuf 2009-02-05 22:26 ` scm_to_locale_stringbuf Ludovic Courtès @ 2009-02-08 21:41 ` Neil Jerram 0 siblings, 0 replies; 6+ messages in thread From: Neil Jerram @ 2009-02-08 21:41 UTC (permalink / raw) To: Ludovic Courtès; +Cc: guile-devel ludo@gnu.org (Ludovic Courtès) writes: > I would say returning both "number of bytes needed for the full string" > (as is the case) plus "number of bytes actually written" (which may be > smaller than MAX_LEN in the case of multi-byte encoding). This would be > an addition to the API, IMO, while `scm_to_locale_stringbuf ()' would > keep behaving as described, with the limitations you outline above. That sounds good to me. Neil ^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2009-02-08 21:41 UTC | newest] Thread overview: 6+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2009-02-03 16:25 scm_to_locale_stringbuf Mike Gran 2009-02-03 22:48 ` scm_to_locale_stringbuf Neil Jerram 2009-02-03 23:46 ` scm_to_locale_stringbuf Mike Gran 2009-02-04 0:23 ` scm_to_locale_stringbuf Neil Jerram 2009-02-05 22:26 ` scm_to_locale_stringbuf Ludovic Courtès 2009-02-08 21:41 ` scm_to_locale_stringbuf Neil Jerram
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).