scm_to_locale_stringbuf

unofficial mirror of guile-devel@gnu.org 
 help / color / mirror / Atom feed

* scm_to_locale_stringbuf
@ 2009-02-03 16:25 Mike Gran
  2009-02-03 22:48 ` scm_to_locale_stringbuf Neil Jerram
  0 siblings, 1 reply; 6+ messages in thread
From: Mike Gran @ 2009-02-03 16:25 UTC (permalink / raw)
  To: guile-devel

Hi,

The description for scm_to_locale_stringbuf doesn't specify
what happens when the final multibyte character doesn't fit 
in the provided string buffer.

size_t scm_to_locale_stringbuf (SCM str, char *buf, size_t max_len)

Say the locale is UTF-8, and the last position in BUF would be 
the first byte of a two-byte character.  The right thing is not 
to copy that first byte of the character into the last position 
of BUF, but, instead copy a '\0'.  But there is no way to indicate 
to the caller that the final '\0' is padding and not a true '\0'.

Ideas?

Thanks,

Mike Gran

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: scm_to_locale_stringbuf
  2009-02-03 16:25 scm_to_locale_stringbuf Mike Gran
@ 2009-02-03 22:48 ` Neil Jerram
  2009-02-03 23:46   ` scm_to_locale_stringbuf Mike Gran
  0 siblings, 1 reply; 6+ messages in thread
From: Neil Jerram @ 2009-02-03 22:48 UTC (permalink / raw)
  To: Mike Gran; +Cc: guile-devel

Mike Gran <spk121@yahoo.com> writes:

> Hi,
>
> The description for scm_to_locale_stringbuf doesn't specify
> what happens when the final multibyte character doesn't fit 
> in the provided string buffer.
>
> size_t scm_to_locale_stringbuf (SCM str, char *buf, size_t max_len)
>
> Say the locale is UTF-8, and the last position in BUF would be 
> the first byte of a two-byte character.  The right thing is not 
> to copy that first byte of the character into the last position 
> of BUF, but, instead copy a '\0'.  But there is no way to indicate 
> to the caller that the final '\0' is padding and not a true '\0'.

I'm afraid I don't understand the problem, on two counts.

1. The doc (in the manual) says that scm_to_locale_stringbuf doesn't
add a terminating \0.  So presumably any \0s present must be padding.

2. The doc also says that if scm_to_locale_stringbuf's return value
is > max_len (as it would be in your case), the caller should call it
again with a larger buffer.

What is the caller scenario that you have in mind?

Regards,
        Neil

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: scm_to_locale_stringbuf
  2009-02-03 22:48 ` scm_to_locale_stringbuf Neil Jerram
@ 2009-02-03 23:46   ` Mike Gran
  2009-02-04  0:23     ` scm_to_locale_stringbuf Neil Jerram
  0 siblings, 1 reply; 6+ messages in thread
From: Mike Gran @ 2009-02-03 23:46 UTC (permalink / raw)
  To: guile-devel

> From: Neil Jerram neil@ossau.uklinux.net

> I'm afraid I don't understand the problem, on two counts.
> 
> 1. The doc (in the manual) says that scm_to_locale_stringbuf doesn't
> add a terminating \0.  So presumably any \0s present must be padding.
> 
> 2. The doc also says that if scm_to_locale_stringbuf's return value
> is > max_len (as it would be in your case), the caller should call it
> again with a larger buffer.
> 

Right now, the internal coding of strings is an unspecified 8-bit encoding, and is assumed to be compatible with the locale in which it is being run.

So if I have a guile string with some 8-bit character that is between 128 and 255, it just gets passed through.  If I request the contents of that string from C with scm_to_locale_string, it just returns the buffer of the scheme string.

But, in future, scm_to_locale_string or scm_to_locale_stringbuf should actually do the proper conversion to the current locale so that wide characters are printed properly.

So, if we move the internal representation of strings away
from unspecified 8-bit data and toward something concrete,
like ISO-8859-1 or UCS-4, and if a program is running in an
environment where a locale that has a multibyte encoding
like UTF-8, then the created locale string could have multi-byte characters.

Consider a scheme string that is internally the single
character "LATIN SMALL LETTER A WITH ACUTE", which is
U+00E1.  If the locale were some sort of UTF-8, like
en_US.utf-8, this letter should become the two bytes 0xC3
and 0xA1 when converted to the locale.

So what should happen in this case if I call
scm_to_locale_stringbuf (str, buf, 1)?  Note that here BUF
can only contain 1 byte.  Should the one byte 0xC3 be
copied into it, which creates an illegal string?  Or,
should nothing be copied into it.  In either case, there
should be some mechanism in the API to provide information
that an incomplete last character has occurred, because
outputting just the one byte 0xC3 would cause problems
somewhere down the road.

So what I was saying was that in this case maybe the best
thing to do would be to pad the output buffer with '\0'
instead of putting in half of a multibyte character, and
then signal that there is some padding at the end of the
string.

For instance, one could have a function
scm_to_locale_stringbufn (SCM str, char *buf, size_t max_len, size_t *len_used)
where LEN_USED is size of the buffer that was actually
used.

Sorry for the book-length explanation,

Mike Gran

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: scm_to_locale_stringbuf
  2009-02-03 23:46   ` scm_to_locale_stringbuf Mike Gran
@ 2009-02-04  0:23     ` Neil Jerram
  2009-02-05 22:26       ` scm_to_locale_stringbuf Ludovic Courtès
  0 siblings, 1 reply; 6+ messages in thread
From: Neil Jerram @ 2009-02-04  0:23 UTC (permalink / raw)
  To: Mike Gran; +Cc: guile-devel

Hi Mike,

Thanks for explaining...

Mike Gran <spk121@yahoo.com> writes:

> Right now, the internal coding of strings is an unspecified 8-bit
> encoding, and is assumed to be compatible with the locale in which it
> is being run.
>
> So if I have a guile string with some 8-bit character that is between
> 128 and 255, it just gets passed through.  If I request the contents
> of that string from C with scm_to_locale_string, it just returns the
> buffer of the scheme string.
>
> But, in future, scm_to_locale_string or scm_to_locale_stringbuf should
> actually do the proper conversion to the current locale so that wide
> characters are printed properly.
>
> So, if we move the internal representation of strings away from
> unspecified 8-bit data and toward something concrete, like ISO-8859-1
> or UCS-4, and if a program is running in an environment where a locale
> that has a multibyte encoding like UTF-8, then the created locale
> string could have multi-byte characters.
>
> Consider a scheme string that is internally the single character
> "LATIN SMALL LETTER A WITH ACUTE", which is U+00E1.  If the locale
> were some sort of UTF-8, like en_US.utf-8, this letter should become
> the two bytes 0xC3 and 0xA1 when converted to the locale.

Right.  I'm happy with all this.

> So what should happen in this case if I call scm_to_locale_stringbuf
> (str, buf, 1)?  Note that here BUF can only contain 1 byte. 

I think the key thing is that scm_to_locale_stringbuf () will return
2.  This tells the caller that BUF wasn't big enough.  Beyond that, we
shouldn't do something obviously misleading, but I don't think it
matters very much what we choose to do.

> Should
> the one byte 0xC3 be copied into it, which creates an illegal
> string? 

No.  I agree that that would feel "obviously misleading".

> Or, should nothing be copied into it.

That - in other words no change to BUF at all - sounds good to me.

>  In either case, there should be some mechanism in the API to
>provide information that an incomplete last character has occurred,
>because outputting just the one byte 0xC3 would cause problems
>somewhere down the road.

I don't follow your "in either case" - because in the second case we
haven't output 0xC3.

You may still be right that we need some mechanism to say that some
bytes at the end of BUF were not used, but the case for this isn't
obvious to me yet.

> So what I was saying was that in this case maybe the best thing to do
> would be to pad the output buffer with '\0' instead of putting in half
> of a multibyte character,

Padding feels wrong to me.  We wouldn't pad if the caller supplied a
BUF of length 10 and a string that needed only 3 bytes.

> Sorry for the book-length explanation,

No problem.  I think the key question remains: why is the existing API
(i.e. the existing return value) not good enough?

I guess there could be a scenario where the caller has a fixed size
buffer, and just wants to copy in as much of an arbitrary string as
will fit, and then use that possibly truncated string somehow.
Depending on the API that the string is being passed on to, any of the
following could be most useful:

- padding the unused bytes of BUF with \0 (or some other value)

- adding a single \0 (or other value) in the first unused byte

- returning a pointer (or offset in bytes) to the first unused byte

- returning the number of characters written.

Returning both <number of chars written> and <number of bytes used>
would allow the caller to do any of those efficiently, so perhaps we
should do that?

Regards,
        Neil

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: scm_to_locale_stringbuf
  2009-02-04  0:23     ` scm_to_locale_stringbuf Neil Jerram
@ 2009-02-05 22:26       ` Ludovic Courtès
  2009-02-08 21:41         ` scm_to_locale_stringbuf Neil Jerram
  0 siblings, 1 reply; 6+ messages in thread
From: Ludovic Courtès @ 2009-02-05 22:26 UTC (permalink / raw)
  To: guile-devel

Hi,

Neil Jerram <neil@ossau.uklinux.net> writes:

> I think the key thing is that scm_to_locale_stringbuf () will return
> 2.  This tells the caller that BUF wasn't big enough.  Beyond that, we
> shouldn't do something obviously misleading, but I don't think it
> matters very much what we choose to do.

Agreed.  The caller is already able to determine that something's wrong
if the return value is larger than MAX_LEN.

> I guess there could be a scenario where the caller has a fixed size
> buffer, and just wants to copy in as much of an arbitrary string as
> will fit, and then use that possibly truncated string somehow.
> Depending on the API that the string is being passed on to, any of the
> following could be most useful:
>
> - padding the unused bytes of BUF with \0 (or some other value)
>
> - adding a single \0 (or other value) in the first unused byte
>
> - returning a pointer (or offset in bytes) to the first unused byte
>
> - returning the number of characters written.
>
> Returning both <number of chars written> and <number of bytes used>
> would allow the caller to do any of those efficiently, so perhaps we
> should do that?

I would say returning both "number of bytes needed for the full string"
(as is the case) plus "number of bytes actually written" (which may be
smaller than MAX_LEN in the case of multi-byte encoding).  This would be
an addition to the API, IMO, while `scm_to_locale_stringbuf ()' would
keep behaving as described, with the limitations you outline above.

Thanks,
Ludo'.





^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: scm_to_locale_stringbuf
  2009-02-05 22:26       ` scm_to_locale_stringbuf Ludovic Courtès
@ 2009-02-08 21:41         ` Neil Jerram
  0 siblings, 0 replies; 6+ messages in thread
From: Neil Jerram @ 2009-02-08 21:41 UTC (permalink / raw)
  To: Ludovic Courtès; +Cc: guile-devel

ludo@gnu.org (Ludovic Courtès) writes:

> I would say returning both "number of bytes needed for the full string"
> (as is the case) plus "number of bytes actually written" (which may be
> smaller than MAX_LEN in the case of multi-byte encoding).  This would be
> an addition to the API, IMO, while `scm_to_locale_stringbuf ()' would
> keep behaving as described, with the limitations you outline above.

That sounds good to me.

     Neil




^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2009-02-08 21:41 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-02-03 16:25 scm_to_locale_stringbuf Mike Gran
2009-02-03 22:48 ` scm_to_locale_stringbuf Neil Jerram
2009-02-03 23:46   ` scm_to_locale_stringbuf Mike Gran
2009-02-04  0:23     ` scm_to_locale_stringbuf Neil Jerram
2009-02-05 22:26       ` scm_to_locale_stringbuf Ludovic Courtès
2009-02-08 21:41         ` scm_to_locale_stringbuf Neil Jerram

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).