* Unicode strings and symbols
[not found] <E1MZiTo-0002G2-W8@cvs.savannah.gnu.org>
@ 2009-08-09 18:22 ` Ludovic Courtès
2009-08-09 22:40 ` Mike Gran
0 siblings, 1 reply; 3+ messages in thread
From: Ludovic Courtès @ 2009-08-09 18:22 UTC (permalink / raw)
To: Michael Gran; +Cc: guile-devel
Hi Mike!
Glad to see Unicode has landed! :-)
Here's a quick review of the big patch.
"Michael Gran" <spk121@yahoo.com> writes:
> Add Unicode strings and symbols
>
> This adds full Unicode strings as a datatype, and it adds some
> minimal functionality. The terminal and port encoding is assumed
> to be ISO-8859-1. Non-ISO-8859-1 characters are written or
> input as string character escapes.
>
> The string character escapes now have 3 forms: \xXX \uXXXX and
> \UXXXXXX, for unprintable characters that have 2, 4 or 6 hex digits.
>
> The process for writing to strings has been modified. There is now a
> function scm_i_string_start_writing that does the copy-on-write
> conversion if necessary.
>
> To compile strings that may be wide, the VM storage of strings and
> string-likes has changed.
>
> Most string-using functions have not yet been updated and may break
> when used with wide strings.
Most of these comments really belong in the source code, close to the
things they refer to.
> libguile/ports.c | 90 +++-
> libguile/ports.h | 3 +
> libguile/print.c | 157 +++++--
> libguile/print.h | 1 +
> libguile/read.c | 233 ++++++----
> libguile/rw.c | 2 +
> libguile/socket.c | 3 +
> libguile/srfi-13.c | 23 +-
> libguile/strings.c | 649 +++++++++++++++++++++----
> libguile/strings.h | 59 ++-
> libguile/vm-engine.h | 1 +
> libguile/vm-i-loader.c | 87 +++-
> module/language/assembly.scm | 12 +-
> module/language/assembly/compile-bytecode.scm | 26 +-
> test-suite/tests/asm-to-bytecode.test | 6 +-
> 15 files changed, 1046 insertions(+), 306 deletions(-)
I would feel more confident if the number of lines of tests added was
proportional to the number of new C lines of code. Do you think some
more tests could be added? Or maybe this will come at a later stage?
> + else if (c == '\b')
> + {
> + SCM_DECCOL (port);
> + }
Style! ;-)
> +SCM_API void scm_charprint (scm_t_uint32 c, SCM port);
This ought to be internal, no?
> #define STRINGBUF_F_SHARED 0x100
> #define STRINGBUF_F_INLINE 0x200
> +#define STRINGBUF_F_WIDE 0x400
Although other flags miss this, can you add a comment to the right
saying that this means UCS-4 encoding?
> +static SCM
> +make_wide_stringbuf (size_t len)
> +{
> + scm_t_wchar *mem;
> +#if SCM_DEBUG
> + if (len < 1000)
> + lenhist[len]++;
> + else
> + lenhist[1000]++;
> +#endif
I would use "#ifdef SCM_STRING_LENGTH_HISTOGRAM" for that. Conversely,
I'd make `%string-dump' and `%symbol-dump' always available (with
docstring and possibly manual entry).
> + (scm_t_wchar) (unsigned char) STRINGBUF_INLINE_CHARS (buf)[i];
Is the double cast needed?
> + if (scm_i_is_narrow_string (name))
"Narrow strings" are Latin-1, right?
> +SCM_DEFINE (scm_sys_string_dump, "%string-dump", 1, 0, 0, (SCM str), "")
> #define FUNC_NAME s_scm_sys_string_dump
> {
> SCM_VALIDATE_STRING (1, str);
> fprintf (stderr, "%p:\n", str);
> fprintf (stderr, " start: %u\n", STRING_START (str));
> fprintf (stderr, " len: %u\n", STRING_LENGTH (str));
> + if (scm_i_is_narrow_string (str))
> + fprintf (stderr, " format: narrow\n");
> + else
> + fprintf (stderr, " format: wide\n");
How about returning an alist with all this information instead of using
printf?
It would allow higher-level debugging functions to be implemented and
may also be useful in unit tests.
> +SCM_DEFINE (scm_sys_symbol_dump, "%symbol-dump", 1, 0, 0, (SCM sym), "")
Same here.
> +SCM_DEFINE (scm_string_width, "string-width", 1, 0, 0,
> + (SCM string),
> + "Return the bytes used to represent a character in @var{string}."
> + "This will return 1 or 4.")
I was wondering whether we should expose as much of the internal
representation, but I think it's a good debugging aid and it can't hurt.
I find the name slightly misleading, but I can't think of a better one.
> - "Return character @var{k} of @var{str} using zero-origin\n"
> - "indexing. @var{k} must be a valid index of @var{str}.")
> + "Return character @var{k} of @var{str} using zero-origin\n"
> + "indexing. @var{k} must be a valid index of @var{str}.")
Please avoid unneeded formatting changes, to keep the diff smaller.
> +SCM_INTERNAL char *scm_to_stringn (SCM str, size_t *lenp,
> + const char *encoding,
> + enum iconv_ilseq_handler handler);
I suppose this would eventually become public. What do you think?
Should we use a different type for HANDLER before that happens?
> +SCM_API const scm_t_wchar *scm_i_string_wide_chars (SCM str);
Should be SCM_INTERNAL.
Thank you!
Ludo'.
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: Unicode strings and symbols
2009-08-09 18:22 ` Unicode strings and symbols Ludovic Courtès
@ 2009-08-09 22:40 ` Mike Gran
2009-08-10 21:27 ` Ludovic Courtès
0 siblings, 1 reply; 3+ messages in thread
From: Mike Gran @ 2009-08-09 22:40 UTC (permalink / raw)
To: Ludovic Courtès; +Cc: guile-devel
On Sun, 2009-08-09 at 20:22 +0200, Ludovic Courtès wrote:
> Hi Mike!
> I would feel more confident if the number of lines of tests added was
> proportional to the number of new C lines of code. Do you think some
> more tests could be added? Or maybe this will come at a later stage?
I should probably add some tests now. They would have to use hex
escapes, since the character encoding/decoding functionality is not
ready. That's not a problem, but, it doesn't look as cool as using
actual non-latin glyphs in the tests.
>
> > + else if (c == '\b')
> > + {
> > + SCM_DECCOL (port);
> > + }
>
> Style! ;-)
OK. The SCM_DECCOL et al macros are missing enclosing do/while
statements, making them conflict with the if statement, so do/while
statements will have to be added.
>
> > +SCM_API void scm_charprint (scm_t_uint32 c, SCM port);
>
> This ought to be internal, no?
Could be. A couple of the types are given their own print functions:
scm_intprint and an scm_uintprint. Most types don't have their own
print functions. Are int and uint given special treatment because of
their radix term?
>
> > #define STRINGBUF_F_SHARED 0x100
> > #define STRINGBUF_F_INLINE 0x200
> > +#define STRINGBUF_F_WIDE 0x400
>
> Although other flags miss this, can you add a comment to the right
> saying that this means UCS-4 encoding?
OK.
>
> > +#if SCM_DEBUG
> > + if (len < 1000)
> > + lenhist[len]++;
> > + else
> > + lenhist[1000]++;
> > +#endif
>
> I would use "#ifdef SCM_STRING_LENGTH_HISTOGRAM" for that. Conversely,
> I'd make `%string-dump' and `%symbol-dump' always available (with
> docstring and possibly manual entry).
I like that idea.
>
> > + (scm_t_wchar) (unsigned char) STRINGBUF_INLINE_CHARS (buf)[i];
>
> Is the double cast needed?
Sort of. Unsigned char will successfully be implicitly cast to
scm_t_wchar, so the scm_t_wchar term is just for clarity. The unsigned
char term is definitely needed. Negative 8-bit chars are the upper half
of the 8-bit charset (128 - 255). Casting them directly to scm_t_wchar
may return 0xFFFFFF80 - 0xFFFFFFFF instead of 128-255. I don't have any
problem removing the scm_t_wchar cast. Would you prefer that?
>
> > + if (scm_i_is_narrow_string (name))
>
> "Narrow strings" are Latin-1, right?
Right.
>
> > +SCM_DEFINE (scm_sys_string_dump, "%string-dump", 1, 0, 0, (SCM str), "")
>
> How about returning an alist with all this information instead of using
> printf?
OK
> > +SCM_DEFINE (scm_string_width, "string-width", 1, 0, 0,
> > + (SCM string),
> > + "Return the bytes used to represent a character in @var{string}."
> > + "This will return 1 or 4.")
>
> I was wondering whether we should expose as much of the internal
> representation, but I think it's a good debugging aid and it can't hurt.
>
> I find the name slightly misleading, but I can't think of a better one.
I put it in because that information needs to be available in the
bytecode compiler. A slightly clearer name would probably be
string-bytes-per-character, I suppose.
> > +SCM_INTERNAL char *scm_to_stringn (SCM str, size_t *lenp,
> > + const char *encoding,
> > + enum iconv_ilseq_handler handler);
>
> I suppose this would eventually become public. What do you think?
> Should we use a different type for HANDLER before that happens?
The simplest thing would be to make some constants like
scm_c_define ("STRING_ESCAPE", scm_from_int(iconveh_escape_sequence))
Something similar is done in the scm_seek function's constants, such as
SEEK_CUR.
>
> > +SCM_API const scm_t_wchar *scm_i_string_wide_chars (SCM str);
>
> Should be SCM_INTERNAL.
OK
>
> Thank you!
Thanks for taking the time to check it out.
>
> Ludo'.
Mike
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: Unicode strings and symbols
2009-08-09 22:40 ` Mike Gran
@ 2009-08-10 21:27 ` Ludovic Courtès
0 siblings, 0 replies; 3+ messages in thread
From: Ludovic Courtès @ 2009-08-10 21:27 UTC (permalink / raw)
To: guile-devel
Hey,
Mike Gran <spk121@yahoo.com> writes:
[...]
>> > +SCM_API void scm_charprint (scm_t_uint32 c, SCM port);
>>
>> This ought to be internal, no?
>
> Could be. A couple of the types are given their own print functions:
> scm_intprint and an scm_uintprint. Most types don't have their own
> print functions. Are int and uint given special treatment because of
> their radix term?
Dunno. Anyway, they're not really meant to be public either. Feel free
to make them internal as well, while you're at it. ;-)
>> > + (scm_t_wchar) (unsigned char) STRINGBUF_INLINE_CHARS (buf)[i];
>>
>> Is the double cast needed?
>
> Sort of. Unsigned char will successfully be implicitly cast to
> scm_t_wchar, so the scm_t_wchar term is just for clarity. The unsigned
> char term is definitely needed. Negative 8-bit chars are the upper half
> of the 8-bit charset (128 - 255). Casting them directly to scm_t_wchar
> may return 0xFFFFFF80 - 0xFFFFFFFF instead of 128-255. I don't have any
> problem removing the scm_t_wchar cast. Would you prefer that?
How about:
#define STRINGBUF_INLINE_CHARS(buf) \
((unsigned char *) SCM_CELL_OBJECT_LOC ((buf), 1))
and changing the caller to:
for (i = 0; i < len; i++)
mem[i] = (scm_t_wchar) STRINGBUF_INLINE_CHARS (buf)[i];
?
That would make the intent clearer to me.
> I put it in because that information needs to be available in the
> bytecode compiler. A slightly clearer name would probably be
> string-bytes-per-character, I suppose.
Agreed, let's take this name.
>> > +SCM_INTERNAL char *scm_to_stringn (SCM str, size_t *lenp,
>> > + const char *encoding,
>> > + enum iconv_ilseq_handler handler);
>>
>> I suppose this would eventually become public. What do you think?
>> Should we use a different type for HANDLER before that happens?
>
> The simplest thing would be to make some constants like
>
> scm_c_define ("STRING_ESCAPE", scm_from_int(iconveh_escape_sequence))
>
> Something similar is done in the scm_seek function's constants, such as
> SEEK_CUR.
It's a C API so Scheme-level constants don't matter.
I was wondering whether using `enum iconv_ilseq_handler' in the public
API would be a good idea because that means that public headers include
either the system's or GNU libiconv's <iconv.h> (or some libunistring
header), in which case `guile.pc' must include the right `-I' flag, etc.
This may slightly complicate compilation of Guile apps. Another
downside is that Guile's API would be bound to the values and semantics
of `iconv_ilseq_handler', and bound to iconv.
One possibility to avoid th would be to define our own type similar to
`iconv_ilseq_handler'.
Thanks,
Ludo'.
^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2009-08-10 21:27 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <E1MZiTo-0002G2-W8@cvs.savannah.gnu.org>
2009-08-09 18:22 ` Unicode strings and symbols Ludovic Courtès
2009-08-09 22:40 ` Mike Gran
2009-08-10 21:27 ` Ludovic Courtès
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).