* Unicode strings and symbols [not found] <E1MZiTo-0002G2-W8@cvs.savannah.gnu.org> @ 2009-08-09 18:22 ` Ludovic Courtès 2009-08-09 22:40 ` Mike Gran 0 siblings, 1 reply; 3+ messages in thread From: Ludovic Courtès @ 2009-08-09 18:22 UTC (permalink / raw) To: Michael Gran; +Cc: guile-devel Hi Mike! Glad to see Unicode has landed! :-) Here's a quick review of the big patch. "Michael Gran" <spk121@yahoo.com> writes: > Add Unicode strings and symbols > > This adds full Unicode strings as a datatype, and it adds some > minimal functionality. The terminal and port encoding is assumed > to be ISO-8859-1. Non-ISO-8859-1 characters are written or > input as string character escapes. > > The string character escapes now have 3 forms: \xXX \uXXXX and > \UXXXXXX, for unprintable characters that have 2, 4 or 6 hex digits. > > The process for writing to strings has been modified. There is now a > function scm_i_string_start_writing that does the copy-on-write > conversion if necessary. > > To compile strings that may be wide, the VM storage of strings and > string-likes has changed. > > Most string-using functions have not yet been updated and may break > when used with wide strings. Most of these comments really belong in the source code, close to the things they refer to. > libguile/ports.c | 90 +++- > libguile/ports.h | 3 + > libguile/print.c | 157 +++++-- > libguile/print.h | 1 + > libguile/read.c | 233 ++++++---- > libguile/rw.c | 2 + > libguile/socket.c | 3 + > libguile/srfi-13.c | 23 +- > libguile/strings.c | 649 +++++++++++++++++++++---- > libguile/strings.h | 59 ++- > libguile/vm-engine.h | 1 + > libguile/vm-i-loader.c | 87 +++- > module/language/assembly.scm | 12 +- > module/language/assembly/compile-bytecode.scm | 26 +- > test-suite/tests/asm-to-bytecode.test | 6 +- > 15 files changed, 1046 insertions(+), 306 deletions(-) I would feel more confident if the number of lines of tests added was proportional to the number of new C lines of code. Do you think some more tests could be added? Or maybe this will come at a later stage? > + else if (c == '\b') > + { > + SCM_DECCOL (port); > + } Style! ;-) > +SCM_API void scm_charprint (scm_t_uint32 c, SCM port); This ought to be internal, no? > #define STRINGBUF_F_SHARED 0x100 > #define STRINGBUF_F_INLINE 0x200 > +#define STRINGBUF_F_WIDE 0x400 Although other flags miss this, can you add a comment to the right saying that this means UCS-4 encoding? > +static SCM > +make_wide_stringbuf (size_t len) > +{ > + scm_t_wchar *mem; > +#if SCM_DEBUG > + if (len < 1000) > + lenhist[len]++; > + else > + lenhist[1000]++; > +#endif I would use "#ifdef SCM_STRING_LENGTH_HISTOGRAM" for that. Conversely, I'd make `%string-dump' and `%symbol-dump' always available (with docstring and possibly manual entry). > + (scm_t_wchar) (unsigned char) STRINGBUF_INLINE_CHARS (buf)[i]; Is the double cast needed? > + if (scm_i_is_narrow_string (name)) "Narrow strings" are Latin-1, right? > +SCM_DEFINE (scm_sys_string_dump, "%string-dump", 1, 0, 0, (SCM str), "") > #define FUNC_NAME s_scm_sys_string_dump > { > SCM_VALIDATE_STRING (1, str); > fprintf (stderr, "%p:\n", str); > fprintf (stderr, " start: %u\n", STRING_START (str)); > fprintf (stderr, " len: %u\n", STRING_LENGTH (str)); > + if (scm_i_is_narrow_string (str)) > + fprintf (stderr, " format: narrow\n"); > + else > + fprintf (stderr, " format: wide\n"); How about returning an alist with all this information instead of using printf? It would allow higher-level debugging functions to be implemented and may also be useful in unit tests. > +SCM_DEFINE (scm_sys_symbol_dump, "%symbol-dump", 1, 0, 0, (SCM sym), "") Same here. > +SCM_DEFINE (scm_string_width, "string-width", 1, 0, 0, > + (SCM string), > + "Return the bytes used to represent a character in @var{string}." > + "This will return 1 or 4.") I was wondering whether we should expose as much of the internal representation, but I think it's a good debugging aid and it can't hurt. I find the name slightly misleading, but I can't think of a better one. > - "Return character @var{k} of @var{str} using zero-origin\n" > - "indexing. @var{k} must be a valid index of @var{str}.") > + "Return character @var{k} of @var{str} using zero-origin\n" > + "indexing. @var{k} must be a valid index of @var{str}.") Please avoid unneeded formatting changes, to keep the diff smaller. > +SCM_INTERNAL char *scm_to_stringn (SCM str, size_t *lenp, > + const char *encoding, > + enum iconv_ilseq_handler handler); I suppose this would eventually become public. What do you think? Should we use a different type for HANDLER before that happens? > +SCM_API const scm_t_wchar *scm_i_string_wide_chars (SCM str); Should be SCM_INTERNAL. Thank you! Ludo'. ^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: Unicode strings and symbols 2009-08-09 18:22 ` Unicode strings and symbols Ludovic Courtès @ 2009-08-09 22:40 ` Mike Gran 2009-08-10 21:27 ` Ludovic Courtès 0 siblings, 1 reply; 3+ messages in thread From: Mike Gran @ 2009-08-09 22:40 UTC (permalink / raw) To: Ludovic Courtès; +Cc: guile-devel On Sun, 2009-08-09 at 20:22 +0200, Ludovic Courtès wrote: > Hi Mike! > I would feel more confident if the number of lines of tests added was > proportional to the number of new C lines of code. Do you think some > more tests could be added? Or maybe this will come at a later stage? I should probably add some tests now. They would have to use hex escapes, since the character encoding/decoding functionality is not ready. That's not a problem, but, it doesn't look as cool as using actual non-latin glyphs in the tests. > > > + else if (c == '\b') > > + { > > + SCM_DECCOL (port); > > + } > > Style! ;-) OK. The SCM_DECCOL et al macros are missing enclosing do/while statements, making them conflict with the if statement, so do/while statements will have to be added. > > > +SCM_API void scm_charprint (scm_t_uint32 c, SCM port); > > This ought to be internal, no? Could be. A couple of the types are given their own print functions: scm_intprint and an scm_uintprint. Most types don't have their own print functions. Are int and uint given special treatment because of their radix term? > > > #define STRINGBUF_F_SHARED 0x100 > > #define STRINGBUF_F_INLINE 0x200 > > +#define STRINGBUF_F_WIDE 0x400 > > Although other flags miss this, can you add a comment to the right > saying that this means UCS-4 encoding? OK. > > > +#if SCM_DEBUG > > + if (len < 1000) > > + lenhist[len]++; > > + else > > + lenhist[1000]++; > > +#endif > > I would use "#ifdef SCM_STRING_LENGTH_HISTOGRAM" for that. Conversely, > I'd make `%string-dump' and `%symbol-dump' always available (with > docstring and possibly manual entry). I like that idea. > > > + (scm_t_wchar) (unsigned char) STRINGBUF_INLINE_CHARS (buf)[i]; > > Is the double cast needed? Sort of. Unsigned char will successfully be implicitly cast to scm_t_wchar, so the scm_t_wchar term is just for clarity. The unsigned char term is definitely needed. Negative 8-bit chars are the upper half of the 8-bit charset (128 - 255). Casting them directly to scm_t_wchar may return 0xFFFFFF80 - 0xFFFFFFFF instead of 128-255. I don't have any problem removing the scm_t_wchar cast. Would you prefer that? > > > + if (scm_i_is_narrow_string (name)) > > "Narrow strings" are Latin-1, right? Right. > > > +SCM_DEFINE (scm_sys_string_dump, "%string-dump", 1, 0, 0, (SCM str), "") > > How about returning an alist with all this information instead of using > printf? OK > > +SCM_DEFINE (scm_string_width, "string-width", 1, 0, 0, > > + (SCM string), > > + "Return the bytes used to represent a character in @var{string}." > > + "This will return 1 or 4.") > > I was wondering whether we should expose as much of the internal > representation, but I think it's a good debugging aid and it can't hurt. > > I find the name slightly misleading, but I can't think of a better one. I put it in because that information needs to be available in the bytecode compiler. A slightly clearer name would probably be string-bytes-per-character, I suppose. > > +SCM_INTERNAL char *scm_to_stringn (SCM str, size_t *lenp, > > + const char *encoding, > > + enum iconv_ilseq_handler handler); > > I suppose this would eventually become public. What do you think? > Should we use a different type for HANDLER before that happens? The simplest thing would be to make some constants like scm_c_define ("STRING_ESCAPE", scm_from_int(iconveh_escape_sequence)) Something similar is done in the scm_seek function's constants, such as SEEK_CUR. > > > +SCM_API const scm_t_wchar *scm_i_string_wide_chars (SCM str); > > Should be SCM_INTERNAL. OK > > Thank you! Thanks for taking the time to check it out. > > Ludo'. Mike ^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: Unicode strings and symbols 2009-08-09 22:40 ` Mike Gran @ 2009-08-10 21:27 ` Ludovic Courtès 0 siblings, 0 replies; 3+ messages in thread From: Ludovic Courtès @ 2009-08-10 21:27 UTC (permalink / raw) To: guile-devel Hey, Mike Gran <spk121@yahoo.com> writes: [...] >> > +SCM_API void scm_charprint (scm_t_uint32 c, SCM port); >> >> This ought to be internal, no? > > Could be. A couple of the types are given their own print functions: > scm_intprint and an scm_uintprint. Most types don't have their own > print functions. Are int and uint given special treatment because of > their radix term? Dunno. Anyway, they're not really meant to be public either. Feel free to make them internal as well, while you're at it. ;-) >> > + (scm_t_wchar) (unsigned char) STRINGBUF_INLINE_CHARS (buf)[i]; >> >> Is the double cast needed? > > Sort of. Unsigned char will successfully be implicitly cast to > scm_t_wchar, so the scm_t_wchar term is just for clarity. The unsigned > char term is definitely needed. Negative 8-bit chars are the upper half > of the 8-bit charset (128 - 255). Casting them directly to scm_t_wchar > may return 0xFFFFFF80 - 0xFFFFFFFF instead of 128-255. I don't have any > problem removing the scm_t_wchar cast. Would you prefer that? How about: #define STRINGBUF_INLINE_CHARS(buf) \ ((unsigned char *) SCM_CELL_OBJECT_LOC ((buf), 1)) and changing the caller to: for (i = 0; i < len; i++) mem[i] = (scm_t_wchar) STRINGBUF_INLINE_CHARS (buf)[i]; ? That would make the intent clearer to me. > I put it in because that information needs to be available in the > bytecode compiler. A slightly clearer name would probably be > string-bytes-per-character, I suppose. Agreed, let's take this name. >> > +SCM_INTERNAL char *scm_to_stringn (SCM str, size_t *lenp, >> > + const char *encoding, >> > + enum iconv_ilseq_handler handler); >> >> I suppose this would eventually become public. What do you think? >> Should we use a different type for HANDLER before that happens? > > The simplest thing would be to make some constants like > > scm_c_define ("STRING_ESCAPE", scm_from_int(iconveh_escape_sequence)) > > Something similar is done in the scm_seek function's constants, such as > SEEK_CUR. It's a C API so Scheme-level constants don't matter. I was wondering whether using `enum iconv_ilseq_handler' in the public API would be a good idea because that means that public headers include either the system's or GNU libiconv's <iconv.h> (or some libunistring header), in which case `guile.pc' must include the right `-I' flag, etc. This may slightly complicate compilation of Guile apps. Another downside is that Guile's API would be bound to the values and semantics of `iconv_ilseq_handler', and bound to iconv. One possibility to avoid th would be to define our own type similar to `iconv_ilseq_handler'. Thanks, Ludo'. ^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2009-08-10 21:27 UTC | newest] Thread overview: 3+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- [not found] <E1MZiTo-0002G2-W8@cvs.savannah.gnu.org> 2009-08-09 18:22 ` Unicode strings and symbols Ludovic Courtès 2009-08-09 22:40 ` Mike Gran 2009-08-10 21:27 ` Ludovic Courtès
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).