From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: ludo@gnu.org (Ludovic =?iso-8859-1?Q?Court=E8s?=) Newsgroups: gmane.lisp.guile.devel Subject: Unicode strings and symbols Date: Sun, 09 Aug 2009 20:22:22 +0200 Message-ID: <87skg0sv8h.fsf@gnu.org> References: NNTP-Posting-Host: lo.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Trace: ger.gmane.org 1249842178 13772 80.91.229.12 (9 Aug 2009 18:22:58 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Sun, 9 Aug 2009 18:22:58 +0000 (UTC) Cc: guile-devel@gnu.org To: "Michael Gran" Original-X-From: guile-devel-bounces+guile-devel=m.gmane.org@gnu.org Sun Aug 09 20:22:51 2009 Return-path: Envelope-to: guile-devel@m.gmane.org Original-Received: from lists.gnu.org ([199.232.76.165]) by lo.gmane.org with esmtp (Exim 4.50) id 1MaD2h-0001VM-5D for guile-devel@m.gmane.org; Sun, 09 Aug 2009 20:22:51 +0200 Original-Received: from localhost ([127.0.0.1]:46708 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1MaD2g-0002r9-H8 for guile-devel@m.gmane.org; Sun, 09 Aug 2009 14:22:50 -0400 Original-Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43) id 1MaD2M-0002kN-Ou for guile-devel@gnu.org; Sun, 09 Aug 2009 14:22:30 -0400 Original-Received: from exim by lists.gnu.org with spam-scanned (Exim 4.43) id 1MaD2J-0002iS-4V for guile-devel@gnu.org; Sun, 09 Aug 2009 14:22:30 -0400 Original-Received: from [199.232.76.173] (port=42256 helo=monty-python.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1MaD2I-0002iJ-MV for guile-devel@gnu.org; Sun, 09 Aug 2009 14:22:26 -0400 Original-Received: from mail1-relais-roc.national.inria.fr ([192.134.164.82]:44723) by monty-python.gnu.org with esmtps (TLS-1.0:RSA_ARCFOUR_SHA1:16) (Exim 4.60) (envelope-from ) id 1MaD2I-00005W-06 for guile-devel@gnu.org; Sun, 09 Aug 2009 14:22:26 -0400 X-IronPort-AV: E=Sophos;i="4.43,349,1246831200"; d="scan'208";a="34208092" Original-Received: from reverse-83.fdn.fr (HELO nixey) ([80.67.176.83]) by mail1-relais-roc.national.inria.fr with ESMTP/TLS/DHE-RSA-AES128-SHA; 09 Aug 2009 20:22:24 +0200 X-URL: http://www.fdn.fr/~lcourtes/ X-Revolutionary-Date: 22 Thermidor an 217 de la =?iso-8859-1?Q?R=E9volutio?= =?iso-8859-1?Q?n?= X-PGP-Key-ID: 0xEA52ECF4 X-PGP-Key: http://www.fdn.fr/~lcourtes/ludovic.asc X-PGP-Fingerprint: 821D 815D 902A 7EAB 5CEE D120 7FBA 3D4F EB1F 5364 X-OS: x86_64-unknown-linux-gnu In-Reply-To: (Michael Gran's message of "Sat, 08 Aug 2009 09:44:49 +0000") User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/23.1.50 (gnu/linux) X-detected-operating-system: by monty-python.gnu.org: Genre and OS details not recognized. X-BeenThere: guile-devel@gnu.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: "Developers list for Guile, the GNU extensibility library" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Original-Sender: guile-devel-bounces+guile-devel=m.gmane.org@gnu.org Errors-To: guile-devel-bounces+guile-devel=m.gmane.org@gnu.org Xref: news.gmane.org gmane.lisp.guile.devel:9044 Archived-At: Hi Mike! Glad to see Unicode has landed! :-) Here's a quick review of the big patch. "Michael Gran" writes: > Add Unicode strings and symbols > > This adds full Unicode strings as a datatype, and it adds some > minimal functionality. The terminal and port encoding is assumed > to be ISO-8859-1. Non-ISO-8859-1 characters are written or > input as string character escapes. > > The string character escapes now have 3 forms: \xXX \uXXXX and > \UXXXXXX, for unprintable characters that have 2, 4 or 6 hex digits. > > The process for writing to strings has been modified. There is now a > function scm_i_string_start_writing that does the copy-on-write > conversion if necessary. > > To compile strings that may be wide, the VM storage of strings and > string-likes has changed. > > Most string-using functions have not yet been updated and may break > when used with wide strings. Most of these comments really belong in the source code, close to the things they refer to. > libguile/ports.c | 90 +++- > libguile/ports.h | 3 + > libguile/print.c | 157 +++++-- > libguile/print.h | 1 + > libguile/read.c | 233 ++++++---- > libguile/rw.c | 2 + > libguile/socket.c | 3 + > libguile/srfi-13.c | 23 +- > libguile/strings.c | 649 +++++++++++++++++++++---- > libguile/strings.h | 59 ++- > libguile/vm-engine.h | 1 + > libguile/vm-i-loader.c | 87 +++- > module/language/assembly.scm | 12 +- > module/language/assembly/compile-bytecode.scm | 26 +- > test-suite/tests/asm-to-bytecode.test | 6 +- > 15 files changed, 1046 insertions(+), 306 deletions(-) I would feel more confident if the number of lines of tests added was proportional to the number of new C lines of code. Do you think some more tests could be added? Or maybe this will come at a later stage? > + else if (c == '\b') > + { > + SCM_DECCOL (port); > + } Style! ;-) > +SCM_API void scm_charprint (scm_t_uint32 c, SCM port); This ought to be internal, no? > #define STRINGBUF_F_SHARED 0x100 > #define STRINGBUF_F_INLINE 0x200 > +#define STRINGBUF_F_WIDE 0x400 Although other flags miss this, can you add a comment to the right saying that this means UCS-4 encoding? > +static SCM > +make_wide_stringbuf (size_t len) > +{ > + scm_t_wchar *mem; > +#if SCM_DEBUG > + if (len < 1000) > + lenhist[len]++; > + else > + lenhist[1000]++; > +#endif I would use "#ifdef SCM_STRING_LENGTH_HISTOGRAM" for that. Conversely, I'd make `%string-dump' and `%symbol-dump' always available (with docstring and possibly manual entry). > + (scm_t_wchar) (unsigned char) STRINGBUF_INLINE_CHARS (buf)[i]; Is the double cast needed? > + if (scm_i_is_narrow_string (name)) "Narrow strings" are Latin-1, right? > +SCM_DEFINE (scm_sys_string_dump, "%string-dump", 1, 0, 0, (SCM str), "") > #define FUNC_NAME s_scm_sys_string_dump > { > SCM_VALIDATE_STRING (1, str); > fprintf (stderr, "%p:\n", str); > fprintf (stderr, " start: %u\n", STRING_START (str)); > fprintf (stderr, " len: %u\n", STRING_LENGTH (str)); > + if (scm_i_is_narrow_string (str)) > + fprintf (stderr, " format: narrow\n"); > + else > + fprintf (stderr, " format: wide\n"); How about returning an alist with all this information instead of using printf? It would allow higher-level debugging functions to be implemented and may also be useful in unit tests. > +SCM_DEFINE (scm_sys_symbol_dump, "%symbol-dump", 1, 0, 0, (SCM sym), "") Same here. > +SCM_DEFINE (scm_string_width, "string-width", 1, 0, 0, > + (SCM string), > + "Return the bytes used to represent a character in @var{string}." > + "This will return 1 or 4.") I was wondering whether we should expose as much of the internal representation, but I think it's a good debugging aid and it can't hurt. I find the name slightly misleading, but I can't think of a better one. > - "Return character @var{k} of @var{str} using zero-origin\n" > - "indexing. @var{k} must be a valid index of @var{str}.") > + "Return character @var{k} of @var{str} using zero-origin\n" > + "indexing. @var{k} must be a valid index of @var{str}.") Please avoid unneeded formatting changes, to keep the diff smaller. > +SCM_INTERNAL char *scm_to_stringn (SCM str, size_t *lenp, > + const char *encoding, > + enum iconv_ilseq_handler handler); I suppose this would eventually become public. What do you think? Should we use a different type for HANDLER before that happens? > +SCM_API const scm_t_wchar *scm_i_string_wide_chars (SCM str); Should be SCM_INTERNAL. Thank you! Ludo'.