From mboxrd@z Thu Jan  1 00:00:00 1970
Path: news.gmane.org!not-for-mail
From: ludo@gnu.org (Ludovic =?iso-8859-1?Q?Court=E8s?=)
Newsgroups: gmane.lisp.guile.devel
Subject: Unicode strings and symbols
Date: Sun, 09 Aug 2009 20:22:22 +0200
Message-ID: <87skg0sv8h.fsf@gnu.org>
References: <E1MZiTo-0002G2-W8@cvs.savannah.gnu.org>
NNTP-Posting-Host: lo.gmane.org
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
X-Trace: ger.gmane.org 1249842178 13772 80.91.229.12 (9 Aug 2009 18:22:58 GMT)
X-Complaints-To: usenet@ger.gmane.org
NNTP-Posting-Date: Sun, 9 Aug 2009 18:22:58 +0000 (UTC)
Cc: guile-devel@gnu.org
To: "Michael Gran" <spk121@yahoo.com>
Original-X-From: guile-devel-bounces+guile-devel=m.gmane.org@gnu.org Sun Aug 09 20:22:51 2009
Return-path: <guile-devel-bounces+guile-devel=m.gmane.org@gnu.org>
Envelope-to: guile-devel@m.gmane.org
Original-Received: from lists.gnu.org ([199.232.76.165])
	by lo.gmane.org with esmtp (Exim 4.50)
	id 1MaD2h-0001VM-5D
	for guile-devel@m.gmane.org; Sun, 09 Aug 2009 20:22:51 +0200
Original-Received: from localhost ([127.0.0.1]:46708 helo=lists.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.43)
	id 1MaD2g-0002r9-H8
	for guile-devel@m.gmane.org; Sun, 09 Aug 2009 14:22:50 -0400
Original-Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43)
	id 1MaD2M-0002kN-Ou
	for guile-devel@gnu.org; Sun, 09 Aug 2009 14:22:30 -0400
Original-Received: from exim by lists.gnu.org with spam-scanned (Exim 4.43)
	id 1MaD2J-0002iS-4V
	for guile-devel@gnu.org; Sun, 09 Aug 2009 14:22:30 -0400
Original-Received: from [199.232.76.173] (port=42256 helo=monty-python.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.43) id 1MaD2I-0002iJ-MV
	for guile-devel@gnu.org; Sun, 09 Aug 2009 14:22:26 -0400
Original-Received: from mail1-relais-roc.national.inria.fr ([192.134.164.82]:44723)
	by monty-python.gnu.org with esmtps (TLS-1.0:RSA_ARCFOUR_SHA1:16)
	(Exim 4.60) (envelope-from <ludo@gnu.org>) id 1MaD2I-00005W-06
	for guile-devel@gnu.org; Sun, 09 Aug 2009 14:22:26 -0400
X-IronPort-AV: E=Sophos;i="4.43,349,1246831200"; d="scan'208";a="34208092"
Original-Received: from reverse-83.fdn.fr (HELO nixey) ([80.67.176.83])
	by mail1-relais-roc.national.inria.fr with ESMTP/TLS/DHE-RSA-AES128-SHA;
	09 Aug 2009 20:22:24 +0200
X-URL: http://www.fdn.fr/~lcourtes/
X-Revolutionary-Date: 22 Thermidor an 217 de la =?iso-8859-1?Q?R=E9volutio?=
	=?iso-8859-1?Q?n?=
X-PGP-Key-ID: 0xEA52ECF4
X-PGP-Key: http://www.fdn.fr/~lcourtes/ludovic.asc
X-PGP-Fingerprint: 821D 815D 902A 7EAB 5CEE  D120 7FBA 3D4F EB1F 5364
X-OS: x86_64-unknown-linux-gnu
In-Reply-To: <E1MZiTo-0002G2-W8@cvs.savannah.gnu.org> (Michael Gran's message
	of "Sat, 08 Aug 2009 09:44:49 +0000")
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/23.1.50 (gnu/linux)
X-detected-operating-system: by monty-python.gnu.org: Genre and OS details not
	recognized.
X-BeenThere: guile-devel@gnu.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: "Developers list for Guile,
	the GNU extensibility library" <guile-devel.gnu.org>
List-Unsubscribe: <http://lists.gnu.org/mailman/listinfo/guile-devel>,
	<mailto:guile-devel-request@gnu.org?subject=unsubscribe>
List-Archive: <http://lists.gnu.org/pipermail/guile-devel>
List-Post: <mailto:guile-devel@gnu.org>
List-Help: <mailto:guile-devel-request@gnu.org?subject=help>
List-Subscribe: <http://lists.gnu.org/mailman/listinfo/guile-devel>,
	<mailto:guile-devel-request@gnu.org?subject=subscribe>
Original-Sender: guile-devel-bounces+guile-devel=m.gmane.org@gnu.org
Errors-To: guile-devel-bounces+guile-devel=m.gmane.org@gnu.org
Xref: news.gmane.org gmane.lisp.guile.devel:9044
Archived-At: <http://permalink.gmane.org/gmane.lisp.guile.devel/9044>

Hi Mike!

Glad to see Unicode has landed!  :-)

Here's a quick review of the big patch.

"Michael Gran" <spk121@yahoo.com> writes:

>     Add Unicode strings and symbols
>     
>     This adds full Unicode strings as a datatype, and it adds some
>     minimal functionality.  The terminal and port encoding is assumed
>     to be ISO-8859-1.  Non-ISO-8859-1 characters are written or
>     input as string character escapes.
>     
>     The string character escapes now have 3 forms: \xXX \uXXXX and
>     \UXXXXXX, for unprintable characters that have 2, 4 or 6 hex digits.
>     
>     The process for writing to strings has been modified.  There is now a
>     function scm_i_string_start_writing that does the copy-on-write
>     conversion if necessary.
>     
>     To compile strings that may be wide, the VM storage of strings and
>     string-likes has changed.
>     
>     Most string-using functions have not yet been updated and may break
>     when used with wide strings.

Most of these comments really belong in the source code, close to the
things they refer to.

>  libguile/ports.c                              |   90 +++-
>  libguile/ports.h                              |    3 +
>  libguile/print.c                              |  157 +++++--
>  libguile/print.h                              |    1 +
>  libguile/read.c                               |  233 ++++++----
>  libguile/rw.c                                 |    2 +
>  libguile/socket.c                             |    3 +
>  libguile/srfi-13.c                            |   23 +-
>  libguile/strings.c                            |  649 +++++++++++++++++++++----
>  libguile/strings.h                            |   59 ++-
>  libguile/vm-engine.h                          |    1 +
>  libguile/vm-i-loader.c                        |   87 +++-
>  module/language/assembly.scm                  |   12 +-
>  module/language/assembly/compile-bytecode.scm |   26 +-
>  test-suite/tests/asm-to-bytecode.test         |    6 +-
>  15 files changed, 1046 insertions(+), 306 deletions(-)

I would feel more confident if the number of lines of tests added was
proportional to the number of new C lines of code.  Do you think some
more tests could be added?  Or maybe this will come at a later stage?

> +  else if (c == '\b')
> +    {
> +      SCM_DECCOL (port);
> +    }

Style!  ;-)

> +SCM_API void scm_charprint (scm_t_uint32 c, SCM port);

This ought to be internal, no?

>  #define STRINGBUF_F_SHARED      0x100
>  #define STRINGBUF_F_INLINE      0x200
> +#define STRINGBUF_F_WIDE        0x400

Although other flags miss this, can you add a comment to the right
saying that this means UCS-4 encoding?

> +static SCM
> +make_wide_stringbuf (size_t len)
> +{
> +  scm_t_wchar *mem;
> +#if SCM_DEBUG
> +  if (len < 1000)
> +    lenhist[len]++;
> +  else
> +    lenhist[1000]++;
> +#endif

I would use "#ifdef SCM_STRING_LENGTH_HISTOGRAM" for that.  Conversely,
I'd make `%string-dump' and `%symbol-dump' always available (with
docstring and possibly manual entry).

> +          (scm_t_wchar) (unsigned char) STRINGBUF_INLINE_CHARS (buf)[i];

Is the double cast needed?

> +      if (scm_i_is_narrow_string (name))

"Narrow strings" are Latin-1, right?

> +SCM_DEFINE (scm_sys_string_dump, "%string-dump", 1, 0, 0, (SCM str), "")
>  #define FUNC_NAME s_scm_sys_string_dump
>  {
>    SCM_VALIDATE_STRING (1, str);
>    fprintf (stderr, "%p:\n", str);
>    fprintf (stderr, " start: %u\n", STRING_START (str));
>    fprintf (stderr, " len:   %u\n", STRING_LENGTH (str));
> +  if (scm_i_is_narrow_string (str))
> +    fprintf (stderr, " format: narrow\n");
> +  else
> +    fprintf (stderr, " format: wide\n");

How about returning an alist with all this information instead of using
printf?

It would allow higher-level debugging functions to be implemented and
may also be useful in unit tests.

> +SCM_DEFINE (scm_sys_symbol_dump, "%symbol-dump", 1, 0, 0, (SCM sym), "")

Same here.

> +SCM_DEFINE (scm_string_width, "string-width", 1, 0, 0,
> +            (SCM string),
> +            "Return the bytes used to represent a character in @var{string}."
> +            "This will return 1 or 4.")

I was wondering whether we should expose as much of the internal
representation, but I think it's a good debugging aid and it can't hurt.

I find the name slightly misleading, but I can't think of a better one.

> -	    "Return character @var{k} of @var{str} using zero-origin\n"
> -	    "indexing. @var{k} must be a valid index of @var{str}.")
> +            "Return character @var{k} of @var{str} using zero-origin\n"
> +            "indexing. @var{k} must be a valid index of @var{str}.")

Please avoid unneeded formatting changes, to keep the diff smaller.

> +SCM_INTERNAL char *scm_to_stringn (SCM str, size_t *lenp, 
> +                                   const char *encoding,
> +                                   enum iconv_ilseq_handler handler);

I suppose this would eventually become public.  What do you think?
Should we use a different type for HANDLER before that happens?

> +SCM_API const scm_t_wchar *scm_i_string_wide_chars (SCM str);

Should be SCM_INTERNAL.

Thank you!

Ludo'.