From mboxrd@z Thu Jan  1 00:00:00 1970
Path: news.gmane.org!not-for-mail
From: Mike Gran <spk121@yahoo.com>
Newsgroups: gmane.lisp.guile.devel
Subject: Re: unicode status
Date: Sun, 06 Sep 2009 08:02:25 -0700
Message-ID: <1252249345.17414.21280.camel@localhost.localdomain>
References: <m3fxb0ic6f.fsf@pobox.com>
NNTP-Posting-Host: lo.gmane.org
Mime-Version: 1.0
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: 8bit
X-Trace: ger.gmane.org 1252249398 20169 80.91.229.12 (6 Sep 2009 15:03:18 GMT)
X-Complaints-To: usenet@ger.gmane.org
NNTP-Posting-Date: Sun, 6 Sep 2009 15:03:18 +0000 (UTC)
Cc: guile-devel <guile-devel@gnu.org>
To: Andy Wingo <wingo@pobox.com>
Original-X-From: guile-devel-bounces+guile-devel=m.gmane.org@gnu.org Sun Sep 06 17:03:11 2009
Return-path: <guile-devel-bounces+guile-devel=m.gmane.org@gnu.org>
Envelope-to: guile-devel@m.gmane.org
Original-Received: from lists.gnu.org ([199.232.76.165])
	by lo.gmane.org with esmtp (Exim 4.50)
	id 1MkJGo-0005Lf-2K
	for guile-devel@m.gmane.org; Sun, 06 Sep 2009 17:03:10 +0200
Original-Received: from localhost ([127.0.0.1]:49680 helo=lists.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.43)
	id 1MkJGn-0004rK-7z
	for guile-devel@m.gmane.org; Sun, 06 Sep 2009 11:03:09 -0400
Original-Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43)
	id 1MkJGg-0004r5-A5
	for guile-devel@gnu.org; Sun, 06 Sep 2009 11:03:02 -0400
Original-Received: from exim by lists.gnu.org with spam-scanned (Exim 4.43)
	id 1MkJGe-0004qp-LB
	for guile-devel@gnu.org; Sun, 06 Sep 2009 11:03:02 -0400
Original-Received: from [199.232.76.173] (port=51876 helo=monty-python.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.43) id 1MkJGe-0004qj-C9
	for guile-devel@gnu.org; Sun, 06 Sep 2009 11:03:00 -0400
Original-Received: from smtp108.prem.mail.sp1.yahoo.com ([98.136.44.63]:21405)
	by monty-python.gnu.org with smtp (Exim 4.60)
	(envelope-from <spk121@yahoo.com>) id 1MkJGd-0002NG-Lb
	for guile-devel@gnu.org; Sun, 06 Sep 2009 11:02:59 -0400
Original-Received: (qmail 54346 invoked from network); 6 Sep 2009 15:02:58 -0000
DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=s1024; d=yahoo.com;
	h=Received:X-Yahoo-SMTP:X-YMail-OSG:X-Yahoo-Newman-Property:Subject:From:To:Cc:In-Reply-To:References:Content-Type:Date:Message-Id:Mime-Version:X-Mailer:Content-Transfer-Encoding;
	b=Cq+TzYUKEuIaaB6iwM29X1yPmP1/6LCNLAAFrBYxc/ZPchLVlgGgfLZx3x1u2rljH1BTaKJ1a2h9fYmq6fcPU7/aTwjRA/xy2HvDDsnnPBqRfC+iw1yoRGLU/7o3Lmfs+QNDYeVl3QRLQ0AyJc/tJ3kyqhJQMgQqhLN1WD9ibNU=
	; 
Original-Received: from adsl-71-130-218-93.dsl.irvnca.pacbell.net (spk121@71.130.218.93
	with plain) by smtp108.prem.mail.sp1.yahoo.com with SMTP;
	06 Sep 2009 08:02:58 -0700 PDT
X-Yahoo-SMTP: FzNaA9iswBDuBl1BmgaIRDaP9Q--
X-YMail-OSG: PSHxfYYVM1mUf2CsCm8qVerQza4i9nUAx.sq8NVGD4tvf.WdGcMO2dyxC5eqUIr0AYuP1Qy10L4zIfwQOyEMKWbgDek08cK2yUClG.49dqpzu2tlSTUsBmWDwh.42wgAN8dkCBH.LWuYb5wA_wvhVprKIu54yGxs_jqzEwZC2NFvViTfJmuFq3KdLxvjYK_B0lbUyLEsPuplAzCTUfmAavXq3b0rBEA1KPrSh_uk_AP.79HUnNB.jAZvtz1wMlKi.zFUPtA-
X-Yahoo-Newman-Property: ymail-3
In-Reply-To: <m3fxb0ic6f.fsf@pobox.com>
X-Mailer: Evolution 2.24.5 (2.24.5-2.fc10) 
X-detected-operating-system: by monty-python.gnu.org: FreeBSD 4.7-5.2 (or
	MacOS X 10.2-10.4) (2)
X-BeenThere: guile-devel@gnu.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: "Developers list for Guile,
	the GNU extensibility library" <guile-devel.gnu.org>
List-Unsubscribe: <http://lists.gnu.org/mailman/listinfo/guile-devel>,
	<mailto:guile-devel-request@gnu.org?subject=unsubscribe>
List-Archive: <http://lists.gnu.org/pipermail/guile-devel>
List-Post: <mailto:guile-devel@gnu.org>
List-Help: <mailto:guile-devel-request@gnu.org?subject=help>
List-Subscribe: <http://lists.gnu.org/mailman/listinfo/guile-devel>,
	<mailto:guile-devel-request@gnu.org?subject=subscribe>
Original-Sender: guile-devel-bounces+guile-devel=m.gmane.org@gnu.org
Errors-To: guile-devel-bounces+guile-devel=m.gmane.org@gnu.org
Xref: news.gmane.org gmane.lisp.guile.devel:9274
Archived-At: <http://permalink.gmane.org/gmane.lisp.guile.devel/9274>

On Sun, 2009-09-06 at 12:45 +0200, Andy Wingo wrote:
> Hey Mike,
> 
> Would you mind posting to the list a "state of unicode & guile" summary?
> I'm very excited about finally being able to say "Guile does unicode",
> and was wondering what was left to do :)
> 
> Andy

OK.  

First, here's the stuff I've already put in NEWS

** Characters

Characters can take the whole Unicode range.  char-upcase and
char-downcase use default Unicode casing rules. Character comparisons
such as char<? and char-ci<? are now sorting based on Unicode code
points.  Combining characters are printed with dotted circles #\◌́

** Strings

String and SRFI-13 functions can operate on Unicode strings.  Strings
can contain the new string escapes \uHHHH and \UHHHHHH for 4 and 6 hex
digit characters.

** SRFI-14 char-sets are modified for Unicode

The default char-sets are not longer locale dependent and contain
characters from the whole Unicode range.  There is a new char-set,
char-set:designated, which contains all assigned Unicode characters.
There is a new debugging function: %char-set-dump.

** Ports do transcoding

Ports now have an associated character encoding, and port read/write
operations do conversion to/from locales automatically.  Ports also
have an associated strategy for how to deal with locale conversion
failures.  Four functions to support this: set-port-encoding!,
port-encoding, set-port-conversion-strategy!,
port-conversion-strategy.

** Non-ASCII source code files can be read, but require coding
   declarations

The default reader now handles source code files for some of the
non-ASCII character encodings, such as UTF-8.  A non-ASCII source file
should have an encoding declaration near the top of the file.  Also,
there is a new function 'file-encoding' that scans a port for a coding
declaration.

The pre-1.9.3 reader handled 8-bit clean but otherwise unspecified
source code.  This use is now discouraged.

-------------------------------------------------------------

Here's some stuff that is complete, but, not working quite right.

** There are undocumented things: %string-dump, %symbol-dump, setbinary,
and a discussion about why ISO-8859-1 is the fastest encoding to process
and why it should be used by default.

** Non-ASCII symbols and keywords are supported and variables and
procedures can have non-ASCII names.

These probably need wide-symbol and wide-keyword support in the VM,
instead of the locale-specific implementation that they have now to
avoid some corner cases where locales switch.

-------------------------------------------------------------

Here's the stuff left to be done, in no particular order

* The disassembler doesn't handle wide strings gracefully

* Some parts of Goops expect 8-bit strings.  This is probably fine for
now, but, needs to be documented.  I've avoided touching this because
I've never used goops for anything, so I'm not sure what does what.

* The i18n library hasn't been touched.  It should probably move to use
functions like u32_casecmp from libunistring for unicode-capable
locale-specific sorting.  But the #ifdef and locale madness in i18n is
deep.  I've avoided hacking it.  Also we'll have to write our own
functions for locale-string->double and locale-string->int.  Bruno has
some suggestions on how to do that at
http://savannah.gnu.org/support/?106998

* I haven't done any testing on readline or gettext

* Unicode-capable regex has not been implemented.  Libunistring might do
this someday.  Until then, there will probably have to be the hack where
strings are converted to UTF-8 encoding to pass through regex.  This
doesn't get you Unicode regex, but, it keeps non-ASCII from being
mangled by regex.

* EMACS has a lot of aliases that can be use in the "-*- coding: XXXXX
-*-" line, like latin-1, that aren't valid encoding names.  The reader
should be modified to understand the common ones.

* The whole issue of R6RS compliance will have to be dealt with some
day.  For example, I went with \xHH \uHHHH and \UHHHHHH escapes because
they were backwards compatible with the \xHH we already had.  R6RS uses
a variable length hex escape terminated by a semicolon: \xHH; \xHHH;.
These are not backward compatible.  There are some R6RS functions that
are missing: string-foldcase, string normalization routines.

Also, R6RS and R5RS seem to disagree on the definition of string-upcase
et al.  R6RS is clear that the result of string-upcase can have more
letters that its input, and it gets rid of string-upcase! for the same
reason.

That's all I remember off the top of my head.

Thanks,

Mike